# Validation of Base Year

This notebook validates the results of the PyPSA-USA model for the base year against two authoritative datasets: **EIA (U.S. Energy Information Administration)** and **Ember**. The validation covers three key aspects of the electricity system:
- **Electricity Demand**
- **Electricity Generation**
- **Installed Generation Capacity**

### Loading libraries

We begin by importing the necessary libraries for data handling, analysis, and visualization. These include PyPSA for power system analysis, pandas and numpy for data manipulation, geopandas for spatial data, and seaborn/matplotlib for plotting.

---

In [1]:

import os
import pypsa
import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns


import warnings
warnings.filterwarnings("ignore")


The namespace `pypsa.networkclustering` is deprecated and will be removed in PyPSA v0.24. Please use `pypsa.clustering.spatial instead`. 



In [2]:
import sys
sys.path.append(os.path.abspath(".."))
from plots.results_validation import (
    load_ember_data, PLOTS_DIR, DATA_DIR, get_generation_capacity_ember, 
    convert_two_country_code_to_three, get_data_EIA, get_generation_capacity_pypsa,
    preprocess_eia_data, get_generation_capacity_ember, get_demand_pypsa, get_demand_ember,
    get_installed_capacity_ember, get_installed_capacity_pypsa
    )

### Loading Files

Here, we load the PyPSA-USA network results for the base year, as well as the relevant EIA and Ember datasets. These datasets provide reference values for demand, generation, and installed capacity, which will be used for validation.

---

In [3]:
# load network
project_root = os.getcwd()
results_dir = os.path.join(project_root, 'results')

# Load Base scenario
base_path = os.path.join(results_dir, 'base_year', "elec_s_100_ec_lcopt_Co2L-3H_3H_2020_0.071_AB_10export.nc")
base_network = pypsa.Network(base_path)
# base_network = pypsa.Network("/Users/gbotemi/Documents/code/PYPSA/US/efuels-supply-potentials/submodules/pypsa-earth/results/US_2021/postnetworks/elec_s_10_ec_lcopt_Co2L-24H_24H_2030_0.071_AB_10export.nc")

INFO:pypsa.io:Imported network elec_s_100_ec_lcopt_Co2L-3H_3H_2020_0.071_AB_10export.nc has buses, carriers, generators, global_constraints, lines, links, loads, storage_units, stores


In [4]:
def attach_state_to_buses(network, path_shapes, distance_crs="EPSG:4326"):
    """
    Attach state to buses
    """
    # Read the shapefile using geopandas
    shapes = gpd.read_file(path_shapes, crs=distance_crs)
    shapes["ISO_1"] = shapes["ISO_1"].apply(lambda x: x.split("-")[1])
    shapes.rename(columns={"ISO_1": "State"}, inplace=True)

    ac_dc_carriers = ["AC", "DC"]
    location_mapping = network.buses.query(
        "carrier in @ac_dc_carriers")[["x", "y"]]

    network.buses["x"] = network.buses["location"].map(
        location_mapping["x"]).fillna(0)
    network.buses["y"] = network.buses["location"].map(
        location_mapping["y"]).fillna(0)

    pypsa_gpd = gpd.GeoDataFrame(
        network.buses,
        geometry=gpd.points_from_xy(network.buses.x, network.buses.y),
        crs=4326
    )

    bus_cols = network.buses.columns
    bus_cols = list(bus_cols) + ["State"]

    st_buses = gpd.sjoin_nearest(shapes, pypsa_gpd, how="right")

    network.buses["state"] = st_buses["State"]

In [5]:
state_shape_path = "gadm41_USA_1.json"
attach_state_to_buses(base_network, state_shape_path)

In [6]:
country_code = "US"
horizon = 2020

three_country_code = convert_two_country_code_to_three(country_code)

In [7]:
EIA_demand_path = os.path.join(
    DATA_DIR, "validation", "EIA_demands.csv")
EIA_installed_capacities_path = os.path.join(
    DATA_DIR, "validation", "EIA_installed_capacities.csv")
EIA_generation_path = os.path.join(
    DATA_DIR, "validation", "EIA_electricity_generation.csv")

## Generation

In this section, we compare the **annual electricity generation** by technology as reported by PyPSA, EIA, and Ember. This helps to identify any discrepancies in the modeled generation mix and total output.

---

In [8]:
ember_data = load_ember_data()
generation_data_ember = get_generation_capacity_ember(
        ember_data, three_country_code, horizon).round(2)
generation_data_ember.drop(['Load shedding'], inplace=True)

EIA_generation = get_data_EIA(EIA_generation_path, country_code, horizon)
EIA_generation = preprocess_eia_data(EIA_generation).round(2)

In [9]:
ac_balance = base_network.statistics.energy_balance().xs("AC", level=2)
ac_total = (ac_balance[ac_balance>0]/1e6).unstack().sum().rename({
        "Onshore Wind": "Wind", "offwind": "Wind", "Offshore Wind (DC)": "Wind",
        "solar": "Solar", "Run of River": "Hydro", "Reservoir & Dam": "Hydro",
        "geothermal": "Geothermal", "nuclear": "Nuclear", "hydro": "Hydro",
        "OCGT": "Fossil fuels", "CCGT": "Fossil fuels", "Oil": "Fossil fuels",
        "Coal": "Fossil fuels", "biomass": "Biomass", "urban central gas CHP": "Fossil fuels", 
        "urban central solid biomass CHP": "Biomass",
    }).to_frame('PyPSA data').groupby(level=0).sum().round(2)

pypsa_gen_final = ac_total[ac_total > 0].dropna()
pypsa_gen_final.drop(['DC'], inplace=True)

In [10]:
generation_df = pd.concat(
    [pypsa_gen_final, generation_data_ember, EIA_generation], axis=1).fillna(0)
generation_df

Unnamed: 0,PyPSA data,Ember data,EIA data
Biomass,12744.36,54.7,67.57
Hydro,212.47,279.95,285.27
Nuclear,458.98,789.88,789.88
Solar,85.69,130.72,130.72
Wind,343.05,337.94,337.94
Fossil fuels,0.0,2431.9,2429.34
PHS,0.0,0.0,-5.32


# Capacity 

This section validates the **installed generation capacities** by technology. We compare the capacities in the PyPSA network to those reported by EIA and Ember to ensure the model's infrastructure assumptions are consistent with real-world data.

---

In [11]:
installed_capacity_ember = get_installed_capacity_ember(
        ember_data, three_country_code, horizon).round(2)

EIA_inst_capacities = get_data_EIA(
    EIA_installed_capacities_path, country_code, horizon)
EIA_inst_capacities = preprocess_eia_data(EIA_inst_capacities).round(2)

In [12]:
gen_carriers = {
    "onwind", "offwind-ac", "offwind-dc", "solar", "solar-rooftop",
    "csp", "nuclear", "geothermal", "ror", "PHS", "Reservoir & Dam", 'hydro'
}
link_carriers = {
    "OCGT", "CCGT", "coal", "oil", "biomass", "biomass CHP", "gas CHP"
}

# Generators
gen = base_network.generators.copy()
gen['carrier'] = gen['carrier'].replace({'offwind-ac': 'offwind', 'offwind-dc': 'offwind'})
gen = gen[gen.carrier.isin(gen_carriers)]
gen_totals = gen.groupby('carrier')['p_nom_opt'].sum()

# Storage
sto = base_network.storage_units.copy()
sto = sto[sto.carrier.isin(gen_carriers)]
sto_totals = sto.groupby('carrier')['p_nom_opt'].sum()

# Links (output side scaled by efficiency)
links = base_network.links.copy()
mask = (
    links.efficiency.notnull()
    & (links.p_nom_opt > 0)
    & links.carrier.isin(link_carriers)
)
links = links[mask]
links_totals = links.groupby('carrier').apply(
    lambda df: (df['p_nom_opt'] * df['efficiency']).sum()
)

# Combine all
all_totals = pd.concat([gen_totals, sto_totals, links_totals])
all_totals = all_totals.groupby(all_totals.index).sum()  # Merge duplicates
all_totals = all_totals[all_totals > 0] / 1e3

In [13]:
pypsa_cap = all_totals.rename({
    "onwind": "Wind", "offwind": "Wind",
    "solar": "Solar", "ror": "Hydro", "Reservoir & Dam": "Hydro",
    "geothermal": "Geothermal", "nuclear": "Nuclear", "hydro": "Hydro",
    "OCGT": "Fossil fuels", "CCGT": "Fossil fuels", "oil": "Fossil fuels",
    "coal": "Fossil fuels", "biomass": "Biomass",
}).to_frame('PyPSA data').round(2).groupby(level=0).sum()

In [14]:
installed_capacity_df = pd.concat(
    [pypsa_cap, installed_capacity_ember, EIA_inst_capacities], axis=1).fillna(0)
installed_capacity_df

Unnamed: 0,PyPSA data,Ember data,EIA data
Biomass,3.07,10.83,16.03
Fossil fuels,585.87,786.94,731.21
Geothermal,1.29,0.0,0.0
Hydro,66.91,83.83,79.92
Nuclear,52.4,96.5,96.5
PHS,21.98,0.0,23.02
Solar,60.39,76.44,75.64
Wind,132.76,118.66,118.38


# Demand

Finally, we validate the **total electricity demand** in the PyPSA network against the EIA and Ember datasets. This ensures that the modeled demand matches observed values for the base year.

---

In [15]:
demand_ember = get_demand_ember(ember_data, three_country_code, horizon)
pypsa_demand = get_demand_pypsa(base_network)

EIA_demand = get_data_EIA(EIA_demand_path, country_code, horizon)
EIA_demand = EIA_demand.iloc[0, 1]

In [16]:
demand_df = pd.DataFrame(
    {"PyPSA data": [pypsa_demand], "Ember data": [demand_ember], "EIA data": [EIA_demand]})
demand_df.index = ["Demand [TWh]"]
demand_df

Unnamed: 0,PyPSA data,Ember data,EIA data
Demand [TWh],11764.2561,4090.49,3897.8994
