## eGRID script to get localized carbon intensity of the grid

The eGRID database provides detailed information of the carbon intensity of electricity generation in the United States. Since 2018, they have released it every January, with the latest data being from 2 years prior. So currently, in July 2024, the latest eGRID data is from 2022 and was released in January 2024. This script should be re-run each year to incorporate the latest data. 

>  Interactive tool: https://www.epa.gov/egrid/power-profiler

In [18]:
# change these values when a new dataset is released

latest_released_year = 2022

# since 2018, eGRID has been released yearly (before this, it was inconsistent)
# eGRID Data files linked from https://www.epa.gov/egrid/download-data and https://www.epa.gov/egrid/historical-egrid-data
# eGRID Subregions files linked from https://www.epa.gov/egrid/egrid-mapping-files
egrid_urls = {
  2018: {
    "egrid_xlsx": 'https://www.epa.gov/sites/default/files/2020-03/egrid2018_data_v2.xlsx',
    "subregions_kmz": 'https://www.epa.gov/sites/default/files/2020-03/egrid2018_subregions.kmz',
  },
  2019: {
    "egrid_xlsx": 'https://www.epa.gov/sites/default/files/2021-02/egrid2019_data.xlsx',
    "subregions_kmz": 'https://www.epa.gov/sites/default/files/2021-02/egrid2019_subregions.kmz',
  },
  2020: {
    "egrid_xlsx": 'https://www.epa.gov/system/files/documents/2022-09/eGRID2020_Data_v2.xlsx',
    "subregions_kmz": 'https://www.epa.gov/system/files/other-files/2022-01/egrid2020_subregions.kmz',
  },
  2021: {
    "egrid_xlsx": 'https://www.epa.gov/system/files/documents/2023-01/eGRID2021_data.xlsx',
    "subregions_kml": 'https://www.epa.gov/system/files/other-files/2023-05/eGRID2021_subregions.kml',
  },
  2022: {
    "egrid_xlsx": 'https://www.epa.gov/system/files/documents/2024-01/egrid2022_data.xlsx',
    "subregions_kmz": 'https://www.epa.gov/system/files/other-files/2024-05/egrid2022_subregions.kmz',
  },
}

In [19]:
# imports

import os
import json
import requests
from zipfile import ZipFile
from io import BytesIO

from script_utils import is_up_to_date, load_dataframe


The eGRID has data by state or by 27 eGRID regions.

> The 27 eGRID subregions in the US are defined by EPA using data from the Energy Information Administration (EIA) and the North American Electric Reliability Corporation (NERC). The subregions are defined to limit the amount of imports and exports across regions in order to best represent the electricity used in each of the subregions. More information can be found in section 3.4.2 of the eGRID Technical Support Document.

Although it might be easier to use state-level data, the eGRID regions are more accurate. The eGRID regions are given per-year as shapefiles, so we will be able to perform coordinate-based lookups to get the eGRID region for a given location.
The shapefiles are quite large, so we will simplify them to a coarser resolution to make the lookup faster. Then we will save them in GeoJSON format, easier to work with in Python & JavaScript.

We will also include a 'metadata' field with 'data_source_urls' in each GeoJSON file to more easily track where the data comes from. If this script is ran again and nothing has changed, we will not need to regenerate the GeoJSON files. But if eGRID puts out a revision in the future, we can update the URL and know what needs to be updated.

In [23]:
for year in egrid_urls:
  output_filename = f"../src/emcommon/resources/egrid{year}_subregions_5pct.json"
  subregions_url = egrid_urls[year]["subregions_kml"] if "subregions_kml" in egrid_urls[year] else egrid_urls[year]["subregions_kmz"]

  if is_up_to_date(output_filename, [subregions_url]):
    continue

  print(f"Downloading subregions for {year}...")
  urls = egrid_urls[year]
  if "subregions_kml" in urls:
    r = requests.get(urls["subregions_kml"])
    os.makedirs(os.path.dirname('tmp/doc.kml'), exist_ok=True)
    with open('tmp/doc.kml', 'wb') as f:
      f.write(r.content)
  elif "subregions_kmz" in urls:
    r = requests.get(urls["subregions_kmz"])
    kmz = ZipFile(BytesIO(r.content))
    kmz.extractall('tmp')

  print(f"Simplifying geometry for {year}...")
  ! npx mapshaper 'tmp/doc.kml' -simplify dp 5% -o precision=0.0001 'tmp/out.json'

  with open('tmp/out.json') as f:
    output = json.load(f)

  ! rm -rf 'tmp'
  
  output['metadata'] = {
      "year": year,
      "data_source_urls": [subregions_url],
  }
  with open(output_filename, 'w') as f:
    json.dump(output, f)

print("Done generating eGRID subregions simplified geojson files")

../src/emcommon/resources/egrid2018_subregions_5pct.json is up to date, skipping
../src/emcommon/resources/egrid2019_subregions_5pct.json is up to date, skipping
Creating ../src/emcommon/resources/egrid2020_subregions_5pct.json
Downloading subregions for 2020...
Simplifying geometry for 2020...
[simplify] Repaired 2,739 intersections
[o] Wrote tmp/out.json
Creating ../src/emcommon/resources/egrid2021_subregions_5pct.json
Downloading subregions for 2021...
Simplifying geometry for 2021...
[simplify] Repaired 1,423 intersections
[o] Wrote tmp/out.json
Creating ../src/emcommon/resources/egrid2022_subregions_5pct.json
Downloading subregions for 2022...
Simplifying geometry for 2022...
[simplify] Repaired 2,573 intersections
[o] Wrote tmp/out.json
Done generating eGRID subregions simplified geojson files


Now let's include the carbon intensity for each region, also by year.
The field we will be using for carbon intensity is "Annual CO2 equivalent total output emission rate (lb/MWh)". For the national average, the column is called `USC2ERTA`, and for the per-region values it is called `SRC2ERTA`. We'll convert this to kg CO2 per MWh.

In [24]:
LBS_PER_KG = 0.45359237

for year, urls in egrid_urls.items():
  url = urls['egrid_xlsx']
  output = {}
  output_filename = f"../src/emcommon/resources/egrid{year}_intensities.json"

  if is_up_to_date(output_filename, [url]):
    continue

  # national average in the "USxx" sheet; e.g. "US22"
  # it only has one row
  national_df = load_dataframe(urls, 'egrid', 'US' + str(year)[-2:], skiprows=[0])
  output['national_kg_per_mwh'] = national_df[['USC2ERTA']].iloc[0, 0] * LBS_PER_KG

  # per-region averages in the "SRLxx" sheet; e.g. "SRL22"
  regions_df = load_dataframe(urls, 'egrid', 'SRL' + str(year)[-2:], skiprows=[0])
  output['regions_kg_per_mwh'] = regions_df[['SUBRGN', 'SRC2ERTA']] \
    .set_index('SUBRGN')['SRC2ERTA'] \
    .apply(lambda lbs: lbs * LBS_PER_KG) \
    .to_dict()

  output['metadata'] = {
    "year": year,
    "data_source_urls": [url],
  }
  with open(output_filename, 'w') as f:
    json.dump(output, f, indent=2)

print("Done generating eGRID intensity json files")
    

../src/emcommon/resources/egrid2018_intensities.json is up to date, skipping
../src/emcommon/resources/egrid2019_intensities.json is up to date, skipping
../src/emcommon/resources/egrid2020_intensities.json is up to date, skipping
../src/emcommon/resources/egrid2021_intensities.json is up to date, skipping
../src/emcommon/resources/egrid2022_intensities.json is up to date, skipping
Done generating eGRID intensity json files
