# EPA GHGRP demo of Data in Data Commons

The purpose of this notebook is to provide a visual representation of the availability and connectivity of some of the data in the Data Commons.  The Data are:
* EPA_GHGRP: Asset-level data (physical plants and operations in the US that emit >= 25 kt CO2e and others covered by GHG Reporting Project)
* GLEIF: Legal Entity Identifiers for parent companies that own assets (and identification of parent companies that don't have LEIs)
* Type of emissions from parent companies: Direct Emitters, LDC Emissions, On-Shore Refining, Gathering and Boosting, Transmission Pipelines, SF6 from Electrical Equipment
* Types of operations by SIC/NAICS codes (Steel, Cement, Electricity Generation, Pulp and Paper Manufacturing, etc)
* Sectors (Manufacturing, Transportation Communications and Utilities, Service Industries, Mining, etc)
* SEC 10-K reports: Revenue Data (can be compared/contrasted with EPA CO2e emissions data)

The data developed in this notebook can be visualized by running the notebook https://github.com/os-climate/data-platform-demo/blob/master/notebooks/Sankey.ipynb

Then visualizing the data in SuperSet here: https://superset-secure-odh-superset.apps.odh-cl2.apps.os-climate.org/superset/explore/

This data is incomplete from a number of perspectives:
* Major non-emitting power plants (hydro electric dams, solar arrays, wind turbines, and nuclear powerplants) may be missing as assets
* There are no metrics for energy generation or consumption, nor for other units of production (such as tons of steel produced)
* All emissions are essentially Scope 1 emissions; there are no Scope 2 attributions for major energy consumers (such as Steel manufacturing)
* There are also no Scope 3 metrics
* The data is exclusively US-based

Despite these shortcomings, this illustration/demonstration shows how additional data can be linked in to provide a more complete picture:
* WRI Power Plant data, providing a global perspective on power plants, including emissions, generation, capacity, fuel type, etc.
* RMI Utility Transition Hub data, providing fine-grained, up-to-date information about US power plants, including emissions targets
* SEC data at Business Segment level (to separate Berkshire Hathaway's \\$65B energy business from their overall \\$265B enterprise, for example)
* SPGI sustainability reports (require NLP analysis to yield quantitative metrics)
* etc.

# Begin with Credentials and Connection to Trino

In [1]:
import os
import pathlib
from dotenv import load_dotenv

# Load some standard environment variables from a dot-env file, if it exists.
# If no such file can be found, does not fail, and so allows these environment vars to
# be populated in some other way
dotenv_dir = os.environ.get('CREDENTIAL_DOTENV_DIR', os.environ.get('PWD', '/opt/app-root/src'))
dotenv_path = pathlib.Path(dotenv_dir) / 'credentials.env'
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path,override=True)

Set session variable CATALOG to make query terms much more compact

In [2]:
import trino
from sqlalchemy.engine import create_engine

sqlstring = 'trino://{user}@{host}:{port}/'.format(
    user = os.environ['TRINO_USER'],
    host = os.environ['TRINO_HOST'],
    port = os.environ['TRINO_PORT']
)

ingest_catalog = 'osc_datacommons_dev'
ingest_schema = 'sandbox'
epa_table_prefix = 'epa_'
dera_table_prefix = 'dera_'

sqlargs = {
    'auth': trino.auth.JWTAuthentication(os.environ['TRINO_PASSWD']),
    'http_scheme': 'https',
    'catalog': ingest_catalog,
    'schema': ingest_schema,
}
engine = create_engine(sqlstring, connect_args = sqlargs)
connection = engine.connect()

import pandas as pd
import osc_ingest_trino as osc

In [3]:
cleanup = False

if cleanup:
    qres = engine.execute(f"show tables in {ingest_schema}")
    l = qres.fetchall()

    for schema in [ ingest_schema ]:
        print(schema)
        qres = engine.execute(f'show tables in {schema}')
        l = qres.fetchall()

        for table in l:
            qres = engine.execute(f'drop table {schema}.{table[0]}')
            display(qres.fetchall())

        qres = engine.execute(f'show tables in {schema}')
        display(qres.fetchall())

        qres = engine.execute(f'drop schema {schema}')
        display(qres.fetchall())


    qres = engine.execute('show schemas')
    qres.fetchall()

# Introduction to EPA GHG Reporting Program data (EPA_GHGRP)

The EPA's GHG Reporting Program (GHGRP) seems to be a gold standard in terms of creating a bottoms-up list that's good enough to play a major role in tops-down estimates.

`Direct_Emitters` are the lion's share of CO2 _emissions_.  `Suppliers` tracks fuels and products which, when used as intended, will create GHG emissions (by direct emitters or others).

In [4]:
qres = engine.execute(f"describe {epa_table_prefix}direct_emitters")
display(qres.fetchall())

[('facility_id', 'bigint', '', ''),
 ('frs_id', 'varchar', '', ''),
 ('facility_name', 'varchar', '', ''),
 ('city', 'varchar', '', ''),
 ('state', 'varchar', '', ''),
 ('zip_code', 'varchar', '', ''),
 ('address', 'varchar', '', ''),
 ('county', 'varchar', '', ''),
 ('latitude', 'double', '', ''),
 ('longitude', 'double', '', ''),
 ('primary_naics_code', 'varchar', '', ''),
 ('latest_reported_industry_type_subparts', 'varchar', '', ''),
 ('latest_reported_industry_type_sectors', 'varchar', '', ''),
 ('total_reported_emissions', 'double', '', ''),
 ('total_reported_emissions_units', 'varchar', '', ''),
 ('year', 'timestamp(6)', '', '')]

In [5]:
qres = engine.execute(f"""
select format('%tY', year), format('%,.2f', sum(total_reported_emissions)/1e9) || ' Gt CO2e'
from {epa_table_prefix}direct_emitters group by year order by year desc""")
display(qres.fetchall())

[('2020', '2.40 Gt CO2e'),
 ('2019', '2.63 Gt CO2e'),
 ('2018', '2.78 Gt CO2e'),
 ('2017', '2.74 Gt CO2e'),
 ('2016', '2.81 Gt CO2e'),
 ('2015', '2.94 Gt CO2e'),
 ('2014', '3.08 Gt CO2e'),
 ('2013', '3.07 Gt CO2e'),
 ('2012', '3.06 Gt CO2e'),
 ('2011', '3.21 Gt CO2e')]

Here's a look at how they stack up (from a Database perspective--we should also look at this in Super Set).

In [6]:
qres = engine.execute(f"""
select count (*), latest_reported_industry_type_sectors,
       format('%,.2f', sum(total_reported_emissions)/1e6) || ' Mt CO2e' as MtCO2e
from {epa_table_prefix}direct_emitters
where year>=DATE('2020-01-01') and year<DATE('2021-01-01')
group by latest_reported_industry_type_sectors
order by MtCO2e desc
""")
display(qres.fetchall())

[(1200, 'Petroleum and Natural Gas Systems', '97.56 Mt CO2e'),
 (1201, 'Waste', '96.91 Mt CO2e'),
 (1059, 'Other', '93.91 Mt CO2e'),
 (6, 'Chemicals,Refineries', '9.87 Mt CO2e'),
 (4, 'Chemicals,Petroleum Product Suppliers,Refineries,Suppliers of CO2', '8.58 Mt CO2e'),
 (41, 'Chemicals,Petroleum Product Suppliers,Refineries', '60.97 Mt CO2e'),
 (1, 'Petroleum Product Suppliers,Power Plants,Refineries', '6.56 Mt CO2e'),
 (1, 'Metals,Minerals', '6.17 Mt CO2e'),
 (275, 'Metals', '59.17 Mt CO2e'),
 (69, 'Petroleum Product Suppliers,Refineries', '56.06 Mt CO2e'),
 (1, 'Chemicals,Petroleum Product Suppliers,Power Plants,Refineries', '5.97 Mt CO2e'),
 (1, 'Metals,Power Plants', '5.96 Mt CO2e'),
 (7, 'Refineries', '5.28 Mt CO2e'),
 (1, 'Chemicals,Other,Petroleum Product Suppliers,Power Plants,Refineries', '5.11 Mt CO2e'),
 (10, 'Injection of CO2,Petroleum and Natural Gas Systems,Suppliers of CO2', '4.81 Mt CO2e'),
 (1, 'Chemicals,Other,Petroleum and Natural Gas Systems,Waste', '4.67 Mt CO2e'),

This looks at the `Minerals` industry (which includes cement).  We see that the top emitters have multiple facility locations.

In [7]:
qres = engine.execute(f"""
select count (*), parent_company_name, format('%5.2f', sum(total_reported_emissions)/1e6) || ' Mt CO2e' as MtCO2e
from {epa_table_prefix}direct_emitters, {epa_table_prefix}parent_company
where year>=DATE('2020-01-01') and year<DATE('2021-01-01') and year=reporting_year
      and latest_reported_industry_type_sectors='Minerals'
      and {epa_table_prefix}direct_emitters.facility_id={epa_table_prefix}parent_company.ghgrp_facility_id
group by parent_company_name
order by MtCO2e desc
limit 20
""")
display(qres.fetchall())

[(13, 'HOLCIM PARTICIPATIONS (US) INC', '11.80 Mt CO2e'),
 (12, 'HANSON LEHIGH INC', ' 6.61 Mt CO2e'),
 (7, 'CEMEX INC', ' 6.33 Mt CO2e'),
 (7, 'RC LONESTAR INC', ' 6.10 Mt CO2e'),
 (11, 'Lhoist North America, Inc', ' 5.75 Mt CO2e'),
 (9, 'CRH AMERICAS INC', ' 4.94 Mt CO2e'),
 (12, 'EAGLE MATERIALS INC', ' 4.64 Mt CO2e'),
 (4, 'MARTIN MARIETTA MATERIALS INC', ' 3.93 Mt CO2e'),
 (11, 'CARMEUSE LIME INC', ' 3.84 Mt CO2e'),
 (9, 'GRAYMONT INC', ' 3.69 Mt CO2e'),
 (3, 'TAIHEIYO CEMENT USA INC', ' 3.29 Mt CO2e'),
 (4, 'ARGOS USA LLC', ' 3.19 Mt CO2e'),
 (3, 'HBM HOLDINGS CO', ' 3.11 Mt CO2e'),
 (5, 'GCC OF AMERICA INC', ' 2.15 Mt CO2e'),
 (2, 'TITAN AMERICA LLC', ' 2.14 Mt CO2e'),
 (1, 'GENESIS ENERGY LP', ' 1.60 Mt CO2e'),
 (2, 'NATIONAL CEMENT CO INC', ' 1.47 Mt CO2e'),
 (1, 'VOTORANTIM CIMENTOS NORTH AMERICA INC', ' 1.45 Mt CO2e'),
 (3, 'GIANT CEMENT HOLDING INC', ' 1.45 Mt CO2e'),
 (2, 'SUMMIT MATERIALS INC', ' 1.45 Mt CO2e')]

`Suppliers` are those who buy and sell GHG-emitting products, but they do not, themselves, cause the emissions.  They merely enable others to emit.

In [8]:
qres = engine.execute(f"describe {epa_table_prefix}suppliers")
display(qres.fetchall())

[('facility_id', 'bigint', '', ''),
 ('frs_id', 'varchar', '', ''),
 ('facility_name', 'varchar', '', ''),
 ('city', 'varchar', '', ''),
 ('state', 'varchar', '', ''),
 ('zip_code', 'varchar', '', ''),
 ('address', 'varchar', '', ''),
 ('county', 'varchar', '', ''),
 ('latitude', 'double', '', ''),
 ('longitude', 'double', '', ''),
 ('primary_naics_code', 'varchar', '', ''),
 ('latest_reported_industry_type_subparts', 'varchar', '', ''),
 ('coal_based_liquid_fuel_production_ghg', 'double', '', ''),
 ('coal_based_liquid_fuel_production_ghg_units', 'varchar', '', ''),
 ('petroleum_products_produced_ghg', 'double', '', ''),
 ('petroleum_products_produced_ghg_units', 'varchar', '', ''),
 ('petroleum_products_imported_ghg', 'double', '', ''),
 ('petroleum_products_imported_ghg_units', 'varchar', '', ''),
 ('petroleum_products_exported_ghg', 'double', '', ''),
 ('petroleum_products_exported_ghg_units', 'varchar', '', ''),
 ('natural_gas_supply_ghg', 'double', '', ''),
 ('natural_gas_supply_g

A quick summary of how many rows of data we have in `epa_ghgrp`.

68k rows in `direct_emitters`: lots of facilities  
103k rows in `parent_company`: lots of facility/owner relationships

In [9]:
qres = engine.execute('show tables in sandbox')
l = qres.fetchall()

l = [t for t in l if t[0].startswith('epa_')]
totalrows = 0
for e in l:
    s = f'select count (*) from {e[0]}'
    qres = engine.execute(s)
    rowcount = qres.fetchall()[0][0]
    totalrows += rowcount
    print(f"{rowcount:>6} <- {s})")

print(f'{totalrows} <- total rows')

   954 <- select count (*) from epa_co2_injection)
 68472 <- select count (*) from epa_direct_emitters)
  1703 <- select count (*) from epa_gathering_boosting)
    20 <- select count (*) from epa_geologic_sequestration_of_co2)
  1730 <- select count (*) from epa_ldc_direct_emissions)
  5068 <- select count (*) from epa_onshore_oil_gas_prod)
 89879 <- select count (*) from epa_parent_attribution)
103043 <- select count (*) from epa_parent_company)
  1022 <- select count (*) from epa_sankey)
  1012 <- select count (*) from epa_sf6_from_elec_equip)
  8539 <- select count (*) from epa_suppliers)
   780 <- select count (*) from epa_transmission_pipelines)
282222 <- total rows


# Reshaping tables to make them easier to chart

The key metric is total_emissions (in metric tons of CO2e), but the name of the metric depends on the source/process.  Nevertheless, we know that `year` is our last metric and that the CO2e metric is 2nd-to-last (hence the `-2` index).

We also know that when building our final summary table, the sums feeding into it are all only one row per year.  We use `iat[0,1` to access the 0th row and the 1st column (which will be named specifically to the source/process).  By using `iat`, we get a scalar value we can sum, instead of a Series object we'd have to `squeeze`.

In [10]:
import pandas as pd

emission_tables = ['direct_emitters', 'onshore_oil_gas_prod', 'gathering_boosting',
                   'transmission_pipelines', 'ldc_direct_emissions', 'sf6_from_elec_equip']
tot_em_columns = []

q_dict = {}

for t in emission_tables:
    qres = engine.execute(f"describe {epa_table_prefix}{t}")
    tr = qres.fetchall()
    # Each table's total reported emissions are in column total_reported_emissions
    # (paired with total_reported_emissions_units).  We need to make unique names for this merge
    total_emission_cname = t+"_em"
    tot_em_columns.append(total_emission_cname)
    qres = engine.execute(f"""
select year, sum(total_reported_emissions), total_reported_emissions_units
from {epa_table_prefix}{t}
group by year, total_reported_emissions_units
""")
    q_dict[t] = pd.DataFrame(qres.fetchall(), columns=['year', total_emission_cname, total_emission_cname+"_units"])

In [11]:
# A function that excludes terms using SQL to say "and X!=Y"
def excl_text(excl):
    return ' and '.join([f"latest_reported_industry_type_sectors!='{e}'" for e in excl])

# A function that includes text that matches; SQL that says "or X like '%Y%'"
def incl_text(incl):
    return ' or '.join([f"latest_reported_industry_type_sectors like '%{e}%'" for e in incl])

t = 'direct_emitters'
qres = engine.execute(f"describe epa_{t}")
t_cols = qres.fetchall()
total_emission_cname = t+"_em"

incl = [ 'Power', 'Petroleum']
qres = engine.execute(f"""
select year, sum(total_reported_emissions),  total_reported_emissions_units
from {epa_table_prefix}{t}
where {incl_text(incl)}
group by year, total_reported_emissions_units
""")

q_dict[t + f" (incl {','.join(incl)})"] = pd.DataFrame(qres.fetchall(),
                                                       columns=['year',
                                                                total_emission_cname + f" (matching {','.join(incl)})",
                                                                total_emission_cname + f" (matching {','.join(incl)})" + "_units"])

excl = [ 'Minerals', 'Other', 'Waste', 'Chemicals', 'Pulp and Paper,Waste',
        'Metals,Waste', 'Pulp and Paper']
qres = engine.execute(f"""
select year, sum(total_reported_emissions),  total_reported_emissions_units
from {epa_table_prefix}{t}
where {excl_text(excl)}
group by year, total_reported_emissions_units
""")
q_dict[t + f" (excl {','.join(excl)})"] = pd.DataFrame(qres.fetchall(),
                                                       columns=['year',
                                                                total_emission_cname + f" (excl {','.join(excl)})",
                                                                total_emission_cname + f" (excl {','.join(excl)})" + "_units"])

In [12]:
for t in emission_tables:
    qres = engine.execute(f"describe {epa_table_prefix}{t}")
    tr = qres.fetchall()
    total_emission_cname = t+"_em"
    qres = engine.execute(f"""
select year, sum(total_reported_emissions),  total_reported_emissions_units
from {epa_table_prefix}{t}
group by year, total_reported_emissions_units
""")
    q_dict[t] = pd.DataFrame(qres.fetchall(), columns=['year',
                                                       total_emission_cname,
                                                       total_emission_cname + "_units"])

grand_total = {}

for year in q_dict['direct_emitters'].year:
    grand_total[year] = sum([q_dict[t][q_dict[t].year==year].iat[0,1]
                             for t in emission_tables if year in q_dict[t].year.values])

df = pd.DataFrame.from_dict(grand_total, orient='index', columns=['total_co2e']).reset_index()
df.rename(columns={'index':'year'}, inplace=True)
# Grab the common unit that all these reports share
df['total_co2e_units'] = q_dict['direct_emitters'].iat[0,2]
q_dict['grand_total'] = df

This gem comes from https://stackoverflow.com/questions/44327999/python-pandas-merge-multiple-dataframes

In [13]:
from functools import reduce

df_merged = reduce(lambda left,right: pd.merge(left,right,on=['year'], how='outer'), q_dict.values()).fillna(0)
df_merged.sort_values(by='year', ascending=False, inplace=True)
df_merged.index = pd.RangeIndex(len(df_merged.index))

In [14]:
df_merged

Unnamed: 0,year,direct_emitters_em,direct_emitters_em_units,onshore_oil_gas_prod_em,onshore_oil_gas_prod_em_units,gathering_boosting_em,gathering_boosting_em_units,transmission_pipelines_em,transmission_pipelines_em_units,ldc_direct_emissions_em,ldc_direct_emissions_em_units,sf6_from_elec_equip_em,sf6_from_elec_equip_em_units,"direct_emitters_em (matching Power,Petroleum)","direct_emitters_em (matching Power,Petroleum)_units","direct_emitters_em (excl Minerals,Other,Waste,Chemicals,Pulp and Paper,Waste,Metals,Waste,Pulp and Paper)","direct_emitters_em (excl Minerals,Other,Waste,Chemicals,Pulp and Paper,Waste,Metals,Waste,Pulp and Paper)_units",total_co2e,total_co2e_units
0,2020-01-01 00:00:00.000,2400335000.0,t CO2,93488110.0,t CO2,90028670.0,t CO2,3497590.0,t CO2,12641100.0,t CO2,2004836.0,t CO2,1768353000.0,t CO2,1947463000.0,t CO2,2601995000.0,t CO2
1,2019-01-01 00:00:00.000,2626532000.0,t CO2,120174300.0,t CO2,92765660.0,t CO2,2859475.0,t CO2,12847020.0,t CO2,2510832.0,t CO2,1958369000.0,t CO2,2153508000.0,t CO2,2857689000.0,t CO2
2,2018-01-01 00:00:00.000,2779471000.0,t CO2,111958800.0,t CO2,83325600.0,t CO2,3050315.0,t CO2,13236260.0,t CO2,2270228.0,t CO2,2099221000.0,t CO2,2304343000.0,t CO2,2993312000.0,t CO2
3,2017-01-01 00:00:00.000,2735841000.0,t CO2,96241460.0,t CO2,77830580.0,t CO2,2699047.0,t CO2,13670430.0,t CO2,2555766.0,t CO2,2070082000.0,t CO2,2270354000.0,t CO2,2928838000.0,t CO2
4,2016-01-01 00:00:00.000,2805105000.0,t CO2,86898250.0,t CO2,82597010.0,t CO2,3183982.0,t CO2,14002290.0,t CO2,2930497.0,t CO2,2144348000.0,t CO2,2337524000.0,t CO2,2994717000.0,t CO2
5,2015-01-01 00:00:00.000,2939444000.0,t CO2,101748500.0,t CO2,0.0,0,0.0,0,14558310.0,t CO2,2472281.0,t CO2,2261725000.0,t CO2,2452045000.0,t CO2,3058223000.0,t CO2
6,2014-01-01 00:00:00.000,3084069000.0,t CO2,101951700.0,t CO2,0.0,0,0.0,0,14771850.0,t CO2,3220287.0,t CO2,2392070000.0,t CO2,2592859000.0,t CO2,3204013000.0,t CO2
7,2013-01-01 00:00:00.000,3073214000.0,t CO2,97959460.0,t CO2,0.0,0,0.0,0,15161470.0,t CO2,3258298.0,t CO2,2392082000.0,t CO2,2593335000.0,t CO2,3189593000.0,t CO2
8,2012-01-01 00:00:00.000,3058076000.0,t CO2,92539660.0,t CO2,0.0,0,0.0,0,15412350.0,t CO2,3236291.0,t CO2,2376107000.0,t CO2,2579361000.0,t CO2,3169264000.0,t CO2
9,2011-01-01 00:00:00.000,3207583000.0,t CO2,91190570.0,t CO2,0.0,0,0.0,0,15667940.0,t CO2,3920547.0,t CO2,2509918000.0,t CO2,2728839000.0,t CO2,3318362000.0,t CO2


A summary table consolidating the totals from the GHGRP, plus three additional columns:
1. direct emitters that match "Power" or "Petroleum"
2. direct emitters that are not the top other industries
3. total co2e

In [15]:
df_merged.rename(columns={v:v.replace('_', ' ') for v in df_merged.columns.values})

Unnamed: 0,year,direct emitters em,direct emitters em units,onshore oil gas prod em,onshore oil gas prod em units,gathering boosting em,gathering boosting em units,transmission pipelines em,transmission pipelines em units,ldc direct emissions em,ldc direct emissions em units,sf6 from elec equip em,sf6 from elec equip em units,"direct emitters em (matching Power,Petroleum)","direct emitters em (matching Power,Petroleum) units","direct emitters em (excl Minerals,Other,Waste,Chemicals,Pulp and Paper,Waste,Metals,Waste,Pulp and Paper)","direct emitters em (excl Minerals,Other,Waste,Chemicals,Pulp and Paper,Waste,Metals,Waste,Pulp and Paper) units",total co2e,total co2e units
0,2020-01-01 00:00:00.000,2400335000.0,t CO2,93488110.0,t CO2,90028670.0,t CO2,3497590.0,t CO2,12641100.0,t CO2,2004836.0,t CO2,1768353000.0,t CO2,1947463000.0,t CO2,2601995000.0,t CO2
1,2019-01-01 00:00:00.000,2626532000.0,t CO2,120174300.0,t CO2,92765660.0,t CO2,2859475.0,t CO2,12847020.0,t CO2,2510832.0,t CO2,1958369000.0,t CO2,2153508000.0,t CO2,2857689000.0,t CO2
2,2018-01-01 00:00:00.000,2779471000.0,t CO2,111958800.0,t CO2,83325600.0,t CO2,3050315.0,t CO2,13236260.0,t CO2,2270228.0,t CO2,2099221000.0,t CO2,2304343000.0,t CO2,2993312000.0,t CO2
3,2017-01-01 00:00:00.000,2735841000.0,t CO2,96241460.0,t CO2,77830580.0,t CO2,2699047.0,t CO2,13670430.0,t CO2,2555766.0,t CO2,2070082000.0,t CO2,2270354000.0,t CO2,2928838000.0,t CO2
4,2016-01-01 00:00:00.000,2805105000.0,t CO2,86898250.0,t CO2,82597010.0,t CO2,3183982.0,t CO2,14002290.0,t CO2,2930497.0,t CO2,2144348000.0,t CO2,2337524000.0,t CO2,2994717000.0,t CO2
5,2015-01-01 00:00:00.000,2939444000.0,t CO2,101748500.0,t CO2,0.0,0,0.0,0,14558310.0,t CO2,2472281.0,t CO2,2261725000.0,t CO2,2452045000.0,t CO2,3058223000.0,t CO2
6,2014-01-01 00:00:00.000,3084069000.0,t CO2,101951700.0,t CO2,0.0,0,0.0,0,14771850.0,t CO2,3220287.0,t CO2,2392070000.0,t CO2,2592859000.0,t CO2,3204013000.0,t CO2
7,2013-01-01 00:00:00.000,3073214000.0,t CO2,97959460.0,t CO2,0.0,0,0.0,0,15161470.0,t CO2,3258298.0,t CO2,2392082000.0,t CO2,2593335000.0,t CO2,3189593000.0,t CO2
8,2012-01-01 00:00:00.000,3058076000.0,t CO2,92539660.0,t CO2,0.0,0,0.0,0,15412350.0,t CO2,3236291.0,t CO2,2376107000.0,t CO2,2579361000.0,t CO2,3169264000.0,t CO2
9,2011-01-01 00:00:00.000,3207583000.0,t CO2,91190570.0,t CO2,0.0,0,0.0,0,15667940.0,t CO2,3920547.0,t CO2,2509918000.0,t CO2,2728839000.0,t CO2,3318362000.0,t CO2


# Cross-check with ESSD tops-down dataset

A quick look at *just* CO2.  We'll look at CO2e in the next set of cells.

In [16]:
qres = engine.execute("""
select format('%tY', year), sector_title, format('%,.2f', sum(value)/1e9) || ' Gt CO2' as GtCO2 from essd_ghg_data
where sector_title='Energy systems' and gas='CO2' and year>=DATE('2010-01-01') and year<DATE('2021-01-01') and ISO='USA'
group by year, sector_title, gas order by year desc""")
qres.fetchall()

[('2020', 'Energy systems', '1.75 Gt CO2'),
 ('2019', 'Energy systems', '1.99 Gt CO2'),
 ('2018', 'Energy systems', '2.13 Gt CO2'),
 ('2017', 'Energy systems', '2.11 Gt CO2'),
 ('2016', 'Energy systems', '2.18 Gt CO2'),
 ('2015', 'Energy systems', '2.28 Gt CO2'),
 ('2014', 'Energy systems', '2.41 Gt CO2'),
 ('2013', 'Energy systems', '2.41 Gt CO2'),
 ('2012', 'Energy systems', '2.39 Gt CO2'),
 ('2011', 'Energy systems', '2.51 Gt CO2'),
 ('2010', 'Energy systems', '2.63 Gt CO2')]

In [17]:
qres = engine.execute('describe essd_ghg_data')
qres.fetchall()

[('iso', 'varchar', '', ''),
 ('country', 'varchar', '', ''),
 ('region_ar6_6', 'varchar', '', ''),
 ('region_ar6_10', 'varchar', '', ''),
 ('region_ar6_22', 'varchar', '', ''),
 ('region_ar6_dev', 'varchar', '', ''),
 ('sector_title', 'varchar', '', ''),
 ('subsector_title', 'varchar', '', ''),
 ('gas', 'varchar', '', ''),
 ('gwp100_ar5', 'integer', '', ''),
 ('value', 'double', '', ''),
 ('value_units', 'varchar', '', ''),
 ('year', 'timestamp(6)', '', '')]

In [18]:
qres = engine.execute('describe essd_gwp100_data')
qres.fetchall()

[('iso', 'varchar', '', ''),
 ('country', 'varchar', '', ''),
 ('region_ar6_6', 'varchar', '', ''),
 ('region_ar6_10', 'varchar', '', ''),
 ('region_ar6_22', 'varchar', '', ''),
 ('region_ar6_dev', 'varchar', '', ''),
 ('sector_title', 'varchar', '', ''),
 ('subsector_title', 'varchar', '', ''),
 ('co2', 'double', '', ''),
 ('co2_units', 'varchar', '', ''),
 ('ch4', 'double', '', ''),
 ('ch4_units', 'varchar', '', ''),
 ('n2o', 'double', '', ''),
 ('n2o_units', 'varchar', '', ''),
 ('fgas', 'double', '', ''),
 ('fgas_units', 'varchar', '', ''),
 ('ghg', 'double', '', ''),
 ('ghg_units', 'varchar', '', ''),
 ('year', 'timestamp(6)', '', '')]

A look at CO2e (presuming that's what GHG gives us from the GWP100 table) for the category `Energy Systems`.

In [19]:
qres = engine.execute("""
select format('%tY', year), sector_title, format('%,.2f', sum(GHG)/1e9) || ' Gt CO2' as GtCO2 from essd_gwp100_data
where sector_title='Energy systems' and year>DATE('2010-01-01') and year<=DATE('2021-01-01') and ISO='USA'
group by year, sector_title order by year desc""")
qres.fetchall()

[('2020', 'Energy systems', '1.75 Gt CO2'),
 ('2019', 'Energy systems', '2.35 Gt CO2'),
 ('2018', 'Energy systems', '2.48 Gt CO2'),
 ('2017', 'Energy systems', '2.45 Gt CO2'),
 ('2016', 'Energy systems', '2.51 Gt CO2'),
 ('2015', 'Energy systems', '2.63 Gt CO2'),
 ('2014', 'Energy systems', '2.78 Gt CO2'),
 ('2013', 'Energy systems', '2.78 Gt CO2'),
 ('2012', 'Energy systems', '2.76 Gt CO2'),
 ('2011', 'Energy systems', '2.89 Gt CO2')]

# Connect with economic data provided by US CENSUS All-sector Survey (2017)

In [20]:
qres = engine.execute("describe census_all_sector_survey_2017")
display(qres.fetchall())
qres = engine.execute("select * from census_all_sector_survey_2017 where naics2012='221112'")
display(qres.fetchall())


[('geo_id', 'varchar', '', ''),
 ('name', 'varchar', '', ''),
 ('geo_id_f', 'bigint', '', ''),
 ('naics2012', 'varchar', '', ''),
 ('naics2012_f', 'varchar', '', ''),
 ('naics2012_label', 'varchar', '', ''),
 ('estab', 'varchar', '', ''),
 ('rcptot', 'varchar', '', ''),
 ('payann', 'varchar', '', ''),
 ('emp', 'varchar', '', ''),
 ('year', 'varchar', '', '')]

[('0100000US', 'United States', None, '221112', None, 'Fossil fuel electric power generation', '1711', '75455040', '8192622', '76058', '2017'),
 ('0100000US', 'United States', None, '221112', None, 'Fossil fuel electric power generation', '1416', '81473633', '7997908', '82071', '2012')]

Exercise the connection to NAICS and sector information provided by US Department of Commerce (US_CENSUS)

In [21]:
# Show how many facilities are tagged with what primary NAICS codes

qres = engine.execute(f"""
select count (*), format('%tY', {epa_table_prefix}direct_emitters.year), primary_naics_code, naics2012_label
from {epa_table_prefix}direct_emitters, census_all_sector_survey_2017
where primary_naics_code=naics2012
      and census_all_sector_survey_2017.year='2017' and {epa_table_prefix}direct_emitters.year=DATE('2017-01-01')
group by {epa_table_prefix}direct_emitters.year, primary_naics_code, naics2012_label
order by count (*) desc limit 20
""")
display(qres.fetchall())

[(1281, '2017', '221112', 'Fossil fuel electric power generation'),
 (1134, '2017', '562212', 'Solid waste landfill'),
 (585, '2017', '486210', 'Pipeline transportation of natural gas'),
 (173, '2017', '325193', 'Ethyl alcohol manufacturing'),
 (141, '2017', '324110', 'Petroleum refineries'),
 (120, '2017', '331110', 'Iron and steel mills and ferroalloy manufacturing'),
 (114, '2017', '322121', 'Paper (except newsprint) mills'),
 (100, '2017', '325199', 'All other basic organic chemical manufacturing'),
 (93, '2017', '327310', 'Cement manufacturing'),
 (79, '2017', '212112', 'Bituminous coal underground mining'),
 (77, '2017', '322130', 'Paperboard mills'),
 (75, '2017', '325211', 'Plastics material and resin manufacturing'),
 (69, '2017', '325120', 'Industrial gas manufacturing'),
 (65, '2017', '562213', 'Solid waste combustors and incinerators'),
 (59, '2017', '221330', 'Steam and air-conditioning supply'),
 (59, '2017', '325180', 'Other basic inorganic chemical manufacturing'),
 (55

# More table reshaping: attribution estimation

In [22]:
df = pd.read_sql(f"""
select facility_id, year, latitude, longitude, latest_reported_industry_type_sectors, total_reported_emissions, total_reported_emissions_units
from {epa_table_prefix}direct_emitters""", engine)
df.facility_id = df.facility_id.astype('int64')
df.year = df.year.astype('datetime64[ns, UTC]')
df.total_reported_emissions = df.total_reported_emissions.astype('float64')
df.total_reported_emissions_units = df.total_reported_emissions_units.astype('string')
df.latest_reported_industry_type_sectors.fillna('Other', inplace=True)

df['sector_groupings'] = pd.Series([f"{s[0]} ({len(s)+1})" if len(s)>1 else s[0] for s in df.latest_reported_industry_type_sectors.str.split(',')])

In [23]:
for sl in df.latest_reported_industry_type_sectors.str.split(','):
    # Ensure all primary (and if listed, secondary) sectors are represented
    if f's_{sl[0]}' not in df.columns:
        df[f's_{sl[0]}'] = 0.0
    if len(sl)>1 and f's_{sl[1]}' not in df.columns:
        df[f's_{sl[1]}'] = 0.0

In [24]:
attribution_vector = [ pd.Series([1.0]),
                       pd.Series([2.0/3.0, 1.0/3.0]),
                       pd.Series([0.5, 0.3, 0.2]),
                       pd.Series([0.4, 0.3, 0.2, 0.1]),
                       pd.Series([0.30, 0.25, 0.20, 0.15, 0.10]),
                       pd.Series([0.30, 0.24, 0.19, 0.14, 0.09, 0.04])]

def apply_attribution(x):
    sl = x.latest_reported_industry_type_sectors.split(',')
    # Tertiary sectors not previously mentioned are silently converted to Other, keeping our attribution columns from exploding
    appropriate_columns = list(set([f's_{s}' if f's_{s}' in x else 's_Other' for s in sl]))
    x[ appropriate_columns ] = x.total_reported_emissions * attribution_vector[len(appropriate_columns)-1].values
    return x

df_emitters = df.apply(apply_attribution, axis=1)

In [25]:
df_emitters[df_emitters.latest_reported_industry_type_sectors.str.contains(',')]

Unnamed: 0,facility_id,year,latitude,longitude,latest_reported_industry_type_sectors,total_reported_emissions,total_reported_emissions_units,sector_groupings,s_Waste,s_Other,...,s_Industrial Gas Suppliers,s_Metals,s_Suppliers of CO2,s_Pulp and Paper,s_Petroleum Product Suppliers,s_Refineries,s_Natural Gas and Natural Gas Liquids Suppliers,s_Injection of CO2,s_Import and Export of Equipment Containing Fluorintaed GHGs,s_Coal-based Liquid Fuel Supply
14,1004206,2014-01-01 00:00:00+00:00,34.641667,-87.038611,"Chemicals,Industrial Gas Suppliers",5.302489e+04,t CO2,Chemicals (3),0.000000,0.000000,...,17674.963520,0.0,0.000,0.0,0.000000,0.000000e+00,0.000000,0.0,0.0,0.0
15,1006665,2014-01-01 00:00:00+00:00,41.755000,-90.284167,"Chemicals,Industrial Gas Suppliers",1.230363e+06,t CO2,Chemicals (3),0.000000,0.000000,...,410121.011733,0.0,0.000,0.0,0.000000,0.000000e+00,0.000000,0.0,0.0,0.0
16,1004836,2014-01-01 00:00:00+00:00,44.789444,-92.908333,"Chemicals,Industrial Gas Suppliers,Minerals",6.630663e+04,t CO2,Chemicals (4),0.000000,0.000000,...,13261.325792,0.0,0.000,0.0,0.000000,0.000000e+00,0.000000,0.0,0.0,0.0
36,1002627,2014-01-01 00:00:00+00:00,43.499510,-92.917090,"Other,Waste",1.791617e+05,t CO2,Other (3),119441.109333,59720.554667,...,0.000000,0.0,0.000,0.0,0.000000,0.000000e+00,0.000000,0.0,0.0,0.0
40,1004761,2014-01-01 00:00:00+00:00,44.958900,-90.960800,"Other,Suppliers of CO2",7.514327e+04,t CO2,Other (3),0.000000,25047.756000,...,0.000000,0.0,50095.512,0.0,0.000000,0.000000e+00,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68445,1007002,2012-01-01 00:00:00+00:00,29.722222,-95.126944,"Chemicals,Petroleum Product Suppliers,Refineries",4.308236e+06,t CO2,Chemicals (4),0.000000,0.000000,...,0.000000,0.0,0.000,0.0,861647.249600,2.154118e+06,0.000000,0.0,0.0,0.0
68446,1004211,2012-01-01 00:00:00+00:00,48.472836,-122.560194,"Petroleum Product Suppliers,Refineries",2.105542e+06,t CO2,Petroleum Product Suppliers (3),0.000000,0.000000,...,0.000000,0.0,0.000,0.0,701847.402000,1.403695e+06,0.000000,0.0,0.0,0.0
68456,1005002,2012-01-01 00:00:00+00:00,27.799140,-97.572170,"Natural Gas and Natural Gas Liquids Suppliers,...",2.312958e+05,t CO2,Natural Gas and Natural Gas Liquids Suppliers (3),0.000000,0.000000,...,0.000000,0.0,0.000,0.0,0.000000,0.000000e+00,77098.605333,0.0,0.0,0.0
68468,1009291,2012-01-01 00:00:00+00:00,41.261600,-110.807000,"Petroleum Product Suppliers,Refineries",3.507487e+04,t CO2,Petroleum Product Suppliers (3),0.000000,0.000000,...,0.000000,0.0,0.000,0.0,11691.624667,2.338325e+04,0.000000,0.0,0.0,0.0


In [26]:
df_emitters[df_emitters.latest_reported_industry_type_sectors.str.count(',')>1]

Unnamed: 0,facility_id,year,latitude,longitude,latest_reported_industry_type_sectors,total_reported_emissions,total_reported_emissions_units,sector_groupings,s_Waste,s_Other,...,s_Industrial Gas Suppliers,s_Metals,s_Suppliers of CO2,s_Pulp and Paper,s_Petroleum Product Suppliers,s_Refineries,s_Natural Gas and Natural Gas Liquids Suppliers,s_Injection of CO2,s_Import and Export of Equipment Containing Fluorintaed GHGs,s_Coal-based Liquid Fuel Supply
16,1004836,2014-01-01 00:00:00+00:00,44.789444,-92.908333,"Chemicals,Industrial Gas Suppliers,Minerals",6.630663e+04,t CO2,Chemicals (4),0.00,0.0,...,13261.325792,0.0,0.000,0.00,0.0000,0.000,0.0,0.000000e+00,0.0,0.0
78,1007217,2012-01-01 00:00:00+00:00,41.778990,-107.104000,"Chemicals,Petroleum Product Suppliers,Refineries",9.519548e+05,t CO2,Chemicals (4),0.00,0.0,...,0.000000,0.0,0.000,0.00,190390.9692,475977.423,0.0,0.000000e+00,0.0,0.0
225,1007923,2012-01-01 00:00:00+00:00,39.805556,-104.944444,"Chemicals,Petroleum Product Suppliers,Refineries",8.957908e+05,t CO2,Chemicals (4),0.00,0.0,...,0.000000,0.0,0.000,0.00,179158.1552,447895.388,0.0,0.000000e+00,0.0,0.0
355,1002150,2012-01-01 00:00:00+00:00,41.887000,-110.094460,"Injection of CO2,Petroleum and Natural Gas Sys...",4.482297e+06,t CO2,Injection of CO2 (4),0.00,0.0,...,0.000000,0.0,2241148.643,0.00,0.0000,0.000,0.0,1.344689e+06,0.0,0.0
472,1002027,2012-01-01 00:00:00+00:00,44.850583,-93.002139,"Chemicals,Petroleum Product Suppliers,Refineries",7.713741e+05,t CO2,Chemicals (4),0.00,0.0,...,0.000000,0.0,0.000,0.00,154274.8276,385687.069,0.0,0.000000e+00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67984,1001920,2012-01-01 00:00:00+00:00,36.682100,-97.089500,"Chemicals,Petroleum Product Suppliers,Refineries",2.021000e+06,t CO2,Chemicals (4),0.00,0.0,...,0.000000,0.0,0.000,0.00,404200.0960,1010500.240,0.0,0.000000e+00,0.0,0.0
67987,1007196,2012-01-01 00:00:00+00:00,36.946100,-89.079200,"Pulp and Paper,Suppliers of CO2,Waste",1.984303e+05,t CO2,Pulp and Paper (4),59529.09,0.0,...,0.000000,0.0,99215.150,39686.06,0.0000,0.000,0.0,0.000000e+00,0.0,0.0
68389,1006395,2012-01-01 00:00:00+00:00,38.043600,-122.253200,"Chemicals,Petroleum Product Suppliers,Refineries",1.510389e+06,t CO2,Chemicals (4),0.00,0.0,...,0.000000,0.0,0.000,0.00,302077.8760,755194.690,0.0,0.000000e+00,0.0,0.0
68390,1002565,2012-01-01 00:00:00+00:00,35.395110,-119.046520,"Chemicals,Petroleum Product Suppliers,Refineries",9.681997e+04,t CO2,Chemicals (4),0.00,0.0,...,0.000000,0.0,0.000,0.00,19363.9932,48409.983,0.0,0.000000e+00,0.0,0.0


# Working with Materialized Views

For now, we use Database Tables, because the specific Trino connector we are using does not support Materialized Views.

Here's an example of a facility with many owners...

In [27]:
qres = engine.execute(f"""
select ghgrp_facility_id,frs_id_facility,lei,format('%tY', reporting_year),facility_name,
       facility_city,facility_state,parent_company_name,facility_naics_code
from {epa_table_prefix}parent_company where YEAR(reporting_year)=2020 and ghgrp_facility_id=1005071 order by lei""")
qres.fetchall()

[(1005071, '110000702730', '2549000NXAL5JJHJYT18', '2020', 'North Terrebonne Gas Plant', 'Gibson', 'LA', 'ENERGY RESOURCES TECHNOLOGY LAND INC', '211130'),
 (1005071, '110000702730', '54930000S35EESPK1C27', '2020', 'North Terrebonne Gas Plant', 'Gibson', 'LA', 'BYRON ENERGY LLC', '211130'),
 (1005071, '110000702730', '5493003QENHHS261UR94', '2020', 'North Terrebonne Gas Plant', 'Gibson', 'LA', 'TARGA RESOURCES CORP', '211130'),
 (1005071, '110000702730', '5493005Y7TJPYWLDEO18', '2020', 'North Terrebonne Gas Plant', 'Gibson', 'LA', 'ARENA ENERGY LP', '211130'),
 (1005071, '110000702730', '5493007VQUSLFRDRBT52', '2020', 'North Terrebonne Gas Plant', 'Gibson', 'LA', 'SUPERIOR NATURAL GAS CORP', '211130'),
 (1005071, '110000702730', '549300HX0ISXOOEMR657', '2020', 'North Terrebonne Gas Plant', 'Gibson', 'LA', 'BLACK ELK ENERGY OFFSHORE OPERATIONS LLC', '211130'),
 (1005071, '110000702730', '549300IRDTHJQ1PVET45', '2020', 'North Terrebonne Gas Plant', 'Gibson', 'LA', 'FREEPORT-MCMORAN INC',

...meaning 10 rows of data that's outside our easy-to-aggregate data

In [28]:
qres = engine.execute(f"""
select facility_id,facility_name,total_reported_emissions,
       city,state,latitude,longitude,primary_naics_code,
       latest_reported_industry_type_subparts,latest_reported_industry_type_sectors,format('%tY', year)
from {epa_table_prefix}direct_emitters where facility_id=1005071 order by year""")
qres.fetchall()

[(1005071, 'North Terrebonne Gas Plant', 383446.646, 'Gibson', 'LA', 29.6257, -90.9289, '211130', 'C,W-PROC', 'Petroleum and Natural Gas Systems', '2011'),
 (1005071, 'North Terrebonne Gas Plant', 339163.524, 'Gibson', 'LA', 29.6257, -90.9289, '211130', 'C,W-PROC', 'Petroleum and Natural Gas Systems', '2012'),
 (1005071, 'North Terrebonne Gas Plant', 313640.418, 'Gibson', 'LA', 29.6257, -90.9289, '211130', 'C,W-PROC', 'Petroleum and Natural Gas Systems', '2013'),
 (1005071, 'North Terrebonne Gas Plant', 312585.924, 'Gibson', 'LA', 29.6257, -90.9289, '211130', 'C,W-PROC', 'Petroleum and Natural Gas Systems', '2014'),
 (1005071, 'North Terrebonne Gas Plant', 292713.398, 'Gibson', 'LA', 29.6257, -90.9289, '211130', 'C,W-PROC', 'Petroleum and Natural Gas Systems', '2015'),
 (1005071, 'North Terrebonne Gas Plant', 216519.086, 'Gibson', 'LA', 29.6257, -90.9289, '211130', 'C,W-PROC', 'Petroleum and Natural Gas Systems', '2016'),
 (1005071, 'North Terrebonne Gas Plant', 194438.318, 'Gibson', '

Create actual materialized data from a large concatenation operation

In [29]:
import osc_ingest_trino as osc
import itertools

engine.execute(f"create schema if not exists {ingest_schema}")

# display([(x, y) for x, y in zip(emission_tables,tot_em_columns)])

emission_selects = [ f"""
select ghgrp_facility_id, reporting_year, lei, '{e_tbl}' as table_source,
         primary_naics_code, parent_co_percent_ownership * 0.01 * {e_col} as fractional_emissions,
         facility_naics_code, parent_company_name
    from {epa_table_prefix}parent_company as PC join {epa_table_prefix}{e_tbl} as ET on PC.ghgrp_facility_id=ET.facility_id and PC.reporting_year=ET.year
""" for e_tbl, e_col in zip(emission_tables,itertools.repeat('total_reported_emissions')) ]

qres = engine.execute(f"drop table if exists {epa_table_prefix}parent_attribution")
print(qres.fetchall())

sql = f"""
create table {epa_table_prefix}parent_attribution as {' union all '.join(emission_selects)}
"""

print(sql)

qres = engine.execute(sql)
print(qres.fetchall())

[(True,)]

create table epa_parent_attribution as 
select ghgrp_facility_id, reporting_year, lei, 'direct_emitters' as table_source,
         primary_naics_code, parent_co_percent_ownership * 0.01 * total_reported_emissions as fractional_emissions,
         facility_naics_code, parent_company_name
    from epa_parent_company as PC join epa_direct_emitters as ET on PC.ghgrp_facility_id=ET.facility_id and PC.reporting_year=ET.year
 union all 
select ghgrp_facility_id, reporting_year, lei, 'onshore_oil_gas_prod' as table_source,
         primary_naics_code, parent_co_percent_ownership * 0.01 * total_reported_emissions as fractional_emissions,
         facility_naics_code, parent_company_name
    from epa_parent_company as PC join epa_onshore_oil_gas_prod as ET on PC.ghgrp_facility_id=ET.facility_id and PC.reporting_year=ET.year
 union all 
select ghgrp_facility_id, reporting_year, lei, 'gathering_boosting' as table_source,
         primary_naics_code, parent_co_percent_ownership * 0.01 * 

In [30]:
qres = engine.execute(f"describe {epa_table_prefix}parent_attribution")
display(qres.fetchall())

qres = engine.execute(f"""
select ghgrp_facility_id, YEAR(reporting_year), lei, table_source, format('%,.2f', fractional_emissions) || ' t CO2e' as metric
from {epa_table_prefix}parent_attribution""")
qres.fetchall()[::2000]

[('ghgrp_facility_id', 'bigint', '', ''),
 ('reporting_year', 'timestamp(6)', '', ''),
 ('lei', 'varchar', '', ''),
 ('table_source', 'varchar', '', ''),
 ('primary_naics_code', 'varchar', '', ''),
 ('fractional_emissions', 'double', '', ''),
 ('facility_naics_code', 'varchar', '', ''),
 ('parent_company_name', 'varchar', '', '')]

[(1010455, 2020, 'VQOHU6HCVU6YY1KUKU03', 'transmission_pipelines', '3,135.46 t CO2e'),
 (1010811, 2012, 'XRZQ5S7HYJFPHJ78L959', 'sf6_from_elec_equip', '91,298.04 t CO2e'),
 (1001626, 2012, 'SHMU4TLUTSPECU6W5487', 'ldc_direct_emissions', '62,605.10 t CO2e'),
 (1001314, 2020, '549300OX0Q38NLSKPB49', 'direct_emitters', '1,118,330.59 t CO2e'),
 (1012923, 2020, '5493005JBO5YSIGK1814', 'direct_emitters', '35,902.89 t CO2e'),
 (1002000, 2012, 'FP4Y9QGZ4D2KY6M7GL79', 'direct_emitters', '47,195.88 t CO2e'),
 (1002910, 2012, '5493007RXSP83CHG5U59', 'direct_emitters', '775,064.70 t CO2e'),
 (1001762, 2011, None, 'direct_emitters', '156,836.09 t CO2e'),
 (1004097, 2012, '5299005MZ4WZECVATV08', 'direct_emitters', '1,079,406.91 t CO2e'),
 (1005138, 2013, None, 'direct_emitters', '112,634.25 t CO2e'),
 (1006570, 2015, '6JGDANYXFLQCUTA0R543', 'direct_emitters', '40,989.20 t CO2e'),
 (1007803, 2015, None, 'direct_emitters', '906,818.51 t CO2e'),
 (1001919, 2016, '5493007131OTEECH0P98', 'direct_emitters

How many **_facilities owned by public companies_** match to corporate reports we can see using the SEC's DERA dataset?

See how many `PARENT_COMPANY` records have LEIs we know.  Note that there are about 8400 total facilities, so 4 facilities not covered by LEI for each that is.
There are 3K-4K distinctly named entities, so average entity owns (at least partially) approx 2-3 facilities.  It also means we know the LEIs of approximately half of the parent copmanies.

In [31]:
qres = engine.execute(f"""select count (*), DATE(reporting_year)
from (select lei, reporting_year from {epa_table_prefix}parent_company where LEI is not null group by lei, reporting_year)
group by reporting_year order by reporting_year desc""")
qres.fetchall()

[(1542, '2020-01-01'),
 (1559, '2019-01-01'),
 (1626, '2018-01-01'),
 (1562, '2017-01-01'),
 (1560, '2016-01-01'),
 (1541, '2015-01-01'),
 (1712, '2014-01-01'),
 (1695, '2013-01-01'),
 (1657, '2012-01-01'),
 (1578, '2011-01-01'),
 (1309, '2010-01-01')]

In [32]:
qres = engine.execute(f"describe {dera_table_prefix}sub")
qres.fetchall()

[('adsh', 'varchar', '', ''),
 ('cik', 'integer', '', ''),
 ('name', 'varchar', '', ''),
 ('lei', 'varchar', '', ''),
 ('sic', 'integer', '', ''),
 ('countryba', 'varchar', '', ''),
 ('stprba', 'varchar', '', ''),
 ('cityba', 'varchar', '', ''),
 ('zipba', 'varchar', '', ''),
 ('bas1', 'varchar', '', ''),
 ('bas2', 'varchar', '', ''),
 ('baph', 'varchar', '', ''),
 ('countryma', 'varchar', '', ''),
 ('stprma', 'varchar', '', ''),
 ('cityma', 'varchar', '', ''),
 ('zipma', 'varchar', '', ''),
 ('mas1', 'varchar', '', ''),
 ('mas2', 'varchar', '', ''),
 ('countryinc', 'varchar', '', ''),
 ('stprinc', 'varchar', '', ''),
 ('ein', 'bigint', '', ''),
 ('former', 'varchar', '', ''),
 ('changed', 'varchar', '', ''),
 ('afs', 'varchar', '', ''),
 ('wksi', 'boolean', '', ''),
 ('fye', 'varchar', '', ''),
 ('form', 'varchar', '', ''),
 ('period', 'timestamp(6)', '', ''),
 ('fy', 'timestamp(6)', '', ''),
 ('fp', 'varchar', '', ''),
 ('filed', 'timestamp(6)', '', ''),
 ('accepted', 'timestamp(6)',

In [33]:
qres = engine.execute(f"describe {dera_table_prefix}num")
qres.fetchall()

[('adsh', 'varchar', '', ''),
 ('tag', 'varchar', '', ''),
 ('version', 'varchar', '', ''),
 ('coreg', 'varchar', '', ''),
 ('ddate', 'timestamp(6)', '', ''),
 ('qtrs', 'integer', '', ''),
 ('uom', 'varchar', '', ''),
 ('value', 'double', '', ''),
 ('footnote', 'varchar', '', ''),
 ('srcdir', 'varchar', '', '')]

In [34]:
qres = engine.execute(f"""select count (*), DATE(reporting_year)
from {epa_table_prefix}parent_attribution PA join {dera_table_prefix}sub S on PA.lei=S.lei
where (form='10-K' or form='20-F' or form='40-F')
and YEAR(reporting_year)=2020 and YEAR(fy)=2020
and PA.lei is not null
group by PA.reporting_year
order by PA.reporting_year
""")
qres.fetchall()

[(3772, '2020-01-01')]

We can tie these companies to ticker symbols...

In [35]:
qres = engine.execute(f"""select * from ticker limit 10""")
qres.fetchall()

[(None, 1074769),
 ('ephyw', 1827248),
 ('epr-pe', 1045450),
 ('emmaw', 822370),
 ('eocw-un', 1843862),
 ('eeh', 352960),
 ('edncw', 1864891),
 ('ecolw', 1783400),
 ('ebgef', 895728),
 ('chpmw', 1785041)]

How many distinct companies own these facilities (and what are their ticker symbols)?

In [36]:
qres = engine.execute(f"""
with leis as (select DISTINCT(S.lei), name, if(tname IS NULL, '<private>', tname) as ticker
              from {epa_table_prefix}parent_attribution PA join {dera_table_prefix}sub S on PA.lei=S.lei
                   join ticker on S.cik=ticker.cik
              where (form='10-K' or form='20-F' or form='40-F')
              and period>=DATE('2020-01-01')
              and period<DATE('2021-01-01'))
select count (*), ticker, leis.lei, name, format('%tY', reporting_year)
from {epa_table_prefix}parent_attribution PA join leis on PA.lei=leis.lei
where YEAR(reporting_year)=2020
group by leis.ticker, leis.lei, name, reporting_year
order by count(*) desc
-- limit 10
""")
ticker_list = qres.fetchall()
print(len(ticker_list))

433


Note that some comapnies have more than one ticker symbol!

In [37]:
ticker_list[0:50]

[(259, 'kmi', '549300WR7IX8XE0TBO16', 'KINDER MORGAN, INC.', '2020'),
 (259, 'ep-pc', '549300WR7IX8XE0TBO16', 'KINDER MORGAN, INC.', '2020'),
 (228, 'wm', '549300YX8JIID70NFS41', 'WASTE MANAGEMENT INC', '2020'),
 (162, 'rsg', 'NKNQHM6BLECKVOQP7O46', 'REPUBLIC SERVICES, INC.', '2020'),
 (158, 'et-pe', 'MTLVN9N7JE8MIBIJ1H73', 'ENERGY TRANSFER LP', '2020'),
 (158, 'et-pc', 'MTLVN9N7JE8MIBIJ1H73', 'ENERGY TRANSFER LP', '2020'),
 (158, 'et-pd', 'MTLVN9N7JE8MIBIJ1H73', 'ENERGY TRANSFER LP', '2020'),
 (158, 'et', 'MTLVN9N7JE8MIBIJ1H73', 'ENERGY TRANSFER LP', '2020'),
 (114, 'brk-b', '5493000C01ZX7D35SD85', 'BERKSHIRE HATHAWAY INC', '2020'),
 (114, 'brk-a', '5493000C01ZX7D35SD85', 'BERKSHIRE HATHAWAY INC', '2020'),
 (89, 'xom', 'J3WHBG0MTS7O8ZVMDC91', 'EXXON MOBIL CORP', '2020'),
 (87, 'wmb', 'D71FAKCBLFS2O0RBPG08', 'WILLIAMS COMPANIES, INC.', '2020'),
 (65, 'epd', 'K4CDIF4M54DJZ6TB4Q48', 'ENTERPRISE PRODUCTS PARTNERS L.P.', '2020'),
 (61, 'so', '549300FC3G3YU2FBZD92', 'SOUTHERN CO', '2020'),


We can try to add up all the faciltiies for all the tickers, but that leads to counting duplicates for companies that have multiple ticker symbols...(should be 2651, not 5746)

In [38]:
sum([te[0] for te in ticker_list])

5779

Sample data to cross-check LEI, Facility ID and EDGAR submission data

In [39]:
qres = engine.execute(f"""
select DISTINCT(S.lei), ghgrp_facility_id, adsh
              from {epa_table_prefix}parent_attribution PA join {dera_table_prefix}sub S on PA.lei=S.lei
              where YEAR(reporting_year)=2020
              and (form='10-K' or form='20-F' or form='40-F')
              and YEAR(period)=2020
              order by S.lei desc""")
l = qres.fetchall()
print(len(l))
display(l[::100])

3648


[('ZW1LRE7C3H17O2ZN9B45', 1005442, '0001628280-21-003434'),
 ('WD6L6041MNRW1JE49D58', 1013001, '0000100493-20-000132'),
 ('UMI46YPGBLUE4VGNNT48', 1013722, '0000753308-21-000014'),
 ('R8V1FN4M5ITGZOG7BS19', 1009462, '0001140361-21-003906'),
 ('NKNQHM6BLECKVOQP7O46', 1008003, '0001060391-21-000014'),
 ('NKNQHM6BLECKVOQP7O46', 1006747, '0001060391-21-000014'),
 ('MTLVN9N7JE8MIBIJ1H73', 1003126, '0001276187-21-000034'),
 ('MP3J6QPYPGN75NVW2S34', 1002179, '0000055785-21-000016'),
 ('K4CDIF4M54DJZ6TB4Q48', 1012520, '0001061219-21-000009'),
 ('J3WHBG0MTS7O8ZVMDC91', 1005858, '0000034088-21-000012'),
 ('IM7X0T3ECJW4C1T7ON55', 1003552, '0000797468-21-000009'),
 ('I1BZKREC126H0VB1BL91', 1001543, '0001326160-21-000063'),
 ('FCNMH6O7VWU7LHXMK351', 1008503, '0000858470-21-000013'),
 ('CE5OG6JPOZMDSA0LAQ19', 1007781, '0001021635-21-000026'),
 ('824LMFJDH41EY779Q875', 1006776, '0000051434-21-000012'),
 ('5E2UPK5SW04M13XY7I38', 1007243, '0001013871-21-000005'),
 ('549300YX8JIID70NFS41', 1007575, '0001

Compute intensity in metric tons of CO2e per million dollars

In [40]:
qres = engine.execute(f"""
select PA.lei, sic, floor(sic/100) as sic_2digit, format('%1$tY-%1$tm-%1$td', reporting_year),
       name, sum(fractional_emissions) as tot_co2e,
       uom || ' $M', round(max(value)/1e6,3) as tot_revenue,
       format('%7.2f', 1e6*sum(fractional_emissions)/sum(value)) || ' t CO2e/$M' as intensity
from {epa_table_prefix}parent_attribution as PA join {dera_table_prefix}sub as S on PA.lei=S.lei
     join {dera_table_prefix}num as N on S.adsh=N.adsh
where YEAR(reporting_year)=2020
and (form='10-K' or form='20-F' or form='40-F')
and YEAR(period)=2020
and YEAR(ddate)=2020
and coreg is NULL
and (N.tag='Revenues'
     or N.tag='RevenueFromContractWithCustomerIncludingAssessedTax'
     or N.tag='RevenueFromContractWithCustomerExcludingAssessedTax'
     or N.tag='RevenuesNetOfInterestExpense'
     or N.tag='RegulatedAndUnregulatedOperatingRevenue'
     or N.tag='RegulatedOperatingRevenuePipelines')
and N.qtrs=4
group by PA.lei, PA.reporting_year, sic, name, uom
order by intensity desc
-- limit 100
""")
rows = qres.fetchall()
print(len(rows))
display(rows[::5])

337


[('549300O4B5CVWMKUES27', 3829, 38, '2020-01-01', 'MIDWEST ENERGY EMISSIONS CORP.', 29981.588, 'USD $M', 8.158, '3674.91 t CO2e/$M'),
 ('254900GKEQRHOI2SSC19', 1220, 12, '2020-01-01', 'HALLADOR ENERGY CO', 2396625.0, 'USD $M', 245.295, '1639.12 t CO2e/$M'),
 ('5493007QR70AVQSNF619', 1311, 13, '2020-01-01', 'SILVERBOW RESOURCES, INC.', 203574.548, 'USD $M', 177.386, '1147.64 t CO2e/$M'),
 ('529900CG8YAQFZ2JMV97', 2870, 28, '2020-01-01', 'CF INDUSTRIES HOLDINGS, INC.', 19602923.075999998, 'USD $M', 4124.0, ' 950.68 t CO2e/$M'),
 ('5493002H80P81B3HXL31', 4911, 49, '2020-01-01', 'CLECO CORPORATE HOLDINGS LLC', 19183441.194319956, 'USD $M', 1498.146, ' 822.08 t CO2e/$M'),
 ('549300NNLSIMY6Z8OT86', 4931, 49, '2020-01-01', 'ALLETE INC', 4235536.671091213, 'USD $M', 1169.1, ' 724.58 t CO2e/$M'),
 ('549300G6KKUMMXM8NH73', 1311, 13, '2020-01-01', 'BERRY CORP (BRY)', 3339128.304000001, 'USD $M', 523.833, ' 598.48 t CO2e/$M'),
 ('549300JK3KH8PWM3B226', 1311, 13, '2020-01-01', 'CNX RESOURCES CORP',

# A Deep Dive into outlier data

In [41]:
qreg=engine.execute(f"""
select DISTINCT(S.lei), ghgrp_facility_id, name, adsh
from {epa_table_prefix}parent_attribution PA join {dera_table_prefix}sub S on PA.lei=S.lei
where YEAR(reporting_year)=2020 and S.lei='549300O4B5CVWMKUES27'
and (form='10-K' or form='20-F' or form='40-F')
and YEAR(period)=2020""")
qreg.fetchall()

[('549300O4B5CVWMKUES27', 1012016, 'MIDWEST ENERGY EMISSIONS CORP.', '0001477932-21-002039')]

In [42]:
qreg=engine.execute(f"""
select DATE(reporting_year), format ('%,10.2f', sum(fractional_emissions)) || ' t CO2e' as metric
from {epa_table_prefix}parent_attribution
where lei='549300O4B5CVWMKUES27'
group by DATE(reporting_year)
""")
l = qreg.fetchall()
l

[('2014-01-01', ' 21,395.30 t CO2e'),
 ('2018-01-01', '100,039.15 t CO2e'),
 ('2013-01-01', ' 24,847.40 t CO2e'),
 ('2012-01-01', ' 24,854.15 t CO2e'),
 ('2019-01-01', ' 69,246.51 t CO2e'),
 ('2017-01-01', ' 71,698.95 t CO2e'),
 ('2020-01-01', ' 29,981.59 t CO2e'),
 ('2015-01-01', ' 46,877.45 t CO2e'),
 ('2011-01-01', ' 25,032.10 t CO2e'),
 ('2016-01-01', ' 98,413.61 t CO2e')]

# GHGRP Direct Emitters include Cement and Steel Plans (which we can connect to SFI data)

In [43]:
qres = engine.execute("describe sfi_cement")
display(qres.fetchall())

[('uid', 'varchar', '', ''),
 ('city', 'varchar', '', ''),
 ('state', 'varchar', '', ''),
 ('country', 'varchar', '', ''),
 ('iso3', 'varchar', '', ''),
 ('country_code', 'bigint', '', ''),
 ('region', 'varchar', '', ''),
 ('sub_region', 'varchar', '', ''),
 ('latitude', 'double', '', ''),
 ('longitude', 'double', '', ''),
 ('accuracy', 'varchar', '', ''),
 ('status', 'varchar', '', ''),
 ('plant_type', 'varchar', '', ''),
 ('production_type', 'varchar', '', ''),
 ('capacity', 'double', '', ''),
 ('capacity_units', 'varchar', '', ''),
 ('capacity_source', 'varchar', '', ''),
 ('year', 'timestamp(6)', '', ''),
 ('owner_permid', 'bigint', '', ''),
 ('owner_name', 'varchar', '', ''),
 ('owner_source', 'varchar', '', ''),
 ('parent_permid', 'bigint', '', ''),
 ('parent_name', 'varchar', '', ''),
 ('ownership_stake', 'double', '', ''),
 ('parent_lei', 'varchar', '', ''),
 ('parent_holding_status', 'varchar', '', ''),
 ('parent_ticker', 'varchar', '', ''),
 ('parent_exchange', 'varchar', '',

In [44]:
qres = engine.execute("select count (*) from sfi_cement")
display(qres.fetchall())
qres = engine.execute("select count (*) from sfi_steel")
display(qres.fetchall())

# There are 105 US-located cement plants listed in the SFI report with parent LEIs
qres = engine.execute("select count (*), iso3 from sfi_cement where iso3='USA' group by iso3")
display(qres.fetchall())

qres = engine.execute(f"""
select owner_name, parent_name, lei, parent_lei, facility_id
from sfi_cement, {epa_table_prefix}direct_emitters, {epa_table_prefix}parent_company
where ghgrp_facility_id=facility_id
and reporting_year={epa_table_prefix}direct_emitters.year
and YEAR(reporting_year)=2020
and sfi_cement.iso3='USA'
and abs(sfi_cement.latitude-{epa_table_prefix}direct_emitters.latitude)<0.01
and abs(sfi_cement.longitude-{epa_table_prefix}direct_emitters.longitude)<0.01
""")
l = qres.fetchall()
print(f"{len(l)}: facilities/parent relationships matched in USA using lat/lon")

[(3117,)]

[(1598,)]

[(105, 'USA')]

106: facilities/parent relationships matched in USA using lat/lon


In [45]:
l[3::2]

[('Suwannee American Cement Company LLC', 'CRH PLC', '549300RN11MJ182CNF63', '549300MIDJNNTH068E74', 1000533),
 ('Lehigh White Cement Co', 'Cementir Holding NV', None, '8156008B101B97A43B02', 1002554),
 ('Argos USA Corp', 'Grupo Argos SA', '549300U2JOCV4PZHDX74', '254900HANAO95XIAE681', 1003479),
 ('Lehigh Hanson Inc', 'HeidelbergCement AG', None, 'LZ2C6E0W5W7LQMX5ZI37', 1004612),
 ('Texas Industries Inc', 'Martin Marietta Materials Inc', '5299005MZ4WZECVATV08', '5299005MZ4WZECVATV08', 1007792),
 ('Argos USA Corp', 'Grupo Argos SA', '549300FC3G3YU2FBZD92', '254900HANAO95XIAE681', 1001508),
 ('Argos USA Corp', 'Grupo Argos SA', '2549000NKLSHNQQBTJ24', '254900HANAO95XIAE681', 1005458),
 ('Lehigh Hanson Inc', 'HeidelbergCement AG', '40XIFLS8XDQGGHGPGC04', 'LZ2C6E0W5W7LQMX5ZI37', 1000362),
 ('Armstrong Cement & Supply Corporation', 'Snyder Associated Co Inc', None, None, 1006785),
 ('St Marys Cement Inc Canada', 'Votorantim SA', '549300V0FRLSRFEZFU13', '5493009RIHSE12DQ0J28', 1006263),
 ('