# ITR Data Pipeline

The ITR data pipeline organizes and assembles data needed for the ITR tool.  The data may come from many sources, but the output of this pipeline is a complete, consistent dataset that can be fully interrogated by the ITR tool.  If users wish to add additional data or analyze additional portfolio companies, they must create a new dataset using this pipeline.

These are the data needed to create the ITR dataset:
* Global Parameters (just for reference--we do nothing with them here)
* Industry Data (Sector Projections aka Benchmarks)
* Portfolio Data (Must cover all the stocks a user may query)
* Company Data (Must cover all companies in all possible portfolio universes)
* Automization (Must cover all years and scenarios a user may query)

The ITR tool can create secondary datasets:
* Cumulative emissions targets trajectories
* Cumulative emissions budgets
* Target and trajectory overshoot/undershoot ratios
* Target and trajectory temperature scores

These secondary datasets are not the concern of this pipeline.

### Environment variables and dot-env

The following cell looks for a "dot-env" file in some standard locations,
and loads its contents into `os.environ`.

In [1]:
import os
import pathlib
from dotenv import load_dotenv

# Load some standard environment variables from a dot-env file, if it exists.
# If no such file can be found, does not fail, and so allows these environment vars to
# be populated in some other way
dotenv_dir = os.environ.get('CREDENTIAL_DOTENV_DIR', os.environ.get('PWD', '/opt/app-root/src'))
dotenv_path = pathlib.Path(dotenv_dir) / 'credentials.env'
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path,override=True)

import numpy as np
import pandas as pd

### S3 and boto3

In [2]:
import boto3

s3_source = boto3.resource(
    service_name="s3",
    endpoint_url=os.environ['S3_LANDING_ENDPOINT'],
    aws_access_key_id=os.environ['S3_LANDING_ACCESS_KEY'],
    aws_secret_access_key=os.environ['S3_LANDING_SECRET_KEY'],
)
source_bucket = s3_source.Bucket(os.environ['S3_LANDING_BUCKET'])

In [4]:
import osc_ingest_trino as osc
import io

### Connecting to Trino with sqlalchemy

In the context of the Data Vault, this pipeline operates with full visibiilty into all the data it prepares for the ITR tool.  When the data is output, it is labeled so that the Data Vault can enforce its data management access rules.

In [5]:
import trino
from sqlalchemy.engine import create_engine

ingest_catalog = 'osc_datacommons_dev'
ingest_schema = 'sandbox'
dera_schema = 'sandbox'
dera_prefix = 'dera_'
gleif_schema = 'sandbox'
rmi_schema = 'sandbox'
iso3166_schema = 'sandbox'
essd_schema = 'sandbox'
demo_schema = 'demo_dv'

sqlstring = 'trino://{user}@{host}:{port}/'.format(
    user = os.environ['TRINO_USER'],
    host = os.environ['TRINO_HOST'],
    port = os.environ['TRINO_PORT']
)
sqlargs = {
    'auth': trino.auth.JWTAuthentication(os.environ['TRINO_PASSWD']),
    'http_scheme': 'https',
    'catalog': ingest_catalog,
    'schema': ingest_schema,
}
engine = create_engine(sqlstring, connect_args = sqlargs)
connection = engine.connect()

## Global Parameters

These parameters are set/selected by the ITR tool.  They are included here for reference only (the following is not live code).

Create the ISIC-to-Sector table manually until we have a proper sector mapping table

In [6]:
i2s_df = pd.DataFrame({"isic": [2410, 4010],
                       "sector": ['Steel', 'Electricity Utilities']}).convert_dtypes()

ingest_table = 'isic_to_sector'
drop_table = engine.execute(f"drop table if exists {ingest_schema}.{ingest_table}")
drop_table.fetchall()

columnschema = osc.create_table_schema_pairs(i2s_df)

tabledef = f"""
create table if not exists {ingest_catalog}.{ingest_schema}.{ingest_table}(
{columnschema}
) with (
    format = 'ORC',
    partitioning = array['bucket(isic,20)']
)
"""

print(tabledef)
qres = engine.execute(tabledef)
print(qres.fetchall())
i2s_df.to_sql(ingest_table,
              con=engine, schema=ingest_schema, if_exists='append',
              index=False,
              method=osc.TrinoBatchInsert(batch_size = 2000, verbose = True))


create table if not exists osc_datacommons_dev.sandbox.isic_to_sector(
    isic bigint,
    sector varchar
) with (
    format = 'ORC',
    partitioning = array['bucket(isic,20)']
)

[(True,)]
inserting 2 records
  (2410, 'Steel')
  (4010, 'Electricity Utilities')
constructed fully qualified table name as: "sandbox.isic_to_sector"
batch insert result: [(2,)]


## Portfolio Data

The user will ultimately supply portfolio selection and position information to the ITR tool as part of the weighting calculations.  This part of the pipeline just collects the LEI and ISIN information for companies we should expect to analyze (i.e., companies for which we have fundamental financial information, production, intensity, and target information, in sectors for which we have benchmark projections).

Because this pipeline does the full pre-computation of data for the tool, there is no sense carrying forward information that is not fully closed.  I.e., there's no reason to carry forward an LEI:ISIN relationship if there is no financial, production, or target information related to that LEI and/or ISIN.  The user does not add such data later; the data is collected and fully processed by this pipeline now.

### Get LEI/ISIN data

RMI handes us data already matched with LEIs and ISINs.  Other lists of company names may require us to stitch that together manually.

In [6]:
# TODO: sort why some notorious utilities are missing LEIs in the following query--bad source data?
rmi_lei_isin = pd.read_sql(f"select DISTINCT parent_name, parent_lei, parent_isin from {rmi_schema}.utility_information", engine)
rmi_lei_isin.loc[rmi_lei_isin.parent_name=='Mt. Carmel Public Utility Co.', 'parent_lei'] = rmi_lei_isin.apply(lambda x: f"RMI{x.name:017}", axis=1)
rmi_lei_isin.loc[rmi_lei_isin.parent_name=='PG&E Corp.', 'parent_lei'] = '8YQ2GSDWYZXO2EDN3511'
rmi_lei_isin.loc[rmi_lei_isin.parent_name=='Verso Corp.', 'parent_lei'] = '549300FODXCTQ8DGT594'
rmi_lei_isin.loc[rmi_lei_isin.parent_name=='Verso Corp.', 'parent_isin'] = 'US92531L2079'
rmi_lei_dict = dict(zip(rmi_lei_isin.parent_lei, rmi_lei_isin.parent_isin))

Implement an *ad hoc* ingestion pipeline for Steel portfolio.  Later we will ingest steel production data.  We use this only to define the universe, not for actual investment information.

In [7]:
steel_idx = pd.read_csv(os.environ.get('PWD')+f"/itr-data-pipeline/data/external/mdt-steel-portfolio.csv", header=0, sep=';', dtype=str, engine='c')
steel_idx = steel_idx.drop('investment_value', axis=1)
steel_idx

Unnamed: 0,company_name,company_lei,company_id
0,CARPENTER TECHNOLOGY CORP,DX6I6ZD3X5WNNCDJKP85,US1442851036
1,CLEVELAND-CLIFFS INC,549300TM2WLI2BJMDD86,US1858991011
2,COMMERCIAL METALS CO,549300OQS2LO07ZJ7N73,US2017231034
3,FRIEDMAN INDUSTRIES INC,LEI05,US3584351056
4,GENERAL STEEL HOLDINGS INC,5493008ZKBIR02ICY091,US3708532029
5,GERDAU S.A.,254900YDV6SEQQPZVG24,US3737371050
6,"GIBRALTAR INDUSTRIES, INC.",LEI08,US3746891072
7,GROUP SIMEC SA DE CV,529900LCYCXPA0TZEU09,MXP4984U1083
8,HAYNES INTERNATIONAL INC,549300I9MS5UZLRFDO40,US4208772016
9,INSTEEL INDUSTRIES INC,52990026LKY4MOX3L174,US45774W1080


Prepare GLEIF matching data for SEC DERA data.  In the future, such matching will use the ESG Entity-Matching pipeline (https://github.com/os-climate/financial-entity-cleaner/tree/version_0.1.0).

In [8]:
gleif_file = s3_source.Object(os.environ['S3_LANDING_BUCKET'],'mtiemann-GLEIF/DERA-matches.csv')
gleif_file.download_file(f'/tmp/dera-gleif.csv')
gleif_df = pd.read_csv(f'/tmp/dera-gleif.csv', header=0, sep=',', dtype=str, engine='c')
gleif_dict = dict(zip(gleif_df.name, gleif_df.LEI))
del(gleif_df)

# Many of the following ISINs are bonds, but some are also stocks (on various exchanges)
# But we don't need to load and match here, because the portfolio has the ISINs
if False:
    gleif_isin_file = s3_source.Object(os.environ['S3_LANDING_BUCKET'],'mtiemann-GLEIF/ISIN_LEI_20211009.csv')
    gleif_isin_file.download_file(f'/tmp/ISIN_LEI_20211009.csv')
    gleif_isins = pd.read_csv(f'/tmp/ISIN_LEI_20211009.csv', header=0, sep=',', dtype=str, engine='c')

Create a very simple entity matcher, cleaning up slight variations in company names between RMI's entity names, the SEC's entity names, and GLEIF's entity names.

Commented out are names we would have to fix if there were SEC data for them.  But because not, we'll never match what's not there in the first place.

In [9]:
# gleif_dict['Basin Electric Power Coop'.upper()] = gleif_dict['BASIN ELECTRIC POWER COOPERATIVE']
# gleif_dict['Big Rivers Electric Corp'.upper()] = gleif_dict['BIG RIVERS ELECTRIC CORPORATION']
gleif_dict['Cleco Partners LP'.upper()] = gleif_dict['CLECO CORPORATE HOLDINGS LLC']
# gleif_dict['Golden Spread Electric Coop., Inc'.upper()] = gleif_dict['GOLDEN SPREAD ELECTRIC COOPERATIVE, INC.']
gleif_dict['MIDWEST ENERGY INC'] = '549300O4B5CVWMKUES27'
gleif_dict['OG&E Energy'.upper()] = gleif_dict['OGE ENERGY CORP.']
# gleif_dict['Ohio Valley Electric Corp'.upper()] = gleif_dict['OHIO VALLEY ELECTRIC CORPORATION']
gleif_dict['Old Dominion Electric Coop'.upper()] = gleif_dict['OLD DOMINION ELECTRIC COOPERATIVE']
gleif_dict['PG&E Corp.'.upper()] = gleif_dict['PG&E CORP']
gleif_dict['Tri-State Generation & Transmission Association'.upper()] = gleif_dict['TRI-STATE GENERATION & TRANSMISSION ASSOCIATION, INC.']
gleif_dict['DOMINION ENERGY INC'] = 'ILUL7B6Z54MRYCF6H308'

gleif_dict['GROUP SIMEC SA DE CV'] = '529900LCYCXPA0TZEU09'

gleif_1 = { k.split(',')[0].split(' ')[0]:v for k,v in gleif_dict.items() }
gleif_2 = { ' '.join(k.split(',')[0].split(' ')[0:2]):v for k,v in gleif_dict.items() }

def gleif_match(x):
    x = x.split(',')[0]
    if x in gleif_dict:
        return gleif_dict[x]
    x = x.replace('.','')
    if x in gleif_dict:
        return gleif_dict[x]
    x2 = ' '.join(x.split(' ')[0:2])
    if x2 in gleif_2:
        return gleif_2[x2]
    if ' ' not in x and x in gleif_1:
        return gleif_1[x]
    return None

Collect the universe of company names for the sectors we cover.  Steel sector is SIC 3310-3317. Electricity Utilities is SIC 4911 (but also 4931-4932 and 4991).

Some conglomerates have more general SIC codes that hide their activities in sectors of interest.  Others report those SIC codes within reportable segements.
Without more detailed SEC DERA data (available in an S3 bucket but not yet processed as a pipeline), we will not collect the company names we need to collect.

In [10]:
sec_lei_isin = pd.read_sql(f"""
select DISTINCT F.name, F.lei, F.sic
from {dera_schema}.financials_by_lei F
where (sic=4911 or sic=4931 or sic=4932 or sic=4991)
      or (sic>=3310 and sic<=3317)
""", engine)
sec_lei_isin.loc[sec_lei_isin.name=='DOMINION ENERGY INC', 'lei'] = 'ILUL7B6Z54MRYCF6H308'
sec_lei_isin.loc[sec_lei_isin.name=='GROUP SIMEC SA DE CV', 'lei'] = '529900LCYCXPA0TZEU09'

missing_leis = sec_lei_isin[sec_lei_isin.lei.isna()]
sec_lei_isin.dropna(inplace=True)
print("The following companies are missing LEI information and will be dropped:")
display(missing_leis)

The following companies are missing LEI information and will be dropped:


Unnamed: 0,name,lei,sic
0,BRAZILIAN ELECTRIC POWER CO,,4911
7,"CLEANSPARK, INC.",,4991
16,"HELIOGEN, INC.",,4911
29,8POINT3 ENERGY PARTNERS LP,,4911
38,"ENERGY CONVERSION SERVICES, INC.",,4911
54,"MONTAUK RENEWABLES, INC.",,4932
63,VETANOVA INC.,,4911
64,"PECK CO HOLDINGS, INC.",,4932
74,UNITIL CORP,,4931
79,CHUGACH ELECTRIC ASSOCIATION INC,,4911


We create a theoretical portfolio that conveniently contains all available LEI and ISIN information, meaning we don't need to do entity matching or ISIN matching.

Other portfolios may need a lot more work before they can be used to precompute other data.  The code above are samples of the kind of extra data/processing needed for such portfolios.

In [11]:
rmi_idx = rmi_lei_isin.rename(columns={'parent_name':'company_name', 'parent_lei':'company_lei', 'parent_isin':'company_id'})
# rmi_idx.insert(1, 'company_lei', portfolio_df.company_name.str.upper().map(gleif_match))
# if rmi_idx.company_lei.isna().any():
#     display(rmi_idx[rmi_idx.company_lei.isna()])
rmi_idx.loc[rmi_idx.company_id.isna(), 'company_id'] = rmi_idx.apply(lambda x: f"ZZ{x.name:010}", axis=1)

print(f"Number of RMI portfolio copmanies = {len(rmi_idx)}")

Number of RMI portfolio copmanies = 184


Show list of RMI companies that use made-up LEIs or ISINs

In [12]:
rmi_idx[rmi_idx.company_lei.str.startswith('RMI')|rmi_idx.company_id.str.startswith('ZZ')]

Unnamed: 0,company_name,company_lei,company_id
0,"Buckeye Power, Inc.",549300VR7GQZV6W7OR57,ZZ0000000000
5,"Freeport-Mcmoran, Inc.",549300IRDTHJQ1PVET45,ZZ0000000005
7,Citizens Energy Corp.,5493008ORX814MK1WM19,ZZ0000000007
8,Oglethorpe Power Corp.,3EERXCUSWMS9GV5D9M98,ZZ0000000008
14,Omya AG,5299004YRCHMOU9FKK67,ZZ0000000014
15,"Southwest Power Pool, Inc.",549300NXXWJMFXIKNU79,ZZ0000000015
16,Wolverine Power Supply Coop.,549300ROWOIV5X5MB591,ZZ0000000016
17,"Vermont Electric Coop., Inc.",549300GNSLQRYVBRRM43,ZZ0000000017
26,Puget Holdings LLC,8MNFJR7KOMBQ7X62LK44,ZZ0000000026
27,"New Hampshire Electric Coop., Inc.",5493003TZVX6QJ0PBO15,ZZ0000000027


Add Steel company portfolio

In [13]:
portfolio_idx = pd.concat([rmi_idx, steel_idx])
portfolio_idx = portfolio_idx.convert_dtypes()

print(f"Number of total portfolio companies = {len(portfolio_idx)}")

Number of total portfolio companies = 209


### Company Data

The SIC-to-ISIC table is an open workstream item: https://github.com/os-climate/itr-data-pipeline/issues/1

### Capture a list of the companies for which we have good financial info

We limit our view to the companies in our portfolio.  The user can prioritize whether this is the best source of revenue, market cap, etc., or whether they prefer another source.

Note for future reference: Berkshire Hathaway has one line of business for Energy and another for Steel.  We don't yet have line-of-business info because we use summary data from SEC DERA, not the detailed Notes version of the dataset.

In [14]:
ingest_table = 'portfolio_universe'

drop_table = engine.execute(f"drop table if exists {ingest_schema}.{ingest_table}")
drop_table.fetchall()

columnschema = osc.create_table_schema_pairs(portfolio_idx, typemap={"datetime64[ns]":"timestamp(6)"})

tabledef = f"""
create table if not exists {ingest_catalog}.{ingest_schema}.{ingest_table}(
{columnschema}
) with (
    format = 'ORC',
    partitioning = array['bucket(company_lei, 20)']
)
"""
print(tabledef)
create_table = engine.execute(tabledef)
print(create_table.fetchall())
portfolio_idx.to_sql(ingest_table,
                     con=engine, schema=ingest_schema, if_exists='append',
                     index=False,
                     method=osc.TrinoBatchInsert(batch_size = 5000, verbose = True))


create table if not exists osc_datacommons_dev.sandbox.portfolio_universe(
    company_name varchar,
    company_lei varchar,
    company_id varchar
) with (
    format = 'ORC',
    partitioning = array['bucket(company_lei, 20)']
)

[(True,)]
inserting 209 records
  ('Buckeye Power, Inc.', '549300VR7GQZV6W7OR57', 'ZZ0000000000')
  ('Platte-Clay Electric Coop. Inc.', NULL, 'ZZ0000000001')
  ('Fortis, Inc.', '549300MQYQ9Y065XPR71', 'CA3495531079')
  ...
  ('WORTHINGTON INDUSTRIES INC', '1WRCIANKYOIK6KYE5E82', 'US9818111026')
constructed fully qualified table name as: "sandbox.portfolio_universe"
batch insert result: [(209,)]
constructed fully qualified table name as: "sandbox.portfolio_universe"
execute optimize 209 rows: []


### Create a list with metric labels embedded in the output for easy reading...

Highlight any rows that have NULL data

### Capture and print a list of companies with financial info

Financial information is part of the "fundamental data" we need for the ITR portfolio companies.  The other part is base year production, emission, and intensity data.  We query the two separately because we have a unified source of truth for the former (SEC DERA) but multiple sources for the latter (RMI for Electric Utilities and MDT for Steel).

### Financial info:
* Company Name, LEI, ISIN, year
* ISIC Code (for Sector)
* Country and Region
* Revenue, Market Cap, Enterprise Value, Assets, Cash

We currently focus exclusively on data from 2019 as our base year

In [15]:
base_financial_sql = f"""
select DISTINCT P.company_name, P.company_lei, P.company_id,
       F.country, UN.region_ar6_10 as region,
       if(S2I.isic=2410 or P.company_name='CLEVELAND-CLIFFS INC', 'Steel', 'Electricity Utilities') as sector,
       'equity' as exposure, 'USD' as currency,
       year(F.ddate) as year,
       F.market_cap_usd as company_market_cap,
       F.revenue_usd as company_revenue,
       F.market_cap_usd+F.debt_usd-F.cash_usd as company_ev,
       F.market_cap_usd+F.debt_usd as company_evic,
       F.assets_usd as company_total_assets,
       F.cash_usd as company_cash_equivalents,
       F.debt_usd as company_debt
from {ingest_schema}.portfolio_universe as P
     left join {dera_schema}.financials_by_lei as F on F.lei=P.company_lei and year(F.ddate)=2019
     join {iso3166_schema}.countries as I on F.country=I.alpha_2
     join {essd_schema}.regions as UN on I.alpha_3=UN.iso
     -- join {dera_schema}.{dera_prefix}sub as S on S.cik=F.cik
     -- left join {rmi_schema}.utility_information as U on U.parent_lei=P.company_lei
     -- left join {gleif_schema}.gleif_isin_lei G on G.lei=P.lei and G.isin=U.parent_isin
     left join {dera_schema}.sic_isic as S2I on S2I.sic=F.sic
     -- left join {rmi_schema}.operations_emissions_by_fuel as E on U.respondent_id=E.respondent_id and year(E.year)=year(F.ddate)
-- where E.owned_or_total='owned'
group by P.company_name, P.company_lei, P.company_id,
       F.country, UN.region_ar6_10,
       if(S2I.isic=2410 or P.company_name='CLEVELAND-CLIFFS INC', 'Steel', 'Electricity Utilities'),
       6, 7, -- exposure, currency
       year(F.ddate),
       F.market_cap_usd, F.revenue_usd, F.market_cap_usd+F.debt_usd-F.cash_usd, F.market_cap_usd+F.debt_usd, F.assets_usd, F.cash_usd, F.debt_usd
order by P.company_name
limit 200
"""

### Emissions/Production info
* Company Name, LEI, ISIN (join axis with financial info)
* Sector (inferred from RMI data as a source rather than ISIC)
* Production (in whatever units -- we need units in either metadata or a column or as part of the data element iselft)
* S1, S2, S3 emissions (in megametric tons CO2e)
* S1, S2, S3 emissions intensity (emissions / production, in whatever units this resolves to)

We currently focus exclusively on data from 2019 as our base year

Note that RMI data is S1 only (own generation); we use zero as S2 value

In [16]:
# 'sector', 's1_co2', 's2_co2', 's3_co2', 's1_ei', 's2_ei', 's3_ei', 'production'

emissions_sql = f"""
select DISTINCT P.company_name, P.company_lei, P.company_id,
       'Electricity Utilities' as sector, year(E.year) as year,
       sum(E.emissions_co2 + (265/1000000.0)*coalesce(E.emissions_nox, 0)) as ghg_s1, 0 as ghg_s2, NULL as ghg_s3,
       sum(E.emissions_co2 + (265/1000000.0)*coalesce(E.emissions_nox, 0)) / sum(E.generation) as ei_s1, 0 as ei_s2, NULL as ei_s3,
       sum(E.generation) as production
from {ingest_schema}.portfolio_universe as P
     join {rmi_schema}.utility_information as U on U.parent_lei=P.company_lei
     join {rmi_schema}.operations_emissions_by_fuel as E on U.respondent_id=E.respondent_id
where year(E.year)>=2016 and year(E.year)<2023
-- and E.owned_or_total='owned'
group by P.company_name, P.company_lei, P.company_id, 3, year(E.year)
order by P.company_name
"""

### `financial_df` contains all the base year (2019) financial, production, and emissions data

For now our benchmark data covers only North America and Europe.  Over time, we expect additional regions (possibly on a per-sector basis).

In [17]:
financial_df = pd.read_sql(base_financial_sql, engine, index_col=['company_name', 'company_lei', 'company_id', 'sector']).convert_dtypes()
financial_df.region = financial_df.region.apply(lambda x: x if x in ['Asia', 'Europe', 'North America'] else 'Global').astype('string')
financial_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,country,region,exposure,currency,year,company_market_cap,company_revenue,company_ev,company_evic,company_total_assets,company_cash_equivalents,company_debt
company_name,company_lei,company_id,sector,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
AES Corp.,2NUNNB7D43COUIRE5295,US00130H1059,Electricity Utilities,US,North America,equity,USD,2019,10870000000.0,10189000000.0,10102000000,11131000000.0,33648000000.0,1029000000.0,261000000.0
"ALLETE, Inc.",549300NNLSIMY6Z8OT86,US0185223007,Electricity Utilities,US,North America,equity,USD,2019,4285299935.0,1240500000.0,5829799935,5899099935.0,5482800000.0,69300000.0,1613800000.0
Alcoa Corp.,549300T12EZ1F6PWWU29,US0138721065,Electricity Utilities,US,North America,equity,USD,2019,4300000000.0,10433000000.0,5221000000,6100000000.0,14631000000.0,879000000.0,1800000000.0
Algonquin Power & Utilities Corp.,549300K5VIUTJXQL7X75,US0158577090,Electricity Utilities,CA,North America,equity,USD,2019,,1624921000.0,,,10911470000.0,62485000.0,6500799000.0
Alliant Energy,5493009ML300G373MZ12,US0188021085,Electricity Utilities,US,North America,equity,USD,2019,11600000000.0,3647700000.0,18503600000,18519900000.0,16700700000.0,16300000.0,6919900000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Verso Corp.,549300FODXCTQ8DGT594,US92531L2079,Electricity Utilities,US,North America,equity,USD,2019,658075983.0,2444000000.0,622075983,664075983.0,1721000000.0,42000000.0,6000000.0
Vistra Corp.,549300KP43CPCUJOOG15,US92840M1027,Electricity Utilities,US,North America,equity,USD,2019,8654325784.0,11809000000.0,18456325784,18756325784.0,26616000000.0,300000000.0,10102000000.0
WEC Energy Group,549300IGLYTZUK3PVP70,US92939U1060,Electricity Utilities,US,North America,equity,USD,2019,26300000000.0,7523100000.0,38120800000,38158300000.0,34951800000.0,37500000.0,11858300000.0
WORTHINGTON INDUSTRIES INC,1WRCIANKYOIK6KYE5E82,US9818111026,Steel,US,North America,equity,USD,2019,1633376617.0,3759556000.0,2294113617,2386476617.0,2510796000.0,92363000.0,753100000.0


### `emissions_df` contains all the base year (2019) production and emissions data

In [18]:
rmi_emissions_df = pd.read_sql(emissions_sql, engine, index_col=['company_name', 'company_lei', 'company_id', 'sector']).convert_dtypes()
rmi_emissions_df.ghg_s3 = rmi_emissions_df.ghg_s3.astype('float64')
template_rmi_df = rmi_emissions_df.pivot(index=None, columns='year')

# Put column names into YYYY_metric order (Multi-index has this order inverted)
template_rmi_df.columns = template_rmi_df.columns.map(lambda x: f"{x[1]}_{x[0]}")
template_rmi_df = template_rmi_df.loc[:, ~template_rmi_df.columns.str.contains('_ei_')]
display(template_rmi_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,2016_ghg_s1,2017_ghg_s1,2018_ghg_s1,2019_ghg_s1,2020_ghg_s1,2016_ghg_s2,2017_ghg_s2,2018_ghg_s2,2019_ghg_s2,2020_ghg_s2,2016_ghg_s3,2017_ghg_s3,2018_ghg_s3,2019_ghg_s3,2020_ghg_s3,2016_production,2017_production,2018_production,2019_production,2020_production
company_name,company_lei,company_id,sector,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
AES Corp.,2NUNNB7D43COUIRE5295,US00130H1059,Electricity Utilities,20.952695,10.483392,11.23589,11.616368,9.42552,0,0,0,0,0,,,,,,22.186759,10.959302,13.537873,15.292477,13.075168
"ALLETE, Inc.",549300NNLSIMY6Z8OT86,US0185223007,Electricity Utilities,8.028792,6.56607,6.622019,4.223366,3.750732,0,0,0,0,0,,,,,,10.311127,9.033366,8.743458,6.490906,6.078342
Alcoa Corp.,549300T12EZ1F6PWWU29,US0138721065,Electricity Utilities,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,,,,,,0.742563,0.621888,1.069254,1.026422,1.196076
Algonquin Power & Utilities Corp.,549300K5VIUTJXQL7X75,US0158577090,Electricity Utilities,3.427649,3.972491,3.768993,3.327286,2.408914,0,0,0,0,0,,,,,,4.900562,6.28555,6.311677,5.314576,4.588301
Alliant Energy,5493009ML300G373MZ12,US0188021085,Electricity Utilities,12.247084,13.595806,14.580424,11.098765,11.037756,0,0,0,0,0,,,,,,16.476444,18.561338,21.667852,20.524337,22.008184
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WEC Energy Group,549300IGLYTZUK3PVP70,US92939U1060,Electricity Utilities,19.020371,19.627232,16.478434,9.874577,9.667298,0,0,0,0,0,,,,,,23.367324,23.864635,21.139317,15.874234,15.865286
"Wabash Valley Power Assn, Inc",VR27ZYPWHGW7Z1BM8Y69,ZZ0000000125,Electricity Utilities,1.204542,1.02107,1.254569,0.913953,1.013642,0,0,0,0,0,,,,,,1.642564,1.434571,1.69599,1.314171,1.443082
Wolverine Power Supply Coop.,549300ROWOIV5X5MB591,ZZ0000000016,Electricity Utilities,0.542754,0.533111,0.807801,0.542314,0.825158,0,0,0,0,0,,,,,,0.821739,0.794862,1.210924,0.782358,1.291555
"Xcel Energy, Inc.",LGJNMI9GH8XIDG5RCM61,US98389B1008,Electricity Utilities,46.128179,45.010999,45.358516,41.448405,34.87948,0,0,0,0,0,,,,,,73.830968,72.028543,76.00693,75.731407,69.493404


### Collect emissions/production info from the MDT Steel data
* Company Name, LEI, ISIN (join axis with financial info)
* Sector (inferred as Steel from source)
* Production (in whatever units -- we need units in either metadata or a column or as part of the data element itself)
* S1, S2, S3 emissions (in whatever units of CO2e)
* S1, S2, S3 emissions intensity (emissions / production, in whatever units this resolves to)

In [19]:
steel_wb = pd.read_excel(os.environ.get('PWD')+f"/itr-data-pipeline/data/external/mdt-steel-demo.xlsx", sheet_name=None)
steel_production = steel_wb['Steel Fe_tons'].dropna(axis=1,how='all')
steel_production.set_index(steel_production.columns[0:3].to_list(), inplace=True)
steel_co2 = {}
steel_ei = {}
scopes = ['s1', 's2', 's3']
for scope in scopes:
    steel_co2[scope] = steel_wb[f"Steel CO2e {scope.upper()}"].dropna(axis=1,how='all')
    steel_co2[scope].set_index(steel_co2[scope].columns[0:3].to_list(), inplace=True)
    steel_ei[scope] = (steel_co2[scope] / steel_production).dropna(how='all')

In [20]:
def rename_column_emissions(df, scope):
    df = df.loc[:, 2016:2020]
    df.columns = df.columns.map(lambda x: f"{x}_ghg_{scope}")
    return df

template_steel_co2 = pd.concat([rename_column_emissions(steel_co2[scope], scope) for scope in scopes], axis=1)
for year in range(2016,2021):
    template_steel_co2.insert(len(template_steel_co2.columns)-5,f"{year}_ghg_s1s2", steel_co2['s1'][year]+steel_co2['s2'][year])
template_steel_co2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,2016_ghg_s1,2017_ghg_s1,2018_ghg_s1,2019_ghg_s1,2020_ghg_s1,2016_ghg_s2,2017_ghg_s2,2018_ghg_s2,2019_ghg_s2,2020_ghg_s2,2016_ghg_s1s2,2017_ghg_s1s2,2018_ghg_s1s2,2019_ghg_s1s2,2020_ghg_s1s2,2016_ghg_s3,2017_ghg_s3,2018_ghg_s3,2019_ghg_s3,2020_ghg_s3
company_name,company_lei,company_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
CARPENTER TECHNOLOGY CORP,DX6I6ZD3X5WNNCDJKP85,US1442851036,298055.0,298055.0,298055.0,298055.0,292832.2,660000.0,660000.0,660000.0,660000.0,658435.0,958055.0,958055.0,958055.0,958055.0,951267.2,,,,,
CLEVELAND-CLIFFS INC,549300TM2WLI2BJMDD86,US1858991011,33209460.0,32357760.0,31034980.0,30349900.0,25607730.0,4431505.0,4473104.0,4413355.0,4426105.0,3671367.0,37640970.0,36830870.0,35448340.0,34776010.0,29279100.0,1934076.0,2449775.0,2449866.0,2194702.0,1851780.0
COMMERCIAL METALS CO,549300OQS2LO07ZJ7N73,US2017231034,1048006.0,1048006.0,1048006.0,1048006.0,1106156.0,2548437.0,2548437.0,2548437.0,1500431.0,1466830.0,3596443.0,3596443.0,3596443.0,2548437.0,2572986.0,,,,,
FRIEDMAN INDUSTRIES INC,LEI05,US3584351056,,,,,,,,,,,,,,,,,,,,
GENERAL STEEL HOLDINGS INC,5493008ZKBIR02ICY091,US3708532029,,,,,,,,,,,,,,,,,,,,
GERDAU S.A.,254900YDV6SEQQPZVG24,US3737371050,12075000.0,12075000.0,10707410.0,9056519.0,9198407.0,4025000.0,4025000.0,3569137.0,2890986.0,2082515.0,16100000.0,16100000.0,14276550.0,11947500.0,11280920.0,,,,,
"GIBRALTAR INDUSTRIES, INC.",LEI08,US3746891072,,,,,,,,,,,,,,,,,,,,
GROUP SIMEC SA DE CV,529900LCYCXPA0TZEU09,MXP4984U1083,,,,,,,,,,,,,,,,,,,,
HAYNES INTERNATIONAL INC,549300I9MS5UZLRFDO40,US4208772016,,,,,,,,,,,,,,,,,,,,
INSTEEL INDUSTRIES INC,52990026LKY4MOX3L174,US45774W1080,,,,,,,,,,,,,,,,,,,,


In [21]:
template_steel_production = steel_production.loc[:, 2016:2020]
template_steel_production.columns = template_steel_production.columns.map(lambda x: f"{x}_production")
template_steel_production

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,2016_production,2017_production,2018_production,2019_production,2020_production
company_name,company_lei,company_id,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AK STEEL HOLDING CORP,529900DT4E7ZNETMVC04,US0015471081,6051800.0,5596200.0,5683400.0,5342200.0,5422333.0
ARCELORMITTAL,2EULGUTUI56JI9SAL165,LU0140205948,83900000.0,85200000.0,83900000.0,84500000.0,69100000.0
CARPENTER TECHNOLOGY CORP,DX6I6ZD3X5WNNCDJKP85,US1442851036,138831.0,138831.0,138831.0,138831.0,140944.9
CLEVELAND-CLIFFS INC,549300TM2WLI2BJMDD86,US1858991011,89951800.0,90796200.0,89583400.0,89842200.0,74522330.0
COMMERCIAL METALS CO,549300OQS2LO07ZJ7N73,US2017231034,5301216.0,5301216.0,5301216.0,5301216.0,5543677.0
FRIEDMAN INDUSTRIES INC,LEI05,US3584351056,,,,,
GENERAL STEEL HOLDINGS INC,5493008ZKBIR02ICY091,US3708532029,,,,,
GERDAU S.A.,254900YDV6SEQQPZVG24,US3737371050,16100000.0,16100000.0,14276550.0,12453100.0,13142350.0
"GIBRALTAR INDUSTRIES, INC.",LEI08,US3746891072,,,,,
GROUP SIMEC SA DE CV,529900LCYCXPA0TZEU09,MXP4984U1083,,,,,


In [22]:
template_steel_df = pd.concat([template_steel_co2, template_steel_production], axis=1)
template_steel_df.insert(0, 'sector', 'Steel')
template_steel_df.set_index(['sector'], append=True, inplace=True)
template_steel_df.insert(0, 'emissions_metric', 't CO2')
template_steel_df.insert(1, 'production_metric', 'Fe_ton')
template_steel_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,emissions_metric,production_metric,2016_ghg_s1,2017_ghg_s1,2018_ghg_s1,2019_ghg_s1,2020_ghg_s1,2016_ghg_s2,2017_ghg_s2,2018_ghg_s2,...,2016_ghg_s3,2017_ghg_s3,2018_ghg_s3,2019_ghg_s3,2020_ghg_s3,2016_production,2017_production,2018_production,2019_production,2020_production
company_name,company_lei,company_id,sector,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1
CARPENTER TECHNOLOGY CORP,DX6I6ZD3X5WNNCDJKP85,US1442851036,Steel,t CO2,Fe_ton,298055.0,298055.0,298055.0,298055.0,292832.2,660000.0,660000.0,660000.0,...,,,,,,138831.0,138831.0,138831.0,138831.0,140944.9
CLEVELAND-CLIFFS INC,549300TM2WLI2BJMDD86,US1858991011,Steel,t CO2,Fe_ton,33209460.0,32357760.0,31034980.0,30349900.0,25607730.0,4431505.0,4473104.0,4413355.0,...,1934076.0,2449775.0,2449866.0,2194702.0,1851780.0,89951800.0,90796200.0,89583400.0,89842200.0,74522330.0
COMMERCIAL METALS CO,549300OQS2LO07ZJ7N73,US2017231034,Steel,t CO2,Fe_ton,1048006.0,1048006.0,1048006.0,1048006.0,1106156.0,2548437.0,2548437.0,2548437.0,...,,,,,,5301216.0,5301216.0,5301216.0,5301216.0,5543677.0
FRIEDMAN INDUSTRIES INC,LEI05,US3584351056,Steel,t CO2,Fe_ton,,,,,,,,,...,,,,,,,,,,
GENERAL STEEL HOLDINGS INC,5493008ZKBIR02ICY091,US3708532029,Steel,t CO2,Fe_ton,,,,,,,,,...,,,,,,,,,,
GERDAU S.A.,254900YDV6SEQQPZVG24,US3737371050,Steel,t CO2,Fe_ton,12075000.0,12075000.0,10707410.0,9056519.0,9198407.0,4025000.0,4025000.0,3569137.0,...,,,,,,16100000.0,16100000.0,14276550.0,12453100.0,13142350.0
"GIBRALTAR INDUSTRIES, INC.",LEI08,US3746891072,Steel,t CO2,Fe_ton,,,,,,,,,...,,,,,,,,,,
GROUP SIMEC SA DE CV,529900LCYCXPA0TZEU09,MXP4984U1083,Steel,t CO2,Fe_ton,,,,,,,,,...,,,,,,,,,,
HAYNES INTERNATIONAL INC,549300I9MS5UZLRFDO40,US4208772016,Steel,t CO2,Fe_ton,,,,,,,,,...,,,,,,,,,,
INSTEEL INDUSTRIES INC,52990026LKY4MOX3L174,US45774W1080,Steel,t CO2,Fe_ton,,,,,,,,,...,,,,,,,,,,


In [23]:
pd.options.display.max_rows = 99
pd.options.display.max_columns = 49
template_df = pd.concat([financial_df, pd.concat([template_steel_df, template_rmi_df])], axis=1).dropna(thresh=16).drop(columns=['company_cash_equivalents', 'company_debt'], axis=1)
template_df.loc[pd.IndexSlice[:, :, :, ['Electricity Utilities']], ['emissions_metric', 'production_metric']] = ['Mt CO2', 'TWh']
template_df = template_df.reset_index()
cols = template_df.columns.tolist()
cols = cols[:3] + cols[4:6] + [cols[3]] + cols[6:]
template_df = template_df[cols]
for col in cols:
    if col.startswith('2020_'):
        col_index = template_df.columns.get_loc(col)
        for year in [2022, 2021]:
            newcol = col.replace('2020', str(year))
            template_df.insert(col_index+1, newcol, np.nan)
display(template_df)
pd.reset_option("display.max_rows")
pd.reset_option("display.max_columns")

Unnamed: 0,company_name,company_lei,company_id,country,region,sector,exposure,currency,year,company_market_cap,company_revenue,company_ev,company_evic,company_total_assets,emissions_metric,production_metric,2016_ghg_s1,2017_ghg_s1,2018_ghg_s1,2019_ghg_s1,2020_ghg_s1,2021_ghg_s1,2022_ghg_s1,2016_ghg_s2,...,2020_ghg_s2,2021_ghg_s2,2022_ghg_s2,2016_ghg_s1s2,2017_ghg_s1s2,2018_ghg_s1s2,2019_ghg_s1s2,2020_ghg_s1s2,2021_ghg_s1s2,2022_ghg_s1s2,2016_ghg_s3,2017_ghg_s3,2018_ghg_s3,2019_ghg_s3,2020_ghg_s3,2021_ghg_s3,2022_ghg_s3,2016_production,2017_production,2018_production,2019_production,2020_production,2021_production,2022_production
0,AES Corp.,2NUNNB7D43COUIRE5295,US00130H1059,US,North America,Electricity Utilities,equity,USD,2019.0,10870000000.0,10189000000.0,10102000000.0,11131000000.0,33648000000.0,Mt CO2,TWh,20.952695,10.483392,11.23589,11.616368,9.42552,,,0.0,...,0.0,,,,,,,,,,,,,,,,,22.186759,10.959302,13.537873,15.292477,13.075168,,
1,"ALLETE, Inc.",549300NNLSIMY6Z8OT86,US0185223007,US,North America,Electricity Utilities,equity,USD,2019.0,4285299935.0,1240500000.0,5829799935.0,5899099935.0,5482800000.0,Mt CO2,TWh,8.028792,6.56607,6.622019,4.223366,3.750732,,,0.0,...,0.0,,,,,,,,,,,,,,,,,10.311127,9.033366,8.743458,6.490906,6.078342,,
2,Alcoa Corp.,549300T12EZ1F6PWWU29,US0138721065,US,North America,Electricity Utilities,equity,USD,2019.0,4300000000.0,10433000000.0,5221000000.0,6100000000.0,14631000000.0,Mt CO2,TWh,0.0,0.0,0.0,0.0,0.0,,,0.0,...,0.0,,,,,,,,,,,,,,,,,0.742563,0.621888,1.069254,1.026422,1.196076,,
3,Algonquin Power & Utilities Corp.,549300K5VIUTJXQL7X75,US0158577090,CA,North America,Electricity Utilities,equity,USD,2019.0,,1624921000.0,,,10911470000.0,Mt CO2,TWh,3.427649,3.972491,3.768993,3.327286,2.408914,,,0.0,...,0.0,,,,,,,,,,,,,,,,,4.900562,6.28555,6.311677,5.314576,4.588301,,
4,Alliant Energy,5493009ML300G373MZ12,US0188021085,US,North America,Electricity Utilities,equity,USD,2019.0,11600000000.0,3647700000.0,18503600000.0,18519900000.0,16700700000.0,Mt CO2,TWh,12.247084,13.595806,14.580424,11.098765,11.037756,,,0.0,...,0.0,,,,,,,,,,,,,,,,,16.476444,18.561338,21.667852,20.524337,22.008184,,
5,Ameren Corp.,XRZQ5S7HYJFPHJ78L959,US0236081024,US,North America,Electricity Utilities,equity,USD,2019.0,18378774986.0,5910000000.0,27804774986.0,27820774986.0,28933000000.0,Mt CO2,TWh,28.146924,31.18751,30.672014,23.409708,25.799494,,,0.0,...,0.0,,,,,,,,,,,,,,,,,38.509235,40.953088,42.757388,35.416853,35.824987,,
6,"American Electric Power Co., Inc.",1B4S6S7G0TW5EE83BO58,US0255371017,US,North America,Electricity Utilities,equity,USD,2019.0,43491855142.0,15561400000.0,73417055142.0,73663855142.0,75892300000.0,Mt CO2,TWh,91.800594,66.9755,66.441626,58.130282,43.961628,,,0.0,...,0.0,,,,,,,,,,,,,,,,,127.769808,93.827355,93.541289,83.962493,71.732567,,
7,American States Water Co.,529900L26LIS2V8PWM23,US0298991011,US,North America,Electricity Utilities,equity,USD,2019.0,2771217000.0,473869000.0,3054582000.0,3055916000.0,1641331000.0,Mt CO2,TWh,0.000337,0.000406,0.000414,0.000402,0.000702,,,0.0,...,0.0,,,,,,,,,,,,,,,,,-3.6e-05,0.000174,5.6e-05,0.000385,0.001295,,
8,"Avangrid, Inc.",549300OX0Q38NLSKPB49,US05351W1036,US,North America,Electricity Utilities,equity,USD,2019.0,2836000000.0,6338000000.0,10826000000.0,11004000000.0,34416000000.0,Mt CO2,TWh,0.020854,0.046252,0.026542,0.026234,0.025008,,,0.0,...,0.0,,,,,,,,,,,,,,,,,0.373739,0.445768,0.326458,0.233509,0.147153,,
9,Avista Corp.,Q0IK63NITJD6RJ47SW96,US05379B1070,US,North America,Electricity Utilities,equity,USD,2019.0,2948564738.0,1345622000.0,4917868738.0,4927764738.0,6082456000.0,Mt CO2,TWh,2.234314,2.212798,2.166581,2.446111,2.053892,,,0.0,...,0.0,,,,,,,,,,,,,,,,,7.613518,7.618778,7.602209,7.614823,7.216391,,


In [24]:
with pd.ExcelWriter("../data/processed/template-20220415-output.xlsx", datetime_format="YYYY") as writer:
    template_df.to_excel(writer, sheet_name="ITR input data", index=False)

In [25]:
stop!

SyntaxError: invalid syntax (3319058519.py, line 1)

### Load emissions target data

The RMI power plant data is valid for Scope 1 emissions only.

In [None]:
engine.execute(f"describe {rmi_schema}.emissions_targets").fetchall()

### `targets_df` has all the historical and target emissions data (which can be interpreted to provide trajectory data as well)

We also preserve RMI's 1.5 degree target info, which can be presented as a trajectory to compare/contrast corporate targets with RMI's best policy recommendations
* rtg_df is the RMI contribution to targets_df
* mtg_df is the Steel contribution to targets_df

We do not consider targets for WIRES ONLY utilities (who have no generation of their own).

In [None]:
# Emissions targets are now segregated by states, but we care more about rolling them up to the company level.
# Therefore we sum absolutes (emissions and generation) and re-compute intensities based on the aggregated amounts.

rtg_df = pd.read_sql(f"""
select ET.parent_name as company_name, ET.respondent_id, 'Electricity Utilities' as sector, year(ET.year) as year,
       sum(co2_target) as co2_s1_target,
       sum(co2_historical) as co2_s1_historical,
       sum(co2_target_all_years) as co2_s1_target_all_years,
       sum(co2_1point5C) as co2_s1_1point5C,
       sum(generation_historical) as production_historical,
       sum(generation_projected) as production_projected,
       sum(generation_1point5C) as production_1point5C
from {rmi_schema}.emissions_targets ET
     join (select respondent_id, year
           from {rmi_schema}.operations_emissions_by_tech
           where technology_eia!='Batteries' and technology_eia!='Hydroelectric Pumped Storage'
           group by respondent_id, year) EM
           on ET.respondent_id=EM.respondent_id and ((year(ET.year)>2020 and year(EM.year)=2020) or (ET.year=EM.year) or ((year(ET.year)<2005 and year(EM.year)=2005) ))
     -- join (select parent_name, parent_lei from {rmi_schema}.utility_information group by parent_name, parent_lei) U
     --       on ET.parent_name=U.parent_name
     -- join {dera_schema}.financials_by_lei as F on F.lei=U.parent_lei
where ET.target_type='All'
group by ET.parent_name, ET.respondent_id, year(ET.year)
order by company_name, year
""", engine) # parse_dates=['year']

rtg_df.insert(1, 'company_lei', rtg_df.company_name.str.upper().map(gleif_match))
rtg_df.insert(2, 'company_id', rtg_df.company_lei.map(rmi_lei_dict))
rtg_df.loc[rtg_df.production_historical > 0, 'ei_s1_historical'] = rtg_df.co2_s1_historical / rtg_df.production_historical
rtg_df['production_general'] = rtg_df[['production_historical', 'production_projected']].bfill(axis=1).iloc[:, 0]
rtg_df.loc[rtg_df.production_general > 0, 'ei_s1_target'] = rtg_df.co2_s1_target / rtg_df.production_general
rtg_df.loc[rtg_df.production_general > 0, 'ei_s1_target_all_years'] = rtg_df.co2_s1_target_all_years / rtg_df.production_general
rtg_df.loc[rtg_df.production_1point5C > 0, 'ei_s1_1point5C'] = rtg_df.co2_s1_1point5C / rtg_df.production_1point5C
rtg_df.drop(columns='production_general', inplace=True)
rtg_df.co2_s1_historical = rtg_df.co2_s1_historical.astype('float64')
rtg_df.ei_s1_target = rtg_df.ei_s1_target.astype('float64')
rtg_df.ei_s1_target_all_years = rtg_df.ei_s1_target_all_years.astype('float64')
rtg_df.ei_s1_1point5C = rtg_df.ei_s1_1point5C.astype('float64')

print(f"len(rtg_df) = {len(rtg_df)}")

### Fix target information comprehensively (mostly fixed with March 2022 update)

1. Where co2_target is set to zero before 2019 and then ramps up to a non-zero number before 2020, clear the target number and replace all target data with historical data
2. Where co2_target is NULL, generation_historical==1, and co2_intensity_historical==0, remove false generation_historical==1 data point.  There is never any generation before generators are operational.
3. Where co2_historical is non-NULL and non-zero, look for outlier data.  If the generation_historical for the outlier data is not an outlier in the generation data, recompute co2_intensity_historical and co2_historical based on non-outlier data
4. Where max(year) < 2020, discard forward-looking projections: they are represented elsewhere
5. Where production_projected is non-NULL and flatline from 2021-2050, replace with OECM production growth values for 'North America' region

In [None]:
print("Step 4: When data is exhausted prior to 2020, discard forward-looking projections represented elsewhere")

step4_df = rtg_df.loc[rtg_df.year==2019, ['respondent_id', 'production_historical']].fillna(0)
step4_index = step4_df[step4_df.production_historical!=0]['respondent_id']
print(f"Initial length of target dataset: {len(rtg_df)}")
print("respondent_id not in index")
print(sorted(rtg_df.loc[~rtg_df.respondent_id.isin(step4_index), 'respondent_id'].drop_duplicates().tolist()))
rtg_df = rtg_df.loc[rtg_df.respondent_id.isin(step4_index)]
print(f"Resulting length of target dataset: {len(rtg_df)}")


The RMI targets only cover S1, so we don't need to compute the non-existent S2 and S3 numbers (until they do provide such).

In [None]:
def compute_sums_and_wavg(x):
    d = { 'co2_s1_target_by_year':x['co2_s1_target_all_years'].sum(),
          'production_by_year':x[['production_historical', 'production_projected']].bfill(axis=1).iloc[:, 0].sum() }
    if d['production_by_year']:
        d['ei_s1_target_by_year'] = (x[['production_historical', 'production_projected']].bfill(axis=1).iloc[:, 0] * x['ei_s1_target_all_years']).sum() / d['production_by_year']
    else:
        d['ei_s1_target_by_year'] = 0
    return pd.Series(d, index=['ei_s1_target_by_year', 'co2_s1_target_by_year', 'production_by_year'])

targets_df = (rtg_df[rtg_df.year>=2014]
      .groupby(['company_name', 'company_lei', 'company_id', 'sector', 'year'])
      .apply(compute_sums_and_wavg)
      .sort_values(['company_name', 'year'], ascending=[True, False])
     ).reset_index()

targets_df.loc[(targets_df.production_by_year!=0)&targets_df.co2_s1_target_by_year.notnull(), 'ei_s1_target_by_year'] = targets_df.co2_s1_target_by_year/targets_df.production_by_year

In [None]:
targets_df[(targets_df.company_name=='AES Corp.')&(targets_df.year==2016)]

In [None]:
mdt_production = (steel_production
                  .reset_index()
                  .melt(id_vars=['company_name','company_lei','company_id'], var_name='year',
                                 value_name='production_by_year')
                  .dropna()
                  .set_index(['company_name','company_lei','company_id', 'year'])
display(mdt_production)
mdt_co2 = pd.concat([steel_co2[scope]
                     .reset_index()
                     .melt(id_vars=['company_name','company_lei','company_id'], var_name='year',
                                    value_name=f"co2_{scope}_target_by_year")
                     .dropna()
                     .set_index(['company_name','company_lei','company_id', 'year'])
                     for scope in scopes],
                    join='outer', axis=1)
display(mdt_co2)
mdt_ei = pd.concat([steel_ei[scope]
                    .reset_index()
                    .melt(id_vars=['company_name','company_lei','company_id'], var_name='year',
                                   value_name=f"ei_{scope}_target_by_year")
                    .dropna()
                    .set_index(['company_name','company_lei','company_id', 'year'])
                    for scope in scopes],
                    join='outer', axis=1)
display(mdt_ei)
mdt_trajectory = mdt_ei.copy()
for scope in scopes:
    mdt_trajectory.rename(columns={f"ei_{scope}_target_by_year":f"ei_{scope}_trajectory_by_year"}, inplace=True)
display(mdt_trajectory)

In [None]:
steel_targets_df = pd.concat([mdt_production, mdt_co2, mdt_ei], join='outer', axis=1).reset_index()
steel_targets_df.insert(3, 'sector', 'Steel')
targets_df = pd.concat([targets_df, steel_targets_df])

In [None]:
stop!

In [None]:
traj_df = {}
traj_mdf = {}
for scope in scopes:
    traj_df[scope] = targets_df.pivot(index=['company_name', 'company_lei', 'company_id', 'sector'], columns='year', values=f"ei_{scope}_target_by_year").reset_index()
    # We handicap historic progress by averaging with "no progress"
    historic_progress = (1.0 + traj_df[scope][2019] / traj_df[scope][2014]) / 2

    # There are wierd artifacts where energy storage systems have negative generation, so treat their progress as zero
    # If intensity is actually growing, cap trajectory at 1 (no progress)
    annualized_progress = historic_progress.where(historic_progress>=0, 0).where(historic_progress<=1, 1) ** (1/(2019-2014))

    traj_df[scope].loc[:, 2021:2049]=np.nan
    traj_df[scope][2050] = traj_df[scope][2020] * annualized_progress ** (2050-2020)
    traj_df[scope].loc[:, 2020:2050] = traj_df[scope].loc[:, 2020:2050].astype('float64').interpolate(axis=1)
    traj_mdf[scope] = (traj_df[scope]
                       .melt(id_vars=['company_name','company_lei','company_id','sector'], var_name='year', value_name=f"ei_{scope}_trajectory_by_year")
                       .set_index(['company_name','company_lei','company_id','sector','year'])
                       .convert_dtypes())

traj_mdf = pd.concat([*traj_mdf.values()], join='outer', axis=1).reset_index()
display(traj_mdf)

In [None]:
targets_df = targets_df.merge(traj_mdf, on=['company_name','company_lei','company_id','sector','year']).convert_dtypes()
print(f"Final len(targets_df) = {len(targets_df)}")

In [None]:
targets_df

### TODO: Implement Units

Intensity and Production data need Units to distinguish TWh of generation vs. Tons of Steel production

Company data is converted to USD by SEC_DERA ingestion for now, but should support any currencies in the future

In [None]:
tablenames = 'company_data', 'intensity_data', 'trajectory_data', 'emissions_data', 'production_data'

In [None]:
schema_create = engine.execute(f"""
CREATE SCHEMA if not exists {ingest_catalog}.{demo_schema}
 AUTHORIZATION USER michaeltiemannosc
 WITH (
     location = 's3a://osc-datacommons-s3-bucket-dev02/data/demo_dv.db'
 )
""")
schema_create.fetchall()

In [None]:
dataframes = [financial_df[financial_df.company_id.isin(targets_df.company_id)],
              targets_df[['company_name','company_lei','company_id','year','ei_s1_target_by_year']],
              targets_df[['company_name','company_lei','company_id','year','ei_s1_trajectory_by_year']],
              targets_df[['company_name','company_lei','company_id','year','co2_s1_target_by_year']],
              targets_df[['company_name','company_lei','company_id','year','production_by_year']],]

for ingest_table, df in zip(tablenames, dataframes):
    drop_table = engine.execute(f"drop table if exists {demo_schema}.{ingest_table}")
    drop_table.fetchall()

    columnschema = osc.create_table_schema_pairs(df)

    tabledef = f"""
create table if not exists {ingest_catalog}.{demo_schema}.{ingest_table}(
{columnschema}
) with (
    format = 'ORC',
    partitioning = array['year']
)
"""

    print(tabledef)
    qres = engine.execute(tabledef)
    print(qres.fetchall())
    df.to_sql(ingest_table,
              con=engine, schema=demo_schema, if_exists='append',
              index=False,
              method=osc.TrinoBatchInsert(batch_size = 2000, verbose = True))

In [None]:
pdf = targets_df.pivot(index=['company_name', 'company_lei', 'company_id'], columns='year').reset_index()

In [None]:
pdf

In [None]:
stop!
# pdf.insert(1, 'company_lei', pdf.company_name.str.upper().map(gleif_match))
# pdf.insert(2, 'company_id', pdf.company_lei.map(rmi_lei_dict))
# pdf = pdf.set_index(['company_name','company_lei', 'company_id'], drop=True)
pdf.columns.names=[None,None]
pdf

In [None]:
ei_s1_df = pd.concat([pdf.company_name, pdf.company_lei, pdf.company_id, pdf.ei_s1_target_by_year.reset_index()], axis=1).drop('index', axis=1)
ei_s1_df

In [None]:
ei_s2_df = pd.concat([pdf.company_name, pdf.company_lei, pdf.company_id, pdf.ei_s2_target_by_year.reset_index()], axis=1).drop('index', axis=1)
ei_s2_df

In [None]:
ei_s1_df.iloc[:, 3] = 2*ei_s1_df.iloc[:, 4] - ei_s1_df.iloc[:, 5]
ei_s1_df = ei_s1_df[ei_s1_df.company_id.notna()]
ei_s1_df.insert(3, 'scope', 'S1')
ei_s1_df.head(10)

In [None]:
ei_s2_df.iloc[:, 3] = 2*ei_s2_df.iloc[:, 4] - ei_s2_df.iloc[:, 5]
ei_s2_df = ei_s2_df[ei_s2_df.company_id.notna()]
ei_s2_df.insert(3, 'scope', 'S2')
ei_s2_df.head(10)

In [None]:
ei_s1_df.iloc[:, 3] = 2*ei_s1_df.iloc[:, 4] - ei_s1_df.iloc[:, 5]
ei_s1_df = co2_ei_df[co2_ei_df.company_id.notna()]
ei_s1_df.insert(3, 'scope', 'S1')
ei_s1_df.head(10)

In [None]:
co2_df = pd.concat([pdf.company_name, pdf.company_lei, pdf.company_id, pdf.co2_target_by_year.reset_index()], axis=1).drop('index', axis=1)
co2_df = co2_df[co2_df.company_id.notna()]
co2_df.insert(3, 'scope', 'S1+S2')
co2_df.head()

In [None]:
gen_df = pd.concat([pdf.company_name, pdf.company_lei, pdf.company_id, pdf.production_by_year.reset_index()], axis=1).drop('index', axis=1)
gen_df.iloc[:, 3] = 2*gen_df.iloc[:, 4] - gen_df.iloc[:, 5]
gen_df = gen_df[gen_df.company_id.notna()]
gen_df.insert(3, 'production', 'TWh')
gen_df.head()

In [None]:
with pd.ExcelWriter("rmi-20220307-output.xlsx", datetime_format="YYYY") as writer:
    financial_df.to_excel(writer, sheet_name="fundamental_data", index=False)
    co2_ei_df.to_excel(writer, sheet_name="projected_ei_in_Wh", index=False)
    gen_df.to_excel(writer, sheet_name="projected_production", index=False)
    co2_df.to_excel(writer, sheet_name="projected_co2", index=False)


In [None]:
portfolio_zero = portfolio_df.copy()
portfolio_zero.target_probability = 0.0
portfolio_one = portfolio_df.copy()
portfolio_one.target_probability = 1.0

portfolio_df.to_csv("rmi-20220307-portfolio.csv", sep=';', index=False)

In [None]:
engine.execute(f"select count (*) from (select parent_name from {rmi_schema}.utility_information group by parent_name)").fetchall()

If the following is non-NULL, the Data Vault will reject the company data

In [None]:
engine.execute(f"select C.company_name, C.company_id, EI.* from {demo_schema}.company_data C left join {demo_schema}.intensity_data EI on EI.company_name=C.company_name where EI.co2_intensity_target_by_year is NULL").fetchall()