# ITR Data Pipeline

The ITR data pipeline organizes and assembles data needed for the ITR tool.  The data may come from many sources, but the output of this pipeline is a complete, consistent dataset that can be fully interrogated by the ITR tool.  If users wish to add additional data or analyze additional portfolio companies, they must create a new dataset using this pipeline.

These are the data needed to create the ITR dataset:
* Global Parameters (just for reference--we do nothing with them here)
* Industry Data (Sector Projections aka Benchmarks)
* Portfolio Data (Must cover all the stocks a user may query)
* Company Data (Must cover all companies in possible portfolios)
* Automization (Must cover all years and scenarios a user may query)

The ITR tool can create secondary datasets:
* Cumulative emissions targets trajectories
* Cumulative emissions budgets
* Target and trajectory overshoot/undershoot ratios
* Target and trajectory temperature scores

These secondary datasets are not the concern of this pipeline.

### Environment variables and dot-env

The following cell looks for a "dot-env" file in some standard locations,
and loads its contents into `os.environ`.

In [1]:
import os
import pathlib
from dotenv import load_dotenv

# Load some standard environment variables from a dot-env file, if it exists.
# If no such file can be found, does not fail, and so allows these environment vars to
# be populated in some other way
dotenv_dir = os.environ.get('CREDENTIAL_DOTENV_DIR', os.environ.get('PWD', '/opt/app-root/src'))
dotenv_path = pathlib.Path(dotenv_dir) / 'credentials.env'
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path,override=True)

import numpy as np
import pandas as pd

### S3 and boto3

In [2]:
import boto3

s3_source = boto3.resource(
    service_name="s3",
    endpoint_url=os.environ['S3_LANDING_ENDPOINT'],
    aws_access_key_id=os.environ['S3_LANDING_ACCESS_KEY'],
    aws_secret_access_key=os.environ['S3_LANDING_SECRET_KEY'],
)
source_bucket = s3_source.Bucket(os.environ['S3_LANDING_BUCKET'])

In [3]:
import osc_ingest_trino as osc
import io

s3 = boto3.resource(
    service_name="s3",
    endpoint_url=os.environ["S3_DEV_ENDPOINT"],
    aws_access_key_id=os.environ["S3_DEV_ACCESS_KEY"],
    aws_secret_access_key=os.environ["S3_DEV_SECRET_KEY"],
)
trino_bucket = osc.attach_s3_bucket("S3_DEV")

### Connecting to Trino with sqlalchemy

In the context of the Data Vault, this pipeline operates with full visibiilty into all the data it prepares for the ITR tool.  When the data is output, it is labeled so that the Data Vault can enforce its data management access rules.

In [4]:
import trino
from sqlalchemy.engine import create_engine

ingest_catalog = 'osc_datacommons_dev'
ingest_schema = 'itr_mdt'
demo_schema = 'demo'

sqlstring = 'trino://{user}@{host}:{port}/'.format(
    user = os.environ['TRINO_USER'],
    host = os.environ['TRINO_HOST'],
    port = os.environ['TRINO_PORT']
)
sqlargs = {
    'auth': trino.auth.JWTAuthentication(os.environ['TRINO_PASSWD']),
    'http_scheme': 'https',
    'catalog': ingest_catalog,
    'schema': ingest_schema,
}
engine = create_engine(sqlstring, connect_args = sqlargs)
connection = engine.connect()

## Global Parameters

These parameters are set/selected by the ITR tool.  They are included here for reference only (the following is not live code).

## Industry Data (Sector Projections)

We presently ingest two TPI benchmark projections and one OECM projections covering Electricity Utilities and Steel.  If users want to evaluate their portfolios under different/expanded projections, the need to be added here.

In [5]:
scenarios = {}
for scenario in ['TPI', 'TPI_below_2', 'OECM']:
    df_dict = pd.read_excel(os.environ.get('PWD')+f"/itr-data-pipeline/data/external/{scenario}_EI_and_production_benchmarks{('','_v2')[scenario=='OECM']}.xlsx", sheet_name=None)
    for projtype in ['projected_production', 'projected_ei_in_Wh']:
        df_dict[projtype]['projection'] = projtype
        df_dict[projtype]['scenario'] = scenario
    scenarios[scenario] = pd.concat (df_dict.values())
df = pd.concat(scenarios, ignore_index=True)
cols = df.columns.tolist()
# Shift 'projection' and 'scenario' to the first two columns of the data frame
cols = cols[-2:]+cols[0:-2]
df = df[cols]
# display(df)

sector_projections = df.melt(id_vars=cols[0:4], value_vars=cols[4:], var_name='year')
display(sector_projections)

Unnamed: 0,projection,scenario,region,sector,year,value
0,projected_ei_in_Wh,TPI,Global,Steel,2019,0.607560
1,projected_ei_in_Wh,TPI,Global,Electricity Utilities,2019,1.669000
2,projected_production,TPI,Global,Steel,2019,0.000000
3,projected_production,TPI,Europe,Steel,2019,0.000000
4,projected_production,TPI,North America,Steel,2019,0.000000
...,...,...,...,...,...,...
891,projected_production,OECM,Europe,Steel,2050,0.015000
892,projected_production,OECM,North America,Steel,2050,0.015000
893,projected_production,OECM,Global,Electricity Utilities,2050,0.011913
894,projected_production,OECM,Europe,Electricity Utilities,2050,0.006360


Create the ISIC-to-Sector table manually until we have a proper sector mapping table

In [6]:
i2s_df = pd.DataFrame({"isic": [2410, 4010],
                       "sector": ['Steel', 'Electricity Utilities']}).convert_dtypes()
osc.drop_unmanaged_data(ingest_catalog, demo_schema, 'isic_to_sector', engine, trino_bucket)
osc.ingest_unmanaged_parquet(i2s_df, demo_schema, 'isic_to_sector', trino_bucket)

sql = osc.unmanaged_parquet_tabledef(i2s_df, ingest_catalog, demo_schema, 'isic_to_sector', trino_bucket)
engine.execute(sql).fetchall()

AttributeError: 'str' object has no attribute 'objects'

## Portfolio Data

The user will ultimately supply portfolio selection and position information to the ITR tool as part of the weighting calculations.  This part of the pipeline just collects the LEI and ISIN information for companies we should expect to analyze (i.e., companies for which we have fundamental financial information, production, intensity, and target information, in sectors for which we have benchmark projections).

Because this pipeline does the full pre-computation of data for the tool, there is no sense carrying forward information that is not fully closed.  I.e., there's no reason to carry forward an LEI:ISIN relationship if there is no financial, production, or target information related to that LEI and/or ISIN.  The user does not add such data later; the data is collected and fully processed by this pipeline now.

### Get LEI/ISIN data

RMI handes us data already matched with LEIs and ISINs.  Other lists of company names may require us to stitch that together manually.

In [None]:
rmi_lei_isin = pd.read_sql('select parent_name, parent_lei, parent_isin from rmi_20211120.utility_information', engine)
rmi_dict = dict(zip(rmi_lei_isin.parent_lei, rmi_lei_isin.parent_isin))
rmi_lei_isin

Prepare GLEIF matching data for SEC DERA data.  In the future, such matching will use the ESG Entity-Matching pipeline (https://github.com/os-climate/financial-entity-cleaner/tree/version_0.1.0).

In [None]:
gleif_file = s3_source.Object(os.environ['S3_LANDING_BUCKET'],'mtiemann-GLEIF/DERA-matches.csv')
gleif_file.download_file(f'/tmp/dera-gleif.csv')
gleif_df = pd.read_csv(f'/tmp/dera-gleif.csv', header=0, sep=',', dtype=str, engine='c')
gleif_dict = dict(zip(gleif_df.name, gleif_df.LEI))

# Many of the following ISINs are bonds, but some are also stocks (on various exchanges)

gleif_isin_file = s3_source.Object(os.environ['S3_LANDING_BUCKET'],'mtiemann-GLEIF/ISIN_LEI_20211009.csv')
gleif_isin_file.download_file(f'/tmp/ISIN_LEI_20211009.csv')
gleif_isins = pd.read_csv(f'/tmp/ISIN_LEI_20211009.csv', header=0, sep=',', dtype=str, engine='c')

In [None]:
gleif_isins[gleif_isins.ISIN.str.startswith('JP')]

Create a very simple entity matcher, cleaning up slight variations in company names between RMI's entity names, the SEC's entity names, and GLEIF's entity names.

In [None]:
# gleif_dict['Basin Electric Power Coop'.upper()] = gleif_dict['BASIN ELECTRIC POWER COOPERATIVE']
# gleif_dict['Big Rivers Electric Corp'.upper()] = gleif_dict['BIG RIVERS ELECTRIC CORPORATION']
gleif_dict['Cleco Partners LP'.upper()] = gleif_dict['CLECO CORPORATE HOLDINGS LLC']
# gleif_dict['Golden Spread Electric Coop., Inc'.upper()] = gleif_dict['GOLDEN SPREAD ELECTRIC COOPERATIVE, INC.']
gleif_dict['MIDWEST ENERGY INC'] = '549300O4B5CVWMKUES27'
gleif_dict['OG&E Energy'.upper()] = gleif_dict['OGE ENERGY CORP.']
# gleif_dict['Ohio Valley Electric Corp'.upper()] = gleif_dict['OHIO VALLEY ELECTRIC CORPORATION']
gleif_dict['Old Dominion Electric Coop'.upper()] = gleif_dict['OLD DOMINION ELECTRIC COOPERATIVE']
gleif_dict['PG&E Corp.'.upper()] = gleif_dict['PG&E CORP']
gleif_dict['Tri-State Generation & Transmission Association'.upper()] = gleif_dict['TRI-STATE GENERATION & TRANSMISSION ASSOCIATION, INC.']
gleif_dict['DOMINION ENERGY INC'] = 'ILUL7B6Z54MRYCF6H308'

gleif_dict['GROUP SIMEC SA DE CV'] = '529900LCYCXPA0TZEU09'

gleif_1 = { k.split(',')[0].split(' ')[0]:v for k,v in gleif_dict.items() }
gleif_2 = { ' '.join(k.split(',')[0].split(' ')[0:2]):v for k,v in gleif_dict.items() }

def gleif_match(x):
    x = x.split(',')[0]
    if x in gleif_dict:
        return gleif_dict[x]
    x = x.replace('.','')
    if x in gleif_dict:
        return gleif_dict[x]
    x2 = ' '.join(x.split(' ')[0:2])
    if x2 in gleif_2:
        return gleif_2[x2]
    if ' ' not in x and x in gleif_1:
        return gleif_1[x]
    return None

Collect the universe of company names for the sectors we cover.  Steel sector is SIC 3310-3317. Electricity Utilities is SIC 4911 (but also 4931-4932 and 4991).

Some conglomerates have more general SIC codes that hide their activities in sectors of interest.  Others report those SIC codes within reportable segements.
Without more detailed SEC DERA data (available in an S3 bucket but not yet processed as a pipeline), we will not collect the company names we need to collect.

In [None]:
sec_lei_isin = pd.read_sql(f"""
select DISTINCT F.name, F.lei, F.sic
from sec_dera.financials_by_lei F
where (sic=4911 or sic=4931 or sic=4932 or sic=4991)
      or (sic>=3310 and sic<=3317)
""", engine)
sec_lei_isin.loc[sec_lei_isin.name=='DOMINION ENERGY INC', 'lei'] = 'ILUL7B6Z54MRYCF6H308'
sec_lei_isin.loc[sec_lei_isin.name=='GROUP SIMEC SA DE CV', 'lei'] = '529900LCYCXPA0TZEU09'

sec_dict = dict(zip(sec_lei_isin.lei, rmi_lei_isin.parent_isin))
missing_leis = sec_lei_isin[sec_lei_isin.lei.isna()]
sec_lei_isin.dropna(inplace=True)
print("The following companies are missing LEI information and will be dropped:")
display(missing_leis)

This sample portfolio conveniently contains all necessary LEI and ISIN information, meaning we don't need to do entity matching or ISIN matching.

Other portfolios may need a lot more work before they can be used to precompute other data.  The code above are samples of the kind of extra data/processing needed for such portfolios.

In [None]:
portfolio_df = pd.read_csv(f"{os.environ.get('PWD')}/itr-data-pipeline/data/external/mdt-20220116-portfolio.csv",
                           delimiter=';')
# portfolio_df.insert(1, 'company_lei', portfolio_df.company_name.str.upper().map(gleif_match))
# portfolio_df.loc[portfolio_df.company_lei.isin(rmi_dict.keys()), 'company_id'] = portfolio_df.company_lei.map(rmi_dict)
# portfolio_df = portfolio_df.drop(columns='company_isin')
if portfolio_df.company_lei.isna().any():
    display(portfolio_df[portfolio_df.company_lei.isna()])

In [None]:
portfolio_df = portfolio_df.dropna(how='any').convert_dtypes()
print(f"Number of portfolio copmanies = {len(portfolio_df)}")
portfolio_df.iloc[2:60:3]

In [None]:
engine.execute(f"create schema if not exists {ingest_schema}").fetchall()

qres = engine.execute(f"show tables in {ingest_schema}")
l = qres.fetchall()
for x in l:
    osc.drop_unmanaged_table(ingest_catalog, ingest_schema, x[0], engine, trino_bucket)

engine.execute(f"drop schema {ingest_schema}").fetchall()
engine.execute(f"create schema {ingest_schema}").fetchall()

### Company Data

The SIC-to-ISIC table is an open workstream item: https://github.com/os-climate/itr-data-pipeline/issues/1

In [None]:
# We have no S3 emissions in RMI data.

engine.execute("select * from sec_dera.sic_isic").fetchall()

### Capture a list of the companies for which we have good financial info

We limit our view to the companies in our portfolio.  The user can prioritize whether this is the best source of revenue, market cap, etc., or whether they prefer another source.

In [None]:
osc.drop_unmanaged_table(ingest_catalog, ingest_schema, 'portfolio', engine, trino_bucket)
osc.ingest_unmanaged_parquet(portfolio_df, ingest_schema, 'portfolio', trino_bucket)

sql = osc.unmanaged_parquet_tabledef(portfolio_df, ingest_catalog, ingest_schema, 'portfolio', trino_bucket)
engine.execute(sql).fetchall()

### Print that list with metric labels embedded in the output for easy reading...

The counts tell how many constitutent plants are summed to arrive at the final numbers.

### Capture and print a list of copmanies with financial info

Financial information is part of the "fundamental data" we need for the ITR portfolio companies.  The other part is base year production, emission, and intensity data.  We query the two separately because we have a unified source of truth for the former (SEC DERA) but multiple sources for the latter (RMI for Electric Utilities and MDT for Steel).

### Financial info:
* Company Name, LEI, ISIN, year
* ISIC Code (for Sector)
* Country and Region
* Revenue, Market Cap, Enterprise Value, Assets, Cash

We currently focus exclusively on data from 2019 as our base year

In [None]:
qres = engine.execute(f"""
select DISTINCT 'P.company_name', 'P.company_lei', 'P.company_id', 'S2I.isic',
       'F.country', 'UN.region',
       'F.revenue_usd', 'F.market_cap_usd', 'ev_usd', 'F.assets_usd', 'F.cash_usd'
""")
l = qres.fetchall()
print(l)

base_financial_sql = f"""
select DISTINCT P.company_name, P.company_lei, P.company_id, S2I.isic, year(F.ddate) as year,
       F.country, UN.region_ar6_10 as region,
       F.revenue_usd as company_revenue,
       F.market_cap_usd as company_market_cap,
       F.market_cap_usd+F.debt_usd-F.cash_usd as company_enterprise_value,
       F.assets_usd as company_total_assets,
       F.cash_usd as company_cash_equivalents
from {ingest_schema}.portfolio as P
     left join sec_dera.financials_by_lei as F on F.lei=P.company_lei and F.ddate>=DATE('2019-01-01') and F.ddate<DATE('2020-01-01')
     join iso3166.countries as I on F.country=I.alpha_2
     join essd.regions as UN on I.alpha_3=UN.iso
     -- join sec_dera.sub as S on S.cik=F.cik
     -- left join rmi_20211120.utility_information as U on U.parent_lei=P.company_lei
     -- left join gleif_mdt.gleif_isin_lei G on G.lei=P.lei and G.isin=U.parent_isin
     left join sec_dera.sic_isic as S2I on S2I.sic=F.sic
     -- left join rmi_20211120.operations_emissions_by_fuel as E on U.respondent_id=E.respondent_id and year(E.year)=year(F.ddate)
-- where E.owned_or_total='owned'
group by P.company_name, P.company_lei, P.company_id, year(F.ddate), S2I.isic,
       F.country, UN.region_ar6_10, -- 'Electric Utilties', NULL, NULL,
       F.revenue_usd, F.market_cap_usd, F.market_cap_usd+F.debt_usd-F.cash_usd, F.assets_usd, F.cash_usd
order by P.company_name
limit 200
"""

qres = engine.execute(base_financial_sql)
company_financials = qres.fetchall()
print(f"Number of companies in financial list: {len(company_financials)}\nFirst 40 are\n")
display(company_financials[0:40])

### Emissions/Production info
* Company Name, LEI, ISIN (join axis with financial info)
* Sector (inferred from RMI data as a source rather than ISIC)
* Production (in whatever units -- we need units in either metadata or a column or as part of the data element iselft)
* S1+S2, S3 emissions (in whatever units of CO2e)
* S1+S2, S3 emissions intensity (emissions / production, in whatever units this resolves to)

We currently focus exclusively on data from 2019 as our base year

In [None]:
# 'sector', 'production', 's1s2_co2', 's3_co2', 's1s2_ei', 's3_ei'

base_emissions_sql = f"""
select DISTINCT P.company_name, P.company_lei, P.company_id, 2019 as year,
       'Electricity Utilities' as sector, sum(E.generation) as production, sum(E.emissions_co2 + (265/1000000.0)*coalesce(E.emissions_nox, 0)) as s1s2_co2, NULL as s3_co2,
       sum(E.emissions_co2 + (265/1000000.0)*coalesce(E.emissions_nox, 0)) / sum(E.generation) as s1s2_ei, NULL as s3_ei
from {ingest_schema}.portfolio as P
     join rmi_20211120.utility_information as U on U.parent_lei=P.company_lei
     join rmi_20211120.operations_emissions_by_fuel as E on U.respondent_id=E.respondent_id and year(E.year)=2019
-- where E.owned_or_total='owned'
group by P.company_name, P.company_lei, P.company_id, 3, 4
order by P.company_name
limit 200
"""

qres = engine.execute(base_emissions_sql)
company_emissions = qres.fetchall()
print(f"Number of companies in emissions list: {len(company_emissions)}\nFirst 40 are\n")
display(company_emissions[0:40])

### `financial_df` contains all the base year (2019) financial, production, and emissions data

In [None]:
financial_df = pd.read_sql(base_financial_sql, engine).convert_dtypes()

print(f"Total number of financial reports = {len(financial_df)}")

### `emissions_df` contains all the base year (2019) production and emissions data

In [None]:
rmi_emissions_df = pd.read_sql(base_emissions_sql, engine)
print(f"Total number of rmi emissions reports = {len(rmi_emissions_df)}")
rmi_emissions_df

### Collect emissions/production info from the MDT Steel data
* Company Name, LEI, ISIN (join axis with financial info)
* Sector (inferred as Steel from source)
* Production (in whatever units -- we need units in either metadata or a column or as part of the data element iselft)
* S1+S2, S3 emissions (in whatever units of CO2e)
* S1+S2, S3 emissions intensity (emissions / production, in whatever units this resolves to)

In [None]:
steel_wb = pd.read_excel(os.environ.get('PWD')+f"/itr-data-pipeline/data/external/mdt-steel-demo.xlsx", sheet_name=None)
steel_co2 = steel_wb['Steel CO2e'].dropna(axis=1,how='all')
steel_ei = steel_wb['Steel EI_per_Fe_Ton'].dropna(axis=1,how='all')
steel_production = steel_wb['Steel Fe_tons'].dropna(axis=1,how='all')

steel_emissions_df = pd.concat([steel_production[['company_name', 'company_lei', 'company_id']],
                                pd.Series(steel_production[2019], name='production'),
                                pd.Series(steel_co2[2019], name='s1s2_co2'),
                                pd.Series(steel_ei[2019], name='s1s2_ei')], axis=1).dropna()
steel_emissions_df.insert(6, 's3_ei', [np.nan] * len(steel_emissions_df))
steel_emissions_df.insert(5, 's3_co2', [np.nan] * len(steel_emissions_df))
steel_emissions_df.insert(3, 'sector', ['Steel'] * len(steel_emissions_df))
steel_emissions_df.insert(3, 'year', [2019] * len(steel_emissions_df))

steel_emissions_df

In [None]:
emissions_df = pd.concat([rmi_emissions_df, steel_emissions_df])

### Load emissions target data

In [None]:
engine.execute("describe rmi_20211120.emissions_targets").fetchall()

### `targets_df` has all the historical and target emissions data (which can be interpreted to provide trajectory data as well)

We also preserve RMI's 1.5 degree target info, which can be presented as a trajectory to compare/contrast corporate targets with RMI's best policy recommendations
* rtg_df is the RMI contribution to targets_df
* mtg_df is the Steel contribution to targets_df

In [None]:
rtg_df = pd.read_sql(f"""
select ET.parent_name as company_name, 'Electricity Utilities' as sector, year(year) as year,
       co2_intensity_historical, co2_intensity_target_all_years, co2_intensity_1point5C,
       co2_historical, co2_target_all_years, co2_1point5C,
       generation_historical as production_historical, generation_projected as production_projected, generation_1point5C as production_1point5C
from rmi_20211120.emissions_targets ET
     -- join (select parent_name, parent_lei from rmi_20211120.utility_information group by parent_name, parent_lei) U
     --      on ET.parent_name=U.parent_name
     -- join sec_dera.financials_by_lei as F on F.lei=U.parent_lei
""", engine) # parse_dates=['year']

rtg_df.insert(1, 'company_lei', rtg_df.company_name.str.upper().map(gleif_match))
rtg_df.insert(2, 'company_id', rtg_df.company_lei.map(rmi_dict))

print(f"len(rtg_df) = {len(rtg_df)}")

In [None]:
rtg_df.loc[rtg_df.year==2019]

Fix some inconsistencies in the data, such as retrospective target information being null where historical data is available, or where retrospective target is zero and emissions grow to present date, only to shrink again

In [None]:
rtg_df.loc[rtg_df.year<2020, 'co2_intensity_target_all_years'] = rtg_df.loc[rtg_df.year<2020, ['co2_intensity_historical', 'co2_intensity_target_all_years', 'co2_intensity_1point5C']].max(skipna=True, axis=1)
rtg_df.loc[rtg_df.year<2020, 'co2_target_all_years'] = rtg_df.loc[rtg_df.year<2020, ['co2_historical', 'co2_target_all_years', 'co2_1point5C']].max(skipna=True, axis=1)
rtg_df.loc[rtg_df.year<2020, 'production_projected'] = rtg_df.loc[rtg_df.year<2020, ['production_historical', 'production_projected']].max(skipna=True, axis=1)

In [None]:
def compute_sums_and_wavg(x):
    d = { 'co2_target_by_year':x['co2_target_all_years'].sum(),
          'production_by_year':x['production_projected'].sum() }
    if d['production_by_year']:
        d['co2_intensity_target_by_year'] = (x['production_projected'] * x['co2_intensity_target_all_years']).sum() / d['production_by_year']
    else:
        d['co2_intensity_target_by_year'] = np.nan
    return pd.Series(d, index=['co2_intensity_target_by_year', 'co2_target_by_year', 'production_by_year'])

targets_df = (rtg_df[rtg_df.year>=2014]
      .fillna(method='pad').groupby(['company_name', 'company_lei', 'company_id', 'sector', 'year'])
      .apply(compute_sums_and_wavg)
      .sort_values(['company_name', 'year'], ascending=[True, False])
     ).reset_index()

In [None]:
mdt_co2 = steel_co2.melt(id_vars=['company_name','company_lei','company_id'], var_name='year', value_name='co2_target_by_year').dropna()
mdt_co2.loc[:, 'co2_target_by_year'] = mdt_co2.co2_target_by_year / 1000000
mdt_ei = steel_ei.melt(id_vars=['company_name','company_lei','company_id'], var_name='year', value_name='co2_intensity_target_by_year').dropna()
mdt_production = steel_production.melt(id_vars=['company_name','company_lei','company_id'], var_name='year', value_name='production_by_year').dropna()
mdt_production.loc[:, 'production_by_year'] = mdt_production.production_by_year / 1000000
mdt_trajectory = mdt_ei.rename(columns={'co2_intensity_target_by_year':'co2_intensity_trajectory_by_year'})

steel_targets_df = (mdt_ei.merge(mdt_co2, on=['company_name','company_lei','company_id','year'])
                    .merge(mdt_production, on=['company_name','company_lei','company_id','year']))
steel_targets_df.insert(2, 'sector', 'Steel')
targets_df = pd.concat([targets_df, steel_targets_df])
targets_df

In [None]:
traj_df = targets_df.pivot(index=['company_name', 'company_lei', 'company_id', 'sector'], columns='year', values='co2_intensity_target_by_year').reset_index()
# We handicap historic progress by averaging with "no progress"
historic_progress = (1.0 + traj_df[2019] / traj_df[2014]) / 2
annualized_progress = historic_progress.where(historic_progress>=0, 0).where(historic_progress<=1, 1).dropna() ** (1/(2019-2014))

traj_df.loc[:, 2021:2049]=np.nan
traj_df[2050] = traj_df[2019] * annualized_progress ** (2050-2020)
traj_df.loc[:, 2020:2050] = traj_df.loc[:, 2020:2050].astype('float64').interpolate(axis=1)
traj_mdf = traj_df.melt(id_vars=['company_name','company_lei','company_id','sector'], var_name='year', value_name='co2_intensity_trajectory_by_year').convert_dtypes()
traj_mdf

In [None]:
targets_df = targets_df.merge(traj_mdf, on=['company_name','company_lei','company_id','sector','year']).convert_dtypes()
targets_df

In [None]:
tablenames = 'company_data', 'intensity_data', 'trajectory_data', 'emissions_data', 'production_data'

In [None]:
dataframes = [financial_df[financial_df.company_id.isin(targets_df.company_id)],
              targets_df[['company_name','company_lei','company_id','year','co2_intensity_target_by_year']],
              targets_df[['company_name','company_lei','company_id','year','co2_intensity_trajectory_by_year']],
              targets_df[['company_name','company_lei','company_id','year','co2_target_by_year']],
              targets_df[['company_name','company_lei','company_id','year','production_by_year']],]

for table, df in zip(tablenames, dataframes):
    osc.drop_unmanaged_table(ingest_catalog, demo_schema, table, engine, trino_bucket)
    osc.ingest_unmanaged_parquet(df, demo_schema, table, trino_bucket)
    sql = osc.unmanaged_parquet_tabledef(df, ingest_catalog, demo_schema, table, trino_bucket)
    qres = engine.execute(sql)
    for row in qres.fetchall():
        print(row)

In [None]:
financial_df