# Ingest data in Trino
This notebook shows how to ingest raw data in Trino. The data can then be used in Superset for visualization. First, we take raw csv files from a s3 bucket, create parquet files and then create tables in Trino that read from the parquet files. The notebook also shows how to join two tables on Trino to create a new table.

In [1]:
import re
import pandas as pd
from dotenv import load_dotenv
import os
import pathlib
import boto3
import trino

### Example `credentials.env` file
\# This file is required to connect with Trino and S3.

```
# s3 credentials
S3_ENDPOINT=https://s3.us-east-1.amazonaws.com
S3_BUCKET=ocp-odh-os-demo-s3
S3_ACCESS_KEY=xxx
S3_SECRET_KEY=xxx

# trino credentials
TRINO_USER=xxx
TRINO_PASSWD=xxx
TRINO_HOST=trino-secure-odh-trino.apps.odh-cl1.apps.os-climate.org
TRINO_PORT=443
```

In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

In [3]:
# Create an S3 client
s3 = boto3.client(
    service_name="s3",
    endpoint_url=os.environ["S3_ENDPOINT"],
    aws_access_key_id=os.environ["S3_ACCESS_KEY"],
    aws_secret_access_key=os.environ["S3_SECRET_KEY"],
)

In [4]:
obj = s3.get_object(
    Bucket=os.environ["S3_BUCKET"],
    Key="urgentem/UrgentemDataSampleEmissionsTargetsDec2020.csv",
)

# load the raw file from the bucket
df_emissions = (pd.read_csv(obj["Body"])).convert_dtypes()

# convert columns to specific data types
df_emissions = df_emissions.convert_dtypes()

In [5]:
obj_2 = s3.get_object(
    Bucket=os.environ["S3_BUCKET"], Key="urgentem/UrgentemDataSampleDec2020.csv"
)

# load the raw file from the bucket
df_emissions_2 = (pd.read_csv(obj_2["Body"])).convert_dtypes()

# convert columns to specific data types
df_emissions_2 = df_emissions_2.convert_dtypes()

In [6]:
## Methods to clean column names

_wsdedup = re.compile(r"\s+")
_usdedup = re.compile(r"__+")
_rmpunc = re.compile(r"[,.()&$/+-]+")
_p2smap = {"string": "varchar", "Float64": "double", "Int64": "bigint"}


def snakify(name, maxlen):
    w = name.casefold().rstrip().lstrip()
    w = w.replace("-", "_")
    w = _rmpunc.sub("", w)
    w = _wsdedup.sub("_", w)
    w = _usdedup.sub("_", w)
    w = w.replace("average", "avg")
    w = w.replace("maximum", "max")
    w = w.replace("minimum", "min")
    w = w.replace("absolute", "abs")
    w = w.replace("source", "src")
    w = w.replace("distribution", "dist")
    # these are common in the sample names but unsure of standard abbv
    # w = w.replace("inference", "inf")
    # w = w.replace("emissions", "emis")
    # w = w.replace("intensity", "int")
    # w = w.replace("reported", "rep")
    # w = w.replace("revenue", "rev")
    w = w[:maxlen]
    return w


def snakify_columns(df, inplace=False, maxlen=63):
    icols = df.columns.to_list()
    ocols = [snakify(e, maxlen=maxlen) for e in icols]
    if len(set(ocols)) < len(ocols):
        raise ValueError("remapped column names were not unique!")
    rename_map = dict(list(zip(icols, ocols)))
    return df.rename(columns=rename_map, inplace=inplace)


def pandas_type_to_sql(pt):
    st = _p2smap.get(pt)
    if st is not None:
        return st
    raise ValueError("unexpected pandas column type '{pt}'".format(pt=pt))


# add ability to specify optional dict for specific fields?
# if column name is present, use specified value?
def generate_table_schema_pairs(df):
    ptypes = [str(e) for e in df.dtypes.to_list()]
    stypes = [pandas_type_to_sql(e) for e in ptypes]
    pz = list(zip(df.columns.to_list(), stypes))
    return ",\n".join(["    {n} {t}".format(n=e[0], t=e[1]) for e in pz])

In [7]:
# map column names to a form that works for SQL
snakify_columns(df_emissions, inplace=True)

# map column names to a form that works for SQL
# Had to increase the snakify max length to 100 to avoid column name repetition
snakify_columns(df_emissions_2, inplace=True, maxlen=100)

In [8]:
# a way to examine the structure of a pandas data frame
df_emissions.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 15 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   company_name                        19 non-null     string 
 1   isin                                19 non-null     string 
 2   target_type                         19 non-null     string 
 3   scope                               19 non-null     string 
 4   coverage_s1                         16 non-null     Float64
 5   coverage_s2                         15 non-null     Float64
 6   coverage_s3                         4 non-null      Int64  
 7   reduction_ambition                  19 non-null     Float64
 8   base_year                           19 non-null     Int64  
 9   end_year                            19 non-null     Int64  
 10  start_year                          19 non-null     Int64  
 11  base_year_ghg_emissions_s1_tco2e    1 non-null 

# Save parquet files for Trino tables

In [9]:
# parquet has multiple options for appending or updating data
# including adding new files, or appending, sharding directory trees, etc
df_emissions.to_parquet("/tmp/emissions_table1.parquet", index=False)
s3.upload_file(
    Bucket=os.environ["S3_BUCKET"],
    Key="urgentem/trino/itr_emissions_join_1/emissions.parquet",
    Filename="/tmp/emissions_table1.parquet",
)

In [10]:
# parquet has multiple options for appending or updating data
# including adding new files, or appending, sharding directory trees, etc
df_emissions_2.to_parquet("/tmp/emissions_table2.parquet", index=False)
s3.upload_file(
    Bucket=os.environ["S3_BUCKET"],
    Key="urgentem/trino/itr_emissions_join_2/emissions.parquet",
    Filename="/tmp/emissions_table2.parquet",
)

# Interaction with Trino

In [11]:
conn = trino.dbapi.connect(
    auth=trino.auth.BasicAuthentication(
        os.environ["TRINO_USER"], os.environ["TRINO_PASSWD"]
    ),
    host=os.environ["TRINO_HOST"],
    port=int(os.environ["TRINO_PORT"]),
    http_scheme="https",
    verify=True,
)
cur = conn.cursor()

## Create tables

In [12]:
# generate a sql schema that will correspond to the data types
# of columns in the pandas DF
# to-do: add some mechanisms for overriding types, either here
# or on the pandas data-frame itself before we write it out
schema = generate_table_schema_pairs(df_emissions)

tabledef = """create table if not exists default.urgentem.itr_emissions_1(
{schema}
) with (
    format = 'parquet',
    external_location = 's3a://ocp-odh-os-demo-s3/urgentem/trino/itr_emissions_join_1/'
)""".format(
    schema=schema
)
print(tabledef)

# tables created externally may not show up immediately in cloud-beaver
cur.execute(tabledef)
cur.fetchall()

create table if not exists default.urgentem.itr_emissions_1(
    company_name varchar,
    isin varchar,
    target_type varchar,
    scope varchar,
    coverage_s1 double,
    coverage_s2 double,
    coverage_s3 bigint,
    reduction_ambition double,
    base_year bigint,
    end_year bigint,
    start_year bigint,
    base_year_ghg_emissions_s1_tco2e varchar,
    base_year_ghg_emissions_s1s2_tco2e varchar,
    base_year_ghg_emissions_s3_tco2e varchar,
    achieved_reduction double
) with (
    format = 'parquet',
    external_location = 's3a://ocp-odh-os-demo-s3/urgentem/trino/itr_emissions_join_1/'
)


[[True]]

In [13]:
1  # generate a sql schema that will correspond to the data types
# of columns in the pandas DF
# to-do: add some mechanisms for overriding types, either here
# or on the pandas data-frame itself before we write it out
schema = generate_table_schema_pairs(df_emissions_2)

tabledef = """create table if not exists default.urgentem.itr_emissions_2(
{schema}
) with (
    format = 'parquet',
    external_location = 's3a://ocp-odh-os-demo-s3/urgentem/trino/itr_emissions_join_2/'
)""".format(
    schema=schema
)
print(tabledef[:1000])

# tables created externally may not show up immediately in cloud-beaver
cur.execute(tabledef)
cur.fetchall()

create table if not exists default.urgentem.itr_emissions_2(
    company_short_name varchar,
    isin varchar,
    sedol varchar,
    bloomberg_ticker varchar,
    urgentem_disclosure_category bigint,
    number_of_scope_3_categories_disclosed bigint,
    country varchar,
    region varchar,
    intensity_avg_inference_scope_1_2_3_total_tco2em_revenue double,
    intensity_avg_inference_scope_1_2_total_tco2em_revenue double,
    intensity_avg_inference_scope_1_2_total_tco2em_revenue_src varchar,
    intensity_avg_inference_scope_3_total_tco2em_revenue double,
    intensity_avg_inference_scope_3_total_tco2em_revenue_src varchar,
    intensity_avg_inference_scope_1_tco2em_revenue double,
    intensity_avg_inference_scope_1_tco2em_revenue_src varchar,
    intensity_avg_inference_scope_2_location_based_tco2em_revenue double,
    intensity_avg_inference_scope_2_location_based_tco2em_revenue_src varchar,
    intensity_avg_inference_scope_2_market_based_tco2em_revenue double,
    intensity_av

[[True]]

In [14]:
## Check if table 1 is there
cur.execute("select * from default.urgentem.itr_emissions_1 LIMIT 5")
cur.fetchall()[1]

['ADIDAS AG',
 'DE000A1EWWW0',
 'Absolute',
 'S1+S2',
 0.9,
 0.9,
 None,
 0.15,
 2015,
 2020,
 2015,
 None,
 ' 59,132 ',
 ' 295,660 ',
 1.0]

In [15]:
## Check if table 2 is there
cur.execute("select * from default.urgentem.itr_emissions_2 LIMIT 5")
cur.fetchall()[1][:15]

['ADIDAS AG',
 'DE000A1EWWW0',
 '4031976',
 'ADS GR',
 3,
 3,
 'Germany',
 'Europe',
 301.0,
 17.5,
 'Sum of Location and Scope One',
 283.5,
 'Sum of Average Category Intensities',
 1.8,
 'Inferred - Average - Industry winsor']

# Join the two tables

In [16]:
# Generate column names for df_emissions table
# removing isin column to avoid duplication
# of key column in the join operation
b_columns = list(df_emissions.columns)
b_columns.remove("isin")
b_columns = ["default.urgentem.itr_emissions_1." + i for i in b_columns]
b_columns = ", ".join(b_columns)

In [17]:
# Write the join_query
join_query = f"CREATE TABLE if not exists default.urgentem.itr_emissions_joined_2 AS\
              SELECT default.urgentem.itr_emissions_2.*, {b_columns} \
              FROM default.urgentem.itr_emissions_2 \
              LEFT JOIN default.urgentem.itr_emissions_1 \
              ON default.urgentem.itr_emissions_1.isin=default.urgentem.itr_emissions_2.isin"
join_query

'CREATE TABLE if not exists default.urgentem.itr_emissions_joined_2 AS              SELECT default.urgentem.itr_emissions_2.*, default.urgentem.itr_emissions_1.company_name, default.urgentem.itr_emissions_1.target_type, default.urgentem.itr_emissions_1.scope, default.urgentem.itr_emissions_1.coverage_s1, default.urgentem.itr_emissions_1.coverage_s2, default.urgentem.itr_emissions_1.coverage_s3, default.urgentem.itr_emissions_1.reduction_ambition, default.urgentem.itr_emissions_1.base_year, default.urgentem.itr_emissions_1.end_year, default.urgentem.itr_emissions_1.start_year, default.urgentem.itr_emissions_1.base_year_ghg_emissions_s1_tco2e, default.urgentem.itr_emissions_1.base_year_ghg_emissions_s1s2_tco2e, default.urgentem.itr_emissions_1.base_year_ghg_emissions_s3_tco2e, default.urgentem.itr_emissions_1.achieved_reduction               FROM default.urgentem.itr_emissions_2               LEFT JOIN default.urgentem.itr_emissions_1               ON default.urgentem.itr_emissions_1.isi

In [18]:
cur.execute(join_query)
cur.fetchall()

[[0]]

In [19]:
cur.execute("select * from default.urgentem.itr_emissions_joined_2 LIMIT 5")
pd.DataFrame(cur.fetchall())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,176,177,178,179,180,181,182,183,184,185
0,CHRISTIAN DIOR,FR0000130403,4061393,CDI FP,1,3,France,Europe,167.1,5.3,...,,,,,,,,,,
1,EQUINOR ASA,NO0010096985,7133608,EQNR NO,1,13,Norway,Europe,5361.8,262.1,...,,,0.21,2016.0,2030.0,2017.0,9329201.0,,,0.06
2,GLAXOSMITHKLINE,GB0009252882,925288,GSK LN,1,12,United Kingdom,Europe,485.4,33.9,...,,1.0,0.16,2017.0,2030.0,2017.0,,,7475825.0,0.0
3,GLAXOSMITHKLINE,GB0009252882,925288,GSK LN,1,12,United Kingdom,Europe,485.4,33.9,...,1.0,,1.0,2017.0,2025.0,2017.0,,1495165.0,7475825.0,0.88
4,IBERDROLA SA,ES0144580Y14,B288C92,IBE SM,1,5,Spain,Europe,1753.6,665.5,...,,,,,,,,,,


## Conclusion
We saw how to take a raw csv file on s3 and convert it to parquet format. We also used the trino api to create tables from parquet files as well as from join operation on existing tables. The tables can now be used in a Superset dashboard for visualization. 