# Demo 3 - Data Ingestion

This notebook reads the inference from ceph s3 storage for demo2 and will ingest these inference as a table to trino. These tables will be used for creating visualizations using Apache Superset.

In [1]:
import os
import pathlib
from dotenv import load_dotenv
import pandas as pd
import glob
import config
from src.data.s3_communication import S3Communication
import osc_ingest_trino as osc
%xmode Minimal

### Injecting Credentials

In order to run this notebook, we need credentials to connect with S3 storage to retrieve data and the Trino server to create tables.

In an automated environment, the credentials can be specified in a pipeline's environment variables or through Openshift secrets.

For running the notebook in a local environment, we will define them as environment variables in a `credentials.env` file at the root of the project repository, and load them using dotenv. An example of what the contents of `credentials.env` could look like is shown below

```
# s3 credentials
S3_ENDPOINT=https://s3.us-east-1.amazonaws.com
S3_BUCKET=ocp-odh-os-demo-s3
AWS_ACCESS_KEY_ID=xxx
AWS_SECRET_ACCESS_KEY=xxx

# trino credentials
TRINO_USER=xxx
TRINO_PASSWD=xxx
TRINO_HOST=trino-secure-odh-trino.apps.odh-cl1.apps.os-climate.org
TRINO_PORT=443
```

In [2]:
# Load credentials
dotenv_dir = os.environ.get(
    "CREDENTIAL_DOTENV_DIR", os.environ.get("PWD", "/opt/app-root/src")
)
dotenv_path = pathlib.Path(dotenv_dir) / "credentials.env"
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path, override=True)

## Read Raw Data from S3

First, we will read some sample data from s3. We will format the column data types to ensure they can be understood by Trino, as well as rename the columns so that they are compatible with SQL naming conventions.

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url=os.getenv("S3_LANDING_ENDPOINT"),
    aws_access_key_id=os.getenv("S3_LANDING_ACCESS_KEY"),
    aws_secret_access_key=os.getenv("S3_LANDING_SECRET_KEY"),
    s3_bucket=os.getenv("S3_LANDING_BUCKET"),
)

In [4]:
if os.getenv("AUTOMATION"):
    if not os.path.exists(config.BASE_INFER_KPI_FOLDER):
        pathlib.Path(config.BASE_INFER_KPI_FOLDER).mkdir(parents=True, exist_ok=True)

    # Download a sample dataset file from s3
    s3c.download_files_in_prefix_to_dir(
        s3_prefix=config.BASE_INFER_KPI_S3_PREFIX,
        destination_dir=config.BASE_INFER_KPI_FOLDER
    )

In [5]:
all_files = glob.glob(str(config.BASE_INFER_KPI_FOLDER / "*.csv"))
list_of_files = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0).convert_dtypes().drop(columns=['Unnamed: 0'],axis=1)
    list_of_files.append(df)

preds_kpi = pd.concat(list_of_files, axis=0, ignore_index=True)

len_preds_kpi = len(preds_kpi)

# convert columns to specific data types
preds_kpi = preds_kpi.convert_dtypes().drop(['index'], axis=1, errors='ignore')
preds_kpi.head()

Unnamed: 0,pdf_name,kpi,kpi_id,answer,page,paragraph,source,score,no_ans_score,no_answer_score_plus_boost
0,sustainability-report-2019,In which year was the annual report or the sus...,,2019,3.0,This report focuses on the sustainability topi...,Text,12.819067,-11.38402,-26.38402
1,sustainability-report-2019,In which year was the annual report or the sus...,,2018,7.0,According to IPCC’s 1.5 C report from 2018 and...,Text,12.508747,-6.967487,-21.967487
2,sustainability-report-2019,In which year was the annual report or the sus...,,2019,26.0,Equinor Sustainability report 2019 High value ...,Text,12.427499,-9.680321,-24.680321
3,sustainability-report-2019,In which year was the annual report or the sus...,,2019,8.0,Equinor Sustainability report 2019Low carbon —...,Text,12.356201,-8.748014,-23.748014
4,sustainability-report-2019,What is the annual total production from coal?,,no_answer,,,Text,2.840456,,


In [7]:
# a way to examine the structure of a pandas data frame
preds_kpi.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 10 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   pdf_name                    96 non-null     string 
 1   kpi                         96 non-null     string 
 2   kpi_id                      0 non-null      Int64  
 3   answer                      96 non-null     string 
 4   page                        79 non-null     Int64  
 5   paragraph                   79 non-null     string 
 6   source                      96 non-null     string 
 7   score                       96 non-null     Float64
 8   no_ans_score                79 non-null     Float64
 9   no_answer_score_plus_boost  79 non-null     Float64
dtypes: Float64(3), Int64(2), string(5)
memory usage: 8.1 KB


## Connect with Trino

In [8]:
ingest_catalog = 'osc_datacommons_dev'
ingest_schema = 'aicoe_osc_demo_results'
ingest_table = 'infer_kpi'

osc.load_credentials_dotenv()
engine = osc.attach_trino_engine(verbose=True, catalog=ingest_catalog)

using connect string: trino://Shreyanand@trino-secure-odh-trino.apps.odh-cl2.apps.os-climate.org:443/osc_datacommons_dev


  res = connection.execute(sql.text(query))


## Create a Table on Trino

Finally, we will create a table in our Trino database that uses the parquet files we uploaded in the previous section as the data source.

In [49]:
columnschema = osc.create_table_schema_pairs(preds_kpi)

tabledef = f"""
create table if not exists {ingest_catalog}.{ingest_schema}.{ingest_table}(
{columnschema})
with (
    format = 'ORC'
)
"""
print(tabledef)
qres = engine.execute(tabledef)
print(qres.fetchall())


create table if not exists osc_datacommons_dev.aicoe_osc_demo_results.infer_kpi(
    pdf_name varchar,
    kpi varchar,
    kpi_id bigint,
    answer varchar,
    page bigint,
    paragraph varchar,
    source varchar,
    score double,
    no_ans_score double,
    no_answer_score_plus_boost double)
with (
    format = 'ORC'
)

[(True,)]


In [9]:
sql=f"""
select * from {ingest_catalog}.{ingest_schema}.{ingest_table}
"""
pd.read_sql(sql, engine)

Unnamed: 0,pdf_name,kpi,kpi_id,answer,page,paragraph,source,score,no_ans_score,no_answer_score_plus_boost


### Insert the data in the table

In [None]:
preds_kpi.to_sql(ingest_table,
                 con=engine,
                 schema=ingest_schema,
                 if_exists='append',
                 index=False,
                 method=osc.TrinoBatchInsert(batch_size=5, verbose=True))

In [11]:
sql=f"""
select * from {ingest_catalog}.{ingest_schema}.{ingest_table}
"""
pd.read_sql(sql, engine)

Unnamed: 0,pdf_name,kpi,kpi_id,answer,page,paragraph,source,score,no_ans_score,no_answer_score_plus_boost
0,sustainability-report-2019,What is the total installed capacity from lign...,,4-6GW,5.0,Equinor renewable equity generation capacity e...,Text,-10.342312,17.458086,2.458086
1,sustainability-report-2019,What is the total installed capacity from lign...,,4 and 6GW,11.0,By 2026 Equinor expects to increase our share ...,Text,-12.341133,17.691341,2.691341
2,sustainability-report-2019,What is the total installed capacity from lign...,,.,1.0,"more than 350,000 barrels per day. It is powered",Text,-14.051498,17.214603,2.214603
3,sustainability-report-2019,What is the total volume of crude oil liquid p...,,no_answer,,,Text,3.003765,,
4,sustainability-report-2019,What is the total volume of crude oil liquid p...,,"660,000 barrels of oil per day",25.0,Norway This year Equinor and the Johan Sverdru...,Text,-4.086257,18.003765,3.003765
...,...,...,...,...,...,...,...,...,...,...
91,sustainability-report-2019,What is the total volume of natural gas produc...,,no_answer,,,Text,2.754639,,
92,sustainability-report-2019,What is the total volume of natural gas produc...,,18kg CO₂/boe.,10.0,Equinor aims to reduce the CO₂ intensity of it...,Text,-4.949422,17.604679,2.604679
93,sustainability-report-2019,What is the total volume of natural gas produc...,,2.5 tonnes/1000 tonnes,15.0,"Our 2019 flaring intensity (upstream, operated...",Text,-8.543850,17.754639,2.754639
94,sustainability-report-2019,What is the total volume of natural gas produc...,,2050,10.0,Further reduction ambitions towards 70% in 204...,Text,-11.088131,17.561235,2.561235


# Conclusion

In this notebook, we read inference for KPI sustainability report, 2019 which follows the same format as the output of the KPI Inference model in Demo 2. After reading the report, we automatically infer the data schema from the report, preprocess it and create a table in trino that could be used for visualization in Apache Superset.