## Ingest Tutorial

Examples of standard techniques and idioms for data ingestion to the OS-Climate Data Commons

## python dependencies
The following are python packages commonly used for data ingest.
If your jupyter environment does not already have these,
you can copy and paste these into a live code cell to install them.

Run these in a notebook cell if you need to install onto your nb env
```
# 'capture' magic prevents long outputs from spamming your notebook
%%capture pipoutput

# For loading predefined environment variables from files
# Typically used to load sensitive access credentials
%pip install python-dotenv

# Standard python package for interacting with S3 buckets
%pip install boto3

# Interacting with Trino and using Trino with sqlalchemy
%pip install trino sqlalchemy sqlalchemy-trino

# Pandas and parquet file i/o
%pip install pandas pyarrow fastparquet

# OS-Climate utilities to make data ingest easier
%pip install osc-ingest-tools
```

## Environment variables and dot-env

The following cell looks for a "dot-env" file in some standard locations,
and loads its contents into `os.environ`.

In [2]:
import os
import pathlib
from dotenv import load_dotenv

# Load some standard environment variables from a dot-env file, if it exists.
# If no such file can be found, does not fail, and so allows these environment vars to
# be populated in some other way
dotenv_dir = os.environ.get('CREDENTIAL_DOTENV_DIR', os.environ.get('PWD', '/opt/app-root/src'))
dotenv_path = pathlib.Path(dotenv_dir) / 'credentials.env'
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path,override=True)

## S3 and boto3

In [14]:
import boto3
s3 = boto3.resource(
    service_name="s3",
    endpoint_url=os.environ["S3_DEV_ENDPOINT"],
    aws_access_key_id=os.environ["S3_DEV_ACCESS_KEY"],
    aws_secret_access_key=os.environ["S3_DEV_SECRET_KEY"],
)
bucket = s3.Bucket(os.environ["S3_DEV_BUCKET"])

## Connecting to Trino with sqlalchemy

In [5]:
import trino
from sqlalchemy.engine import create_engine

sqlstring = 'trino://{user}@{host}:{port}/'.format(
    user = os.environ['TRINO_USER'],
    host = os.environ['TRINO_HOST'],
    port = os.environ['TRINO_PORT']
)
sqlargs = {
    'auth': trino.auth.JWTAuthentication(os.environ['TRINO_PASSWD']),
    'http_scheme': 'https'
}
engine = create_engine(sqlstring, connect_args = sqlargs)
connection = engine.connect()

## tutorial example data

In [6]:
import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['First Name', 'Age In Years'])
df

Unnamed: 0,First Name,Age In Years
0,tom,10
1,nick,15
2,juli,14


## staging example data to an S3 bucket

In [15]:
from io import BytesIO

tutorial_data_prefix = 'tutorial/ingest'
tutorial_data_filename = 'example01.parquet'

buf = BytesIO()
df.to_parquet(path=buf)
buf.seek(0)

bucket.upload_fileobj(Fileobj=buf,
                      Key=f'{tutorial_data_prefix}/{tutorial_data_filename}')

## loading a raw parquet file from S3 into pandas

In [20]:
obj = bucket.Object(f'{tutorial_data_prefix}/{tutorial_data_filename}')
buf = BytesIO()
obj.download_fileobj(buf)
df = pd.read_parquet(buf)
df

Unnamed: 0,First Name,Age In Years
0,tom,10
1,nick,15
2,juli,14


## Pandas DataFrame Column types
Here we see that pandas has left the `First Name` column as generic `object` instead of `string`

In [21]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   First Name    3 non-null      object
 1   Age In Years  3 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes


## Using `df.convert_dtypes()`

You can tell pandas to make more specific column type assignments using the `convert_dtypes()` method.

In [22]:
df = df.convert_dtypes()
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   First Name    3 non-null      string
 1   Age In Years  3 non-null      Int64 
dtypes: Int64(1), string(1)
memory usage: 179.0 bytes


## `osc-ingest-tools`

## Enforcing valid SQL column names

In [23]:
import osc_ingest_trino as osc
osc.enforce_sql_column_names(df, inplace=True)
df

Unnamed: 0,first_name,age_in_years
0,tom,10
1,nick,15
2,juli,14


## Trino catalog, schema, table

In [24]:
ingest_catalog = 'osc_datacommons_dev'
ingest_schema = 'demo'
ingest_table = 'ingest_tutorial_01'

## Staging parquet files for trino tables

In [25]:
buf = BytesIO()
df.to_parquet(path=buf)
buf.seek(0)

bucket.upload_fileobj(Fileobj=buf,
                      Key=f'trino/{ingest_schema}/{ingest_table}/data.parquet')

## Declaring a trino table on top of a raw parquet file

In [30]:
columnschema = osc.create_table_schema_pairs(df)

tabledef = f"""
create table if not exists {ingest_catalog}.{ingest_schema}.{ingest_table}(
{columnschema}
) with (
    format = 'parquet',
    external_location = 's3a://{bucket.name}/trino/{ingest_schema}/{ingest_table}/'
)
"""
print(tabledef)
qres = engine.execute(tabledef)
print(qres.fetchall())


create table if not exists osc_datacommons_dev.demo.ingest_tutorial_01(
    first_name varchar,
    age_in_years bigint
) with (
    format = 'parquet',
    external_location = 's3a://ocp-odh-os-demo-s3/trino/demo/ingest_tutorial_01/'
)

[(True,)]


## SQL queries on our new table

In [31]:
sql = f"""
select * from {ingest_catalog}.{ingest_schema}.{ingest_table}
"""
pd.read_sql(sql, engine).convert_dtypes()

Unnamed: 0,first_name,age_in_years
0,tom,10
1,nick,15
2,juli,14


## Dropping a table

In [33]:
sql = f"""
drop table if exists {ingest_catalog}.{ingest_schema}.{ingest_table}
"""
qres = engine.execute(sql)
print(qres.fetchall())

[(True,)]


## Removing unmanaged parquet files

In [34]:
bucket.objects \
    .filter(Prefix=f'trino/{ingest_schema}/{ingest_table}/') \
    .delete()

[{'ResponseMetadata': {'RequestId': 'VFN51ZQG18RK2091',
   'HostId': 'RzWFMCazA0Ujr5uByh68ub0C34mG4BEvflCJyCuQVENtjQ5FiSYKrrgoqCHOV3lJ0rRoCvcRc6I=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'RzWFMCazA0Ujr5uByh68ub0C34mG4BEvflCJyCuQVENtjQ5FiSYKrrgoqCHOV3lJ0rRoCvcRc6I=',
    'x-amz-request-id': 'VFN51ZQG18RK2091',
    'date': 'Mon, 22 Nov 2021 21:13:17 GMT',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3',
    'connection': 'close'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'trino/demo/ingest_tutorial_01/data.parquet'}]}]