## Loading package dependencies

The `sqlalchemy-trino` package currently requires `sqlalchemy==1.3`.
This requirement may be lifted in the future.

`%pip install` commands need only be run once per JupyterHub session.
If you restart your JupyterHub server, they should be re-installed.

Notebook dependencies may be pre-installed on custom notebook images in future iterations.

In [1]:
%%capture pipoutput
%pip install trino python-dotenv pandas
%pip install --upgrade sqlalchemy==1.3 sqlalchemy-trino

## Load credentials
OS-Climate convention is to store credentials using the `dotenv` file `credentials.env`

In [2]:
from dotenv import dotenv_values, load_dotenv
import os
import pathlib

dotenv_dir = os.environ.get('CREDENTIAL_DOTENV_DIR', os.environ.get('PWD', '/opt/app-root/src'))
dotenv_path = pathlib.Path(dotenv_dir) / 'credentials.env'
if os.path.exists(dotenv_path):
    load_dotenv(dotenv_path=dotenv_path,override=True)

## Connect to trino with sqlalchemy engine

The following cell establishes an `sqlalchemy` connection to trino

In [3]:
import trino
from sqlalchemy.engine import create_engine

sqlstring = 'trino://{user}@{host}:{port}/'.format(
    user = os.environ['TRINO_USER'],
    host = os.environ['TRINO_HOST'],
    port = os.environ['TRINO_PORT']
)
sqlargs = {
    'auth': trino.auth.JWTAuthentication(os.environ['TRINO_PASSWD']),
    'http_scheme': 'https'
}
engine = create_engine(sqlstring, connect_args = sqlargs)
print("connecting with engine " + str(engine))
connection = engine.connect()

connecting with engine Engine(trino://erikerlandson@trino-secure-odh-trino.apps.odh-cl1.apps.os-climate.org:443/)


## Load an SQL query into pandas
The `pandas` library can read a sql query directly into a DataFrame
using an `sqlalchemy` engine, as shown in the following cell.

Note the use of `convert_dtypes()` to tell pandas to assess the data types of the columns.

In [4]:
import pandas as pd
df = pd.read_sql("show catalogs", engine) \
       .convert_dtypes()
df

Unnamed: 0,Catalog
0,jmx
1,osc_datacommons_dev
2,osc_datacommons_iceberg_dev
3,osc_datacommons_prod
4,system


## check the column data types

You can check the column types returned for your query using the `info` DataFrame method:

In [5]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Catalog  5 non-null      string
dtypes: string(1)
memory usage: 168.0 bytes
