Run these in a notebook cell if you need to install onto your nb env

```python
# 'capture' magic prevents long outputs from spamming your notebook
%%capture pipoutput

# For loading predefined environment variables from files
# Typically used to load sensitive access credentials
%pip install python-dotenv

# Standard python package for interacting with S3 buckets
%pip install boto3

# Interacting with Trino and using Trino with sqlalchemy
%pip install trino sqlalchemy sqlalchemy-trino

# Pandas and parquet file i/o
%pip install pandas pyarrow fastparquet

# OS-Climate utilities to make data ingest easier
%pip install osc-ingest-tools
```

In [1]:
# use a catalog that is configured for iceberg
ingest_catalog = 'osc_datacommons_iceberg_dev'
ingest_schema = 'iceberg_demo'
ingest_table = 'trino_iceberg_demo'

In [2]:
# load standard credentials and get a database connection
import osc_ingest_trino as osc
osc.load_credentials_dotenv()
engine = osc.attach_trino_engine(verbose=True, catalog=ingest_catalog)

using connect string: trino://erikerlandson@trino-secure-odh-trino.apps.odh-cl1.apps.os-climate.org:443/osc_datacommons_iceberg_dev


set up some example data in a pandas DF

In [3]:
import pandas as pd
data1 = [
    ['2021Q4', 0.6],
    ['2021Q4', 0.7],
    ['2021Q4', 0.8],
    ['2022Q1', 0.7],
    ['2022Q1', 0.8],
    ['2022Q1', 0.9],
    ['2022Q2', 0.8],
    ['2022Q2', 0.9],
    ['2022Q2', 0.95],
]
df1 = pd.DataFrame(data1, columns = ['quarter', 'reduction'])
df1 = df1.convert_dtypes()
print(df1.info(verbose=True))
df1

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   quarter    9 non-null      string 
 1   reduction  9 non-null      Float64
dtypes: Float64(1), string(1)
memory usage: 281.0 bytes
None


Unnamed: 0,quarter,reduction
0,2021Q4,0.6
1,2021Q4,0.7
2,2021Q4,0.8
3,2022Q1,0.7
4,2022Q1,0.8
5,2022Q1,0.9
6,2022Q2,0.8
7,2022Q2,0.9
8,2022Q2,0.95


In [4]:
# make sure schema exists, or table creation below will fail in weird ways
sql = f"""
create schema if not exists {ingest_catalog}.{ingest_schema}
"""
qres = engine.execute(sql)
print(qres.fetchall())

[(True,)]


create a table with a particular data partitioning on our `quarter` column,
and request underlying data to use `ORC` columnar storage format

In [5]:
import osc_ingest_trino as osc
columnschema = osc.create_table_schema_pairs(df1)

tabledef = f"""
create table if not exists {ingest_catalog}.{ingest_schema}.{ingest_table}(
{columnschema}
) with (
    format = 'ORC',
    partitioning = array['quarter']
)
"""
print(tabledef)
qres = engine.execute(tabledef)
print(qres.fetchall())


create table if not exists osc_datacommons_iceberg_dev.iceberg_demo.trino_iceberg_demo(
    quarter varchar,
    reduction double
) with (
    format = 'ORC',
    partitioning = array['quarter']
)

[(True,)]


In [6]:
# Delete all data from our db, so we start with empty table
sql=f"""
delete from {ingest_catalog}.{ingest_schema}.{ingest_table}
"""
qres = engine.execute(sql)
print(qres.fetchall())

[(None,)]


Following is the standard `to_sql` functionality provided by `sqlalchemy-trino`.
This default logic writes the entire pandas dataframe as a single sql `insert`.
Note: on very large data frames this may fail because trino limits the size of sql commands.
Below we will create a custom inserting class to adapt to these limits

In [7]:
# method = 'multi' is important, default will not work
# important to tell it about schema here, and catalog when you create the db connection above
# index = False, unless you declared that as a column when you create the table
# use 'append' mode since we already created the table
df1.to_sql(ingest_table,
           con=engine,
           schema=ingest_schema,
           if_exists='append',
           index=False,
           method='multi')

In [8]:
sql=f"""
select * from {ingest_catalog}.{ingest_schema}.{ingest_table}
"""
pd.read_sql(sql, engine)

Unnamed: 0,quarter,reduction
0,2021Q4,0.6
1,2021Q4,0.7
2,2021Q4,0.8
3,2022Q2,0.8
4,2022Q1,0.7
5,2022Q1,0.8
6,2022Q1,0.9
7,2022Q2,0.9
8,2022Q2,0.95


pandas also allows you to write your own `callable` insertion method and
provide it for `to_sql` to use.
Below we will define a class with the callable `__call__` method,
with the function signature that pandas expects.
This custom method breaks the data up into batches, which is useful if you are
inserting large dataframes that may exceed trino's limits on sql command size

In [9]:
class TrinoBatchInsert(object):
    def __init__(self,
        catalog = None,
        schema = None,
        batch_size = 1000,
        verbose = False):
        self.catalog = catalog
        self.schema = schema
        self.batch_size = batch_size
        self.verbose = verbose

    # conforms to signature expected by pandas 'callable' value for method kw arg
    # https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html
    # https://pandas.pydata.org/docs/user_guide/io.html#io-sql-method
    def __call__(self, sqltbl, dbcxn, columns, data_iter):
        batch = []
        for r in data_iter:
            # each row of data_iter is a python tuple
            batch.append(str(r))
            # possible alternative: dispatch batches by total batch size in bytes
            if len(batch) >= self.batch_size:
                self._do_insert(dbcxn, sqltbl, batch)
                batch = []
        if len(batch) > 0:
            self._do_insert(dbcxn, sqltbl, batch)

    def _do_insert(self, dbcxn, sqltbl, batch_rows):
        valclause = ',\n'.join(batch_rows)
        sql = f'insert into {self._full_table_name(sqltbl)} values\n{valclause}'
        if self.verbose: print(f'{sql}')
        qres = dbcxn.execute(sql)
        x = qres.fetchall()
        if self.verbose: print(x)

    def _full_table_name(self, sqltbl):
        # start with table name
        name = f'{sqltbl.name}'
        # prepend schema - allow override from this class
        name = f'{self.schema or sqltbl.schema}.{name}'
        # prepend catalog, if provided
        if self.catalog is not None:
            name = f'{self.catalog}.{name}'
        return name
        

Below we use our custom insertion method.
Note that this method inserts our data in two separate insert commands,
and so it will add TWO snapshots to our iceberg db

In [10]:
df1.to_sql(ingest_table,
           con=engine,
           schema=ingest_schema,
           if_exists='append',
           index=False,
           method=TrinoBatchInsert(batch_size = 5, verbose = True))

insert into iceberg_demo.trino_iceberg_demo values
('2021Q4', 0.6),
('2021Q4', 0.7),
('2021Q4', 0.8),
('2022Q1', 0.7),
('2022Q1', 0.8)
[(5,)]
insert into iceberg_demo.trino_iceberg_demo values
('2022Q1', 0.9),
('2022Q2', 0.8),
('2022Q2', 0.9),
('2022Q2', 0.95)
[(4,)]


we can see that a second copy of our dataframe has been inserted into the table

In [11]:
sql=f"""
select * from {ingest_catalog}.{ingest_schema}.{ingest_table}
"""
pd.read_sql(sql, engine)

Unnamed: 0,quarter,reduction
0,2022Q1,0.9
1,2022Q2,0.8
2,2022Q2,0.9
3,2022Q2,0.95
4,2022Q1,0.7
5,2021Q4,0.6
6,2022Q2,0.8
7,2021Q4,0.7
8,2021Q4,0.8
9,2022Q1,0.8


iceberg maintains snapshots that capture the state of your database after each operation.
Below is an example of how to look at recent snapshots

In [12]:
sql=f"""
select snapshot_id, committed_at from {ingest_catalog}.{ingest_schema}."{ingest_table}$snapshots"
    order by committed_at desc
    limit 5
"""
qres = engine.execute(sql)
snapshots = qres.fetchall()
snapshots

[(1634140632585271867, '2021-12-14 23:22:09.112 UTC'),
 (2152499919448736732, '2021-12-14 23:22:08.045 UTC'),
 (827552413813726480, '2021-12-14 23:21:55.455 UTC'),
 (4306330142958682731, '2021-12-14 23:21:50.847 UTC'),
 (6454393995092049285, '2021-12-14 23:21:46.827 UTC')]

And here is the snapshot of the previous state of our db

In [13]:
previous_snapshot = snapshots[1][0]
previous_snapshot

2152499919448736732

You can run your query against a particular snapshot of your db,
as in this example

Notice that the previous snapshot includes the FIRST insertion from our custom method,
but not the second one

In [14]:
sql=f"""
select * from {ingest_catalog}.{ingest_schema}."{ingest_table}@{previous_snapshot}"
"""
pd.read_sql(sql, engine)

Unnamed: 0,quarter,reduction
0,2022Q2,0.8
1,2022Q2,0.9
2,2022Q2,0.95
3,2021Q4,0.6
4,2021Q4,0.7
5,2021Q4,0.8
6,2022Q1,0.7
7,2022Q1,0.8
8,2022Q1,0.9
9,2022Q1,0.7


You can also roll-back the state of your database to a particular snapshot.

Note - rollback appears to not be robust against dropping a table and re-creating it.
Still investigating what the problem is here.

In [15]:
try:
    sql=f"""
    call {ingest_catalog}.system.rollback_to_snapshot('{ingest_schema}', '{ingest_table}', {previous_snapshot})
    """
    qres = engine.execute(sql)
    print(qres.fetchall())
except Exception as e:
    print(e)

TrinoQueryError(type=INTERNAL_ERROR, name=PROCEDURE_CALL_FAILED, message="Cannot roll back to unknown snapshot id: 2152499919448736732", query_id=20211214_232231_00922_gf2h4)


In [16]:
sql=f"""
select * from {ingest_catalog}.{ingest_schema}.{ingest_table}
"""
pd.read_sql(sql, engine)

Unnamed: 0,quarter,reduction
0,2021Q4,0.6
1,2021Q4,0.7
2,2021Q4,0.8
3,2022Q2,0.8
4,2022Q2,0.9
5,2022Q2,0.95
6,2022Q2,0.8
7,2022Q1,0.7
8,2022Q1,0.8
9,2022Q1,0.9


Iceberg+trino has an integration bug where it will NOT remove underlying files after dropping the table.
It is possible to drop the table and remove these files manually, but be very careful doing it.

copy/paste this if you need to drop this table.
Exercise extreme caution.  NEVER remove the underlying files unless the table is first dropped.

```python
sql = f"""
drop table if exists {ingest_catalog}.{ingest_schema}.{ingest_table}
"""
qres = engine.execute(sql)
print(qres.fetchall())
bucket = osc.attach_s3_bucket('S3_ICEBERG_DEV')
bucket.objects.filter(Prefix=f'data/{ingest_schema}.db/{ingest_table}/').delete()
```