<a href="https://colab.research.google.com/github/lestermartin/starburst-dataframes-exploration/blob/main/IcebergMigrationTool/Migrate2Iceberg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Starburst Galaxy Iceberg migration tool

An interactive notebook used to migrate non-Iceberg tables in a given
Starburst Galaxy data lake catalog.

See [Migrate Hive tables to Apache Iceberg with Starburst Galaxy tutorial](https://www.starburst.io/tutorials/migrate-hive-tables-to-iceberg-with-starburst-galaxy/#0).

---
## Config & setup

In [1]:
!pip install pystarburst

Collecting pystarburst
  Downloading pystarburst-0.9.0-py3-none-any.whl.metadata (2.9 kB)
Collecting trino<0.330.0,>=0.329.0 (from pystarburst)
  Downloading trino-0.329.0-py3-none-any.whl.metadata (18 kB)
Collecting zstandard<0.23.0,>=0.22.0 (from pystarburst)
  Downloading zstandard-0.22.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.9 kB)
Downloading pystarburst-0.9.0-py3-none-any.whl (135 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m135.3/135.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading trino-0.329.0-py3-none-any.whl (53 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading zstandard-0.22.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m63.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: zstandard, trino, pys

### Galaxy cluster & user credentials

Run the next cell, but realize that it does NOT actually validate your values.

In [2]:
import getpass

# grab credentials from the notebook user to be used when making a connection
host = input("Host name")
username = input("User name")
password = getpass.getpass("Password")

Host nametrain-aws-us-east-1-small.trino.galaxy.starburst.io
User namelester.martin@starburstdata.com/accountadmin
Password··········


### Migration process parameters

Update the values below to the most appropriate values for your migration effort.

In [8]:
# Galaxy target catalog to perform migration on
tgt_cat = 'mycatalog'

# schema to target migration effort on
tgt_sch = 'myschema'
#TODO: allow '*' to be valid and used to loop through all schemas in tgt_cat

# CTAS properties used in WITH clause on new tables created for shadow migration
with_props = "type='iceberg', format='parquet'"

### Setup PyStarburst session

Should return `[Row(Working='Yes')]` if functional.  If an exception is raised,
it is likely due to incorrect cluster and/or credentials values.

In [4]:
import trino

from pystarburst import Session
from pystarburst import functions as F
from pystarburst.functions import *
from pystarburst.window import Window as W

# PyStarburst setup
session_properties = {
    "host":host,
    "port": 443,
    "http_scheme": "https",
    "auth": trino.auth.BasicAuthentication(username, password)
}
session = Session.builder.configs(session_properties).create()

# validate PyStarburst working
session.sql("select 'Yes' as Working").collect()

[Row(Working='Yes')]

In [10]:
session.sql("SELECT * FROM system.runtime.nodes").collect()

[Row(node_id='trino-worker-56fb5497-8pv85', http_uri='http://10.5.125.147:8080', node_version='476-galaxy-1-u136-g710558245c7', coordinator=False, state='active'),
 Row(node_id='trino-worker-56fb5497-bclpb', http_uri='http://10.5.19.39:8080', node_version='476-galaxy-1-u136-g710558245c7', coordinator=False, state='active'),
 Row(node_id='trino-worker-56fb5497-g92g2', http_uri='http://10.5.85.205:8080', node_version='476-galaxy-1-u136-g710558245c7', coordinator=False, state='active'),
 Row(node_id='trino-coordinator-5bb5959994-zgdpk', http_uri='http://10.5.54.157:8080', node_version='476-galaxy-1-u136-g710558245c7', coordinator=True, state='active'),
 Row(node_id='trino-worker-56fb5497-gkdfz', http_uri='http://10.5.18.139:8080', node_version='476-galaxy-1-u136-g710558245c7', coordinator=False, state='active')]

---
## Initial analysis

### Identify targeted tables

Fully qualified table names of all tables that will be interogated and attempted to be migrated if appropriate.

In [11]:
# get all BASE TABLE entries in the info schema's tables table
table_list = session \
    .table(tgt_cat + ".information_schema.tables") \
    .filter("table_schema = '" + tgt_sch + "' AND table_type = 'BASE TABLE'") \
    .select_expr("table_catalog||'.'||table_schema||'.'||table_name as table_name") \
    .collect()

for a_table in table_list:
    print(a_table.table_name)

students.hive2ice.tpch_nation_hive_textfile
students.hive2ice.tpch_orders_hive_orc
students.hive2ice.tpch_orders_hive_json
students.hive2ice.tpch_cust_ice_orc
students.hive2ice.tpch_cust_hive_orc
students.hive2ice.tpch_orders_delta
students.hive2ice.tpch_orders_hive_textfile
students.hive2ice.tpch_cust_ice_avro
students.hive2ice.tpch_orders_hive_parquet
students.hive2ice.tpch_cust_hive_textfile
students.hive2ice.tpch_orders_ice_orc
students.hive2ice.tpch_orders_hive_avro
students.hive2ice.tpch_cust_hive_avro
students.hive2ice.tpch_cust_delta
students.hive2ice.tpch_cust_ice_parquet
students.hive2ice.tpch_orders_ice_avro
students.hive2ice.tpch_orders_ice_parquet
students.hive2ice.tpch_cust_hive_parquet
students.hive2ice.tpch_cust_hive_json
students.hive2ice.tpch_nation_hive_json


### Categorize tables

Identify tables for each of the following categories.

- Existing Iceberg tables (no action to take)
- Hive tables backed by ORC, Parquet, or Avro (will attempt in-place migrations via ALTER command)
- Hive tables backed by other file formats (will attempt shadow migrations via CTAS statement)
- Non Hive or Iceberg tables (doing nothing with these at this time)

In [12]:
# create some lists to separate the table types into
hive_tables_2rewrite = list()
hive_tables_2migrate = list()
iceberg_tables = list()
other_tables   = list()

# look at the table create statement to determine which category to use
for a_table in table_list:
    table_name = a_table.table_name
    crt_tb = session.sql('show create table ' + table_name).collect()
    cts = crt_tb[0]["Create Table"]

    if "type = 'HIVE'" in cts:
        if "format = 'PARQUET'" in cts or "format = 'ORC'" in cts or "format = 'AVRO'" in cts:
            hive_tables_2migrate.append(table_name)
        else:
            hive_tables_2rewrite.append(table_name)
    elif "type = 'ICEBERG'" in cts:
        iceberg_tables.append(table_name)
    else:
        other_tables.append(table_name)


print("\n" + str(len(iceberg_tables)) + " Iceberg tables already exist -- no action will be taken on these...")
print(iceberg_tables)

print("\n" + str(len(hive_tables_2migrate)) + " Hive tables targeted to be MIGRATED (in-place) to Iceberg...")
print(hive_tables_2migrate)

print("\n" + str(len(hive_tables_2rewrite)) + " Hive tables targeted to be REWRITTEN (ctas) to Iceberg...")
print(hive_tables_2rewrite)

print("\n" + str(len(other_tables)) + " non Iceberg or Hive tables exist -- no action will be taken on these " + \
      "(BUT Delta Lake TABLES COULD BE REWRITTEN)...")
print(other_tables)


6 Iceberg tables already exist -- no action will be taken on these...
['students.hive2ice.tpch_cust_ice_orc', 'students.hive2ice.tpch_cust_ice_avro', 'students.hive2ice.tpch_orders_ice_orc', 'students.hive2ice.tpch_cust_ice_parquet', 'students.hive2ice.tpch_orders_ice_avro', 'students.hive2ice.tpch_orders_ice_parquet']

6 Hive tables targeted to be MIGRATED (in-place) to Iceberg...
['students.hive2ice.tpch_orders_hive_orc', 'students.hive2ice.tpch_cust_hive_orc', 'students.hive2ice.tpch_orders_hive_parquet', 'students.hive2ice.tpch_orders_hive_avro', 'students.hive2ice.tpch_cust_hive_avro', 'students.hive2ice.tpch_cust_hive_parquet']

6 Hive tables targeted to be REWRITTEN (ctas) to Iceberg...
['students.hive2ice.tpch_nation_hive_textfile', 'students.hive2ice.tpch_orders_hive_json', 'students.hive2ice.tpch_orders_hive_textfile', 'students.hive2ice.tpch_cust_hive_textfile', 'students.hive2ice.tpch_cust_hive_json', 'students.hive2ice.tpch_nation_hive_json']

2 non Iceberg or Hive tables

---
## Begin migration

### Perform in-place migrations

Run `ALTER TABLE table_name SET PROPERTIES type = 'ICEBERG'` on Hive tables backed by ORC, Parquet, or Avro file format.

In [13]:
print("\n")
print("+++++++++++++++++++++++++++++")
print("++++ IN-PLACE MIGRATIONS ++++")
print("+++++++++++++++++++++++++++++")
print("\n")

# NEED TO HANDLE ANY EXEPTIONS THAT MIGHT BE RAISED
#  these would likely be caused from invalid data types for starters
#  which once identified could be added to the shadow migration list

for tbl in hive_tables_2migrate:
    print("in-place migration > " + tbl)
    session.sql("ALTER TABLE " + tbl + " SET PROPERTIES type = 'ICEBERG'").show()



+++++++++++++++++++++++++++++
++++ IN-PLACE MIGRATIONS ++++
+++++++++++++++++++++++++++++


in-place migration > students.hive2ice.tpch_orders_hive_orc
----------
|status  |
----------
|ok      |
----------

in-place migration > students.hive2ice.tpch_cust_hive_orc
----------
|status  |
----------
|ok      |
----------

in-place migration > students.hive2ice.tpch_orders_hive_parquet
----------
|status  |
----------
|ok      |
----------

in-place migration > students.hive2ice.tpch_orders_hive_avro
----------
|status  |
----------
|ok      |
----------

in-place migration > students.hive2ice.tpch_cust_hive_avro
----------
|status  |
----------
|ok      |
----------

in-place migration > students.hive2ice.tpch_cust_hive_parquet
----------
|status  |
----------
|ok      |
----------



### Perform shadow migrations

Perform the following steps for Hive tables backed by file formats not supported directly by Iceberg.

- `ALTER TABLE table_name RENAME TO hold_name`
- `CREATE TABLE table_name WITH (with_props) AS SELECT * FROM hold_name`
- `ALTER TABLE hold_name RENAME TO rm_name`
- Add `rm_name` to a collection to later be dropped, if desired

In [14]:
print("\n")
print("+++++++++++++++++++++++++++++")
print("++++ SHADOW MIGRATIONS ++++++")
print("+++++++++++++++++++++++++++++")
print("+++ with_props > " + with_props)
print("+++++++++++++++++++++++++++++")
print("\n")

old_tbls2rm = list()

for tbl in hive_tables_2rewrite:
    print("******** shadow migration > " + tbl + "\n")
    fqtn = tbl.split(".")
    hold_name = fqtn[0]+"."+fqtn[1]+".hold_"+fqtn[2]
    alter2hold_cmd = "ALTER TABLE " + tbl + " RENAME TO " + hold_name
    print(alter2hold_cmd)
    session.sql(alter2hold_cmd).show()

    # NEED TO TACKLE PARTITIONS BY LOOPING THROUGH THEM INSERTING FROM MOST RECENT TO LEAST RECENT
    #  could even create a union with the old and new table then once new partition is committed,
    #  quickly drop the old partition (briefly have 2 copies of each partition!!)

    ctas_cmd = "CREATE TABLE " + tbl + " WITH (" + with_props + ") AS SELECT * FROM " + hold_name
    print(ctas_cmd)
    session.sql(ctas_cmd).show()

    fqtn = hold_name.split(".")
    rm_name = fqtn[0]+"."+fqtn[1]+".rm_"+fqtn[2]
    alter2rm_cmd = "ALTER TABLE " + hold_name + " RENAME TO " + rm_name
    print(alter2rm_cmd)
    session.sql(alter2rm_cmd).show()

    # hold on to the rm_name for possible deletions later
    old_tbls2rm.append(rm_name)



+++++++++++++++++++++++++++++
++++ SHADOW MIGRATIONS ++++++
+++++++++++++++++++++++++++++
+++ with_props > type='iceberg', format='parquet'
+++++++++++++++++++++++++++++


******** shadow migration > students.hive2ice.tpch_nation_hive_textfile

ALTER TABLE students.hive2ice.tpch_nation_hive_textfile RENAME TO students.hive2ice.hold_tpch_nation_hive_textfile
----------
|status  |
----------
|ok      |
----------

CREATE TABLE students.hive2ice.tpch_nation_hive_textfile WITH (type='iceberg', format='parquet') AS SELECT * FROM students.hive2ice.hold_tpch_nation_hive_textfile
----------
|status  |
----------
|ok      |
----------

ALTER TABLE students.hive2ice.hold_tpch_nation_hive_textfile RENAME TO students.hive2ice.rm_hold_tpch_nation_hive_textfile
----------
|status  |
----------
|ok      |
----------

******** shadow migration > students.hive2ice.tpch_orders_hive_json

ALTER TABLE students.hive2ice.tpch_orders_hive_json RENAME TO students.hive2ice.hold_tpch_orders_hive_json
--------

### Optionally, delete migrated tables

If desired, run `DROP TABLE` commands on the original Hive tables that were migrated with the shadow (i.e. CTAS) approach.

NOTE: Their names are prefixed with `rm_hold_`.

In [15]:
# cleanup of the original tables that were shadow migrated

for tbl in old_tbls2rm:
    print("\ndropping original table > " + tbl)
    session.sql("DROP TABLE " + tbl).show()


dropping original table > students.hive2ice.rm_hold_tpch_nation_hive_textfile
----------
|status  |
----------
|ok      |
----------


dropping original table > students.hive2ice.rm_hold_tpch_orders_hive_json
----------
|status  |
----------
|ok      |
----------


dropping original table > students.hive2ice.rm_hold_tpch_orders_hive_textfile
----------
|status  |
----------
|ok      |
----------


dropping original table > students.hive2ice.rm_hold_tpch_cust_hive_textfile
----------
|status  |
----------
|ok      |
----------


dropping original table > students.hive2ice.rm_hold_tpch_cust_hive_json
----------
|status  |
----------
|ok      |
----------


dropping original table > students.hive2ice.rm_hold_tpch_nation_hive_json
----------
|status  |
----------
|ok      |
----------

