# Starburst Galaxy Iceberg migration tool

An interactive notebook used to migrate non-Iceberg tables in a given 
Starburst Galaxy data lake catalog.

See [Migrate Hive tables to Apache Iceberg with Starburst Galaxy tutorial](https://www.starburst.io/tutorials/migrate-hive-tables-to-iceberg-with-starburst-galaxy/#0).

---
## Config & setup

### Galaxy cluster & user credentials

Run the next cell, but realize that it does NOT actually validate your values.

In [None]:
import getpass

# grab credentials from the notebook user to be used when making a connection
host = input("Host name")
username = input("User name")
password = getpass.getpass("Password")

### Migration process parameters

Update the values below to the most appropriate values for your migration effort.

In [137]:
# Galaxy target catalog to perform migration on
tgt_cat = 'mycatalog'

# schema to target migration effort on  
tgt_sch = 'myschema'
#TODO: allow '*' to be valid and used to loop through all schemas in tgt_cat

# CTAS properties used in WITH clause on new tables created for shadow migration
with_props = "type='iceberg', format='orc'"

### Setup PyStarburst session

Should return `[Row(Working='Yes')]` if functional.  If an exception is raised, 
it is likely due to incorrect cluster and/or credentials values.

In [None]:
from pystarburst import Session
from pystarburst import functions as F
from pystarburst.functions import *
from pystarburst.window import Window as W

# PyStarburst setup
session_properties = {
    "host":host,
    "port": 443,
    "http_scheme": "https",
    "auth": trino.auth.BasicAuthentication(username, password)
}
session = Session.builder.configs(session_properties).create()

# validate PyStarburst working
session.sql("select 'Yes' as Working").collect()

### Setup trino-python-client connection

Should return `[['Yes, Working']]` if functional.  If an exception is raised, 
it is likely due to incorrect cluster and/or credentials values.

In [None]:
from trino.dbapi import connect

# trino-python-client setup
conn = connect(
    host=host,
    port=443,
    http_scheme='https',
    auth=trino.auth.BasicAuthentication (username, password)
)
cur = conn.cursor()
cur.execute("select 'Yes, Working' as dummy")

# validate trino-python-client working
print(cur.fetchall())

---
## Initial analysis

### Identify targeted tables

Fully qualified table names of all tables that will be interogated and attempted to be migrated if appropriate.

In [145]:
# get all BASE TABLE entries in the info schema's tables table
table_list_from_collect = session \
    .table(tgt_cat + ".information_schema.tables") \
    .filter("table_schema = '" + tgt_sch + "' AND table_type = 'BASE TABLE'") \
    .select("table_name").sort("table_name") \
    .collect()

table_list = list()
for a_table in table_list_from_collect:
    full_name = tgt_cat + "." + tgt_sch + "." + a_table.table_name
    table_list.append(full_name)
    print(full_name)

students.mv2ice.tpch_cust_delta
students.mv2ice.tpch_cust_hive_avro
students.mv2ice.tpch_cust_hive_json
students.mv2ice.tpch_cust_hive_orc
students.mv2ice.tpch_cust_hive_parquet
students.mv2ice.tpch_cust_hive_textfile
students.mv2ice.tpch_cust_ice_avro
students.mv2ice.tpch_cust_ice_orc
students.mv2ice.tpch_cust_ice_parquet
students.mv2ice.tpch_nation_hive_json
students.mv2ice.tpch_nation_hive_textfile
students.mv2ice.tpch_orders_delta
students.mv2ice.tpch_orders_hive_avro
students.mv2ice.tpch_orders_hive_json
students.mv2ice.tpch_orders_hive_orc
students.mv2ice.tpch_orders_hive_parquet
students.mv2ice.tpch_orders_hive_textfile
students.mv2ice.tpch_orders_ice_avro
students.mv2ice.tpch_orders_ice_orc
students.mv2ice.tpch_orders_ice_parquet


### Categorize tables

Identify tables for each of the following categories.

- Existing Iceberg tables (no action to take)
- Hive tables backed by ORC, Parquet, or Avro (will attempt in-place migrations via ALTER command)
- Hive tables backed by other file formats (will attempt shadow migrations via CTAS statement)
- Non Hive or Iceberg tables (doing nothing with these at this time)

In [None]:
# create some lists to separate the table types into
hive_tables_2rewrite = list()
hive_tables_2migrate = list()
iceberg_tables = list()
other_tables   = list()

# could NOT get the SHOW CREATE TABLE output via PyStarburst 
#  OR figure out how to get the table format type in other way
#  SO using trino-python-client

for a_table in table_list:
    cur.execute("SHOW CREATE TABLE " + a_table)
    cts = cur.fetchall()[0][0]
    
    if "type = 'HIVE'" in cts:
        if "format = 'PARQUET'" in cts or "format = 'ORC'" in cts or "format = 'AVRO'" in cts:
            hive_tables_2migrate.append(a_table)
        else:
            hive_tables_2rewrite.append(a_table)
    elif "type = 'ICEBERG'" in cts:
        iceberg_tables.append(a_table)
    else:
        other_tables.append(a_table)
        
        
print("\n" + str(len(iceberg_tables)) + " Iceberg tables already exist -- no action will be taken on these...")
print(iceberg_tables)

print("\n" + str(len(hive_tables_2migrate)) + " Hive tables targeted to be MIGRATED (in-place) to Iceberg...")
print(hive_tables_2migrate)

print("\n" + str(len(hive_tables_2rewrite)) + " Hive tables targeted to be REWRITTEN (ctas) to Iceberg...")
print(hive_tables_2rewrite)

print("\n" + str(len(other_tables)) + " non Iceberg or Hive tables exist -- no action will be taken on these " + \
      "(BUT Delta Lake TABLES COULD BE REWRITTEN)...")
print(other_tables)

---
## Begin migration

### Perform in-place migrations

Run `ALTER TABLE table_name SET PROPERTIES type = 'ICEBERG'` on Hive tables backed by ORC, Parquet, or Avro file format.

In [142]:
print("\n")
print("+++++++++++++++++++++++++++++")
print("++++ IN-PLACE MIGRATIONS ++++")
print("+++++++++++++++++++++++++++++")
print("\n")

# NEED TO HANDLE ANY EXEPTIONS THAT MIGHT BE RAISED
#  these would likely be caused from invalid data types for starters
#  which once identified could be added to the shadow migration list

for tbl in hive_tables_2migrate:
    print("in-place migration > " + tbl)
    session.sql("ALTER TABLE " + tbl + " SET PROPERTIES type = 'ICEBERG'").show()



+++++++++++++++++++++++++++++
++++ IN-PLACE MIGRATIONS ++++
+++++++++++++++++++++++++++++


in-place migration > students.mv2ice.tpch_cust_hive_avro
----------
|status  |
----------
|ok      |
----------

in-place migration > students.mv2ice.tpch_cust_hive_orc
----------
|status  |
----------
|ok      |
----------

in-place migration > students.mv2ice.tpch_cust_hive_parquet
----------
|status  |
----------
|ok      |
----------

in-place migration > students.mv2ice.tpch_orders_hive_avro
----------
|status  |
----------
|ok      |
----------

in-place migration > students.mv2ice.tpch_orders_hive_orc
----------
|status  |
----------
|ok      |
----------

in-place migration > students.mv2ice.tpch_orders_hive_parquet
----------
|status  |
----------
|ok      |
----------



### Perform shadow migrations

Perform the following steps for Hive tables backed by file formats not supported directly by Iceberg.

- `ALTER TABLE table_name RENAME TO hold_name`
- `CREATE TABLE table_name WITH (with_props) AS SELECT * FROM hold_name`
- `ALTER TABLE hold_name RENAME TO rm_name`
- Add `rm_name` to a collection to later be dropped, if desired

In [143]:
print("\n")
print("+++++++++++++++++++++++++++++")
print("++++ SHADOW MIGRATIONS ++++++")
print("+++++++++++++++++++++++++++++")
print("+++ with_props > " + with_props)
print("+++++++++++++++++++++++++++++")
print("\n")

old_tbls2rm = list()

for tbl in hive_tables_2rewrite:
    print("******** shadow migration > " + tbl + "\n")
    fqtn = tbl.split(".")
    hold_name = fqtn[0]+"."+fqtn[1]+".hold_"+fqtn[2]
    alter2hold_cmd = "ALTER TABLE " + tbl + " RENAME TO " + hold_name
    print(alter2hold_cmd)
    session.sql(alter2hold_cmd).show()
    
    # NEED TO TACKLE PARTITIONS BY LOOPING THROUGH THEM INSERTING FROM MOST RECENT TO LEAST RECENT
    #  could even create a union with the old and new table then once new partition is committed,
    #  quickly drop the old partition (briefly have 2 copies of each partition!!)

    ctas_cmd = "CREATE TABLE " + tbl + " WITH (" + with_props + ") AS SELECT * FROM " + hold_name
    print(ctas_cmd)
    session.sql(ctas_cmd).show()
    
    fqtn = hold_name.split(".")
    rm_name = fqtn[0]+"."+fqtn[1]+".rm_"+fqtn[2]
    alter2rm_cmd = "ALTER TABLE " + hold_name + " RENAME TO " + rm_name
    print(alter2rm_cmd)
    session.sql(alter2rm_cmd).show()
    
    # hold on to the rm_name for possible deletions later
    old_tbls2rm.append(rm_name)



+++++++++++++++++++++++++++++
++++ SHADOW MIGRATIONS ++++++
+++++++++++++++++++++++++++++
+++ with_props > type='iceberg', format='orc'
+++++++++++++++++++++++++++++
******** shadow migration > students.mv2ice.tpch_cust_hive_json

ALTER TABLE students.mv2ice.tpch_cust_hive_json RENAME TO students.mv2ice.hold_tpch_cust_hive_json
----------
|status  |
----------
|ok      |
----------

CREATE TABLE students.mv2ice.tpch_cust_hive_json WITH (type='iceberg', format='orc') AS SELECT * FROM students.mv2ice.hold_tpch_cust_hive_json
----------
|status  |
----------
|ok      |
----------

ALTER TABLE students.mv2ice.hold_tpch_cust_hive_json RENAME TO students.mv2ice.rm_hold_tpch_cust_hive_json
----------
|status  |
----------
|ok      |
----------

******** shadow migration > students.mv2ice.tpch_cust_hive_textfile

ALTER TABLE students.mv2ice.tpch_cust_hive_textfile RENAME TO students.mv2ice.hold_tpch_cust_hive_textfile
----------
|status  |
----------
|ok      |
----------

CREATE TABLE stude

### Optionally, delete migrated tables

If desired, run `DROP TABLE` commands on the original Hive tables that were migrated with the shadow (i.e. CTAS) approach.

NOTE: Their names are prefixed with `rm_hold_`.

In [144]:
# cleanup of the original tables that were shadow migrated

for tbl in old_tbls2rm:
    print("\ndropping original table > " + tbl)
    session.sql("DROP TABLE " + tbl).show()


dropping original table > students.mv2ice.rm_hold_tpch_cust_hive_json
----------
|status  |
----------
|ok      |
----------


dropping original table > students.mv2ice.rm_hold_tpch_cust_hive_textfile
----------
|status  |
----------
|ok      |
----------


dropping original table > students.mv2ice.rm_hold_tpch_nation_hive_json
----------
|status  |
----------
|ok      |
----------


dropping original table > students.mv2ice.rm_hold_tpch_nation_hive_textfile
----------
|status  |
----------
|ok      |
----------


dropping original table > students.mv2ice.rm_hold_tpch_orders_hive_json
----------
|status  |
----------
|ok      |
----------


dropping original table > students.mv2ice.rm_hold_tpch_orders_hive_textfile
----------
|status  |
----------
|ok      |
----------

