# Iceberg Lab 
## Unit 6: Time Travel

In the previous unit, we-
1. Explore metadata inspection tables that iceberg provides


In this unit, we will-
1. Explore Time Travel feature of Iceberg Tables

### 1. Imports

In [16]:
from datetime import datetime
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [1]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark.sparkContext.setLogLevel("WARN")
spark

### 3. Declare variables

In [2]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  nikhim-iceberg-lab


In [3]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  nikhim-iceberg-lab


In [4]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  928505941962


In [5]:
DPMS_NAME=f"iceberg-hms-{PROJECT_NUMBER}"
LOCATION="us-central1"

metastore_dir = !gcloud metastore services describe $DPMS_NAME --location $LOCATION |grep 'hive.metastore.warehouse.dir'| cut -d':' -f2- | xargs 
HIVE_METASTORE_WAREHOUSE_DIR = metastore_dir[0]
print("HIVE_METASTORE_WAREHOUSE_DIR",HIVE_METASTORE_WAREHOUSE_DIR)

HIVE_METASTORE_WAREHOUSE_DIR gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse


In [7]:
TABLE_NAME="loans_by_state_iceberg"
DB_NAME="loan_db"
#fully qualified table name
FQTN=f"{DB_NAME}.{TABLE_NAME}"
print("Fully quailified table name :",FQTN)

Fully quailified table name : loan_db.loans_by_state_iceberg


### 4. Time Travel

#### a. Time Travel with Snapshots

In [8]:
spark.sql(f"select committed_at, snapshot_id, operation from {FQTN}.snapshots").show(truncate=False)

[Stage 0:>                                                          (0 + 1) / 1]

+-----------------------+-------------------+---------+
|committed_at           |snapshot_id        |operation|
+-----------------------+-------------------+---------+
|2023-02-10 15:26:16.364|3648627921780331930|append   |
|2023-02-10 22:29:45.022|5222601969543758311|overwrite|
|2023-02-10 22:33:06.455|9145457862466461068|append   |
|2023-02-10 22:35:11.244|8627182030940064924|overwrite|
|2023-02-10 22:36:30.093|2697368997376323351|overwrite|
|2023-02-10 22:44:01.95 |5865803199727045458|overwrite|
+-----------------------+-------------------+---------+



                                                                                

**Note: Please replace the Snapshot-id value in below statements based on your result from the above query at the time of execution**

In [11]:
print("Table state at snapshot-id '2697368997376323351'")
spark.read.option("snapshot-id","2697368997376323351").format("iceberg").load(f"{FQTN}").filter(col('addr_state').isin('IA','AZ','CA','IN')).show(truncate=False)
      

print("Table state at snapshot-id '3648627921780331930'")
spark.read.option("snapshot-id","3648627921780331930").format("iceberg").load(f"{FQTN}").filter(col('addr_state').isin('IA','AZ','CA','IN')).show(truncate=False)
   
print("Table state at latest snapshot")
spark.read.format("iceberg").load(f"{FQTN}").filter(col('addr_state').isin('IA','AZ','CA','IN')).show(truncate=False)


Table state at snapshot-id '2697368997376323351'


                                                                                

+----------+----------+
|addr_state|loan_count|
+----------+----------+
|IN        |7511      |
|AZ        |11111     |
+----------+----------+

Table state at snapshot-id '3648627921780331930'


                                                                                

+----------+----------+
|addr_state|loan_count|
+----------+----------+
|AZ        |10318     |
|CA        |62090     |
|IN        |7511      |
|IA        |1         |
+----------+----------+

Table state at latest snapshot
+----------+----------+
|addr_state|loan_count|
+----------+----------+
|CA        |11111     |
|IA        |11111     |
|IN        |11111     |
|AZ        |11111     |
+----------+----------+



#### b. Time Travel with Timestamps

In [14]:
#checking all updates to table
spark.table(f"{FQTN}.history").show(truncate=False)

+-----------------------+-------------------+-------------------+-------------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|
+-----------------------+-------------------+-------------------+-------------------+
|2023-02-10 15:26:16.364|3648627921780331930|null               |true               |
|2023-02-10 22:29:45.022|5222601969543758311|3648627921780331930|true               |
|2023-02-10 22:33:06.455|9145457862466461068|5222601969543758311|true               |
|2023-02-10 22:35:11.244|8627182030940064924|9145457862466461068|true               |
|2023-02-10 22:36:30.093|2697368997376323351|8627182030940064924|true               |
|2023-02-10 22:44:01.95 |5865803199727045458|2697368997376323351|true               |
+-----------------------+-------------------+-------------------+-------------------+



**Note: Please replace the timestamp values for _'dt1'_ and _'dt2'_  in below statements based on your result from the above query at the time of execution**

In [17]:

dt_fmt = "%Y-%m-%d %H:%M:%S"

dt1 = '2023-02-10 22:35:11'
dt1_millis = int(datetime.strptime(dt1,dt_fmt).strftime("%s"))*1000
print("Table state at timestamp ",dt1)
spark.read.option("as-of-timestamp",dt1_millis).format("iceberg").load(f"{FQTN}").filter(col('addr_state').isin('IA','AZ','CA','IN')).show(truncate=False)
      

dt2 = '2023-02-10 22:29:45'
dt2_millis = int(datetime.strptime(dt2,dt_fmt).strftime("%s"))*1000
print("Table state at timestamp ",dt2)
spark.read.option("as-of-timestamp",dt2_millis).format("iceberg").load(f"{FQTN}").filter(col('addr_state').isin('IA','AZ','CA','IN')).show(truncate=False)
   
    
print("Table state at latest timestamp")
spark.read.format("iceberg").load(f"{FQTN}").filter(col('addr_state').isin('IA','AZ','CA','IN')).show(truncate=False)


Table state at timestamp  2023-02-10 22:35:11
+----------+----------+
|addr_state|loan_count|
+----------+----------+
|CA        |62090     |
|IN        |7511      |
|IA        |1         |
|AZ        |50000     |
+----------+----------+

Table state at timestamp  2023-02-10 22:29:45
+----------+----------+
|addr_state|loan_count|
+----------+----------+
|AZ        |10318     |
|CA        |62090     |
|IN        |7511      |
|IA        |1         |
+----------+----------+

Table state at latest timestamp
+----------+----------+
|addr_state|loan_count|
+----------+----------+
|CA        |11111     |
|IA        |11111     |
|IN        |11111     |
|AZ        |11111     |
+----------+----------+



**NOTE:** 

For Spark version 3.3 and above Iceberg has introduced new clauses to make it easier to time travel.
If your Kernel is running on **spark 3.3** make sure to try out the below statements


_1. spark.sql("SELECT * FROM loan_db.loans_by_state_iceberg TIMESTAMP AS OF '2023-02-09 04:51:39'")_

_2. spark.sql("SELECT * FROM loan_db.loans_by_state_iceberg VERSION AS OF '7858592723528865925'")_


### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK