# Apache Hudi Time Travel Queries Example

This notebook demonstrates how to use **Time Travel Queries** in Apache Hudi using PySpark. We will:

- Generate a large synthetic dataset
- Create a Hudi table
- Perform multiple upserts
- Capture commit timestamps
- Query historical versions of the data using `as.of.instant`

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SimpleHudiCreate") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog") \
    .getOrCreate()

In [7]:
from datetime import datetime
from pyspark.sql.functions import lit
import uuid

def generate_data(start_id, count, name_prefix):
    return spark.range(start_id, start_id + count) \
        .withColumn("name", lit(name_prefix)) \
        .withColumn("ts", lit(datetime.now().isoformat())) \
        .withColumn("uuid", lit(str(uuid.uuid4())))

In [8]:
base_path = "/home/jovyan/hudi"
table_name = "hudi_time_travel_table"

In [9]:
df_initial = generate_data(0, 1000000, "initial")

df_initial.write.format("hudi") \
    .option("hoodie.table.name", table_name) \
    .option("hoodie.datasource.write.recordkey.field", "id") \
    .option("hoodie.datasource.write.precombine.field", "ts") \
    .option("hoodie.datasource.write.operation", "insert") \
    .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE") \
    .mode("overwrite") \
    .save(base_path)

In [10]:
commits = spark.read.format("hudi").load(base_path) \
    .select("_hoodie_commit_time").distinct().orderBy("_hoodie_commit_time", ascending=True)

commit_times = [row["_hoodie_commit_time"] for row in commits.collect()]
first_commit = commit_times[0]
print("First commit time:", first_commit)

First commit time: 20250714160208465


In [11]:
import time
time.sleep(2)

df_upsert = generate_data(500000, 100000, "upserted")

df_upsert.write.format("hudi") \
    .option("hoodie.table.name", table_name) \
    .option("hoodie.datasource.write.recordkey.field", "id") \
    .option("hoodie.datasource.write.precombine.field", "ts") \
    .option("hoodie.datasource.write.operation", "upsert") \
    .mode("append") \
    .save(base_path)

In [12]:
commits = spark.read.format("hudi").load(base_path) \
    .select("_hoodie_commit_time").distinct().orderBy("_hoodie_commit_time", ascending=True)

commit_times = [row["_hoodie_commit_time"] for row in commits.collect()]
second_commit = commit_times[-1]
print("Second commit time:", second_commit)

Second commit time: 20250714160302927


In [13]:
df_first_version = spark.read.format("hudi") \
    .option("as.of.instant", first_commit) \
    .load(base_path)

print("Record count at first commit:", df_first_version.count())

Record count at first commit: 1000000


In [14]:
df_second_version = spark.read.format("hudi") \
    .option("as.of.instant", second_commit) \
    .load(base_path)

print("Record count at second commit:", df_second_version.count())

Record count at second commit: 1000000
