# Demo2 : Schema Validation and Evolution in an S3 Datalake using Apache Hudi

## Table of Contents:

1. [Overview](#Overview)
2. [Bulk Insert the Initial Dataset](#Bulk-Insert-the-Initial-Dataset)
3. [Schema Validation](#Schema-Validation)
4. [Schema Evolution](#Schema-Evolution)

## Overview

This notebook is a continuation of the 1st notebook and covers the following concepts when writing Copy-On-Write tables to an S3 Datalake:

- Schema Validation.
- Schema Evolution.

There are 4 cases:

1. Columns are added in the middle - which works automatically in Hudi.
2. Columns are dropped i.e. missing - which is a compatible change as nulls are expected for newer records.
3. Column types are changed - is an **incompatible** change. You could try automatic casting as a resolution here.
4. Columns are added at the end - which is a compatible change.

The assumption is that all case mismatches in columns has been resolved by lower casing the fields. 

**This demo was run on a single node (r5.4xlarge) EMR Cluster version 6.1. i.e. Hudi 0.5.1** <br>
**This demo focusses on a flattened schema. If you have a complex nested schema, there are other (solvable) challenges not discussed in this notebook.**

Let's start by initializing the Spark Session to connect this notebook to our Spark EMR cluster:

Note that the files hudi-spark-bundle.jar and spark-avro.jar are copied into HDFS.

In [1]:
%%configure -f
{
    "conf":  { 
             "spark.jars":"hdfs:///httpclient-4.5.9.jar,hdfs:///hudi-spark-bundle.jar,hdfs:///spark-avro.jar",
             "spark.serializer":"org.apache.spark.serializer.KryoSerializer",
             "spark.sql.hive.convertMetastoreParquet":"false",
             "spark.dynamicAllocation.executorIdleTimeout": 3600,
             "spark.executor.memory": "7G",
             "spark.executor.cores": 1,
             "spark.dynamicAllocation.initialExecutors":16
           } 
}

In [2]:
## CHANGE ME ##
config = {
    "table_name": "example_hudi_table",
    "target": "s3://[s3_bucket]/tmp/hudi/example_hudi_table",
    "primary_key": "id",
    "sort_key": "sk",
    "commits_to_retain": "2"
}

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
45,application_1603824359861_0051,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

The constants for Python to use:

In [3]:
# General Constants
HUDI_FORMAT = "org.apache.hudi"
TABLE_NAME = "hoodie.table.name"
RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
UPSERT_OPERATION_OPT_VAL = "upsert"
BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
S3_CONSISTENCY_CHECK = "hoodie.consistency.check.enabled"
HUDI_CLEANER_POLICY = "hoodie.cleaner.policy"
KEEP_LATEST_COMMITS = "KEEP_LATEST_COMMITS"
HUDI_COMMITS_RETAINED = "hoodie.cleaner.commits.retained"
PAYLOAD_CLASS_OPT_KEY = "hoodie.datasource.write.payload.class"
EMPTY_PAYLOAD_CLASS_OPT_VAL = "org.apache.hudi.EmptyHoodieRecordPayload"

# Hive Constants
HIVE_SYNC_ENABLED_OPT_KEY="hoodie.datasource.hive_sync.enable"
HIVE_PARTITION_FIELDS_OPT_KEY="hoodie.datasource.hive_sync.partition_fields"
HIVE_ASSUME_DATE_PARTITION_OPT_KEY="hoodie.datasource.hive_sync.assume_date_partitioning"
HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY="hoodie.datasource.hive_sync.partition_extractor_class"
HIVE_TABLE_OPT_KEY="hoodie.datasource.hive_sync.table"

# Partition Constants
NONPARTITION_EXTRACTOR_CLASS_OPT_VAL="org.apache.hudi.hive.NonPartitionedExtractor"
MULIPART_KEYS_EXTRACTOR_CLASS_OPT_VAL="org.apache.hudi.hive.MultiPartKeysValueExtractor"
KEYGENERATOR_CLASS_OPT_KEY="hoodie.datasource.write.keygenerator.class"
NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL="org.apache.hudi.keygen.NonpartitionedKeyGenerator"
COMPLEX_KEYGENERATOR_CLASS_OPT_VAL="org.apache.hudi.keygen.ComplexKeyGenerator"
PARTITIONPATH_FIELD_OPT_KEY="hoodie.datasource.write.partitionpath.field"

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Some helper functions:

In [4]:
## Generates Data
def get_new_json_data(start, count, increment=0):
    data = [{"id": i, "sk": i+increment, "txt": chr(65 + (i % 26))} for i in range(start, start + count)]
    return data

# Creates the Dataframe
def create_json_df(spark, data):
    sc = spark.sparkContext
    return spark.read.json(sc.parallelize(data, 2))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Bulk Insert the Initial Dataset

Let's generate 4M records to load into our Data Lake:

In [5]:
spark.sql("drop table if exists "+config['table_name']).show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

++
||
++
++

In [6]:
df1 = create_json_df(spark, get_new_json_data(0, 4000))
print(df1.count())
df1.printSchema()
df1.show(3)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

4000
root
 |-- id: long (nullable = true)
 |-- sk: long (nullable = true)
 |-- txt: string (nullable = true)

+---+---+---+
| id| sk|txt|
+---+---+---+
|  0|  0|  A|
|  1|  1|  B|
|  2|  2|  C|
+---+---+---+
only showing top 3 rows

And write the data to S3:

In [7]:
(df1.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL)
      .option(BULK_INSERT_PARALLELISM, 3)
      .option(S3_CONSISTENCY_CHECK, "true")
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,NONPARTITION_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL)
      .mode("Overwrite")
      .save(config['target']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [8]:
spark.sql("show tables").show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+------------------+-----------+
|database|tableName         |isTemporary|
+--------+------------------+-----------+
|default |example_hudi_table|false      |
+--------+------------------+-----------+

In [9]:
spark.sql("Select id, sk, txt from "+config['table_name'] + " limit 5").show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+----+---+
|id  |sk  |txt|
+----+----+---+
|2139|2139|H  |
|214 |214 |G  |
|2140|2140|I  |
|2141|2141|J  |
|2142|2142|K  |
+----+----+---+

## Schema Validation

Let's see how we can easily implement Schema Validation in Apache Hudi with just a few lines of code.

There are 4 cases:


1. Columns are added in the middle - which is an incompatible change.
2. Columns are dropped i.e. missing - which is a compatible change as nulls are expected for newer records.
3. Column types are changed - is an **incompatible** change. You could try automatic casting as a resolution here.
4. Columns are added at the end - which is a compatible change

### Adding Columns in the middle

Let's add a new column called 'new_col':

In [10]:
## Generates Data
def get_json_data(start, count, increment=0):
    data = [{"id": i, "sk": i+increment, "new_col": i, "txt": chr(65 + (i % 26))} for i in range(start, start + count)]
    return data

idf = create_json_df(spark, get_json_data(0, 100))
idf.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- id: long (nullable = true)
 |-- new_col: long (nullable = true)
 |-- sk: long (nullable = true)
 |-- txt: string (nullable = true)

In [11]:
def validateSchema(df, table):
    original_df=spark.sql("SELECT * FROM "+table+" LIMIT 0")
    columns_to_drop = ['_hoodie_commit_time', '_hoodie_commit_seqno','_hoodie_record_key','_hoodie_partition_path','_hoodie_file_name']
    odf = original_df.drop(*columns_to_drop)
    if (df.schema != odf.schema):
        print ("Schema Validation Failed : Incoming Schema is not compatible with existing Table.")
        if len(odf.schema) > len(df.schema):
            print ("Original Data Diff : "+str(set(odf.schema)-set(df.schema)))
        else:
            print ("Incoming Data Diff : "+str(set(df.schema)-set(odf.schema)))
        return False
    return True

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [12]:
validateSchema(idf,config['table_name'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Schema Validation Failed : Incoming Schema is not compatible with existing Table.
Incoming Data Diff : {StructField(new_col,LongType,true)}
False

Our validation function identified that a new column has been added:

Let's see what would happen if we try to insert this data:

In [13]:
(idf.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
      .option(UPSERT_PARALLELISM, 20)
      .option(S3_CONSISTENCY_CHECK, "true")
      .option(HUDI_CLEANER_POLICY, KEEP_LATEST_COMMITS)
      .option(HUDI_COMMITS_RETAINED,config["commits_to_retain"])
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,NONPARTITION_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL)  
      .mode("Append")
      .save(config['target']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Note that the new rows have value for column new_col populated.

In [14]:
spark.sql("Select id, new_col, sk, txt from "+config['table_name'] + " where new_col is not null limit 1").show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+-------+---+---+
|id |new_col|sk |txt|
+---+-------+---+---+
|32 |32     |32 |G  |
+---+-------+---+---+

while the older rows have new_col correctly populated as null.

In [15]:
spark.sql("Select id, new_col, sk, txt from "+config['table_name'] + " where new_col is null limit 1").show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+-------+----+---+
|id  |new_col|sk  |txt|
+----+-------+----+---+
|2139|null   |2139|H  |
+----+-------+----+---+

### Changing Column Types

Let's change the type of one column this time:

In [16]:
spark.sql("drop table if exists  "+config['table_name']).show(100,False)
spark.sql("show tables").show(100,False)
df1 = create_json_df(spark, get_new_json_data(0, 4000))
(df1.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL)
      .option(BULK_INSERT_PARALLELISM, 3)
      .option(S3_CONSISTENCY_CHECK, "true")
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,NONPARTITION_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL)
      .mode("Overwrite")
      .save(config['target']))
spark.sql("show tables").show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

++
||
++
++

+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+

+--------+------------------+-----------+
|database|tableName         |isTemporary|
+--------+------------------+-----------+
|default |example_hudi_table|false      |
+--------+------------------+-----------+

In [17]:
## Generates Data
def get_json_data(start, count, increment=0):
    data = [{"id": i, "sk": i+increment, "txt": (65 + (i % 26))} for i in range(start, start + count)]
    return data

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [18]:
df1 = create_json_df(spark, get_json_data(0, 100))
df1.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- id: long (nullable = true)
 |-- sk: long (nullable = true)
 |-- txt: long (nullable = true)

In [19]:
df1.show(3)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+---+---+
| id| sk|txt|
+---+---+---+
|  0|  0| 65|
|  1|  1| 66|
|  2|  2| 67|
+---+---+---+
only showing top 3 rows

In [20]:
validateSchema(df1,config['table_name'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Schema Validation Failed : Incoming Schema is not compatible with existing Table.
Incoming Data Diff : {StructField(txt,LongType,true)}
False

Our function identifed that column types have changed. Let's now write it to Hudi:

In [21]:
(df1.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
      .option(UPSERT_PARALLELISM, 20)
      .option(S3_CONSISTENCY_CHECK, "true")
      .option(HUDI_CLEANER_POLICY, KEEP_LATEST_COMMITS)
      .option(HUDI_COMMITS_RETAINED,config["commits_to_retain"])
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,NONPARTITION_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL)  
      .mode("Append")
      .save(config['target']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
An error occurred while calling o215.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 88.0 failed 14 times, most recent failure: Lost task 1.13 in stage 88.0 (TID 528, ip-172-31-41-126.us-west-2.compute.internal, executor 2): org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :1
	at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:270)
	at org.apache.hudi.client.HoodieWriteClient.lambda$upsertRecordsInternal$9c951a5d$1(HoodieWriteClient.java:472)
	at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
	at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:889)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:889)
	at org.apache.spar

We see that Hudi has failed with the errors message when it tried to read the existing data as a number:
    
java.lang.UnsupportedOperationException: org.apache.parquet.avro.AvroConverters$FieldLongConverter

We will try to fix this later in this notebook by forcecasting the column that changed type.

### Dropping Columns


In [22]:
spark.sql("drop table if exists  "+config['table_name']).show(100,False)
spark.sql("show tables").show(100,False)
df1 = create_json_df(spark, get_new_json_data(0, 4000))
(df1.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL)
      .option(BULK_INSERT_PARALLELISM, 3)
      .option(S3_CONSISTENCY_CHECK, "true")
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,NONPARTITION_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL)
      .mode("Overwrite")
      .save(config['target']))
spark.sql("show tables").show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

++
||
++
++

+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+

+--------+------------------+-----------+
|database|tableName         |isTemporary|
+--------+------------------+-----------+
|default |example_hudi_table|false      |
+--------+------------------+-----------+

In [23]:
## Generates Data
def get_json_data(start, count, increment=0):
    data = [{"id": i, "sk": i+increment} for i in range(start, start + count)]
    return data

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [24]:
df1 = create_json_df(spark, get_json_data(0, 100))
df1.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- id: long (nullable = true)
 |-- sk: long (nullable = true)

In [25]:
df1.show(3)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+---+
| id| sk|
+---+---+
|  0|  0|
|  1|  1|
|  2|  2|
+---+---+
only showing top 3 rows

In [26]:
validateSchema(df1,config['table_name'])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Schema Validation Failed : Incoming Schema is not compatible with existing Table.
Original Data Diff : {StructField(txt,StringType,true)}
False

In [27]:
(df1.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
      .option(UPSERT_PARALLELISM, 20)
      .option(S3_CONSISTENCY_CHECK, "true")
      .option(HUDI_CLEANER_POLICY, KEEP_LATEST_COMMITS)
      .option(HUDI_COMMITS_RETAINED,config["commits_to_retain"])
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,NONPARTITION_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL)  
      .mode("Append")
      .save(config['target']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
An error occurred while calling o303.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 122.0 failed 14 times, most recent failure: Lost task 0.13 in stage 122.0 (TID 780, ip-172-31-41-126.us-west-2.compute.internal, executor 1): org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0
	at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:270)
	at org.apache.hudi.client.HoodieWriteClient.lambda$upsertRecordsInternal$9c951a5d$1(HoodieWriteClient.java:472)
	at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
	at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:889)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:889)
	at org.apache.sp

Hudi threw an error message:
    
Caused by: org.apache.parquet.io.InvalidRecordException: Parquet/Avro schema mismatch: Avro field 'txt' not found

What if we added a null column called 'txt'?

In [28]:
from pyspark.sql.functions import when, lit, col
from pyspark.sql.types import *

df1 = create_json_df(spark, get_json_data(0, 100)).withColumn("txt",lit(None).cast(StringType()))
df1.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- id: long (nullable = true)
 |-- sk: long (nullable = true)
 |-- txt: string (nullable = true)

In [29]:
(df1.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
      .option(UPSERT_PARALLELISM, 20)
      .option(S3_CONSISTENCY_CHECK, "true")
      .option(HUDI_CLEANER_POLICY, KEEP_LATEST_COMMITS)
      .option(HUDI_COMMITS_RETAINED,config["commits_to_retain"])
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,NONPARTITION_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL)  
      .mode("Append")
      .save(config['target']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [30]:
spark.sql("select id,sk, txt from "+config['table_name']+' where id in (0,10000)').show(2)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+---+----+
| id| sk| txt|
+---+---+----+
|  0|  0|null|
+---+---+----+

It works! So we need to identify dropped columns and make sure we NULL those out in our dataframe.

## Schema Evolution

To summarize: we tested the 4 cases:

1. Columns are added at the end - which is automatic in Hudi.
2. Columns are added in the middle - which is an incompatible change but could be resolved by rearranging the input columns.
3. Columns are dropped - which is an incompatible change but could be resolved the adding a null column. 
4. Column types are changed - which is an incompatible change and can potentially be resolved by casting the changed column to the right type.

Let's add columns at the end to ensure our schema remains compatible with the original schema.

In [31]:
## Generates Data
def get_json_data(start, count, increment=0):
    data = [{"id": i, "sk": i+increment, "new_col": i, "txt": chr(65 + (i % 26))} for i in range(start, start + count)]
    return data

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [32]:
spark.sql("drop table if exists "+config['table_name']).show(100,False)
spark.sql("show tables").show(100,False)
df1 = create_json_df(spark, get_new_json_data(0, 4000))
(df1.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL)
      .option(BULK_INSERT_PARALLELISM, 3)
      .option(S3_CONSISTENCY_CHECK, "true")
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,NONPARTITION_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL)
      .mode("Overwrite")
      .save(config['target']))
spark.sql("show tables").show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

++
||
++
++

+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+

+--------+------------------+-----------+
|database|tableName         |isTemporary|
+--------+------------------+-----------+
|default |example_hudi_table|false      |
+--------+------------------+-----------+

The function below does the following things:
    
    1. Sets dropped columns to NULL to preserve schema compatibility.
    2. Optionally attempts to forcecasts a changed column type to the old type. 
    3. Rearranges the schema to ensure newly added columns to the end of the table though it is not really needed for Hudi.

In [33]:
def evolveSchema(df, table,strict=False,forcecast=True):
    original_df=spark.sql("SELECT * FROM "+table+" LIMIT 0")
    columns_to_drop = ['_hoodie_commit_time', '_hoodie_commit_seqno','_hoodie_record_key','_hoodie_partition_path','_hoodie_file_name']
    odf = original_df.drop(*columns_to_drop)
    if (df.schema != odf.schema):
        if (strict):
            print ("Strict Schema Validation Failed : Incoming Schema is not compatible with existing Table.")
            if len(odf.schema) > len(df.schema):
                print ("Original Data Diff : "+str(set(odf.schema)-set(df.schema)))
            else:
                print ("Incoming Data Diff : "+str(set(df.schema)-set(odf.schema)))
            return (False,df)
        else:
            new_cols=[s for s in list(set([s.name for s in df.schema.fields])-set([s.name for s in odf.schema.fields]))]
            if forcecast:
                ## force cast columns 
                existing_cols=[f'cast({s.name} as {s.dataType.typeName()}) {s.name}' for s in odf.schema.fields]
                #print  (existing_cols)
                #print  (new_cols)
                new_df = df.selectExpr(existing_cols+new_cols)
            else:    
                existing_cols=[s.name for s in odf.schema.fields]
                new_df = df.select(existing_cols+new_cols)
            ## re-arrange the columns but that's not really necessary anymore to ensure schema compatibility
            ## however it does enable better readability of changes to a schema if newly added columns are 
            ## at the end.
            
            return (True, new_df)
    return (True,df)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

We now perform a type change as well as add a new column.

In [34]:
## Generates Data
def get_json_data(start, count, increment=0):
    data = [{"id": i, "sk": i+increment, 'new':i, "txt": (65 + (i % 26))} for i in range(start, start + count)]
    return data

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [35]:
idf = create_json_df(spark, get_json_data(0, 100))
idf.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- id: long (nullable = true)
 |-- new: long (nullable = true)
 |-- sk: long (nullable = true)
 |-- txt: long (nullable = true)

In [36]:
(schemaValidationResult,evolvedDF)=evolveSchema(idf,config['table_name'],False)
evolvedDF.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- id: long (nullable = true)
 |-- sk: long (nullable = true)
 |-- txt: string (nullable = true)
 |-- new: long (nullable = true)

**We can now observe schema evolution has rearranged the columns to what matches our original Hudi table and has fixed the datatypes as well.**

In [37]:
(evolvedDF.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
      .option(UPSERT_PARALLELISM, 20)
      .option(S3_CONSISTENCY_CHECK, "true")
      .option(HUDI_CLEANER_POLICY, KEEP_LATEST_COMMITS)
      .option(HUDI_COMMITS_RETAINED,config["commits_to_retain"])
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,NONPARTITION_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL)  
      .mode("Append")
      .save(config['target']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [38]:
df=spark.sql("select * from "+config['table_name'])
df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- id: long (nullable = true)
 |-- sk: long (nullable = true)
 |-- txt: string (nullable = true)
 |-- new: long (nullable = true)