# Demo : Batch Updates to a S3 Datalake using Apache Hudi

## Table of Contents:

1. [Overview](#Overview)
2. [Bulk Insert the Initial Dataset](#Bulk-Insert-the-Initial-Dataset)
3. [Batch Upsert some records](#Batch-Upsert-some-records)
4. [Deleting Records](#Deleting-Records.)
5. [Working with Partitioned Tables](#Working-with-Partitioned-Tables)
6. [Creating Manifests for Athena/Redshift Spectrum](#Creating_Manifests_for_Athena/Redshift_Spectrum)

## Overview

**This demo notebook runs fine on a single node (r5.4xlarge) EMR Cluster version 5.30.**

This notebook demonstrates using PySpark on [Apache Hudi](https://aws.amazon.com/emr/features/hudi/) on Amazon EMR to upsert records to an S3 data lake.

Here are some good reference links to read later:

* [Apache Hudi concepts](https://hudi.apache.org/concepts.html)
* [How Hudi Works](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html)

This notebook covers the following concepts when writing Copy-On-Write tables to an S3 Datalake:

- Write Hudi Spark jobs in PySpark.
- Bulk Insert the Initial Dataset.
- Write a MultiKey Partitioned table as well as a Non-Partitioned table.
- Tune the Bulk Insert write performance as per expected number of target files.
- Sync the Hudi tables to the Hive/Glue Catalog.
- Upsert some records to a Hudi table.
- Delete from records from a Hudi table.
- Understand how Hudi Commit Retention policy works.


Let's start by initializing the Spark Session to connect this notebook to our Spark EMR cluster:

- Note that the files hudi-spark-bundle.jar and spark-avro.jar are copied into HDFS. 
- When working with the AWS Glue Catalog, the httpclient-4.5.9.jar library is needed to be the 1st jar specified in the spark.jars configuration.

In [1]:
%%configure -f
{
    "conf":  { 
             "spark.jars":"hdfs:///httpclient-4.5.9.jar,hdfs:///hudi-spark-bundle.jar,hdfs:///spark-avro.jar,",
             "spark.serializer":"org.apache.spark.serializer.KryoSerializer",
             "spark.sql.hive.convertMetastoreParquet":"false",
             "spark.dynamicAllocation.executorIdleTimeout": 3600,
             "spark.executor.memory": "7G",
             "spark.executor.cores": 1,
             "spark.dynamicAllocation.initialExecutors":16
           } 
}

In [2]:
## CHANGE ME ##
config = {
    "table_name": "example_hudi_table",
    "target": "s3://[YOUR-S3-BUCKET]/tmp/hudi/example_hudi_table",
    "primary_key": "id",
    "sort_key": "sk",
    "commits_to_retain": "2"
}

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1590604325881_0003,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

The constants for Python to use:

In [3]:
# General Constants
HUDI_FORMAT = "org.apache.hudi"
TABLE_NAME = "hoodie.table.name"
RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
UPSERT_OPERATION_OPT_VAL = "upsert"
BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
HUDI_CLEANER_POLICY = "hoodie.cleaner.policy"
KEEP_LATEST_COMMITS = "KEEP_LATEST_COMMITS"
HUDI_COMMITS_RETAINED = "hoodie.cleaner.commits.retained"
PAYLOAD_CLASS_OPT_KEY = "hoodie.datasource.write.payload.class"
EMPTY_PAYLOAD_CLASS_OPT_VAL = "org.apache.hudi.common.model.EmptyHoodieRecordPayload"

# Hive Constants
HIVE_SYNC_ENABLED_OPT_KEY="hoodie.datasource.hive_sync.enable"
HIVE_PARTITION_FIELDS_OPT_KEY="hoodie.datasource.hive_sync.partition_fields"
HIVE_ASSUME_DATE_PARTITION_OPT_KEY="hoodie.datasource.hive_sync.assume_date_partitioning"
HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY="hoodie.datasource.hive_sync.partition_extractor_class"
HIVE_TABLE_OPT_KEY="hoodie.datasource.hive_sync.table"

# Partition Constants
NONPARTITION_EXTRACTOR_CLASS_OPT_VAL="org.apache.hudi.hive.NonPartitionedExtractor"
MULIPART_KEYS_EXTRACTOR_CLASS_OPT_VAL="org.apache.hudi.hive.MultiPartKeysValueExtractor"
KEYGENERATOR_CLASS_OPT_KEY="hoodie.datasource.write.keygenerator.class"
NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL="org.apache.hudi.keygen.NonpartitionedKeyGenerator"
COMPLEX_KEYGENERATOR_CLASS_OPT_VAL="org.apache.hudi.ComplexKeyGenerator"
PARTITIONPATH_FIELD_OPT_KEY="hoodie.datasource.write.partitionpath.field"

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Some helper functions:

In [4]:
from datetime import datetime

## Generates Data
def get_json_data(start, count, increment=0):
    now = str(datetime.today().replace(microsecond=0))
    data = [{"id": i, "sk": i+increment, "txt": chr(65 + (i % 26)), "modified_time" : now} for i in range(start, start + count)]
    return data

# Creates the Dataframe
def create_json_df(spark, data):
    sc = spark.sparkContext
    return spark.read.json(sc.parallelize(data, 2))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Bulk Insert the Initial Dataset

Let's generate 4M records to load into our Data Lake:

In [5]:
df1 = create_json_df(spark, get_json_data(0, 4000000))
print(df1.count())
df1.printSchema()
df1.show(3)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

4000000
root
 |-- id: long (nullable = true)
 |-- modified_time: string (nullable = true)
 |-- sk: long (nullable = true)
 |-- txt: string (nullable = true)

+---+-------------------+---+---+
| id|      modified_time| sk|txt|
+---+-------------------+---+---+
|  0|2020-05-27 20:48:57|  0|  A|
|  1|2020-05-27 20:48:57|  1|  B|
|  2|2020-05-27 20:48:57|  2|  C|
+---+-------------------+---+---+
only showing top 3 rows

In [6]:
import pyspark.sql.functions as F
from pyspark.sql.functions import unix_timestamp, from_unixtime

df2=df1.withColumn("modified_timestamp",F.to_timestamp(F.col('modified_time'), "yyyy-MM-dd HH:mm:ss")).drop("modified_time")
df2.show(3,False)
df2.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+---+---+-------------------+
|id |sk |txt|modified_timestamp |
+---+---+---+-------------------+
|0  |0  |A  |2020-05-27 20:48:57|
|1  |1  |B  |2020-05-27 20:48:57|
|2  |2  |C  |2020-05-27 20:48:57|
+---+---+---+-------------------+
only showing top 3 rows

root
 |-- id: long (nullable = true)
 |-- sk: long (nullable = true)
 |-- txt: string (nullable = true)
 |-- modified_timestamp: timestamp (nullable = true)

In [7]:
df1.schema

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

StructType(List(StructField(id,LongType,true),StructField(modified_time,StringType,true),StructField(sk,LongType,true),StructField(txt,StringType,true)))

And write the data to S3:

In [8]:
spark.sql("drop table "+config['table_name']).show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

++
||
++
++

In [9]:
(df2.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL)
      .option(BULK_INSERT_PARALLELISM, 3)
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,NONPARTITION_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL)
      .mode("Overwrite")
      .save(config['target']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [10]:
%%local
!aws s3 ls  s3://[YOUR-S3-BUCKET]/tmp/hudi/example_hudi_table/

                           PRE .hoodie/
2020-05-27 20:49:37          0 .hoodie_$folder$
2020-05-27 20:50:17         93 .hoodie_partition_metadata
2020-05-27 20:50:30    9867126 3812e9ba-9376-4ea1-a61c-38a6be5b9645-0_1-9-13_20200527204934.parquet
2020-05-27 20:50:31   10704243 aa492e8d-e0e1-4e39-a01c-07cdf9ad4e6c-0_2-9-14_20200527204934.parquet
2020-05-27 20:50:35   13949105 ea2cdfe9-cf72-4652-8199-22a05ac3fb32-0_0-9-12_20200527204934.parquet


Let's observe the number of files in S3. Expected number of files is 3 files as BULK_INSERT_PARALLELISM is set to 3. 

4M records took approximately 1min 20s to get written to S3 in 3 files using a single node EMR cluster - r4.2xlarge.

When sizing an EMR Cluster, when the the number of executors matches the number of files to be written, you get perfect parallelism. However if the number of files being written is too large, the number of executors will be far smaller than the number of files to be written which will be the more usual scenario.

Some Apache Hudi Write Parameters to note when performing a bulk-insert are below though you may not need to override them all for every workload:

| Storage configs | ProbDistribution |
| :--- | :--- | 
| hoodie.parquet.max.file.size | Target size for parquet files produced by Hudi write phases. |
| hoodie.parquet.small.file.limit | This should be less < maxFileSize. |
| hoodie.parquet.compression.ratio | Expected compression of parquet data used by Hudi. |
| hoodie.bulkinsert.shuffle.parallelism | Parallelism determines the initial number of files in your table. |

Let's inspect the table created and query the data:

In [None]:
spark.sql("show tables").show(100,False)

In [12]:
spark.sql("show create table "+config['table_name']).show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt                                                                                                                                                                                                                                                                             

Note the extra columns that are added by Hudi to keep track of commits and filenames.

In [13]:
df2=spark.read.format(HUDI_FORMAT).load(config["target"]+"/*")
df2.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

4000000

We can query the Hive table as well:

In [14]:
spark.sql("select count(*) from "+config['table_name']).show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+
|count(1)|
+--------+
|4000000 |
+--------+

### Batch Upsert some records

Let's modify a few records:

In [15]:
spark.sql("select id, sk from "+config['table_name'] +" where id between 3000000 and 3000010").show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+-------+
|id     |sk     |
+-------+-------+
|3000000|3000000|
|3000001|3000001|
|3000002|3000002|
|3000003|3000003|
|3000004|3000004|
|3000005|3000005|
|3000006|3000006|
|3000007|3000007|
|3000008|3000008|
|3000009|3000009|
|3000010|3000010|
+-------+-------+

In [16]:
df2 = create_json_df(spark, get_json_data(3000000, 10000, 2))
df2.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

10000

In [17]:
df2.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- id: long (nullable = true)
 |-- modified_time: string (nullable = true)
 |-- sk: long (nullable = true)
 |-- txt: string (nullable = true)

In [18]:
df3=df2.withColumn("modified_timestamp",F.to_timestamp(F.col('modified_time'), "yyyy-MM-dd HH:mm:ss")).drop("modified_time")
df3.show(3,False)
df3.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+-------+---+-------------------+
|id     |sk     |txt|modified_timestamp |
+-------+-------+---+-------------------+
|3000000|3000002|Q  |2020-05-27 20:51:01|
|3000001|3000003|R  |2020-05-27 20:51:01|
|3000002|3000004|S  |2020-05-27 20:51:01|
+-------+-------+---+-------------------+
only showing top 3 rows

root
 |-- id: long (nullable = true)
 |-- sk: long (nullable = true)
 |-- txt: string (nullable = true)
 |-- modified_timestamp: timestamp (nullable = true)

In [19]:
df3.select("id","sk").show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+-------+
|     id|     sk|
+-------+-------+
|3000000|3000002|
|3000001|3000003|
|3000002|3000004|
|3000003|3000005|
|3000004|3000006|
+-------+-------+
only showing top 5 rows

We have incremented the value in the sk column by 1 for those 10 records and let's write the changes to S3. Note that the operation now is Upsert as opposed to BulkInsert for the initial load:

```
      .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)

```

In [20]:
(df3.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
      .option(UPSERT_PARALLELISM, 20)
      .option(HUDI_CLEANER_POLICY, KEEP_LATEST_COMMITS)
      .option(HUDI_COMMITS_RETAINED,config["commits_to_retain"])
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,NONPARTITION_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL)  
      .mode("Append")
      .save(config['target']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [21]:
%%local
!aws s3 ls  s3://[YOUR-S3-BUCKET]/tmp/hudi/example_hudi_table/

                           PRE .hoodie/
2020-05-27 20:49:37          0 .hoodie_$folder$
2020-05-27 20:50:17         93 .hoodie_partition_metadata
2020-05-27 20:51:29    9867588 3812e9ba-9376-4ea1-a61c-38a6be5b9645-0_0-52-355_20200527205105.parquet
2020-05-27 20:50:30    9867126 3812e9ba-9376-4ea1-a61c-38a6be5b9645-0_1-9-13_20200527204934.parquet
2020-05-27 20:50:31   10704243 aa492e8d-e0e1-4e39-a01c-07cdf9ad4e6c-0_2-9-14_20200527204934.parquet
2020-05-27 20:50:35   13949105 ea2cdfe9-cf72-4652-8199-22a05ac3fb32-0_0-9-12_20200527204934.parquet


Let's observe the number of files in S3. Expected : 4 files as one file has now 2 versions stored.

Let's rerun the previous cell and observe the number of files. 

And once more and observe the number of files. Expected : 5 files.

Notice that the number of files is not increasing beyond initial files(3) + commits_to_retain(2) = 5 files. This is because Hudi Cleaning Policy is deleting older files when writing as the commits_to_retain policy is set to 2. 

Let's query our changed files:

In [22]:
df2=spark.read.format(HUDI_FORMAT).load(config["target"]+"/*")
df2.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

4000000

In [23]:
spark.sql("select id, sk, modified_timestamp from "+config['table_name'] +" where id between 2000000 and 2000010").show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+-------+------------------+
|id     |sk     |modified_timestamp|
+-------+-------+------------------+
|2000000|2000000|1590612537000000  |
|2000001|2000001|1590612537000000  |
|2000002|2000002|1590612537000000  |
|2000003|2000003|1590612537000000  |
|2000004|2000004|1590612537000000  |
|2000005|2000005|1590612537000000  |
|2000006|2000006|1590612537000000  |
|2000007|2000007|1590612537000000  |
|2000008|2000008|1590612537000000  |
|2000009|2000009|1590612537000000  |
|2000010|2000010|1590612537000000  |
+-------+-------+------------------+

We can observe that the records are updated.

## Deleting Records.

Apache Hudi supports implementing two types of deletes on data stored in Hudi datasets, by enabling the user to specify a different record payload implementation.

* **Soft Deletes** : With soft deletes, user wants to retain the key but just null out the values for all other fields. This can be simply achieved by ensuring the appropriate fields are nullable in the dataset schema and simply upserting the dataset after setting these fields to null.
    
* **Hard Deletes** : A stronger form of delete is to physically remove any trace of the record from the dataset. 

Let's now execute some hard delete operations on our dataset which will remove the records from our dataset.

In [24]:
## Generates Data
def get_json_data(start, count, increment=0):
    now = str(datetime.today().replace(microsecond=0))
    data = [{"id": i} for i in range(start, start + count)]
    return data

df2 = create_json_df(spark, get_json_data(2000000, 10, 1))
df2.createOrReplaceTempView("data_to_delete_v")
# join the incoming delete ids with the original table.
df3=spark.sql("SELECT a.* from "+config['table_name']+' a,data_to_delete_v b where a.id = b.id')
df3.count()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

10

In [25]:
df3.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- _hoodie_commit_time: string (nullable = true)
 |-- _hoodie_commit_seqno: string (nullable = true)
 |-- _hoodie_record_key: string (nullable = true)
 |-- _hoodie_partition_path: string (nullable = true)
 |-- _hoodie_file_name: string (nullable = true)
 |-- id: long (nullable = true)
 |-- sk: long (nullable = true)
 |-- txt: string (nullable = true)
 |-- modified_timestamp: long (nullable = true)

Let's now delete these 10 records. Note that the only change is the single line that set the hoodie.datasource.write.payload.class to org.apache.hudi.common.model.EmptyHoodieRecordPayload to delete the records.

```
.option(PAYLOAD_CLASS_OPT_KEY, EMPTY_PAYLOAD_CLASS_OPT_VAL)
```

In [26]:
(df3.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL)
      .option(UPSERT_PARALLELISM, 20)
      .option(HUDI_CLEANER_POLICY, KEEP_LATEST_COMMITS)
      .option(HUDI_COMMITS_RETAINED,config["commits_to_retain"])
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,NONPARTITION_EXTRACTOR_CLASS_OPT_VAL)
      .option(KEYGENERATOR_CLASS_OPT_KEY,NONPARTITIONED_KEYGENERATOR_CLASS_OPT_VAL)
      .option(PAYLOAD_CLASS_OPT_KEY, EMPTY_PAYLOAD_CLASS_OPT_VAL)
      .mode("Append")
      .save(config['target']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [27]:
spark.sql("select id, sk from "+config['table_name'] +" where id between 2000000 and 2000009").show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+---+
|id |sk |
+---+---+
+---+---+

We can observe that the records no longer exist in our table.

## Working with Partitioned Tables

Let's do the same thing with Partitioned Tables.

In [28]:
## CHANGE ME ##
config = {
    "table_name": "example_hudi_partitioned_table",
    "target": "s3://[YOUR-S3-BUCKET]/tmp/hudi/example_hudi_partitioned_table",
    "primary_key": "id",
    "sort_key": "sk",
    "commits_to_retain": "2",
    "partition_keys" : "year,month"
}

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [29]:
import random

## Generates Data - we are adding year and month columns this time.
def get_json_data(start, count, increment=0):
    data = [{"id": i, "sk": i+increment, "txt": chr(65 + (i % 26)), "year" : "2019", "month": random.randint(1,12) } for i in range(start, start + count)]
    return data

# Creates the Dataframe
def create_json_df(spark, data):
    sc = spark.sparkContext
    return spark.read.json(sc.parallelize(data, 2))


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Let's generate the data:

In [30]:
df1 = create_json_df(spark, get_json_data(0, 4000000))
print(df1.count())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

4000000

We add the partitionKey column to the dataframe. Note that we are doing this because we want Hive Style Partitions : year=<year>/month=<month> etc.

In [31]:
from pyspark.sql.functions import concat, col, lit

hudiTablePartitionKey="partitionKey"
df1 = df1.withColumn(hudiTablePartitionKey,concat(lit("year="),col("year"),lit("/month="),col("month")))
df1.select(hudiTablePartitionKey).show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------+
|     partitionKey|
+-----------------+
|year=2019/month=3|
|year=2019/month=4|
|year=2019/month=6|
|year=2019/month=7|
|year=2019/month=1|
+-----------------+
only showing top 5 rows

And we can now write out the data to S3. Notice that the Hive Partition Extractor class has changed in the statement below:

```
      .option(HIVE_PARTITION_FIELDS_OPT_KEY, config["partition_keys"])
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,MULIPART_KEYS_EXTRACTOR_CLASS_OPT_VAL)
      .option(PARTITIONPATH_FIELD_OPT_KEY,"partitionKey")
```


In [32]:
(df1.write.format(HUDI_FORMAT)
      .option(PRECOMBINE_FIELD_OPT_KEY, config["sort_key"])
      .option(RECORDKEY_FIELD_OPT_KEY, config["primary_key"])
      .option(TABLE_NAME, config['table_name'])
      .option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL)
      .option(BULK_INSERT_PARALLELISM, 3)
      .option(HIVE_PARTITION_FIELDS_OPT_KEY, config["partition_keys"])
      .option(HIVE_TABLE_OPT_KEY,config['table_name'])
      .option(HIVE_SYNC_ENABLED_OPT_KEY,"true")
      .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,MULIPART_KEYS_EXTRACTOR_CLASS_OPT_VAL)
      .option(PARTITIONPATH_FIELD_OPT_KEY,"partitionKey")
      .mode("Overwrite")
      .save(config['target']))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [33]:
spark.sql("show create table "+config['table_name']).show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt                                                                                                                                                                                                            

We can see the partitions fields are present in our Hive table. 

```
PARTITIONED BY (`year` STRING, `month` BIGINT)
```

Let's now query the data and group by the the partition columns:

In [34]:
spark.sql("Select year, month, count(*) from "+config['table_name']+" group by year, month order by month").show(100,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+-----+--------+
|year|month|count(1)|
+----+-----+--------+
|2019|1    |334289  |
|2019|2    |332567  |
|2019|3    |333834  |
|2019|4    |332667  |
|2019|5    |333940  |
|2019|6    |333554  |
|2019|7    |333280  |
|2019|8    |332993  |
|2019|9    |332970  |
|2019|10   |333269  |
|2019|11   |333607  |
|2019|12   |333030  |
+----+-----+--------+

The other operations Upsert etc. behave the same way on Partitioned tables.