Delta Share with GCP

Purpose of this solution is demostrate incremental data ingestion from delta share into Google cloud Big Query using Dataproc.

Prerequisite / Setup

Setup a delta share. In this solution, I have used customer churn dataset.
- Create the table, enable CDC, insert data and create the share
Setup a recepient and generate the credentials file.
Create a single node Dataproc cluster and specify python package dependencies by setting up the properties as below
- dataproc:pip.packages = delta-sharing==1.0.3,google-cloud-bigquery==3.20.1
Create a bigquery dataset (equivalent to a database or schema)
Create the "job_reference_table" in bigquery dataset. This table will be used as a watermark table to maintain status of lastest delta table version loaded
Create a Google storage bucket to store credential file generated above and store the python program
Create a Google Storage bucket required for query bigquery using spark or pandas API.

For the first time load, provider must share the latest table version.
Submit a Dataproc Job with bellow configurations

Job Type = Pyspark
Cluster = Choose the one you have created above. Cluster should be in running state
Main python file = gd bucket path to the python program
Jar files = gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.11-0.23.2.jar
Properties
- spark.jars.packages - io.delta:delta-sharing-spark_2.12:0.6.4
- spark.executorEnv.bucket -
- spark.executorEnv.share_file_path -
- spark.executorEnv.share_table_name -
- spark.executorEnv.gcp_project_dataset -

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
First_Time_Load.png		First_Time_Load.png
Incremental_Load.png		Incremental_Load.png
README.md		README.md
churn_customers.sql		churn_customers.sql
customer_churns.csv		customer_churns.csv
job_reference_table.sql		job_reference_table.sql
share_spark.py		share_spark.py