## Load Data Incrementally to Target Table

As the files are added to the source, let us run downstream process to partition by date and also to convert data to delta file format.
* Let us write this data to target location using `delta` file format. However, we will use `trigger(once=True)` to run only once.
* We will also validate by running `%fs` command to see if the delta files are copied or not.
* The files that are available at source at this time will be picked up automatically.

In [0]:
spark.conf.set('spark.sql.streaming.schemaInference', 'true')

In [0]:
ghactivity_df = spark.readStream.json('/mnt/itv-github-db/streaming/landing/ghactivity')

In [0]:
from pyspark.sql.functions import year, date_format

In [0]:
ghactivity_df = ghactivity_df. \
  withColumn('created_year', year('created_at')). \
  withColumn('created_month', date_format('created_at', 'MM')). \
  withColumn('created_dayofmonth', date_format('created_at', 'dd'))

In [0]:
ghactivity_df. \
  writeStream. \
  partitionBy('created_year', 'created_month', 'created_dayofmonth'). \
  format('delta'). \
  option("checkpointLocation", "/mnt/itv-github-db/streaming/bronze/checkpoint/ghactivity"). \
  option("path", "/mnt/itv-github-db/streaming/bronze/data/ghactivity"). \
  trigger(once=True). \
  start()

* Validating the checkpoint location. We can see multiple folders. These folders will have all the files that are required for the overhead of the checkpoint.

In [0]:
%fs ls /mnt/itv-github-db/streaming/bronze/checkpoint/ghactivity

path,name,size
dbfs:/mnt/itv-github-db/streaming/bronze/checkpoint/ghactivity/commits/,commits/,0
dbfs:/mnt/itv-github-db/streaming/bronze/checkpoint/ghactivity/metadata,metadata,45
dbfs:/mnt/itv-github-db/streaming/bronze/checkpoint/ghactivity/offsets/,offsets/,0
dbfs:/mnt/itv-github-db/streaming/bronze/checkpoint/ghactivity/sources/,sources/,0


* Validating the data location. We should see the files in this location as we are just copying the files in the parquet file format.

In [0]:
%fs ls /mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01

path,name,size
dbfs:/mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=13/,created_dayofmonth=13/,0
dbfs:/mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14/,created_dayofmonth=14/,0


In [0]:
%fs ls /mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14/

path,name,size
dbfs:/mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14/part-00000-12cb6849-4729-44df-9262-af7574cd34b9.c000.snappy.parquet,part-00000-12cb6849-4729-44df-9262-af7574cd34b9.c000.snappy.parquet,214925270
dbfs:/mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14/part-00001-c0454c00-dd98-4ce1-9faf-c3b0989e4666.c000.snappy.parquet,part-00001-c0454c00-dd98-4ce1-9faf-c3b0989e4666.c000.snappy.parquet,150909453
dbfs:/mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14/part-00002-f890bc45-9179-4f73-8089-804e50748cd0.c000.snappy.parquet,part-00002-f890bc45-9179-4f73-8089-804e50748cd0.c000.snappy.parquet,146356691
dbfs:/mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14/part-00003-469755f2-bfa6-441b-8c24-d00d9c579f85.c000.snappy.parquet,part-00003-469755f2-bfa6-441b-8c24-d00d9c579f85.c000.snappy.parquet,146570818
dbfs:/mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14/part-00004-b021684e-c4fa-4968-a5fc-afc30c113ca6.c000.snappy.parquet,part-00004-b021684e-c4fa-4968-a5fc-afc30c113ca6.c000.snappy.parquet,144929275
dbfs:/mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14/part-00005-6fe1183a-20bb-470c-96ed-800b7fbfeed5.c000.snappy.parquet,part-00005-6fe1183a-20bb-470c-96ed-800b7fbfeed5.c000.snappy.parquet,136733721
dbfs:/mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14/part-00006-6f94bc01-eaca-4c56-b2df-50f0bdd5187a.c000.snappy.parquet,part-00006-6f94bc01-eaca-4c56-b2df-50f0bdd5187a.c000.snappy.parquet,129833194
dbfs:/mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14/part-00007-720ae7cb-ea73-40ff-96e6-1ef0b7002db2.c000.snappy.parquet,part-00007-720ae7cb-ea73-40ff-96e6-1ef0b7002db2.c000.snappy.parquet,124659342
dbfs:/mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14/part-00008-a15cc1f6-a658-4458-8137-34733fe0bff3.c000.snappy.parquet,part-00008-a15cc1f6-a658-4458-8137-34733fe0bff3.c000.snappy.parquet,125953230
dbfs:/mnt/itv-github-db/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14/part-00009-5236ccd9-39f9-4828-af02-1d61654c3093.c000.snappy.parquet,part-00009-5236ccd9-39f9-4828-af02-1d61654c3093.c000.snappy.parquet,122110466
