## Incremental Load using Archival Process

Let us go through the details about reading and writing data to target location leveraging Spark Structured Streaming APIs while dealing with processed files.
* As part of data pipelines, we typically delete the files or archive the files which are already processed.
* As part of the archival process, typically we move the files to a different location using some low cost storage.
* Spark Structured Streaming APIs support both `delete` as well as `archive` using `cleanSource` option. 
* By default `cleanSource` is turned off. We can set it to either `delete` or `archive`.
* When we use `archive`, we also need to set the archive folder using `sourceArchiveDir`. We need to pass the fully qualified path of the archive folder to `sourceArchiveDir`.

In [None]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Incremental Loads using Spark Structurd Streaming'). \
    master('yarn'). \
    getOrCreate()

In [None]:
spark.conf.set('spark.sql.shuffle.partitions', '8')

In [None]:
spark.conf.set('spark.sql.streaming.schemaInference', 'true')

In [None]:
ghactivity_df = spark. \
    readStream. \
    format('json'). \
    option('maxFilesPerTrigger', 8). \
    option('cleanSource', 'archive'). \
    option('sourceArchiveDir', f'/user/{username}/itv-github/streaming/landing/archive/ghactivity'). \
    load(f'/user/{username}/itv-github/streaming/landing/ghactivity')

In [None]:
from pyspark.sql.functions import year, month, dayofmonth, lpad

In [None]:
ghactivity_df = ghactivity_df. \
    withColumn('created_year', year('created_at')). \
    withColumn('created_month', lpad(month('created_at'), 2, '0')). \
    withColumn('created_dayofmonth', lpad(dayofmonth('created_at'), 2, '0'))

In [None]:
ghactivity_df. \
    writeStream. \
    partitionBy('created_year', 'created_month', 'created_dayofmonth'). \
    format('parquet'). \
    option("checkpointLocation", f"/user/{username}/itv-github/streaming/bronze/checkpoint/ghactivity"). \
    option("path", f"/user/{username}/itv-github/streaming/bronze/data/ghactivity"). \
    trigger(processingTime='120 seconds'). \
    start()

* Validating the checkpoint location. We can see multiple folders. These folders will have all the files that are required for the overhead of the checkpoint.

In [None]:
!hdfs dfs -ls /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity

In [None]:
!hdfs dfs -ls -R /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/sources

In [None]:
# Check last few ones
!hdfs dfs -cat /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/sources/0/6

In [None]:
!hdfs dfs -cat /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/sources/0/1

In [None]:
!hdfs dfs -ls /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/offsets

In [None]:
!hdfs dfs -cat /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/offsets/0

In [None]:
!hdfs dfs -cat /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/offsets/1

* Validating the data location. We should see the files in this location as we are just copying the files in the parquet file format.

In [None]:
!hdfs dfs -ls /user/${USER}/itv-github/streaming/bronze/data/ghactivity/created_year=2021/created_month=01

In [None]:
!hdfs dfs -ls /user/${USER}/itv-github/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=16/

* Validating the archive folder

In [None]:
!hdfs dfs -ls -R /user/${USER}/itv-github/streaming/landing/archive/ghactivity