## Using maxFilesPerTrigger and latestFirst

Let us go through the details about reading and writing data to target location leveraging Spark Structured Straming APIs using `maxFilesPerTrigger`. We will also see how to process the latest files first.

* `maxFilesPerTrigger` is primarily used to keep the usage of resources under control.
* It is useful for baseline loads as well as sudden spikes in incremental loads.
* By default old files based on the timestamp associated with the files will be read first, however we can change the behavior using `latestFirst`. 
* We will also validate by running `hdfs fs -ls` command to see if the pqrquet files are copied or not.
* The files that are available at source at this time will be picked up automatically. However, only latest 8 files will be picked as part of the first iteration.

`maxFilesPerTrigger` is only applicable when we trigger job runs using `frequent interval batches`, but not `trigger(once=True)`.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Incremental Loads using Spark Structured Streaming'). \
    master('yarn'). \
    getOrCreate()

In [2]:
spark.conf.set('spark.sql.shuffle.partitions', '8')
spark.conf.set('spark.sql.streaming.schemaInference', 'true')

In [3]:
ghactivity_df = spark. \
    readStream. \
    format('json'). \
    option('maxFilesPerTrigger', 8). \
    option('latestFirst', 'True'). \
    option('cleanSource', 'delete'). \
    load(f'/user/{username}/github/streaming/landing/ghactivity/')

In [4]:
from pyspark.sql.functions import year, month, dayofmonth, lpad

In [5]:
ghactivity_df = ghactivity_df. \
    withColumn('created_year', year('created_at')). \
    withColumn('created_month', lpad(month('created_at'), 2, '0')). \
    withColumn('created_dayofmonth', lpad(dayofmonth('created_at'), 2, '0'))

In [6]:
ghactivity_df. \
    writeStream. \
    partitionBy('created_year', 'created_month', 'created_dayofmonth'). \
    format('parquet'). \
    option("checkpointLocation", f"/user/{username}/github/streaming/bronze/checkpoint/ghactivity"). \
    option("path", f"/user/{username}/github/streaming/bronze/data/ghactivity"). \
    trigger(processingTime='60 seconds'). \
    start()

<pyspark.sql.streaming.StreamingQuery at 0x7febd4cbe3c8>

* Validating the checkpoint location. We can see multiple folders. These folders will have all the files that are required for the overhead of the checkpoint.

In [7]:
!hdfs dfs -ls /user/${USER}/github/streaming/bronze/checkpoint/ghactivity

Found 4 items
drwxr-xr-x   - itv007304 supergroup          0 2023-07-14 12:23 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/commits
-rw-r--r--   3 itv007304 supergroup         45 2023-07-14 12:17 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/metadata
drwxr-xr-x   - itv007304 supergroup          0 2023-07-25 10:27 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/offsets
drwxr-xr-x   - itv007304 supergroup          0 2023-07-14 12:17 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/sources


In [8]:
!hdfs dfs -ls -R /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/sources

drwxr-xr-x   - itv007304 supergroup          0 2023-07-25 10:27 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/sources/0
-rw-r--r--   3 itv007304 supergroup       4432 2023-07-14 12:17 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/sources/0/0
-rw-r--r--   3 itv007304 supergroup       4432 2023-07-25 10:27 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/sources/0/1


In [9]:
!hdfs dfs -cat /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/sources/0/0

v1
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-0.json.gz","timestamp":1689199156255,"batchId":0}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-1.json.gz","timestamp":1689276160654,"batchId":0}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-2.json.gz","timestamp":1689276171657,"batchId":0}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-3.json.gz","timestamp":1689276182338,"batchId":0}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-4.json.gz","timestamp":1689276191737,"batchId":0}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landin

In [10]:
!hdfs dfs -cat /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/sources/0/1

v1
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12/2023-07-12-0.json.gz","timestamp":1690126013086,"batchId":1}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12/2023-07-12-1.json.gz","timestamp":1690126022761,"batchId":1}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12/2023-07-12-2.json.gz","timestamp":1690126032846,"batchId":1}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12/2023-07-12-3.json.gz","timestamp":1690126041951,"batchId":1}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12/2023-07-12-4.json.gz","timestamp":1690126050808,"batchId":1}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landin

In [11]:
!hdfs dfs -ls /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/offsets

Found 2 items
-rw-r--r--   3 itv007304 supergroup        471 2023-07-14 12:17 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/offsets/0
-rw-r--r--   3 itv007304 supergroup        471 2023-07-25 10:27 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/offsets/1


In [13]:
!hdfs dfs -cat /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/offsets/1

v1
{"batchWatermarkMs":0,"batchTimestampMs":1690295265397,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"16"}}
{"logOffset":1}

* Validating the data location. We should see the files in this location as we are just copying the files in the parquet file format.

In [23]:
!hdfs dfs -ls /user/${USER}/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=12

Found 24 items
-rw-r--r--   3 itv007304 supergroup  236303493 2023-07-25 10:31 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=12/part-00000-10dadd13-e7ff-4a21-9017-d4a6551314c6.c000.snappy.parquet
-rw-r--r--   3 itv007304 supergroup  222001999 2023-07-25 10:30 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=12/part-00001-4a50d886-3b31-4ed3-bee0-bb5fde49745b.c000.snappy.parquet
-rw-r--r--   3 itv007304 supergroup  230568614 2023-07-25 10:31 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=12/part-00002-e3116fcf-7651-438a-ab46-7ea001daab63.c000.snappy.parquet
-rw-r--r--   3 itv007304 supergroup  221314719 2023-07-25 10:29 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=12/part-00003-822f743d-2e86-4327-b29e-d9a6e222f465.c000.snappy.parquet
-rw-r

In [None]:
!hdfs dfs -ls -R /user/${USER}/github/streaming/bronze/data/ghactivity

* Validating the source location to see if the files are delted or not. You should not be seeing the files related to 2023-07-11

In [24]:
!hdfs dfs -ls -R /user/${USER}/github/streaming/landing/ghactivity

drwxr-xr-x   - itv007304 supergroup          0 2023-07-12 17:57 /user/itv007304/github/streaming/landing/ghactivity/year=2023
drwxr-xr-x   - itv007304 supergroup          0 2023-07-23 11:26 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07
drwxr-xr-x   - itv007304 supergroup          0 2023-07-25 10:27 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11
drwxr-xr-x   - itv007304 supergroup          0 2023-07-23 11:31 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12
-rw-r--r--   3 itv007304 supergroup   77672256 2023-07-23 11:26 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12/2023-07-12-0.json.gz
-rw-r--r--   3 itv007304 supergroup   76987420 2023-07-23 11:27 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12/2023-07-12-1.json.gz
-rw-r--r--   3 itv007304 supergroup   99933896 2023-07-23 11:28 /user/itv007304/github/streaming/land