##  Using maxFilesPerTrigger and latestFirst

Let us go through the details about reading and writing data to target location leveraging Spark Structured Streaming APIs using `maxFilesPerTrigger`. We will also see how to process the latest files first.

* `maxFilesPerTrigger` is primarily used to keep the usage of resources under control. E
* It is useful for baseline loads as well as sudden spikes in incremental loads. 
* By default old files based upon the timestamp associated with the files will be read first, however we can change the behavior using `latestFirst`.
* We will also validate by running `hdfs fs -ls` command to see if the pqrquet files are copied or not.
* The files that are available at source at this time will be picked up automatically. However, only latest 8 files will be picked as part of the first iteration. 

`maxFilesPerTrigger` is only applicable when we trigger job runs using frequent interval batches, not `trigger(once=True)`.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Incremental Loads using Spark Structured Streaming'). \
    master('yarn'). \
    getOrCreate()

In [2]:
spark.conf.set('spark.sql.shuffle.partitions', '8')
spark.conf.set('spark.sql.streaming.schemaInference', 'true')

In [3]:
ghactivity_df = spark. \
    readStream. \
    format('json'). \
    option('maxFilesPerTrigger', 8). \
    option('latestFirst', True). \
    option('cleanSource', 'delete'). \
    load(f'/user/{username}/github/streaming/landing/ghactivity/')

In [4]:
from pyspark.sql.functions import year, month, dayofmonth, lpad

In [5]:
ghactivity_df = ghactivity_df. \
    withColumn('created_year', year('created_at')). \
    withColumn('created_month', lpad(month('created_at'), 2, '0')). \
    withColumn('created_dayofmonth', lpad(dayofmonth('created_at'), 2, '0'))

In [6]:
ghactivity_df. \
    writeStream. \
    partitionBy('created_year', 'created_month', 'created_dayofmonth'). \
    format('parquet'). \
    option("checkpointLocation", f"/user/{username}/github/streaming/bronze/checkpoint/ghactivity"). \
    option("path", f"/user/{username}/github/streaming/bronze/data/ghactivity"). \
    trigger(processingTime='60 seconds'). \
    start()

# If the job run is completed before 60 seconds, it will wait up to 60 seconds for the next run.
# If the job run takes more than 60 seconds to complete, then next run will start immediately.

<pyspark.sql.streaming.StreamingQuery at 0x7fba5f7fe518>

* Validating the checkpoint location. We can see multiple folders. These folders will have all the files that are required for the overhead of the checkpoint.

In [7]:
!hdfs dfs -ls /user/${USER}/github/streaming/bronze/checkpoint/ghactivity

Found 4 items
drwxr-xr-x   - itv007304 supergroup          0 2023-07-25 10:33 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/commits
-rw-r--r--   3 itv007304 supergroup         45 2023-07-14 12:17 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/metadata
drwxr-xr-x   - itv007304 supergroup          0 2023-07-29 00:15 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/offsets
drwxr-xr-x   - itv007304 supergroup          0 2023-07-14 12:17 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/sources


In [9]:
!hdfs dfs -ls -R /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/sources

drwxr-xr-x   - itv007304 supergroup          0 2023-07-29 00:18 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/sources/0
-rw-r--r--   3 itv007304 supergroup       4432 2023-07-14 12:17 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/sources/0/0
-rw-r--r--   3 itv007304 supergroup       4432 2023-07-25 10:27 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/sources/0/1
-rw-r--r--   3 itv007304 supergroup       1482 2023-07-29 00:15 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/sources/0/2
-rw-r--r--   3 itv007304 supergroup       1480 2023-07-29 00:18 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/sources/0/3


In [10]:
!hdfs dfs -cat /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/sources/0/0

v1
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-0.json.gz","timestamp":1689199156255,"batchId":0}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-1.json.gz","timestamp":1689276160654,"batchId":0}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-2.json.gz","timestamp":1689276171657,"batchId":0}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-3.json.gz","timestamp":1689276182338,"batchId":0}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-4.json.gz","timestamp":1689276191737,"batchId":0}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landin

In [11]:
!hdfs dfs -cat /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/sources/0/1

v1
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12/2023-07-12-0.json.gz","timestamp":1690126013086,"batchId":1}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12/2023-07-12-1.json.gz","timestamp":1690126022761,"batchId":1}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12/2023-07-12-2.json.gz","timestamp":1690126032846,"batchId":1}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12/2023-07-12-3.json.gz","timestamp":1690126041951,"batchId":1}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12/2023-07-12-4.json.gz","timestamp":1690126050808,"batchId":1}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landin

In [12]:
!hdfs dfs -cat /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/sources/0/2

v1
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-23.json.gz","timestamp":1690601469796,"batchId":2}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-22.json.gz","timestamp":1690601458911,"batchId":2}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-21.json.gz","timestamp":1690601447783,"batchId":2}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-20.json.gz","timestamp":1690601434392,"batchId":2}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-19.json.gz","timestamp":1690601420916,"batchId":2}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/l

In [13]:
!hdfs dfs -cat /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/sources/0/3

v1
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-15.json.gz","timestamp":1690601362716,"batchId":3}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-14.json.gz","timestamp":1690601350312,"batchId":3}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-13.json.gz","timestamp":1690601335094,"batchId":3}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-12.json.gz","timestamp":1690601321692,"batchId":3}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-11.json.gz","timestamp":1690601307004,"batchId":3}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/l

In [14]:
!hdfs dfs -cat /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/sources/0/4

v1
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-7.json.gz","timestamp":1690601247874,"batchId":4}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-6.json.gz","timestamp":1690601234646,"batchId":4}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-5.json.gz","timestamp":1690601222349,"batchId":4}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-4.json.gz","timestamp":1690601210120,"batchId":4}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-3.json.gz","timestamp":1690601200306,"batchId":4}
{"path":"hdfs://m01.itversity.com:9000/user/itv007304/github/streaming/landin

In [15]:
!hdfs dfs -ls /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/offsets

Found 5 items
-rw-r--r--   3 itv007304 supergroup        471 2023-07-14 12:17 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/offsets/0
-rw-r--r--   3 itv007304 supergroup        471 2023-07-25 10:27 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/offsets/1
-rw-r--r--   3 itv007304 supergroup        471 2023-07-29 00:15 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/offsets/2
-rw-r--r--   3 itv007304 supergroup        471 2023-07-29 00:18 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/offsets/3
-rw-r--r--   3 itv007304 supergroup        471 2023-07-29 00:20 /user/itv007304/github/streaming/bronze/checkpoint/ghactivity/offsets/4


In [17]:
!hdfs dfs -cat /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/offsets/0

v1
{"batchWatermarkMs":0,"batchTimestampMs":1689351454553,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"16"}}
{"logOffset":0}

In [19]:
!hdfs dfs -cat /user/${USER}/github/streaming/bronze/checkpoint/ghactivity/offsets/3

v1
{"batchWatermarkMs":0,"batchTimestampMs":1690604283947,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"16"}}
{"logOffset":3}

* Validating the data location. We should see the files in this location as we are just copying the files in the parquet file format.

In [20]:
!hdfs dfs -ls /user/${USER}/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07

Found 3 items
drwxr-xr-x   - itv007304 supergroup          0 2023-07-14 12:22 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=11
drwxr-xr-x   - itv007304 supergroup          0 2023-07-25 10:32 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=12
drwxr-xr-x   - itv007304 supergroup          0 2023-07-29 00:20 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=13


In [21]:
!hdfs dfs -ls /user/${USER}/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=13

Found 24 items
-rw-r--r--   3 itv007304 supergroup  205772846 2023-07-29 00:22 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=13/part-00000-99ff4ce2-7daf-4d40-9892-8b1439259318.c000.snappy.parquet
-rw-r--r--   3 itv007304 supergroup  223230532 2023-07-29 00:20 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=13/part-00000-b6f8c050-43ba-4203-a061-a50e551c5ab4.c000.snappy.parquet
-rw-r--r--   3 itv007304 supergroup  208222488 2023-07-29 00:17 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=13/part-00000-ecd72c59-de78-44e3-9bff-becead37acd9.c000.snappy.parquet
-rw-r--r--   3 itv007304 supergroup  201686617 2023-07-29 00:17 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=13/part-00001-2691e150-c63a-4098-8d3d-57b2ccaade59.c000.snappy.parquet
-rw-r

In [None]:
!hdfs dfs -ls -R /user/${USER}/github/streaming/bronze/data/ghactivity

* Validating the source location to see if the files are delted or not. 

In [23]:
!hdfs dfs -ls -R /user/${USER}/github/streaming/landing/ghactivity

drwxr-xr-x   - itv007304 supergroup          0 2023-07-12 17:57 /user/itv007304/github/streaming/landing/ghactivity/year=2023
drwxr-xr-x   - itv007304 supergroup          0 2023-07-28 23:25 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07
drwxr-xr-x   - itv007304 supergroup          0 2023-07-25 10:27 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11
drwxr-xr-x   - itv007304 supergroup          0 2023-07-29 00:16 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=12
drwxr-xr-x   - itv007304 supergroup          0 2023-07-29 00:20 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13
-rw-r--r--   3 itv007304 supergroup   79630690 2023-07-28 23:25 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=13/2023-07-13-0.json.gz
-rw-r--r--   3 itv007304 supergroup   79798127 2023-07-28 23:26 /user/itv007304/github/streaming/landing/ghactivity/year=2