## Using maxFilerPerTrigger and latestFirst

Let us go through the details about reading and writing data to target location leveraging Spark Structured Streaming APIs using `maxFilesPerTrigger`. We will also see how to process the latest files first.
* `maxFilesPerTrigger` is primarily used to keep the usage of resources under control.
* It is useful for baseline loads as well as sudden spikes in incremental loads.
* By default old files based upon the timestamp associated with the files will be read first, however we can change the behavior using `latestFirst`.
* We will also validate by running `hdfs dfs -ls` command to see if the parquet files are copied or not.
* The files that are available at source at this time will be picked up automatically. However, only latest 8 files will be picked as part of the first iteration.

```{note}
`maxFilesPerTrigger` is not applicable when we trigger job runs using `trigger(once=True)`.
```

In [None]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Incremental Loads using Spark Structurd Streaming'). \
    master('yarn'). \
    getOrCreate()

In [None]:
spark.conf.set('spark.sql.shuffle.partitions', '8')

In [None]:
spark.conf.set('spark.sql.streaming.schemaInference', 'true')

In [None]:
ghactivity_df = spark. \
    readStream. \
    format('json'). \
    option('maxFilesPerTrigger', 8). \
    option('latestFirst', True). \
    option('cleanSource', 'delete'). \
    option('path', f'/user/{username}/itv-github/streaming/landing/ghactivity'). \
    load()

# We can also pass path directly to load

In [None]:
from pyspark.sql.functions import year, month, dayofmonth, lpad

In [None]:
ghactivity_df = ghactivity_df. \
    withColumn('created_year', year('created_at')). \
    withColumn('created_month', lpad(month('created_at'), 2, '0')). \
    withColumn('created_dayofmonth', lpad(dayofmonth('created_at'), 2, '0'))

In [None]:
ghactivity_df. \
    writeStream. \
    partitionBy('created_year', 'created_month', 'created_dayofmonth'). \
    format('parquet'). \
    option("checkpointLocation", f"/user/{username}/itv-github/streaming/bronze/checkpoint/ghactivity"). \
    option("path", f"/user/{username}/itv-github/streaming/bronze/data/ghactivity"). \
    trigger(processingTime='60 seconds'). \
    start()

# If the job run is completed before 60 seconds, it will wait for the next run.
# If the job run takes more than 60 seconds to complete, then next run will start immediately.

* Validating the checkpoint location. We can see multiple folders. These folders will have all the files that are required for the overhead of the checkpoint.

```{note}
You can wait for few minutes before running below cells.
```

In [None]:
!hdfs dfs -ls /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity

In [None]:
!hdfs dfs -ls -R /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/sources

In [None]:
!hdfs dfs -cat /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/sources/0/0

In [None]:
!hdfs dfs -cat /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/sources/0/1

In [None]:
!hdfs dfs -cat /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/sources/0/2

In [None]:
!hdfs dfs -cat /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/sources/0/3

In [None]:
!hdfs dfs -cat /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/sources/0/4

In [None]:
!hdfs dfs -ls /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/offsets

In [None]:
!hdfs dfs -cat /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/offsets/0

In [None]:
!hdfs dfs -cat /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/offsets/1

In [None]:
!hdfs dfs -cat /user/${USER}/itv-github/streaming/bronze/checkpoint/ghactivity/offsets/3

* Validating the data location. We should see the files in this location as we are just copying the files in the parquet file format.

In [None]:
!hdfs dfs -ls /user/${USER}/itv-github/streaming/bronze/data/ghactivity/created_year=2021/created_month=01

In [None]:
!hdfs dfs -ls /user/${USER}/itv-github/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=15/

* Validating the source location to see if the files are deleted or not.

In [None]:
!hdfs dfs -ls -R /user/${USER}/itv-github/streaming/landing/ghactivity