## Validate Incremental Load

Let us analyze GHActivity Data that is added to the target by our Spark Streaming Process.
* Location: **/user/{username}/itv-github/streaming/bronze/data/ghactivity**.
* As the files are in parquet format, we can use `spark.read.format('parquet')` to read these files.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Incremental Loads using Spark Structurd Streaming'). \
    master('yarn'). \
    getOrCreate()

In [2]:
!hdfs dfs -ls /user/${USER}/itv-github/streaming/bronze/data/ghactivity/created_year=2021/created_month=01

Found 4 items
drwxr-xr-x   - itversity itversity          0 2021-09-14 14:37 /user/itversity/itv-github/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=13
drwxr-xr-x   - itversity itversity          0 2021-09-14 15:00 /user/itversity/itv-github/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=14
drwxr-xr-x   - itversity itversity          0 2021-09-14 15:17 /user/itversity/itv-github/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=15
drwxr-xr-x   - itversity itversity          0 2021-09-14 15:51 /user/itversity/itv-github/streaming/bronze/data/ghactivity/created_year=2021/created_month=01/created_dayofmonth=16


In [3]:
ghactivity = spark. \
    read. \
    parquet(f'/user/{username}/itv-github/streaming/bronze/data/ghactivity')

In [4]:
ghactivity.count()

9591684

In [5]:
ghactivity.printSchema()

root
 |-- actor: struct (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |    |-- display_login: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- id: string (nullable = true)
 |-- org: struct (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- payload: struct (nullable = true)
 |    |-- action: string (nullable = true)
 |    |-- before: string (nullable = true)
 |    |-- comment: struct (nullable = true)
 |    |    |-- _links: struct (nullable = true)
 |    |    |    |-- html: struct (nullable = true)
 |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- pull_request: struct (nul

In [6]:
ghactivity. \
  filter("type = 'CreateEvent' AND payload.ref_type = 'repository'"). \
  count()

404582

* Get count by date to confirm the counts by date.

In [7]:
from pyspark.sql.functions import to_date

In [8]:
ghactivity. \
  groupby(to_date('created_at')). \
  count(). \
  show()

+---------------------+-------+
|to_date(`created_at`)|  count|
+---------------------+-------+
|           2021-01-16|1251855|
|           2021-01-15|2652900|
|           2021-01-14|2857818|
|           2021-01-13|2829111|
+---------------------+-------+



In [9]:
ghactivity. \
  groupby('year', 'month', 'dayofmonth'). \
  count(). \
  show()

+----+-----+----------+-------+
|year|month|dayofmonth|  count|
+----+-----+----------+-------+
|2021|    1|        14|2857818|
|2021|    1|        13|2829111|
|2021|    1|        15|2652900|
|2021|    1|        16|1251855|
+----+-----+----------+-------+



In [None]:
ghactivity. \
  filter("type = 'CreateEvent' AND payload.ref_type = 'repository'"). \
  groupby(to_date('created_at')). \
  count(). \
  show()