## Validate Incremental Load

Let us analyze GHActivity Data that is added to the target by our Spark Streaming Process.
* Location: **/user/{username}/github/streaming/bronze/data/ghactivity**.
* As the files are in delta format, we can use `spark.read.format('parquet')` to read these files.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Incremental Loads using Spark Structured Streaming'). \
    master('yarn'). \
    getOrCreate()

In [2]:
!hdfs dfs -ls /user/${USER}/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07

Found 2 items
drwxr-xr-x   - itv007304 supergroup          0 2023-07-14 12:22 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=11
drwxr-xr-x   - itv007304 supergroup          0 2023-07-25 10:32 /user/itv007304/github/streaming/bronze/data/ghactivity/created_year=2023/created_month=07/created_dayofmonth=12


In [8]:
username

'itv007304'

In [3]:
ghactivity = spark. \
    read. \
    parquet(f"/user/{username}/github/streaming/bronze/data/ghactivity")

In [16]:
ghactivity.printSchema()

root
 |-- actor: struct (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |    |-- display_login: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- id: string (nullable = true)
 |-- org: struct (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- payload: struct (nullable = true)
 |    |-- action: string (nullable = true)
 |    |-- before: string (nullable = true)
 |    |-- comment: struct (nullable = true)
 |    |    |-- _links: struct (nullable = true)
 |    |    |    |-- html: struct (nullable = true)
 |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- pull_request: struct (nul

In [17]:
ghactivity.createOrReplaceTempView('ghactivity')

In [18]:
new_repos = spark.sql("""
    SELECT
        repo.id AS repo_id,
        repo.name AS repo_name,
        actor.id AS actor_id,
        actor.login AS actor_login,
        actor.display_login AS actor_display_login,
        payload.ref_type AS ref_type,
        type,
        created_at,
        year(created_at) AS year,
        month(created_at) AS month,
        dayofmonth(created_at) AS day
    FROM ghactivity
    WHERE 
        type = 'CreateEvent'
        AND payload.ref_type = 'repository'
""")

In [19]:
display(new_repos)

repo_id,repo_name,actor_id,actor_login,actor_display_login,ref_type,type,created_at,year,month,day
665549714,ShubhamMalik818/A...,123889995,ShubhamMalik818,ShubhamMalik818,repository,CreateEvent,2023-07-12T13:00:00Z,2023,7,12
665570747,kylezfldla/ChatBot,47021913,kylezfldla,kylezfldla,repository,CreateEvent,2023-07-12T13:51:52Z,2023,7,12
665549715,channiboi1998/nes...,91501578,channiboi1998,channiboi1998,repository,CreateEvent,2023-07-12T13:00:00Z,2023,7,12
665549719,anoushka-10/unity...,125151652,anoushka-10,anoushka-10,repository,CreateEvent,2023-07-12T13:00:00Z,2023,7,12
665549722,parmeet2311/Ecomm...,73023547,parmeet2311,parmeet2311,repository,CreateEvent,2023-07-12T13:00:01Z,2023,7,12
665570757,idsb3t1/hello-hel...,18148588,idsb3t1,idsb3t1,repository,CreateEvent,2023-07-12T13:51:53Z,2023,7,12
665549727,Nike2447/test,66029681,Nike2447,Nike2447,repository,CreateEvent,2023-07-12T13:00:01Z,2023,7,12
665549726,Feemoai/Pendaftar...,109532738,Feemoai,Feemoai,repository,CreateEvent,2023-07-12T13:00:01Z,2023,7,12
665549728,qamilo/recipes-api,80307763,qamilo,qamilo,repository,CreateEvent,2023-07-12T13:00:02Z,2023,7,12
665570760,Rashi-04/pythonPr...,96513991,Rashi-04,Rashi-04,repository,CreateEvent,2023-07-12T13:51:53Z,2023,7,12


In [20]:
first = spark.sql("""
    SELECT
        repo.id AS repo_id,
        repo.name AS repo_name,
        actor.id AS actor_id,
        actor.login AS actor_login,
        actor.display_login AS actor_display_login,
        payload.ref_type AS ref_type,
        type,
        created_at,
        year(created_at) AS year,
        month(created_at) AS month,
        dayofmonth(created_at) AS day
    FROM ghactivity
    WHERE 
        dayofmonth(created_at) = 11
""")

In [None]:
spark.sql("""
    SELECT
        COUNT(*) AS cnt
    FROM ghactivity
    WHERE 
        dayofmonth(created_at) = 11
""").show(5, False)

In [None]:
ghactivity. \
  filter("type = 'CreateEvent' AND payload.ref_type = 'repository'"). \
  count()

* Get count by date to confirm the counts by date.

In [4]:
from pyspark.sql.functions import to_date

In [None]:
ghactivity. \
  groupby(to_date('created_at')). \
  count(). \
  show()

In [None]:
ghactivity. \
  groupby('year', 'month', 'dayofmonth'). \
  count(). \
  show()

In [None]:
ghactivity. \
  filter("type = 'CreateEvent' AND payload.ref_type = 'repository'"). \
  groupby(to_date('created_at')). \
  count(). \
  show()