## Analyze GHArchive Data in Parquet files using Spark

Let us analyze GHArchive Data that is created by our Spark Structured Streaming Job.
* Location: **/user/{username}/itv-github/streaming/bronze/data/ghactivity**.
* As the files are in Parquet format, we can use `spark.read.parquet` to read these files.

In [None]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Incremental Loads using Spark Structurd Streaming'). \
    master('yarn'). \
    getOrCreate()

In [None]:
!hdfs dfs -ls /user/${USER}/itv-github/streaming/bronze/data/ghactivity/

In [None]:
ghactivity = spark. \
    read. \
    parquet(f"/user/{username}/itv-github/streaming/bronze/data/ghactivity")

In [None]:
ghactivity.printSchema()

In [None]:
ghactivity.count()

In [None]:
ghactivity. \
  filter("type = 'CreateEvent' AND payload.ref_type = 'repository'"). \
  count()

We can also register Dataframe as temporary view and analyze the data using Spark SQL.

In [None]:
ghactivity.createOrReplaceTempView('ghactivity')

In [None]:
new_repos = spark.sql("""
  SELECT
    repo.id AS repo_id,
    repo.name AS repo_name,
    actor.id AS actor_id,
    actor.login AS actor_login,
    actor.display_login AS actor_display_login,
    payload.ref_type AS ref_type,
    type,
    created_at,
    year(created_at) AS year,
    month(created_at) AS month,
    dayofmonth(created_at) AS day
  FROM ghactivity
  WHERE type = 'CreateEvent'
    AND payload.ref_type = 'repository'
""")

In [None]:
display(new_repos)

In [None]:
new_repos.count()

In [None]:
spark.sql("""
  SELECT count(1)
  FROM ghactivity
  WHERE type = 'CreateEvent'
    AND payload.ref_type = 'repository'
"""). \
    show()

* Get count by date to confirm the counts by date.

In [None]:
from pyspark.sql.functions import to_date

In [None]:
ghactivity. \
  groupby(to_date('created_at')). \
  count(). \
  show()

In [None]:
ghactivity. \
  select('type', 'payload.ref_type'). \
  distinct(). \
  show(100)

In [None]:
ghactivity. \
  filter("payload.ref_type = 'repository'"). \
  groupby('type'). \
  count(). \
  show()

In [None]:
ghactivity. \
  filter("type = 'CreateEvent'"). \
  groupby('payload.ref_type'). \
  count(). \
  show()

In [None]:
ghactivity.printSchema()