## Process JSON Data in Landing Zone

Let us quickly understand how to process JSON Data in Landing Zone.
* The files related to GitHub Activity Archive data which are downloaded and copied on to the landing zone are of type JSON format.
* Spark provides robust APIs to deal with JSON data.
* Processing of JSON data along with special data types is covered in subsequent topics. For now we will see some basics of processing data in JSON Structure.
* A JSON record will have attributes of different types.
  * Simple - Numeric, String, Boolean, Null, etc.
  * Object Type (a nested JSON object)
  * JSON Array. An array can contain attributes of simple type or object type or another JSON Array.
* Our GitHub Archive data have all type of attributes. We will review how to process data of different types here.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Analyze GitHub Archive Data'). \
    master('yarn'). \
    getOrCreate()

In [None]:
%%sh

hdfs dfs -ls /user/${USER}/itvgithub/landing/

* We can use `spark.read.json` to read a file or all files in a folder or some files based upon a pattern.
* To explore basic capabilities of Spark to process JSON data, we will read one file into a Data Frame.

In [2]:
ghdata = spark.read.json(f'/user/{username}/itvgithub/landing/2021-01-13-0.json.gz')

Spend sometime in reviewing the Schema.
* `actor`, `org`, `payload`, `repo` are of type structs.
* `created_at`, `id` are of strings.
* Most of the attributes under payload are either of simple type or JSON object type. However `commits` under payload is of type JSON Array.

In [None]:
ghdata.printSchema()

In [3]:
ghdata.select('repo').show()

+--------------------+
|                repo|
+--------------------+
|[67224522, i-RIC/...|
|[329141406, kaned...|
|[221279833, arche...|
|[182814691, Auden...|
|[4542716, NixOS/n...|
|[329130975, eterw...|
|[104382627, littl...|
|[302490178, qmk/q...|
|[156042726, Maybe...|
|[329144511, direw...|
|[91074692, zaland...|
|[280011532, GeopJ...|
|[32481543, cBioPo...|
|[270887418, feeda...|
|[322448852, ehenn...|
|[325641835, machi...|
|[189429001, mlysy...|
|[307762661, steve...|
|[214051777, leigh...|
|[97922418, leanpr...|
+--------------------+
only showing top 20 rows



In [4]:
ghdata.select('repo').printSchema()

root
 |-- repo: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- url: string (nullable = true)



In [5]:
# we can access all the attributes from struct using .*
ghdata.select('repo.*').printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- url: string (nullable = true)



In [6]:
ghdata.select('created_at', 'repo.*').show()

+--------------------+---------+--------------------+--------------------+
|          created_at|       id|                name|                 url|
+--------------------+---------+--------------------+--------------------+
|2021-01-13T00:00:00Z| 67224522|   i-RIC/prepost-gui|https://api.githu...|
|2021-01-13T00:00:00Z|329141406| kaneda96/React-quiz|https://api.githu...|
|2021-01-13T00:00:00Z|221279833|archesproject/arc...|https://api.githu...|
|2021-01-13T00:00:00Z|182814691|    Audentio/kinetic|https://api.githu...|
|2021-01-13T00:00:00Z|  4542716|       NixOS/nixpkgs|https://api.githu...|
|2021-01-13T00:00:00Z|329130975|   eterwin/schastota|https://api.githu...|
|2021-01-13T00:00:00Z|104382627|littlebizzy/slick...|https://api.githu...|
|2021-01-13T00:00:00Z|302490178|   qmk/qmk_keyboards|https://api.githu...|
|2021-01-13T00:00:00Z|156042726|MaybeNotWrong/lc-sep|https://api.githu...|
|2021-01-13T00:00:00Z|329144511|direwolf-github/e...|https://api.githu...|
|2021-01-13T00:00:00Z| 91

* `payload.commits` is of type array. Each element in the array is of type JSON object.

In [7]:
ghdata.select('payload.commits').printSchema()

root
 |-- commits: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- author: struct (nullable = true)
 |    |    |    |-- email: string (nullable = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |-- distinct: boolean (nullable = true)
 |    |    |-- message: string (nullable = true)
 |    |    |-- sha: string (nullable = true)
 |    |    |-- url: string (nullable = true)



In [8]:
ghdata.count()

90911

* We can use `explode` to flatten the list. Once you use explode you will see more records.

In [9]:
from pyspark.sql.functions import explode
ghdata. \
    select(explode('payload.commits').alias('commits')). \
    printSchema()

root
 |-- commits: struct (nullable = true)
 |    |-- author: struct (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |-- distinct: boolean (nullable = true)
 |    |-- message: string (nullable = true)
 |    |-- sha: string (nullable = true)
 |    |-- url: string (nullable = true)



* `explode` by default ignore the records where there are no element in the list.

In [10]:
from pyspark.sql.functions import explode
ghdata. \
    select(explode('payload.commits').alias('commits')). \
    count()

75708

* `explode_outer` will retain those records where there are no elements in the JSON Array and explode when there are elements in the JSON Array.
* With `explode_outer`, you can see the count to be greater than the original Data Frame.

In [11]:
from pyspark.sql.functions import explode_outer
ghdata. \
    select(explode_outer('payload.commits').alias('commits')). \
    count()

119495

In [13]:
from pyspark.sql.functions import explode_outer
ghdata. \
    select('repo.*', 'created_at', explode_outer('payload.commits').alias('commits')). \
    show()

+---------+--------------------+--------------------+--------------------+--------------------+
|       id|                name|                 url|          created_at|             commits|
+---------+--------------------+--------------------+--------------------+--------------------+
| 67224522|   i-RIC/prepost-gui|https://api.githu...|2021-01-13T00:00:00Z|                null|
|329141406| kaneda96/React-quiz|https://api.githu...|2021-01-13T00:00:00Z|                null|
|221279833|archesproject/arc...|https://api.githu...|2021-01-13T00:00:00Z|                null|
|182814691|    Audentio/kinetic|https://api.githu...|2021-01-13T00:00:00Z|                null|
|  4542716|       NixOS/nixpkgs|https://api.githu...|2021-01-13T00:00:00Z|                null|
|329130975|   eterwin/schastota|https://api.githu...|2021-01-13T00:00:00Z|[[394a73ceb6ee034...|
|104382627|littlebizzy/slick...|https://api.githu...|2021-01-13T00:00:00Z|[[a5c95b3d7cb4d0a...|
|302490178|   qmk/qmk_keyboards|https://

* You can access the authors for the commits by using `commits.author.*` after exploding `commits` which is of type JSON Array.

In [15]:
from pyspark.sql.functions import explode_outer
ghdata. \
    select('repo.*', 'created_at', explode_outer('payload.commits').alias('commits')). \
    select('id', 'name', 'url', 'created_at', 'commits.author.*'). \
    show()

+---------+--------------------+--------------------+--------------------+--------------------+-------------------+
|       id|                name|                 url|          created_at|               email|               name|
+---------+--------------------+--------------------+--------------------+--------------------+-------------------+
| 67224522|   i-RIC/prepost-gui|https://api.githu...|2021-01-13T00:00:00Z|                null|               null|
|329141406| kaneda96/React-quiz|https://api.githu...|2021-01-13T00:00:00Z|                null|               null|
|221279833|archesproject/arc...|https://api.githu...|2021-01-13T00:00:00Z|                null|               null|
|182814691|    Audentio/kinetic|https://api.githu...|2021-01-13T00:00:00Z|                null|               null|
|  4542716|       NixOS/nixpkgs|https://api.githu...|2021-01-13T00:00:00Z|                null|               null|
|329130975|   eterwin/schastota|https://api.githu...|2021-01-13T00:00:00