## Read JSON Data using Spark Structured Streaming

Let us understand how to read files using Spark Structured Streaming.
* `spark.readStream` exposes several APIs to read data using different file formats.
  * `json`
  * `csv`
  * `parquet`
  * `orc`
* You can check by typing `spark.readStream.` and then by hitting tab.
* We can also pass file format as argument to `spark.readStream.format`.
* Here are the esamples to read the files using `json` file format:
  * Direct API: `spark.readStream.json('/mnt/itv-github-db/streaming/landing/ghactivity')`
  * Using format: `spark.readStream.format('json').load('/mnt/itv-github-db/streaming/landing/ghactivity')`

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Incremental Loads using Spark Structured Streaming'). \
    master('yarn'). \
    getOrCreate()

* When we read `json` files using `spark.readStream`, by default schema will note be inferred.
* The below cell will fail as schema is mandatory for `spark.readStream.json`.

In [3]:
spark.readStream.json(f'/user/{username}/github/streaming/landing/ghactivity/')

IllegalArgumentException: Schema must be specified when creating a streaming source DataFrame. If some files already exist in the directory, then depending on the file format you may be able to create a static DataFrame on that directory with 'spark.read.load(directory)' and infer schema from it.

* We can set `spark.sql.streaming.schemaInference` to `true` so that the schema can be inferred automatically when we use `spark.readStream.json`.
* However, you should use it caution as the whole data will be read every time to apply the schema.
* Let us go ahead and try reading `json` files after enabling the **schema inference**.

In [4]:
spark.conf.set('spark.sql.streaming.schemaInference', 'true')

In [5]:
!hdfs dfs -ls -R /user/${USER}/github/streaming/landing/ghactivity

drwxr-xr-x   - itv007304 supergroup          0 2023-07-12 17:57 /user/itv007304/github/streaming/landing/ghactivity/year=2023
drwxr-xr-x   - itv007304 supergroup          0 2023-07-12 17:57 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07
drwxr-xr-x   - itv007304 supergroup          0 2023-07-13 15:27 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11
-rw-r--r--   3 itv007304 supergroup   96670341 2023-07-12 17:59 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-0.json.gz
-rw-r--r--   3 itv007304 supergroup   90660972 2023-07-13 15:22 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-1.json.gz
-rw-r--r--   3 itv007304 supergroup  104839007 2023-07-13 15:24 /user/itv007304/github/streaming/landing/ghactivity/year=2023/month=07/dayofmonth=11/2023-07-11-10.json.gz
-rw-r--r--   3 itv007304 supergroup  110194573 2023-07-13 15:24 /user/itv007304

In [6]:
ghactivity_df = spark. \
    readStream. \
    format('json'). \
    load(f'/user/{username}/github/streaming/landing/ghactivity/')

In [7]:
ghactivity_df.isStreaming

True

In [8]:
ghactivity_df.printSchema()

root
 |-- actor: struct (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |    |-- display_login: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- created_at: string (nullable = true)
 |-- id: string (nullable = true)
 |-- org: struct (nullable = true)
 |    |-- avatar_url: string (nullable = true)
 |    |-- gravatar_id: string (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- login: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- payload: struct (nullable = true)
 |    |-- action: string (nullable = true)
 |    |-- before: string (nullable = true)
 |    |-- comment: struct (nullable = true)
 |    |    |-- _links: struct (nullable = true)
 |    |    |    |-- html: struct (nullable = true)
 |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- pull_request: struct (nul

> Keep in mind that, we typically do not infer schema as the compute will be wasted to scan the data for the purpose of inferring the Schema. Instead we apply schema.

In [9]:
ghactivity_df. \
    writeStream. \
    format('memory'). \
    queryName('ghactivity'). \
    start()

<pyspark.sql.streaming.StreamingQuery at 0x7f642d37e8d0>

In [12]:
# we might not see the output as the data might not fit in the memory
spark.sql('SELECT * FROM ghactivity').show()

+-----+----------+---+---+-------+------+----+----+----+-----+----------+
|actor|created_at| id|org|payload|public|repo|type|year|month|dayofmonth|
+-----+----------+---+---+-------+------+----+----+----+-----+----------+
+-----+----------+---+---+-------+------+----+----+----+-----+----------+

