## Read JSON Data using Spark Structured Streaming

Let us understand how to read files using Spark Structured Streaming.
* `spark.readStream` exposes several APIs to read data using different file formats.
  * `json`
  * `csv`
  * `parquet`
  * `orc`
* You can check by typing `spark.readStream.` and then by hitting tab.
* We can also pass file format as argument to `spark.readStream.format`.
* Depending upon the file format chosen, we need to apply additional options. For example, if we use `csv`, we might have to specify `header` and also custom separator.
* Some options are applicable to all formats. Here are commonly used options for all formats.
  * `path`
  * `maxFilesPerTrigger`
  * `latestFirst`
  * `maxFileAge`
  * `cleanSource` (`archive`, `delete`, `off`). We need to provide additional option for archive.
* Here are the examples to read the files using `json` file format:
  * Direct API: `spark.readStream.json(f'/user/{username}/itv-github/streaming/landing/ghactivity')`
  * Using format: `spark.readStream.format('json').load(f'/user/{username}/itv-github/streaming/landing/ghactivity')`

In [None]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Incremental Loads using Spark Structurd Streaming'). \
    master('yarn'). \
    getOrCreate()

* When we read `json` files using `spark.readStream`, by default schema will not be inferred.
* The below cell will fail as schema is mandatory for `spark.readStream.json`.

In [None]:
spark.readStream.json(f'/user/{username}/itv-github/streaming/landing/ghactivity')

* We can set `spark.sql.streaming.schemaInference` to `true` so that the schema can be inferred automatically when we use `spark.readStream.json`.
* However, you should use it with caution as the whole data will be read every time to apply the schema.
* Let us go ahead and try reading `json` files after enabling the **schema inference**.

In [None]:
spark.conf.set('spark.sql.streaming.schemaInference', 'true')

In [None]:
!hdfs dfs -ls -R /user/${USER}/itv-github/streaming/landing/ghactivity

In [None]:
ghactivity_df = spark. \
    readStream. \
    format('json'). \
    load(f'/user/{username}/itv-github/streaming/landing/ghactivity')

In [None]:
ghactivity_df.isStreaming

In [None]:
ghactivity_df.printSchema()

> Keep in mind that, we typically do not infer schema as the compute will be wasted to scan the data for the purpose of inferring the Schema. Instead we apply schema.

In [None]:
ghactivity_df. \
    writeStream. \
    format('memory'). \
    queryName('ghactivity'). \
    start()

In [None]:
spark.sql('SELECT * FROM ghactivity').show()
# We might not see the output as the data might not fit in the memory