## Reading Web Server logs using Spark Structured Streaming

As we are ready with the ability to simulate log message generation, let us get into reading these logs using Spark Structured Streaming.
* `spark` which is of type `sparkSession` have an attribute called as `readStream`. It is of type `pyspark.sql.streaming.DataStreamReader`.
* It exposes APIs such as `csv`, `json`, etc along with `format`. To read data from web servers, we can use `socket` as format.
* We need to set options `host` and `port`, then invoke `load` to read data in streaming fashion.
* It will create an object which will be of type `pyspark.sql.dataframe.DataFrame`.

In [None]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Overview of Structured Streaming'). \
    master('yarn'). \
    getOrCreate()

In [None]:
socketDF = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9000) \
    .load()

In [None]:
socketDF.isStreaming

In [None]:
socketDF.printSchema()

In [None]:
socketDF.show() # throws exceptions

In [None]:
socketDF \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .start()

* Run the below code and watch the output. You will see messages being processed every 5 seconds.

In [None]:
socketDF \
    .writeStream \
    .outputMode("append") \
    .format("console") \
    .trigger(processingTime='5 seconds') \
    .start()

# Triggers every 5 seconds