## Reading Web Server logs using Spark Structured Streaming

As we are ready with the ability to simulate log message generation, let us get into reading these logs using Spark Structured Streaming.
* `spark` which is of type `SparkSession` have an attribute called as `readStream`. It is of type `pyspark.sql.streaming.DataStreamReader`.
* It exposes APIs such as `csv`, `json`, etc along with `format`. To read data from web servers, we can use `socket` as format.
* We need to set options `host` and `port`, then invoke `load` to read data in streaming fashion.
* It will create an object which will be of type `pyspark.sql.dataframe.DataFrame`.

Launch Pyspark using below commands and run Spark Structured Streaming Code.

**Using Pyspark2**

```
export PYSPARK_PYTHON=python3

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark3**

```
export PYSPARK_PYTHON=python3

pyspark3 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

* Deriving the hostname.

```python
import socket
hostname = socket.gethostname()
```

* Creating Streaming Data Frame.

```python
socketDF = spark. \
    readStream. \
    format("socket"). \
    option("host", hostname). \
    option("port", 9000). \
    load()
```

* Validating whether the data frame is streaming data frame or not.

```python
socketDF.isStreaming
```

* Previewing the schema.

```python
socketDF.printSchema()
```

* We cannot use `show` to preview the data for streaming data frame.

```python
socketDF.show() # throws exceptions
```

* Previewing the data. It will run continuously.

```python
socketDF. \
    writeStream. \
    outputMode("append"). \
    format("console"). \
    start()
```

* Run the below code and watch the output. You will see messages being processed every 5 seconds.

```python
socketDF. \
    writeStream. \
    outputMode("append"). \
    format("console"). \
    trigger(processingTime='5 seconds'). \
    start()

# Triggers every 5 seconds
```