## Reading Web Server Logs using Spark Structured Streaming

Let us read web server logs using Spark Structured Streaming.
* We need to ensure data is being pushed to netcat based web server.
* We can read the data from the web server logs using `format` as socket.
* For the demo of end to end pipeline, we are using `socket` as source. In production, we might either have Kafka topic or files or some other streaming tool as source.
* After reading the data we will preview using `memory` as output mode.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Overview of Structured Streaming'). \
    master('yarn'). \
    getOrCreate()

In [2]:
spark.conf.set('spark.sql.shuffle.partitions', '2')

In [3]:
import socket
hostname = socket.gethostname()

In [4]:
log_messages = spark. \
    readStream. \
    format("socket"). \
    option("host", hostname). \
    option("port", 9000). \
    load()

In [5]:
log_messages.isStreaming

True

In [6]:
log_messages.printSchema()

root
 |-- value: string (nullable = true)



In [7]:
log_messages. \
    writeStream. \
    format("memory"). \
    queryName("log_messages"). \
    start()

<pyspark.sql.streaming.StreamingQuery at 0x7f22d882f518>

In [10]:
spark.sql('SELECT * FROM log_messages').show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|118.51.187.164 - - [06/Sep/2021:22:02:51 -0800] "GET /department/fitness/products HTTP/1.1" 200 1788 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"                 |
|60.80.158.33 - - [06/Sep/2021:22:02:53 -0800] "GET /dep

In [9]:
spark.sql('SELECT count(1) FROM log_messages').show(truncate=False)

+--------+
|count(1)|
+--------+
|0       |
+--------+

