## Previewing the Streaming Data

Let us understand how we can preview the streaming data using `console` as well as `memory`. We have seen console already in the past.
* Here is an example to preview the streaming data using `console`. We will preview the data using `update` mode involving aggregations as part of transformations. Launch **Pyspark CLI** and run this script.

```python
spark.conf.set('spark.sql.shuffle.partitions', '2')

import socket
hostname = socket.gethostname()

log_messages = spark. \
    readStream. \
    format("socket"). \
    option("host", hostname). \
    option("port", 9000). \
    load()

from pyspark.sql.functions import split, count, lit

department_count = log_messages. \
    filter(split(split('value', ' ')[6], '/')[1] == 'department'). \
    select(split(split('value', ' ')[6], '/')[2].alias('department')). \
    groupBy('department'). \
    agg(count(lit(1)).alias('department_count'))

department_count. \
    writeStream. \
    outputMode("update"). \
    format("console"). \
    option('truncate', 'false'). \
    trigger(processingTime='5 seconds'). \
    start()
```

Launch Pyspark using below commands and run Spark Structured Streaming Code.

**Using Pyspark2**

```
export PYSPARK_PYTHON=python3

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

**Using Pyspark3**

```
export PYSPARK_PYTHON=python3

pyspark3 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse
```

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Overview of Structured Streaming'). \
    master('yarn'). \
    getOrCreate()

In [2]:
spark.conf.set('spark.sql.shuffle.partitions', '2')

In [4]:
import socket
hostname = socket.gethostname()
hostname

'g02'

In [5]:
log_messages = spark. \
    readStream. \
    format("socket"). \
    option("host", hostname). \
    option("port", 9000). \
    load()

In [6]:
log_messages.isStreaming

True

In [7]:
log_messages.printSchema()

root
 |-- value: string (nullable = true)



In [8]:
# outputMode will not have any impact
log_messages. \
    writeStream. \
    format("memory"). \
    queryName("log_messages"). \
    start()

<pyspark.sql.streaming.StreamingQuery at 0x7fa871534748>

In [9]:
spark.sql('SELECT * FROM log_messages').show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                       |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|44.189.215.171 - - [08/Jun/2023:13:56:20 -0800] "GET /checkout HTTP/1.1" 200 623 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.76.4 (KHTML, like Gecko) Version/7.0.4 Safari/537.76.4"                |
|24.45.196.46 - - [08/Jun/2023:13:56:22 -0800] "GET /departments HTTP/1.1" 200 455 "-" "Mozilla/5.0 (Windows

In [12]:
spark.sql('SELECT count(1) FROM log_messages').show(truncate=False)

+--------+
|count(1)|
+--------+
|272     |
+--------+



In [13]:
spark.sql("""
    SELECT * FROM log_messages
    WHERE split(split(value, ' ')[6], '/')[1] = 'department'
""").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                                                                    |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|44.8.32.120 - - [08/Jun/2023:13:56:24 -0800] "GET /department/fitness/categories HTTP/1.1" 200 1908 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0"                                                       |
|13.57.150.162 - - [08/Jun/2023:13:56:30 -0800] "GET /de

In [16]:
spark.sql("""
    SELECT count(1) FROM log_messages
    WHERE split(split(value, ' ')[6], '/')[1] = 'department'
""").show(truncate=False)

+--------+
|count(1)|
+--------+
|107     |
+--------+



In [18]:
spark.sql("""
    SELECT split(split(value, ' ')[6], '/')[2] AS department, 
        count(1) AS cnt
    FROM log_messages
    WHERE split(split(value, ' ')[6], '/')[1] = 'department'
    GROUP BY split(split(value, ' ')[6], '/')[2]
""").show(truncate=False)

+-------------+---+
|department   |cnt|
+-------------+---+
|fitness      |19 |
|team%20sports|20 |
|fan%20shop   |21 |
|outdoors     |7  |
|golf         |21 |
|footwear     |11 |
|apparel      |20 |
+-------------+---+



In [19]:
spark.sql("""
    SELECT split(split(value, ' ')[6], '/')[2] AS department, 
        count(1) AS cnt
    FROM log_messages
    WHERE split(split(value, ' ')[6], '/')[1] = 'department'
    GROUP BY split(split(value, ' ')[6], '/')[2]
""").show(truncate=False)

+-------------+---+
|department   |cnt|
+-------------+---+
|fitness      |19 |
|fan%20shop   |23 |
|team%20sports|22 |
|outdoors     |8  |
|golf         |21 |
|apparel      |21 |
|footwear     |11 |
+-------------+---+

