## Writing Streaming Data to Files

As we have successfully read the data and see it is being processed using `writeStream.format('console')`, now it is time for us to understand how the data can be written to files.

Here are the steps we need to follow to write the data to files:
1. Ensure the logs are being redirected to Netcat Webserver
2. Read the data using `spark.readStream` with `format('socket')`
3. Use `writeStream.format` with appropriate options related to the file format. We will be using `writeStream.format('csv')` and hence we need to specify checkpoint and target location.

```python
socketDF. \
    writeStream. \
    format("csv"). \
    option("checkpointLocation", "/user/itversity/retail_logs/gen_logs/checkpoint"). \
    option("path", "/user/itversity/retail_logs/gen_logs/data"). \
    start()
```

4. Validate both the checkpoint location as well as data location in which files are being copied to.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Overview of Structured Streaming'). \
    master('yarn'). \
    getOrCreate()

In [2]:
import socket
hostname = socket.gethostname()

In [3]:
log_messages = spark. \
    readStream. \
    format('socket'). \
    option('host', hostname). \
    option('port', 9000). \
    load()

In [4]:
import getpass
username = getpass.getuser()

In [6]:
log_messages. \
    writeStream. \
    format('csv'). \
    option('path', f'/user/{username}/retail_logs/gen_logs/data'). \
    start()

AnalysisException: 'checkpointLocation must be specified either through option("checkpointLocation", ...) or SparkSession.conf.set("spark.sql.streaming.checkpointLocation", ...);'

In [8]:
log_messages. \
    writeStream. \
    format('csv'). \
    option("checkpointLocation", f'/user/{username}/retail_logs/gen_logs/checkpoint'). \
    option('path', f'/user/{username}/retail_logs/gen_logs/data'). \
    trigger(processingTime='5 seconds'). \
    start()

<pyspark.sql.streaming.StreamingQuery at 0x7f81bd057e10>

In [13]:
!hdfs dfs -ls /user/${USER}/retail_logs/gen_logs/data

Found 30 items
drwxr-xr-x   - itversity itversity          0 2021-08-22 09:57 /user/itversity/retail_logs/gen_logs/data/_spark_metadata
-rw-r--r--   3 itversity itversity        633 2021-08-22 09:57 /user/itversity/retail_logs/gen_logs/data/part-00000-16713de8-c877-45d0-818c-78d9333b0d11-c000.csv
-rw-r--r--   3 itversity itversity          0 2021-08-22 09:56 /user/itversity/retail_logs/gen_logs/data/part-00000-3e52a404-1f53-43f9-97a7-0db2a90c6ff9-c000.csv
-rw-r--r--   3 itversity itversity        680 2021-08-22 09:57 /user/itversity/retail_logs/gen_logs/data/part-00000-5d1d48a4-57dd-4ede-94ba-c7ebd71979e1-c000.csv
-rw-r--r--   3 itversity itversity        653 2021-08-22 09:57 /user/itversity/retail_logs/gen_logs/data/part-00000-640f67e4-2c52-4bdf-ba08-5ba69932f277-c000.csv
-rw-r--r--   3 itversity itversity        623 2021-08-22 09:57 /user/itversity/retail_logs/gen_logs/data/part-00000-68d61ad5-c1aa-46da-9ab2-4d1a3b10ce68-c000.csv
-rw-r--r--   3 itversity itversity        622 2021-08-

In [11]:
spark.stop()