## Write Data to HDFS using Spark with Header

Let us now write the data to HDFS using Spark Structured Streaming APIs. It will take care of writing streaming data frame to HDFS. 

Here are the steps that are involved.
* Create spark session object.
* Create streaming data frame by subscribing to Kafka topic leveraging `readStream`.
* Apply required transformations to add new fields such as year, month and dayofmonth.
* Write the data from streaming data frame to HDFS using appropriate format. We will be using `csv` for now. We can write data to all standard formats using streaming data frame.
* We need to specify `checkpointLocation` while writing data to HDFS. The location should be in HDFS itself.
* We will also specify options such as `header` and `sep` to provide names for the fields and also to customize the delimiter.
* We will also validate whether data is being written to HDFS using csv format with header and custom separator.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1'). \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Kafka and Spark Integration'). \
    master('yarn'). \
    getOrCreate()

In [2]:
kafka_bootstrap_servers = 'w01.itversity.com:9092,w02.itversity.com:9092'

In [3]:
df = spark. \
  readStream. \
  format('kafka'). \
  option('kafka.bootstrap.servers', kafka_bootstrap_servers). \
  option('subscribe', f'{username}_retail'). \
  load()

In [4]:
from pyspark.sql.functions import date_format, to_date, split, substring

In [5]:
df.selectExpr("CAST(key AS STRING) AS key", "CAST(value AS STRING) AS value").printSchema()

root
 |-- key: string (nullable = true)
 |-- value: string (nullable = true)



In [6]:
!hdfs dfs -rm -R -skipTrash /user/${USER}/kafka/retail_logs/

Deleted /user/itversity/kafka/retail_logs


In [7]:
df.selectExpr("CAST(value AS STRING)"). \
    withColumn('log_date', to_date(substring(split('value', ' ')[3], 2, 21), '[dd/MMM/yyyy:HH:mm:ss')). \
    withColumn('year', date_format('log_date', 'yyyy')). \
    withColumn('month', date_format('log_date', 'MM')). \
    withColumn('dayofmonth', date_format('log_date', 'dd')). \
    writeStream. \
    partitionBy('year', 'month', 'dayofmonth'). \
    format('csv'). \
    option("checkpointLocation", f'/user/{username}/kafka/retail_logs/gen_logs/checkpoint'). \
    option('path', f'/user/{username}/kafka/retail_logs/gen_logs/data'). \
    option('header', True). \
    option('sep', '\t'). \
    trigger(processingTime='30 seconds'). \
    start()

<pyspark.sql.streaming.StreamingQuery at 0x7f6957f96c88>

In [8]:
!hdfs dfs -ls /user/${USER}/kafka/retail_logs/gen_logs

Found 2 items
drwxr-xr-x   - itversity itversity          0 2021-09-02 14:15 /user/itversity/kafka/retail_logs/gen_logs/checkpoint
drwxr-xr-x   - itversity itversity          0 2021-09-02 14:15 /user/itversity/kafka/retail_logs/gen_logs/data


In [9]:
!hdfs dfs -ls -R /user/${USER}/kafka/retail_logs/gen_logs/data

drwxr-xr-x   - itversity itversity          0 2021-09-02 14:15 /user/itversity/kafka/retail_logs/gen_logs/data/_spark_metadata
-rw-r--r--   3 itversity itversity          2 2021-09-02 14:15 /user/itversity/kafka/retail_logs/gen_logs/data/_spark_metadata/0
-rw-r--r--   3 itversity itversity        884 2021-09-02 14:15 /user/itversity/kafka/retail_logs/gen_logs/data/_spark_metadata/1
drwxr-xr-x   - itversity itversity          0 2021-09-02 14:15 /user/itversity/kafka/retail_logs/gen_logs/data/year=2021
drwxr-xr-x   - itversity itversity          0 2021-09-02 14:15 /user/itversity/kafka/retail_logs/gen_logs/data/year=2021/month=09
drwxr-xr-x   - itversity itversity          0 2021-09-02 14:15 /user/itversity/kafka/retail_logs/gen_logs/data/year=2021/month=09/dayofmonth=02
-rw-r--r--   3 itversity itversity        960 2021-09-02 14:15 /user/itversity/kafka/retail_logs/gen_logs/data/year=2021/month=09/dayofmonth=02/part-00000-838deec6-62f2-49d5-9db0-4760747fd3a0.c000.csv
-rw-r--r--   3 itve

In [11]:
!hdfs dfs -cat /user/itversity/kafka/retail_logs/gen_logs/data/year=2021/month=09/dayofmonth=02/part-00002-182efb8d-e990-4036-8f1f-251b68c0e88e.c000.csv

value	log_date
"107.151.90.45 - - [02/Sep/2021:14:15:23 -0800] \"GET /department/outdoors/products HTTP/1.1\" 200 1298 \"-\" \"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36\""	2021-09-02
"73.83.117.251 - - [02/Sep/2021:14:15:26 -0800] \"GET /departments HTTP/1.1\" 200 1260 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36\""	2021-09-02


In [12]:
df = spark.read.csv(f'/user/{username}/kafka/retail_logs/gen_logs/data', sep='\t', header=True)

In [13]:
df.printSchema()

root
 |-- value: string (nullable = true)
 |-- log_date: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- dayofmonth: integer (nullable = true)



In [14]:
df.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+----+-----+----------+
|value                                                                                                                                                                                                                                        |log_date  |year|month|dayofmonth|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+----+-----+----------+
|199.93.223.194 - - [02/Sep/2021:14:21:00 -0800] "GET /product/695 HTTP/1.1" 200 1444 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chro