## Write Data to HDFS using Spark

Let us now write the data to HDFS using Spark Structured Streaming APIs. It will take care of writing streaming data frame to HDFS.

Here are the steps that are involved.
* Create spark session object.
* Create streaming data frame by subscribing to Kafka topic leveraging `readStream`.
* Apply required transformations to add new fields such as year, month and dayofmonth.
* Write the data from streaming data frame to HDFS using appropriate format. We will be using `csv` for now. We can write data to all standard formats using streaming data frame.
* We need to specify `checkpointLocation` while writing data to HDFS. The location should be in HDFS itself.
* As part of this lecture we will validate whether the files are being generated or not. We will perform detailed validation as part of next lecture.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1'). \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Kafka and Spark Integration'). \
    master('yarn'). \
    getOrCreate()

In [2]:
kafka_bootstrap_servers = 'w01.itversity.com:9092,w02.itversity.com:9092'

In [3]:
df = spark. \
  readStream. \
  format('kafka'). \
  option('kafka.bootstrap.servers', kafka_bootstrap_servers). \
  option('subscribe', f'{username}_retail'). \
  load()

In [4]:
from pyspark.sql.functions import date_format, to_date, split, substring

In [5]:
df.selectExpr("CAST(key AS STRING) AS key", "CAST(value AS STRING) AS value").printSchema()

root
 |-- key: string (nullable = true)
 |-- value: string (nullable = true)



In [6]:
!hdfs dfs -rm -R -skipTrash /user/${USER}/kafka/retail_logs/

Deleted /user/itversity/kafka/retail_logs


In [7]:
df.selectExpr("CAST(value AS STRING)"). \
    withColumn('log_date', to_date(substring(split('value', ' ')[3], 2, 21), '[dd/MMM/yyyy:HH:mm:ss')). \
    withColumn('year', date_format('log_date', 'yyyy')). \
    withColumn('month', date_format('log_date', 'MM')). \
    withColumn('dayofmonth', date_format('log_date', 'dd')). \
    writeStream. \
    partitionBy('year', 'month', 'dayofmonth'). \
    format('csv'). \
    option("checkpointLocation", f'/user/{username}/kafka/retail_logs/gen_logs/checkpoint'). \
    option('path', f'/user/{username}/kafka/retail_logs/gen_logs/data'). \
    trigger(processingTime='30 seconds'). \
    start()

<pyspark.sql.streaming.StreamingQuery at 0x7f56bafd2e48>

In [8]:
!hdfs dfs -ls /user/${USER}/kafka/retail_logs/gen_logs

Found 2 items
drwxr-xr-x   - itversity itversity          0 2021-09-02 13:20 /user/itversity/kafka/retail_logs/gen_logs/checkpoint
drwxr-xr-x   - itversity itversity          0 2021-09-02 13:20 /user/itversity/kafka/retail_logs/gen_logs/data


In [9]:
!hdfs dfs -ls -R /user/${USER}/kafka/retail_logs/gen_logs/data

drwxr-xr-x   - itversity itversity          0 2021-09-02 13:20 /user/itversity/kafka/retail_logs/gen_logs/data/_spark_metadata
