## Validate Data in HDFS using Spark

As we have successfully written data to HDFS in streaming fashion, let us validate whether data is written to HDFS as expected or not. Also we will review the checkpoint location to understand what is captured as part of the checkpoint.

In [1]:
from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1'). \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Kafka and Spark Integration'). \
    master('yarn'). \
    getOrCreate()

In [2]:
!hdfs dfs -ls /user/${USER}/kafka/retail_logs/gen_logs

Found 2 items
drwxr-xr-x   - itversity itversity          0 2021-09-02 13:20 /user/itversity/kafka/retail_logs/gen_logs/checkpoint
drwxr-xr-x   - itversity itversity          0 2021-09-02 13:21 /user/itversity/kafka/retail_logs/gen_logs/data


In [None]:
!hdfs dfs -ls -R /user/${USER}/kafka/retail_logs/gen_logs/data

In [4]:
!hdfs dfs -ls /user/${USER}/kafka/retail_logs/gen_logs/checkpoint

Found 4 items
drwxr-xr-x   - itversity itversity          0 2021-09-02 13:32 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/commits
-rw-r--r--   3 itversity itversity         45 2021-09-02 13:20 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/metadata
drwxr-xr-x   - itversity itversity          0 2021-09-02 13:32 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/offsets
drwxr-xr-x   - itversity itversity          0 2021-09-02 13:20 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/sources


In [5]:
!hdfs dfs -ls -R /user/${USER}/kafka/retail_logs/gen_logs/checkpoint/sources

drwxr-xr-x   - itversity itversity          0 2021-09-02 13:20 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/sources/0
-rw-r--r--   3 itversity itversity         56 2021-09-02 13:20 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/sources/0/0


In [6]:
!hdfs dfs -cat /user/itversity/kafka/retail_logs/gen_logs/checkpoint/sources/0/0

 v1
{"itversity_retail":{"2":17133,"1":15506,"0":15348}}

In [9]:
!hdfs dfs -ls -R /user/itversity/kafka/retail_logs/gen_logs/checkpoint/offsets

-rw-r--r--   3 itversity itversity        509 2021-09-02 13:20 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/offsets/0
-rw-r--r--   3 itversity itversity        509 2021-09-02 13:21 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/offsets/1
-rw-r--r--   3 itversity itversity        509 2021-09-02 13:25 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/offsets/10
-rw-r--r--   3 itversity itversity        509 2021-09-02 13:26 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/offsets/11
-rw-r--r--   3 itversity itversity        509 2021-09-02 13:26 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/offsets/12
-rw-r--r--   3 itversity itversity        509 2021-09-02 13:27 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/offsets/13
-rw-r--r--   3 itversity itversity        509 2021-09-02 13:27 /user/itversity/kafka/retail_logs/gen_logs/checkpoint/offsets/14
-rw-r--r--   3 itversity itversity        509 2021-09-02 13:28 /user/itversity/kafka/retail_logs/gen_logs/

In [10]:
!hdfs dfs -cat /user/itversity/kafka/retail_logs/gen_logs/checkpoint/offsets/0

v1
{"batchWatermarkMs":0,"batchTimestampMs":1630603245236,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
{"itversity_retail":{"2":17133,"1":15506,"0":15348}}

In [11]:
!hdfs dfs -cat /user/itversity/kafka/retail_logs/gen_logs/checkpoint/offsets/1

v1
{"batchWatermarkMs":0,"batchTimestampMs":1630603260011,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
{"itversity_retail":{"2":17139,"1":15511,"0":15352}}

In [13]:
!hdfs dfs -cat /user/itversity/kafka/retail_logs/gen_logs/checkpoint/offsets/2

v1
{"batchWatermarkMs":0,"batchTimestampMs":1630603290010,"conf":{"spark.sql.streaming.stateStore.providerClass":"org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider","spark.sql.streaming.join.stateFormatVersion":"2","spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion":"2","spark.sql.streaming.multipleWatermarkPolicy":"min","spark.sql.streaming.aggregation.stateFormatVersion":"2","spark.sql.shuffle.partitions":"200"}}
{"itversity_retail":{"2":17148,"1":15522,"0":15362}}

In [16]:
df = spark.read.csv(f'/user/{username}/kafka/retail_logs/gen_logs/data')

In [17]:
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- dayofmonth: integer (nullable = true)



In [18]:
df.show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+----+-----+----------+
|_c0                                                                                                                                                                                                                                         |_c1       |year|month|dayofmonth|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------+----+-----+----------+
|138.246.234.237 - - [02/Sep/2021:13:32:30 -0800] "GET /departments HTTP/1.1" 200 1400 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0"                   