Below we have some Spark and Kafka code to complete. Here are the steps we want to achieve: - Our objective is to answer the questions in the initial cell, in order to stream the information. - Once we have streamed the information, we want to read it and write to individual csv files. - Then we should load the .csv`` files with our wrestler information into a combined spark DataFrame.

Before we get started we should make a new Topic:

In [1]:
%pip install kafka-python

Note: you may need to restart the kernel to use updated packages.


In [20]:
from kafka.admin import KafkaAdminClient, NewTopic


admin_client = KafkaAdminClient(
    bootstrap_servers="kafka:9092", 
    client_id='test_two'
)

topic_list = []
topic_list.append(NewTopic(name="hell_in_a_cell_two", num_partitions=1, replication_factor=1))
admin_client.create_topics(new_topics=topic_list, validate_only=False)

CreateTopicsResponse_v3(throttle_time_ms=0, topic_errors=[(topic='hell_in_a_cell_two', error_code=0, error_message=None)])

In [39]:
import json
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import lit, col

# Questions to answer - no need to edit this section:
print("THIS HELL IN A CELL MATCH FEATURES THREE OF THE GREATEST WRESTLERS OF 2001.")
for i in range(1):
    print(f"WELCOME WRESTLER {i}!!!")
    name = input("What is your name?")
    favourite_colour = input("What is your favourite colour?")
    entrance_song = input("What track would you choose as your wrestler alter-ego entrance song?")
    height = input("How many feet will you plummet?")

    # Create a dictionary representing the JSON object
    data = {
        'name': name,
        'color': favourite_colour,
        'song': entrance_song,
        'height': height
    }

    # Serialize the dictionary to a JSON string
    json_string = json.dumps(data)

    # Print the JSON string
    print(json_string)

    # Writing to data.json (this will save locally but likely inside the container)
    with open("data.json", "w") as outfile:
        outfile.write(json_string)

    spark = SparkSession.builder.appName("json_entrance_app").config("spark.jars","work/data/commons-pool2-2.11.1.jar,work/data/spark-sql-kafka-0-10_2.12-3.4.0.jar,work/data/spark-streaming-kafka-0-10-assembly_2.12-3.4.0.jar").getOrCreate()

    schema = StructType([
        StructField("name", StringType(), nullable=False),
        StructField("color", StringType(), nullable=False),
        StructField("song", StringType(), nullable=False),
        StructField("height", IntegerType(), nullable=False)
    ])

    # Use spark.read.load() to load the JSON data
    df = spark.read.load('./data.json', 
        format='json',
        multiLine=True,
        schema=None
        )

    # Add 'value' column to the DataFrame using the .lit() spark method
    df = df.withColumn("value", lit(json_string))

    df.show()

    # Write the DataFrame to Kafka
    df.write.format("kafka").option("kafka.bootstrap.servers", "kafka:9092").option("topic", "hell_in_a_cell_two").save()

THIS HELL IN A CELL MATCH FEATURES THREE OF THE GREATEST WRESTLERS OF 2001.
WELCOME WRESTLER 0!!!
{"name": "Triple H", "color": "Green", "song": "Throwdown", "height": "55"}
+-----+------+--------+---------+--------------------+
|color|height|    name|     song|               value|
+-----+------+--------+---------+--------------------+
|Green|    55|Triple H|Throwdown|{"name": "Triple ...|
+-----+------+--------+---------+--------------------+



In [40]:
from pyspark.sql.types import StructType, StringType
from pyspark.sql.functions import from_json

schema = StructType() \
    .add("name", StringType()) \
    .add("color", StringType()) \
    .add("song", StringType()) \
    .add("height", StringType()) 

# Read stream with all the required .option()
df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "hell_in_a_cell_two") \
    .option("kafka.group.id", "test_two") \
    .option("startingOffsets", "earliest") \
    .option("maxOffsetsPerTrigger", 1) \
    .load() \
    .selectExpr("CAST(value AS STRING)")



# Parse the JSON data and apply the schema with from_json()
df = df.select(from_json(df.value, schema).alias("data")).select("data.*")


# writeStream to csv - you will have to look up how to do this one! Give it a go! 
query = df \
    .writeStream \
    .option('format', 'append') \
    .outputMode("append") \
    .format("csv") \
    .option('path', './work/data/wrestler.csv') \
    .option('checkpointLocation', './work/data/delete') \
    .start()

# query.awaitTermination()

StreamingQueryException: [STREAM_FAILED] Query [id = 91ff02be-94d8-43b0-9649-50e3264b2f27, runId = 03fb6a14-2052-490a-b626-6701050bceff] terminated with exception: Unable to find batch work/data/wrestler.csv/_spark_metadata/0.

In [37]:
import glob
 
# list all csv files only
csv_files = glob.glob('./work/data/wrestler.csv/*.{}'.format('csv'))

# print(csv_files) # print this variable to see csv paths of individual csv files from the stream

df = spark.read.csv(csv_files, header=False, schema=schema)

df.show()


+--------------------+------+----------------+----+
|                name|colour|            song|feet|
+--------------------+------+----------------+----+
|Stone Cold Steve ...|  null|Witchita Lineman|null|
|      The Undertaker|  null|     Hells Bells|null|
|                Kane|  null|      Fight Song|null|
+--------------------+------+----------------+----+

