In [1]:
'''
# Objective:
To develop a Spark Streaming application that reads JSON data from a Kafka topic,
processes the data, and writes the results into the S3 in Parquet format.

Overview:
This Spark Streaming application is designed to integrate with Apache Kafka for data
ingestion. It reads JSON-formatted review data associated with applications, parses
this data, and subsequently stores it on S3 for further analysis or processing. The
primary workflow includes initializing a Spark session, defining the data schema,
streaming data from Kafka, parsing the JSON data, and persisting the processed data
to S3.
'''

# Spark Environment Setup: Import Libraries:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T

In [2]:

# Define the JSON Schema for Incoming Data:

'''
Define the JSON Schema for Incoming Data:
In Apache Spark, when dealing with structured data (especially data that comes in
structured formats like JSON), it is often necessary to define the schema to inform
Spark about the structure of the data. This helps in efficient processing and in
ensuring data integrity.
In the given code, we're defining a schema for the expected JSON data. The schema
is defined using Spark's StructType and StructField classes from the
pyspark.sql.types module.
'''

json_schema = T.StructType([
T.StructField('application_name', T.StringType()),
T.StructField('num_of_positive_sentiments', T.LongType()),
T.StructField('num_of_neutral_sentiments', T.LongType()),
T.StructField('num_of_negative_sentiments', T.LongType()),
T.StructField('avg_sentiment_polarity', T.DoubleType()),
T.StructField('avg_sentiment_subjectivity', T.DoubleType()),
T.StructField('category', T.StringType()),
T.StructField('rating', T.StringType()),
T.StructField('reviews', T.StringType()),
T.StructField('size', T.StringType()),
T.StructField('num_of_installs', T.DoubleType()),
T.StructField('price', T.DoubleType()),
T.StructField('age_limit', T.LongType()),
T.StructField('genres', T.StringType()),
T.StructField('version', T.StringType())])

In [None]:
'''
Initialize Spark Session: Establish a local Spark session utilizing all available cores and
adding a Spark-Kafka integration package, which allows Spark to interact with Kafka.
'''


spark = SparkSession \
.builder \
.master("local") \
.appName('ex6_store_results') \
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0') \
.getOrCreate()

In [4]:
# Read Streaming Data from Kafka
'''
initializes a streaming DataFrame (stream_df) from a Kafka source. Here's a brief
explanation:
    • sets up a streaming read from Kafka.
    • Specifies the Kafka bootstrap server address as "course-kafka:9092".
    • Subscribes to the Kafka topic "gps-with-reviews".
    • Sets the starting offsets to the earliest, meaning it will process data from the
      beginning of the topic.
    • Finally, after loading the data, it selects the 'value' column and casts it to a
    string type.
'''


stream_df = spark \
.readStream \
.format('kafka') \
.option("kafka.bootstrap.servers", "course-kafka:9092") \
.option("subscribe", "gps-with-reviews") \
.option('startingOffsets', 'earliest') \
.load() \
.select(F.col('value').cast(T.StringType()))

'''
Today's tip
    In Kafka, each message consists of a key and a value. When you consume
    messages from Kafka using Spark Structured Streaming, these messages are read
    into a DataFrame with multiple columns, two of which are key and value.
    The value column in this context contains the actual content of the Kafka message.
    In the provided code, the value column is being selected and cast to a string type.
    This implies that the actual message content resides in the value column, and it's
    being treated as a string (which is often the case when dealing with JSON-
    formatted messages in Kafka).

'''

In [5]:
# parse the JSON content

#First Part:
parsed_df = stream_df \
.withColumn('parsed_json', F.from_json(F.col('value'), json_schema))

'''
    • F.from_json(...): is a PySpark SQL function that parses a column of JSON strings
        into a structured format based on the provided schema.
    • It fetches the column named value from the DataFrame. This column contains
        the JSON strings and apply the provided schema
    • After this operation, the DataFrame will contain a new column named
        parsed_json. This new column holds the structured data that results from parsing
        the JSON strings in the value column.
# ChatGBT
1. Original Column (value):
    The value column contains the JSON data as plain text (strings) from Kafka.
    It looks like this in its raw form:
    value
    {"application_name": "App A", "translated_review": "Good app", "sentiment_rank": 1, ...}

    Type: StringType (It's just a long string, unstructured).
    Content: Raw JSON as a string.

In this form, Spark does not yet "understand" or recognize that it's a JSON object.
It's just a block of text, and you can't easily access individual fields like application_name or sentiment_polarity.

2. New Column (parsed_json):
    When you apply F.from_json(F.col('value'), json_schema),
    you are transforming the raw JSON string into a structured format according to the schema (json_schema), 
    so Spark can recognize each part of the JSON as a structured object.

After parsing, the new column parsed_json contains a StructType, which is like a nested data structure.
Each field inside the JSON becomes an individual part of this structure.

The column looks similar to this:
parsed_json
{"application_name": "App A", "translated_review": "Good app", "sentiment_rank": 1, ...}

But internally, Spark knows this is no longer just a string, but a structured object with individual fields that match the schema you provided.

Type: StructType (A structured object with fields like application_name, sentiment_rank, etc.).
Content: A parsed JSON object with fields recognized by Spark based on the schema.

This structure allows you to access each field individually and perform operations on them.

Difference Summary:
Original column (value): This is a plain string. You can't easily access individual parts of the JSON without parsing it.
New column (parsed_json): This is a structured object (StructType). Spark recognizes the individual fields inside the JSON, and now you can easily work with those fields in subsequent transformations.
Why Is This Important?
The structured nature of parsed_json allows you to access and manipulate each part of the JSON.
'''

#Second part

parsed_df = parsed_df.select(F.col('parsed_json.*'))

'''
F.col('parsed_json.*'): This syntax is used to select all the individual fields from the
structured column parsed_json and elevate them to the top level. Essentially, it
"flattens" the nested or structured data in parsed_json

# chatGBT:
F.col('parsed_json.*'): This expands the fields inside the parsed_json column into individual columns.
The .* syntax allows Spark to take all the fields from the StructType (parsed_json) and create separate columns for each of the fields.

What happens here:

After parsing the JSON into the parsed_json column (which is a StructType),
the select statement "explodes" the fields inside parsed_json into separate top-level columns.
For example, given the schema:

json_schema = T.StructType([
    T.StructField('application_name', T.StringType()),
    T.StructField('translated_review', T.StringType()),
    T.StructField('sentiment_rank', T.IntegerType()),
    T.StructField('sentiment_polarity', T.FloatType()),
    T.StructField('sentiment_subjectivity', T.FloatType())
])

The resulting DataFrame will have individual columns for each field in the JSON:

The resulting DataFrame will have individual columns for each field in the JSON:

application_name	translated_review	sentiment_rank	sentiment_polarity	sentiment_subjectivity
App A	                Good app	        1	            0.8	                0.6
App B	N               eeds improvement	2	            -0.5	            0.9


'''

'\nF.col(\'parsed_json.*\'): This syntax is used to select all the individual fields from the\nstructured column parsed_json and elevate them to the top level. Essentially, it\n"flattens" the nested or structured data in parsed_json\n\n# chatGBT:\nF.col(\'parsed_json.*\'): This expands the fields inside the parsed_json column into individual columns.\nThe .* syntax allows Spark to take all the fields from the StructType (parsed_json) and create separate columns for each of the fields.\n\nWhat happens here:\n\nAfter parsing the JSON into the parsed_json column (which is a StructType),\nthe select statement "explodes" the fields inside parsed_json into separate top-level columns.\nFor example, given the schema:\n\njson_schema = T.StructType([\n    T.StructField(\'application_name\', T.StringType()),\n    T.StructField(\'translated_review\', T.StringType()),\n    T.StructField(\'sentiment_rank\', T.IntegerType()),\n    T.StructField(\'sentiment_polarity\', T.FloatType()),\n    T.StructFie

In [None]:
# write the data as a stream to S3

query = parsed_df \
.writeStream \
.trigger(processingTime='1 minute') \
.format('parquet') \
.outputMode('append') \
.option("path", "s3a://spark/data/target/google_reviews_calc") \
.option('checkpointLocation', 's3a://spark/checkpoints/ex6/store_result') \
.start()

'''

• parsed_df.writeStream:
    This code is preparing to write the data contained in parsed_df as a stream. PySpark
    provides stream processing capabilities, and this is the starting point for setting up
    the streaming write.
• .trigger(processingTime='1 minute'):
    This sets the trigger for the streaming query to process data once every minute.
• .format('parquet'):
    This specifies that the output data format should be parquet, which is a popular
    columnar storage format.
• .outputMode('append'):
    This indicates that new data should be appended to the result table/directory.

• .option("path", "s3a://spark/data/target/google_reviews_calc"):
    This sets the destination where the processed streaming data should be written.
    Here, it is specifying a location in the S3.
• .option('checkpointLocation', 's3a://spark/checkpoints/ex6/store_result'):
    Streaming queries in PySpark need a location to store recovery information, which is
    used in case of a failure to restart the stream from where it left off. This is called a
    checkpoint location, and it's set to a directory in S3 in this code.
• .start():
    This method actually starts the streaming query. Until this method is called, no data
    is processed or written. Once started, the data will be processed according to the
    configurations set up above and written to the specified location in S3.

'''

24/09/26 08:07:02 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
24/09/26 08:07:02 WARN StreamingQueryManager: Stopping existing streaming query [id=be405fe6-268b-433d-8553-79f9fbaa369d, runId=fdf9412a-78df-4d32-9fc2-2b8788312f14], as a new run is being started.


'\n\n• parsed_df.writeStream:\n    This code is preparing to write the data contained in parsed_df as a stream. PySpark\n    provides stream processing capabilities, and this is the starting point for setting up\n    the streaming write.\n• .trigger(processingTime=\'1 minute\'):\n    This sets the trigger for the streaming query to process data once every minute.\n• .format(\'parquet\'):\n    This specifies that the output data format should be parquet, which is a popular\n    columnar storage format.\n• .outputMode(\'append\'):\n    This indicates that new data should be appended to the result table/directory.\n\n• .option("path", "s3a://spark/data/target/google_reviews_calc"):\n    This sets the destination where the processed streaming data should be written.\n    Here, it is specifying a location in the S3.\n• .option(\'checkpointLocation\', \'s3a://spark/checkpoints/ex6/store_result\'):\n    Streaming queries in PySpark need a location to store recovery information, which is\n    

24/09/26 08:07:02 WARN AdminClientConfig: The configuration 'key.deserializer' was supplied but isn't a known config.
24/09/26 08:07:02 WARN AdminClientConfig: The configuration 'value.deserializer' was supplied but isn't a known config.
24/09/26 08:07:02 WARN AdminClientConfig: The configuration 'enable.auto.commit' was supplied but isn't a known config.
24/09/26 08:07:02 WARN AdminClientConfig: The configuration 'max.poll.records' was supplied but isn't a known config.
24/09/26 08:07:02 WARN AdminClientConfig: The configuration 'auto.offset.reset' was supplied but isn't a known config.


In [11]:
# streaming query termination

query.awaitTermination()
spark.stop()

'''
After starting the streaming query, this line of code will make the application block
and wait for the streaming query to terminate, either due to a failure or a manual
termination. It's essentially saying, "Hold on here and keep running until the
streaming process finishes for some reason."
Finally, after the streaming query terminates (or in some setups, potentially never if
the awaitTermination() doesn't detect a termination), spark.stop() line stops the
SparkSession, freeing up any resources and ending the application.

'''


ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.10/dist-packages/py4j/clientserver.py", line 511, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt


KeyboardInterrupt: 