# Homework - Question 6 Code

This notebook contains ; please refer to 'Part 2' of `02-answers.md` for further details. Please note that all of the code cells in this notebook assume that you're running the notebook **within the `06-stream/homework` directory**.


Before running any of the code in this notebook, make sure that you've:
1. Started up a local Kafka cluster by executing `docker compose up -d` inside of a terminal instance that is located within the `06-stream/homework` directory.
1. Install all the Python requirements by executing `pip install -r requirements.txt`

## Set-Up

First, let's import the modules we'll be using:

In [1]:
import os
import time
import shutil

from pyspark.sql import SparkSession
import pyspark.sql.types as T
import pyspark.sql.functions as F

#
from producers import confluent_producer, kafka_python_producer
from consumers import confluent_consumer, kafka_python_consumer
from utils import download_csv, load_kafka_settings
from streaming import (
    parse_taxi_messages,
    show_streaming_df,
    write_df_to_topic,
    read_df_from_kafka,
    prepare_df_for_producing,
    count_streaming_df_rows,
)

Next, we'll download the Green and FHV taxi CSV files specified in the question:

In [2]:
green_csv_path = download_csv(
    "https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2019-01.csv.gz"
)
fhv_csv_path = download_csv(
    "https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhv/fhv_tripdata_2019-01.csv.gz"
)
TAXI_CSVS = {"green": green_csv_path, "fhv": fhv_csv_path}

Let's now load `kafka_settings.yaml`, which lists the configurations settings associated with the Kafka cluster running in Docker:

In [3]:
SETTINGS = load_kafka_settings()

Finally, we'll load our Kafka consumers and producers. As we mentioned in `Section 2` of `02-answers.md`, we've implemented *two* producers and *two* consumers:
- One producer has been implemented using the `confluent` Python package, the other has been implemented using the `kafka-python` package.
- One consumer has been implemented using the `confluent` Python package, the other has been implemented using the `kafka-python` package. Importantly, the `confluent` consumer can only read messages written by the `confluent` producer, and the `kafka-python` consumer can only read messages written by the `kafka-python` producer.

With this in mind, let's load our two producers and two consumers:

In [4]:
PRODUCERS = {"confluent": confluent_producer, "kafka-python": kafka_python_producer}
CONSUMERS = {"confluent": confluent_consumer, "kafka-python": kafka_python_consumer}

## Testing of Kafka Consumers and Producers Implemented in Python

Let's first do a quick a 'test run' with both of our producers. In particular, we'll write a couple of messages from each taxi CSV to a topic that are specific to each produer implementation *and* to each taxi service type (e.g. with the `confluent` producer, we'll write the green taxi trips to the `'confluent_green_trips'` topic). 

If this cell throws an error, then either one of two problems has occurred:
1. The Kafka cluster is not running - make sure you've run `docker compose up -d` before running this notebook.
1. The Kafka cluster is still starting up - wait five or so seconds before re-rerunning this cell.

In [5]:
max_messages = 3
sleep = 0.0
# Loop over each producer implementation:
for i, (package_name, producer) in enumerate(PRODUCERS.items()):
    # Loop over each taxi CSV:
    for j, (taxi, csv_path) in enumerate(TAXI_CSVS.items()):
        # Add space between outputs of each taxi-producer combo:
        if i > 0 or j > 0:
            print()
        print(
            f"Producing {taxi} taxi data with Producer implemented in {package_name}:"
        )
        producer.produce_taxi_data(
            csv_path=csv_path,
            topic=f"{package_name}_{taxi}_trips",
            key_schema=SETTINGS[f"{taxi}_key_schema"],
            value_schema=SETTINGS[f"{taxi}_value_schema"],
            bootstrap_servers=SETTINGS["bootstrap_servers"],
            schema_registry_url=SETTINGS["schema_registry_url"],
            max_messages=max_messages,
            sleep=sleep,
            verbose=True,
        )

Producing green taxi data with Producer implemented in confluent:
Record successfully produced to Partition 0 of topic 'confluent_green_trips' at offset 0.
Record successfully produced to Partition 0 of topic 'confluent_green_trips' at offset 1.
Record successfully produced to Partition 0 of topic 'confluent_green_trips' at offset 2.
Successfully produced 3 messages to 'confluent_green_trips' topic.

Producing fhv taxi data with Producer implemented in confluent:
Record successfully produced to Partition 0 of topic 'confluent_fhv_trips' at offset 0.
Record successfully produced to Partition 0 of topic 'confluent_fhv_trips' at offset 1.
Record successfully produced to Partition 0 of topic 'confluent_fhv_trips' at offset 2.
Successfully produced 3 messages to 'confluent_fhv_trips' topic.

Producing green taxi data with Producer implemented in kafka-python:
Record successfully produced to Partition 0 of topic 'kafka-python_green_trips' at offset 0.
Record successfully produced to Partitio

Let's now check that our consumers are working correctly - with each consumer, we'll read from both the FHV taxi and Green taxi topics written to by the corresponding producer (e.g. for the `kafka_python` consumer, we'll read from the ``'kafka_python_green_trips'`` and the ``'kafka_python_fhv_trips'`` topics):

In [6]:
min_messages = 3
sleep = 0.0
for i, (package_name, consumer) in enumerate(CONSUMERS.items()):
    for j, (taxi, csv_path) in enumerate(TAXI_CSVS.items()):
        if i > 0 or j > 0:
            print()
        topic = f"{package_name}_{taxi}_trips"
        print(f"Consuming from '{topic}' topic with {package_name} consumer:")
        consumer.consume_taxi_data(
            topic=topic,
            key_schema=SETTINGS[f"{taxi}_key_schema"],
            value_schema=SETTINGS[f"{taxi}_value_schema"],
            schema_registry_url=SETTINGS["schema_registry_url"],
            bootstrap_servers=SETTINGS["bootstrap_servers"],
            group_id=f"{package_name}_{taxi}_taxi",
            offset="earliest",
            min_messages=min_messages,
            sleep=sleep,
            verbose=True,
        )

Consuming from 'confluent_green_trips' topic with confluent consumer:
Record 1, Key: {'lpep_pickup_datetime': '2018-12-21 15:17:29'}, Values: {'PULocationID': 264, 'lpep_pickup_datetime': '2018-12-21 15:17:29', 'lpep_dropoff_datetime': '2018-12-21 15:18:57', 'passenger_count': 5, 'trip_distance': 0.0, 'fare_amount': 3.0}
Record 2, Key: {'lpep_pickup_datetime': '2019-01-01 00:10:16'}, Values: {'PULocationID': 97, 'lpep_pickup_datetime': '2019-01-01 00:10:16', 'lpep_dropoff_datetime': '2019-01-01 00:16:32', 'passenger_count': 2, 'trip_distance': 0.8600000143051147, 'fare_amount': 6.0}
Record 3, Key: {'lpep_pickup_datetime': '2019-01-01 00:27:11'}, Values: {'PULocationID': 49, 'lpep_pickup_datetime': '2019-01-01 00:27:11', 'lpep_dropoff_datetime': '2019-01-01 00:31:38', 'passenger_count': 2, 'trip_distance': 0.6600000262260437, 'fare_amount': 4.5}
Succesfully consumed 3 records from topics: ['confluent_green_trips'].

Consuming from 'confluent_fhv_trips' topic with confluent consumer:
Rec

Pleasingly, it appears that our consumers have correctly read the records written by our producers. Now that we know that our Kafka consumers and producers have been implemented correctly, let's now move on to actually analysing the data in these topics using `pyspark`. 

## Connecting PySpark to Kafka

We'll first need to start our Spark session. In order to have PySpark correctly interact with our Kafka cluster, we'll have to correctly specify the `PYSPARK_SUBMIT_ARGS` environment variable:

In [7]:
os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1,"
    "org.apache.spark:spark-avro_2.12:3.3.1 pyspark-shell"
)
spark = SparkSession.builder.appName("kafka-example").getOrCreate()

PySpark - we should make sure to clear this

In [8]:
if os.path.exists("checkpoint"):
    shutil.rmtree("checkpoint")

For the sake of reproducibility, we'll also take this opportunity to print out the version of Spark we're using:

In [9]:
!spark-submit --version

23/03/27 23:32:10 WARN Utils: Your hostname, pop-os resolves to a loopback address: 127.0.1.1; using 192.168.1.146 instead (on interface enp9s0)
23/03/27 23:32:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/
                        
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.17
Branch HEAD
Compiled by user liangchi on 2023-02-10T19:57:40Z
Revision 5103e00c4ce5fcc4264ca9c4df12295d42557af6
Url https://github.com/apache/spark
Type --help for more information.


For the purposes of writing data to our Kafka cluster that we'll analyse using PySpark, we'll restrict ourself to **only using the `python-kafka` producer**. We choose the `python_kafka` producer over the `confluent` producer here since the it's a lot easier to deserialize the messages written by with `python-kafka` (which are just plain strings that contain all the message values concatenated together) than those written by with `confluent`.

With this in mind, we'll define the following function to write a specified number of rows in the FHV taxi and Green taxi CSVs to the topics `fhv_trips` and `green_trips` respectively:

In [10]:
def produce_taxi_data_with_kafka_python(num_messages: int) -> None:
    """
    Using `kafka_python_producer`, writes the first `num_messages`
    rows of the Green taxi CSV data and the first `num_messages` rows
    of the FHV taxi CSV data to the 'green_trips' and `fhv_trips`
    topics respectively.
    """
    for taxi, csv_path in TAXI_CSVS.items():
        kafka_python_producer.produce_taxi_data(
            csv_path=csv_path,
            topic=f"{taxi}_trips",
            key_schema=SETTINGS[f"{taxi}_key_schema"],
            value_schema=SETTINGS[f"{taxi}_value_schema"],
            bootstrap_servers=SETTINGS["bootstrap_servers"],
            schema_registry_url=SETTINGS["schema_registry_url"],
            max_messages=num_messages,
            sleep=0.0,
            verbose=False,
        )
    return None

Let's now use our function to write to twenty thousand messages to each topic:

In [11]:
produce_taxi_data_with_kafka_python(num_messages=20_000)

Successfully produced 20000 messages to 'green_trips' topic.
Successfully produced 20000 messages to 'fhv_trips' topic.


### Loading Kafka Messages into PySpark

Now that we've written messages to the `green_trips` and `fhv_trips` topics, let's read from these topics using PySpark to create two streaming dataframes (i.e. one for the `green_trips` topic and another for the `fhv_trips` topic):

In [12]:
df_green_raw = read_df_from_kafka(spark, topic="green_trips")
df_fhv_raw = read_df_from_kafka(spark, topic="fhv_trips")

To better understand what `read_df_from_kafka` has done ehre, please refer to `streaming.py`. An important point to note here is that the schema of the dataframes we've loaded here is **not** the same as the schema of the taxi trip CSVs. Instead, the schema of `df_green_raw` and `df_fhv_raw` contains the *serialized* `key` and `value` of each message we've read:  

In [13]:
show_streaming_df(df_green_raw, spark)

+--------------------+--------------------+-----------+---------+------+--------------------+-------------+
|                 key|               value|      topic|partition|offset|           timestamp|timestampType|
+--------------------+--------------------+-----------+---------+------+--------------------+-------------+
|[32 30 31 38 2D 3...|[32 36 34 2C 20 3...|green_trips|        0|     0|2023-03-27 23:32:...|            0|
|[32 30 31 39 2D 3...|[39 37 2C 20 32 3...|green_trips|        0|     1|2023-03-27 23:32:...|            0|
|[32 30 31 39 2D 3...|[34 39 2C 20 32 3...|green_trips|        0|     2|2023-03-27 23:32:...|            0|
|[32 30 31 39 2D 3...|[31 38 39 2C 20 3...|green_trips|        0|     3|2023-03-27 23:32:...|            0|
|[32 30 31 39 2D 3...|[38 32 2C 20 32 3...|green_trips|        0|     4|2023-03-27 23:32:...|            0|
|[32 30 31 39 2D 3...|[34 39 2C 20 32 3...|green_trips|        0|     5|2023-03-27 23:32:...|            0|
|[32 30 31 39 2D 3...|[32 35

Before actually working with the data we've loaded from each topic then, we'll have to deserialize the `value` of each loaded message. To do this, first need to declare to Pyspark the schema of the message values we're going to load, which we already know ahead of time (see Arvo files in `schema/` directory):

In [14]:
green_schema = T.StructType(
    [
        T.StructField("PULocationID", T.IntegerType()),
        T.StructField("lpep_pickup_datetime", T.TimestampType()),
        T.StructField("lpep_dropoff_datetime", T.TimestampType()),
        T.StructField("passenger_count", T.IntegerType()),
        T.StructField("trip_distance", T.FloatType()),
        T.StructField("fare_amount", T.FloatType()),
    ]
)
fhv_schema = T.StructType(
    [
        T.StructField("PUlocationID", T.IntegerType()),
        T.StructField("DOlocationID", T.IntegerType()),
        T.StructField("pickup_datetime", T.TimestampType()),
        T.StructField("dropOff_datetime", T.TimestampType()),
        T.StructField("dispatching_base_num", T.StringType()),
        T.StructField("SR_Flag", T.BooleanType()),
    ]
)

With the schema of our messages defined, we can deserialize and parse each message value using the `parse_kafka_df` function we've defined in `streaming.py`:

In [15]:
df_green = parse_taxi_messages(df_green_raw, green_schema)
df_fhv = parse_taxi_messages(df_fhv_raw, fhv_schema)

To confirm that we've correctly parsed the Kafka messages we've read, let's now print the contents of the `df_green` and `df_fhv` dataframes:

In [16]:
print("Parsed Green taxi dataframe:")
show_streaming_df(df_green, spark)
print("Parsed FHV taxi dataframe:")
show_streaming_df(df_fhv, spark)

Parsed Green taxi dataframe:
+------------+--------------------+---------------------+---------------+-------------+-----------+
|PULocationID|lpep_pickup_datetime|lpep_dropoff_datetime|passenger_count|trip_distance|fare_amount|
+------------+--------------------+---------------------+---------------+-------------+-----------+
|         264| 2018-12-21 15:17:29|  2018-12-21 15:18:57|              5|          0.0|        3.0|
|          97| 2019-01-01 00:10:16|  2019-01-01 00:16:32|              2|         0.86|        6.0|
|          49| 2019-01-01 00:27:11|  2019-01-01 00:31:38|              2|         0.66|        4.5|
|         189| 2019-01-01 00:46:20|  2019-01-01 01:04:54|              2|         2.68|       13.5|
|          82| 2019-01-01 00:19:06|  2019-01-01 00:39:43|              1|         4.53|       18.0|
|          49| 2019-01-01 00:12:35|  2019-01-01 00:19:09|              1|         1.05|        6.5|
|         255| 2019-01-01 00:47:55|  2019-01-01 01:00:01|              

### Combining Streaming Dataframes

Now that we have our `df_green` and `df_fhv` streaming dataframes, let's now merge these two streaming dataframes into a unified `df_rides` dataframe. To do achieve this, let's first ensure that the names of the columns shared by `df_green` and `df_fhv` have the same name:

In [17]:
# Now "pickup_datetime" and "dropoff_datetime" in both `df_green` and `df_fhv`:
df_green = df_green.withColumnRenamed(
    "lpep_pickup_datetime", "pickup_datetime"
).withColumnRenamed("lpep_dropoff_datetime", "dropoff_datetime")
df_fhv = df_fhv.withColumnRenamed("PUlocationID", "PULocationID")
# Now "PULocationID" in both `df_green` and `df_fhv`:
df_fhv = df_fhv.withColumnRenamed("PUlocationID", "PULocationID")

We'll also add a `service_type` column to `df_green` and `df_fhv` so that once we merge these dataframes into a single dataframe, we'll be able to track whether a record came from the Green taxi dataset or the FHV taxi dataset:

In [18]:
df_green = df_green.withColumn("service_type", F.lit("green").cast("string"))
df_fhv = df_fhv.withColumn("service_type", F.lit("fhv").cast("string"))

The final step we need to perform before we merge our two dataframes is to add a timestamp column to each dataframe, and then set this column as a *watermark*. Without going into detail here, watermarks are used PySpark to correctly handle messages that 'arrive late'. For more information on how watermarks work, please refer to [this blog post](https://www.databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html), as well as [this blog post](https://www.databricks.com/blog/2022/08/22/feature-deep-dive-watermarking-apache-spark-structured-streaming.html).

In [19]:
df_fhv_wm = df_fhv.withColumn("timestamp", F.current_timestamp()).withWatermark(
    "timestamp", "1 hours"
)
df_green_wm = df_green.withColumn("timestamp", F.current_timestamp()).withWatermark(
    "timestamp", "1 hours"
)

We're now able to unify our two streaming dataframes into one:

In [20]:
df_rides = df_fhv_wm.unionByName(df_green_wm, allowMissingColumns=True)

Let's print our newly created `df_rides` dataframes to make sure this unification has ocurred correctly:

In [21]:
show_streaming_df(df_rides, spark)

+------------+------------+-------------------+-------------------+--------------------+-------+------------+--------------------+---------------+-------------+-----------+
|PULocationID|DOlocationID|    pickup_datetime|   dropOff_datetime|dispatching_base_num|SR_Flag|service_type|           timestamp|passenger_count|trip_distance|fare_amount|
+------------+------------+-------------------+-------------------+--------------------+-------+------------+--------------------+---------------+-------------+-----------+
|        null|        null|2019-01-01 00:30:00|2019-01-01 02:51:55|              B00001|   null|         fhv|2023-03-27 23:32:...|           null|         null|       null|
|        null|        null|2019-01-01 00:45:00|2019-01-01 00:54:49|              B00001|   null|         fhv|2023-03-27 23:32:...|           null|         null|       null|
|        null|        null|2019-01-01 00:15:00|2019-01-01 00:54:52|              B00001|   null|         fhv|2023-03-27 23:32:...|     

### Querying Streaming Dataframes

Now that we've combined the Green and FHV taxi trips into a single streaming dataframe, let's try to work out the ID of the most popular pickup location across all of these taxi trips. 

For this purpose, we'll define the function `count_pu_locations`, which will use a simple SQL query to determine the top `num_result` most popular pick-up location IDs:

In [22]:
def count_pu_locations(df, spark, num_result, query_name="count_pu_locs"):
    query = (
        df.writeStream.queryName(query_name)
        .format("memory")
        .trigger(availableNow=True)
        .start()
    )
    query.awaitTermination()
    query_results = spark.sql(
        f"""
        SELECT
            PULocationID, COUNT(PULocationID) AS count
        FROM
            {query_name}
        GROUP BY
            PULocationID
        ORDER BY
            count DESC
        LIMIT {num_result};
    """
    )
    return query_results

Using `count_pu_locations`, let's find out the top 10 most popular pick-up location IDs:

In [23]:
count_pu_locations(df_rides, spark, num_result=10).show()

+------------+-----+
|PULocationID|count|
+------------+-----+
|          74| 1623|
|          41| 1220|
|           7| 1219|
|         181|  983|
|          42|  961|
|          75|  912|
|         129|  859|
|          82|  855|
|         255|  848|
|          97|  698|
+------------+-----+



The convenient thing about *streaming* dataframes is that they *automatically update* when more data is produced to a topic. To illustrate this, let's write some more data to the `fhv_trips` and `green_trips` topics and then recompute the top 10 most popular pickup location IDs:

In [24]:
print("Add new data to Kafka topics:")
produce_taxi_data_with_kafka_python(num_messages=500)
print("Re-run count:")
count_pu_locations(df_rides, spark, num_result=10).show()

Add new data to Kafka topics:
Successfully produced 500 messages to 'green_trips' topic.
Successfully produced 500 messages to 'fhv_trips' topic.
Re-run count:
+------------+-----+
|PULocationID|count|
+------------+-----+
|          74| 1641|
|          41| 1250|
|           7| 1246|
|         181| 1016|
|          42|  985|
|          75|  929|
|         255|  872|
|         129|  871|
|          82|  870|
|          97|  713|
+------------+-----+



As expected, the result of our query has changed upon writing more data to the topics we're streaming data from - in particular, check the values in the `count` column.

### Writing Straming Dataframes to a Kafka Topic

Now that we've queried our combined `df_trips` dataframe, let's now try to write this dataframe to a new topic in our Kafka cluster.

Unfortunately, we can't directly write our `df_trips` in its current form to Kafka; this is because Kafka is expecting a dataframe with a `key` column, which stores the *serialized* key of the message we're writing, and a `value` column, which stores the *serialized* values of the message we're writing. In this case, our serialized messages values are simply a serialized strings, where each string is formed by joining together the values we want to store together with commas.

With this in mind, we'll prepare our `df_rides` dataframe for writing by using the `prepare_df_for_producing` function we've defined in `streaming.py`:

In [25]:
df_rides_produce = prepare_df_for_producing(
    df_rides,
    value_columns=["service_type", "pickup_datetime", "dropoff_datetime"],
    key_column="pickup_datetime",
)

Now that we've defined `df_rides_produce` (which is just `df_trips` put into a form that is suitable for writing to a Kafka topic), we can now use the `write_df_to_topic` function we've defined in `streaming.py` to continuously write the contents of this dataframe to a new topic, which we'll call `all_trips`. 

Since `df_rides_produce` is a streaming dataframe whose contents will automatically update should new data be written to the `green_trips` and/or `fhv_trips` topics, we also need to specify a writing frequency to `write_df_to_topic`: this is the number of seconds that PySpark should periodically wait before checking if any new data has been added to `df_rides_produce`. If any new data has been written to this dataframe since the last time PySpark checked, the new data will be automatically appended to the specified topic:

In [26]:
write_freq = 1
write_query = write_df_to_topic(
    df_rides_produce, topic="all_trips", write_freq=write_freq
)

To make sure that we've successfully written all the messages currently stored in the `green_trips` and `fhv_trips` topics (i.e. the Kafka topics that the `df_rides_produce` dataframe is streaming from) to the new `all_trips` topic, let's read a dataframe from the `all_trips` topic we've just created:

In [27]:
df_rides_read = read_df_from_kafka(spark, topic="all_trips")
# Wait until all new data has streamed into new topic:
while write_query.status["isTriggerActive"]:
    time.sleep(0.1)
show_streaming_df(df_rides_read, spark)

+--------------------+--------------------+---------+---------+------+--------------------+-------------+
|                 key|               value|    topic|partition|offset|           timestamp|timestampType|
+--------------------+--------------------+---------+---------+------+--------------------+-------------+
|[32 30 31 39 2D 3...|[66 68 76 2C 20 3...|all_trips|        0|     0|2023-03-27 23:33:...|            0|
|[32 30 31 38 2D 3...|[67 72 65 65 6E 2...|all_trips|        0|     1|2023-03-27 23:33:...|            0|
|[32 30 31 39 2D 3...|[66 68 76 2C 20 3...|all_trips|        0|     2|2023-03-27 23:33:...|            0|
|[32 30 31 39 2D 3...|[67 72 65 65 6E 2...|all_trips|        0|     3|2023-03-27 23:33:...|            0|
|[32 30 31 39 2D 3...|[66 68 76 2C 20 3...|all_trips|        0|     4|2023-03-27 23:33:...|            0|
|[32 30 31 39 2D 3...|[66 68 76 2C 20 3...|all_trips|        0|     5|2023-03-27 23:33:...|            0|
|[32 30 31 39 2D 3...|[67 72 65 65 6E 2...|all

Let's also count the number of records we've written to the `all_trips` topic:

In [28]:
print(
    f"Number of records in `all_trips`: {count_streaming_df_rows(df_rides_read, spark)}"
)

Number of records in `all_trips`: 41000


Finally, let's check that any new data that's written to the `green_trips` and `fhv_trips` topic is *automatically* streamed to the `df_trips` and `df_trips_produce` dataframes, and then is automatically written to the `all_trips` topic:

In [29]:
# Write some more messages to `green_trips` and `fhv_trips`:
produce_taxi_data_with_kafka_python(num_messages=123)
# Wait for PySpark to check for new messages written to DataFrame:
time.sleep(1.5 * write_freq)
# Wait until all new data has streamed into new topic:
while write_query.status["isTriggerActive"]:
    time.sleep(0.1)
print(f"Number of rows: {count_streaming_df_rows(df_rides_read, spark)}")

Successfully produced 123 messages to 'green_trips' topic.
Successfully produced 123 messages to 'fhv_trips' topic.
Number of rows: 41246


Pleasingly, we can see that PySpark has automatically written the messages we sent to `green_trips` and `fhv_trips` to the `all_trips` topic.

If we no longer want PySpark to automatically update the `all_trips` topic with new observations in `df_rides_produce`, we just need to call the `stop()` method of the `write_query` object returned by our `write_df_to_topic` function:

In [30]:
write_query.stop()