# Data Ingestion

Before proceeding we need to ensure kafka has the topic we will use.
`raw-taxi-data`.
For the partitions, we are using the `medallion` as the partition key ensuring the related records go to the same partition.

We will produce the data using the csv as the datasource and then we read the stream from kafka, load into dataframe and start doing the EDA.


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_json, struct, lit, current_timestamp, expr
from pyspark.sql.types import (
    StructType,
    StructField,
    StringType,
    DoubleType,
    TimestampType,
    IntegerType,
)
import pandas as pd

In [2]:
raw_sample_df = pd.read_csv('input/sample.csv', header=0)

In [3]:
raw_sample_df.columns

Index(['medallion', ' hack_license', ' vendor_id', ' rate_code',
       ' store_and_fwd_flag', ' pickup_datetime', ' dropoff_datetime',
       ' passenger_count', ' trip_time_in_secs', ' trip_distance',
       ' pickup_longitude', ' pickup_latitude', ' dropoff_longitude',
       ' dropoff_latitude'],
      dtype='object')

In [4]:
spark = (
    SparkSession.builder.appName("NYC Taxi Data Kafka ETL")
    .config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.0")
    .config("spark.sql.session.timeZone", "UTC")
    .getOrCreate()
)

# noise reduction
spark.sparkContext.setLogLevel("WARN")

In [5]:
raw_taxi_schema = StructType(
    [
        StructField("medallion", StringType(), True),
        StructField("hack_license", StringType(), True),
        StructField("vendor_id", StringType(), True),
        StructField("rate_code", StringType(), True),
        StructField("store_and_fwd_flag", StringType(), True),
        StructField("pickup_datetime", TimestampType(), True),
        StructField("dropoff_datetime", TimestampType(), True),
        StructField("passenger_count", IntegerType(), True),
        StructField("trip_time_in_secs", IntegerType(), True),
        StructField("trip_distance", DoubleType(), True),
        StructField("pickup_longitude", DoubleType(), True),
        StructField("pickup_latitude", DoubleType(), True),
        StructField("dropoff_longitude", DoubleType(), True),
        StructField("dropoff_latitude", DoubleType(), True),
    ]
)


In [None]:
def ingest_csv_to_kafka(csv_path, batch_size=1000):
    """
    Read a CSV file in batches and publish records to Kafka
    """
    print(
        f"Ingesting data from {csv_path} to Kafka topic 'raw-taxi-data' with batch size {batch_size}"
    )

    # loading the data
    taxi_data = spark.read.csv(csv_path, header=True, schema=raw_taxi_schema)

    # Get total count for progress reporting
    total_count = taxi_data.count()
    print(f"Total records to process: {total_count}")

    # Process in batches
    for offset in range(0, total_count, batch_size):
        current_batch = taxi_data.limit(batch_size).offset(offset)

        # medallion used as partition key
        current_batch = current_batch.withColumn("kafka_key", col("medallion"))

        # jsonify the data for kafka
        batch_json = current_batch.select(
            col("kafka_key").cast("string"),
            to_json(
                struct(*[col(c) for c in current_batch.columns if c != "kafka_key"])
            ).alias("value"),
        )

        # write current batch
        batch_json.write.format("kafka").option(
            "kafka.bootstrap.servers", "kafka:9092"
        ).option("topic", "raw-taxi-data").save()

        print(
            f"Progress: processed {min(offset + batch_size, total_count)}/{total_count} records"
        )

    print(f"Finished ingesting data to Kafka topic 'raw-taxi-data'")

## Producer

We are doing simulation of producing events and then sending them to kafka. Next step is to subscribe to topic and read the events as a stream.

In [6]:
csv_path = "input/trip_data_8.csv"
ingest_csv_to_kafka(csv_path, batch_size=10000)

Ingesting data from input/trip_data_8.csv to Kafka topic 'raw-taxi-data' with batch size 10000
Total records to process: 12597109
Progress: processed 10000/12597109 records
Progress: processed 20000/12597109 records
Progress: processed 30000/12597109 records
Progress: processed 40000/12597109 records
Progress: processed 50000/12597109 records
Progress: processed 60000/12597109 records
Progress: processed 70000/12597109 records
Progress: processed 80000/12597109 records
Progress: processed 90000/12597109 records
Progress: processed 100000/12597109 records
Progress: processed 110000/12597109 records
Progress: processed 120000/12597109 records
Progress: processed 130000/12597109 records
Progress: processed 140000/12597109 records
Progress: processed 150000/12597109 records
Progress: processed 160000/12597109 records
Progress: processed 170000/12597109 records
Progress: processed 180000/12597109 records
Progress: processed 190000/12597109 records
Progress: processed 200000/12597109 records

In [9]:
spark.stop()

## Consumer