## Problem Statement:

The Taxi and Limousine Commission (TLC) of New York City collects trip record data from licensed taxis and for-hire vehicles (FHVs) and provides it to the public. The data includes details such as pick-up and drop-off times, locations, passenger counts, and payment information for each trip. As a data engineer, your task is to build a batch data processing pipeline using PySpark to process and analyze this data to gain insights into taxi and FHV trips in New York City.

### Goals:

Data ingestion: Download the trip record data from the NYC TLC website and ingest it into the pipeline for further processing.

Data cleaning and validation: Perform data quality checks and validation to ensure that the data is clean and consistent. Identify and remove duplicates, null values, and other data quality issues that may impact downstream analysis.

Data transformation: Transform the raw trip record data into a format that is optimized for analysis. This may include aggregating the data by time periods, geographical regions, and other factors of interest.

Data analysis: Use PySpark to perform statistical analysis, data exploration, and data visualization to gain insights into taxi and FHV trips in New York City. This may include identifying popular pick-up and drop-off locations, peak trip times, and other patterns and trends in the data.

Data storage: Store the processed and analyzed data in a suitable data storage system such as Hadoop Distributed File System (HDFS) or Apache Cassandra for future use.

Automation and scheduling: Automate the data processing pipeline using tools such as Apache Airflow or Apache Oozie. Schedule the pipeline to run at regular intervals to ensure that the data is up to date and accurate.

---

The overall goal of the project is to build a batch data processing pipeline using PySpark to extract insights from the NYC TLC trip record data. The pipeline should be scalable, efficient, and automated to enable easy data processing and analysis.

### Import Libraries and Intiate Spark session

In [1]:
import configparser

import pandas as pd

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, LongType, TimestampType

In [2]:
spark = SparkSession.builder.appName("nyc_batch_pipeline").getOrCreate()

In [3]:
spark

### Data Ingestion

In [4]:
# Parsing from config file
conf = configparser.ConfigParser()
conf.read("config")
data_source_path = conf.get("DATASOURCE PATH", "PATH")

# Reading the DataSource from PySpark
df = spark.read.parquet(data_source_path)

In [5]:
print(f"No of Partitons: {df.rdd.getNumPartitions()}")

In [6]:
# Schema
df.printSchema()

### Data Cleaning And Validation

#### Missing Values

In [7]:
no_of_row = df.count()
print(f"No of Rows: {no_of_row}")
print(f"No of cols: {len(df.columns)}")

In [8]:
# Get No of Null vlaues per columns
null_count = df.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in df.columns])
null_df = null_count.pandas_api().transpose()

In [9]:
null_col_list = null_df[null_df[0] > 0].index.to_list()

In [10]:
# We could see below three cols have huge no of Null Values
# The Rate_Code and mta_tax is completely null
# store_and_forward have few rows present
null_col_list = ["Rate_Code", "store_and_forward", "mta_tax"]
df.select([F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)).alias(c) for c in null_col_list]).show()

In [11]:
# Checking no of Distinct values in Null having columns
df.select([F.countDistinct(F.col(c)).alias(c) for c in null_col_list]).show()

In [12]:
# Count of distinct values on store_and_forward, we could see the values are very small compared to total rows
df.select(F.col("store_and_forward")).groupBy('store_and_forward').count().show()

In [13]:
# Since the amount of null values is very large compared to total no of records we are dropping those columns
df_not_null = df.drop(*null_col_list)

In [14]:
df_not_null.printSchema()

#### DateTime

In [15]:
# Currently the DateTime col are in String format, will check the format
date_col = ["Trip_Pickup_DateTime", "Trip_Dropoff_DateTime"]
df_not_null.select(date_col).show(3)

In [16]:
# We could see the format is %Y-%M-%D %H:%M:%s
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
date_dict_map = {date_c: F.to_timestamp(F.col(date_c)) for date_c in date_col}
df_date_parsed = df_not_null.withColumns(date_dict_map)

In [17]:
df_date_parsed.printSchema()

#### Distinct Values

In [18]:
string_cols = [f.name for f in df_date_parsed.schema.fields if isinstance(f.dataType, F.StringType)]
distinct_count = df_date_parsed.select([F.countDistinct(F.col(c)).alias(c) for c in string_cols])
distinct_count_pd = distinct_count.pandas_api().transpose()
dist_cols = distinct_count_pd[distinct_count_pd[0]<50].index.tolist()

In [19]:
df_date_parsed.select([F.countDistinct(F.col(c)).alias(c) for c in dist_cols]).show()

In [20]:
# View Distinct Values
for c in dist_cols:
    df_date_parsed.select(c).distinct().show()

#### Duplicates

In [21]:
#check for duplicate values
df_dup_drop = df_date_parsed.dropDuplicates()

In [22]:
# Taking count after duplicates
cnt_after_drop = df_dup_drop.count()

# No of Duplciates dropped
no_of_row - cnt_after_drop

In [23]:
df_cleaned = df_dup_drop

### Data Validation

#### Schema Validation

In [24]:
# Defining Schema for Validation
validate_schema = StructType(
    [
        StructField('vendor_name', StringType(), True), 
        StructField('Trip_Pickup_DateTime', TimestampType(), True), 
        StructField('Trip_Dropoff_DateTime', TimestampType(), True), 
        StructField('Passenger_Count', LongType(), True), 
        StructField('Trip_Distance', DoubleType(), True), 
        StructField('Start_Lon', DoubleType(), True), 
        StructField('Start_Lat', DoubleType(), True), 
        StructField('End_Lon', DoubleType(), True), 
        StructField('End_Lat', DoubleType(), True), 
        StructField('Payment_Type', StringType(), True), 
        StructField('Fare_Amt', DoubleType(), True), 
        StructField('surcharge', DoubleType(), True), 
        StructField('Tip_Amt', DoubleType(), True), 
        StructField('Tolls_Amt', DoubleType(), True),
        StructField('Total_Amt', DoubleType(), True)
    ]
)

In [25]:
# Validate Schema
assert validate_schema == df_cleaned.schema, "schema is not valid"

#### Null Value Validation

In [26]:
# This method cosumes more memory hence commenting the below, It better to do one by one
# is_null_values = df_cleaned.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in df_cleaned.columns]).collect()[0].asDict()
# [col for col in is_null_values if is_null_values[col] > 0]

In [27]:
for c in df_cleaned.columns:
    cnt = df_cleaned.select(c).where(F.col(c).isNull()).count()
    if cnt > 0:
        print(c, cnt)
else:
    print("There No Null columns")

#### Data Range Validation