## Problem Statement:

The Taxi and Limousine Commission (TLC) of New York City collects trip record data from licensed taxis and for-hire vehicles (FHVs) and provides it to the public. The data includes details such as pick-up and drop-off times, locations, passenger counts, and payment information for each trip. As a data engineer, your task is to build a batch data processing pipeline using PySpark to process and analyze this data to gain insights into taxi and FHV trips in New York City.

### Goals:

Data ingestion: Download the trip record data from the NYC TLC website and ingest it into the pipeline for further processing.

Data cleaning and validation: Perform data quality checks and validation to ensure that the data is clean and consistent. Identify and remove duplicates, null values, and other data quality issues that may impact downstream analysis.

Data transformation: Transform the raw trip record data into a format that is optimized for analysis. This may include aggregating the data by time periods, geographical regions, and other factors of interest.

Data analysis: Use PySpark to perform statistical analysis, data exploration, and data visualization to gain insights into taxi and FHV trips in New York City. This may include identifying popular pick-up and drop-off locations, peak trip times, and other patterns and trends in the data.

Data storage: Store the processed and analyzed data in a suitable data storage system such as Hadoop Distributed File System (HDFS) or Apache Cassandra for future use.

Automation and scheduling: Automate the data processing pipeline using tools such as Apache Airflow or Apache Oozie. Schedule the pipeline to run at regular intervals to ensure that the data is up to date and accurate.

---

The overall goal of the project is to build a batch data processing pipeline using PySpark to extract insights from the NYC TLC trip record data. The pipeline should be scalable, efficient, and automated to enable easy data processing and analysis.

### Import Libraries and Intiate Spark session

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("nyc_batch_pipeline").getOrCreate()

23/05/14 14:46:01 WARN Utils: Your hostname, joker021-VirtualBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
23/05/14 14:46:01 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/05/14 14:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
spark

### Data Ingestion

In [4]:
df = spark.read.parquet("DataSource/yellow_tripdata_2009-01.parquet")

[Stage 0:>                                                          (0 + 1) / 1]                                                                                

In [6]:
df.rdd.getNumPartitions()

4

In [8]:
df.printSchema()

root
 |-- vendor_name: string (nullable = true)
 |-- Trip_Pickup_DateTime: string (nullable = true)
 |-- Trip_Dropoff_DateTime: string (nullable = true)
 |-- Passenger_Count: long (nullable = true)
 |-- Trip_Distance: double (nullable = true)
 |-- Start_Lon: double (nullable = true)
 |-- Start_Lat: double (nullable = true)
 |-- Rate_Code: double (nullable = true)
 |-- store_and_forward: double (nullable = true)
 |-- End_Lon: double (nullable = true)
 |-- End_Lat: double (nullable = true)
 |-- Payment_Type: string (nullable = true)
 |-- Fare_Amt: double (nullable = true)
 |-- surcharge: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- Tip_Amt: double (nullable = true)
 |-- Tolls_Amt: double (nullable = true)
 |-- Total_Amt: double (nullable = true)

