# Yellow Taxi Trip Records


In this notebook, we will perform the ETL process for the [Yellow Taxi Trip Records](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

**Obs.:** To perform the data assessment, we will use the [Data Dictionary – Yellow Taxi Trip Records](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf), provided by the TCL NYC Website. In this document, we will check the description of each field name, keeping in mind the possible values and range of values for each data field.

## Step 1: Import Dependencies

In [1]:
import numpy as np
import pandas as pd
import datetime
import pyspark.sql.functions as f
from pyspark.sql.types import IntegerType

## Step 2: Load the Data

Since the data has the same schema, we can easily perform: 

```spark.read()```

And pass the folder to it:

In [2]:
df = spark.read.parquet("gs://mobilab-tech-task-bucket/yellow-taxi")

We will also define the begin and the current year for future analysis

In [3]:
begin = 2020

now = datetime.datetime.now()
until = now.year

## Step 3: Exploratory Data Analysis

In order to get to know our data, we will perform a basic exploratory analysis of it:

In [4]:
print(f"There is {df.count()} rows in the dataframe")

There is 81698054 rows in the dataframe


In [5]:
df.printSchema()

root
 |-- VendorID: long (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- payment_type: long (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: integer (nullable = true)



In [6]:
df.select('passenger_count','trip_distance', 'total_amount', 'payment_type', 'tip_amount').describe().show()

+-------+------------------+-----------------+-----------------+------------------+------------------+
|summary|   passenger_count|    trip_distance|     total_amount|      payment_type|        tip_amount|
+-------+------------------+-----------------+-----------------+------------------+------------------+
|  count|          78539123|         81698054|         81698054|          81698054|          81698054|
|   mean|1.4321764453621413|5.571054262810674|19.74366564854705|1.2040914830113334|2.3662261661738015|
| stddev|1.0405594659825625|573.4568144618119|226.0265670676297|0.5231703283378963| 2.902432948777313|
|    min|               0.0|           -30.62|          -2567.8|                 0|           -493.22|
|    max|             112.0|        357192.65|        1000003.8|                 5|           1400.16|
+-------+------------------+-----------------+-----------------+------------------+------------------+



In [7]:
df.select('tpep_pickup_datetime','tpep_dropoff_datetime').show(10)

+--------------------+---------------------+
|tpep_pickup_datetime|tpep_dropoff_datetime|
+--------------------+---------------------+
| 2020-01-01 00:28:15|  2020-01-01 00:33:03|
| 2020-01-01 00:35:39|  2020-01-01 00:43:04|
| 2020-01-01 00:47:41|  2020-01-01 00:53:52|
| 2020-01-01 00:55:23|  2020-01-01 01:00:14|
| 2020-01-01 00:01:58|  2020-01-01 00:04:16|
| 2020-01-01 00:09:44|  2020-01-01 00:10:37|
| 2020-01-01 00:39:25|  2020-01-01 00:39:29|
| 2019-12-18 15:27:49|  2019-12-18 15:28:59|
| 2019-12-18 15:30:35|  2019-12-18 15:31:35|
| 2020-01-01 00:29:01|  2020-01-01 00:40:28|
+--------------------+---------------------+
only showing top 10 rows



**Issues**

*   `trip_distance` < 0
*   `total_amount` < 0
*   `tip_amount` < 0
*   `passenger_count` > 10
* `tpep_pickup_datetime` < 2020
* `payment_type` out of the range


**Possible Issues**

* `passenger_count` < 0
* `tpep_pickup_datetime` < 2020
* `lpep_dropoff_datetime` > 2022
* `tpep_dropoff_datetime` - `lpep_dropoff_datetime` < 0

We will handle with this issues in the transformation step.

## Step 4: Data Transformation

Here, we will perform a series of data transformation methods, such as filtering, type conversion, row dropping, etc. Focus on building a more robust dataset.

**Trip Distance Filtering**

In [8]:
df = df.filter('trip_distance >= 0')

In [9]:
df.filter('trip_distance < 0').count()

0

**Passenger Count Filtering**

In [10]:
df = df.filter('passenger_count <= 10')
df.agg({'passenger_count': 'max' }).show()

+--------------------+
|max(passenger_count)|
+--------------------+
|                 9.0|
+--------------------+



**Total Amount Filtering**

In [11]:
df = df.filter('total_amount >= 0')
df.agg({'total_amount': 'min' }).show()

+-----------------+
|min(total_amount)|
+-----------------+
|              0.0|
+-----------------+



**Tip Amount Filtering**

In [12]:
df = df.filter('tip_amount >= 0')
df.agg({'tip_amount': 'min' }).show()

+---------------+
|min(tip_amount)|
+---------------+
|            0.0|
+---------------+



**Payment Type Analysis**

Checkin wheater the payment is consistent with the total amount or not.

In [13]:
df.select('total_amount','payment_type').filter('total_amount == 0 and payment_type != 3').show(10)

+------------+------------+
|total_amount|payment_type|
+------------+------------+
|         0.0|           2|
|         0.0|           2|
|         0.0|           2|
|         0.0|           2|
|         0.0|           2|
|         0.0|           2|
|         0.0|           2|
|         0.0|           2|
|         0.0|           2|
|         0.0|           2|
+------------+------------+
only showing top 10 rows



In [14]:
df.filter('total_amount == 0 and payment_type != 3').count()

16360

In [15]:
df = df.withColumn("payment_type", f.when(df["total_amount"] == 0, 3).otherwise(df["payment_type"]))

df.filter('total_amount == 0 and payment_type != 3').count()

0

Replacing `0` value as `Nan` in `payment_type`, since is out of the range.

In [16]:
df = df.withColumn("payment_type", f.when(df["payment_type"] == 0, np.nan).otherwise(df["payment_type"]))
df.filter('payment_type == 0').count()

0

**Time Period**

**Pickup Datetime**

- 1.0 Checking wheater the `tpep_pickup_datetime` is in the range of years previously defined.

In [17]:
df.select('tpep_pickup_datetime').sort(f.col("tpep_pickup_datetime")).show(5)

+--------------------+
|tpep_pickup_datetime|
+--------------------+
| 2001-01-01 00:03:14|
| 2001-01-01 00:27:45|
| 2001-01-01 01:02:18|
| 2001-01-01 01:23:51|
| 2001-01-01 01:52:48|
+--------------------+
only showing top 5 rows



- 1.1 - Dropping the out of range rows

In [18]:
df = df.withColumn("year", f.year(f.col("tpep_pickup_datetime")))

In [19]:
df = df.filter(f'year >= {begin} and year <= {until}')
df = df.drop('year')

In [20]:
df.select('tpep_pickup_datetime').sort(f.col("tpep_pickup_datetime")).show(5)

+--------------------+
|tpep_pickup_datetime|
+--------------------+
| 2020-01-01 00:00:00|
| 2020-01-01 00:00:00|
| 2020-01-01 00:00:00|
| 2020-01-01 00:00:00|
| 2020-01-01 00:00:00|
+--------------------+
only showing top 5 rows



**Dropoff Datetime**

- 2.0 Checking wheater the `tpep_dropoff_datetime` is in the range of years previously defined.

In [21]:
df.select('tpep_dropoff_datetime').sort(f.col("tpep_dropoff_datetime").desc()).show(5)

+---------------------+
|tpep_dropoff_datetime|
+---------------------+
|  2022-09-02 10:02:33|
|  2022-09-01 23:25:26|
|  2022-09-01 23:11:08|
|  2022-09-01 23:06:04|
|  2022-09-01 22:56:06|
+---------------------+
only showing top 5 rows



In this particular case, the data frame already meets the requirement, but we will implement the filter thinking in future use cases.

- 2.1 - Dropping the out of range rows

In [22]:
df = df.withColumn("year", f.year(f.col("tpep_dropoff_datetime")))

In [23]:
df = df.filter(f'year >= {begin} and year <= {until}')
df = df.drop('year')

In [24]:
df.select('tpep_dropoff_datetime').sort(f.col("tpep_dropoff_datetime").desc()).show(5)

+---------------------+
|tpep_dropoff_datetime|
+---------------------+
|  2022-09-02 10:02:33|
|  2022-09-01 23:25:26|
|  2022-09-01 23:11:08|
|  2022-09-01 23:06:04|
|  2022-09-01 22:56:06|
+---------------------+
only showing top 5 rows



**Timestamps Analysis**

The difference between `tpep_dropoff_datetime` and `tpep_pickup_datetime` must be greater than zero.


In [25]:
df = df.withColumn('DiffInSeconds', f.unix_timestamp("tpep_dropoff_datetime") - f.unix_timestamp('tpep_pickup_datetime'))

In [26]:
df = df.filter('DiffInSeconds > 0')
df = df.drop('DiffInSeconds')

### 4.1: Timestamp Requirement

Since the data science team wants to evaluate data also based on the hours and the day of the week, we could define two extra columns in our dataset.

Our date and time values are already in a timestamp type, so it will be a quick transformation that will save the time of our team in the future.

**Hours**

On 24-hour time format.

In [27]:
df = df.withColumn("tpep_pickup_hour", f.hour(f.col("tpep_pickup_datetime"))) \
       .withColumn("tpep_dropoff_hour", f.hour(f.col("tpep_dropoff_datetime")))

In [28]:
df.select('tpep_pickup_hour', 'tpep_dropoff_hour').show(5)

+----------------+-----------------+
|tpep_pickup_hour|tpep_dropoff_hour|
+----------------+-----------------+
|               0|                0|
|               0|                0|
|               0|                0|
|               0|                1|
|               0|                0|
+----------------+-----------------+
only showing top 5 rows



**Day of the week**

This transformation will generate a column with the first three letters of the respective day of the week based on the timestamps.

In [29]:
df = df.withColumn("pickup_day", f.date_format('tpep_pickup_datetime', 'E')) \
       .withColumn("dropoff_day", f.date_format('tpep_dropoff_datetime', 'E'))

In [30]:
df.select('tpep_dropoff_datetime','pickup_day', 'tpep_dropoff_datetime', 'dropoff_day').show(5)

+---------------------+----------+---------------------+-----------+
|tpep_dropoff_datetime|pickup_day|tpep_dropoff_datetime|dropoff_day|
+---------------------+----------+---------------------+-----------+
|  2020-01-01 00:33:03|       Wed|  2020-01-01 00:33:03|        Wed|
|  2020-01-01 00:43:04|       Wed|  2020-01-01 00:43:04|        Wed|
|  2020-01-01 00:53:52|       Wed|  2020-01-01 00:53:52|        Wed|
|  2020-01-01 01:00:14|       Wed|  2020-01-01 01:00:14|        Wed|
|  2020-01-01 00:04:16|       Wed|  2020-01-01 00:04:16|        Wed|
+---------------------+----------+---------------------+-----------+
only showing top 5 rows



## Step 5: Data Schema Check

First, let us take a look on the actual data schema:

In [31]:
df.printSchema()

root
 |-- VendorID: long (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- payment_type: double (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: integer (nullable = true)
 |-- tpep_pickup_hour: integer (nullable = true)
 |-- tpep_dropoff_hour: integer (nullable = true)
 |-- pickup_day: string (nullable = t

Some data types above are weird, for example: `passenger_count` as a `double` (float).

In the code cell below, we better define some data types for the final dataset schema.

In [32]:
df = df \
.withColumn("VendorID" ,df["VendorID"].cast(IntegerType())) \
.withColumn("passenger_count" ,df["passenger_count"].cast(IntegerType())) \
.withColumn("payment_type",df["payment_type"].cast(IntegerType()))

df.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: integer (nullable = true)
 |-- tpep_pickup_hour: integer (nullable = true)
 |-- tpep_dropoff_hour: integer (nullable = true)
 |-- pickup_day: string (nullabl

## Step 6: Data Transformation Check

Now, we will perform the same exploratory data analysis that before, in order to evaluate the results of the data transformation step.

The goal here is to confirm that we dealt properly with the spotted issues.

In [33]:
df.select('passenger_count','trip_distance', 'total_amount', 'payment_type', 'tip_amount').describe().show()

+-------+------------------+------------------+------------------+-------------------+-----------------+
|summary|   passenger_count|     trip_distance|      total_amount|       payment_type|       tip_amount|
+-------+------------------+------------------+------------------+-------------------+-----------------+
|  count|          78103927|          78103927|          78103927|           78103927|         78103927|
|   mean|1.4326720601385383|3.0998538611251334|19.463213593572913|  1.242082360340217|2.374415468767291|
| stddev|1.0414274073772392|44.631973382627606| 231.1103016694798|0.44808221913761276|2.902552025221473|
|    min|                 0|               0.0|               0.0|                  1|              0.0|
|    max|                 9|          184340.8|         1000003.8|                  5|          1400.16|
+-------+------------------+------------------+------------------+-------------------+-----------------+



In [34]:
df.select('tpep_pickup_datetime','tpep_dropoff_datetime').show(5)

+--------------------+---------------------+
|tpep_pickup_datetime|tpep_dropoff_datetime|
+--------------------+---------------------+
| 2020-01-01 00:28:15|  2020-01-01 00:33:03|
| 2020-01-01 00:35:39|  2020-01-01 00:43:04|
| 2020-01-01 00:47:41|  2020-01-01 00:53:52|
| 2020-01-01 00:55:23|  2020-01-01 01:00:14|
| 2020-01-01 00:01:58|  2020-01-01 00:04:16|
+--------------------+---------------------+
only showing top 5 rows



In [35]:
print(f"There is {df.count()} rows in the dataframe")

There is 78103927 rows in the dataframe


## Step 7: Outputs

As the pipeline requirements, defined by our data science team, the output datasets are required in:

1. **Colum-oriented format**
2. **Row-oriented format**
3. **Delta lake format**

Since we are working in the Google Cloud (GC) platform, to meet the requirements will use the GC resources:

1. **Colum-oriented format**

     **No available due to versioning issues.**


2. **Row-oriented format**

     **No available due to versioning issues.**

In the above two topics, PySpark was not able to generate the files due to a versioning error. This error only appears in the Yellow Taxi Trip data, and the FHV Trip data. The Green and FHVHV data, even with the same function, do not have this problem.

3. **Delta lake format**

     **No available due to versioning issues.**
