# FHV Trip Records


In this notebook, we will perform the ETL process for the [FHV Trip Records](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

**Obs.:** To perform the data assessment, we will use the [Data Dictionary – FHV Trip Records](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf), provided by the TCL NYC Website. In this document, we will check the description of each field name, keeping in mind the possible values and range of values for each data field.

## Step 1: Import Dependencies

In [1]:
import numpy as np
import pandas as pd
import datetime
import pyspark.sql.functions as f
from pyspark.sql.types import IntegerType

## Step 2: Load the Data

Since the data has the same schema, we can easily perform: 

```spark.read()```

And pass the folder to it:

In [2]:
df = spark.read.parquet("gs://mobilab-tech-task-bucket/fhv")

We will also define the begin and the current year for future analysis

In [3]:
begin = 2020

now = datetime.datetime.now()
until = now.year

## Step 3: Exploratory Data Analysis

In order to get to know our data, we will perform a basic exploratory analysis of it:

In [4]:
print(f"There is {df.count()} rows in the dataframe")

There is 39535386 rows in the dataframe


In [5]:
df.printSchema()

root
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropOff_datetime: timestamp (nullable = true)
 |-- PUlocationID: double (nullable = true)
 |-- DOlocationID: double (nullable = true)
 |-- SR_Flag: integer (nullable = true)
 |-- Affiliated_base_number: string (nullable = true)



In [6]:
df.select('dispatching_base_num', 'SR_Flag').describe().show()

+-------+--------------------+-------+
|summary|dispatching_base_num|SR_Flag|
+-------+--------------------+-------+
|  count|            39535386|      0|
|   mean|                null|   null|
| stddev|                null|   null|
|    min|              B00001|   null|
|    max|              b03340|   null|
+-------+--------------------+-------+



In [7]:
df.select('pickup_datetime','dropOff_datetime').show(10)

+-------------------+-------------------+
|    pickup_datetime|   dropOff_datetime|
+-------------------+-------------------+
|2020-01-01 00:30:00|2020-01-01 01:44:00|
|2020-01-01 00:30:00|2020-01-01 00:47:00|
|2020-01-01 00:48:00|2020-01-01 01:19:00|
|2020-01-01 00:34:00|2020-01-01 00:43:00|
|2020-01-01 00:23:00|2020-01-01 00:32:00|
|2020-01-01 00:52:00|2020-01-01 01:01:00|
|2020-01-01 00:20:30|2020-01-01 00:45:52|
|2020-01-01 00:08:15|2020-01-01 00:12:03|
|2020-01-01 00:40:30|2020-01-01 01:06:23|
|2020-01-01 00:53:04|2020-01-01 01:19:13|
+-------------------+-------------------+
only showing top 10 rows



**Issues**

*   `Null` values besides the `SR_Flag` column

**Possible Issues**

*   `dropOff_datetime` > 2022
* `passenger_count` < 0
* `payment_type` out of the range
* `pickup_datetime` < 2022
* `dropOff_datetime` - `pickup_datetime` < 0

We will handle with this issues in the transformation step.

## Step 4: Data Transformation

Here, we will perform a series of data transformation methods, such as filtering, type conversion, row dropping, etc. Focus on building a more robust dataset.

**Time Period**

**Pickup Datetime**

- 1.0 Checking wheater the `lpep_pickup_datetime` is in the range of years previously defined.

In [8]:
df.select('pickup_datetime').sort(f.col("pickup_datetime")).show(5)

+-------------------+
|    pickup_datetime|
+-------------------+
|2020-01-01 00:00:00|
|2020-01-01 00:00:00|
|2020-01-01 00:00:02|
|2020-01-01 00:00:07|
|2020-01-01 00:00:07|
+-------------------+
only showing top 5 rows



- 1.1 - Dropping the out of range rows

In [9]:
df = df.withColumn("year", f.year(f.col("pickup_datetime")))

In [10]:
df = df.filter(f'year >= {begin} and year <= {until}')
df = df.drop('year')

In [11]:
df.select('pickup_datetime').sort(f.col("pickup_datetime")).show(5)

+-------------------+
|    pickup_datetime|
+-------------------+
|2020-01-01 00:00:00|
|2020-01-01 00:00:00|
|2020-01-01 00:00:02|
|2020-01-01 00:00:07|
|2020-01-01 00:00:07|
+-------------------+
only showing top 5 rows



**Dropoff Datetime**

- 2.0 Checking wheater the `lpep_dropoff_datetime` is in the range of years previously defined.

In [12]:
df.select('dropOff_datetime').sort(f.col("dropOff_datetime").desc()).show(5)

+-------------------+
|   dropOff_datetime|
+-------------------+
|2050-02-05 11:30:00|
|2031-05-01 16:00:04|
|2028-07-28 17:50:00|
|2027-07-01 06:00:00|
|2024-12-01 14:30:18|
+-------------------+
only showing top 5 rows



In this particular case, the data frame already meets the requirement, but we will implement the filter thinking in future use cases.

- 2.1 - Dropping the out of range rows

In [13]:
df = df.withColumn("year", f.year(f.col("dropOff_datetime")))

In [14]:
df = df.filter(f'year >= {begin} and year <= {until}')
df = df.drop('year')

In [15]:
df.select('dropOff_datetime').sort(f.col("dropOff_datetime").desc()).show(5)

+-------------------+
|   dropOff_datetime|
+-------------------+
|2022-12-30 16:00:00|
|2022-12-28 19:50:00|
|2022-12-28 18:45:00|
|2022-12-28 14:30:00|
|2022-12-27 16:30:00|
+-------------------+
only showing top 5 rows



**Timestamps Analysis**

The difference between `lpep_dropoff_datetime` and `lpep_pickup_datetime` must be greater than zero.


In [16]:
df = df.withColumn('DiffInSeconds', f.unix_timestamp("dropOff_datetime") - f.unix_timestamp('pickup_datetime'))

In [17]:
df = df.filter('DiffInSeconds > 0')
df = df.drop('DiffInSeconds')

### 4.1: Timestamp Requirement

Since the data science team wants to evaluate data also based on the hours and the day of the week, we could define two extra columns in our dataset.

Our date and time values are already in a timestamp type, so it will be a quick transformation that will save the time of our team in the future.

**Hours**

On 24-hour time format.

In [18]:
df = df.withColumn("pickup_hour", f.hour(f.col("pickup_datetime"))) \
       .withColumn("dropOff_hour", f.hour(f.col("dropOff_datetime")))

**Day of the week**

This transformation will generate a column with the first three letters of the respective day of the week based on the timestamps.

In [19]:
df = df.withColumn("pickup_day", f.date_format('pickup_datetime', 'E')) \
       .withColumn("dropoff_day", f.date_format('dropOff_datetime', 'E'))


## Step 5: Data Schema Check

First, let us take a look on the actual data schema:

In [20]:
df.printSchema()

root
 |-- dispatching_base_num: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropOff_datetime: timestamp (nullable = true)
 |-- PUlocationID: double (nullable = true)
 |-- DOlocationID: double (nullable = true)
 |-- SR_Flag: integer (nullable = true)
 |-- Affiliated_base_number: string (nullable = true)
 |-- pickup_hour: integer (nullable = true)
 |-- dropOff_hour: integer (nullable = true)
 |-- pickup_day: string (nullable = true)
 |-- dropoff_day: string (nullable = true)



The schema is nice defined.

## Step 6: Data Transformation Check

Now, we will perform the same exploratory data analysis that before, in order to evaluate the results of the data transformation step.

The goal here is to confirm that we dealt properly with the spotted issues.

In [21]:
df.select('dispatching_base_num', 'SR_Flag').describe().show()

+-------+--------------------+-------+
|summary|dispatching_base_num|SR_Flag|
+-------+--------------------+-------+
|  count|            39535379|      0|
|   mean|                null|   null|
| stddev|                null|   null|
|    min|              B00001|   null|
|    max|              b03340|   null|
+-------+--------------------+-------+



In [22]:
print(f"There are {df.count()} rows in the transformed data frame")

There are 39535379 rows in the transformed data frame


## Step 7: Outputs

As the pipeline requirements, defined by our data science team, the output datasets are required in:

1. **Colum-oriented format**
2. **Row-oriented format**
3. **Delta lake format**

Since we are working in the Google Cloud (GC) platform, to meet the requirements will use the GC resources:

## Step 7: Outputs

As the pipeline requirements, defined by our data science team, the output datasets are required in:

1. **Colum-oriented format**
2. **Row-oriented format**
3. **Delta lake format**

Since we are working in the Google Cloud (GC) platform, to meet the requirements will use the GC resources:

1. **Colum-oriented format**

     **No available due to versioning issues.**


2. **Row-oriented format**

     **No available due to versioning issues.**

In the above two topics, PySpark was not able to generate the files due to a versioning error. This error only appears in the Yellow Taxi Trip data, and the FHV Trip data. The Green and FHVHV data, even with the same function, do not have this problem.


3. **Delta lake format**

     **No available due to versioning issues.**