# FHVHV Trip Records


In this notebook, we will perform the ETL process for the [FHVHV Trip Records](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

**Obs.:** To perform the data assessment, we will use the [Data Dictionary – FHVHV Trip Records](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf), provided by the TCL NYC Website. In this document, we will check the description of each field name, keeping in mind the possible values and range of values for each data field.

## Step 1: Import Dependencies

In [3]:
import numpy as np
import pandas as pd
import datetime
import pyspark.sql.functions as f
from pyspark.sql.types import IntegerType

## Step 2: Load the Data

Since the data has the same schema, we can easily perform: 

```spark.read()```

And pass the folder to it:

In [4]:
df = spark.read.parquet("gs://mobilab-tech-task-bucket/fhvhv")

We will also define the begin and the current year for future analysis

In [5]:
begin = 2020

now = datetime.datetime.now()
until = now.year

## Step 3: Exploratory Data Analysis

In order to get to know our data, we will perform a basic exploratory analysis of it:

In [6]:
print(f"There is {df.count()} rows in the dataframe")

There is 455471222 rows in the dataframe


In [7]:
df.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- originating_base_num: string (nullable = true)
 |-- request_datetime: timestamp (nullable = true)
 |-- on_scene_datetime: timestamp (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- trip_miles: double (nullable = true)
 |-- trip_time: long (nullable = true)
 |-- base_passenger_fare: double (nullable = true)
 |-- tolls: double (nullable = true)
 |-- bcf: double (nullable = true)
 |-- sales_tax: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)
 |-- tips: double (nullable = true)
 |-- driver_pay: double (nullable = true)
 |-- shared_request_flag: string (nullable = true)
 |-- shared_match_flag: string (nullable = true)
 |-- access_a_ride_flag: string (nul

In [8]:
df.select('trip_miles','trip_time', 'tips', 'driver_pay', 'base_passenger_fare').describe().show()

+-------+-----------------+------------------+------------------+------------------+-------------------+
|summary|       trip_miles|         trip_time|              tips|        driver_pay|base_passenger_fare|
+-------+-----------------+------------------+------------------+------------------+-------------------+
|  count|        455471222|         455471222|         455471222|         455471222|          455471222|
|   mean|4.819631009109297|1088.5584195350107|0.8036569993439668|16.794573631012756| 21.004440467214035|
| stddev|5.512223049853493| 765.2270550604832|2.5864894824143256|13.804533881142962|  17.45281175653547|
|    min|              0.0|                 0|               0.0|          -2035.92|            -520.11|
|    max|          1310.51|            240764|            1000.0|           4894.62|            8157.74|
+-------+-----------------+------------------+------------------+------------------+-------------------+



In [9]:
df.select('pickup_datetime','dropoff_datetime').show(10)

+-------------------+-------------------+
|    pickup_datetime|   dropoff_datetime|
+-------------------+-------------------+
|2020-01-01 00:45:34|2020-01-01 01:02:20|
|2020-01-01 00:47:50|2020-01-01 00:53:23|
|2020-01-01 00:04:37|2020-01-01 00:21:49|
|2020-01-01 00:26:36|2020-01-01 00:33:00|
|2020-01-01 00:37:49|2020-01-01 00:46:59|
|2020-01-01 00:49:23|2020-01-01 01:07:26|
|2020-01-01 00:21:11|2020-01-01 00:36:58|
|2020-01-01 00:38:28|2020-01-01 00:42:38|
|2020-01-01 00:46:26|2020-01-01 01:09:55|
|2020-01-01 00:15:35|2020-01-01 00:23:21|
+-------------------+-------------------+
only showing top 10 rows



**Issues**

*   `driver_pay` < 0
*   `base_passenger_fare` < 0

**Possible Issues**

* `pickup_datetime` < 2020
* `dropoff_datetime` > 2022
* `dropoff_datetime` - `pickup_datetime` < 0
* `pickup_datetime` - `request_datetime` < 0
* `boolean` variables ou of range

We will handle with this issues in the transformation step.

## Step 4: Data Transformation

Here, we will perform a series of data transformation methods, such as filtering, type conversion, row dropping, etc. Focus on building a more robust dataset.

**Driver Pay Filtering**

In [10]:
df = df.filter('driver_pay >= 0')

**Base Passenger Filtering**

In [11]:
df = df.filter('base_passenger_fare >= 0')

**Boolean Filtering**

In total, we have 5 columns with boolean characteristics, we will perform a filter in the two most general ones. 

In [12]:
df = df.filter(df.shared_match_flag.contains('N') | df.shared_match_flag.contains('Y'))

In [13]:
df = df.filter(df.shared_request_flag.contains('N') | df.shared_request_flag.contains('Y'))

**Time Period**

**Pickup Datetime**

- 1.0 Checking wheater the `tpep_pickup_datetime` is in the range of years previously defined.

In [14]:
df.select('pickup_datetime').sort(f.col("pickup_datetime")).show(5)

+-------------------+
|    pickup_datetime|
+-------------------+
|2020-01-01 00:00:00|
|2020-01-01 00:00:00|
|2020-01-01 00:00:01|
|2020-01-01 00:00:01|
|2020-01-01 00:00:01|
+-------------------+
only showing top 5 rows



- 1.1 - Dropping the out of range rows

In [15]:
df = df.withColumn("year", f.year(f.col("pickup_datetime")))

In [16]:
df = df.filter(f'year >= {begin} and year <= {until}')
df = df.drop('year')

**Dropoff Datetime**

- 2.0 Checking wheater the `tpep_dropoff_datetime` is in the range of years previously defined.

In [17]:
df.select('dropoff_datetime').sort(f.col("dropoff_datetime").desc()).show(5)

+-------------------+
|   dropoff_datetime|
+-------------------+
|2022-09-01 04:37:02|
|2022-09-01 03:13:58|
|2022-09-01 02:42:08|
|2022-09-01 02:04:35|
|2022-09-01 01:46:23|
+-------------------+
only showing top 5 rows



In this particular case, the data frame already meets the requirement, but we will implement the filter thinking in future use cases.

- 2.1 - Dropping the out of range rows

In [18]:
df = df.withColumn("year", f.year(f.col("dropoff_datetime")))

In [19]:
df = df.filter(f'year >= {begin} and year <= {until}')
df = df.drop('year')

**Timestamps Analysis**

The difference between `dropoff_datetime` and `pickup_datetime` must be greater than zero.


In [20]:
df = df.withColumn('DiffInSeconds', f.unix_timestamp("dropoff_datetime") - f.unix_timestamp('pickup_datetime'))

In [21]:
df = df.filter('DiffInSeconds > 0')
df = df.drop('DiffInSeconds')

The difference between `pickup_datetime` and `request_datetime` must be greater than zero.

In [22]:
df = df.withColumn('DiffInSeconds', f.unix_timestamp("pickup_datetime") - f.unix_timestamp('request_datetime'))

In [23]:
df = df.filter('DiffInSeconds > 0')
df = df.drop('DiffInSeconds')

### 4.1: Timestamp Requirement

Since the data science team wants to evaluate data also based on the hours and the day of the week, we could define two extra columns in our dataset.

Our date and time values are already in a timestamp type, so it will be a quick transformation that will save the time of our team in the future.

**Hours**

On 24-hour time format.

In [24]:
df = df.withColumn("pickup_hour", f.hour(f.col("pickup_datetime"))) \
       .withColumn("dropoff_hour", f.hour(f.col("dropoff_datetime")))

In [25]:
df.select('pickup_hour', 'dropoff_hour').show(2)

+-----------+------------+
|pickup_hour|dropoff_hour|
+-----------+------------+
|          0|           1|
|          0|           0|
+-----------+------------+
only showing top 2 rows



**Day of the week**

This transformation will generate a column with the first three letters of the respective day of the week based on the timestamps.

In [26]:
df = df.withColumn("pickup_day", f.date_format('pickup_datetime', 'E')) \
       .withColumn("dropoff_day", f.date_format('dropoff_datetime', 'E'))


In [27]:
df.select('pickup_datetime','pickup_day', 'dropoff_datetime', 'dropoff_day').show(2)

+-------------------+----------+-------------------+-----------+
|    pickup_datetime|pickup_day|   dropoff_datetime|dropoff_day|
+-------------------+----------+-------------------+-----------+
|2020-01-01 00:45:34|       Wed|2020-01-01 01:02:20|        Wed|
|2020-01-01 00:47:50|       Wed|2020-01-01 00:53:23|        Wed|
+-------------------+----------+-------------------+-----------+
only showing top 2 rows



## Step 5: Data Schema Check

First, let us take a look on the actual data schema:

In [28]:
df.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- originating_base_num: string (nullable = true)
 |-- request_datetime: timestamp (nullable = true)
 |-- on_scene_datetime: timestamp (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- trip_miles: double (nullable = true)
 |-- trip_time: long (nullable = true)
 |-- base_passenger_fare: double (nullable = true)
 |-- tolls: double (nullable = true)
 |-- bcf: double (nullable = true)
 |-- sales_tax: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)
 |-- tips: double (nullable = true)
 |-- driver_pay: double (nullable = true)
 |-- shared_request_flag: string (nullable = true)
 |-- shared_match_flag: string (nullable = true)
 |-- access_a_ride_flag: string (nul

The above original schema above make sense to represent the data.

## Step 6: Data Transformation Check

Now, we will perform the same exploratory data analysis that before, in order to evaluate the results of the data transformation step.

The goal here is to confirm that we dealt properly with the spotted issues.

In [29]:
df.select('trip_miles','trip_time', 'tips', 'driver_pay', 'base_passenger_fare').describe().show()

+-------+-----------------+------------------+------------------+------------------+-------------------+
|summary|       trip_miles|         trip_time|              tips|        driver_pay|base_passenger_fare|
+-------+-----------------+------------------+------------------+------------------+-------------------+
|  count|        451765224|         451765224|         451765224|         451765224|          451765224|
|   mean|4.790737384438591|1084.6164302721982|0.7939633281734142|16.713675021330857| 20.933809697964115|
| stddev|5.465292537316417| 758.6772573366525|   2.5577681447341|13.683363837334575|  17.29711410951984|
|    min|              0.0|                 0|               0.0|               0.0|                0.0|
|    max|           738.95|            147918|            1000.0|           4894.62|            8157.74|
+-------+-----------------+------------------+------------------+------------------+-------------------+



In [30]:
df.select('pickup_datetime','dropoff_datetime').show(5)

+-------------------+-------------------+
|    pickup_datetime|   dropoff_datetime|
+-------------------+-------------------+
|2020-01-01 00:45:34|2020-01-01 01:02:20|
|2020-01-01 00:47:50|2020-01-01 00:53:23|
|2020-01-01 00:04:37|2020-01-01 00:21:49|
|2020-01-01 00:26:36|2020-01-01 00:33:00|
|2020-01-01 00:37:49|2020-01-01 00:46:59|
+-------------------+-------------------+
only showing top 5 rows



In [31]:
print(f"There are {df.count()} rows in the transformed dataframe")

There are 451765224 rows in the transformed dataframe


## Step 7: Outputs

As the pipeline requirements, defined by our data science team, the output datasets are required in:

1. **Colum-oriented format**
2. **Row-oriented format**
3. **Delta lake format**

Since we are working in the Google Cloud (GC) platform, to meet the requirements will use the GC resources:

1. **Colum-oriented format**

    Export to Google Cloud Storage as a `.parquet` files. After, load the files as a Big Query table. Google Big Query storage is a solution for column-oriented databases. You could more info on [Overview of BigQuery storage](https://cloud.google.com/bigquery/docs/storage_overview)


2. **Row-oriented format**

    Export to Google Cloud Storage as a `.csv` files. The `.csv` files is the standard for row-oriented databases, this files could be uploaded lately in a SQL solution (e.g: MySQL, Postgres, or even Google Cloud SQL).


3. **Delta lake format**

    No available due to libraries issues.

**1. Colum-oriented format**

Since the data does not have a  vehicle ID identification, we select the pickup day as the partition column. The reason is that this column does not have as many distinct values as the others, as the result we will generate a manageable number of files.

In [None]:
df.write.mode('overwrite').format('parquet').partitionBy('pickup_day').save('gs://mobilab-tech-task-bucket/outputs/fhvhv/parquet')

**2. Row-oriented format**

In [None]:
df.write.mode('overwrite').format('csv').partitionBy('pickup_day').save('gs://mobilab-tech-task-bucket/outputs/fhvhv/csv')