# Validation Queries 

In this notebook we will run a series of validation queries on the Green Taxi Trip Records and in the FHVHV Trip records. The goal is to perform a quick data analysis deriving statements whether the results are feasible or not.

## 0. Import Dependencies

In [1]:
import numpy as np
import pandas as pd
import datetime
import pyspark.sql.functions as f
from pyspark.sql.types import IntegerType

## 1. Green Taxi Trip Records

### 1.1. Load the Dataset

In [98]:
df = spark.read.parquet("gs://mobilab-tech-task-bucket/outputs/green_trip/parquet")

In [3]:
df.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- lpep_pickup_datetime: timestamp (nullable = true)
 |-- lpep_dropoff_datetime: timestamp (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- RatecodeID: double (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- ehail_fee: integer (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- trip_type: integer (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- lpep_pickup_hour: integer (nullable = true)
 |-- lpep_dropoff_hour: integer (nullable = true)
 |-- pickup_day: string (nullable = true

### 1.2. Run the Query

### The average distance driven by green taxis per hour

In [66]:
df1 = df.select('lpep_pickup_datetime', 'lpep_dropoff_datetime', 'trip_distance')

In [67]:
df1 = df1.withColumn('DiffInSeconds', f.unix_timestamp("lpep_dropoff_datetime") - f.unix_timestamp('lpep_pickup_datetime'))

In [69]:
time = df1.agg({'DiffInSeconds': 'sum' })
time.show()

+------------------+
|sum(DiffInSeconds)|
+------------------+
|        2908161620|
+------------------+



In [70]:
dist = df1.agg({'trip_distance': 'sum' })
dist.show()

+------------------+
|sum(trip_distance)|
+------------------+
| 8316941.189999112|
+------------------+



In [71]:
time_p = time.toPandas()
dist_p = dist.toPandas()

In [74]:
hours = (time_p['sum(DiffInSeconds)'][0])/3600
miles = dist_p['sum(trip_distance)'][0]

In [75]:
avg = miles/hours

print(f'The average distance driven by a green taxi per hour, is: {avg}')

The average distance driven by a green taxi per hour, is: 10.295503550451507


### The day of the week respectively in 2020 and 2021 with the lowest number trips with only one passenger

**Starting with 2020**

In [82]:
df2020 = df.withColumn("year", f.year(f.col("lpep_pickup_datetime")))
df2020 = df2020.filter('year == 2020')

In [84]:
df2020 = df2020.select('pickup_day', 'passenger_count')
df2020.show(5)

+----------+---------------+
|pickup_day|passenger_count|
+----------+---------------+
|       Wed|              1|
|       Wed|              1|
|       Wed|              1|
|       Wed|              1|
|       Wed|              1|
+----------+---------------+
only showing top 5 rows



In [85]:
df2020 = df2020.filter('passenger_count == 1')

In [93]:
min2020 = df2020.groupBy("pickup_day").count()
min2020.show()

+----------+------+
|pickup_day| count|
+----------+------+
|       Sun|116254|
|       Mon|139845|
|       Thu|163505|
|       Sat|149566|
|       Wed|158047|
|       Fri|171478|
|       Tue|145553|
+----------+------+



The day of the week in 2020 with the lowest number trips with only one passenger was: **Sunday**

**Looking to 2021**

In [95]:
df2021 = df.withColumn("year", f.year(f.col("lpep_pickup_datetime")))
df2021 = df2021.filter('year == 2021')

df2021 = df2021.select('pickup_day', 'passenger_count')
df2021.show(5)

+----------+---------------+
|pickup_day|passenger_count|
+----------+---------------+
|       Thu|              1|
|       Fri|              1|
|       Fri|              1|
|       Fri|              1|
|       Fri|              1|
+----------+---------------+
only showing top 5 rows



In [96]:
df2021 = df2021.filter('passenger_count == 1')
min2021 = df2021.groupBy("pickup_day").count()
min2021.show()

+----------+-----+
|pickup_day|count|
+----------+-----+
|       Sun|58076|
|       Mon|80004|
|       Thu|88904|
|       Sat|76811|
|       Wed|88446|
|       Fri|90543|
|       Tue|83504|
+----------+-----+



The day of the week in 2021 with the lowest number trips with only one passenger was: **Sunday**

### The top 3 of the busiest hours in the output dataset

In [99]:
hours = df.select('lpep_pickup_hour')

In [100]:
hours.groupBy("lpep_pickup_hour").count().show(3)

+----------------+------+
|lpep_pickup_hour| count|
+----------------+------+
|              12|133704|
|              22| 76205|
|               1| 30588|
+----------------+------+
only showing top 3 rows



The top three busiest hours in the output dataset, regarding Green Taxi Trip data, are:

- 12 (12 AM)
- 22 (10 PM)
- 01 (01 AM)

## 2. FHVHV Trip Records

Now, we will look to the FHVHV dataset.

### 2.1. Load the Dataset

In [103]:
fhvhv = spark.read.parquet("gs://mobilab-tech-task-bucket/outputs/fhvhv/parquet")

### 2.2. Run the Query

### The top 3 of the busiest hours in the output dataset

In [105]:
fhvhv.printSchema()

root
 |-- hvfhs_license_num: string (nullable = true)
 |-- dispatching_base_num: string (nullable = true)
 |-- originating_base_num: string (nullable = true)
 |-- request_datetime: timestamp (nullable = true)
 |-- on_scene_datetime: timestamp (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- trip_miles: double (nullable = true)
 |-- trip_time: long (nullable = true)
 |-- base_passenger_fare: double (nullable = true)
 |-- tolls: double (nullable = true)
 |-- bcf: double (nullable = true)
 |-- sales_tax: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)
 |-- tips: double (nullable = true)
 |-- driver_pay: double (nullable = true)
 |-- shared_request_flag: string (nullable = true)
 |-- shared_match_flag: string (nullable = true)
 |-- access_a_ride_flag: string (nul

In [107]:
hours_fhvhv = fhvhv.select('pickup_hour')

In [108]:
hours_fhvhv.groupBy("pickup_hour").count().show(3)

+-----------+--------+
|pickup_hour|   count|
+-----------+--------+
|         12|20283921|
|         22|23436764|
|          1|11716423|
+-----------+--------+
only showing top 3 rows



The top three busiest hours in the output dataset, regarding Green Taxi Trip data, are:

- 12 (12 AM)
- 22 (10 PM)
- 01 (01 AM)