# **NYC High Volume FHV Data Analysis - Companies: Uber, Lyft, Juno and Via**

Exploratory analysis on NYC (e.g., Uber, Lyft, Juno and Via) looking into customer segmentation and factors that may impact tipping behaviors across the 5 boroughs

--- 

### **Imports & Setup**

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
sys.path.append('../scripts') # imports custom scripts from the scripts directory
from data_utils import remove_outliers

# setting display options for pandas
pd.set_option('display.max_columns', None) # displays all columns in the dataframe
pd.set_option('display.max_rows', 100) # sets the max number of rows to 100

### **Loading Data**

In [4]:
# look up tables
zone = pd.read_csv('../data/taxi_zone_lookup.csv') 
vendor = pd.read_csv('../data/taxi_vendor_lookup.csv')
payment = pd.read_csv('../data/payment_lookup.csv')
hvfhs = pd.read_csv('../data/hvfhs_lookup.csv')

# fhvhv data
df = pd.read_parquet('../data/fhvhv_tripdata_2025-05.parquet')

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21091193 entries, 0 to 21091192
Data columns (total 25 columns):
 #   Column                Dtype         
---  ------                -----         
 0   hvfhs_license_num     object        
 1   dispatching_base_num  object        
 2   originating_base_num  object        
 3   request_datetime      datetime64[us]
 4   on_scene_datetime     datetime64[us]
 5   pickup_datetime       datetime64[us]
 6   dropoff_datetime      datetime64[us]
 7   PULocationID          int32         
 8   DOLocationID          int32         
 9   trip_miles            float64       
 10  trip_time             int64         
 11  base_passenger_fare   float64       
 12  tolls                 float64       
 13  bcf                   float64       
 14  sales_tax             float64       
 15  congestion_surcharge  float64       
 16  airport_fee           float64       
 17  tips                  float64       
 18  driver_pay            float64       
 19

In [11]:
df.describe()

Unnamed: 0,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,trip_time,base_passenger_fare,tolls,bcf,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,cbd_congestion_fee
count,21091193,21091193,21091193,21091193,21091190.0,21091190.0,21091190.0,21091190.0,21091190.0,21091190.0,21091190.0,21091190.0,21091190.0,21091190.0,21091190.0,21091190.0,21091190.0
mean,2025-05-16 09:17:08.694439,2025-05-16 09:21:11.248639,2025-05-16 09:22:05.263951,2025-05-16 09:42:49.797092,138.4395,142.5403,5.097999,1244.524,28.09792,1.116688,0.6993011,2.32869,0.9926219,0.2244738,1.240683,21.6988,0.5148894
min,2025-04-30 23:36:17,2025-04-30 23:34:18,2025-05-01 00:00:00,2025-05-01 00:03:01,1.0,1.0,0.0,0.0,-50.96,0.0,0.0,0.0,0.0,0.0,0.0,-27.12,0.0
25%,2025-05-08 17:49:55,2025-05-08 17:54:24,2025-05-08 17:55:17,2025-05-08 18:20:32,75.0,76.0,1.55,608.0,12.64,0.0,0.31,1.04,0.0,0.0,0.0,9.49,0.0
50%,2025-05-16 09:10:28,2025-05-16 09:13:27,2025-05-16 09:14:27,2025-05-16 09:34:32,138.0,141.0,3.02,996.0,19.99,0.0,0.49,1.68,0.0,0.0,0.0,16.12,0.0
75%,2025-05-23 22:19:23,2025-05-23 22:23:19,2025-05-23 22:24:13,2025-05-23 22:43:07,209.0,216.0,6.39,1598.0,33.5,0.0,0.83,2.86,2.75,0.0,0.0,27.38,1.5
max,2025-06-01 00:16:47,2025-06-01 00:02:41,2025-05-31 23:59:59,2025-06-01 03:19:25,265.0,265.0,276.18,35036.0,1691.79,93.49,42.28,158.45,5.5,7.5,199.4,954.21,3.0
std,,,,,74.82419,78.0486,5.975552,926.3445,26.04773,3.650176,0.6689067,2.130185,1.31659,0.7185373,3.78856,18.9842,0.7121967


In [7]:
df.head()

Unnamed: 0,hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,trip_time,base_passenger_fare,tolls,bcf,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag,cbd_congestion_fee
0,HV0003,B03404,B03404,2025-04-30 23:59:52,2025-05-01 00:03:02,2025-05-01 00:03:25,2025-05-01 00:12:22,160,82,2.51,537,12.54,0.0,0.31,1.13,0.0,0.0,0.0,10.47,N,N,N,N,N,0.0
1,HV0003,B03404,B03404,2025-05-01 00:05:01,2025-05-01 00:06:30,2025-05-01 00:07:04,2025-05-01 00:17:17,82,129,1.51,614,11.14,0.0,0.21,0.76,0.0,0.0,0.0,8.34,N,N,N,N,N,0.0
2,HV0003,B03404,B03404,2025-05-01 00:16:00,2025-05-01 00:19:06,2025-05-01 00:20:32,2025-05-01 00:39:01,129,226,5.01,1109,18.82,0.0,0.47,1.66,0.0,0.0,0.0,18.28,N,N,N,N,N,0.0
3,HV0005,B03406,,2025-05-01 00:44:56,2025-05-01 00:49:38,2025-05-01 00:50:38,2025-05-01 01:00:23,37,112,1.29,585,10.73,0.0,0.27,0.95,0.0,0.0,0.0,8.26,N,N,N,N,Y,0.0
4,HV0003,B03404,B03404,2025-05-01 00:04:08,2025-05-01 00:04:39,2025-05-01 00:06:24,2025-05-01 00:31:44,97,87,5.04,1520,34.21,0.0,0.82,3.04,2.75,0.0,0.0,22.44,N,N,N,N,N,1.5


---
### **Data Assumptions**
* Each row represents a single trip record
* No explicit primary key column
* Each row's uniqueness could be composed of a combination of columns (pickup, dropoff, vendor) 
  
*Note: for this exploratory analysis, a primary key is not critical, as much of the analysis is focused on aggregation patterns rather than uniquely identifying rows* 
> If necessary, could implement 'ride_id' as a primary key using row index 

---
### **Data Cleaning**

In [6]:
df.isna().sum()

hvfhs_license_num             0
dispatching_base_num          0
originating_base_num    6034550
request_datetime              0
on_scene_datetime             0
pickup_datetime               0
dropoff_datetime              0
PULocationID                  0
DOLocationID                  0
trip_miles                    0
trip_time                     0
base_passenger_fare           0
tolls                         0
bcf                           0
sales_tax                     0
congestion_surcharge          0
airport_fee                   0
tips                          0
driver_pay                    0
shared_request_flag           0
shared_match_flag             0
access_a_ride_flag            0
wav_request_flag              0
wav_match_flag                0
cbd_congestion_fee            0
dtype: int64

In [8]:
df['originating_base_num'].unique()

array(['B03404', None, 'B03406', 'B02026', 'B02003'], dtype=object)

In [9]:
# also replace np.nan with NA to standardize
df['originating_base_num'].replace({None: pd.NA}, inplace=True)

In [10]:
#checking for duplicates
df.duplicated().sum()

0

---
### **Exploratory**: The 5 boroughs of NYC