# NYC-TLC High Volume For-Hire Vehicle ("FHVHV") Trip Metadata Exploration

## Introduction

This notebook explore files metadata of [NYC Taxi and Limousine Commission High Volume For-Hire Vehicle ("FHVHV") Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). It may also be used as a base to inform which High Volume For-Hire Vehicle ("FHVHV") trip data files to download and use when perform a specific analysis.

### Data Dictionary

Check [Data Dictionary – High Volume For-Hire Vehicle ("FHVHV") Taxi Trip Records](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_hvfhs.pdf)

## Extracting the Data

Change `year` to extract (or update) metadata

In [1]:
# !python extract_trips_metadata.py -s web -t fhvhv -y 2024

## Loading the Data

### Import libraries

In [2]:
import glob
import matplotlib.pyplot as plt
import pyarrow as pa
import pandas as pd

from conf import DATASET_LOCAL_METADATA_PATH

### Load the data

In [3]:
METADATA_FILES = glob.glob(f"{DATASET_LOCAL_METADATA_PATH}/fhvhv_tripmetadata_*.csv")

In [4]:
df = pd.concat([pd.read_csv(file) for file in METADATA_FILES], ignore_index=True)

### Print data summary

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   file_name               64 non-null     object 
 1   file_s3_url             64 non-null     object 
 2   file_cloudfront_url     64 non-null     object 
 3   file_record_type        64 non-null     object 
 4   file_year               64 non-null     int64  
 5   file_month              64 non-null     int64  
 6   file_modification_time  64 non-null     object 
 7   file_num_rows           64 non-null     int64  
 8   file_num_columns        64 non-null     int64  
 9   file_column_names       64 non-null     object 
 10  file_size_bytes         64 non-null     int64  
 11  file_size_mbs           64 non-null     float64
 12  file_size_gbs           64 non-null     float64
 13  file_metadata_source    64 non-null     object 
dtypes: float64(2), int64(5), object(7)
memory us

## Exploring the Data

### What is the total number of all records (rows)?

In [6]:
print("{:,d} records.".format(df["file_num_rows"].sum()))

1,098,184,332 records.


### What is the total compressed size (GBs) of all records?

In [7]:
print("{:,.4f} GBs.".format(df["file_size_gbs"].sum()))

26.0451 GBs.


### Which years are covered by all records?

In [8]:
pd.DataFrame({"file_year": sorted(df["file_year"].unique())})

Unnamed: 0,file_year
0,2019
1,2020
2,2021
3,2022
4,2023
5,2024


### What is the total number of records (rows) per each year?

In [9]:
df2 = df[["file_year", "file_num_rows"]].groupby(by="file_year").sum()
df2 = df2.reset_index()
df2 = df2.sort_values(by="file_num_rows", ascending=False)
df2["file_num_rows"] = df2["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df2

Unnamed: 0,file_year,file_num_rows
0,2019,234630264
4,2023,232490020
3,2022,212416083
2,2021,174596652
1,2020,143309871
5,2024,100741442


### What is the total compressed size (GBs) of records per each year?

In [10]:
df3 = df[["file_year", "file_size_gbs"]].groupby(by="file_year").sum()
df3 = df3.reset_index()
df3 = df3.sort_values(by="file_size_gbs", ascending=False)
df3["file_size_gbs"] = df3["file_size_gbs"].apply(lambda x: "{:,.4f}".format(x))
df3

Unnamed: 0,file_year,file_size_gbs
0,2019,5.5982
4,2023,5.421
3,2022,5.0633
2,2021,4.2411
1,2020,3.4698
5,2024,2.2518


### Describe files compressed sizes (MBs)?

In [11]:
df[["file_size_mbs"]].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
file_size_mbs,64.0,416.722377,96.300971,109.515195,368.137658,441.92011,478.740239,582.556135


### Which files have largest compressed sizes (MBs)?

In [12]:
df5 = df[["file_name", "file_size_mbs"]]
df5 = df5.sort_values(by="file_size_mbs", ascending=False)
df5.head(n=10)

Unnamed: 0,file_name,file_size_mbs
18,fhvhv_tripdata_2019-03.parquet,582.556135
27,fhvhv_tripdata_2019-12.parquet,549.935223
20,fhvhv_tripdata_2019-05.parquet,544.377504
26,fhvhv_tripdata_2019-11.parquet,535.022415
19,fhvhv_tripdata_2019-04.parquet,533.730174
41,fhvhv_tripdata_2020-02.parquet,532.984617
25,fhvhv_tripdata_2019-10.parquet,523.920382
21,fhvhv_tripdata_2019-06.parquet,511.352468
40,fhvhv_tripdata_2020-01.parquet,506.664212
30,fhvhv_tripdata_2023-03.parquet,498.280401


### Which files have smallest compressed sizes (MBs)?

In [13]:
df6 = df[["file_name", "file_size_mbs"]]
df6 = df6.sort_values(by="file_size_mbs", ascending=True)
df6.head(n=10)

Unnamed: 0,file_name,file_size_mbs
43,fhvhv_tripdata_2020-04.parquet,109.515195
44,fhvhv_tripdata_2020-05.parquet,153.107369
45,fhvhv_tripdata_2020-06.parquet,188.801557
46,fhvhv_tripdata_2020-07.parquet,248.19617
47,fhvhv_tripdata_2020-08.parquet,276.992729
50,fhvhv_tripdata_2020-11.parquet,287.872459
53,fhvhv_tripdata_2021-02.parquet,288.613521
51,fhvhv_tripdata_2020-12.parquet,289.215115
52,fhvhv_tripdata_2021-01.parquet,294.613778
48,fhvhv_tripdata_2020-09.parquet,300.111181


### Describe files number of records (rows)?

In [14]:
df[["file_num_rows"]].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
file_num_rows,64.0,17159130.0,4112863.0,4312909.0,14743486.0,18239742.5,20134360.25,23864598.0


### Which files have largest number of records (rows)?

In [15]:
df7 = df[["file_name", "file_num_rows"]]
df7 = df7.sort_values(by="file_num_rows", ascending=False)
df7["file_num_rows"] = df7["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df7.head(n=10)

Unnamed: 0,file_name,file_num_rows
18,fhvhv_tripdata_2019-03.parquet,23864598
20,fhvhv_tripdata_2019-05.parquet,22329247
27,fhvhv_tripdata_2019-12.parquet,22243901
19,fhvhv_tripdata_2019-04.parquet,21734822
41,fhvhv_tripdata_2020-02.parquet,21725100
26,fhvhv_tripdata_2019-11.parquet,21635568
14,fhvhv_tripdata_2024-03.parquet,21280788
25,fhvhv_tripdata_2019-10.parquet,21162290
21,fhvhv_tripdata_2019-06.parquet,21001990
16,fhvhv_tripdata_2024-05.parquet,20704538


### Which files have smallest number of records (rows)?

In [16]:
df8 = df[["file_name", "file_num_rows"]]
df8 = df8.sort_values(by="file_num_rows", ascending=True)
df8["file_num_rows"] = df8["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df8.head(n=10)

Unnamed: 0,file_name,file_num_rows
43,fhvhv_tripdata_2020-04.parquet,4312909
44,fhvhv_tripdata_2020-05.parquet,6089999
45,fhvhv_tripdata_2020-06.parquet,7555193
46,fhvhv_tripdata_2020-07.parquet,9958454
47,fhvhv_tripdata_2020-08.parquet,11096852
50,fhvhv_tripdata_2020-11.parquet,11596865
53,fhvhv_tripdata_2021-02.parquet,11613942
51,fhvhv_tripdata_2020-12.parquet,11637123
52,fhvhv_tripdata_2021-01.parquet,11908468
48,fhvhv_tripdata_2020-09.parquet,12106669


### How does column names change in files?

In [17]:
df9 = df[["file_year", "file_column_names"]].groupby(by=["file_year", "file_column_names"]).size()
df9 = df9.reset_index(name="num_of_files")
pd.set_option('display.max_colwidth', None)
df9

Unnamed: 0,file_year,file_column_names,num_of_files
0,2019,"hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,trip_time,base_passenger_fare,tolls,bcf,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag",11
1,2020,"hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,trip_time,base_passenger_fare,tolls,bcf,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag",12
2,2021,"hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,trip_time,base_passenger_fare,tolls,bcf,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag",12
3,2022,"hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,trip_time,base_passenger_fare,tolls,bcf,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag",12
4,2023,"hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,trip_time,base_passenger_fare,tolls,bcf,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag",12
5,2024,"hvfhs_license_num,dispatching_base_num,originating_base_num,request_datetime,on_scene_datetime,pickup_datetime,dropoff_datetime,PULocationID,DOLocationID,trip_miles,trip_time,base_passenger_fare,tolls,bcf,sales_tax,congestion_surcharge,airport_fee,tips,driver_pay,shared_request_flag,shared_match_flag,access_a_ride_flag,wav_request_flag,wav_match_flag",5


### How many times a column name appear in files?

In [18]:
df10 = df["file_column_names"].str.split(",").explode()
df10 = pd.DataFrame(df10)
df10 = df10.groupby(by="file_column_names").size()
df10 = df10.reset_index(name="num_of_files")
df10 = df10.sort_values(by="num_of_files", ascending=False)
df10

Unnamed: 0,file_column_names,num_of_files
0,DOLocationID,64
1,PULocationID,64
22,wav_match_flag,64
21,trip_time,64
20,trip_miles,64
19,tolls,64
18,tips,64
17,shared_request_flag,64
16,shared_match_flag,64
15,sales_tax,64


### Which files have longitude and latitude?

In [19]:
df11 = df[(df["file_column_names"].str.contains("long", case=False) | 
           df["file_column_names"].str.contains("lat", case=False))]
df11 = df11[["file_size_mbs", "file_cloudfront_url"]]
pd.set_option('display.max_colwidth', None)
df11

Unnamed: 0,file_size_mbs,file_cloudfront_url
