# NYC-TLC For-Hire Vehicle ("FHV") Trip Metadata Exploration

## Introduction

This notebook explore files metadata of [NYC Taxi and Limousine Commission For-Hire Vehicle ("FHV") Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). It may also be used as a base to inform which For-Hire Vehicle ("FHV") trip data files to download and use when perform a specific analysis.

### Data Dictionary

Check [Data Dictionary – For-Hire Vehicle ("FHV") Taxi Trip Records](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_fhv.pdf)

## Extracting the Data

Change `year` to extract (or update) metadata

In [1]:
# !python extract_trips_metadata.py -s web -t fhv -y 2024

## Loading the Data

### Import libraries

In [2]:
import glob
import matplotlib.pyplot as plt
import pyarrow as pa
import pandas as pd

from conf import DATASET_LOCAL_METADATA_PATH

### Load the data

In [3]:
METADATA_FILES = glob.glob(f"{DATASET_LOCAL_METADATA_PATH}/fhv_tripmetadata_*.csv")

In [4]:
df = pd.concat([pd.read_csv(file) for file in METADATA_FILES], ignore_index=True)

### Print data summary

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   file_name               113 non-null    object 
 1   file_s3_url             113 non-null    object 
 2   file_cloudfront_url     113 non-null    object 
 3   file_record_type        113 non-null    object 
 4   file_year               113 non-null    int64  
 5   file_month              113 non-null    int64  
 6   file_modification_time  113 non-null    object 
 7   file_num_rows           113 non-null    int64  
 8   file_num_columns        113 non-null    int64  
 9   file_column_names       113 non-null    object 
 10  file_size_bytes         113 non-null    int64  
 11  file_size_mbs           113 non-null    float64
 12  file_size_gbs           113 non-null    float64
 13  file_metadata_source    113 non-null    object 
dtypes: float64(2), int64(5), object(7)
memory 

## Exploring the Data

### What is the total number of all records (rows)?

In [6]:
print("{:,d} records.".format(df["file_num_rows"].sum()))

758,801,924 records.


### What is the total compressed size (GBs) of all records?

In [7]:
print("{:,.4f} GBs.".format(df["file_size_gbs"].sum()))

5.2644 GBs.


### Which years are covered by all records?

In [8]:
pd.DataFrame({"file_year": sorted(df["file_year"].unique())})

Unnamed: 0,file_year
0,2015
1,2016
2,2017
3,2018
4,2019
5,2020
6,2021
7,2022
8,2023
9,2024


### What is the total number of records (rows) per each year?

In [9]:
df2 = df[["file_year", "file_num_rows"]].groupby(by="file_year").sum()
df2 = df2.reset_index()
df2 = df2.sort_values(by="file_num_rows", ascending=False)
df2["file_num_rows"] = df2["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df2

Unnamed: 0,file_year,file_num_rows
3,2018,260874753
2,2017,192309558
1,2016,132114083
0,2015,63388532
4,2019,43261276
8,2023,15858639
5,2020,14945465
6,2021,14805265
7,2022,14511664
9,2024,6732689


### What is the total compressed size (GBs) of records per each year?

In [10]:
df3 = df[["file_year", "file_size_gbs"]].groupby(by="file_year").sum()
df3 = df3.reset_index()
df3 = df3.sort_values(by="file_size_gbs", ascending=False)
df3["file_size_gbs"] = df3["file_size_gbs"].apply(lambda x: "{:,.4f}".format(x))
df3

Unnamed: 0,file_year,file_size_gbs
3,2018,2.2087
2,2017,1.3022
1,2016,0.5031
4,2019,0.3685
0,2015,0.2221
8,2023,0.1686
5,2020,0.1417
6,2021,0.14
7,2022,0.1371
9,2024,0.0725


### Describe files compressed sizes (MBs)?

In [11]:
df[["file_size_mbs"]].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
file_size_mbs,113.0,47.706138,61.455682,6.187413,12.08057,15.750369,46.808203,206.910898


### Which files have largest compressed sizes (MBs)?

In [12]:
df5 = df[["file_name", "file_size_mbs"]]
df5 = df5.sort_values(by="file_size_mbs", ascending=False)
df5.head(n=10)

Unnamed: 0,file_name,file_size_mbs
40,fhv_tripdata_2018-12.parquet,206.910898
38,fhv_tripdata_2018-10.parquet,202.694967
5,fhv_tripdata_2019-01.parquet,200.946259
39,fhv_tripdata_2018-11.parquet,198.895775
36,fhv_tripdata_2018-08.parquet,192.433067
37,fhv_tripdata_2018-09.parquet,192.115993
31,fhv_tripdata_2018-03.parquet,188.931273
35,fhv_tripdata_2018-07.parquet,188.241599
33,fhv_tripdata_2018-05.parquet,187.223758
34,fhv_tripdata_2018-06.parquet,183.044876


### Which files have smallest compressed sizes (MBs)?

In [13]:
df6 = df[["file_name", "file_size_mbs"]]
df6 = df6.sort_values(by="file_size_mbs", ascending=True)
df6.head(n=10)

Unnamed: 0,file_name,file_size_mbs
20,fhv_tripdata_2020-04.parquet,6.187413
21,fhv_tripdata_2020-05.parquet,8.070605
55,fhv_tripdata_2015-03.parquet,8.628676
42,fhv_tripdata_2021-02.parquet,10.152308
22,fhv_tripdata_2020-06.parquet,10.163461
56,fhv_tripdata_2015-04.parquet,10.263732
111,fhv_tripdata_2022-11.parquet,10.775536
77,fhv_tripdata_2023-01.parquet,10.785082
101,fhv_tripdata_2022-01.parquet,11.122918
23,fhv_tripdata_2020-07.parquet,11.197661


### Describe files number of records (rows)?

In [14]:
df[["file_num_rows"]].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
file_num_rows,113.0,6715061.0,7468261.0,566426.0,1254734.0,1897856.0,11759658.0,23904082.0


### Which files have largest number of records (rows)?

In [15]:
df7 = df[["file_name", "file_num_rows"]]
df7 = df7.sort_values(by="file_num_rows", ascending=False)
df7["file_num_rows"] = df7["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df7.head(n=10)

Unnamed: 0,file_name,file_num_rows
40,fhv_tripdata_2018-12.parquet,23904082
38,fhv_tripdata_2018-10.parquet,23289768
5,fhv_tripdata_2019-01.parquet,23159064
39,fhv_tripdata_2018-11.parquet,22911479
37,fhv_tripdata_2018-09.parquet,22151736
36,fhv_tripdata_2018-08.parquet,22120593
31,fhv_tripdata_2018-03.parquet,21985270
35,fhv_tripdata_2018-07.parquet,21599714
33,fhv_tripdata_2018-05.parquet,21565752
34,fhv_tripdata_2018-06.parquet,21137951


### Which files have smallest number of records (rows)?

In [16]:
df8 = df[["file_name", "file_num_rows"]]
df8 = df8.sort_values(by="file_num_rows", ascending=True)
df8["file_num_rows"] = df8["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df8.head(n=10)

Unnamed: 0,file_name,file_num_rows
20,fhv_tripdata_2020-04.parquet,566426
21,fhv_tripdata_2020-05.parquet,774970
22,fhv_tripdata_2020-06.parquet,1011867
42,fhv_tripdata_2021-02.parquet,1037692
111,fhv_tripdata_2022-11.parquet,1106084
78,fhv_tripdata_2023-02.parquet,1110797
77,fhv_tripdata_2023-01.parquet,1114320
23,fhv_tripdata_2020-07.parquet,1127489
101,fhv_tripdata_2022-01.parquet,1143691
108,fhv_tripdata_2022-08.parquet,1151155


### How does column names change in files?

In [17]:
df9 = df[["file_year", "file_column_names"]].groupby(by=["file_year", "file_column_names"]).size()
df9 = df9.reset_index(name="num_of_files")
pd.set_option('display.max_colwidth', None)
df9

Unnamed: 0,file_year,file_column_names,num_of_files
0,2015,"dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number",12
1,2016,"dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number",12
2,2017,"dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number",12
3,2018,"dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number",12
4,2019,"dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number",12
5,2020,"dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number",12
6,2021,"dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number",12
7,2022,"dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number",12
8,2023,"dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number",12
9,2024,"dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number",5


### How many times a column name appear in files?

In [18]:
df10 = df["file_column_names"].str.split(",").explode()
df10 = pd.DataFrame(df10)
df10 = df10.groupby(by="file_column_names").size()
df10 = df10.reset_index(name="num_of_files")
df10 = df10.sort_values(by="num_of_files", ascending=False)
df10

Unnamed: 0,file_column_names,num_of_files
0,Affiliated_base_number,113
1,DOlocationID,113
2,PUlocationID,113
3,SR_Flag,113
4,dispatching_base_num,113
5,dropOff_datetime,113
6,pickup_datetime,113


### Which files have longitude and latitude?

In [19]:
df11 = df[(df["file_column_names"].str.contains("long", case=False) | 
           df["file_column_names"].str.contains("lat", case=False))]
df11 = df11[["file_size_mbs", "file_cloudfront_url"]]
pd.set_option('display.max_colwidth', None)
df11

Unnamed: 0,file_size_mbs,file_cloudfront_url
