# NYC-TLC Yellow Trip Metadata Exploration

## Introduction

This notebook explore files metadata of [NYC Taxi and Limousine Commission Yellow Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). It may also be used as a base to inform which yellow trip data files to download and use when perform a specific analysis.

### Data Dictionary

Check [Data Dictionary – Yellow Taxi Trip Records](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf)

## Loading the Data

### Import libraries

In [1]:
import matplotlib.pyplot as plt
import pyarrow as pa
import pandas as pd

### Load the data

In [2]:
METADATA_PATH = "./data/trips-metadata/2023-12-28.csv"
df = pd.read_csv(METADATA_PATH)
df = df[df["file_record_type"] == 'yellow']

### Print data summary

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 170 entries, 269 to 438
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   file_name               170 non-null    object 
 1   file_s3_url             170 non-null    object 
 2   file_cloudfront_url     170 non-null    object 
 3   file_record_type        170 non-null    object 
 4   file_year               170 non-null    int64  
 5   file_month              170 non-null    int64  
 6   file_modification_time  170 non-null    object 
 7   file_num_rows           170 non-null    int64  
 8   file_num_columns        170 non-null    int64  
 9   file_column_names       170 non-null    object 
 10  file_size_bytes         170 non-null    int64  
 11  file_size_mbs           170 non-null    float64
 12  file_size_gbs           170 non-null    float64
dtypes: float64(2), int64(5), object(6)
memory usage: 18.6+ KB


## Exploring the Data

### What is the total number of all records (rows)?

In [4]:
print("{:,d} records.".format(df["file_num_rows"].sum()))

1,660,780,266 records.


### What is the total compressed size (GBs) of all records?

In [5]:
print("{:,.4f} GBs.".format(df["file_size_gbs"].sum()))

26.1997 GBs.


### Which years are covered by all records?

In [6]:
pd.DataFrame({"file_year": sorted(df["file_year"].unique())})

Unnamed: 0,file_year
0,2009
1,2010
2,2011
3,2012
4,2013
5,2014
6,2015
7,2016
8,2017
9,2018


### What is the total number of records (rows) per each year?

In [7]:
df2 = df[["file_year", "file_num_rows"]].groupby(by="file_year").sum()
df2 = df2.reset_index()
df2 = df2.sort_values(by="file_num_rows", ascending=False)
df2["file_num_rows"] = df2["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df2

Unnamed: 0,file_year,file_num_rows
2,2011,176887259
4,2013,171816340
3,2012,171359007
0,2009,170896055
5,2014,165447579
6,2015,146039231
7,2016,131131805
8,2017,113500327
1,2010,111529714
9,2018,102871387


### What is the total compressed size (GBs) of records per each year?

In [8]:
df3 = df[["file_year", "file_size_gbs"]].groupby(by="file_year").sum()
df3 = df3.reset_index()
df3 = df3.sort_values(by="file_size_gbs", ascending=False)
df3["file_size_gbs"] = df3["file_size_gbs"].apply(lambda x: "{:,.4f}".format(x))
df3

Unnamed: 0,file_year,file_size_gbs
0,2009,5.3248
1,2010,3.5191
2,2011,2.0566
3,2012,2.0338
5,2014,2.0007
4,2013,1.9967
6,2015,1.8878
7,2016,1.7072
8,2017,1.4852
9,2018,1.3638


### Which files have largest compressed sizes (GBs)?

In [9]:
df5 = df[["file_name", "file_size_gbs"]]
df5 = df5.sort_values(by="file_size_gbs", ascending=False)
df5.head(n=10)

Unnamed: 0,file_name,file_size_gbs
285,yellow_tripdata_2010-05.parquet,0.50458
284,yellow_tripdata_2010-04.parquet,0.492021
278,yellow_tripdata_2009-10.parquet,0.4913
286,yellow_tripdata_2010-06.parquet,0.481978
281,yellow_tripdata_2010-01.parquet,0.480006
287,yellow_tripdata_2010-07.parquet,0.475898
273,yellow_tripdata_2009-05.parquet,0.461304
280,yellow_tripdata_2009-12.parquet,0.454099
271,yellow_tripdata_2009-03.parquet,0.449449
279,yellow_tripdata_2009-11.parquet,0.445455


### Which files have smallest compressed sizes (GBs)?

In [10]:
df6 = df[["file_name", "file_size_gbs"]]
df6 = df6.sort_values(by="file_size_gbs", ascending=True)
df6.head(n=10)

Unnamed: 0,file_name,file_size_gbs
400,yellow_tripdata_2020-04.parquet,0.004138
401,yellow_tripdata_2020-05.parquet,0.005802
402,yellow_tripdata_2020-06.parquet,0.008853
403,yellow_tripdata_2020-07.parquet,0.012468
404,yellow_tripdata_2020-08.parquet,0.015461
405,yellow_tripdata_2020-09.parquet,0.019913
409,yellow_tripdata_2021-01.parquet,0.020197
410,yellow_tripdata_2021-02.parquet,0.020282
408,yellow_tripdata_2020-12.parquet,0.021439
407,yellow_tripdata_2020-11.parquet,0.021964


### Which files have largest number of records (rows)?

In [11]:
df7 = df[["file_name", "file_num_rows"]]
df7 = df7.sort_values(by="file_num_rows", ascending=False)
df7["file_num_rows"] = df7["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df7.head(n=10)

Unnamed: 0,file_name,file_num_rows
303,yellow_tripdata_2012-03.parquet,16146923
291,yellow_tripdata_2011-03.parquet,16066351
315,yellow_tripdata_2013-03.parquet,15749228
298,yellow_tripdata_2011-10.parquet,15697804
278,yellow_tripdata_2009-10.parquet,15604551
293,yellow_tripdata_2011-05.parquet,15554868
285,yellow_tripdata_2010-05.parquet,15481351
327,yellow_tripdata_2014-03.parquet,15428134
317,yellow_tripdata_2013-05.parquet,15285052
284,yellow_tripdata_2010-04.parquet,15144990


### Which files have smallest number of records (rows)?

In [12]:
df8 = df[["file_name", "file_num_rows"]]
df8 = df8.sort_values(by="file_num_rows", ascending=True)
df8["file_num_rows"] = df8["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df8.head(n=10)

Unnamed: 0,file_name,file_num_rows
400,yellow_tripdata_2020-04.parquet,238073
401,yellow_tripdata_2020-05.parquet,348415
402,yellow_tripdata_2020-06.parquet,549797
403,yellow_tripdata_2020-07.parquet,800412
404,yellow_tripdata_2020-08.parquet,1007286
405,yellow_tripdata_2020-09.parquet,1341017
409,yellow_tripdata_2021-01.parquet,1369769
410,yellow_tripdata_2021-02.parquet,1371709
408,yellow_tripdata_2020-12.parquet,1461898
407,yellow_tripdata_2020-11.parquet,1509000


### How does column names change in files?

In [13]:
df9 = df[["file_column_names"]].groupby(by=["file_column_names"]).size()
df9 = df9.reset_index(name="num_of_files")
df9

Unnamed: 0,file_column_names,num_of_files
0,"VendorID,tpep_pickup_datetime,tpep_dropoff_dat...",5
1,"VendorID,tpep_pickup_datetime,tpep_dropoff_dat...",145
2,"vendor_id,pickup_datetime,dropoff_datetime,pas...",6
3,"vendor_id,pickup_datetime,dropoff_datetime,pas...",2
4,"vendor_name,Trip_Pickup_DateTime,Trip_Dropoff_...",12


### How many times a column name appear in files?

In [14]:
df10 = df["file_column_names"].str.split(',').explode()
df10 = pd.DataFrame(df10)
df10 = df10.groupby(by='file_column_names').size()
df10 = df10.reset_index(name="num_of_files")
df10 = df10.sort_values(by="num_of_files", ascending=False)
df10

Unnamed: 0,file_column_names,num_of_files
28,mta_tax,170
36,store_and_fwd_flag,158
38,tip_amount,158
39,tolls_amount,158
40,total_amount,158
30,payment_type,158
29,passenger_count,158
26,fare_amount,158
43,trip_distance,158
21,congestion_surcharge,150


### Which files have longitude and latitude?

In [15]:
df11 = df[(df["file_column_names"].str.contains("long", case=False) | 
           df["file_column_names"].str.contains("lat", case=False))]
df11 = df11[["file_year", "file_month", "file_name", "file_size_mbs", "file_cloudfront_url"]]
df11

Unnamed: 0,file_year,file_month,file_name,file_size_mbs,file_cloudfront_url
269,2009,1,yellow_tripdata_2009-01.parquet,447.998042,https://d37ci6vzurychx.cloudfront.net/trip-dat...
270,2009,2,yellow_tripdata_2009-02.parquet,422.8917,https://d37ci6vzurychx.cloudfront.net/trip-dat...
271,2009,3,yellow_tripdata_2009-03.parquet,460.235812,https://d37ci6vzurychx.cloudfront.net/trip-dat...
272,2009,4,yellow_tripdata_2009-04.parquet,455.984558,https://d37ci6vzurychx.cloudfront.net/trip-dat...
273,2009,5,yellow_tripdata_2009-05.parquet,472.375512,https://d37ci6vzurychx.cloudfront.net/trip-dat...
274,2009,6,yellow_tripdata_2009-06.parquet,451.648567,https://d37ci6vzurychx.cloudfront.net/trip-dat...
275,2009,7,yellow_tripdata_2009-07.parquet,433.548706,https://d37ci6vzurychx.cloudfront.net/trip-dat...
276,2009,8,yellow_tripdata_2009-08.parquet,437.041971,https://d37ci6vzurychx.cloudfront.net/trip-dat...
277,2009,9,yellow_tripdata_2009-09.parquet,446.650237,https://d37ci6vzurychx.cloudfront.net/trip-dat...
278,2009,10,yellow_tripdata_2009-10.parquet,503.091026,https://d37ci6vzurychx.cloudfront.net/trip-dat...
