# NYC-TLC Trip Metadata Exploration

## Introduction

This notebook explore files metadata of [NYC Taxi and Limousine Commission Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). It may also be used as a base to inform which trip data files to download and use when perform a specific analysis.

## Loading the Data

### Import libraries

In [1]:
import matplotlib.pyplot as plt
import pyarrow as pa
import pandas as pd

### Load the data

In [2]:
METADATA_PATH = "./data/trips-metadata/2023-12-28.csv"
df = pd.read_csv(METADATA_PATH)

### Print data summary

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   file_name               439 non-null    object 
 1   file_s3_url             439 non-null    object 
 2   file_cloudfront_url     439 non-null    object 
 3   file_record_type        439 non-null    object 
 4   file_year               439 non-null    int64  
 5   file_month              439 non-null    int64  
 6   file_modification_time  439 non-null    object 
 7   file_num_rows           439 non-null    int64  
 8   file_num_columns        439 non-null    int64  
 9   file_column_names       439 non-null    object 
 10  file_size_bytes         439 non-null    int64  
 11  file_size_mbs           439 non-null    float64
 12  file_size_gbs           439 non-null    float64
dtypes: float64(2), int64(5), object(6)
memory usage: 44.7+ KB


## Exploring the Data

### What is the total number of all records (rows)?

In [4]:
print("{:,d} records.".format(df["file_num_rows"].sum()))

3,367,004,170 records.


### What is the total compressed size (GBs) of all records?

In [5]:
print("{:,.4f} GBs.".format(df["file_size_gbs"].sum()))

53.6287 GBs.


### Which years are covered by all records?

In [6]:
pd.DataFrame({"file_year": sorted(df["file_year"].unique())})

Unnamed: 0,file_year
0,2009
1,2010
2,2011
3,2012
4,2013
5,2014
6,2015
7,2016
8,2017
9,2018


### Which record types are covered by all records?

In [7]:
pd.DataFrame({"file_record_type": sorted(df["file_record_type"].unique())})

Unnamed: 0,file_record_type
0,fhv
1,fhvhv
2,green
3,yellow


### What is the total number of records (rows) per each year?

In [8]:
df2 = df[["file_year", "file_num_rows"]].groupby(by="file_year").sum()
df2 = df2.reset_index()
df2 = df2.sort_values(by="file_num_rows", ascending=False)
df2["file_num_rows"] = df2["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df2

Unnamed: 0,file_year,file_num_rows
9,2018,372645858
10,2019,368790969
8,2017,317546944
7,2016,279631429
13,2022,267424247
6,2015,228661528
12,2021,221374980
11,2020,184638604
5,2014,181284588
2,2011,176887259


### What is the total compressed size (GBs) of records per each year?

In [9]:
df3 = df[["file_year", "file_size_gbs"]].groupby(by="file_year").sum()
df3 = df3.reset_index()
df3 = df3.sort_values(by="file_size_gbs", ascending=False)
df3["file_size_gbs"] = df3["file_size_gbs"].apply(lambda x: "{:,.4f}".format(x))
df3

Unnamed: 0,file_year,file_size_gbs
10,2019,7.2228
13,2022,5.7894
0,2009,5.3248
12,2021,4.8466
11,2020,3.9884
9,2018,3.7065
1,2010,3.5191
14,2023,3.1394
8,2017,2.9584
7,2016,2.4437


### What is the total number of records (rows) per each record type?

In [10]:
df4 = df[["file_record_type", "file_num_rows"]].groupby(by="file_record_type").sum()
df4 = df4.reset_index()
df4 = df4.sort_values(by="file_num_rows", ascending=False)
df4["file_num_rows"] = df4["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df4

Unnamed: 0,file_record_type,file_num_rows
3,yellow,1660780266
1,fhvhv,880165609
0,fhv,743615705
2,green,82442590


### Describe files compressed sizes (MBs)?

In [11]:
df[["file_size_mbs"]].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
file_size_mbs,439.0,125.092887,150.475805,0.680516,12.857474,50.817932,172.85405,582.556135


### Which files have largest compressed sizes (GBs)?

In [12]:
df5 = df[["file_name", "file_size_gbs"]]
df5 = df5.sort_values(by="file_size_gbs", ascending=False)
df5.head(n=10)

Unnamed: 0,file_name,file_size_gbs
103,fhvhv_tripdata_2019-03.parquet,0.568902
112,fhvhv_tripdata_2019-12.parquet,0.537046
105,fhvhv_tripdata_2019-05.parquet,0.531619
111,fhvhv_tripdata_2019-11.parquet,0.522483
104,fhvhv_tripdata_2019-04.parquet,0.521221
114,fhvhv_tripdata_2020-02.parquet,0.520493
110,fhvhv_tripdata_2019-10.parquet,0.511641
285,yellow_tripdata_2010-05.parquet,0.50458
106,fhvhv_tripdata_2019-06.parquet,0.499368
113,fhvhv_tripdata_2020-01.parquet,0.494789


### Which files have smallest compressed sizes (GBs)?

In [13]:
df6 = df[["file_name", "file_size_gbs"]]
df6 = df6.sort_values(by="file_size_gbs", ascending=True)
df6.head(n=10)

Unnamed: 0,file_name,file_size_gbs
230,green_tripdata_2020-04.parquet,0.000665
231,green_tripdata_2020-05.parquet,0.001008
240,green_tripdata_2021-02.parquet,0.001067
232,green_tripdata_2020-06.parquet,0.001122
251,green_tripdata_2022-01.parquet,0.001168
261,green_tripdata_2022-11.parquet,0.001183
233,green_tripdata_2020-07.parquet,0.001218
257,green_tripdata_2022-07.parquet,0.001222
239,green_tripdata_2021-01.parquet,0.001242
258,green_tripdata_2022-08.parquet,0.001254


### Describe files number of records (rows)?

In [14]:
df[["file_num_rows"]].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
file_num_rows,439.0,7669713.0,6954584.0,35644.0,1234488.0,6310419.0,13912006.5,23904082.0


### Which files have largest number of records (rows)?

In [15]:
df7 = df[["file_name", "file_num_rows"]]
df7 = df7.sort_values(by="file_num_rows", ascending=False)
df7["file_num_rows"] = df7["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df7.head(n=10)

Unnamed: 0,file_name,file_num_rows
47,fhv_tripdata_2018-12.parquet,23904082
103,fhvhv_tripdata_2019-03.parquet,23864598
45,fhv_tripdata_2018-10.parquet,23289768
48,fhv_tripdata_2019-01.parquet,23159064
46,fhv_tripdata_2018-11.parquet,22911479
105,fhvhv_tripdata_2019-05.parquet,22329247
112,fhvhv_tripdata_2019-12.parquet,22243901
44,fhv_tripdata_2018-09.parquet,22151736
43,fhv_tripdata_2018-08.parquet,22120593
38,fhv_tripdata_2018-03.parquet,21985270


### Which files have smallest number of records (rows)?

In [16]:
df8 = df[["file_name", "file_num_rows"]]
df8 = df8.sort_values(by="file_num_rows", ascending=True)
df8["file_num_rows"] = df8["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df8.head(n=10)

Unnamed: 0,file_name,file_num_rows
230,green_tripdata_2020-04.parquet,35644
231,green_tripdata_2020-05.parquet,57361
261,green_tripdata_2022-11.parquet,62313
251,green_tripdata_2022-01.parquet,62495
232,green_tripdata_2020-06.parquet,63110
257,green_tripdata_2022-07.parquet,64192
240,green_tripdata_2021-02.parquet,64572
264,green_tripdata_2023-02.parquet,64809
266,green_tripdata_2023-04.parquet,65392
268,green_tripdata_2023-06.parquet,65550


### How many times a column name appear in files?

In [23]:
df10 = df["file_column_names"].str.split(",").explode()
df10 = pd.DataFrame(df10)
df10 = df10.groupby(by="file_column_names").size()
df10 = df10.reset_index(name="num_of_files")
df10 = df10.sort_values(by="num_of_files", ascending=False)
pd.set_option("display.max_rows", df10.shape[0])
df10

Unnamed: 0,file_column_names,num_of_files
2,DOLocationID,317
7,PULocationID,317
28,congestion_surcharge,317
42,mta_tax,284
65,trip_distance,272
37,fare_amount,272
62,total_amount,272
61,tolls_amount,272
58,tip_amount,272
56,store_and_fwd_flag,272
