# NYC-TLC Green Trip Metadata Exploration

## Introduction

This notebook explore files metadata of [NYC Taxi and Limousine Commission Green Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). It may also be used as a base to inform which green trip data files to download and use when perform a specific analysis.

**Note:** _This exploration does not include Green trips metadata for 2013_

### Data Dictionary

Check [Data Dictionary – Green Taxi Trip Records](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf)

## Extracting the Data

Change `year` to extract (or update) metadata

In [1]:
# !python extract_trips_metadata.py -s web -t green -y 2024

## Loading the Data

### Import libraries

In [2]:
import glob
import matplotlib.pyplot as plt
import pyarrow as pa
import pandas as pd

from conf import DATASET_LOCAL_METADATA_PATH

### Load the data

In [3]:
METADATA_FILES = glob.glob(f"{DATASET_LOCAL_METADATA_PATH}/green_tripmetadata_*.csv")

In [4]:
df = pd.concat([pd.read_csv(file) for file in METADATA_FILES], ignore_index=True)

### Print data summary

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   file_name               125 non-null    object 
 1   file_s3_url             125 non-null    object 
 2   file_cloudfront_url     125 non-null    object 
 3   file_record_type        125 non-null    object 
 4   file_year               125 non-null    int64  
 5   file_month              125 non-null    int64  
 6   file_modification_time  125 non-null    object 
 7   file_num_rows           125 non-null    int64  
 8   file_num_columns        125 non-null    int64  
 9   file_column_names       125 non-null    object 
 10  file_size_bytes         125 non-null    int64  
 11  file_size_mbs           125 non-null    float64
 12  file_size_gbs           125 non-null    float64
 13  file_metadata_source    125 non-null    object 
dtypes: float64(2), int64(5), object(7)
memory 

## Exploring the Data

### What is the total number of all records (rows)?

In [6]:
print("{:,d} records.".format(df["file_num_rows"].sum()))

83,109,529 records.


### What is the total compressed size (GBs) of all records?

In [7]:
print("{:,.4f} GBs.".format(df["file_size_gbs"].sum()))

1.2152 GBs.


### Which years are covered by all records?

In [8]:
pd.DataFrame({"file_year": sorted(df["file_year"].unique())})

Unnamed: 0,file_year
0,2014
1,2015
2,2016
3,2017
4,2018
5,2019
6,2020
7,2021
8,2022
9,2023


### What is the total number of records (rows) per each year?

In [9]:
df2 = df[["file_year", "file_num_rows"]].groupby(by="file_year").sum()
df2 = df2.reset_index()
df2 = df2.sort_values(by="file_num_rows", ascending=False)
df2["file_num_rows"] = df2["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df2

Unnamed: 0,file_year,file_num_rows
1,2015,19233765
2,2016,16385541
0,2014,15837009
3,2017,11737059
4,2018,8899718
5,2019,6300985
6,2020,1734176
7,2021,1068755
8,2022,840402
9,2023,787060


### What is the total compressed size (GBs) of records per each year?

In [10]:
df3 = df[["file_year", "file_size_gbs"]].groupby(by="file_year").sum()
df3 = df3.reset_index()
df3 = df3.sort_values(by="file_size_gbs", ascending=False)
df3["file_size_gbs"] = df3["file_size_gbs"].apply(lambda x: "{:,.4f}".format(x))
df3

Unnamed: 0,file_year,file_size_gbs
1,2015,0.2703
2,2016,0.2334
0,2014,0.2216
3,2017,0.1711
4,2018,0.134
5,2019,0.0981
6,2020,0.028
7,2021,0.0189
9,2023,0.0174
8,2022,0.0161


### Describe files compressed sizes (MBs)?

In [11]:
df[["file_size_mbs"]].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
file_size_mbs,125.0,9.954786,8.249145,0.680516,1.491245,9.83886,17.886513,25.469047


### Which files have largest compressed sizes (MBs)?

In [12]:
df5 = df[["file_name", "file_size_mbs"]]
df5 = df5.sort_values(by="file_size_mbs", ascending=False)
df5.head(n=10)

Unnamed: 0,file_name,file_size_mbs
76,green_tripdata_2015-05.parquet,25.469047
74,green_tripdata_2015-03.parquet,24.695971
75,green_tripdata_2015-04.parquet,23.883558
77,green_tripdata_2015-06.parquet,23.561923
81,green_tripdata_2015-10.parquet,23.466003
35,green_tripdata_2014-12.parquet,23.282601
83,green_tripdata_2015-12.parquet,23.207394
115,green_tripdata_2016-03.parquet,22.681435
73,green_tripdata_2015-02.parquet,22.503124
117,green_tripdata_2016-05.parquet,22.405044


### Which files have smallest compressed sizes (MBs)?

In [13]:
df6 = df[["file_name", "file_size_mbs"]]
df6 = df6.sort_values(by="file_size_mbs", ascending=True)
df6.head(n=10)

Unnamed: 0,file_name,file_size_mbs
104,green_tripdata_2020-04.parquet,0.680516
105,green_tripdata_2020-05.parquet,1.032434
13,green_tripdata_2021-02.parquet,1.092605
106,green_tripdata_2020-06.parquet,1.149343
60,green_tripdata_2022-01.parquet,1.196185
70,green_tripdata_2022-11.parquet,1.211475
85,green_tripdata_2024-02.parquet,1.224332
107,green_tripdata_2020-07.parquet,1.247594
66,green_tripdata_2022-07.parquet,1.251557
12,green_tripdata_2021-01.parquet,1.271743


### Describe files number of records (rows)?

In [14]:
df[["file_num_rows"]].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
file_num_rows,125.0,664876.232,582414.676975,35644.0,76136.0,615594.0,1224158.0,1786848.0


### Which files have largest number of records (rows)?

In [15]:
df7 = df[["file_name", "file_num_rows"]]
df7 = df7.sort_values(by="file_num_rows", ascending=False)
df7["file_num_rows"] = df7["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df7.head(n=10)

Unnamed: 0,file_name,file_num_rows
76,green_tripdata_2015-05.parquet,1786848
74,green_tripdata_2015-03.parquet,1722574
75,green_tripdata_2015-04.parquet,1664394
35,green_tripdata_2014-12.parquet,1645787
77,green_tripdata_2015-06.parquet,1638868
81,green_tripdata_2015-10.parquet,1630536
83,green_tripdata_2015-12.parquet,1608297
115,green_tripdata_2016-03.parquet,1576393
73,green_tripdata_2015-02.parquet,1574830
34,green_tripdata_2014-11.parquet,1548159


### Which files have smallest number of records (rows)?

In [16]:
df8 = df[["file_name", "file_num_rows"]]
df8 = df8.sort_values(by="file_num_rows", ascending=True)
df8["file_num_rows"] = df8["file_num_rows"].apply(lambda x: "{:,d}".format(x))
df8.head(n=10)

Unnamed: 0,file_name,file_num_rows
104,green_tripdata_2020-04.parquet,35644
85,green_tripdata_2024-02.parquet,53577
87,green_tripdata_2024-04.parquet,56471
84,green_tripdata_2024-01.parquet,56551
105,green_tripdata_2020-05.parquet,57361
86,green_tripdata_2024-03.parquet,57457
43,green_tripdata_2023-08.parquet,60649
88,green_tripdata_2024-05.parquet,61003
42,green_tripdata_2023-07.parquet,61343
70,green_tripdata_2022-11.parquet,62313


### How does column names change in files?

In [17]:
df9 = df[["file_year", "file_column_names"]].groupby(by=["file_year", "file_column_names"]).size()
df9 = df9.reset_index(name="num_of_files")
pd.set_option('display.max_colwidth', None)
df9

Unnamed: 0,file_year,file_column_names,num_of_files
0,2014,"VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge",12
1,2015,"VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge",12
2,2016,"VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge",12
3,2017,"VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge",12
4,2018,"VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge",12
5,2019,"VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge",12
6,2020,"VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge",12
7,2021,"VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge",12
8,2022,"VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge",12
9,2023,"VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,congestion_surcharge",12


### How many times a column name appear in files?

In [18]:
df10 = df["file_column_names"].str.split(",").explode()
df10 = pd.DataFrame(df10)
df10 = df10.groupby(by="file_column_names").size()
df10 = df10.reset_index(name="num_of_files")
df10 = df10.sort_values(by="num_of_files", ascending=False)
df10

Unnamed: 0,file_column_names,num_of_files
0,DOLocationID,125
1,PULocationID,125
18,trip_distance,125
17,total_amount,125
16,tolls_amount,125
15,tip_amount,125
14,store_and_fwd_flag,125
13,payment_type,125
12,passenger_count,125
11,mta_tax,125


### Which files have longitude and latitude?

In [19]:
df11 = df[(df["file_column_names"].str.contains("long", case=False) | 
           df["file_column_names"].str.contains("lat", case=False))]
df11 = df11[["file_size_mbs", "file_cloudfront_url"]]
pd.set_option('display.max_colwidth', None)
df11

Unnamed: 0,file_size_mbs,file_cloudfront_url
