### Background

The purpose of this notebook is to conduct exploratory data analysis on NYC taxi datasets. Using the conclusions we infer from this analysis, we can design better tests using these datasets. 

The primary focus of this analysis is on numeric values, namely monetary values around fares and tolls.

The analysis is broken out by year but also aggregated into a single dataframe. Please comment out lines as necessary!

In [74]:
### Imports

import os
import glob
import pandas as pd


In [75]:
### Concatenate all CSV's into separate dataframes based on year

def create_yearly_df(year):
    dfs = []
    for file in os.listdir("."):
        if file.endswith(".csv") and str(year) in file:
            file_df = pd.read_csv(file, index_col=None, header=0)
            dfs.append(file_df)
    
    df = pd.concat(dfs)
    for col in ["VendorID", "RatecodeID", "PULocationID", "DOLocationID"]:
        if col in df:
            df.pop(col)
        
    return df


df_18 = create_yearly_df(2018)
df_19 = create_yearly_df(2019)
df_20 = create_yearly_df(2020)
df = pd.concat([df_18, df_19, df_20], ignore_index=True)

# print(f"2018: {df_18.shape}")
# print(f"2019: {df_19.shape}")
# print(f"2020: {df_20.shape}")
print(f"All: {df.shape}")


All: (360000, 18)


In [76]:
### General stats

# df_18.describe()
# df_19.describe()
# df_20.describe()
df.describe()

Unnamed: 0,vendor_id,passenger_count,trip_distance,rate_code_id,pickup_location_id,dropoff_location_id,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
count,341756.0,351756.0,360000.0,341756.0,350000.0,350000.0,351756.0,360000.0,360000.0,360000.0,360000.0,360000.0,360000.0,360000.0,233673.0
mean,1.628712,1.523223,3.112494,1.050709,162.817074,160.24936,1.307395,14.194708,0.80938,0.494658,1.992985,0.345135,0.298431,19.044735,2.148283
std,0.499568,1.183894,78.134036,0.748074,66.89527,71.069931,0.490034,715.91409,1.102911,0.087537,2.726035,1.784045,0.02972,715.953363,0.88707
min,1.0,0.0,-24.49,1.0,1.0,1.0,1.0,-281.0,-36.71,-0.5,-2.55,-6.12,-0.3,-289.92,-2.5
25%,1.0,1.0,0.99,1.0,114.0,107.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,10.12,2.5
50%,2.0,1.0,1.66,1.0,162.0,162.0,1.0,9.0,0.5,0.5,1.66,0.0,0.3,13.56,2.5
75%,2.0,2.0,3.1,1.0,234.0,234.0,2.0,14.5,1.0,0.5,2.75,0.0,0.3,19.56,2.5
max,4.0,7.0,46826.21,99.0,265.0,265.0,4.0,429490.2,7.0,37.51,400.04,500.05,0.3,429491.0,2.75


Looking at the above, it appears that the following columns have unexpected negative values:

    fare_amount, 
    extra, 
    mta_tax, 
    tip_amount, 
    tolls_amount, 
    improvement_surcharge, 
    total_amount, 
    congestion_surcharge
    
Let's see what portion of the population is comprised of these values.

In [77]:
### Negative values

cols = [
    "fare_amount", 
    "extra", 
    "mta_tax", 
    "tip_amount", 
    "tolls_amount", 
    "improvement_surcharge", 
    "total_amount", 
    "congestion_surcharge"
]

def count_negs(df, col):
    num = df[df[col]<0].shape[0]
    print(f"Column {col} has {num} negative values")
    print(f"This comprises {num/df.shape[0]:%} of the total dataset\n")
    
for col in cols:
#     count_negs(df_18, col)
#     count_negs(df_19, col)
#     count_negs(df_20, col)
    count_negs(df, col)


Column fare_amount has 833 negative values
This comprises 0.231389% of the total dataset

Column extra has 363 negative values
This comprises 0.100833% of the total dataset

Column mta_tax has 816 negative values
This comprises 0.226667% of the total dataset

Column tip_amount has 8 negative values
This comprises 0.002222% of the total dataset

Column tolls_amount has 21 negative values
This comprises 0.005833% of the total dataset

Column improvement_surcharge has 830 negative values
This comprises 0.230556% of the total dataset

Column total_amount has 832 negative values
This comprises 0.231111% of the total dataset

Column congestion_surcharge has 585 negative values
This comprises 0.162500% of the total dataset



In [85]:
### NaN / null values

df_18.isnae().sum()
df_19.isnae().sum()
df_20.isnae().sum()
df.isna().sum()


vendor_id                 18244
pickup_datetime               0
dropoff_datetime              0
passenger_count            8244
trip_distance                 0
rate_code_id              18244
store_and_fwd_flag         8244
pickup_location_id        10000
dropoff_location_id       10000
payment_type               8244
fare_amount                   0
extra                         0
mta_tax                       0
tip_amount                    0
tolls_amount                  0
improvement_surcharge         0
total_amount                  0
congestion_surcharge     126327
dtype: int64