## **Data Cleaning**

 **Importing basic libraries**

In [10]:
import numpy as np
import pandas as pd

**Importing data**

In [11]:
# importing data
nyc_taxi = pd.read_parquet("data/yellow_tripdata_2022-03.parquet")

**Removing non essential features**

*In our analysis `store_and_fwd_flag`, `fare_amount`, `extra` , `mta_tax`, `tip_amount` ,`tolls_amount` ,
`improvement_surcharge`, `congestion_surcharge`, `airport_fee` are dropped.*

In [13]:
drop_cols = ["store_and_fwd_flag", "store_and_fwd_flag", "fare_amount", "extra" , "mta_tax", "tip_amount",
              "tolls_amount" , "improvement_surcharge", "congestion_surcharge", "airport_fee"]
nyc_taxi.drop(drop_cols, axis=1,inplace=True)

### **Missing Values**

In [14]:
nyc_taxi.isna().sum()

VendorID                      0
tpep_pickup_datetime          0
tpep_dropoff_datetime         0
passenger_count          117814
trip_distance                 0
RatecodeID               117814
PULocationID                  0
DOLocationID                  0
payment_type                  0
total_amount                  0
dtype: int64

*There are `117814` values are missing from column `passenger_count` and `Rate_Code_ID`*

**Dealing with RatecodeID**

*Majority of instances belong to category 1 and 2. If the payment value lies within 1 SD of mean value of total_amount for category 2 
we assign it values 2 all else are assigned value 1*

In [15]:
nyc_taxi["RatecodeID"].value_counts()/len(nyc_taxi)

1.0     0.921587
2.0     0.033114
5.0     0.005875
99.0    0.003550
3.0     0.002237
4.0     0.001156
6.0     0.000007
Name: RatecodeID, dtype: float64

In [16]:
miss_pos = nyc_taxi[nyc_taxi["RatecodeID"].isna()].index
meanrt2 = nyc_taxi[nyc_taxi["RatecodeID"] == 2.0]["total_amount"].mean()
stdrt2 = nyc_taxi[nyc_taxi["RatecodeID"] == 2.0]["total_amount"].std()
for pos in miss_pos:
    payment = nyc_taxi["total_amount"][pos]
    if (payment >= (meanrt2 - stdrt2)) | (payment <= meanrt2 + stdrt2):
        nyc_taxi.loc[pos,"RatecodeID"] = 2.0
    else:
        nyc_taxi.loc[pos,"RatecodeID"] = 1.0

**Dealing with passenger_count**

In [17]:
np.round(nyc_taxi["passenger_count"].value_counts()/len(nyc_taxi),2)

1.0    0.72
2.0    0.14
3.0    0.04
0.0    0.02
5.0    0.02
4.0    0.01
6.0    0.01
7.0    0.00
8.0    0.00
9.0    0.00
Name: passenger_count, dtype: float64

*Impute all missing class with  `category 1` as it is the majority class in most instances*

In [18]:
nyc_taxi.loc[nyc_taxi[nyc_taxi["passenger_count"].isna()].index, "passenger_count"] = 1.0

*All missing instances are now filled with appropriate value*

In [20]:
nyc_taxi.isna().sum()

VendorID                 0
tpep_pickup_datetime     0
tpep_dropoff_datetime    0
passenger_count          0
trip_distance            0
RatecodeID               0
PULocationID             0
DOLocationID             0
payment_type             0
total_amount             0
dtype: int64

### **Data Type**

In [21]:
nyc_taxi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3627882 entries, 0 to 3627881
Data columns (total 10 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int64         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        float64       
 4   trip_distance          float64       
 5   RatecodeID             float64       
 6   PULocationID           int64         
 7   DOLocationID           int64         
 8   payment_type           int64         
 9   total_amount           float64       
dtypes: datetime64[ns](2), float64(4), int64(4)
memory usage: 276.8 MB


*Since `passenger_count` and `RatecodeID` are categorical variables we will change there data type from float to int*

In [22]:
nyc_taxi["passenger_count"] = nyc_taxi["passenger_count"].astype("int")
nyc_taxi["RatecodeID"] = nyc_taxi["RatecodeID"].astype("int")

*Convert the `int64` datatype to `int16` to save memory space*

In [23]:
for col in nyc_taxi.columns:
    if nyc_taxi[col].dtype == "int64":
        nyc_taxi[col] = nyc_taxi[col].astype("int16")
    else:
        pass

*We don't convert the `float64` because it needs `high precision`*

In [25]:
# required format
nyc_taxi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3627882 entries, 0 to 3627881
Data columns (total 10 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   VendorID               int16         
 1   tpep_pickup_datetime   datetime64[ns]
 2   tpep_dropoff_datetime  datetime64[ns]
 3   passenger_count        int16         
 4   trip_distance          float64       
 5   RatecodeID             int16         
 6   PULocationID           int16         
 7   DOLocationID           int16         
 8   payment_type           int16         
 9   total_amount           float64       
dtypes: datetime64[ns](2), float64(2), int16(6)
memory usage: 152.2 MB


*Reduce the size of dataframe by 45%*

### **Saving the dataframe in pickle file**

In [28]:
nyc_taxi.to_pickle("data/nyc_taxi.pickle")

In [29]:
nyc_taxi.describe()

Unnamed: 0,VendorID,passenger_count,trip_distance,RatecodeID,PULocationID,DOLocationID,payment_type,total_amount
count,3627882.0,3627882.0,3627882.0,3627882.0,3627882.0,3627882.0,3627882.0,3627882.0
mean,1.714948,1.376396,5.76129,1.444936,164.9635,163.0534,1.180307,20.59364
std,0.4984502,0.9596307,569.4616,5.837187,65.03559,69.97796,0.4971751,16.53309
min,1.0,0.0,0.0,1.0,1.0,1.0,0.0,-895.3
25%,1.0,1.0,1.1,1.0,132.0,113.0,1.0,11.84
50%,2.0,1.0,1.83,1.0,162.0,162.0,1.0,15.36
75%,2.0,1.0,3.4,1.0,234.0,234.0,1.0,21.82
max,6.0,9.0,286259.8,99.0,265.0,265.0,5.0,1783.85
