## 1.6 Homework

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.


## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records"

Download the data for January and February 2021

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

Read the data for January. How many records are there?

* 1054112
* 1154112
* 1254112
* 1354112

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [2]:
df_jan = pd.read_parquet("data/fhv_tripdata_2021-01.parquet")
print(df_jan.shape[0])

1154112


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes.

What's the average trip duration in January?

* 15.16
* 19.16
* 24.16
* 29.16

In [3]:
df_jan['duration'] = (df_jan['dropOff_datetime'] - df_jan['pickup_datetime'])\
    .apply(lambda x: x.total_seconds() / 60)
print(df_jan['duration'].mean())

19.167224093791006


## Data preparation

Check the distribution of the duration variable. There are some outliers.

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop?

In [4]:
df_jan_prep = df_jan.query("duration >= 1 & duration <= 60")
df_jan_prep.shape[0]

1109826

In [5]:
print(f"We dropped {df_jan.shape[0] - df_jan_prep.shape[0]} observations")

We dropped 44286 observations


## Q3. Missing values

The features we'll use for our model are the pickup and dropoff location IDs.

But they have a lot of missing values there. Let's replace them with "-1"

What's the factions of missing values for the pickup location ID? (Or the fraction of "-1"s after you filled the NAs)

* 53%
* 63%
* 73%
* 83%

In [6]:
df_jan_prep.isna().sum()

dispatching_base_num            0
pickup_datetime                 0
dropOff_datetime                0
PUlocationID               927008
DOlocationID               147907
SR_Flag                   1109826
Affiliated_base_number        773
duration                        0
dtype: int64

In [7]:
df_jan_prep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1109826 entries, 0 to 1154111
Data columns (total 8 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   dispatching_base_num    1109826 non-null  object        
 1   pickup_datetime         1109826 non-null  datetime64[ns]
 2   dropOff_datetime        1109826 non-null  datetime64[ns]
 3   PUlocationID            182818 non-null   float64       
 4   DOlocationID            961919 non-null   float64       
 5   SR_Flag                 0 non-null        float64       
 6   Affiliated_base_number  1109053 non-null  object        
 7   duration                1109826 non-null  float64       
dtypes: datetime64[ns](2), float64(4), object(2)
memory usage: 76.2+ MB


In [8]:
cols_fill_miss = ['PUlocationID', 'DOlocationID']
for col in cols_fill_miss:
   df_jan_prep[col] =  df_jan_prep[col].fillna(-1)

In [9]:
share_of_miss_PU = df_jan_prep.query('PUlocationID == -1').shape[0] / df_jan_prep.shape[0]
print(f"the factions of missing values for the pickup location ID = {np.round(share_of_miss_PU * 100, 2)}% ")

the factions of missing values for the pickup location ID = 83.53% 


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer
* Get a feature matrix from it

What's the dimensionality of this matrix? (The number of columns)

* 2
* 152
* 352
* 525
* 725

In [12]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import OneHotEncoder
train_dicts = df_jan_prep[cols_fill_miss].to_dict(orient='records')
dv = DictVectorizer(sparse=False)
ohe = OneHotEncoder(handle_unknown="ignore")
X_train = dv.fit_transform(train_dicts)

X_train_ohe = ohe.fit_transform(X_train)

In [13]:
print(f"the dimensionality of this matrix is {X_train_ohe.shape}")

the dimensionality of this matrix is (1109826, 525)


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model.

* Train a plain linear regression model with default parameters
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 5.52
* 10.52
* 15.52
* 20.52

In [16]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
lr = LinearRegression()
target = 'duration'
y_train = df_jan_prep[target].values

lr.fit(X_train_ohe, y_train)

y_pred = lr.predict(X_train_ohe)

print(f"RMSE for linear regression on train = {np.round(mean_squared_error(y_train, y_pred, squared=False), 2)}")

RMSE for linear regression on train = 10.53


## Q6. Evaluating the model

Now let's apply this model to the validation dataset (Feb 2021).

What's the RMSE on validation?

* 7.85
* 12.85
* 17.85
* 22.85

In [18]:
df_feb = pd.read_parquet("data/fhv_tripdata_2021-02.parquet")
df_feb_prep = df_feb\
    .assign(
    duration=lambda x: x['dropOff_datetime'] - x['pickup_datetime'])\
    .assign(
    duration=lambda x: x['duration'].apply(lambda td: td.total_seconds() / 60))\
    .query("1<=duration<=60")
df_feb_prep[cols_fill_miss] = df_feb_prep[cols_fill_miss].fillna(-1)
val_dicts = df_feb_prep[cols_fill_miss].to_dict(orient='records')
X_val = dv.transform(val_dicts)
X_val_ohe = ohe.transform(X_val)

y_val = df_feb_prep[target].values

y_pred_val = lr.predict(X_val_ohe)

print(f"RMSE for linear regression on validation = {np.round(mean_squared_error(y_val, y_pred_val, squared=False), 2)}")

RMSE for linear regression on validation = 11.01


In [41]:
df_feb_prep['duration'].max()

60.0