# 1.6 Homework
The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

## Q1. Downloading the data
We'll use the same NYC taxi dataset, but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records".

Download the data for January and February 2021.

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

Read the data for January. How many records are there?

1054112
1154112
1254112
1354112

In [75]:
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [61]:
df = pd.read_parquet("fhv_tripdata_2021-01.parquet")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1154112 entries, 0 to 1154111
Data columns (total 7 columns):
 #   Column                  Non-Null Count    Dtype         
---  ------                  --------------    -----         
 0   dispatching_base_num    1154112 non-null  object        
 1   pickup_datetime         1154112 non-null  datetime64[ns]
 2   dropOff_datetime        1154112 non-null  datetime64[ns]
 3   PUlocationID            195845 non-null   float64       
 4   DOlocationID            991892 non-null   float64       
 5   SR_Flag                 0 non-null        object        
 6   Affiliated_base_number  1153227 non-null  object        
dtypes: datetime64[ns](2), float64(2), object(3)
memory usage: 61.6+ MB


In [24]:
print(f"For January there are {len(df)} records")

For January there are 1154112 records


## Q2. Computing duration
Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the average trip duration in January?

- 15.16
- 19.16
- 24.16
- 29.16

In [62]:
df["duration"] = df["dropOff_datetime"] - df["pickup_datetime"]
df["duration"] = df["duration"].apply(lambda td: td.total_seconds() / 60)

In [63]:
print(f"The mean of the duration in minutes is {df['duration'].mean():.2f}")

The mean of the duration in minutes is 19.17


## Data preparation
Check the distribution of the duration variable. There are some outliers.

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop?

In [17]:
df["duration"].describe()

count    1.154112e+06
mean     1.916722e+01
std      3.986922e+02
min      1.666667e-02
25%      7.766667e+00
50%      1.340000e+01
75%      2.228333e+01
max      4.233710e+05
Name: duration, dtype: float64

In [19]:
df["duration"].quantile(0.05), df["duration"].quantile(0.95)

(3.0166666666666666, 47.25)

In [27]:
df[(df["duration"] >= 1) & (df["duration"] <= 60)]

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number,duration
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009,17.000000
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009,17.000000
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037,8.283333
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037,15.216667
5,B00037,2021-01-01 00:59:02,2021-01-01 01:08:05,,71.0,,B00037,9.050000
...,...,...,...,...,...,...,...,...
1154107,B03266,2021-01-31 23:43:03,2021-01-31 23:51:48,7.0,7.0,,B03266,8.750000
1154108,B03284,2021-01-31 23:50:27,2021-02-01 00:48:03,44.0,91.0,,,57.600000
1154109,B03285,2021-01-31 23:13:46,2021-01-31 23:29:58,171.0,171.0,,B03285,16.200000
1154110,B03285,2021-01-31 23:58:03,2021-02-01 00:17:29,15.0,15.0,,B03285,19.433333


In [64]:
records_df = len(df)
df = df[df["duration"].between(1, 60)]

In [65]:
print(f"After removing the trips with less than 1 minute and more than 1 hour, we have drop {records_df - len(df)} records")

After removing the trips with less than 1 minute and more than 1 hour, we have drop 44286 records


## Q3. Missing values
The features we'll use for our model are the pickup and dropoff location IDs.

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.

53%
63%
73%
83%

In [57]:
df["PUlocationID"].isna().sum() / len(df)

0.8352732770722617

In [66]:
df["PUlocationID"] = df["PUlocationID"].fillna("-1")
df["DOlocationID"] = df["DOlocationID"].fillna("-1")

In [59]:
print(f"The proportion of missed values for Pick up location it is: {(df['PUlocationID'] == '-1').sum() / len(df):.2%}")

The proportion of missed values for Pick up location it is: 83.53%


## Q4. One-hot encoding
Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

Turn the dataframe into a list of dictionaries
Fit a dictionary vectorizer
Get a feature matrix from it
What's the dimensionality of this matrix? (The number of columns).

- 2
- 152
- 352
- 525
- 725

In [68]:
categorical = ['PUlocationID', 'DOlocationID']
df[categorical] = df[categorical].astype(str)


In [93]:
train_dicts = df[categorical].to_dict(orient='records')
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

In [82]:
type(X_train)

scipy.sparse._csr.csr_matrix

In [94]:
X_train.shape

(1109826, 525)

In [None]:
X_train.get

In [95]:
print(f"The dimensionality of this matrix is {X_train.shape[1]} (columns)")

The dimensionality of this matrix is 525 (columns)


## Q5. Training a model
Now let's use the feature matrix from the previous step to train a model.

Train a plain linear regression model with default parameters
Calculate the RMSE of the model on the training data
What's the RMSE on train?

- 5.52
- 10.52
- 15.52
- 20.52

In [96]:
y_train = df["duration"]

In [97]:
lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

mean_squared_error(y_train, y_pred, squared=False)

10.528519434683

In [85]:
print(f"The RMSE of the model is: {mean_squared_error(y_train, y_pred, squared=False):.2f}")

The RMSE of the model is: 10.53


## Q6. Evaluating the model
Now let's apply this model to the validation dataset (Feb 2021).

What's the RMSE on validation?

6.01
11.01
16.01
21.01

In [106]:
df_val = pd.read_parquet("fhv_tripdata_2021-02.parquet")
df_val["duration"] = df_val["dropOff_datetime"] - df_val["pickup_datetime"]
df_val["duration"] = df_val["duration"].apply(lambda td: td.total_seconds() / 60)
df_val = df_val[df_val["duration"].between(1, 60)]
df_val["PUlocationID"] = df_val["PUlocationID"].fillna("-1")
df_val["DOlocationID"] = df_val["DOlocationID"].fillna("-1")

In [107]:
categorical = ['PUlocationID', 'DOlocationID']
df_val[categorical] = df_val[categorical].astype(str)
eval_dicts = df_val[categorical].to_dict(orient='records')

In [108]:
X_val = dv.fit_transform(eval_dicts, )
y_val = df_val["duration"]

In [104]:
type(X_val)

scipy.sparse._csr.csr_matrix

In [109]:
X_val.shape

(990113, 526)

In [105]:
y_pred = lr.predict(X_val)
mean_squared_error(y_val, y_pred, squared=False)

ValueError: X has 526 features, but LinearRegression is expecting 525 features as input.