# **MLOps ZoomCamp - Homework #1**

~~~
The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.
~~~

In [1]:
# Import Libs
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression, Lasso, Ridge

from sklearn.metrics import mean_squared_error

## **Q1. Downloading the data**

We'll use the same NYC taxi dataset, but instead of "Green Taxi Trip Records", we'll use "For-Hire Vehicle Trip Records".

Download the data for January and February 2021.

Note that you need "For-Hire Vehicle Trip Records", not "High Volume For-Hire Vehicle Trip Records".

Read the data for January. How many records are there?

In [18]:
df = pd.read_parquet('./data/fhv_tripdata_2021-01.parquet)

In [19]:
df.shape

(1154112, 7)

> **1_154_112 records for January**

## **Q2. Computing duration**

Now let's compute the duration variable. It should contain the duration of a ride in minutes.

What's the average trip duration in January?

In [21]:
df.head()

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037


In [22]:
df.dtypes

dispatching_base_num              object
pickup_datetime           datetime64[ns]
dropOff_datetime          datetime64[ns]
PUlocationID                     float64
DOlocationID                     float64
SR_Flag                           object
Affiliated_base_number            object
dtype: object

In [23]:
df['duration'] = df['dropOff_datetime'] - df['pickup_datetime']
df['duration'] = df['duration'].apply(lambda i: i.total_seconds() / 60)

In [24]:
df['duration'].mean()

19.1672240937939

> **Average trip duration in January: 19.17**

## **Data preparation**

Check the distribution of the duration variable. There are some outliers. 

Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

How many records did you drop?

In [25]:
df['duration'].describe(percentiles=[.95, .98, .99]).apply('{0:.2f}'.format)

count    1154112.00
mean          19.17
std          398.69
min            0.02
50%           13.40
95%           47.25
98%           66.13
99%           90.30
max       423371.05
Name: duration, dtype: object

In [26]:
start = df.shape[0]
df = df[(df.duration >=1) & (df.duration <=60)]
end = df.shape[0]

start - end


44286

> **44286 records dropped from dataset**

## **Q3. Missing values**

The features we'll use for our model are the pickup and dropoff location IDs. 

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.

In [29]:
df.shape

(1109826, 8)

In [34]:
df.isnull().sum().sum()

2185514

In [35]:
df = df.fillna(-1)

In [38]:
(df["PUlocationID"] == -1).mean()

0.8352732770722617

> **83% of pickup location IDs are missing**

## **Q4. One-hot encoding**

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix? (The number of columns).

In [39]:
cat_cols = ['PUlocationID', 'DOlocationID']
df[cat_cols] = df[cat_cols].astype(str)

dv = DictVectorizer()

train_dicts = df[cat_cols].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

In [41]:
X_train.shape

(1109826, 525)

> **The dimensionality of this matrix? (The number of columns) is 525**

## **Q5. Training a model**

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

In [42]:
target = 'duration'
y_train = df[target].values

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

rmse_train = mean_squared_error(y_train, y_pred, squared=False)

In [43]:
rmse_train

10.528519107210744

> **The RMSE on train is 10.53**

## **Q6. Evaluating the model**

Now let's apply this model to the validation dataset (Feb 2021). 

What's the RMSE on validation?


In [44]:
df_val = pd.read_parquet('./data/fhv_tripdata_2021-02.parquet')

df_val['duration'] = df_val['dropOff_datetime'] - df_val['pickup_datetime']
df_val['duration'] = df_val.duration.apply(lambda td: td.total_seconds()/60)
df_val = df_val[(df_val.duration >=1) & (df_val.duration <=60)]

df_val = df_val.fillna(-1)

df_val[cat_cols] = df_val[cat_cols].astype(str)  

val_dicts = df_val[cat_cols].to_dict(orient='records')
X_val = dv.transform(val_dicts)

y_val = df_val['duration'].values
y_pred = lr.predict(X_val)

rmse_val = mean_squared_error(y_val, y_pred, squared=False)

In [45]:
rmse_val

11.014283196111764

> **The RMSE on validation is 11.01**