## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

In [1]:
import pandas as pd
import numpy as np

## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2022.

Read the data for January. How many columns are there?

- [ ] 16
- [ ] 17
- [ ] 18
- [x] 19

In [2]:
data_jan = pd.read_parquet('../../data/yellow_tripdata_2022-01.parquet')
data_jan.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2022-01-01 00:35:40,2022-01-01 00:53:29,2.0,3.8,1.0,N,142,236,1,14.5,3.0,0.5,3.65,0.0,0.3,21.95,2.5,0.0
1,1,2022-01-01 00:33:43,2022-01-01 00:42:07,1.0,2.1,1.0,N,236,42,1,8.0,0.5,0.5,4.0,0.0,0.3,13.3,0.0,0.0
2,2,2022-01-01 00:53:21,2022-01-01 01:02:19,1.0,0.97,1.0,N,166,166,1,7.5,0.5,0.5,1.76,0.0,0.3,10.56,0.0,0.0
3,2,2022-01-01 00:25:21,2022-01-01 00:35:23,1.0,1.09,1.0,N,114,68,2,8.0,0.5,0.5,0.0,0.0,0.3,11.8,2.5,0.0
4,2,2022-01-01 00:36:48,2022-01-01 01:14:20,1.0,4.3,1.0,N,68,163,1,23.5,0.5,0.5,3.0,0.0,0.3,30.3,2.5,0.0


In [3]:
print(f"Number of columns: {len(data_jan.columns)}")

Number of columns: 19


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

- [ ] 41.45
- [X] 46.45
- [ ] 51.45
- [ ] 56.45

In [4]:
# Creating the duration column (in minutes)
data_jan['duration'] = (data_jan['tpep_dropoff_datetime'] - data_jan['tpep_pickup_datetime']).dt.total_seconds() / 60

# Standard deviation of the duration column
print(f"Standard deviation of the duration column: {round(data_jan['duration'].std(), 2)}")

Standard deviation of the duration column: 46.45


## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

- [ ] 90%
- [ ] 92%
- [ ] 95%
- [x] 98%

In [5]:
# Keeping durations between 1 and 60 minutes
data_jan_clean = data_jan[(data_jan['duration'] >= 1) & (data_jan['duration'] <= 60)]

# Fraction of trips with durations between 1 and 60 minutes
print(f"Fraction of trips with durations between 1 and 60 minutes: {round(len(data_jan_clean) / len(data_jan), 2)}")

Fraction of trips with durations between 1 and 60 minutes: 0.98


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

- [ ] 2
- [ ] 155
- [ ] 345
- [X] 515
- [ ] 715


In [6]:
from sklearn.feature_extraction import DictVectorizer

# Getting only the location columns as converting to string
df_location = data_jan_clean[['PULocationID', 'DOLocationID']].astype(str)

# Turn the dataframe into a list of dictionaries
train_dict = df_location.to_dict(orient='records')

# Fit a dictionary vectorizer
vectorizer = DictVectorizer()
X_train = vectorizer.fit_transform(train_dict)

# Number of features
print(f"Number of features: {np.shape(X_train)[1]}")

Number of features: 515


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

- [X] 6.99
- [ ] 11.99
- [ ] 16.99
- [ ] 21.99

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Fit a linear regression model
model = LinearRegression()
model.fit(X_train, data_jan_clean['duration'])

# Root-mean squared error
rmse = np.sqrt(mean_squared_error(data_jan_clean['duration'], model.predict(X_train)))
print(f"Root Mean squared error: {round(rmse, 2)}")

Root Mean squared error: 6.99


## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2022). 

What's the RMSE on validation?

- [X] 7.79
- [ ] 12.79
- [ ] 17.79
- [ ] 22.79

In [9]:
# Validation data
data_feb = pd.read_parquet('../../data/yellow_tripdata_2022-02.parquet')

# Creating the duration column (in minutes)
data_feb['duration'] = (data_feb['tpep_dropoff_datetime'] - data_feb['tpep_pickup_datetime']).dt.total_seconds() / 60

# Keeping durations between 1 and 60 minutes
data_feb_clean = data_feb[(data_feb['duration'] >= 1) & (data_feb['duration'] <= 60)]

# Getting only the location columns as converting to string
df_location = data_feb_clean[['PULocationID', 'DOLocationID']].astype(str)

# Turn the dataframe into a list of dictionaries
test_dict = df_location.to_dict(orient='records')

# Transforming the test data
X_test = vectorizer.transform(test_dict)

# Root-mean squared error
rmse_test = np.sqrt(mean_squared_error(data_feb_clean['duration'], model.predict(X_test)))
print(f"Root Mean squared error: {round(rmse_test, 2)}")

Root Mean squared error: 7.79
