In [1]:
!python -V

Python 3.9.13


In [2]:
import pandas as pd
import pickle
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.




## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2022.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19



In [122]:
# download yellow taxi data (parquet)
! wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet -P data/
! wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-02.parquet -P data/

--2023-05-25 10:24:56--  https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet
Resolving d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)... 2600:9000:200c:fa00:b:20a5:b140:21, 2600:9000:200c:a200:b:20a5:b140:21, 2600:9000:200c:6e00:b:20a5:b140:21, ...
Connecting to d37ci6vzurychx.cloudfront.net (d37ci6vzurychx.cloudfront.net)|2600:9000:200c:fa00:b:20a5:b140:21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38139949 (36M) [application/x-www-form-urlencoded]
Saving to: 'data/yellow_tripdata_2022-01.parquet'

     0K .......... .......... .......... .......... ..........  0%  296K 2m6s
    50K .......... .......... .......... .......... ..........  0%  510K 99s
   100K .......... .......... .......... .......... ..........  0%  841K 81s
   150K .......... .......... .......... .......... ..........  0% 1,03M 69s
   200K .......... .......... .......... .......... ..........  0% 1,33M 61s
   250K .......... .......... .....

In [123]:
def read_dataframe(filename):
    
    if filename.endswith('.csv'):

        df = pd.read_csv(filename)
        df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
        df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)

    elif filename.endswith('.parquet'):

        df = pd.read_parquet(filename)
    
    return df

In [124]:
# read parquet files
df_train = read_dataframe('./data/yellow_tripdata_2022-01.parquet')
df_val   = read_dataframe('./data/yellow_tripdata_2022-02.parquet')

In [125]:
size_train = len(df_train)
size_val = len(df_val)

size_train, size_val

(2463931, 2979431)

In [126]:
print(f'number of columns: {len(df_train.columns)}')
# print(df_train.columns)

number of columns: 19



## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 41.45
* 46.45
* 51.45
* 56.45


In [127]:
def calculate_duration_in_minutes(df):
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)
    return df

In [128]:
df_train = calculate_duration_in_minutes(df_train)
df_val   = calculate_duration_in_minutes(df_val)

In [129]:
print(f'duration std: {round(df_train.duration.std(),2)} minutes')

duration std: 46.45 minutes




## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98%



In [130]:
def drop_outliers(df,duration_lower,duration_upper):
    return df[(df.duration >= duration_lower) & (df.duration <= duration_upper)]

In [136]:
df_train = drop_outliers(df_train,1,60)
df_val = drop_outliers(df_val,1,60)

In [137]:
size_train_new = len(df_train)
size_val_new = len(df_val)

size_train_new, size_val_new

(2421440, 2918187)

In [138]:
outlier_fraction = 1 - (size_train_new + size_val_new)/(size_val + size_val_new)
print(f'Fraction of the records left after outlier filtering: {round(100*(1 - outlier_fraction),0)}%')

Fraction of the records left after outlier filtering: 91.0%



## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515
* 715


In [139]:
def transform_categorical_column_to_string(df, categorical):
    df[categorical] = df[categorical].astype(str)
    return df

categorical = ['PULocationID', 'DOLocationID']
df_train = transform_categorical_column_to_string(df_train, categorical)
df_val = transform_categorical_column_to_string(df_val, categorical)

In [140]:
def create_PU_DO_column(df):
    df['PU_DO'] = df['PULocationID'] + '_' + df['DOLocationID']
    return df

df_train = create_PU_DO_column(df_train)
df_val = create_PU_DO_column(df_val)

In [141]:
# categorical = ['PU_DO'] #'PULocationID', 'DOLocationID']
categorical = ['PULocationID', 'DOLocationID']
numerical = ['trip_distance']

# Turn the dataframe into a list of dictionaries
train_dicts = df_train[categorical + numerical].to_dict(orient='records')
val_dicts   = df_val[categorical + numerical].to_dict(orient='records')

# Fit a dictionary vectorizer
dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)
X_val   = dv.transform(val_dicts)

# Get a feature matrix from it

In [142]:
print(f'dimensionality: {X_train.shape[1]}')

dimensionality: 516


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 6.99
* 11.99
* 16.99
* 21.99

In [144]:
target = 'duration'
y_train = df_train[target].values
y_val = df_val[target].values

In [145]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [146]:
y_pred = lr.predict(X_train)

mean_squared_error(y_train, y_pred, squared=False)

7.001496179430534

## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2022). 

What's the RMSE on validation?

* 7.79
* 12.79
* 17.79
* 22.79

In [147]:
y_pred = lr.predict(X_val)

mean_squared_error(y_val, y_pred, squared=False)

7.7954986956554695


## Submit the results

* Submit your results here: https://forms.gle/uYTnWrcsubi2gdGV7
* You can submit your solution multiple times. In this case, only the last submission will be used
* If your answer doesn't match options exactly, select the closest one


## Deadline

The deadline for submitting is 23 May 2023 (Tuesday), 23:00 CEST (Berlin time). 

After that, the form will be closed.