In [1]:
!python -V

Python 3.12.10


In [2]:
import pandas as pd

In [3]:
import pickle

In [4]:
import seaborn as sns
import matplotlib.pyplot as plt

In [5]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from sklearn.metrics import root_mean_squared_error

1. Read the data for January. How many columns are there?



In [10]:
df = pd.read_parquet('./data/yellow_tripdata_2023-01.parquet')
print(f"There are: {len(df.columns)} columns")


There are: 19 columns


2. Now let's compute the duration variable. It should contain the duration of a ride in minutes. What's the standard deviation of the trips duration in January?

In [12]:
df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

print(f"The standard deviation is: {df['duration'].std()}")

The standard deviation is: 42.59435124195458


3. Next, we need to check the distribution of the duration variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive). What fraction of the records left after you dropped the outliers?

In [26]:
length = len(df)

df = df[(df.duration >= 1) & (df.duration <= 60)]

print(f"We have the: {len(df)/length * 100}% of the rows left")

We have the: 98.1220282212598% of the rows left


4. Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.

- Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will label encode them)
- Fit a dictionary vectorizer
- Get a feature matrix from it

    What's the dimensionality of this matrix (number of columns)?

In [28]:
categorical = ['PULocationID', 'DOLocationID']
df[categorical] = df[categorical].astype(str)

train_dicts = df[categorical].to_dict(orient='records')

dv = DictVectorizer()
X_train = dv.fit_transform(train_dicts)

X_train.shape

print(f"Number of features: {X_train.shape[1]}")

Number of features: 515


5. Now let's use the feature matrix from the previous step to train a model.

- Train a plain linear regression model with default parameters, where duration is the response variable
- Calculate the RMSE of the model on the training data

    What's the RMSE on train?

In [10]:
df_train = read_dataframe('./data/green_tripdata_2021-01.parquet')
df_val = read_dataframe('./data/green_tripdata_2021-02.parquet')

In [30]:
target = 'duration'
y_train = df[target].values

lr = LinearRegression()
lr.fit(X_train, y_train)

y_pred = lr.predict(X_train)

root_mean_squared_error(y_train, y_pred)

print(f"RMSE on train: {root_mean_squared_error(y_train, y_pred)}")

RMSE on train: 7.649261927686161


6. Now let's apply this model to the validation dataset (February 2023).

    What's the RMSE on validation?

In [34]:
df_val = pd.read_parquet('./data/yellow_tripdata_2023-02.parquet')

df_val['duration'] = df_val.tpep_dropoff_datetime - df_val.tpep_pickup_datetime
df_val.duration = df_val.duration.apply(lambda td: td.total_seconds() / 60)

df_val = df_val[(df_val.duration >= 1) & (df_val.duration <= 60)]

df_val[categorical] = df_val[categorical].astype(str)

val_dicts = df_val[categorical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

y_val = df_val[target].values

y_pred = lr.predict(X_val)
root_mean_squared_error(y_val, y_pred)

print(f"RMSE on val: {root_mean_squared_error(y_val, y_pred)}")


RMSE on val: 7.811817957524739
