## Homework

The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.


## Q1. Downloading the data

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page),
but instead of "**Green** Taxi Trip Records", we'll use "**Yellow** Taxi Trip Records".

Download the data for January and February 2023.

Read the data for January. How many columns are there?

* 16
* 17
* 18
* 19


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the standard deviation of the trips duration in January?

* 32.59
* 42.59
* 52.59
* 62.59


## Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

* 90%
* 92%
* 95%
* 98%


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries (remember to re-cast the ids to strings - otherwise it will 
  label encode them)
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix (number of columns)?

* 2
* 155
* 345
* 515
* 715


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters, where duration is the response variable
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 3.64
* 7.64
* 11.64
* 16.64


## Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2023). 

What's the RMSE on validation?

* 3.81
* 7.81
* 11.81
* 16.81

## Submit the results

* Submit your results here: https://courses.datatalks.club/mlops-zoomcamp-2025/homework/hw1
* If your answer doesn't match options exactly, select the closest one

In [1]:
import os
import argparse
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
import joblib
from typing import List, Optional, Dict, Union, Tuple
from datetime import datetime
import logging
from NYC_trip_duration_pred import NYCTaxiDurationPredictor

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from sklearn.metrics import root_mean_squared_error

In [2]:
nyc = NYCTaxiDurationPredictor()

In [6]:
#Question 1: 19 columns D
df_jan = nyc.download_data(year=2023, month=1, taxi='yellow')
df_jan

Downloading data from https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
Downloaded 3066766 records


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,2,2023-01-01 00:32:10,2023-01-01 00:40:36,1.0,0.97,1.0,N,161,141,2,9.30,1.00,0.5,0.00,0.0,1.0,14.30,2.5,0.00
1,2,2023-01-01 00:55:08,2023-01-01 01:01:27,1.0,1.10,1.0,N,43,237,1,7.90,1.00,0.5,4.00,0.0,1.0,16.90,2.5,0.00
2,2,2023-01-01 00:25:04,2023-01-01 00:37:49,1.0,2.51,1.0,N,48,238,1,14.90,1.00,0.5,15.00,0.0,1.0,34.90,2.5,0.00
3,1,2023-01-01 00:03:48,2023-01-01 00:13:25,0.0,1.90,1.0,N,138,7,1,12.10,7.25,0.5,0.00,0.0,1.0,20.85,0.0,1.25
4,2,2023-01-01 00:10:29,2023-01-01 00:21:19,1.0,1.43,1.0,N,107,79,1,11.40,1.00,0.5,3.28,0.0,1.0,19.68,2.5,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3066761,2,2023-01-31 23:58:34,2023-02-01 00:12:33,,3.05,,,107,48,0,15.80,0.00,0.5,3.96,0.0,1.0,23.76,,
3066762,2,2023-01-31 23:31:09,2023-01-31 23:50:36,,5.80,,,112,75,0,22.43,0.00,0.5,2.64,0.0,1.0,29.07,,
3066763,2,2023-01-31 23:01:05,2023-01-31 23:25:36,,4.67,,,114,239,0,17.61,0.00,0.5,5.32,0.0,1.0,26.93,,
3066764,2,2023-01-31 23:40:00,2023-01-31 23:53:00,,3.15,,,230,79,0,18.15,0.00,0.5,4.43,0.0,1.0,26.58,,


In [7]:
df_feb = nyc.download_data(year=2023, month=2, taxi='yellow')
df_feb

Downloading data from https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet
Downloaded 2913955 records


Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,Airport_fee
0,1,2023-02-01 00:32:53,2023-02-01 00:34:34,2.0,0.30,1.0,N,142,163,2,4.40,3.50,0.5,0.00,0.0,1.0,9.40,2.5,0.00
1,2,2023-02-01 00:35:16,2023-02-01 00:35:30,1.0,0.00,1.0,N,71,71,4,-3.00,-1.00,-0.5,0.00,0.0,-1.0,-5.50,0.0,0.00
2,2,2023-02-01 00:35:16,2023-02-01 00:35:30,1.0,0.00,1.0,N,71,71,4,3.00,1.00,0.5,0.00,0.0,1.0,5.50,0.0,0.00
3,1,2023-02-01 00:29:33,2023-02-01 01:01:38,0.0,18.80,1.0,N,132,26,1,70.90,2.25,0.5,0.00,0.0,1.0,74.65,0.0,1.25
4,2,2023-02-01 00:12:28,2023-02-01 00:25:46,1.0,3.22,1.0,N,161,145,1,17.00,1.00,0.5,3.30,0.0,1.0,25.30,2.5,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2913950,2,2023-02-28 23:46:00,2023-03-01 00:05:00,,4.65,,,249,140,0,20.22,0.00,0.5,4.84,0.0,1.0,29.06,,
2913951,2,2023-02-28 23:26:02,2023-02-28 23:37:10,,2.47,,,186,79,0,13.66,0.00,0.5,2.65,0.0,1.0,20.31,,
2913952,2,2023-02-28 23:24:00,2023-02-28 23:38:00,,3.49,,,158,143,0,17.64,0.00,0.5,0.00,0.0,1.0,21.64,,
2913953,2,2023-02-28 23:03:00,2023-02-28 23:10:00,,2.13,,,79,162,0,13.56,0.00,0.5,2.63,0.0,1.0,20.19,,


In [8]:
df_jan['duration'] = (df_jan['tpep_dropoff_datetime'] - df_jan['tpep_pickup_datetime']).dt.total_seconds() / 60

In [9]:
df_feb['duration'] = (df_feb['tpep_dropoff_datetime'] - df_feb['tpep_pickup_datetime']).dt.total_seconds() / 60

In [10]:
#question 2: 42.59 B
df_jan['duration'].std(ddof=1)

42.594351241920904

In [11]:
#Question 3: 98%
((df_jan['duration'] >= 1) & (df_jan['duration'] <= 60)).mean()

0.9812202822125979

In [12]:
# Filter out unreasonable durations (less than 1 minute or more than 1 hour)
df_jan2 = df_jan[(df_jan['duration'] >= 1) & (df_jan['duration'] <= 60)]
df_feb2 = df_feb[(df_feb['duration'] >= 1) & (df_feb['duration'] <= 60)]

In [23]:
df_feb2.shape

(2855951, 20)

In [13]:
feats = ["PULocationID", "DOLocationID"]
df_jan2[feats] = df_jan2[feats].astype(str)
df_feb2[feats] = df_feb2[feats].astype(str)

In [14]:
from sklearn.feature_extraction import DictVectorizer
dv = DictVectorizer()

train_dicts = df_jan2[feats].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

In [15]:
# Question 4: D 515
len(dv.feature_names_)

515

In [16]:
val_dicts = df_feb2[feats].to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [17]:
target = 'duration'
y_train = df_jan2[target].values
y_val = df_feb2[target].values

In [18]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
# Question 6: 7.81
y_pred = lr.predict(X_val)

root_mean_squared_error(y_val, y_pred)

7.811818488357944

In [20]:
#Question 5: 7.649
root_mean_squared_error(y_train, lr.predict(X_train))

7.649261932004951

In [None]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# Create pipeline with DictVectorizer and LinearRegression
pipeline = make_pipeline(
    DictVectorizer(), 
    LinearRegression()
)

# Train the model
pipeline.fit(train_dicts, y_train)

root_mean_squared_error(y_val, pipeline.predict(val_dicts))

7.811818488357944

## use py script

In [5]:
!make train-custom \
  DATA_YEAR=2023 \
  DATA_MONTH=1 \
  TEST_YEAR=2023 \
  TEST_MONTH=1 \
  TAXI=yellow \
  MODEL=linear_regression \
  TARGET_TRANSFORM=none \
  FEATURES="PULocationID DOLocationID" \
  CAT_FEATURES="PULocationID DOLocationID" \
  NUM_FEATURES=none \
  CAT_PREPROCESSOR=dictvectorizer \
  NUM_PREPROCESSOR=none

mkdir -p models plots data
python3 NYC_trip_duration_pred.py train \
	--train-year 2023 \
	--train-month 1 \
	--test-year 2023 \
	--test-month 1 \
	--taxi-type yellow \
	--model linear_regression \
	--target-transform none \
	--features PULocationID DOLocationID \
	--categorical-features PULocationID DOLocationID \
	--numerical-features none \
	--cat-preprocessor dictvectorizer \
	--num-preprocessor none \
	--save-model \
	--save-plot \
	--random-state 42
2025-05-12 16:58:25,903 - INFO - Starting NYC Taxi Duration Prediction - Training Pipeline
2025-05-12 16:58:25,904 - INFO - Model: linear_regression, Target Transform: none
2025-05-12 16:58:25,904 - INFO - Cat Preprocessor: dictvectorizer, Num Preprocessor: none
Downloading data from https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
Downloaded 3066766 records
Downloading data from https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet
Downloaded 3066766 records
Training linear_reg

In [6]:
!make predict \
    MODEL_PATH=models/nyc_taxi_duration_linear_regression_20250512_165903.joblib \
    DATA_YEAR=2023 \
    TAXI=yellow \
    DATA_MONTH=2

python3 NYC_trip_duration_pred.py predict \
	--model-path models/nyc_taxi_duration_linear_regression_20250512_165903.joblib \
	--year 2023 \
	--month 2 \
	--taxi-type yellow \
	--save-plot
2025-05-12 17:01:17,416 - INFO - Starting NYC Taxi Duration Prediction - Prediction Pipeline
2025-05-12 17:01:17,416 - INFO - Loading model from models/nyc_taxi_duration_linear_regression_20250512_165903.joblib
2025-05-12 17:01:17,416 - INFO - Loading model from models/nyc_taxi_duration_linear_regression_20250512_165903.joblib
2025-05-12 17:01:17,418 - INFO - Model loaded from models/nyc_taxi_duration_linear_regression_20250512_165903.joblib
2025-05-12 17:01:17,418 - INFO - Model: linear_regression
2025-05-12 17:01:17,418 - INFO - Features: PULocationID, DOLocationID
Downloading data from https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet
Downloaded 2913955 records
RMSE: 7.81
MAE: 5.82
R²: 0.40
2025-05-12 17:01:42,735 - INFO - Plot saved to plots/nyc_taxi_duration_linear_