# Week 1: NYC Taxi Ride Duration Prediction – Baseline Model

In this notebook, I will build a baseline machine learning model to predict the duration of NYC green taxi rides using trip data from January and February 2021 based on Week 1's [MLOps Zoomcamp](https://github.com/DataTalksClub/mlops-zoomcamp) lessons.

The data can be found at: https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

### Objectives:
- Load and explore the NYC Green Taxi trip dataset.
- Perform data cleaning and filtering (e.g., removing very short or long rides).
- Engineer useful features such as ride distance and pickup/drop-off combinations.
- Encode categorical variables using `DictVectorizer`.
- Train and evaluate linear regression models (Linear, Lasso, Ridge).
- Establish a baseline RMSE score using a hold-out validation set.
- Save the model and preprocessing pipeline for later use.


## Install Packages

In [1]:
!pip install pyarrow # read parquet files

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 

In [36]:
from scipy.stats import gaussian_kde
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.metrics import mean_squared_error

In [39]:
import pickle

## Data Ingestion & Basic Cleaning

read_dataframe(url) downloads a Parquet file directly from the cloud and prepares it for modeling:

- Loads the file via pd.read_parquet (pyarrow engine).

- Calculates trip duration in minutes from the pickup/drop-off timestamps.

- Filters out trips shorter than 1 minute or longer than 60 minutes to remove obvious outliers.

- Casts the pickup and drop-off location IDs to strings so they can be one-hot encoded later.

In [24]:
def read_dataframe(url: str) -> pd.DataFrame:
    """
    Load NYC Green Taxi data from a Parquet URL and apply minimal filtering.
    Returns a DataFrame with an extra 'PU_DO' categorical column.
    """
    df = pd.read_parquet(url, engine="pyarrow")

    # trip duration (minutes)
    df["duration"] = (
        df.lpep_dropoff_datetime - df.lpep_pickup_datetime
    ).dt.total_seconds() / 60

    # keep 1–60 min trips
    df = df.query("1 <= duration <= 60").copy()

    # categorical ids → string then combined route id
    df["PULocationID"] = df["PULocationID"].astype(str)
    df["DOLocationID"] = df["DOLocationID"].astype(str)
    df["PU_DO"] = df["PULocationID"] + "_" + df["DOLocationID"]

    return df

## Feature Engineering with DictVectorizer

make_X_y(df, dv=None, fit=True) turns the cleaned DataFrame into:

- X – sparse one-hot matrix (DictVectorizer).

- y – duration vector.

- dv – the fitted vectorizer (returned so you can reuse it).

In [25]:
from typing import Optional

def make_X_y(
    df: pd.DataFrame,
    dv: Optional[DictVectorizer] = None,
    fit: bool = True,
):
    """
    Turn the DataFrame into X (sparse) and y.
    Features: one-hot 'PU_DO'  + numeric 'trip_distance'.
    """
    cat = ["PU_DO"]
    num = ["trip_distance"]

    records = df[cat + num].to_dict(orient="records")

    if dv is None:
        dv = DictVectorizer(sparse=True)

    X = dv.fit_transform(records) if fit else dv.transform(records)
    y = df["duration"].values
    return X, y, dv

## Train / Validation Split (Jan + Feb 2021 Green Taxi)

Train on January 2021 and validate on February 2021 data

In [26]:
URL_TRAIN = "https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2021-01.parquet"
URL_VAL   = "https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2021-02.parquet"

df_train = read_dataframe(URL_TRAIN)
df_val   = read_dataframe(URL_VAL)

X_train, y_train, dv = make_X_y(df_train, fit=True)
X_val,   y_val,  _   = make_X_y(df_val,   dv=dv, fit=False)

## Baseline Model: Linear Regression

Fit an ordinary least-squares model and report RMSE on February 2021.

In [27]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

y_pred_lr = lin_reg.predict(X_val)
rmse_lr   = mean_squared_error(y_val, y_pred_lr, squared=False)
print(f"Linear Regression RMSE: {rmse_lr:.12f} minutes")

Linear Regression RMSE: 7.479586896300 minutes


## Regularised Model: Lasso Regression

Fit a Lasso model with regularisation (alpha=0.001) and compare its RMSE.

In [43]:
lasso = Lasso(alpha=0.001)
lasso.fit(X_train, y_train)

y_pred_lasso = lasso.predict(X_val)
rmse_lasso   = mean_squared_error(y_val, y_pred_lasso, squared=False)
print(f"Lasso Regression RMSE: {rmse_lasso:.12f} minutes")

Lasso Regression RMSE: 9.233436225721 minutes


## Regularised Model: Ridge Regression (alpha = 1.0)

Fit a Ridge model with ℓ² regularisation (alpha=1.0) and evaluate.

In [47]:
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

y_pred_ridge = ridge.predict(X_val)
rmse_ridge   = mean_squared_error(y_val, y_pred_ridge, squared=False)
print(f"Ridge Regression RMSE: {rmse_ridge:.12f} minutes")

Ridge Regression RMSE: 11.342603943250 minutes
