# MLOps Zoomcamp 2025: Homework 1

This notebook solves the homework for predicting NYC Yellow Taxi trip durations using January and February 2023 data. We train a LinearRegression model with one-hot encoded pickup and dropoff location IDs and evaluate it on training and validation sets.

## Questions
1. How many columns in the January 2023 dataset?
2. What’s the standard deviation of trip durations in January?
3. What fraction of records remain after removing duration outliers (1–60 minutes)?
4. What’s the dimensionality of the feature matrix after one-hot encoding?
5. What’s the RMSE on the training data?
6. What’s the RMSE on the validation data (February 2023)?

## Environment
- Run in `exp-tracking-env` conda environment.
- Dependencies: pandas, scikit-learn, mlflow, pyarrow, requests.
- MLflow tracking URI: `sqlite:///mlflow.db`, Experiment: `nyc-taxi-homework`.
- Data: Yellow Taxi Trip Records from [TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

In [1]:
!python -V

Python 3.10.16


## Imports and Setup

Import libraries and configure MLflow tracking.

In [2]:
import pandas as pd
import requests
import os
import pickle
import mlflow
import logging
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
from mlflow.models.signature import infer_signature

In [4]:
# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Create data directory
os.makedirs('data', exist_ok=True)

# Set MLflow tracking
mlflow.set_tracking_uri('sqlite:///mlflow.db')
mlflow.set_experiment('nyc-taxi-homework')

# Ensure data and models directories exist
os.makedirs('data', exist_ok=True)
os.makedirs('models', exist_ok=True)

## Q1: Downloading and Loading Data

Download Yellow Taxi Trip Records for January and February 2023 from the TLC website and load the January data to count columns.

In [5]:
# Download data
urls = {
    '01': 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet',
    '02': 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-02.parquet'
}

for month, url in urls.items():
    filename = f'data/yellow_tripdata_2023-{month}.parquet'
    if not os.path.exists(filename) or os.path.getsize(filename) == 0:
        try:
            logging.info(f'Downloading {filename}')
            response = requests.get(url, timeout=30)
            response.raise_for_status()  # Raise exception for HTTP errors
            with open(filename, 'wb') as f:
                f.write(response.content)
            if os.path.getsize(filename) > 0:
                logging.info(f'Successfully downloaded {filename} ({os.path.getsize(filename)} bytes)')
            else:
                logging.error(f'Downloaded file {filename} is empty')
                raise ValueError(f'Empty file downloaded for {filename}')
        except (requests.exceptions.RequestException, ValueError) as e:
            logging.error(f'Failed to download {filename}: {e}')
            raise
    else:
        logging.info(f'{filename} already exists and is non-empty ({os.path.getsize(filename)} bytes)')

# Load January data
try:
    df_jan = pd.read_parquet('data/yellow_tripdata_2023-01.parquet')
    num_columns = len(df_jan.columns)
    logging.info(f'Number of columns in January 2023 data: {num_columns}')
    print(f'Q1 Answer: {num_columns}')
except FileNotFoundError as e:
    logging.error(f'File not found: {e}')
    raise
except Exception as e:
    logging.error(f'Error loading January data: {e}')
    raise

2025-05-21 01:56:53,040 - INFO - Downloading data/yellow_tripdata_2023-01.parquet
2025-05-21 01:57:02,186 - INFO - Successfully downloaded data/yellow_tripdata_2023-01.parquet (47673370 bytes)
2025-05-21 01:57:02,187 - INFO - Downloading data/yellow_tripdata_2023-02.parquet
2025-05-21 01:57:11,850 - INFO - Successfully downloaded data/yellow_tripdata_2023-02.parquet (47748012 bytes)
2025-05-21 01:57:12,189 - INFO - Number of columns in January 2023 data: 19


Q1 Answer: 19


## Q2 & Q3: Computing Duration and Dropping Outliers

Compute trip duration in minutes, calculate its standard deviation, and filter out durations outside 1–60 minutes to find the fraction of remaining records.

In [6]:
def read_dataframe(filename):
    try:
        df = pd.read_parquet(filename)
        df['duration'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds() / 60
        logging.info(f'Loaded {filename} with {len(df)} rows')
        return df
    except Exception as e:
        logging.error(f'Error reading {filename}: {e}')
        raise

# Load and preprocess January data
df_jan = read_dataframe('data/yellow_tripdata_2023-01.parquet')

# Q2: Standard deviation of duration
std_duration = df_jan['duration'].std()
logging.info(f'Standard deviation of duration: {std_duration:.2f}')
print(f'Q2 Answer: {std_duration:.2f}')

# Q3: Fraction of records after removing outliers
original_count = len(df_jan)
df_jan_filtered = df_jan[(df_jan['duration'] >= 1) & (df_jan['duration'] <= 60)]
filtered_count = len(df_jan_filtered)
fraction_remaining = filtered_count / original_count
logging.info(f'Fraction of records remaining: {fraction_remaining:.2%}')
print(f'Q3 Answer: {fraction_remaining:.0%}')

2025-05-21 01:57:12,520 - INFO - Loaded data/yellow_tripdata_2023-01.parquet with 3066766 rows
2025-05-21 01:57:12,580 - INFO - Standard deviation of duration: 42.59


Q2 Answer: 42.59


2025-05-21 01:57:12,960 - INFO - Fraction of records remaining: 98.12%


Q3 Answer: 98%


## Q4: One-hot Encoding

Apply one-hot encoding to pickup and dropoff location IDs and get the dimensionality of the feature matrix.

In [7]:
# Prepare features
df_jan_filtered.loc[:, 'PULocationID'] = df_jan_filtered['PULocationID'].astype(str)
df_jan_filtered.loc[:, 'DOLocationID'] = df_jan_filtered['DOLocationID'].astype(str)

categorical = ['PULocationID', 'DOLocationID']
dv = DictVectorizer()

train_dicts = df_jan_filtered[categorical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

num_features = X_train.shape[1]
logging.info(f'Dimensionality of feature matrix: {num_features}')
print(f'Q4 Answer: {num_features}')

  df_jan_filtered.loc[:, 'PULocationID'] = df_jan_filtered['PULocationID'].astype(str)
  df_jan_filtered.loc[:, 'DOLocationID'] = df_jan_filtered['DOLocationID'].astype(str)
2025-05-21 01:57:26,183 - INFO - Dimensionality of feature matrix: 515


Q4 Answer: 515


## Q5: Training a Model

Train a LinearRegression model on the January data and compute RMSE on the training set.

In [8]:
with mlflow.start_run():
    mlflow.set_tag('model', 'LinearRegression')
    mlflow.set_tag('developer', 'John')
    mlflow.log_param('train-data-path', './data/yellow_tripdata_2023-01.parquet')
    mlflow.log_param('features', 'PULocationID, DOLocationID')

    y_train = df_jan_filtered['duration'].values
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_train)
    rmse_train = root_mean_squared_error(y_train, y_pred)
    mlflow.log_metric('rmse_train', rmse_train)

    with open('models/lin_reg.bin', 'wb') as f_out:
        pickle.dump((dv, lr), f_out)
    mlflow.log_artifact('models/lin_reg.bin', artifact_path='models_pickle')

    input_example = X_train[:5].toarray()
    signature = infer_signature(X_train, y_pred)
    mlflow.sklearn.log_model(lr, artifact_path='models_mlflow', signature=signature, input_example=input_example)

    logging.info(f'Training RMSE: {rmse_train:.2f}')
    print(f'Q5 Answer: {rmse_train:.2f}')

2025-05-21 01:58:02,182 - INFO - Training RMSE: 7.65


Q5 Answer: 7.65


## Q6: Evaluating the Model

Apply the trained model to February 2023 data and compute RMSE on the validation set.

In [9]:
# Load and preprocess February data
df_feb = read_dataframe('data/yellow_tripdata_2023-02.parquet')
df_feb_filtered = df_feb[(df_feb['duration'] >= 1) & (df_feb['duration'] <= 60)]

# Prepare validation features
val_dicts = df_feb_filtered[categorical].to_dict(orient='records')
X_val = dv.transform(val_dicts)
y_val = df_feb_filtered['duration'].values

with mlflow.start_run():
    mlflow.set_tag('model', 'LinearRegression')
    mlflow.set_tag('developer', 'John')
    mlflow.log_param('valid-data-path', './data/yellow_tripdata_2023-02.parquet')
    mlflow.log_param('features', 'PULocationID, DOLocationID')

    y_pred = lr.predict(X_val)
    rmse_val = root_mean_squared_error(y_val, y_pred)
    mlflow.log_metric('rmse_val', rmse_val)

    logging.info(f'Validation RMSE: {rmse_val:.2f}')
    print(f'Q6 Answer: {rmse_val:.2f}')

2025-05-21 01:58:02,528 - INFO - Loaded data/yellow_tripdata_2023-02.parquet with 2913955 rows
2025-05-21 01:58:07,828 - INFO - Validation RMSE: 13.32


Q6 Answer: 13.32
