# NYC Taxi Duration Prediction — Experiment Tracking with MLflow

This notebook trains and evaluates models to predict NYC taxi trip duration, while tracking experiments with MLflow.

- Objective: predict trip duration (minutes)
- Dataset: NYC Taxi (sample months)
- Stack: pandas, scikit-learn, XGBoost, Hyperopt, MLflow

Sections
- Environment and Imports
- MLflow Tracking Setup
- Data Reading Utility
- Load Train and Validation Data
- Feature Engineering
- Vectorization (DictVectorizer)
- Define Target Variable
- Baseline Linear Regression (optional)
- Hyperparameter Tuning with XGBoost and Hyperopt
- Train final XGBoost model and log to MLflow
- Quick Model Comparisons with scikit-learn (autolog)
- Load Logged Model and Predict


## Environment and Imports

- Check Python version
- Import core libraries for data processing, modeling, and tracking


In [None]:
!python -V

In [None]:
import pandas as pd
import numpy as np

In [None]:
import pickle

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error

### MLflow Tracking Setup

- Configure tracking URI and experiment name
- Ensure the tracking backend is available (e.g., SQLite file)


In [None]:
import mlflow

mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("nyc-taxi-experiment")

The code above sets up MLflow for experiment tracking in this notebook.

- It imports the `mlflow` package.
- It configures the tracking URI so that all experiment information is stored in a local SQLite database (`mlflow.db`).
- It specifies the name of the experiment ("nyc-taxi-experiment").
- If this experiment doesn't exist already, MLflow will create it automatically.

This setup makes it possible to log, compare, and reproduce experiments efficiently.

### Data Reading Utility

- Load parquet files with selected columns
- Compute trip duration and filter outliers (1–60 minutes)
- Cast categorical columns to strings


In [None]:
CATEGORICAL = ['PULocationID', 'DOLocationID']
NEEDED_COLS = CATEGORICAL + ['tpep_pickup_datetime', 'tpep_dropoff_datetime', 'trip_distance']

def read_dataframe(filename):
    df = pd.read_parquet(filename, columns=NEEDED_COLS)

    df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
    df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])

    df['duration'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.total_seconds() / 60
    df = df[(df.duration >= 1) & (df.duration <= 60)].copy()

    df[CATEGORICAL] = df[CATEGORICAL].astype(str)
    return df


### Load Train and Validation Data

- Read January as training and February as validation
- Apply the same preprocessing for both


In [None]:
df_train = read_dataframe('/workspaces/mlops-zoomcamp/data/yellow_tripdata_2021-01.parquet')
df_val = read_dataframe('/workspaces/mlops-zoomcamp/data/yellow_tripdata_2021-02.parquet')

In [None]:
# df_train.head()

In [None]:
# len(df_train), len(df_val)

### Feature Engineering

- Create `PU_DO` combined categorical feature
- Keep distance as a numerical feature


In [None]:
# Create a new feature 'PU_DO' that combines pickup and dropoff location IDs.
# This represents a specific route or trip pattern (e.g., pickup from zone 142 to dropoff at zone 43).
df_train['PU_DO'] = df_train['PULocationID'] + '_' + df_train['DOLocationID']
df_val['PU_DO'] = df_val['PULocationID'] + '_' + df_val['DOLocationID']

#### Vectorization (DictVectorizer)

- Convert feature dicts into sparse matrices
- Fit on training data; transform validation with the same `DictVectorizer`


In [None]:
categorical = ['PU_DO'] #'PULocationID', 'DOLocationID']
numerical = ['trip_distance']

dv = DictVectorizer()

train_dicts = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dicts)

val_dicts = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

### Define Target Variable

- Predict the trip `duration` in minutes
- Split labels for train/validation


In [None]:
target = 'duration'
y_train = df_train[target].values
y_val = df_val[target].values

### Baseline Linear Regression (optional)

- Fit a simple baseline (e.g., Linear or Lasso/Ridge)
- Compute RMSE on validation set


In [None]:
# lr = LinearRegression()
# lr.fit(X_train, y_train)

# y_pred = lr.predict(X_val)

# np.sqrt(mean_squared_error(y_val, y_pred))

In [None]:
# with open('models/lin_reg.bin', 'wb') as f_out:
#     pickle.dump((dv, lr), f_out)

### Track Linear Model with MLflow (example)

- Demonstrates manual logging: params, metrics, artifacts
- Useful pattern even when autolog is disabled


In [None]:
# # Start a new MLflow run — everything logged inside this block 
# # will be grouped under the same experiment run in the MLflow UI.
# with mlflow.start_run():

#     # Tag this run with metadata — useful for filtering or identifying runs later.
#     mlflow.set_tag("developer", "cristian")

#     # Log input data paths as parameters to keep track of which datasets were used for training and validation.
#     mlflow.log_param("train-data-path", "./data/yellow_tripdata_2021-01.parquet")
#     mlflow.log_param("valid-data-path", "./data/yellow_tripdata_2021-02.parquet")

#     # Define and log the model hyperparameter 'alpha' for the Lasso regression.
#     alpha = 0.1
#     mlflow.log_param("alpha", alpha)
    
#     # Initialize and train the Lasso regression model using the training data.
#     lr = Lasso(alpha)
#     lr.fit(X_train, y_train)

#     # Make predictions on the validation dataset.
#     y_pred = lr.predict(X_val)

#     # Calculate the Root Mean Squared Error (RMSE) to evaluate model performance.
#     rmse = np.sqrt(mean_squared_error(y_val, y_pred))

#     # Log the RMSE metric so it appears in MLflow for comparison across runs.
#     mlflow.log_metric("rmse", rmse)

#     # Log the trained model file as an artifact — this saves the model binary in the MLflow run directory.
#     # 'artifact_path' defines the subfolder within the run's artifact storage.
#     mlflow.log_artifact(local_path="models/lin_reg.bin", artifact_path="models_pickle")

In [None]:
# # Check current tracking URI
# print("Tracking URI:", mlflow.get_tracking_uri())

# # List all experiments
# experiments = mlflow.search_experiments()
# for exp in experiments:
#     print(f"Experiment: {exp.name} (ID: {exp.experiment_id})")

# # List runs for a specific experiment
# runs = mlflow.search_runs(experiment_ids=["1"])
# print(runs[['run_id', 'metrics.rmse', 'params.alpha']].head())

## Hyperparameter Tuning with MLflow and Hyperopt

### XGBoost and Hyperopt Imports

- Bring in XGBoost for gradient-boosted trees
- Use Hyperopt for hyperparameter search
- Keep MLflow tracking enabled


In [None]:
import xgboost as xgb

In [None]:
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from hyperopt.pyll import scope

## Prepare DMatrix for XGBoost

- Convert sparse/scipy matrices to XGBoost `DMatrix`
- This improves training speed and memory usage


In [None]:
train = xgb.DMatrix(X_train, label=y_train)
valid = xgb.DMatrix(X_val, label=y_val)

## Define the objective function that Hyperopt will minimize

In [None]:
# # Define the objective function that Hyperopt will minimize.
# def objective(params):
#     with mlflow.start_run():
#         mlflow.set_tag("model", "xgboost")
#         mlflow.log_params(params)

#         # Train the XGBoost model with the given parameters.
#         # 'dtrain' is the training data matrix, 'num_boost_round' is the number of boosting rounds,
#         # 'evals' is a list of tuples containing the validation data and a name for the evaluation,
#         # 'early_stopping_rounds' is the number of rounds to wait before stopping if the validation score doesn't improve.
#         booster = xgb.train(
#             params=params,
#             dtrain=train,
#             num_boost_round=100,
#             evals=[(valid, 'validation')],
#             early_stopping_rounds=20
#         )
#         y_pred = booster.predict(valid)
#         rmse = np.sqrt(mean_squared_error(y_val, y_pred))
#         mlflow.log_metric("rmse", rmse)

#     return {'loss': rmse, 'status': STATUS_OK}

## Define the hyperparameter search space

In [None]:
# search_space = {
#     'max_depth': scope.int(hp.quniform('max_depth', 4, 50, 1)),
#     'learning_rate': hp.loguniform('learning_rate', -3, 0),
#     'reg_alpha': hp.loguniform('reg_alpha', -5, -1),
#     'reg_lambda': hp.loguniform('reg_lambda', -6, -1),
#     'min_child_weight': hp.loguniform('min_child_weight', -1, 3),
#     'objective': 'reg:squarederror',
#     'seed': 42
# }

## Hyperparameter Optimization with Hyperopt

In [None]:
# # Use Hyperopt to find the best hyperparameters for the XGBoost model.
# best_result = fmin(
#     fn=objective,          # Objective function to minimize (returns validation RMSE)
#     space=search_space,    # The hyperparameter space defined above
#     algo=tpe.suggest,      # TPE algorithm: Bayesian optimizer that models p(x|y)
#     max_evals=10,          # Number of trials (iterations) to perform
#     trials=Trials()        # Object to store details of each run (params, loss, status)
# )

## Train final XGBoost model and log to MLflow

- Use best params to train with early stopping
- Log parameters, metrics, and artifacts (preprocessor, model)
- Save model with MLflow XGBoost flavor


In [None]:
# Disable XGBoost autologging to manually control what gets logged to MLflow
# (autologging would otherwise record parameters, metrics, and models automatically)
mlflow.xgboost.autolog(disable=True)

In [None]:
# from mlflow.models import infer_signature

# with mlflow.start_run():
    
#     # Convert NumPy arrays or DataFrames into XGBoost's optimized DMatrix format
#     # This structure improves memory efficiency and training performance    
#     train = xgb.DMatrix(X_train, label=y_train)
#     valid = xgb.DMatrix(X_val, label=y_val)

#     # Define the best hyperparameters found from hyperparameter tuning (e.g., Hyperopt)
#     # These control model complexity, learning rate, regularization, and random seed
#     best_params = {
#         'learning_rate': 0.09585355369315604,
#         'max_depth': 30,
#         'min_child_weight': 1.060597050922164,
#         'objective': 'reg:squarederror',
#         'reg_alpha': 0.018060244040060163,
#         'reg_lambda': 0.011658731377413597,
#         'seed': 42
#     }

#     # Log all chosen hyperparameters to MLflow for reproducibility
#     mlflow.log_params(best_params)

#     # Train the XGBoost model using the defined parameters
#     # - num_boost_round: maximum number of boosting iterations
#     # - evals: list of evaluation datasets (train/validation) to track performance
#     # - early_stopping_rounds: stop training if validation metric doesn’t improve for 20 rounds
#     booster = xgb.train(
#         params=best_params,
#         dtrain=train,
#         num_boost_round=100,
#         evals=[(valid, 'validation')],
#         early_stopping_rounds=20
#     )

#     # Make predictions on the validation dataset
#     y_pred = booster.predict(valid)

#     # Calculate the Root Mean Squared Error (RMSE) to evaluate model performance
#     rmse = np.sqrt(mean_squared_error(y_val, y_pred))

#     # Log the RMSE metric so it appears in MLflow for comparison across runs
#     mlflow.log_metric("rmse", rmse)

#     # Save the preprocessor (feature transformation model) as a pickle file
#     with open("models/preprocessor.b", "wb") as f_out:
#         pickle.dump(dv, f_out)

#     # Log the preprocessor as an artifact in MLflow
#     mlflow.log_artifact("models/preprocessor.b", artifact_path="preprocessor")

#     # The signature defines the input and output schema for the model
#     signature = infer_signature(X_val, y_pred)

#     # Log the trained XGBoost model in MLflow with signature and input example
#     input_example = X_val[:3]
#     mlflow.xgboost.log_model(
#         booster, 
#         artifact_path="models_mlflow",  # Path within the artifacts folder
#         input_example=input_example
#     )

## Quick Model Comparisons with scikit-learn (autolog)

- Enable MLflow autologging for scikit-learn
- Train a couple of lightweight baseline models
- Compare validation RMSE across runs


In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.svm import LinearSVR

for cls in (RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, LinearSVR):
    m = cls()
    print(cls.__name__, m.get_params())


In [None]:
# from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
# from sklearn.svm import LinearSVR

# # Enable automatic MLflow logging for all scikit-learn models
# # This logs:
# #   - Model parameters (e.g., n_estimators, max_depth)
# #   - Evaluation metrics (e.g., RMSE)
# #   - Trained model artifacts (serialized .pkl files)
# #   - Model signature and environment info (for reproducibility)
# mlflow.sklearn.autolog()

# # Loop through multiple model classes to train and compare them easily
# # Each iteration will:
# #   1. Start a new MLflow run
# #   2. Train one model
# #   3. Evaluate it
# #   4. Log results to MLflow automatically
# for model_class in (RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor, LinearSVR):

#     with mlflow.start_run():
        
#         # Log paths to the training and validation datasets for reproducibility
#         mlflow.log_param("train-data-path", "./data/green_tripdata_2022-01.parquet")
#         mlflow.log_param("valid-data-path", "./data/green_tripdata_2022-02.parquet")

#         # Log the preprocessor as an artifact in MLflow
#         mlflow.log_artifact("models/preprocessor.b", artifact_path="preprocessor")

#         # Initialize and train the model
#         mlmodel = model_class()
#         mlmodel.fit(X_train, y_train)

#         # Make predictions on the validation dataset
#         y_pred = mlmodel.predict(X_val)
#         rmse = np.sqrt(mean_squared_error(y_val, y_pred))

#         # Log the RMSE metric to MLflow for comparison across runs
#         mlflow.log_metric("rmse", rmse)

In [None]:
import numpy as np
import mlflow
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import LinearSVR

# Turn on MLflow autologging for scikit-learn
mlflow.sklearn.autolog(
    log_datasets=False,        # don't snapshot the dataset
    log_input_examples=False   # don't store a full input example
)

# Define two relatively light models:
# - LinearSVR: fast linear model (good baseline)
# - GradientBoostingRegressor: small tree ensemble with conservative settings
model_specs = [
    ("LinearSVR", LinearSVR(max_iter=5000)),                  # higher max_iter to ensure convergence
    ("GradientBoostingRegressor", GradientBoostingRegressor(  # small/quick ensemble
        n_estimators=80,          # fewer trees → faster
        max_depth=3,              # shallow trees
        learning_rate=0.08,       # a bit higher to compensate for fewer trees
        subsample=0.9,            # stochastic boosting → faster, regularization
        random_state=42
    )),
]

for model_name, model in model_specs:
    with mlflow.start_run(run_name=model_name):
        # Record data sources for reproducibility
        mlflow.log_param("train-data-path", "./data/green_tripdata_2022-01.parquet")
        mlflow.log_param("valid-data-path", "./data/green_tripdata_2022-02.parquet")

        # Log your fitted preprocessor (already created elsewhere)
        mlflow.log_artifact("models/preprocessor.b", artifact_path="preprocessor")

        # Train
        model.fit(X_train, y_train)

        # Validate
        y_pred = model.predict(X_val)
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))

        # Track results
        mlflow.log_param("model_name", model_name)
        mlflow.log_metric("rmse", rmse)


## Load Logged Model and Predict

- Load a previously logged model from MLflow
- Run predictions using both PyFunc and native XGBoost flavors
- Inspect a sample of predictions


In [None]:
import mlflow

# Define the model URI (Uniform Resource Identifier).
# This string points to a specific model stored in MLflow.
logged_model = 'runs:/712e9c4fb3294a75bd60b15f76102062/models_mlflow'

# Load the model using MLflow’s PyFunc interface/flavor.
# This loads the model as a generic Python function (PyFuncModel),
# which can make predictions on pandas DataFrames and other supported input types.
loaded_model = mlflow.pyfunc.load_model(logged_model)

In [None]:
loaded_model

In [None]:
# Make predictions using RAW data (not DMatrix)
y_pred_pyfunc = loaded_model.predict(X_val)

# View predictions
print(y_pred_pyfunc[:10])

In [None]:
## Loads the same model, but using the XGBoost flavor.
xgboost_model = mlflow.xgboost.load_model(logged_model)

In [None]:
xgboost_model

In [None]:
y_pred = xgboost_model.predict(valid)

In [None]:
# check the first 10
y_pred[:10]