# 🧠 MLOps Zoomcamp – Homework 2 Review Notebook 📘  
_A step-by-step summary of my work, questions, answers, and debugging adventures._

---

## ✅ Q1. MLflow Version

Check which MLflow version is installed:

```bash
mlflow --version

Output:

mlflow, version 2.22.0


⸻

✅ Q2. Preprocessing the Dataset

Preprocess raw data into train.pkl, val.pkl, test.pkl, and dv.pkl.

%run preprocess_data.py --raw_data_path ./data --dest_path ./output

Answer:
4 files created

⸻

✅ Q3. Manual Logging with MLflow

We manually tracked a RandomForestRegressor in train.py:

mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("nyc-taxi-experiment-homework")

with mlflow.start_run():
    mlflow.set_tag("developer", "michelangelo")
    mlflow.log_params({"max_depth": 10, "random_state": 0})
    mlflow.log_metric("rmse", rmse)

Run training:

python train.py --data_path ./output

Launch MLflow UI:

mlflow ui --port 6006 \
  --backend-store-uri sqlite:///mlflow.db \
  --default-artifact-root ./mlruns

Q: What is the value of min_samples_split?
A: 2 (default for RandomForestRegressor)

Check with:

python -c "from sklearn.ensemble import RandomForestRegressor; print(RandomForestRegressor().min_samples_split)"


⸻

✅ Q4. Launch MLflow Tracking Server

mlflow ui --port 6006 \
  --backend-store-uri sqlite:///mlflow.db \
  --default-artifact-root ./artifacts

Answer:
default-artifact-root

💡 Struggles:
	•	Had to kill hanging servers with ps aux | grep gunicorn
	•	Deleted corrupted folders inside mlruns/
	•	Restarted MLflow with correct backend + artifact settings

⸻

✅ Q5. Hyperparameter Tuning with Hyperopt

Modified hpo.py to:
	•	Use mlflow.log_params() inside objective()
	•	Log rmse manually
	•	Do not use mlflow.sklearn.autolog()

def objective(params):
    with mlflow.start_run():
        mlflow.log_params(params)
        ...
        mlflow.log_metric("rmse", rmse)
        return {'loss': rmse, 'status': STATUS_OK}

Run optimization:

python hpo.py

Answer:
✅ Best RMSE = 5.335

⸻

✅ Q6. Register Best Model to MLflow Registry

In register_model.py, we:
	1.	Selected top 5 runs from random-forest-hyperopt
	2.	Trained & logged models on train/val/test sets
	3.	Logged val_rmse and test_rmse manually
	4.	Registered best model:

mlflow.register_model(
    model_uri=f"runs:/{best_run.info.run_id}/model",
    name="best-random-forest-model"
)

Fixes made:
	•	❌ Removed mlflow.sklearn.autolog() (caused kernel crashes in Codespaces)
	•	✅ Added import numpy as np

Final Result:
	•	val_rmse: 5.363
	•	test_rmse: 5.594

Answer: ✅ 5.567

⸻

🔁 Quick Command Reference

# Start MLflow UI
mlflow ui --port 6006 \
  --backend-store-uri sqlite:///mlflow.db \
  --default-artifact-root ./mlruns

# Preprocess data
%run preprocess_data.py --raw_data_path ./data --dest_path ./output

# Train model
python train.py --data_path ./output

# Tune with Hyperopt
python hpo.py

# Register best model
python register_model.py

