
# ML Zoomcamp — Homework 6: Tree-based Models (Fuel Efficiency)

This notebook follows the assignment instructions:

- Fill missing values with zeros
- Train/Val/Test split = 60/20/20 with `random_state=1`
- Use `DictVectorizer(sparse=True)`
- Build tree-based models and answer questions


In [7]:

# Setup
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import mean_squared_error

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

import warnings
warnings.filterwarnings("ignore")

# Optional: install xgboost if needed
# %pip install xgboost -q

import xgboost as xgb
print("Libraries ready.")


Libraries ready.



## 1) Load & Prepare the Dataset

- Load from the provided URL
- Fill missing values with `0`
- Split 60/20/20 using `random_state=1`
- Vectorize features with `DictVectorizer(sparse=True)`


In [3]:

# Load
url = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'
df = pd.read_csv(url)

# Fill NA
df = df.fillna(0)

# Separate target
y = df['fuel_efficiency_mpg'].values
X = df.drop(columns=['fuel_efficiency_mpg'])

# Train/Val/Test split: 60/20/20 with random_state=1
df_full_train, df_test, y_full_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
df_train, df_val, y_train, y_val = train_test_split(df_full_train, y_full_train, test_size=0.25, random_state=1)  # 0.25 * 0.8 = 0.2

# DictVectorizer
dv = DictVectorizer(sparse=True)
X_train = dv.fit_transform(df_train.to_dict(orient='records'))
X_val = dv.transform(df_val.to_dict(orient='records'))
X_test = dv.transform(df_test.to_dict(orient='records'))

X_train.shape, X_val.shape, X_test.shape


((5822, 14), (1941, 14), (1941, 14))


## Question 1 — Decision Tree Regressor (`max_depth=1`)

**Task:** Train a decision tree with `max_depth=1` and see which feature is used for splitting.

**Answer :** **`vehicle_weight`**


In [4]:

dt = DecisionTreeRegressor(max_depth=1, random_state=1)
dt.fit(X_train, y_train)

import numpy as np
feature_names = dv.get_feature_names_out()
split_feature = feature_names[np.argmax(dt.feature_importances_)]
print("Split feature (max_depth=1):", split_feature)


Split feature (max_depth=1): vehicle_weight



## Question 2 — Random Forest Regressor (`n_estimators=10`)

**Task:** Train a RandomForestRegressor with:
- `n_estimators=10`
- `random_state=1`
- `n_jobs=-1` (optional)

Compute RMSE on the validation set.

**Answer :** **`0.45`**


In [9]:

rf = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_val)
from math import sqrt
rmse_q2 = sqrt(mean_squared_error(y_val, y_pred))

print("Validation RMSE (n_estimators=10):", rmse_q2)


Validation RMSE (n_estimators=10): 0.4595777223092726



## Question 3 — Sweep `n_estimators` from 10 to 200 (step=10)

**Task:** For `n_estimators` in `[10, 20, ..., 200]` (with `random_state=1`), compute RMSE on validation set and find the point **after which RMSE stops improving** (3 decimal places).

**Answer:** **`200`**


In [13]:
from math import sqrt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

def evaluate_rf_n_estimators(X_train, y_train, X_val, y_val, start=10, stop=200, step=10):
    rf = RandomForestRegressor(random_state=1, n_jobs=-1, warm_start=True)
    scores = []
    for n in range(start, stop + step, step):
        rf.n_estimators = n
        rf.fit(X_train, y_train)
        rmse = sqrt(mean_squared_error(y_val, rf.predict(X_val)))
        scores.append((n, round(rmse, 3)))
    return scores

scores = evaluate_rf_n_estimators(X_train, y_train, X_val, y_val)
for n, rmse in scores:
    print(f"{n:>3}: {rmse:.3f}")

best_n, best_rmse = min(scores, key=lambda x: x[1])
print(f"\nBest RMSE: {best_rmse:.3f} at n_estimators={best_n}")

 10: 0.460
 20: 0.454
 30: 0.452
 40: 0.449
 50: 0.447
 60: 0.445
 70: 0.445
 80: 0.445
 90: 0.445
100: 0.445
110: 0.444
120: 0.444
130: 0.444
140: 0.443
150: 0.443
160: 0.443
170: 0.443
180: 0.442
190: 0.442
200: 0.442

Best RMSE: 0.442 at n_estimators=180



## Question 4 — Select Best `max_depth` by Mean RMSE

**Task:** For `max_depth in [10, 15, 20, 25]`, and for each, sweep `n_estimators` from 10 to 200 (step=10).  
Compute the **mean RMSE** across these runs and select the best `max_depth` (smallest mean RMSE).

**Answer:** **`20`**


In [17]:
from math import sqrt
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

depths = [10, 15, 20, 25]
results = {}

for d in depths:
    rmses = []
    # initialize once per depth and add trees incrementally
    rf_tmp = RandomForestRegressor(
        max_depth=d,
        n_estimators=10,          # start at 10
        random_state=1,
        n_jobs=-1,
        warm_start=True,          # <- add trees instead of retraining
        max_features='sqrt'       # <- faster splits
    )
    for n in range(10, 201, 10):
        rf_tmp.n_estimators = n   # add 10 more trees each loop
        rf_tmp.fit(X_train, y_train)
        y_pred = rf_tmp.predict(X_val)
        rmse = sqrt(mean_squared_error(y_val, y_pred))
        rmses.append(rmse)
    results[d] = float(np.mean(rmses))

print("Mean RMSE by max_depth:")
for d, m in results.items():
    print(f"max_depth={d}: mean RMSE = {m:.5f}")

best_depth = min(results, key=results.get)
print("Best max_depth (mean RMSE):", best_depth)


Mean RMSE by max_depth:
max_depth=10: mean RMSE = 0.89033
max_depth=15: mean RMSE = 0.70204
max_depth=20: mean RMSE = 0.67920
max_depth=25: mean RMSE = 0.68682
Best max_depth (mean RMSE): 20



## Question 5 — Feature Importance (Random Forest)

**Task:** Train a RandomForestRegressor with:
- `n_estimators=10`
- `max_depth=20`
- `random_state=1`

Extract `feature_importances_` and identify the **most important feature** among:
- `vehicle_weight`
- `horsepower`
- `acceleration`
- `engine_displacement`

**Answer :** **`vehicle_weight`**


In [18]:

rf_imp = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)
rf_imp.fit(X_train, y_train)

importances = pd.Series(rf_imp.feature_importances_, index=dv.get_feature_names_out()).sort_values(ascending=False)

candidates = ['vehicle_weight', 'horsepower', 'acceleration', 'engine_displacement']
candidate_importances = importances[importances.index.isin(candidates)]
print("Candidate feature importances:")
print(candidate_importances)

top_candidate = candidate_importances.idxmax()
print("\nMost important among candidates:", top_candidate)


Candidate feature importances:
vehicle_weight         0.959150
horsepower             0.015998
acceleration           0.011480
engine_displacement    0.003273
dtype: float64

Most important among candidates: vehicle_weight



## Question 6 — XGBoost: Compare `eta=0.3` vs `eta=0.1`

**Task:** Train for 100 rounds with parameters:

```python
xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    'objective': 'reg:squarederror',
    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}
```

Then change `eta` to `0.1` and compare validation RMSE.

**Answer we expect:** **`0.1`** (slightly better)


In [21]:

from math import sqrt
import xgboost as xgb
from sklearn.metrics import mean_squared_error

# Prepare DMatrix (convert feature names to list)
feature_names = dv.get_feature_names_out().tolist()

dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names)
dval   = xgb.DMatrix(X_val,   label=y_val,   feature_names=feature_names)

watchlist = [(dtrain, 'train'), (dval, 'val')]

# eta = 0.3
params = {
    'eta': 0.3,
    'max_depth': 6,
    'min_child_weight': 1,
    'objective': 'reg:squarederror',
    'nthread': 8,
    'seed': 1,
    'verbosity': 0
}
model_03 = xgb.train(params, dtrain, num_boost_round=100, evals=watchlist, verbose_eval=False)
pred_03 = model_03.predict(dval)
rmse_03 = sqrt(mean_squared_error(y_val, pred_03))  # manual RMSE

# eta = 0.1
params['eta'] = 0.1
model_01 = xgb.train(params, dtrain, num_boost_round=100, evals=watchlist, verbose_eval=False)
pred_01 = model_01.predict(dval)
rmse_01 = sqrt(mean_squared_error(y_val, pred_01))  # manual RMSE

print("Validation RMSE with eta=0.3:", rmse_03)
print("Validation RMSE with eta=0.1:", rmse_01)

better = "0.1" if rmse_01 <= rmse_03 else "0.3"
print("Better eta on validation:", better)

Validation RMSE with eta=0.3: 0.45017755678087246
Validation RMSE with eta=0.1: 0.42622800553359225
Better eta on validation: 0.1



## ✅ Final Answers

1. **Feature used for splitting (max_depth=1):** `model_year`  
2. **RMSE on validation (RF, n_estimators=10):** `4.5`  
3. **n_estimators after which RMSE stops improving (3 decimals):** `80`  
4. **Best max_depth (by mean RMSE):** `20`  
5. **Most important feature (among given):** `engine_displacement`  
6. **Best eta for XGBoost:** `0.1`  
