# Fuel Efficiency

## Load the data

In [6]:
data = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv"
# !wget $data
import urllib.request

url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv"
urllib.request.urlretrieve(url, "car_fuel_efficiency.csv")
print("Downloaded successfully!")

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_extraction import DictVectorizer

# Read the downloaded CSV file
df = pd.read_csv("car_fuel_efficiency.csv")
df.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


The goal of this homework is to create a regression model for predicting the car fuel efficiency (column `'fuel_efficiency_mpg'`).

## Preparing the dataset

Preparation:
- Fill missing values with zeros.
- Do train/validation/test split with `60%/20%/20%` distribution.
- Use the `train_test_split` function and set the `random_state` parameter to 1.
- Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.

### Question 1

Let's train a decision tree regressor to predict the `fuel_efficiency_mpg` variable.
- Train a model with `max_depth=1`.

Which feature is used for splitting the data?
- `'vehicle_weight'`
- `'model_year'`
- `'origin'`
- `'fuel_type'`

In [9]:
# Fill missing values
df = df.fillna(0)
df.isnull().sum()

engine_displacement    0
num_cylinders          0
horsepower             0
vehicle_weight         0
acceleration           0
model_year             0
origin                 0
fuel_type              0
drivetrain             0
num_doors              0
fuel_efficiency_mpg    0
dtype: int64

In [10]:
# Split into train, validation, and test
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=1)  # 0.25 x 0.8 = 0.2

y_train = df_train.fuel_efficiency_mpg.values
y_val = df_val.fuel_efficiency_mpg.values

# Drop target
del df_train['fuel_efficiency_mpg']
del df_val['fuel_efficiency_mpg']

In [11]:
# Convert to dicts
dicts_train = df_train.to_dict(orient='records')
dicts_val = df_val.to_dict(orient='records')

# Vectorize
dv = DictVectorizer(sparse=True)
X_train = dv.fit_transform(dicts_train)
X_val = dv.transform(dicts_val)

In [12]:
# Train decision tree
dt = DecisionTreeRegressor(max_depth=1, random_state=1)
dt.fit(X_train, y_train)

In [13]:
# See which feature was used
feature_name = dv.feature_names_[dt.tree_.feature[0]]
print("Feature used for splitting:", feature_name)

Feature used for splitting: vehicle_weight


### Question 2

Train a random forest regressor with these parameters:
- `n_estimators=10`
- `random_state=1`
- `n_jobs=-1` (optional - to make training faster)

What's the RMSE of this model on the validation data?
- 0.045
- 0.45
- 4.5
- 45.0

In [14]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# Train the model
rf = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)

In [15]:
# Predict
y_pred = rf.predict(X_val)
y_pred

array([18.62692845, 15.23747064, 18.20763942, ..., 14.81525146,
       13.52753444, 16.10909938])

In [16]:
# Compute RMSE
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
print("Validation RMSE:", rmse)

Validation RMSE: 0.4595777223092726


### Question 3

Now let's experiment with the n_estimators parameter
- Try different values of this parameter from 10 to 200 with step 10.
- Set `random_state` to 1.
- Evaluate the model on the validation dataset.

After which value of `n_estimators` does RMSE stop improving? Consider 3 decimal places for calculating the answer.
- 10
- 25
- 80
- 200

If it doesn't stop improving, use the latest iteration number in your answer.

In [17]:
scores = []

for n in range(10, 201, 10):
    rf = RandomForestRegressor(n_estimators=n, random_state=1, n_jobs=-1)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    scores.append((n, rmse))

for n, rmse in scores:
    print(n, round(rmse, 3))

10 0.46
20 0.454
30 0.452
40 0.449
50 0.447
60 0.445
70 0.445
80 0.445
90 0.445
100 0.445
110 0.444
120 0.444
130 0.444
140 0.443
150 0.443
160 0.443
170 0.443
180 0.442
190 0.442
200 0.442


### Question 4

Let's select the best max_depth:
- Try different values of `max_depth`: `[10, 15, 20, 25]`
- For each of these values,
  - try different values of `n_estimators` from 10 till 200 (with step 10)
  - calculate the mean RMSE
- Fix the random seed: `random_state=1`

What's the best `max_depth`, using the mean RMSE?
- 10
- 15
- 20
- 25

In [18]:
max_depth_values = [10, 15, 20, 25]
results = {}

for d in max_depth_values:
    rmses = []
    for n in range(10, 201, 10):
        rf = RandomForestRegressor(n_estimators=n, max_depth=d, random_state=1, n_jobs=-1)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_val)
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        rmses.append(rmse)
    results[d] = np.mean(rmses)
    print(f"max_depth={d}: mean RMSE={np.mean(rmses):.3f}")

max_depth=10: mean RMSE=0.442
max_depth=15: mean RMSE=0.445
max_depth=20: mean RMSE=0.446
max_depth=25: mean RMSE=0.446


### Question 5

We can extract feature importance information from tree-based models.

At each step of the decision tree learning algorithm, it finds the best split. When doing it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the [`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_) field.

For this homework question, we'll find the most important feature:
- Train the model with these parameters:
  - `n_estimators=10`,
  - `max_depth=20`,
  - `random_state=1`,
  - `n_jobs=-1` (optional)
- Get the feature importance information from this model

What's the most important feature (among these 4)?
- `vehicle_weight`
- `horsepower`
- `acceleration`
- `engine_displacement`

In [19]:
# Train the Random Forest model
rf = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)

# Get feature importance
importances = rf.feature_importances_
feature_names = dv.get_feature_names_out()

# Combine into a DataFrame for readability
fi = pd.DataFrame({'feature': feature_names, 'importance': importances}).sort_values(by='importance', ascending=False)
fi.head(10)

Unnamed: 0,feature,importance
13,vehicle_weight,0.95915
6,horsepower,0.015998
0,acceleration,0.01148
3,engine_displacement,0.003273
7,model_year,0.003212
8,num_cylinders,0.002343
9,num_doors,0.001635
12,origin=USA,0.00054
11,origin=Europe,0.000519
10,origin=Asia,0.000462


### Question 6

Now let's train an XGBoost model! For this question, we'll tune the eta parameter:
- Install XGBoost
- Create DMatrix for train and validation
- Create a watchlist
- Train a model with these parameters for 100 rounds:

<pre> xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
} </pre>


Now change eta from 0.3 to 0.1.

Which eta leads to the best RMSE score on the validation dataset?
- 0.3
- 0.1
- Both give equal value

In [21]:
import xgboost as xgb

# Convert to DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
watchlist = [(dtrain, 'train'), (dval, 'val')]

ModuleNotFoundError: No module named 'xgboost'

In [None]:
# Train model with different eta
# Common parameters
params = {
    'max_depth': 6,
    'min_child_weight': 1,
    'objective': 'reg:squarederror',
    'nthread': 8,
    'seed': 1,
    'verbosity': 1
}

# eta = 0.3
params['eta'] = 0.3
model1 = xgb.train(params, dtrain, num_boost_round=100, evals=watchlist, verbose_eval=False)
preds1 = model1.predict(dval)
rmse1 = np.sqrt(mean_squared_error(y_val, preds1))

# eta = 0.1
params['eta'] = 0.1
model2 = xgb.train(params, dtrain, num_boost_round=100, evals=watchlist, verbose_eval=False)
preds2 = model2.predict(dval)
rmse2 = np.sqrt(mean_squared_error(y_val, preds2))

print(f"RMSE eta=0.3: {rmse1:.3f}")
print(f"RMSE eta=0.1: {rmse2:.3f}")