### Dataset

In this homework, we will use the California Housing Prices from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).

Here's a wget-able [link](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv):

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
```

The goal of this homework is to create a regression model for predicting housing prices (column `'median_house_value'`).


### Preparing the dataset

For this homework, we only want to use a subset of data. This is the same subset we used in homework #2.
But in contrast to homework #2, we are going to use all columns of the dataset.

First, keep only the records where `ocean_proximity` is either `'<1H OCEAN'` or `'INLAND'`

Preparation:

* Fill missing values with zeros.
* Apply the log transform to `median_house_value`.
* Do train/validation/test split with 60%/20%/20% distribution.
* Use the `train_test_split` function and set the `random_state` parameter to 1.
* Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices.


## Question 1

Let's train a decision tree regressor to predict the `median_house_value` variable.

* Train a model with `max_depth=1`.


Which feature is used for splitting the data?

* `ocean_proximity`
* `total_rooms`
* `latitude`
* `population`

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text

data_path = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv'
housing_data = pd.read_csv(data_path)

# Preprocess the data
# 1. Keep only the records where `ocean_proximity` is either `'<1H OCEAN'` or `'INLAND'`
filtered_data = housing_data[housing_data['ocean_proximity'].isin(['<1H OCEAN', 'INLAND'])]

# 2. Fill missing values with zeros
filled_data = filtered_data.fillna(0)

# 3. Apply the log transform to `median_house_value`
filled_data['median_house_value'] = np.log1p(filled_data['median_house_value'])

# 4. Do train/validation/test split with 60%/20%/20% distribution, setting random_state to 1
X = filled_data.drop('median_house_value', axis=1)
y = filled_data['median_house_value']
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.25, random_state=1)

# 5. Use `DictVectorizer(sparse=True)` to turn the dataframes into matrices
dv = DictVectorizer(sparse=True)
X_train_dict = X_train.to_dict(orient='records')
X_val_dict = X_val.to_dict(orient='records')
X_test_dict = X_test.to_dict(orient='records')
X_train_vect = dv.fit_transform(X_train_dict)
X_val_vect = dv.transform(X_val_dict)
X_test_vect = dv.transform(X_test_dict)

dt_regressor = DecisionTreeRegressor(max_depth=1, random_state=1)
dt_regressor.fit(X_train_vect, y_train)

tree_rules = export_text(dt_regressor, feature_names=list(dv.get_feature_names_out()))
tree_rules

'|--- ocean_proximity=<1H OCEAN <= 0.50\n|   |--- value: [11.61]\n|--- ocean_proximity=<1H OCEAN >  0.50\n|   |--- value: [12.30]\n'

## Question 2

Train a random forest model with these parameters:

* `n_estimators=10`
* `random_state=1`
* `n_jobs=-1` (optional - to make training faster)


What's the RMSE of this model on validation?

* 0.045
* 0.245
* 0.545
* 0.845

In [3]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Training a random forest model with specified parameters
rf_regressor = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rf_regressor.fit(X_train_vect, y_train)

# Predicting on the validation set
y_pred = rf_regressor.predict(X_val_vect)

# Calculating RMSE on the validation set
rmse = np.sqrt(mean_squared_error(y_val, y_pred))
rmse

0.24495290030597153

## Question 3

Now let's experiment with the `n_estimators` parameter

* Try different values of this parameter from 10 to 200 with step 10.
* Set `random_state` to `1`.
* Evaluate the model on the validation dataset.


After which value of `n_estimators` does RMSE stop improving?
Consider 3 decimal places for retrieving the answer.

- 10
- 25
- 50
- 160

In [5]:
# Trying different values of n_estimators from 10 to 200 with step 10, and evaluating the model on the validation dataset
n_estimators_range = range(10, 201, 10)
rmse_scores = []

for n_estimators in n_estimators_range:
    rf_regressor = RandomForestRegressor(n_estimators=n_estimators, random_state=1, n_jobs=-1)
    rf_regressor.fit(X_train_vect, y_train)
    y_pred = rf_regressor.predict(X_val_vect)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred))
    rmse_scores.append((n_estimators, rmse))
    print(f"Number of estimators: {n_estimators}, RMSE: {rmse}")

# Displaying RMSE scores for different values of n_estimators
rmse_scores

Number of estimators: 10, RMSE: 0.24495290030597153
Number of estimators: 20, RMSE: 0.23833358987366798
Number of estimators: 30, RMSE: 0.23650402956159838
Number of estimators: 40, RMSE: 0.23509490973460043
Number of estimators: 50, RMSE: 0.23475019819586204
Number of estimators: 60, RMSE: 0.2345062010902977
Number of estimators: 70, RMSE: 0.23440154550608353
Number of estimators: 80, RMSE: 0.23457219159286974
Number of estimators: 90, RMSE: 0.23447044062017394
Number of estimators: 100, RMSE: 0.23428380626857082
Number of estimators: 110, RMSE: 0.23421064634979388
Number of estimators: 120, RMSE: 0.234022293955548
Number of estimators: 130, RMSE: 0.2338243957678853
Number of estimators: 140, RMSE: 0.23366366433540584
Number of estimators: 150, RMSE: 0.23354802078200423
Number of estimators: 160, RMSE: 0.23342267954173773
Number of estimators: 170, RMSE: 0.23338695822923
Number of estimators: 180, RMSE: 0.23356148302077012
Number of estimators: 190, RMSE: 0.23380601278831945
Number of

[(10, 0.24495290030597153),
 (20, 0.23833358987366798),
 (30, 0.23650402956159838),
 (40, 0.23509490973460043),
 (50, 0.23475019819586204),
 (60, 0.2345062010902977),
 (70, 0.23440154550608353),
 (80, 0.23457219159286974),
 (90, 0.23447044062017394),
 (100, 0.23428380626857082),
 (110, 0.23421064634979388),
 (120, 0.234022293955548),
 (130, 0.2338243957678853),
 (140, 0.23366366433540584),
 (150, 0.23354802078200423),
 (160, 0.23342267954173773),
 (170, 0.23338695822923),
 (180, 0.23356148302077012),
 (190, 0.23380601278831945),
 (200, 0.23370275895578774)]

In [7]:
print(min(rmse_scores, key=lambda tup: tup[1]))

(170, 0.23338695822923)


## Question 4

Let's select the best `max_depth`:

* Try different values of `max_depth`: `[10, 15, 20, 25]`
* For each of these values,
  * try different values of `n_estimators` from 10 till 200 (with step 10)
  * calculate the mean RMSE
* Fix the random seed: `random_state=1`


What's the best `max_depth`, using the mean RMSE?

* 10
* 15
* 20
* 25

In [12]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

max_depth_values = [10, 15, 20, 25]
n_estimators_range = range(10, 201, 10)
mean_rmse_scores = {}

for max_depth in max_depth_values:
    print(f"""##########
Max depth: {max_depth}""")
    rmse_scores = []
    for n_estimators in n_estimators_range:
        rf_regressor = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, random_state=1, n_jobs=-1)
        rf_regressor.fit(X_train_vect, y_train)
        y_pred = rf_regressor.predict(X_val_vect)
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        print(f"Number of estimators: {n_estimators}, RMSE: {rmse}")
        rmse_scores.append(rmse)
    mean_rmse_scores[max_depth] = np.mean(rmse_scores)

# Finding the max_depth with the lowest mean RMSE
best_max_depth, best_mean_rmse = min(mean_rmse_scores.items(), key=lambda x: x[1])
print("Best max_depth:", best_max_depth, "with mean RMSE:", best_mean_rmse)

##########
Max depth: 10
Number of estimators: 10, RMSE: 0.250682176668064
Number of estimators: 20, RMSE: 0.24745502781785716
Number of estimators: 30, RMSE: 0.24626379186617842
Number of estimators: 40, RMSE: 0.24502936288687294
Number of estimators: 50, RMSE: 0.24543003121441526
Number of estimators: 60, RMSE: 0.24522113262934925
Number of estimators: 70, RMSE: 0.24528983725642312
Number of estimators: 80, RMSE: 0.24553630580063834
Number of estimators: 90, RMSE: 0.24545212921717774
Number of estimators: 100, RMSE: 0.2453700257720694
Number of estimators: 110, RMSE: 0.24526239970363065
Number of estimators: 120, RMSE: 0.24505304321886626
Number of estimators: 130, RMSE: 0.24478991812503073
Number of estimators: 140, RMSE: 0.24457214573772543
Number of estimators: 150, RMSE: 0.2445535109898143
Number of estimators: 160, RMSE: 0.24445890373792142
Number of estimators: 170, RMSE: 0.24441806155085397
Number of estimators: 180, RMSE: 0.2445233168833414
Number of estimators: 190, RMSE: 0.

## Question 5

We can extract feature importance information from tree-based models.

At each step of the decision tree learning algorithm, it finds the best split.
When doing it, we can calculate "gain" - the reduction in impurity before and after the split.
This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the
[`feature_importances_`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor.feature_importances_)
field.

For this homework question, we'll find the most important feature:

* Train the model with these parameters:
  * `n_estimators=10`,
  * `max_depth=20`,
  * `random_state=1`,
  * `n_jobs=-1` (optional)
* Get the feature importance information from this model


What's the most important feature (among these 4)?

* `total_rooms`
* `median_income`
* `total_bedrooms`
* `longitude`


In [13]:
rf_regressor = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)
rf_regressor.fit(X_train_vect, y_train)

# Getting feature importance information from the model
feature_importances = rf_regressor.feature_importances_
features = dv.get_feature_names_out()

# Creating a dictionary to map feature names to their importance scores
feature_importance_dict = dict(zip(features, feature_importances))

# Finding the most important feature among the specified ones
specified_features = ['total_rooms', 'median_income', 'total_bedrooms', 'longitude']
most_important_feature = max(specified_features, key=lambda x: feature_importance_dict[x])
print("Most important feature:", most_important_feature)

Most important feature: median_income


## Question 6

Now let's train an XGBoost model! For this question, we'll tune the `eta` parameter:

* Install XGBoost
* Create DMatrix for train and validation
* Create a watchlist
* Train a model with these parameters for 100 rounds:

```
xgb_params = {
    'eta': 0.3,
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}
```

Now change `eta` from `0.3` to `0.1`.

Which eta leads to the best RMSE score on the validation dataset?

* 0.3
* 0.1
* Both give equal value

In [14]:
!pip install xgboost



In [15]:
import xgboost as xgb

dtrain = xgb.DMatrix(X_train_vect, label=y_train)
dval = xgb.DMatrix(X_val_vect, label=y_val)

watchlist = [(dtrain, 'train'), (dval, 'val')]

xgb_params = {
    'eta': 0.3,
    'max_depth': 6,
    'min_child_weight': 1,
    'objective': 'reg:squarederror',
    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=100, evals=watchlist, verbose_eval=5)

[0]	train-rmse:0.44350	val-rmse:0.44250
[5]	train-rmse:0.25338	val-rmse:0.27463
[10]	train-rmse:0.21444	val-rmse:0.25179
[15]	train-rmse:0.19858	val-rmse:0.24522
[20]	train-rmse:0.18524	val-rmse:0.23978
[25]	train-rmse:0.17757	val-rmse:0.23830
[30]	train-rmse:0.16888	val-rmse:0.23570
[35]	train-rmse:0.16113	val-rmse:0.23416
[40]	train-rmse:0.15542	val-rmse:0.23318
[45]	train-rmse:0.14941	val-rmse:0.23190
[50]	train-rmse:0.14536	val-rmse:0.23225
[55]	train-rmse:0.14150	val-rmse:0.23197
[60]	train-rmse:0.13719	val-rmse:0.23139
[65]	train-rmse:0.13259	val-rmse:0.23158
[70]	train-rmse:0.12943	val-rmse:0.23068
[75]	train-rmse:0.12555	val-rmse:0.23039
[80]	train-rmse:0.12192	val-rmse:0.22886
[85]	train-rmse:0.11854	val-rmse:0.22888
[90]	train-rmse:0.11496	val-rmse:0.22861
[95]	train-rmse:0.11211	val-rmse:0.22908
[99]	train-rmse:0.10989	val-rmse:0.22862


In [16]:
y_pred = model.predict(dval)
rmse_eta_03 = np.sqrt(mean_squared_error(y_val, y_pred))
print('RMSE with eta=0.3:', rmse_eta_03)

RMSE with eta=0.3: 0.228623199980106


In [17]:
xgb_params['eta'] = 0.1

model = xgb.train(xgb_params, dtrain, num_boost_round=100, evals=watchlist, verbose_eval=5)
y_pred = model.predict(dval)
rmse_eta_01 = np.sqrt(mean_squared_error(y_val, y_pred))
print('RMSE with eta=0.1:', rmse_eta_01)

[0]	train-rmse:0.52449	val-rmse:0.52045
[5]	train-rmse:0.37822	val-rmse:0.38151
[10]	train-rmse:0.30326	val-rmse:0.31427
[15]	train-rmse:0.26538	val-rmse:0.28380
[20]	train-rmse:0.24512	val-rmse:0.26882
[25]	train-rmse:0.23026	val-rmse:0.25997
[30]	train-rmse:0.21887	val-rmse:0.25266
[35]	train-rmse:0.21020	val-rmse:0.24826
[40]	train-rmse:0.20392	val-rmse:0.24539
[45]	train-rmse:0.19814	val-rmse:0.24293
[50]	train-rmse:0.19215	val-rmse:0.24020
[55]	train-rmse:0.18809	val-rmse:0.23878
[60]	train-rmse:0.18457	val-rmse:0.23791
[65]	train-rmse:0.18063	val-rmse:0.23698
[70]	train-rmse:0.17741	val-rmse:0.23622
[75]	train-rmse:0.17468	val-rmse:0.23510
[80]	train-rmse:0.17242	val-rmse:0.23453
[85]	train-rmse:0.17014	val-rmse:0.23404
[90]	train-rmse:0.16797	val-rmse:0.23332
[95]	train-rmse:0.16562	val-rmse:0.23276
[99]	train-rmse:0.16323	val-rmse:0.23209
RMSE with eta=0.1: 0.23208927121609343


In [18]:
import math

if math.isclose(rmse_eta_03, rmse_eta_01):
    print("Both gives equal value")
elif rmse_eta_03 < rmse_eta_01:
    print("RMSE 0.3 is better")
else:
    print("RMSE 0.1 is better")

RMSE 0.3 is better


## Submit the results

- Submit your results here: https://forms.gle/Qa2SuzG7QGZNCaoV9
- If your answer doesn't match options exactly, select the closest one.
- You can submit your solution multiple times. In this case, only the last submission will be used