<a href="https://colab.research.google.com/github/gregorywmorris/MLZoom2022/blob/main/Homework/MLZoomcampWeek6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**HOMEWORK**  

The goal of this homework is to create a regression model for predicting housing prices (column 'median_house_value').  

In this homework we'll again use the California Housing Prices dataset - the same one we used in homework 2 and 3.

You can take it from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices) or download using wget link mentioned below:

```
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

```



In [1]:
#@ IMPORTING LIBRARIES AND DEPENDENCIES:
import re
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

%matplotlib inline

In [6]:
#@ DOWNLOADING THE DATASET: UNCOMMENT BELOW:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

--2022-10-12 17:09:01--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1423529 (1.4M) [text/plain]
Saving to: ‘housing.csv.1’


2022-10-12 17:09:01 (65.2 MB/s) - ‘housing.csv.1’ saved [1423529/1423529]



In [7]:
#@ READING DATASET:
PATH = "./housing.csv"
select_cols = ["longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", 
               "median_income", "median_house_value", "ocean_proximity"]
df = pd.read_csv(PATH, usecols=select_cols)
df.total_bedrooms = df.total_bedrooms.fillna(0)

- Apply the log transform to `median_house_value`. 
- Do train/validation/test split with 60%/20%/20% distribution.
- Use the `train_test_split` function and set the `random_state parameter` to 1.

In [8]:
# Apply the log transform to median_house_value.
df.median_house_value = np.log(df.median_house_value)

In [11]:
#@ SPLITTING THE DATASET FOR TRAINING AND TEST:

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

y_train = df_train.median_house_value.values
y_test = df_test.median_house_value.values
y_val = df_val.median_house_value.values

del df_train['median_house_value']
del df_test['median_house_value']
del df_val['median_house_value']

In [12]:
df_train.isna().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
ocean_proximity       0
dtype: int64

- We will use `DictVectorizer` to turn train and validation into matrices.

In [13]:
#@ IMPLEMENTATION OF DICTVECTORIZER:
dv = DictVectorizer(sparse=False)

train_dicts = df_train.to_dict(orient='records')
x_train = dv.fit_transform(train_dicts)

val_dicts = df_val.to_dict(orient='records')
x_val = dv.fit_transform(val_dicts)


#**Question 1**

Let's train a decision tree regressor to predict the `median_house_value` variable.

Train a model with `max_depth=1`.

In [14]:
#@ TRAINING THE REGRESSION MODEL:
model_dt = DecisionTreeRegressor(max_depth=1, random_state=1)
model_dt.fit(x_train, y_train)

DecisionTreeRegressor(max_depth=1, random_state=1)

In [15]:
#@ INSPECTION:

print(export_text(model_dt, feature_names=dv.get_feature_names()))

|--- ocean_proximity=INLAND <= 0.50
|   |--- value: [12.31]
|--- ocean_proximity=INLAND >  0.50
|   |--- value: [11.61]





- Which feature is used for splitting the data?

- Answer: **ocean_proximity**

#**Question 2**

Train a random forest model with these parameters:

- `n_estimators=10`  
- `random_state=1`  
- `n_jobs=-1` (optional-to make training faster)

In [18]:
#@ TRAINING RANDOM FOREST MODEL:
model_rf = RandomForestRegressor(n_estimators=10, n_jobs=-1, random_state=1)
model_rf.fit(x_train, y_train)
y_pred = model_rf.predict(x_val)


In [21]:
#@ CALCULATING MEAN SQUARED ERROR:
mean_squared_error(y_val, y_pred)

0.060197049028941726

- What's the RMSE of this model on validation?

- Answer: **0.06**

#**Question 3**

Now, let's experiment with the `n_estimators` parameter.

- Try different values of this parameter from 10 to 200 with step 10.
- Set `random_state` to 1.
- Evaluate the model on the validation dataset.

In [35]:
param = np.linspace(10, 200, 20, dtype='int') 
param

array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100, 110, 120, 130,
       140, 150, 160, 170, 180, 190, 200])

In [47]:
#@ TRAINING THE RANDOM FOREST MODEL:
q3 = []

for n in param:
  model_rf = RandomForestRegressor(n_estimators=n, n_jobs=-1, random_state=1)
  model_rf.fit(x_train, y_train)
  y_pred = model_rf.predict(x_val)

  q3.append([n, "  ",round(mean_squared_error(y_val, y_pred),2)])

KeyboardInterrupt: ignored

In [39]:
#@ INSPECTING THE RMSE SCORES:
q3

[[10, '  ', 0.06],
 [20, '  ', 0.06],
 [30, '  ', 0.06],
 [40, '  ', 0.05],
 [50, '  ', 0.05],
 [60, '  ', 0.05],
 [70, '  ', 0.05],
 [80, '  ', 0.05],
 [90, '  ', 0.05],
 [100, '  ', 0.05],
 [110, '  ', 0.05],
 [120, '  ', 0.05],
 [130, '  ', 0.05],
 [140, '  ', 0.05],
 [150, '  ', 0.05],
 [160, '  ', 0.05],
 [170, '  ', 0.05],
 [180, '  ', 0.05],
 [190, '  ', 0.05],
 [200, '  ', 0.05]]

- After which value of `n_estimators` does RMSE stop improving?

- Answer: **40**

#**Question 4**

Let's select the best `max_depth`:

- Try different values of `max_depth`: [10, 15, 20, 25].
- For each of these values, try different values of n_estimators from 10 till 200 (with step 10).
- Fix the random seed: `random_state=1`.

In [46]:
#@ TRAINING THE MODEL WITH DEPTH:
q4 = []
max = [10, 15, 20, 25]

for m in max:
  for n in range(10,201,10):
    model_rf = RandomForestRegressor(n_estimators=n,max_depth=m, n_jobs=-1, random_state=1)
    model_rf.fit(x_train, y_train)
    y_pred = model_rf.predict(x_val)

    q4.append([n,m,mean_squared_error(y_val, y_pred)])

KeyboardInterrupt: ignored

In [None]:
columns = ['max_depth', 'n_estimators', 'rmse']
df_scores = pd.DataFrame(q4, columns=columns)

In [None]:
for d in [5, 10, 15]:
    df_subset = df_scores[df_scores.max_depth == d]
    
    plt.plot(df_subset.n_estimators, df_subset.auc,
             label='max_depth=%d' % d)

plt.legend()

- What's the best `max_depth`:

- Answer:

#**Question 5**

We can extract feature importance information from tree-based models.

At each step of the decision tree learning algorith, it finds the best split. When doint it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the imporatant features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the `feature_importances_` field.

For this homework question, we'll find the most important feature:

Train the model with these parametes:
- `n_estimators=10`,
- `max_depth=20`,
- `random_state=1`,
- `n_jobs=-1` (optional)

Get the feature importance information from this model

In [None]:
#@ TRAINING THE RANDOM FOREST MODEL:


- What's the most important feature?

- Answer:

#**Question 6**

Now let's train an XGBoost model! For this question, we'll tune the eta parameter:

- Install XGBoost.
- Create DMatrix for train and validation
- Create a watchlist
- Train a model with these parameters for 100 rounds:

```
xgb_params = {  
    'eta': 0.3,  
    'max_depth': 6,  
    'min_child_weight': 1,  

    'objective': 'reg:squarederror',
    'nthread': 8,

    'seed': 1,
    'verbosity': 1,
}
```



In [None]:
#@ CREATING THE DMARTIX:
features = dv.feature_names_

regex = re.compile(r"<", re.IGNORECASE)
features = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in features]

dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=features)
dval = xgb.DMatrix(X_val, label=y_val, feature_names=features)

- Now, change eta first to 0.1 and then to 0.01.

- Which eta leads to the best RMSE score on the validation dataset?

- Answer:

#**Submit The Results**

* Submit your results here: https://forms.gle/3yMSuQ4BeNuZFHTU8
* You can submit your solution multiple times. In this case, only the last submission will be used
* If your answer doesn't match options exactly, select the closest one

**Deadline**

The deadline for submitting is 17 October (Monday), 23:00 CEST.

After that, the form will be closed.