The deadline for this homework is on **08.11.2023 08:59** (right before the practice session). After completing the exercises, you should

1. Download this file into your computer (`File` $\to$ `Download .ipynb`)

2. Name the file in the following way *HWx_NameSurname* (for example `HW4_NshanPotikyan.ipynb`)

4. Send the file to this email address `nshan.potikyan@gmail.com` with subject **ML4**

**Note**

* if you do not follow any of the above conditions, your homework will not be graded.

* you do not need to send any dataset files or helper scripts that I provide with your homework (since I already have them).

* you need to write the code for the exercises yourself; you can use ``built-in functions``, ``numpy``, ``pandas``, ``sklearn``
and ``matplotlib``.

**Problem.** During the practice session we tried several regression methods on the california housing dataset.

* In this homework, you need to take the same dataset but this time you need to

 * perform feature engineering (remove unnecessary features, combine several features into one etc.)
 * experiment with different regression methods (the ones that we have discussed so far) to beat the score we got during the practice session
 * perform hyper-parameter tuning for each of the methods

* Evaluate the final model's performance in terms of mean absolute error.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
train_data = pd.read_csv('sample_data/california_housing_train.csv')
test_data = pd.read_csv('sample_data/california_housing_test.csv')

In [3]:
train_data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


X_train, X_val, y_train, y_val = train_test_split(train_data.iloc[:, :-1],
                                                  train_data.iloc[:, [-1]],
                                                  test_size=0.2,
                                                  random_state=42)

In [5]:
X_train['rooms_per_household'] = X_train['total_rooms'] / X_train['households']
X_train['bedrooms_per_room'] = X_train['total_bedrooms'] / X_train['total_rooms']
X_train['population_per_household'] = X_train['population'] / X_train['households']

In [6]:
X_val['rooms_per_household'] = X_val['total_rooms'] / X_val['households']
X_val['bedrooms_per_room'] = X_val['total_bedrooms'] / X_val['total_rooms']
X_val['population_per_household'] = X_val['population'] / X_val['households']

In [7]:
X_train = X_train.drop(['total_rooms', 'total_bedrooms', 'population', 'households'], axis=1)
X_val = X_val.drop(['total_rooms', 'total_bedrooms', 'population', 'households'], axis=1)


In [8]:
X_train.head()

Unnamed: 0,longitude,latitude,housing_median_age,median_income,rooms_per_household,bedrooms_per_room,population_per_household
4679,-118.07,33.81,22.0,8.5407,7.366848,0.129841,3.546196
2512,-117.63,33.5,12.0,7.2013,7.355691,0.148107,3.060976
993,-117.09,32.57,17.0,5.1478,5.103448,0.186937,4.103448
1327,-117.16,32.81,34.0,3.6371,6.002639,0.164835,2.693931
4630,-118.07,34.17,36.0,5.5418,5.847458,0.163147,2.941889


In standard linear regression (i.e., ordinary least squares), it is not necessary to scale the target variable.

However, scaling the target variable might be beneficial in certain situations:

* **Comparing Coefficients**: If your features are on different scales, it can be challenging to compare the magnitudes of the coefficients for different features. Scaling the target can make coefficient magnitudes more interpretable.

* **Gradient Descent**: If you are using an optimization algorithm like gradient descent to train your linear regression model, scaling the target variable can help the optimization process converge faster. Scaling is especially important when using regularized linear regression techniques like ``Ridge`` or ``Lasso``.

* **Residual Analysis**: When you evaluate the model, the residuals (the differences between predicted and actual values) may be easier to interpret when the target variable is scaled.

* **Numerical Stability**: Scaling can also help with numerical stability in some numerical libraries or software when solving linear regression.

If you apply target scaling, it's important to apply the same scaling to your predictions when making predictions using the model.

In [9]:
target_scaler = StandardScaler()

y_train = target_scaler.fit_transform(y_train)
y_val = target_scaler.transform(y_val)

# Linear Regression

In [10]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

In [11]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

y_pred = model.predict(X_val)

mae = mean_absolute_error(y_val, y_pred)
mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)

print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)

Mean Absolute Error: 0.4499302000329306
Mean Squared Error: 0.3794658441020234
Root Mean Squared Error: 0.6160079902907294


# Ridge Regression

In [12]:
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error

ridge = Ridge()
parameters = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

ridge_regressor = GridSearchCV(ridge, parameters, scoring='neg_mean_absolute_error', cv=5)
ridge_regressor.fit(X_train, y_train)

y_pred = ridge_regressor.predict(X_val)
mae = mean_absolute_error(y_val, y_pred)
mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)

print("Ridge Regression - Mean Absolute Error:", mae)
print("Ridge Regression - Mean Squared Error:", mse)
print("Ridge Regression - Root Mean Squared Error:", rmse)

Ridge Regression - Mean Absolute Error: 0.44993027821162074
Ridge Regression - Mean Squared Error: 0.37946582808925294
Ridge Regression - Root Mean Squared Error: 0.6160079772935193


# Lasso Regression

In [13]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error

lasso = Lasso()

parameters = {'alpha': np.logspace(-4, 1, 20)}
lasso_regressor = GridSearchCV(lasso, parameters, scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)

lasso_regressor.fit(X_train, y_train)

best_lasso = lasso_regressor.best_estimator_
y_pred = best_lasso.predict(X_val)

mae = mean_absolute_error(y_val, y_pred)
mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)

print("Lasso Regression - Mean Absolute Error:", mae)
print("Lasso Regression - Mean Squared Error:", mse)
print("Lasso Regression - Root Mean Squared Error:", rmse)


Lasso Regression - Mean Absolute Error: 0.4499578577674567
Lasso Regression - Mean Squared Error: 0.37946158167968347
Lasso Regression - Root Mean Squared Error: 0.6160045305674979


# Regression Trees

In [14]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error

tree_model = DecisionTreeRegressor()

param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10]
}

tree_regressor_cv = GridSearchCV(tree_model, param_grid, scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)
tree_regressor_cv.fit(X_train, y_train)

best_tree_model = tree_regressor_cv.best_estimator_
y_pred = best_tree_model.predict(X_val)

mae = mean_absolute_error(y_val, y_pred)
mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)

print("Decision Tree Regressor - Mean Absolute Error:", mae)
print("Decision Tree Regressor - Mean Squared Error:", mse)
print("Decision Tree Regressor - Root Mean Squared Error:", rmse)


Decision Tree Regressor - Mean Absolute Error: 0.33933818133640914
Decision Tree Regressor - Mean Squared Error: 0.26490010342027953
Decision Tree Regressor - Root Mean Squared Error: 0.5146844697679147


# Bagging Trees

In [15]:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error

base_model = DecisionTreeRegressor()

bagging_model = BaggingRegressor(base_model, random_state=42)
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_samples': [0.5, 1.0],
    'max_features': [0.5, 1.0]
}

bagging_regressor_cv = GridSearchCV(bagging_model, param_grid, scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)
bagging_regressor_cv.fit(X_train, y_train)

best_bagging_model = bagging_regressor_cv.best_estimator_
y_pred = best_bagging_model.predict(X_val)

mae = mean_absolute_error(y_val, y_pred)
mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)

print("Bagging Regressor - Mean Absolute Error:", mae)
print("Bagging Regressor - Mean Squared Error:", mse)
print("Bagging Regressor - Root Mean Squared Error:", rmse)


  return column_or_1d(y, warn=True)


Bagging Regressor - Mean Absolute Error: 0.28308943354771327
Bagging Regressor - Mean Squared Error: 0.18953642279230273
Bagging Regressor - Root Mean Squared Error: 0.43535781007385493


# Random Forest Regressor

In [17]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error

random_forest_model = RandomForestRegressor(random_state=42)

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf_regressor_cv = GridSearchCV(random_forest_model, param_grid, scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)
rf_regressor_cv.fit(X_train, y_train)

best_rf_model = rf_regressor_cv.best_estimator_
y_pred = best_rf_model.predict(X_val)

mae = mean_absolute_error(y_val, y_pred)
mse = mean_squared_error(y_val, y_pred)
rmse = np.sqrt(mse)

print("Random Forest Regressor - Mean Absolute Error:", mae)
print("Random Forest Regressor - Mean Squared Error:", mse)
print("Random Forest Regressor - Root Mean Squared Error:", rmse)


  self.best_estimator_.fit(X, y, **fit_params)


Random Forest Regressor - Mean Absolute Error: 0.28166720516181987
Random Forest Regressor - Mean Squared Error: 0.18818839776603433
Random Forest Regressor - Root Mean Squared Error: 0.4338068668958968
