## Clustering with random forest

In this notebook we will try to use clustering in combination with random forest.

Our assumption will be that if we add clusters' labels to our dataset as a new feature than the random forest will perform better (Random forest will be able to make splits of the data also through clusters and not only through the initial features).

With respect to the initial features it means that we will allow Random forest make some non-linear splits, rather some "metric" splits (K-means clusters uses Euclidian metric) 

In [104]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, root_mean_squared_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn import tree

## Data load

In [113]:
k2_train = pd.read_csv("models/random_forest_on_clusters/k2/k2_train.csv")
k2_test = pd.read_csv("models/random_forest_on_clusters/k2/k2_test.csv")

k3_train = pd.read_csv("models/random_forest_on_clusters/k3/k3_train.csv")
k3_test = pd.read_csv("models/random_forest_on_clusters/k3/k3_test.csv")

k6_train = pd.read_csv("models/random_forest_on_clusters/k6/k6_train.csv")
k6_test = pd.read_csv("models/random_forest_on_clusters/k6/k6_test.csv")

## Checking feature importance in new data

We can overfit DecisionTreeRegressor to check if our new features have at least some importance during the splitting

### K = 2

In [114]:
reg2 = DecisionTreeRegressor()
reg2.fit(k2_train.drop("SalePrice", axis=1), k2_train["SalePrice"])

feat_importances_k2 = pd.DataFrame({"importance": reg2.feature_importances_}).set_index(k2_train.drop("SalePrice", axis=1).columns)
feat_importances_k2.sort_values(by="importance", ascending=False).head()

Unnamed: 0,importance
cluster,0.520982
OverallQual,0.173541
GrLivArea,0.06232
TotalBsmtSF,0.032317
1stFlrSF,0.021719


SUPER!!! It looks like our cluster labels for K=2 have really much sense and they can help us to make more meaningfull splits

### K = 3

In [115]:
reg3 = DecisionTreeRegressor()
reg3.fit(k3_train.drop("SalePrice", axis=1), k3_train["SalePrice"])

feat_importances_k3 = pd.DataFrame({"importance": reg3.feature_importances_}).set_index(k3_train.drop("SalePrice", axis=1).columns)
feat_importances_k3.sort_values(by="importance", ascending=False).head(n=10)

Unnamed: 0,importance
OverallQual,0.566155
GrLivArea,0.122713
TotalBsmtSF,0.071143
1stFlrSF,0.020775
CentralAir,0.017935
GarageCars,0.017687
YearBuilt,0.01635
BsmtFinSF1,0.015297
OverallCond,0.012788
cluster,0.010756


For K=3 we see that cluster feature is not so important for the desicion tree

### K = 6

In [116]:
reg6 = DecisionTreeRegressor()
reg6.fit(k6_train.drop("SalePrice", axis=1), k6_train["SalePrice"])

feat_importances_k6 = pd.DataFrame({"importance": reg6.feature_importances_}).set_index(k6_train.drop("SalePrice", axis=1).columns)
feat_importances_k6.loc["cluster"]

importance    0.003145
Name: cluster, dtype: float64

For the K=6 our cluster feature seems to be almost not important

## Model Applying

Let's train our Random Forest with different K and see what losses will we have

## K = 2

In [100]:
X1 = k2_train.drop(["SalePrice"], axis=1)
y1 = k2_train["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.2,
                                                    random_state=3, stratify=X1["cluster"])

real_y_train = np.exp(y_train)
real_y_test = np.exp(y_test)

print(X_train.shape)

rf_reg2 = RandomForestRegressor(max_depth=10, min_samples_leaf=4).fit(X_train, y_train)

train_pred = rf_reg2.predict(X_train)
test_pred = rf_reg2.predict(X_test)

print("TRAIN LOG RMSE:", root_mean_squared_error(y_train, train_pred))
print("TEST LOG RMSE:", root_mean_squared_error(y_test, test_pred))
print("." * 35)
print("TRAIN RMSE", root_mean_squared_error(real_y_train, np.exp(train_pred)))
print("TEST RMSE", root_mean_squared_error(real_y_test, np.exp(test_pred)))

(1167, 314)
TRAIN LOG RMSE: 0.09007113299443846
TEST LOG RMSE: 0.12200729663657744
...................................
TRAIN RMSE 18849.151556862333
TEST RMSE 23403.144409659362


Our Random Forest Regressor is affected by overfitting. It gives us little loss on train test and large on the test. If we run CV score we receive almost the same results on the test folds as we received using it on the initial data in baseline file

In [107]:
scores = cross_val_score(RandomForestRegressor(), X1, y1, cv=5, scoring="neg_root_mean_squared_error")
scores

array([-0.13619624, -0.15754212, -0.14676518, -0.13165255, -0.1440899 ])

### K = 3

In [120]:
X2 = k3_train.drop(["SalePrice"], axis=1)
y2 = k3_train["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X2, y2, test_size=0.2,
                                                    random_state=3, stratify=X2["cluster"])

real_y_train = np.exp(y_train)
real_y_test = np.exp(y_test)

print(X_train.shape)

rf_reg3 = RandomForestRegressor().fit(X_train, y_train)

train_pred = rf_reg3.predict(X_train)
test_pred = rf_reg3.predict(X_test)

print("TRAIN LOG RMSE:", root_mean_squared_error(y_train, train_pred))
print("TEST LOG RMSE:", root_mean_squared_error(y_test, test_pred))
print("." * 35)
print("TRAIN RMSE", root_mean_squared_error(real_y_train, np.exp(train_pred)))
print("TEST RMSE", root_mean_squared_error(real_y_test, np.exp(test_pred)))

(1167, 314)
TRAIN LOG RMSE: 0.05368963545580474
TEST LOG RMSE: 0.13815264289097393
...................................
TRAIN RMSE 11477.88889114934
TEST RMSE 30618.22078262861


In [121]:
scores = cross_val_score(RandomForestRegressor(), X2, y2, cv=5, scoring="neg_root_mean_squared_error")
scores

array([-0.1390145 , -0.15613265, -0.14363535, -0.13050008, -0.14631757])

### K = 6

In [122]:
X3 = k6_train.drop(["SalePrice"], axis=1)
y3 = k6_train["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X3, y3, test_size=0.2,
                                                    random_state=3, stratify=X3["cluster"])

real_y_train = np.exp(y_train)
real_y_test = np.exp(y_test)

print(X_train.shape)

rf_reg6 = RandomForestRegressor().fit(X_train, y_train)

train_pred = rf_reg6.predict(X_train)
test_pred = rf_reg6.predict(X_test)

print("TRAIN LOG RMSE:", root_mean_squared_error(y_train, train_pred))
print("TEST LOG RMSE:", root_mean_squared_error(y_test, test_pred))
print("." * 35)
print("TRAIN RMSE", root_mean_squared_error(real_y_train, np.exp(train_pred)))
print("TEST RMSE", root_mean_squared_error(real_y_test, np.exp(test_pred)))

(1167, 314)
TRAIN LOG RMSE: 0.05364762001285598
TEST LOG RMSE: 0.146299498185989
...................................
TRAIN RMSE 11258.794667425187
TEST RMSE 27787.893399186516
