## Clustering with random forest

In this notebook we will try to use clustering in combination with random forest.

Our assumption will be that if we add clusters' labels to our dataset as a new feature than the random forest will perform better (Random forest will be able to make splits of the data also through clusters and not only through the initial features).

With respect to the initial features it means that we will allow Random forest make some non-linear splits, rather some "metric" splits (K-means clusters uses Euclidian metric) 

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, root_mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

## Data load

In [3]:
k2_train = pd.read_csv("models/random_forest_on_clusters/k2/k2_train.csv")
k2_test = pd.read_csv("models/random_forest_on_clusters/k2/k2_test.csv")

k3_train = pd.read_csv("models/random_forest_on_clusters/k3/k3_train.csv")
k3_test = pd.read_csv("models/random_forest_on_clusters/k3/k3_test.csv")

k6_train = pd.read_csv("models/random_forest_on_clusters/k6/k6_train.csv")
k6_test = pd.read_csv("models/random_forest_on_clusters/k6/k6_test.csv")

## Checking feature importance in new data

We can overfit DecisionTreeRegressor to check if our new features have at least some importance during the splitting

### K = 2

In [12]:
k2_train.columns

Index(['Unnamed: 0', 'LotFrontage', 'LotArea', 'Street', 'Utilities',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
       ...
       'MSSubClass_80', 'MSSubClass_85', 'MSSubClass_90', 'MSSubClass_120',
       'MSSubClass_150', 'MSSubClass_160', 'MSSubClass_180', 'MSSubClass_190',
       'SalePrice', 'cluster'],
      dtype='object', length=316)

In [10]:
reg2 = DecisionTreeRegressor()
reg2.fit(k2_train.drop("SalePrice", axis=1), k2_train["SalePrice"])

feat_importances_k2 = pd.DataFrame({"importance": reg2.feature_importances_}).set_index(k2_train.drop("SalePrice", axis=1).columns)

In [11]:
feat_importances_k2

Unnamed: 0,importance
Unnamed: 0,0.001922
LotFrontage,0.002382
LotArea,0.012400
Street,0.000000
Utilities,0.000000
...,...
MSSubClass_150,0.000000
MSSubClass_160,0.000000
MSSubClass_180,0.000000
MSSubClass_190,0.000000


## K = 2

First of all we will use data, clustered with K-means with K=2