## Predict selling price of houses in Ames, Iowa by using machine learning techniques (Multiple Linear Regression, SVR, Decision Tree, Decision Forest)

### Information about the dataset

- Number of inputs: **1461**
- Number of variables: **79**
- Dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
- Data fields description: can be found here "data_description.txt"

### Importing main libraries

In [1]:
import numpy as np
import pandas as pd

### Importing the dataset

In [2]:
df = pd.read_csv('ames_1.csv')
df = df.drop(df.columns[0], axis=1) #deleting the id column
df = df.fillna(0) # replacing NaN with zeros, needed for onehotencoder, NaN is not accepted
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].astype(float).values #last column are prices

### Creating a list of categorical variables and encoding them with LabelEncoder

In [3]:
col_list = [0,1,4,5,6,7,8,9,10,11,12,13,14,15,16,17,20,
            21,22,23,24,26,27,28,29,30,31,32,34,
            38,39,40,41,52,54,56,57,59,62,63,64,71,72,73,77,78]
#selection was done manually because some categorical variables are numerical, some are strings

from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder() 
for i in col_list:
    df.iloc[:, i] = labelencoder.fit_transform(df.iloc[:, i].astype(str))

### Creating a list of continious variables

In [4]:
no_cat_var = []
for el in range(len(df.columns)-1):         #excluding variable that we are predicting
    if el in col_list:
        continue
    else:
        no_cat_var.append(el)

### Creating a reference dictionary to find corresponding variables after OneHotEncoding in the initial dataframe
This dictionary can be used to find corresponding variables that were chosen by "Forward Selection" and "Backward Elimination" further below

In [5]:
ref_dict = {}
dict_iter=0 

### Encoding categorical variables with OneHotEncoder
The first categorical column will be encoded, result will be added separately in a ndarray, ecluding first dummy column. All other categorical columns will be encoded and added to this ndarray afterwards via loop. That allows to use OneHotEncoder on range of categorical variables without manually encoding one variable after another

In [6]:
df_categorical = df.iloc[:, col_list]                           #df with categorical variables

X_cat = df_categorical.iloc[:, :].values                        #categorical ndarray
X_cat[:, 0] = labelencoder.fit_transform(X_cat[:, 0])           
onehotencoder = OneHotEncoder(categorical_features = [0])       #encoding 1st column
X_cc = onehotencoder.fit_transform(X_cat).toarray()              
dummy_col = df_categorical.iloc[:, 0].nunique()                 #finding out number of dummy colummns created

X_cc_2 = X_cc[:, 1:dummy_col]                                   #moving to a separate ndarray excluding first dummy column

df_cat_no_one = df_categorical.iloc[:, 1:]                      #first column was preprocessed so it was excuded from further loop
X_cat_no_one = df_cat_no_one.iloc[:, :].values

ref_dict[0] = list(range(dict_iter, dict_iter+dummy_col))     #adding id of original column as key, all corresponding dummy columns as list
dict_iter = dict_iter + dummy_col


Now the first column was encoded in dummy variables and they were added to separate ndarray. <br>
Other encoded variables will be added to this ndarray via loop below

### Adding other categorical variables to ndarray via loop

In [7]:
dict_iter=0 
for c in range(len(col_list)-1):
    X_cat_no_one[:, c] = labelencoder.fit_transform(X_cat_no_one[:, c])           
    onehotencoder = OneHotEncoder(categorical_features = [c])
    X_cc = onehotencoder.fit_transform(X_cat_no_one).toarray()                    
    dummy_col = df_categorical.iloc[:, c+1].nunique()                             #+1 because of reffering to df with all categorical variables, including the first one
    X_cc2_2 = X_cc[:, 1:dummy_col]                                                #excluding first dummy column
    X_cc_2 = np.concatenate((X_cc_2, X_cc2_2), axis=1)                            #merge 2 ndarrays
    ref_dict[c+1] = list(range(dict_iter, dict_iter+dummy_col))                   #adding id of original column as key, all corresponding dummy columns as list
    dict_iter = dict_iter + dummy_col


### After that all continious variables are added to a ndarray with encoded categorical variables

In [8]:
df_non_categorical = df.iloc[:, no_cat_var]
X_no_cat_var = df_non_categorical.iloc[:, :].values
merged_dataset = np.concatenate((X_cc_2, X_no_cat_var), axis=1)

### Final dataset to work with
296 columns

### Selecting columns to work with
At this point there are 296 columns to choose from for a machine learning model. A subjective selection is not appropriate so two approaches will be used to select a required range of variables for machine learning algorithm. These approaches are "Backward Elimination" and "Forward selection": https://en.wikipedia.org/wiki/Stepwise_regression

Lets start with <b>Backward Elimination<b>:

In [9]:
import statsmodels.formula.api as sm

p=0.05

#imputs for def are: dataset and p-value
def BackwardElimination(merged_dataset, p):
    merged_dataset = np.append(arr = np.ones((np.size(merged_dataset,0),1)).astype(int), values=merged_dataset, axis=1) #np.size(merged_categ,0) - number of rows in numpy array
    #this adds our dataset to a column of one so ones are in the first column

    #number of columns
    len_list = []                                 #list of indexes of all columns
    for i in range(np.size(merged_dataset,1)+1):
        len_list.append(i)

    
    p = p #p-value for; can be adjusted depending on desired result (default - 0.05)

    end = False
    while end==False:
        regressor_OLS = sm.OLS(endog = y, exog = merged_dataset).fit()
        p_values = regressor_OLS.pvalues
        #enable these prints to see a process of selection in a real time
        #print("P values are: "+str(['%.3f' % i for i in p_values.tolist()]))
        #print("Max p value: "+str(max(p_values)))
        #print("==============================================")
        if max(p_values)<p:
            end = True
            return merged_dataset
        elif max(p_values)>=p:
            p_max_pos = p_values.tolist().index(max(p_values))
            merged_dataset = np.delete(merged_dataset, [p_max_pos], axis=1)

X = BackwardElimination(merged_dataset, p)

### LinearRegression with cross-validation (Backward Elimination)

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

from sklearn.linear_model import LinearRegression
regressor_back = LinearRegression()
regressor_back.fit(X_train, y_train)

# Predicting the Test set results
y_pred = regressor_back.predict(X_test)

r2_scores = cross_val_score(regressor_back, X_train, y_train, scoring='r2', cv=3)
print('Cross-validation score for r^2={}'.format(r2_scores))

Cross-validation score for r^2=[0.78636832 0.91921861 0.9122908 ]


<b>R-squared of Linear Regression<b>

In [11]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred) 

0.570934689084265

<b>MSE of Linear Regression<b>


In [12]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

2963060661.942607

<b>MAE of Linear Regression<b>


In [13]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

19686.99444786974

<b>Adding results to a table for summarization in the end</b>

In [14]:
model_name=[]
mse=[]
r2=[]
mae=[]

model_name.append("Backward/MLR")
mae.append(mean_absolute_error(y_test, y_pred))
r2.append(r2_score(y_test, y_pred))
mse.append(mean_squared_error(y_test, y_pred))

### SVR (RBF kernel) (Backward Elimination)

<b>Train test split and Feature Scaling<b>

In [15]:
from sklearn.svm import SVR
from sklearn.model_selection  import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train) #X_train.reshape(-1, 1) is added because there is only one column
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.reshape(-1, 1))
y_test = sc_y.fit_transform(y_test.reshape(-1, 1))

<b>Using GridSearch to find the best combination of C and gamma<b>

In [16]:
#parameters
Cs = [0.0001, 0.001, 0.01, 0.1, 1, 10]
gammas = [0.0001, 0.001, 0.01, 0.1, 1, 2] 
param_grid = dict(gamma=gammas, C=Cs)

#model
from sklearn.model_selection import GridSearchCV
svr = SVR(kernel='rbf')
grid_search = GridSearchCV(svr, param_grid)

#fit best combination of parameters
grid_search.fit(X_train, y_train.ravel()) #ravel is needed to convert int to float

y_pred = grid_search.predict(X_test)

In [17]:
print('Grid best parameter (max. accuracy): ', grid_search.best_params_)

Grid best parameter (max. accuracy):  {'C': 10, 'gamma': 0.001}


In [18]:
print('Grid best score (accuracy): ', grid_search.best_score_) #train data

Grid best score (accuracy):  0.9094093728337792


<b>R-squared of SVR<b>

In [19]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred) 

0.7680587755446528

<b>MSE of SVR<b>

In [20]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

0.2319412244553471

<b>MAE of SVR<b>

In [21]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

0.21544819542982774

<b>Adding results to a table for summarization in the end</b>


In [22]:
model_name.append("Backward/SVR")
mae.append(mean_absolute_error(y_test, y_pred))
r2.append(r2_score(y_test, y_pred))
mse.append(mean_squared_error(y_test, y_pred))

### Decision Tree with cross-validation and GridSearch (Backward Elimination)

<b>Train test split and Feature Scaling<b>

In [23]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection  import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train) #X_train.reshape(-1, 1) is added because there is only one column
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.reshape(-1, 1))
y_test = sc_y.fit_transform(y_test.reshape(-1, 1))

<b>Using GridSearch to find the best combination of parameters<b>

In [24]:
max_depth = np.linspace(1, 40, 40, endpoint=True)

param_grid = dict(max_depth=max_depth)

#model
from sklearn.model_selection import GridSearchCV
dec_tree = DecisionTreeRegressor()
grid_search = GridSearchCV(dec_tree, param_grid)

#fit best combination of parameters
grid_search.fit(X_train, y_train.ravel()) #ravel is needed to convert int to float

y_pred = grid_search.predict(X_test)

In [25]:
print('Grid best parameter (max. accuracy): ', grid_search.best_params_)

Grid best parameter (max. accuracy):  {'max_depth': 9.0}


In [26]:
print('Grid best score (accuracy): ', grid_search.best_score_) #train data

Grid best score (accuracy):  0.7326437511733014


<b>R-squared of Decision Tree<b>

In [27]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred) 

0.7646343334387243

<b>MSE of Decision Tree<b>

In [28]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

0.23536566656127567

<b>MAE of Decision Tree<b>


In [29]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

0.3132771151905251

<b>Adding results to a table for summarization in the end</b>

In [30]:
model_name.append("Backward/Decision Tree")
mae.append(mean_absolute_error(y_test, y_pred))
r2.append(r2_score(y_test, y_pred))
mse.append(mean_squared_error(y_test, y_pred))

### Random Forest

<b>Train test split and Feature Scaling<b>

In [31]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection  import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train) #X_train.reshape(-1, 1) is added because there is only one column
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.reshape(-1, 1))
y_test = sc_y.fit_transform(y_test.reshape(-1, 1))

<b>Using GridSearch to find the best combination of parameters<b>

In [32]:
from sklearn.ensemble import RandomForestRegressor

max_depth = np.linspace(1, 40, 40, endpoint=True)
n_estimators = [5,10,15,20,30]

param_grid = dict(max_depth=max_depth, n_estimators = n_estimators)

#model
from sklearn.model_selection import GridSearchCV
forest = RandomForestRegressor()
grid_search = GridSearchCV(forest, param_grid)

#fit best combination of parameters
grid_search.fit(X_train, y_train.ravel())

y_pred = grid_search.predict(X_test)

In [33]:
print('Grid best parameter (max. accuracy): ', grid_search.best_params_)

Grid best parameter (max. accuracy):  {'max_depth': 32.0, 'n_estimators': 30}


In [34]:
print('Grid best score (accuracy): ', grid_search.best_score_) #train data

Grid best score (accuracy):  0.857532073745342


<b>R-squared of Decision Forest<b>

In [35]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred) 

0.8256867368203639

<b>MSE of Decision Forest<b>

In [36]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

0.1743132631796361

<b>MAE of Decision Forest<b>


In [37]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

0.23992240987487665

<b>Adding results to a table for summarization in the end</b>


In [38]:
model_name.append("Backward/Random Forest")
mae.append(mean_absolute_error(y_test, y_pred))
r2.append(r2_score(y_test, y_pred))
mse.append(mean_squared_error(y_test, y_pred))

Lets apply <b>Forward Selection<b>:

In [39]:
import statsmodels.formula.api as sm
y = df.iloc[:, -1].astype(float).values

def ForwardSelection(merged_dataset, p):
    unknown_variables = []                  #a list of variables that are not included as "good" ones; after each iteration some variable dissapears from "unknown" and becomes "good"
    for i in range(merged_dataset.shape[1]):
        unknown_variables.append(i)
    
    #adding b0 variable from formula
    merged_dataset = np.append(arr = np.ones((np.size(merged_dataset,0),1)).astype(int), values=merged_dataset, axis=1) #np.size(merged_categ,0) - number of rows in numpy array
    p = p
    
    ###first iteration is added separately, others in a loop below
    p_values_list=[]
    good_variables=[]
    for i in range(merged_dataset.shape[1]):
        X_opt = merged_dataset[:, i]
        regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()   #finding p value of every variable and y(the variable to predict)
        p_value = regressor_OLS.pvalues
        p_values_list.extend(p_value.tolist())
    min_p_value = min(p_values_list)                            #finding the minimum p value
    min_index = p_values_list.index(min_p_value)                #variable with the smallest p value
    good_variables.append(min_index)                            #add a variable to a "good" list
    unknown_variables.remove(min_index)                         #remove index from a list of "bad" variables
    
    end=False
    while end==False:
        comb_list = []
        p_values_list=[]
        
        #this loop exists to make combinations of "good" variables with every "unknown" to find p value of every combination
        for i in unknown_variables:                            
            temp_list = []
            for t in good_variables:
                temp_list.append(t)
            temp_list.append(i)
            comb_list.append([temp_list])
            #print(temp_list)
        for el in comb_list:
            X_opt = merged_dataset[:, el[0]]
            regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
            p_value = regressor_OLS.pvalues
            pvalue_lst = p_value.tolist()
            p_values_list.append(pvalue_lst[-1])
        #finding combination with min p value
        min_p_value = min(p_values_list)                            
        min_index = p_values_list.index(min_p_value)
        good_variables.append(comb_list[min_index][-1][-1])
        unknown_variables.remove(comb_list[min_index][-1][-1])
        #uncomment to see every step
        #print("Min p value: "+str(min_p_value))
        #print("List of variables: "+str(good_variables))
        #print("####################################")
        if min_p_value>p:
             end=True
        
    #print("UN: "+str(unknown_variables))
    print("GN: "+str(good_variables))
    return merged_dataset[:, good_variables]

In [40]:
p = 0.05
X = ForwardSelection(merged_dataset, p)

GN: [0, 269, 259, 159, 265, 279, 275, 93, 92, 85, 70, 235, 40, 168, 91, 264, 1, 249, 49, 99, 258, 100, 98, 101, 56, 50, 286, 234, 233, 243, 255, 206, 113, 55, 116, 60, 277, 77, 260, 261, 37, 285, 107, 97, 76, 173, 29, 19, 268, 41, 112, 119, 170, 22, 108, 158, 284, 276, 274, 270, 272, 228, 210, 186, 17]


### LinearRegression via cross-validation (Forward Selection)

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the Test set results
y_pred = regressor.predict(X_test)

r2_scores = cross_val_score(regressor, X_train, y_train, scoring='r2', cv=3)
print('Cross-validation score for^2={}'.format(r2_scores))

Cross-validation score for^2=[0.78110129 0.91666683 0.89260419]


<b>R-squared of Linear Regression<b>


In [42]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred) 

0.6605208927634301

<b>MSE of Linear Regression<b>


In [43]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

2344391780.489629

<b>MAE of Linear Regression<b>


In [44]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

19278.979224686384

<b>Adding results to a table for summarization in the end</b>


In [45]:
model_name.append("Forward/MLR")
mae.append(mean_absolute_error(y_test, y_pred))
r2.append(r2_score(y_test, y_pred))
mse.append(mean_squared_error(y_test, y_pred))

### SVR (RBF kernel) (Forward Selection)

<b>Train test split and Feature Scaling<b>


In [46]:
from sklearn.svm import SVR
from sklearn.model_selection  import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train) #X_train.reshape(-1, 1) is added because there is only one column
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.reshape(-1, 1))
y_test = sc_y.fit_transform(y_test.reshape(-1, 1))

<b>Using GridSearch to find the best combination of C and gamma<b>

In [47]:
#parameters
Cs = [0.0001, 0.001, 0.01, 0.1, 1, 10]
gammas = [0.0001, 0.001, 0.01, 0.1, 1, 2] 
param_grid = dict(gamma=gammas, C=Cs)

#model
from sklearn.model_selection import GridSearchCV
svr = SVR(kernel='rbf')
grid_search = GridSearchCV(svr, param_grid)

#fit best combination of parameters
grid_search.fit(X_train, y_train.ravel()) #ravel is needed to convert int to float

y_pred = grid_search.predict(X_test)

In [48]:
print('Grid best parameter (max. accuracy): ', grid_search.best_params_)

Grid best parameter (max. accuracy):  {'C': 10, 'gamma': 0.001}


In [49]:
print('Grid best score (accuracy): ', grid_search.best_score_) #train data

Grid best score (accuracy):  0.9012570709150719


<b>R-squared of SVR<b>

In [50]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred) 

0.759773373113279

<b>MSE of SVR<b>

In [51]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

0.24022662688672103

<b>MAE of SVR<b>

In [52]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

0.21734930595806085

<b>Adding results to a table for summarization in the end</b>

In [53]:
model_name.append("Forward/SVR")
mae.append(mean_absolute_error(y_test, y_pred))
r2.append(r2_score(y_test, y_pred))
mse.append(mean_squared_error(y_test, y_pred))

### Decision Tree with cross-validation and GridSearch (Forward Selection)

<b>Train test split and Feature Scaling<b>

In [54]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection  import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train) #X_train.reshape(-1, 1) is added because there is only one column
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.reshape(-1, 1))
y_test = sc_y.fit_transform(y_test.reshape(-1, 1))

<b>Using GridSearch to find the best combination of parameters<b>

In [55]:
max_depth = np.linspace(1, 40, 40, endpoint=True)

param_grid = dict(max_depth=max_depth)

#model
from sklearn.model_selection import GridSearchCV
dec_tree = DecisionTreeRegressor()
grid_search = GridSearchCV(dec_tree, param_grid)

#fit best combination of parameters
grid_search.fit(X_train, y_train.ravel()) #ravel is needed to convert int to float

y_pred = grid_search.predict(X_test)

In [56]:
print('Grid best parameter (max. accuracy): ', grid_search.best_params_)

Grid best parameter (max. accuracy):  {'max_depth': 8.0}


In [57]:
print('Grid best score (accuracy): ', grid_search.best_score_) #train data

Grid best score (accuracy):  0.7363753120652549


<b>R-squared of Decision tree<b>

In [58]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred) 

0.7534796402230192

<b>MSE of Decision tree<b>

In [59]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

0.24652035977698084

<b>MAE of Decision Tree<b>


In [60]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

0.33652024310605166

<b>Adding results to a table for summarization in the end</b>

In [61]:
model_name.append("Forward/Decision Tree")
mae.append(mean_absolute_error(y_test, y_pred))
r2.append(r2_score(y_test, y_pred))
mse.append(mean_squared_error(y_test, y_pred))

### Random Forest

<b>Train test split and Feature Scaling<b>

In [62]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection  import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train) #X_train.reshape(-1, 1) is added because there is only one column
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.reshape(-1, 1))
y_test = sc_y.fit_transform(y_test.reshape(-1, 1))

<b>Using GridSearch to find the best combination of parameters<b>

In [63]:
from sklearn.ensemble import RandomForestRegressor

max_depth = np.linspace(1, 40, 40, endpoint=True)
n_estimators = [5,10,15,20,30]

param_grid = dict(max_depth=max_depth, n_estimators = n_estimators)

#model
from sklearn.model_selection import GridSearchCV
forest = RandomForestRegressor()
grid_search = GridSearchCV(forest, param_grid)

#fit best combination of parameters
grid_search.fit(X_train, y_train.ravel())

y_pred = grid_search.predict(X_test)

In [64]:
print('Grid best parameter (max. accuracy): ', grid_search.best_params_)

Grid best parameter (max. accuracy):  {'max_depth': 20.0, 'n_estimators': 30}


In [65]:
print('Grid best score (accuracy): ', grid_search.best_score_) #train data

Grid best score (accuracy):  0.8610798031811986


<b>R-squared<b>

In [66]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred) 

0.8449653911665643

<b>MSE<b>

In [67]:
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, y_pred)

0.15503460883343567

<b>MAE of Decision Tree<b>

In [68]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

0.23708490356253453

<b>Adding results to a table for summarization in the end</b>

In [69]:
model_name.append("Forward/Random Forest")
mae.append(mean_absolute_error(y_test, y_pred))
r2.append(r2_score(y_test, y_pred))
mse.append(mean_squared_error(y_test, y_pred))

In [75]:
d = {'model_name': model_name, 'mse': mse, 'r2': r2,
'mae': mae}
df = pd.DataFrame(data=d)
df.round(3)

Unnamed: 0,model_name,mse,r2,mae
0,Backward/MLR,2963061000.0,0.571,19686.994
1,Backward/SVR,0.232,0.768,0.215
2,Backward/Decision Tree,0.235,0.765,0.313
3,Backward/Random Forest,0.174,0.826,0.24
4,Forward/MLR,2344392000.0,0.661,19278.979
5,Forward/SVR,0.24,0.76,0.217
6,Forward/Decision Tree,0.247,0.753,0.337
7,Forward/Random Forest,0.155,0.845,0.237


### Summary

In summary, **Random Forest** achieved the best r-squared among all models as well as the best MSE. As of the feature selection approach, **Forward Selection** did better for Random Forst, but worse for SVR and Decision Tree. The worst model in linear regression