# 4. Fine tuning our model
Spot checking algorithms helped us identify a candidate to model our issue. In our current projet, it's a random forest algorithm. We will now try to fine tune this type of model.

What method is best suited? This article on __[Hyperparameter Tuning the Random Forest in Python](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)__ provides a good methodology that we follow here, as it allows for:
1. first quick browsing hyperparameter range across a Random Hyperparameter Grid. This helps narrow down our research for the best hyperparameters. 
2. then focus on a smaller grid where all combinations of the specified hyperparameters will be tested.

## Data preprocessing

In [1]:
# import libraries
import pandas as pd
import numpy as np

In [2]:
# load data
path = " "
df = pd.read_excel(path+"Real estate valuation data set.xlsx")
df.info()

df.set_index('No', inplace = True)

# Define the variables
X = df.drop(['Y house price of unit area','X6 longitude'], axis =1)
y = df['Y house price of unit area'].values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 8 columns):
No                                        414 non-null int64
X1 transaction date                       414 non-null float64
X2 house age                              414 non-null float64
X3 distance to the nearest MRT station    414 non-null float64
X4 number of convenience stores           414 non-null int64
X5 latitude                               414 non-null float64
X6 longitude                              414 non-null float64
Y house price of unit area                414 non-null float64
dtypes: float64(6), int64(2)
memory usage: 26.0 KB


In [3]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

  _nan_object_mask = _nan_object_array != _nan_object_array


In [4]:
# Normalise and scale the numerical features
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
ss_X = StandardScaler()
X_train = ss_X.fit_transform(X_train)
mms_X = MinMaxScaler(feature_range=(0, 1))
X_train = mms_X.fit_transform(X_train)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [5]:
# Normalise and scale the numerical features - X_test
X_test = ss_X.transform(X_test)
X_test = mms_X.transform(X_test)

  from ipykernel import kernelapp as app


## Baseline
First, we look at the default parameters of our random forest model and its performance. We will use it as our baseline.

In [6]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor()

regressor.fit(X_train, y_train)

from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(regressor.get_params())

Parameters currently in use:

{'bootstrap': True,
 'criterion': 'mse',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}




In [7]:
# Predicting the Test set results
y_pred = regressor.predict(X_test)

In [8]:
# Define a function for model evaluation using cross validation
def evaluate_model_cross_validation(name, model, X_train, y_train, folds = 10):

    from sklearn.model_selection import cross_val_score
 
    # Cross Validation Regression MAE
    metric='neg_mean_absolute_error'
    scores = cross_val_score(regressor, X_train, y_train, scoring=metric, cv=folds, n_jobs=-1)
    mean_score, std_score = np.mean(scores), np.std(scores)
    print('>%s - training - MAE: %.3f (+/-%.3f)' % (name, mean_score, std_score))
    
    
    # Cross Validation Regression MSE
    metric='neg_mean_squared_error'
    scores = cross_val_score(regressor, X_train, y_train, scoring=metric, cv=folds, n_jobs=-1)
    mean_score, std_score = np.mean(scores), np.std(scores)
    print('>%s - training - MSE: %.3f (+/-%.3f)' % (name, mean_score, std_score))
    
    # Cross Validation Regression R^2
    metric='r2'
    scores = cross_val_score(regressor, X_train, y_train, scoring=metric, cv=folds, n_jobs=-1)
    mean_score, std_score = np.mean(scores), np.std(scores)
    print('>%s - training - R^2: %.3f (+/-%.3f)' % (name, mean_score, std_score))

In [9]:
# Define a function for model evaluation using a test set
def evaluate_model_test_set(name, model, y_test, y_predicted):
    from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
    print('>%s - test - MAE: %.3f' % (name, mean_absolute_error(y_test, y_predicted)))
    print('>%s - test - MSE: %.3f' % (name, mean_squared_error(y_test, y_predicted)))
    print('>%s - test - R^2: %.3f' % (name, r2_score(y_test, y_predicted)))

In [10]:
# Evaluate the baseline model
evaluate_model_cross_validation("RF_baseline", regressor, X_train, y_train, 10)

>RF_baseline - training - MAE: -5.030 (+/-0.939)
>RF_baseline - training - MSE: -64.689 (+/-59.342)
>RF_baseline - training - R^2: 0.655 (+/-0.161)


In [11]:
evaluate_model_test_set("RF_baseline", regressor, y_test, y_pred)

>RF_baseline - test - MAE: 5.207
>RF_baseline - test - MSE: 64.513
>RF_baseline - test - R^2: 0.621


## Random Search Cross Validation
We will first perform random search training to try out a wide range of values and see what works!  

In [11]:
# Define the random grid

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

In [12]:
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


As explained in the article cited above: "The most important arguments in RandomizedSearchCV are <b>n_iter</b>, which controls 
the number of different combinations to try, and <b>cv</b> which is the number of folds
 to use for cross validation (we use 100 and 3 respectively). More iterations will
 cover a wider search space and more cv folds reduces the chances of overfitting,
 but raising each will increase the run time. Machine learning is a field of trade-offs,
 and performance vs time is one of the most fundamental."

In [13]:
# First create the base model to tune 
regressor = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
from sklearn.model_selection import RandomizedSearchCV
rf_random = RandomizedSearchCV(estimator = regressor,
                               param_distributions = random_grid,
                               n_iter = 100, cv = 3, verbose=2, random_state=42,
                               n_jobs = -1)

In [14]:
# Fit the random search model (This might take a while)
rf_random.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   11.9s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  2.7min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=100, n_jobs=-1,
          param_distributions={'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'bootstrap': [True, False], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_leaf': [1, 2, 4], 'min_samples_split': [2, 5, 10]},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring=None, verbose=2)

In [15]:
# print the best parameters
rf_random.best_params_

{'bootstrap': True,
 'max_depth': 110,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 1000}

In [16]:
# Fitting Random forest Model to the dataset using the best parameters after random search
rf_r = RandomForestRegressor(bootstrap = True,
                                  max_depth = 110,
                                  max_features ='sqrt',
                                  min_samples_leaf = 1,
                                  min_samples_split = 2,
                                  n_estimators = 1000)

rf_r.fit(X_train, y_train)

# Predicting the Test set results
y_pred = rf_r.predict(X_test)

In [17]:
# evaluate model after random grid search
evaluate_model_cross_validation("RF_random", rf_r, X_train, y_train, 10)

>RF_random - training - MAE: -5.280 (+/-1.113)
>RF_random - training - MSE: -65.284 (+/-55.941)
>RF_random - training - R^2: 0.704 (+/-0.137)


In [18]:
evaluate_model_test_set("RF_random", rf_r, y_test, y_pred)

>RF_random - test - MAE: 4.563
>RF_random - test - MSE: 51.341
>RF_random - test - R^2: 0.699


Our metrics improved from our baseline model. Let's check if we can further improve them by fine tuning our model using a systematic grid search.
## Grid Search with Cross Validation

In [23]:
# Applying Grid Search to find the best model and the best parameters
from sklearn.model_selection import GridSearchCV

# Number of trees in random forest
n_estimators = [800, 900, 1000, 1100, 1200]
# Number of features to consider at every split
max_features = ['sqrt']
# Maximum number of levels in tree
max_depth = [100, 110, 120]
# Minimum number of samples required to split a node
min_samples_split = [2, 3]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2]
# Method of selecting samples for training each tree
bootstrap = [True]
# Create the random grid
parameters = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [24]:
# First create the base model to tune 
regressor = RandomForestRegressor()

In [25]:
grid_search = GridSearchCV(estimator = regressor,
                           param_grid = parameters,
                           cv = 3, n_jobs = -1, verbose = 2)

We have 5 x 1 x 3 x 2 x 2 x 2 x 1 = 60 features and 3 folds, which makes 180 combinations of settings.

In [26]:
# Fit the grid search model (This might take a while)
grid_search = grid_search.fit(X_train, y_train)

Fitting 3 folds for each of 60 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   36.4s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:   47.1s finished


In [27]:
# print the best parameters
grid_search.best_params_

{'bootstrap': True,
 'max_depth': 120,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 3,
 'n_estimators': 900}

In [28]:
# Fitting Random forest Model to the dataset using the best parameters
rf_g = RandomForestRegressor(bootstrap = True,
                                  max_depth = 120,
                                  max_features ='sqrt',
                                  min_samples_leaf = 1,
                                  min_samples_split = 3,
                                  n_estimators = 900)

rf_g.fit(X_train, y_train)


# Predicting the Test set results
y_pred = rf_g.predict(X_test)

In [29]:
# evaluate model after grid search
evaluate_model_cross_validation("RF_grid_search", rf_g, X_train, y_train, 10)

>RF_grid_search - training - MAE: -4.994 (+/-1.043)
>RF_grid_search - training - MSE: -65.844 (+/-61.745)
>RF_grid_search - training - R^2: 0.691 (+/-0.107)


In [30]:
evaluate_model_test_set("RF_grid_search", rf_g, y_test, y_pred)

>RF_grid_search - test - MAE: 4.587
>RF_grid_search - test - MSE: 51.736
>RF_grid_search - test - R^2: 0.696


The new random forest model after a systematic grid search does not yield better performances than the random forest model selected after a randomized grid search. 
# Conclusion
We have found a model that help us predict the house price of unit area in Xindian district using a few features. Through this project, we have seen how to:
* Conduct an exploratory data analysis using Python and Tableau.
* Build summary maps using Graphic Information Processing.
* Test rapidly different regression models using spot checking.
* Fine tune our regression candidate using Random search and then Grid search.

Further analysis could include:
* Excluding some of the properties identified as outliers in the exploratory phase.
* Run a few additional randomized grid search to identify hyper parameters that increase the performances of the random forest model.
* Perform some clustering analysis to assess if we can identify geographical categories, as those identified with Graphic Information Processing.
* Testing the model with more recent data to check how the market has evolved and the need for building a new model.