
# Random Forest

Regression: Boston House Prices

Classification: MNIST handwritten digits

## Regression: Boston House Prices
Implement random forest regression algorithm to predict Boston house prices more accurately than k-nn. The random forest is in general a much better choice for high-dimensional feature spaces.

In [25]:
from sklearn.ensemble import RandomForestRegressor 
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
import numpy as np

In [13]:
X, y = load_boston(return_X_y=True)

In [14]:
# no need to scale data for random forest (no StandardScaler() step)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
rfr = RandomForestRegressor(n_estimators = 50, random_state = 42) # 50 trees
rfr.fit(X_train, y_train) 
y_pred = rfr.predict(X_test)

In [15]:
mean_squared_error(y_pred, y_test) # for 50 trees, without closs-validation

10.260187137724552

Lower MSE than k-nn refgressor.

## Cross validation of Random Forests

Examine feature importances

In [16]:
importances = rfr.feature_importances_
indices = np.argsort(importances)[::-1] 
for f in range(X.shape[1]): 
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]])) 

1. feature 5 (0.447439)
2. feature 12 (0.378119)
3. feature 7 (0.053390)
4. feature 0 (0.031685)
5. feature 10 (0.017446)
6. feature 6 (0.017015)
7. feature 11 (0.012885)
8. feature 4 (0.012477)
9. feature 9 (0.012078)
10. feature 2 (0.007195)
11. feature 8 (0.005473)
12. feature 1 (0.003723)
13. feature 3 (0.001075)


Implement cross validation using GridSearchCV. First, a dictionary is created to store the hyperparameters that we want to search over

In [17]:
from sklearn.model_selection import GridSearchCV

parameters =   {'n_estimators': [5,10,15,20,50,100],
                'max_features': ['sqrt', 'auto', 'log2'],
                'max_depth': [10, 30, 50, None],
                'bootstrap': [True, False]}

Specify ML model as normal, then run the gridsearch. This is going to create $5\times3\times2\times2 = 60$ random forests. 

In [18]:
from sklearn.model_selection import GridSearchCV
gridsearch = GridSearchCV(RandomForestRegressor(random_state=42), parameters, scoring='neg_mean_squared_error', cv=5, error_score=0)
gridsearch.fit(X_train, y_train)

best = gridsearch.best_params_

rfr = RandomForestRegressor(n_estimators = best['n_estimators'], max_features = best['max_features'], max_depth = best['max_depth'], bootstrap = best['bootstrap'], random_state = 42)

rfr.fit(X_train, y_train)
y_pred = rfr.predict(X_test)
mean_squared_error(y_pred, y_test)



9.808918832335316

In [19]:
print(best)

{'bootstrap': False, 'max_depth': 30, 'max_features': 'sqrt', 'n_estimators': 100}


Boston House Prices with random forest (100 trees) MSE < 10

## Classification: MNIST dataset

Train a random forest to classify handwritten digits 0-9.

In [34]:
from sklearn.datasets import fetch_mldata
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import classification_report

mnist = fetch_mldata('MNIST original') 
X = mnist.data 
y = mnist.target



In [35]:
print(X.shape)
print(y.shape)

(70000, 784)
(70000,)


In [36]:
#Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(X_train.shape)
print(X_test.shape)

(46900, 784)
(23100, 784)


In [37]:
#training random Forest
rf=RandomForestClassifier(n_estimators=100)
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [38]:
y_pred=rf.predict(X_test)
print ("Classification Report")
print(classification_report(y_test, y_pred))

Classification Report
              precision    recall  f1-score   support

         0.0       0.98      0.98      0.98      2229
         1.0       0.99      0.98      0.99      2543
         2.0       0.96      0.97      0.97      2321
         3.0       0.96      0.96      0.96      2319
         4.0       0.97      0.98      0.97      2253
         5.0       0.97      0.96      0.97      2167
         6.0       0.98      0.98      0.98      2323
         7.0       0.97      0.97      0.97      2426
         8.0       0.95      0.96      0.96      2273
         9.0       0.96      0.95      0.95      2246

    accuracy                           0.97     23100
   macro avg       0.97      0.97      0.97     23100
weighted avg       0.97      0.97      0.97     23100



97 % accuracy

In [46]:
# cross validation
rf=RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
score = cross_val_score(rf, X_train, y_train, cv = 10, scoring='accuracy')
print (np.mean(score))

0.9656502214998388


Accuracy is between 96 and 97 %