### The Random Forest algorithm

For each feature under consideration, it computes the locally optimal feature/split combination. In Random forest, each decision tree in the ensemble is built from a sample drawn with replacement from the training set and then gets the prediction from each of them and finally selects the best solution by means of voting. It can be used for both classification as well as regression tasks

#### **Classification with Random Forest**

For creating a random forest classifier, the Scikit-learn module provides sklearn.ensemble.RandomForestClassifier. While building random forest classifier, the main parameters this module uses are ‘max_features’ and ‘n_estimators’.
Here, ‘max_features’ is the size of the random subsets of features to consider when splitting a node. If we choose this parameter’s value to none then it will consider all the features rather than a random subset. On the other hand, n_estimators are the number of trees in the forest. The higher the number of trees, the better the result will be. But it will take longer to compute also.

**Implementation example**

In the following example, we are building a random forest classifier by using sklearn.ensemble.RandomForestClassifier and also checking its accuracy also by using cross_val_score module.

In [1]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
X, Y = make_blobs(n_samples=10000, n_features=10, centers=100,random_state=0)

In [7]:
RFclf = RandomForestClassifier(n_estimators=10,max_depth=None,min_samples_split=2, random_state=0)
scores = cross_val_score(RFclf, X, Y, cv=5)
scores.mean()

0.9997

We can also use the sklearn dataset to build Random Forest classifier. As in the following example we are using iris dataset. We will also find its accuracy score and confusion matrix.

In [8]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(path, names=headernames)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
RFclf = RandomForestClassifier(n_estimators=50)
RFclf.fit(X_train, y_train)
y_pred = RFclf.predict(X_test)

In [9]:
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)

Confusion Matrix:
[[12  0  0]
 [ 0 18  1]
 [ 0  1 13]]


In [10]:
print("Classification Report:",)
print (result1)

Classification Report:
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        12
Iris-versicolor       0.95      0.95      0.95        19
 Iris-virginica       0.93      0.93      0.93        14

       accuracy                           0.96        45
      macro avg       0.96      0.96      0.96        45
   weighted avg       0.96      0.96      0.96        45



In [11]:
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)

Accuracy: 0.9555555555555556


#### **Regression with Random Forest**

For creating a random forest regression, the Scikit-learn module provides sklearn.ensemble.RandomForestRegressor. While building random forest regressor, it will use the same parameters as used by sklearn.ensemble.RandomForestClassifier.

Implementation example

In the following example, we are building a random forest regressor by using sklearn.ensemble.RandomForestregressor and also predicting for new values by using predict() method.

In [12]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
X, y = make_regression(n_features=10, n_informative=2,random_state=0, shuffle=False)
RFregr = RandomForestRegressor(max_depth=10,random_state=0,n_estimators=100)
RFregr.fit(X, y)

Once fitted we can predict from regression model as follows:

In [13]:
print(RFregr.predict([[0, 2, 3, 0, 1, 1, 1, 1, 2, 2]]))

[98.47729198]


### EXTRA TREES

For each feature under consideration, it selects a random value for the split. The benefit of using extra tree methods is that it allows to reduce the variance of the model a bit more. The disadvantage of using these methods is that it slightly increases the bias.

#### EXTRA TREES METHOD

For creating a classifier using Extra-tree method, the Scikit-learn module provides sklearn.ensemble.ExtraTreesClassifier. It uses the same parameters as used by sklearn.ensemble.RandomForestClassifier. The only difference is in the way, discussed above, they build trees.

Implementation example

In the following example, we are building a random forest classifier by using sklearn.ensemble.ExtraTreeClassifier and also checking its accuracy by using cross_val_score module.

In [14]:
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_blobs
from sklearn.ensemble import ExtraTreesClassifier
X, y = make_blobs(n_samples=10000, n_features=10, centers=100,random_state=0)
ETclf = ExtraTreesClassifier(n_estimators=10,max_depth=None,min_samples_split=10, random_state=0)
scores = cross_val_score(ETclf, X, y, cv=5)
scores.mean()

1.0

We can also use the sklearn dataset to build classifier using Extra-Tree method. As in the following example we are using Pima-Indian dataset.

In [8]:
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier

path = "/content/drive/MyDrive/Colab Notebooks/data/pima-indians-diabetes.csv"
header_names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = pd.read_csv(path, names=header_names)
array = data.values
X = array[:, 0:8]
Y = array[:, 8]

seed = 7
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
num_trees = 150
max_features = 5

ETclf = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)

results = cross_val_score(ETclf, X, Y, cv=kfold)

print(results.mean())




ValueError: ignored