# Random Forest
## <a href="#I">I How the Random Forest Algorithm Works</a>
### <a href="#I.1">I.1 Advantages of using Random Forest</a>
### <a href="#I.2">I.2 Disadvantages of using Random Forest</a>
## <a href="#II">II Using Random Forest for Regression</a>
### <a href="#II.1">II.1 Problem Definition</a>
### <a href="#II.2">II.2 Solution</a>
## <a href="#III">III Using Random Forest for Classification</a>
### <a href="#III.1">III.1 Problem Definition</a>
### <a href="#III.2">III.2 Solution</a>

# Random Forest
Random forest is a type of __supervised__ machine learning algorithm based on __ensemble learning__.<br> 
__Ensemble learning__ is a type of learning where you join different types of algorithms or same algorithm multiple times to form a more powerful prediction model. <br>
The random forest algorithm combines multiple algorithm of the same type i.e. multiple __decision trees__, resulting in a forest of trees, hence the name "Random Forest".<br>
The random forest algorithm can be used for both __regression__ and __classification__ tasks.

<a id="I"></a>
## I How the Random Forest Algorithm Works
The following are the basic steps involved in performing the random forest algorithm:

1. Pick N random records from the dataset.
2. Build a decision tree based on these N records.
3. Choose the number of trees you want in your algorithm and repeat steps 1 and 2.
4. In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). 
   The final value can be calculated by taking the average of all the values predicted by all the trees in forest.<br> 
   In case of a classification problem, each tree in the forest predicts the category to which the new record 
   belongs. Finally, the new record is assigned to the category that wins the majority vote.
   
<a id="I.1"></a>
### I.1 Advantages of using Random Forest

1. The random forest algorithm is not _biased_ (__bias__ in Machine Learning is defined as the phenomena of observing results that are systematically prejudiced due to faulty assumptions), since, there are multiple trees and each tree is trained on a subset of data.<br> 
Basically, the random forest algorithm relies on the power of "the crowd"; therefore the overall biasedness of the algorithm is reduced.

2. This algorithm is very stable. Even if a new data point is introduced in the dataset the overall algorithm is not affected much since new data may impact one tree, but it is very hard for it to impact all the trees.

3. The random forest algorithm works well when you have both categorical and numerical features (no need to encode categorical data).

4. The random forest algorithm also works well when data has missing values or if it has not been scaled. 

<a id="I.2"></a>
### I.2 Disadvantages of using Random Forest

1. A major disadvantage of random forests lies in their complexity. They required a lot of computational resources.
2. Due to their complexity, they require much more time to train than other comparable algorithms.

<a id="II"></a>
## II Using Random Forest for Regression

In this section we will study how random forests can be used to solve __regression__ problems using __Scikit-Learn__. 

<a id="II.1"></a>
### II.1 Problem Definition

The problem here is to predict the gas consumption (in millions of gallons) in 48 of the US states based on petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with the driving license.<br>

<a id="II.2"></a>
### II.2 Solution
To solve this regression problem we will use the random forest algorithm via the Scikit-Learn Python library. 
<br>
We will follow the traditional machine learning pipeline, which is roughly:

1. import the data
2. prepare the data
3. transform (scale, standardize, ...) the data
4. reduce the number of features
5. train the algorithm
6. evaluate the algorithm

In [14]:
import pandas as pd
import numpy as np
df = pd.read_csv('datasets/petrol_consumption.csv')
df.head()
df.info()
# Preparing data

# Divide data into features and labels:
X = df.iloc[:, 0:4]
y = df.iloc[:, 4]

# Divide the data into training and testing sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Feature Scaling and Standardizing (optional here because of the algorithm used)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Training the algorithm

from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)

# The most important parameter of the RandomForestRegressor class is the n_estimators parameter. 
# This parameter defines the number of trees in the random forest.

# RandomForestRegressor is training multiple trees in an ensemble learner, where the features and 
# samples are randomly sampled with replacement. 
# The random_state parameter makes it easy for others to replicate the results if given the same 
# training data and parameters.

regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

# Evaluating the algorithm

from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))


Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48 entries, 0 to 47
Data columns (total 5 columns):
Petrol_tax                      48 non-null float64
Average_income                  48 non-null int64
Paved_Highways                  48 non-null int64
Population_Driver_licence(%)    48 non-null float64
Petrol_Consumption              48 non-null int64
dtypes: float64(2), int64(3)
memory usage: 2.0 KB


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=20,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)

Mean Absolute Error: 51.76500000000001
Mean Squared Error: 4216.166749999999
Root Mean Squared Error: 64.93201637097064


In [15]:
# With 20 trees, the root mean squared error is 64.93 which is greater than 10 percent of the average 
# petrol consumption i.e. 576.77. 
df.Petrol_Consumption.describe()
# We can try to improve the RMSE by incrementing the number of estimators (trees).

count     48.000000
mean     576.770833
std      111.885816
min      344.000000
25%      509.500000
50%      568.500000
75%      632.750000
max      968.000000
Name: Petrol_Consumption, dtype: float64

In [16]:
regressor = RandomForestRegressor(n_estimators=300, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=300,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False)

Mean Absolute Error: 48.17666666666667
Mean Squared Error: 3472.820688888889
Root Mean Squared Error: 58.93064303814179


<a id="III"></a>
## III Using Random Forest for Classification

<a id="III.1"></a>
### IIII.1 Problem Definition
The task here is to predict whether a bank currency note is authentic or not based on four attributes i.e. __variance of the image wavelet transformed image__, __skewness__, __entropy__, and __curtosis__ of the image.

<a id="III.2"></a>
### III.2 Solution
This is a binary classification problem and we will use a random forest classifier to solve this problem.<br>
Steps followed to solve this problem will be similar to the steps performed for regression.


In [17]:
import pandas as pd
import numpy as np
df = pd.read_csv("datasets/bill_authentication.csv")
df.head()
df.info()

# Divide data into features and labels:

X = df.iloc[:, 0:4]
y = df.iloc[:, 4]

# Divide the data into training and testing sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Feature Scaling and Standardizing (optional here because of the algorithm used)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Training the algorithm

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

# Evaluating the algorithm

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))
print(accuracy_score(y_test, y_pred))

# The accuracy achieved for by our random forest classifier with 20 trees is 98.90%. 
# Unlike before, changing the number of estimators for this problem doesn't significantly improve the results.


Unnamed: 0,Variance,Skewness,Curtosis,Entropy,Class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1372 entries, 0 to 1371
Data columns (total 5 columns):
Variance    1372 non-null float64
Skewness    1372 non-null float64
Curtosis    1372 non-null float64
Entropy     1372 non-null float64
Class       1372 non-null int64
dtypes: float64(4), int64(1)
memory usage: 53.7 KB


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

[[155   2]
 [  1 117]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       157
           1       0.98      0.99      0.99       118

    accuracy                           0.99       275
   macro avg       0.99      0.99      0.99       275
weighted avg       0.99      0.99      0.99       275

0.9890909090909091


In [8]:
# Load the library with the iris dataset
from sklearn.datasets import load_iris

# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier

# Load pandas
import pandas as pd

# Load numpy
import numpy as np

# Set random seed
np.random.seed(0)
# Create an object called iris with the iris data
iris = load_iris()

# Create a dataframe with the four feature variables
df = pd.DataFrame(iris.data, columns=iris.feature_names)

# View the top 5 rows
df.head()
# Add a new column with the species names, this is what we are going to try to predict
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df['species'].unique()
# View the top 5 rows
df.head()
# Create a new column that for each row, generates a random number between 0 and 1, and
# if that value is less than or equal to .75, then sets the value of that cell as True
# and false otherwise. This is a quick and dirty way of randomly assigning some rows to
# be used as the training data and some as the test data.
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75

# View the top 5 rows
df.head()
# Create two new dataframes, one with the training rows, one with the test rows
train, test = df[df['is_train']==True], df[df['is_train']==False]
# Show the number of observations for the test and training dataframes
print('Number of observations in the training data:', len(train))
print('Number of observations in the test data:',len(test))
# Create a list of the feature column's names
features = df.columns[2:4]

# View features
features

# train['species'] contains the actual species names. Before we can use it,
# we need to convert each species name into a digit. So, in this case there
# are three species, which have been coded as 0, 1, or 2.
y = pd.factorize(train['species'])[0]

# View target
y
# Create a random forest Classifier. By convention, clf means 'Classifier'
clf = RandomForestClassifier(n_jobs=2, random_state=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
clf.fit(train[features], y)
# Apply the Classifier we trained to the test data (which, remember, it has never seen before)
clf.predict(test[features])
# View the predicted probabilities of the first 10 observations
clf.predict_proba(test[features])[0:10]

preds = iris.target_names[clf.predict(test[features])]
# View the PREDICTED species for the first five observations
preds[0:5]
list(zip(train[features], clf.feature_importances_))
# Create confusion matrix
pd.crosstab(test['species'], preds, rownames=['Actual Species'], colnames=['Predicted Species'])

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


[setosa, versicolor, virginica]
Categories (3, object): [setosa, versicolor, virginica]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,is_train
0,5.1,3.5,1.4,0.2,setosa,True
1,4.9,3.0,1.4,0.2,setosa,True
2,4.7,3.2,1.3,0.2,setosa,True
3,4.6,3.1,1.5,0.2,setosa,True
4,5.0,3.6,1.4,0.2,setosa,True


Number of observations in the training data: 118
Number of observations in the test data: 32


Index(['petal length (cm)', 'petal width (cm)'], dtype='object')

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2], dtype=int64)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=2,
                       oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 2, 2, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int64)

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.]])

array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa'], dtype='<U10')

[('petal length (cm)', 0.37723763051914105),
 ('petal width (cm)', 0.6227623694808591)]

Predicted Species,setosa,versicolor,virginica
Actual Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
setosa,13,0,0
versicolor,0,4,3
virginica,0,0,12
