Random forest is a type of supervised machine learning algorithm based on ensemble learning. Ensemble learning is a type of learning where you join different types of algorithms or same algorithm multiple times to form a more powerful prediction model. The random forest algorithm combines multiple algorithm of the same type i.e. multiple decision trees, resulting in a forest of trees, hence the name "Random Forest". The random forest algorithm can be used for both regression and classification tasks.

# How the Random Forest Algorithm Works


The following are the basic steps involved in performing the random forest algorithm

-Pick N random records from the dataset.
    
-Build a decision tree based on these N records.
    
-Choose the number of trees you want in your algorithm and repeat steps 1 and 2.
    
-In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). The final value  can be calculated by taking the average of all the values predicted by all the trees in forest. 
    
-In case of a classification problem, each tree in the forest predicts the category to which the new record belongs. Finally,  the new record is assigned to the category that wins the majority vote.

# Advantages of using Random Forest

-The random forest algorithm is not biased, since, there are multiple trees and each tree is trained on a subset of data. Basically, the random forest algorithm relies on the power of "the crowd"; therefore the overall biasedness of the algorithm is reduced.

-This algorithm is very stable. Even if a new data point is introduced in the dataset the overall algorithm is not affected much since new data may impact one tree, but it is very hard for it to impact all the trees.

-The random forest algorithm works well when you have both categorical and numerical features.

-The random forest algorithm also works well when data has missing values or it has not been scaled well (although we have performed feature scaling in this article just for the purpose of demonstration).

# Disadvantages of using Random Forest

-A major disadvantage of random forests lies in their complexity. 
They required much more computational resources, owing to the large number of decision trees joined together.

-Due to their complexity, they require much more time to train than other comparable algorithms.

# Part 1: Using Random Forest for Regression


# Problem Definition


The problem here is to predict the gas consumption (in millions of gallons) in 48 of the US states based on petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with the driving license.

In [36]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics

In [5]:
dataset = pd.read_csv("./petrol_consumption.csv",header=0)

In [6]:
dataset.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


In [10]:
#create attributes 

X = dataset.iloc[:,:4].values
Y = dataset.iloc[:,4:].values

In [13]:

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)


In [16]:
#StandardScaler transforms your data in a distribution that will have a mean value 0 
#and standard deviation of 1


sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

In [34]:
# n_estimators = number of trees 
regressor = RandomForestRegressor(n_estimators=20, random_state=0)  
regressor.fit(X_train, Y_train)  
y_pred = regressor.predict(X_test)  

  This is separate from the ipykernel package so we can avoid doing imports until


In [38]:
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, y_pred)))  

Mean Absolute Error: 68.44500000000002
Mean Squared Error: 6656.516250000002
Root Mean Squared Error: 81.58747606097398


In [39]:
dataset['Petrol_Consumption'].mean()

576.7708333333334

In [50]:
# n_estimators = number of trees 
regressor = RandomForestRegressor(n_estimators=40, random_state=0)  
regressor.fit(X_train, Y_train)  
y_pred = regressor.predict(X_test)  

print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, y_pred)))  

Mean Absolute Error: 65.155
Mean Squared Error: 6368.158874999999
Root Mean Squared Error: 79.80074482735107


  This is separate from the ipykernel package so we can avoid doing imports until


# Part 2: Using Random Forest for Classification

In [74]:
dataset  = pd.read_csv("./bill_authentication.csv")

In [81]:
dataset.head()

Unnamed: 0,Variance,Skewness,Curtosis,Entropy,Class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


In [82]:
X = dataset.iloc[:,:4].values
Y = dataset.iloc[:,4:].values

In [83]:
Y = dataset.iloc[:,4:].values
np.unique(Y)

array([0, 1], dtype=int64)

In [84]:
sc = StandardScaler()

In [85]:
X = sc.fit_transform(X)

In [86]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0 )

In [87]:
regressor = RandomForestRegressor(n_estimators=20, random_state=0)  
regressor.fit(X_train, Y_train)  
y_pred = regressor.predict(X_test)  

  


In [91]:
y_pred = [1 if x > 0.5 else 0 for x in y_pred]

In [96]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(Y_test,y_pred))  
print(classification_report(Y_test,y_pred))  
print(accuracy_score(Y_test, y_pred))  

[[155   2]
 [  0 118]]
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       157
           1       0.98      1.00      0.99       118

   micro avg       0.99      0.99      0.99       275
   macro avg       0.99      0.99      0.99       275
weighted avg       0.99      0.99      0.99       275

0.9927272727272727
