# **Random Forest Algorithm with Python and Scikit-Learn**

# Both Regression & Classification 

Random forest is a type of supervised machine learning algorithm based on ensemble learning.

The random forest algorithm combines multiple algorithm of the same type i.e. multiple decision trees, resulting in a forest of trees.
 
  The random forest algorithm can be used for both regression and classification tasks.

## **How the Random Forest Algorithm Works**

1.  Pick N random records from the dataset.

2.  Build a decision tree based on these N records.

3.  Choose the number of trees you want in your algorithm and repeat steps 1 and 2.

4.  In case of a ***regression problem***, for a new record, each tree in the forest predicts a value for Y (output). The final value can be calculated by taking the average of all the values predicted by all the trees in forest.
  
  Or, in case of a ***classification problem***, each tree in the forest predicts the category to which the new record belongs. Finally, the new record is assigned to the category that wins the majority vote.

## **Part 1: Using Random Forest for Regression**

### Problem Definition
The problem here is to predict the gas consumption (in millions of gallons) in 48 of the US states based on petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the proportion of population with the driving license.




### 1. Import Libraries
Execute the following code to import the necessary libraries:

In [3]:
import pandas as pd
import numpy as np

### 2. Importing Dataset
The dataset for this problem is available at
[here](https://drive.google.com/file/d/1mVmGNx6cbfvRHC_DvF12ZL3wGLSHD9f_/view)

In [4]:
dataset = pd.read_csv('/content/drive/My Drive/petrol_consumption.csv')

In [5]:
dataset.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


### 3. Preparing Data For Training
Two tasks will be performed in this section. The first task is to divide data into 'attributes' and 'label' sets. 

The resultant data is then divided into training and test sets.

The following script divides data into attributes and labels:

In [6]:
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values

Finally, let's divide the data into training and testing sets:

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### 4. Feature Scaling
We know our dataset is not yet a scaled value, for instance the Average_Income field has values in the range of thousands while Petrol_tax has values in range of tens. 

Therefore, it would be beneficial to scale our data (although, as mentioned earlier, this step isn't as important for the random forests algorithm). 

To do so, we will use Scikit-Learn's StandardScaler class. Execute the following code to do so:

In [8]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## 5. Training the Algorithm
Now that we have scaled our dataset, it is time to train our random forest algorithm to solve this regression problem.

In [9]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

The **RandomForestRegressor** class of the **sklearn.ensemble** library is used to solve regression problems via random forest.

The most important parameter of the RandomForestRegressor class is the **n_estimators** parameter.

This parameter defines the number of trees in the random forest. We will start with n_estimator=20 to see how our algorithm performs. 

## 6. Evaluating the Algorithm
The last and final step of solving a machine learning problem is to evaluate the performance of the algorithm.

For regression problems the metrics used to evaluate an algorithm are mean absolute error, mean squared error, and root mean squared error. 

In [10]:
from sklearn import metrics

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 51.76500000000001
Mean Squared Error: 4216.166749999999
Root Mean Squared Error: 64.93201637097064


# **Part 2: Using Random Forest for Classification**

### Problem Definition
The task here is to predict whether a bank currency note is authentic or not based on four attributes i.e. variance of the image wavelet transformed image, skewness, entropy, and curtosis of the image.

1. Import Libraries

In [11]:
import pandas as pd
import numpy as np

### 2. Importing Dataset
The dataset can be downloaded from the following link:

[click here](https://drive.google.com/file/d/13nw-uRXPY8XIZQxKRNZ3yYlho-CYm_Qt/view)

In [21]:
dataset = pd.read_csv("/content/drive/My Drive/bill_authentication.csv")

In [22]:
dataset.head()

Unnamed: 0,Variance,Skewness,Curtosis,Entropy,Class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


3. Preparing Data For Training

In [23]:
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values

In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

4. Feature Scaling

In [25]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

5. Training the Algorithm

In [26]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

6. Evaluating the Algorithm

For classification problems the metrics used to evaluate an algorithm are accuracy, confusion matrix, precision recall, and F1 values.

In [36]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,y_pred.round()))
print(classification_report(y_test,y_pred.round()))
print(accuracy_score(y_test, y_pred.round()))

[[155   2]
 [  0 118]]
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       157
           1       0.98      1.00      0.99       118

    accuracy                           0.99       275
   macro avg       0.99      0.99      0.99       275
weighted avg       0.99      0.99      0.99       275

0.9927272727272727


The accuracy achieved for by our random forest classifier with **20 trees** is **99.27%.**

Unlike before, changing the number of estimators for this problem didn't significantly improve the results