# Random Forest Classifier

Random Forest Classifier is an ensemble learning method that aggregates the predictions of multiple individual decision trees to make a final classification prediction. By combining the predictions of multiple trees, the Random Forest Classifier aims to achieve higher accuracy and improved generalization compared to using a single decision tree. Random Forest Classifier is built upon decision trees, which are hierarchical structures that recursively partition the input data based on features until reaching leaf nodes, where class labels are assigned. Each decision tree in the Random Forest independently learns to make predictions based on different subsets of the training data.

## Working of Random Forest Classifier

Following is the step by step breakdown of working of the random forest classifier  
1. ***Data Preparation***:  
    Prepare the training dataset, which consists of input features (X) and corresponding class labels (y). Perform any necessary preprocessing steps such as handling missing values, encoding categorical variables, and scaling features if required.  
  
2. ***Esnemble Creation***:  
    a. Determine the number of decision trees to include in the Random Forest ensemble, specified by the hyperparameter "n_estimators."  
    b. Initialize an empty ensemble to hold the decision trees.  
  
3. ***For each tree in the ensemble***:  
    a. Random Subsampling:  
    Create a bootstrap sample by randomly selecting instances from the training data with replacement. The size of the bootstrap sample is the same as the original training data. Each bootstrap sample will include some repeated instances and exclude some instances, ensuring diversity within each tree's training data.  
  
    b. Tree Construction:  
    Create a decision tree using the bootstrap sample obtained in the previous step. At each node of the decision tree, randomly select a subset of features to consider for splitting. The number of features to consider is controlled by the hyperparameter "max_features." Determine the best split based on a criterion such as Gini impurity or entropy. This involves evaluating different splits and selecting the one that maximizes information gain or reduces impurity the most.  
  
    c. Tree Growth:  
    Recursively split the nodes of the decision tree until a stopping criterion is met. The stopping criterion can be a maximum depth limit specified by the "max_depth" hyperparameter or a minimum number of samples required to split a node defined by the "min_samples_split" hyperparameter.  
  
    d. Add the constructed decision tree to the ensemble.  
  
4. ***Prediction Aggregation***:  
    a. Once all the decision trees in the ensemble are constructed, the Random Forest makes predictions by aggregating the predictions of individual trees.  
  
    b. Classification task:  
    Each decision tree in the ensemble predicts the class label for a given input. The final prediction is determined by either majority voting (selecting the class with the highest number of votes) or probability averaging (averaging the predicted probabilities across all trees).  
  
5. ***Out-Of-Bag(OOB) Evaluation***:  
    During the training process, the Random Forest utilizes the out-of-bag (OOB) samples, which are the data points that were not included in the bootstrap sample for each tree. The OOB samples can be used to evaluate the performance of the Random Forest without the need for a separate validation set. The predictions of the OOB samples across all the trees are aggregated to calculate evaluation metrics such as accuracy or mean squared error.  
  
6. ***Feature Importance***:  
    Random Forest provides a measure of feature importance based on the collective contribution of features in making predictions across all the trees. Feature importance can be calculated by evaluating how much the impurity or error decreases when a particular feature is used for splitting.
  



## Hyperparameters of Random Forest Classifier.

1. ***n_estimators***:  
    It represents the number of decision trees in the Random Forest. Increasing the number of trees can improve performance but also increases computational complexity.  
  
2. ***max_depth***:  
    This parameter controls the maximum depth of each decision tree in the ensemble. Deeper trees can capture more complex relationships but are more prone to overfitting.  
  
3. ***min_samples_split***:  
    It sets the minimum number of samples required to split an internal node. Larger values prevent overfitting by requiring a higher number of samples for a node to be split.  
  
4. ***min_samples_leaf***:  
    It specifies the minimum number of samples required to be at a leaf node. Similar to min_samples_split, larger values help prevent overfitting.  
  
5. ***max_features***:  
     This parameter determines the maximum number of features randomly selected for each tree. A lower value reduces the correlation between trees and can prevent overfitting.  
  
6. ***bootstrap***:  
    It determines whether bootstrap samples are used when building decision trees. If set to True, each tree is trained on a random subset of the training data with replacement. Setting it to False results in using the entire dataset for training each tree.  
    
  
7. ***criterion***:
    The criterion is not considered a hyperparameter in Random Forest Classifier. Instead, the criterion is a parameter that specifies the quality measure used to evaluate the quality of a split at each node of the decision tree. In the Random Forest Classifier, the commonly used criteria are Gini impurity and entropy. The criterion is selected based on the desired metric for evaluating the quality of splits during the construction of decision trees. Gini impurity measures the degree of impurity in a node, whereas entropy calculates the level of disorder in a node.

### About the Dataset:  
  
This is a Glass Identification Data Set from UCI. It contains 10 attributes including id. The response is glass type(discrete 7 values)  
  
Attribute Information:  

RI: refractive index  
Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)  
Mg: Magnesium  
Al: Aluminum  
Si: Silicon  
K: Potassium  
Ca: Calcium  
Ba: Barium  
Fe: Iron  
Type of glass: (class attribute)  
-- 1 building_windows_float_processed  
-- 2 building_windows_non_float_processed  
-- 3 vehicle_windows_float_processed  
-- 4 vehicle_windows_non_float_processed (none in this database)  
-- 5 containers  
-- 6 tableware  
-- 7 headlamps  

In [1]:
import warnings
import pandas as pd
import numpy as np


warnings.filterwarnings("ignore")

In [None]:
df=pd.read_csv('glass.csv')

In [2]:
df.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,Type
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [3]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
RI,214.0,1.518365,0.003037,1.51115,1.516522,1.51768,1.519157,1.53393
Na,214.0,13.40785,0.816604,10.73,12.9075,13.3,13.825,17.38
Mg,214.0,2.684533,1.442408,0.0,2.115,3.48,3.6,4.49
Al,214.0,1.444907,0.49927,0.29,1.19,1.36,1.63,3.5
Si,214.0,72.650935,0.774546,69.81,72.28,72.79,73.0875,75.41
K,214.0,0.497056,0.652192,0.0,0.1225,0.555,0.61,6.21
Ca,214.0,8.956963,1.423153,5.43,8.24,8.6,9.1725,16.19
Ba,214.0,0.175047,0.497219,0.0,0.0,0.0,0.0,3.15
Fe,214.0,0.057009,0.097439,0.0,0.0,0.0,0.1,0.51
Type,214.0,2.780374,2.103739,1.0,1.0,2.0,3.0,7.0


In [4]:
df.isnull().sum()

RI      0
Na      0
Mg      0
Al      0
Si      0
K       0
Ca      0
Ba      0
Fe      0
Type    0
dtype: int64

In [5]:
df.duplicated().sum()

1

In [6]:
df.drop_duplicates(inplace=True)

**Lets build a Radom Forest Classifier model for our dataset**.

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [8]:
X = df.drop('Type', axis=1)
y = df['Type']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
rf = RandomForestClassifier(n_estimators= 500, criterion= 'entropy', max_depth= None, min_samples_split = 2, min_samples_leaf = 1, max_features = 'sqrt' )

In [11]:
rf.fit(X_train, y_train)

In [12]:
pred= rf.predict(X_test)

In [14]:
accuracy = accuracy_score(y_test, pred)
print("Accuracy:", accuracy)

Accuracy: 0.813953488372093
