Random Forest is a popular and powerful ensemble learning technique used in machine learning for both classification and regression tasks. It is based on the concept of decision trees and combines multiple trees to make more accurate predictions. In this tutorial, I'll teach you about Random Forest in machine learning:

What is Random Forest?
Random Forest is an ensemble method that builds multiple decision trees during training and combines their outputs for better predictive performance. It gets its name from the randomness introduced in the model-building process. Random Forest is highly versatile and can handle various types of data, including both categorical and numerical features.

How Does Random Forest Work?
Here's a step-by-step explanation of how Random Forest works:

1. Bootstrapping (Random Sampling): The algorithm starts by creating multiple subsets of the training data through random sampling with replacement. Each subset is called a "bootstrap sample."

2. Feature Selection: At each node of the decision tree, instead of considering all features for splitting, Random Forest randomly selects a subset of features to split on. This introduces diversity among the individual trees.

3. Decision Tree Building: For each bootstrap sample and for each node in the tree, a decision tree is constructed. These trees are often referred to as "base learners" or "weak learners."

4. Voting or Averaging: For classification tasks, the predictions of each tree are combined through a majority vote (mode), while for regression tasks, the predictions are averaged.

5. Reduced Variance: Because the trees are constructed using bootstrapped samples and feature subsets, each tree is slightly different. This diversity helps reduce overfitting and increases the model's generalization ability.

Advantages of Random Forest:
Excellent predictive performance for a wide range of tasks.
Robust to overfitting due to ensemble averaging.
Handles both numerical and categorical data.
Provides feature importances to understand the importance of each feature.
Can handle missing data.

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,OneHotEncoder,OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
from sklearn.ensemble import RandomForestClassifier

In [13]:
data  = pd.read_csv('D:\GIT1\Machine-Learning\Decision Trees\salaries.csv')
data

Unnamed: 0,company,job,degree,salary_more_then_100k
0,google,sales executive,bachelors,0
1,google,sales executive,masters,0
2,google,business manager,bachelors,1
3,google,business manager,masters,1
4,google,computer programmer,bachelors,0
5,google,computer programmer,masters,1
6,abc pharma,sales executive,masters,0
7,abc pharma,computer programmer,bachelors,0
8,abc pharma,business manager,bachelors,0
9,abc pharma,business manager,masters,1


In [48]:
#Assigning the Target(y) and Features(X)
X =data.drop('salary_more_then_100k',axis=1)
y = data['salary_more_then_100k']

In [49]:
#Define the nominal and ordinal features
custom_order_jobs = ['sales executive','computer programmer','business manager']  #Create rank in the ordinal features 1.(sales executive)...3.(business manager)
custom_order_degree = ['bachelors','masters']    #Create rank in the ordinal features 1.(bachelors)...2.(masters)

In [50]:
#Define the ordinal and nominal features
nominal_features = ['company']
ordinal_features = ['job','degree']

#creating the transformers for encoders
nominal_transformers = OneHotEncoder(drop='first')
ordinal_transformers = OrdinalEncoder(categories=[custom_order_jobs,custom_order_degree])

In [51]:
#Applying the columnTransformers
preprocessor = ColumnTransformer(
    transformers=[
        ('ord',ordinal_transformers,ordinal_features),
        ('nom',nominal_transformers,nominal_features)
    ]
)
preprocessor

In [52]:
#Creating the Random Forest
random_tree = RandomForestClassifier(n_estimators=100, random_state=42)

In [53]:
#creating a pipeline for preprocessing and the ranomForestClassifier
pipeline_1 = Pipeline([
    ('preprocessor',preprocessor),
    ('model',random_tree)
])
pipeline_1

In [54]:
#splitting the data into training and Test sets
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [56]:
#Fitting the Pipeline
pipeline_1.fit(X_train,y_train)


In [58]:
#Making predictions with the model
y_pred = pipeline_1.predict(X_test)
y_pred

array([0, 0, 0, 1], dtype=int64)

In [59]:
#Evaluate the accuracy of the model
accuracy = accuracy_score(y_test,y_pred)
conf_marix = confusion_matrix(y_test,y_pred)
class_report = classification_report(y_test,y_pred)


#Print the evaluation metrics
print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix score: {conf_marix}')
print(f'Classification Report:\n{class_report}')

Accuracy: 0.75
Confusion Matrix score: [[2 0]
 [1 1]]
Classification Report:
              precision    recall  f1-score   support

           0       0.67      1.00      0.80         2
           1       1.00      0.50      0.67         2

    accuracy                           0.75         4
   macro avg       0.83      0.75      0.73         4
weighted avg       0.83      0.75      0.73         4

