<h1 style='text-align:center;'>Random Forest Classification</h1>


<h2 style="text-align:center;">Introduction</h2>

**Random Forest** is an ensemble learning algorithm that combines multiple **Decision Trees** to improve predictive performance.  
It follows the **Bagging (Bootstrap Aggregating)** approach, where each tree is trained on a random sample of the dataset with feature randomness.  

Key concepts:
- **Ensemble Learning**: Combines predictions of multiple models.  
- **Bagging**: Training several weak learners in parallel to reduce variance.  
- **Randomness**: Trees use random subsets of data & features → improves generalization.  
- **Hyperparameters**: 
  - `n_estimators`: Number of trees.  
  - `max_depth`: Maximum depth of each tree.  
  - `criterion`: Entropy or Gini index.  

Random Forest reduces **overfitting** (common in single decision trees) and usually provides better accuracy.


<h2 style='text-align:center;'>Importing Libraries and Dataset</h2>

In [1]:

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv("../data/Social_Network_Ads.csv")
X = dataset.iloc[:, [2, 3]].values  # Features: Age, Estimated Salary
y = dataset.iloc[:, -1].values      # Target: Purchased
dataset.head()  # ➤ Display first few rows


Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


<h2 style='text-align:center;'>Splitting the Dataset</h2>

In [2]:

from sklearn.model_selection import train_test_split

# Splitting into training and testing sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

X_train.shape, X_test.shape  # ➤ Display dataset dimensions


((320, 2), (80, 2))

<h2 style='text-align:center;'>Training the Random Forest Model</h2>

In [3]:

from sklearn.ensemble import RandomForestClassifier

# Creating classifier with entropy criterion
classifier = RandomForestClassifier(max_depth=4, n_estimators=60, criterion='entropy', random_state=0)
classifier.fit(X_train, y_train)  # ➤ Model trained


<h2 style='text-align:center;'>Predicting Test Results</h2>

In [4]:

# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred[:10]  # ➤ Display first 10 predictions


array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=int64)

<h2 style='text-align:center;'>Model Evaluation</h2>

In [5]:

from sklearn.metrics import confusion_matrix, accuracy_score

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)  # ➤ Display confusion matrix

# Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy Score:", accuracy)  # ➤ Display accuracy


Confusion Matrix:
 [[55  3]
 [ 1 21]]
Accuracy Score: 0.95


<h2 style='text-align:center;'>Bias and Variance Analysis</h2>

In [6]:

# Bias (Training Accuracy)
bias = classifier.score(X_train, y_train)

# Variance (Testing Accuracy)
variance = classifier.score(X_test, y_test)

print("Bias (Training Accuracy):", bias)     # ➤ Display bias
print("Variance (Testing Accuracy):", variance)  # ➤ Display variance


Bias (Training Accuracy): 0.921875
Variance (Testing Accuracy): 0.95



<h2 style="text-align:center;">Summary</h2>

In this notebook, we:  
1. Loaded the **Social Network Ads dataset**.  
2. Trained a **Random Forest Classifier** using entropy as the criterion.  
3. Evaluated the model using **confusion matrix** and **accuracy score**.  
4. Analyzed **bias and variance** to check overfitting/underfitting.  

**Key Takeaways**:  
- Random Forest reduces variance compared to a single decision tree.  
- It performs well on both training and testing datasets.  
- Increasing the number of trees (`n_estimators`) usually improves stability.  
- Random Forest is widely used due to its robustness and high accuracy.
