# Ensemble Machine Learning Algorithms   

## Table of Contents
1. Introduction to Ensemble Learning
2. Why Ensemble Methods Work
3. Types of Ensemble Methods
4. Bagging Methods
5. Boosting Methods
6. Stacking
7. Voting Classifiers
8. Practical Implementation with Code
9. When to Use Which Method

## 1. Introduction to Ensemble Learning

**What is Ensemble Learning?**
Ensemble learning is a machine learning technique that combines multiple models (called "base learners" or "weak learners") to create a stronger, more accurate predictive model. Think of it like asking multiple experts for their opinion and combining their answers to get a better decision.

**Key Terminology:**
- **Base Learner/Weak Learner**: Individual model in the ensemble (e.g., a single decision tree)
- **Strong Learner**: The combined ensemble model
- **Homogeneous Ensemble**: All base learners are of the same type
- **Heterogeneous Ensemble**: Base learners are of different types

## 2. Why Ensemble Methods Work

Ensemble methods work because they reduce three types of errors:

1. **Bias**: Error from overly simplistic assumptions
2. **Variance**: Error from sensitivity to training data variations
3. **Noise**: Random errors in data

**The Wisdom of Crowds Principle:**
If you have multiple models making independent predictions, their errors often cancel out, leading to better overall accuracy.


## 3. Types of Ensemble Methods

There are three main categories:

1. **Bagging (Bootstrap Aggregating)**: Train models in parallel on different subsets of data
2. **Boosting**: Train models sequentially, each correcting previous model's errors
3. **Stacking**: Combine different types of models using a meta-learner


## 4. Bagging Methods

### 4.1 Concept
Bagging creates multiple subsets of training data by random sampling with replacement (bootstrapping), trains a model on each subset, and combines predictions by voting (classification) or averaging (regression).

**Key Benefits:**
- Reduces variance
- Helps prevent overfitting
- Works well with high-variance models like decision trees

### 4.2 Random Forest

Random Forest is the most popular bagging algorithm. It builds multiple decision trees and combines their predictions.

**How it Works:**
1. Create multiple bootstrap samples from training data
2. For each sample, build a decision tree
3. At each split, consider only a random subset of features
4. Combine predictions by majority voting (classification) or averaging (regression)

**Important Parameters:**
- `n_estimators`: Number of trees in the forest
- `max_depth`: Maximum depth of each tree
- `max_features`: Number of features to consider at each split
- `bootstrap`: Whether to use bootstrap samples



In [2]:
import pandas as pd

data = {
    "CustomerID": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
    "Age": [25,45,52,23,34,48,29,41,55,38,50,27,60,33,49],
    "Salary": [30000,60000,52000,28000,45000,80000,32000,61000,72000,47000,69000,35000,90000,44000,71000],
    "Years_with_Company": [1,5,10,2,3,12,1,4,15,6,8,2,20,3,9],
    "Num_Products": [1,2,3,1,2,4,1,2,3,2,3,1,4,2,3],
    "IsActive": [0,1,1,0,1,1,0,1,1,1,1,0,1,1,1],
    "Churn": ["No","No","Yes","No","No","Yes","No","No","Yes","No","Yes","No","Yes","No","Yes"]
}

df = pd.DataFrame(data)

df.to_csv("customer.csv", index=False)

print("CSV file created successfully!")


CSV file created successfully!


##  **Random Forest :**

## *Classification*

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import numpy as np

# Load the  dataset
df=pd.read_csv('customer.csv')
df


Unnamed: 0,CustomerID,Age,Salary,Years_with_Company,Num_Products,IsActive,Churn
0,1,25,30000,1,1,0,No
1,2,45,60000,5,2,1,No
2,3,52,52000,10,3,1,Yes
3,4,23,28000,2,1,0,No
4,5,34,45000,3,2,1,No
5,6,48,80000,12,4,1,Yes
6,7,29,32000,1,1,0,No
7,8,41,61000,4,2,1,No
8,9,55,72000,15,3,1,Yes
9,10,38,47000,6,2,1,No


In [4]:

X =df[['CustomerID','Age','Salary','Years_with_Company','Num_Products','IsActive']]
y = df[['Churn']]
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create Random Forest Classifier
rf_classifier = RandomForestClassifier(
    n_estimators=100,      # Number of trees
    max_depth=3,           # Maximum depth of trees
    random_state=42,
    n_jobs=-1              # Use all CPU cores
)

# Train the model
rf_classifier.fit(X_train, y_train)

# Make predictions
y_pred = rf_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Random Forest Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))



Random Forest Accuracy: 1.0000
Classification Report:
              precision    recall  f1-score   support

          No       1.00      1.00      1.00         4
         Yes       1.00      1.00      1.00         1

    accuracy                           1.00         5
   macro avg       1.00      1.00      1.00         5
weighted avg       1.00      1.00      1.00         5



  return fit_method(estimator, *args, **kwargs)


## *Regression*

In [5]:
import pandas as pd

data = {
    "HouseID": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
    "Area_sqft": [850,1200,950,1500,700,1300,1600,1100,900,1400,1750,1000,1250,700,1550],
    "Bedrooms": [2,3,2,4,1,3,4,3,2,3,4,2,3,1,4],
    "Bathrooms": [1,2,1,3,1,2,3,2,1,2,3,1,2,1,3],
    "Age": [10,5,15,8,20,7,4,12,18,6,3,14,9,22,5],
    "Distance_to_City_km": [8,5,10,3,12,6,4,9,11,7,2,9,6,13,4],
    "HousePrice": [520000,780000,460000,1050000,350000,850000,1150000,690000,410000,920000,1300000,480000,830000,330000,1120000]
}

df = pd.DataFrame(data)
df.to_csv("house_sample.csv", index=False)

print("CSV file created successfully!")


CSV file created successfully!


In [6]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

#Load Dataset
data=pd.read_csv('house_sample.csv')
data


Unnamed: 0,HouseID,Area_sqft,Bedrooms,Bathrooms,Age,Distance_to_City_km,HousePrice
0,1,850,2,1,10,8,520000
1,2,1200,3,2,5,5,780000
2,3,950,2,1,15,10,460000
3,4,1500,4,3,8,3,1050000
4,5,700,1,1,20,12,350000
5,6,1300,3,2,7,6,850000
6,7,1600,4,3,4,4,1150000
7,8,1100,3,2,12,9,690000
8,9,900,2,1,18,11,410000
9,10,1400,3,2,6,7,920000


In [7]:

X =df[['Area_sqft','Bedrooms','Bathrooms','Age','Distance_to_City_km']]
y = df[['HousePrice']]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create Random Forest Regressor
rf_regressor = RandomForestRegressor(
    n_estimators=100,
    max_depth=5,
    random_state=42
)

# Train the model
rf_regressor.fit(X_train, y_train)

# Make predictions
y_pred = rf_regressor.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Random Forest Regression MSE: {mse:.4f}")
print(f"R² Score: {r2:.4f}")

Random Forest Regression MSE: 2485832000.0000
R² Score: 0.9516


  return fit_method(estimator, *args, **kwargs)
