# **Bagging**
Let's imagine you have a big box of crayons with lots of different colors. You want to draw a picture, but you want to make sure it looks really nice. Instead of just using one crayon, you decide to use a few different crayons to make your picture even better. 

**Ensemble Learning** is like using a group of crayons to make a better picture. Instead of just one crayon (which is like one model), you use many crayons (which are like many models) to help you create something great!

**Bagging** is a special way of using those crayons. Here‚Äôs how it works:

1. **Choose a few crayons**: You pick some crayons from the box, but you don‚Äôt just pick them once. You keep picking different crayons over and over again. Sometimes you might pick the same crayon more than once!

2. **Draw with each set**: Each time you pick a new set of crayons, you draw a different picture. Some pictures might be really colorful, and some might be a little messy, but that‚Äôs okay!

3. **Look at all the pictures**: After you‚Äôve drawn a bunch of pictures, you look at them all. You notice that some pictures are better than others.

4. **Pick the best parts**: You take the best parts from all your pictures and combine them to make one super awesome picture!

In the world of computers, instead of crayons, we have models that try to guess or predict things. Bagging helps us by using many models to make better guesses. Just like you used different crayons to make a better picture, bagging uses different models to make better predictions!

### Example:
Let‚Äôs say you want to guess how many candies are in a jar. 

- **One guess**: If you just ask one friend, they might guess 50 candies.
- **More guesses**: If you ask 5 friends, one might say 40, another 60, another 55, and so on.
- **Average it out**: Instead of just taking one guess, you can take all their guesses, add them up, and then divide by how many friends you asked. This way, you get a better idea of how many candies are really in the jar!

So, bagging is like asking many friends for their guesses and then using all those guesses to make a better guess!

# **Boosting**

Imagine you‚Äôre trying to draw a really beautiful picture, but sometimes you make mistakes. Instead of just using a bunch of crayons like in bagging, boosting is a little different. It‚Äôs like having a special helper who watches you draw and gives you tips to make your picture better!

Here‚Äôs how boosting works:

1. **Start with a drawing**: You begin by drawing your picture with your first crayon. Maybe you draw a tree, but it doesn‚Äôt look quite right.

2. **Learn from mistakes**: Your helper looks at your drawing and points out what you could do better. For example, they might say, ‚ÄúThe tree is too small!‚Äù 

3. **Try again**: You take that advice and draw another tree, making it bigger this time. But maybe you still make a mistake, like the leaves are too dark.

4. **Keep improving**: Your helper keeps giving you tips based on what you did wrong. Each time you draw, you focus on fixing the mistakes from the last drawing.

5. **Combine everything**: After you‚Äôve drawn several trees, you look at all of them together. You take the best parts from each tree and combine them to make one really beautiful tree!

In the world of computers, boosting works similarly. Instead of just using one model, boosting creates a series of models where each new model tries to fix the mistakes of the previous ones. 

### Example:
Let‚Äôs say you‚Äôre trying to guess how many candies are in a jar again.

1. **First guess**: You ask your first friend, and they guess 50 candies.
2. **Check the answer**: You find out there are actually 70 candies. Your friend was a bit off.
3. **Learn from it**: Now, you ask another friend, but this time you tell them, ‚ÄúMy first friend guessed 50, but it was too low. Can you guess a bit higher?‚Äù
4. **Second guess**: Your second friend might guess 80 candies. They were too high, but they learned from the first guess.
5. **Keep going**: You keep asking more friends, and each time, they learn from the previous guesses to make better guesses.

So, boosting is like having a team of friends who learn from each other‚Äôs mistakes to make better guesses together! Each new guess is smarter because it‚Äôs trying to fix what went wrong before.

## **Resampling With Replacement**

Let‚Äôs use a fun example to explain **resampling with replacement**.

Imagine you have a big jar filled with colorful jellybeans. You want to taste some jellybeans, but you want to make sure you get a good mix of flavors. Here‚Äôs how resampling with replacement works:

1. **Pick a jellybean**: You reach into the jar and grab one jellybean. Let‚Äôs say you pick a red one.

2. **Taste it**: You taste the red jellybean. Yum! But now, instead of keeping it out of the jar, you put it back in. So, the jar still has the same number of jellybeans as before.

3. **Pick again**: You reach in again and grab another jellybean. This time, you might pick a green one.

4. **Repeat**: You keep picking jellybeans, tasting them, and putting them back in the jar. Sometimes you might pick the same jellybean more than once, and sometimes you might get different colors.

### Why is this important?
Resampling with replacement is a way to make sure you can get a variety of samples, even if you pick the same thing multiple times. It helps you understand the different flavors in the jar better!

### In the world of data:
In statistics and machine learning, resampling with replacement is often used to create different samples from a dataset. For example, if you have a list of numbers or data points, you can randomly pick some of them to create a new sample. Since you put the data points back after picking them, you might end up with some data points appearing more than once in your new sample.

### Example in practice:
Let‚Äôs say you have a small dataset of numbers: [1, 2, 3]. If you resample with replacement, you might get:

- First sample: [1, 2, 1]
- Second sample: [3, 2, 2]
- Third sample: [1, 3, 3]

Each time you create a new sample, you can have the same number appear more than once, and that helps you understand the data better!

So, resampling with replacement is like tasting jellybeans from a jar where you can put them back after tasting, allowing you to explore different combinations and flavors!

#### Splitting of a dataset into tiny parts for ensemble learning to get a overall good accuracy with bagging but the tiny parts are weak learners so the sum of all will be a better result this is known as # ***Bootstrap aggregation*** baggin is also called bootstrap aggregation


# **Bootstrap Aggregation (Bagging) ‚Äì Explained Simply üéØ**  

### **üîπ What is Bootstrap Aggregation (Bagging)?**  
**Bagging (Bootstrap Aggregation)** is an **ensemble learning** technique that improves model accuracy and reduces overfitting by training multiple models on **random subsets** of the dataset and averaging their predictions.  

It is mainly used in **decision trees** and is the foundation of **Random Forests**. üå≤üå≤üå≤  

---

### **üîπ How Bagging Works (Step-by-Step)**  
1Ô∏è‚É£ **Bootstrap Sampling** ‚Äì Randomly select **multiple subsets** (with replacement) from the training dataset.  
2Ô∏è‚É£ **Train Multiple Models** ‚Äì Train a separate model (e.g., decision tree) on each subset.  
3Ô∏è‚É£ **Aggregation (Voting or Averaging)** ‚Äì Combine predictions from all models:  
   - **For classification** ‚Üí Use **majority voting**.  
   - **For regression** ‚Üí Use **averaging**.  

---

### **üîπ Why Use Bagging?**  
‚úÖ **Reduces Overfitting** ‚Äì Helps control variance by smoothing predictions.  
‚úÖ **Improves Stability** ‚Äì Works well even on noisy data.  
‚úÖ **Works Best with High-Variance Models** ‚Äì Decision trees and other complex models benefit the most.  

---

### **üîπ Example: Bagging with Decision Trees**  
Let's implement **BaggingClassifier** using **Scikit-Learn**.

```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a BaggingClassifier with Decision Trees
bagging_model = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(),  # Weak learner
    n_estimators=10,  # Number of models
    random_state=42
)
bagging_model.fit(X_train, y_train)

# Predictions
y_pred = bagging_model.predict(X_test)

# Accuracy
print("Bagging Model Accuracy:", accuracy_score(y_test, y_pred))
```

---

### **üîπ Bagging vs Boosting (Difference)**
| Feature | Bagging | Boosting |
|---------|---------|----------|
| Goal | Reduce variance (overfitting) | Reduce bias (underfitting) |
| Training | Independent models | Sequential models (learn from mistakes) |
| Example | Random Forest | AdaBoost, Gradient Boosting |

---

### **üîπ Summary**
‚úî Bagging is an ensemble method that **trains multiple models on random subsets** and **combines predictions**.  
‚úî It **reduces variance** and prevents **overfitting**, especially for **decision trees**.  
‚úî **Random Forest** is an advanced version of Bagging with **multiple decision trees**.  



![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)

In [1]:
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [3]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


This data looks normal so we are not doing any outliers

In [4]:
df.Outcome.value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

In [5]:
268/500

0.536

In [6]:
X = df.drop('Outcome',axis='columns')
y = df.Outcome

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [7]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X_scaled,y,stratify=y,random_state=20)

In [8]:
X_train.shape

(576, 8)

In [9]:
X_test.shape

(192, 8)

In [10]:
y_train.value_counts()

Outcome
0    375
1    201
Name: count, dtype: int64

In [11]:
201/375

0.536

In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

scores = cross_val_score(DecisionTreeClassifier(),X,y,cv=5)
scores

array([0.66883117, 0.66883117, 0.67532468, 0.79738562, 0.7124183 ])

In [13]:
scores.mean()

0.7045581869111281

In [23]:
from sklearn.ensemble import BaggingClassifier
bag_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    oob_score=True, # Out of bag,
    random_state=0
)
bag_model.fit(X_train,y_train)

In [24]:
bag_model.oob_score_

0.7517361111111112

In [25]:
bag_model.score(X_test,y_test)

0.7864583333333334

In [26]:
from sklearn.model_selection import cross_val_score

bag_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    oob_score=True, # Out of bag,
    random_state=0
)

scores = cross_val_score(bag_model,X,y,cv=5)
scores.mean()

0.7578728461081402

In [29]:
from sklearn.ensemble import RandomForestClassifier
scores = cross_val_score(RandomForestClassifier(),X,y,cv=5)
scores.mean()

0.7669892199303965