### Random Forest

## Topic 1: Concept → Bagging & Ensemble Learning
## Part 1: Ensemble Learning vs Bagging
### 1. Ensemble Learning (general idea)

Think of Ensemble as: “Don’t trust one model, combine many.”

It’s like asking multiple people instead of one person.

Techniques: Bagging, Boosting, Stacking (all are types of ensemble learning).

👉 Example:
Election Prediction 🗳️

One survey company may predict wrong.

But if you combine 100 different surveys, the combined result (average) is much more accurate.
This is ensemble learning.

### 2. Bagging (a type of ensemble)

Bagging = “train the same model (like many decision trees), but each on different random samples of the data.”

Each tree is trained on slightly different data → so they don’t all make the same mistakes.

Then we combine (majority vote for classification).

👉 Example:
You want to know if a new restaurant is good 🍴.

You ask 10 different food bloggers.

Each blogger randomly tries a few items (not the full menu).

Then you take their combined opinion.
That’s bagging.

✅ Difference:

Ensemble Learning = big umbrella (combine models).

Bagging = one specific technique (build models on random subsets & combine).

## 🌟 Part 2: Random Forest vs Decision Tree
### 1. Decision Tree 🌳

A single if-else flowchart.

Simple, interpretable.

But can overfit (memorize training data).

👉 Example:
One doctor decides if you’re sick:

“If fever > 100 → sick, else not sick.”

Works, but depends entirely on that one doctor’s judgment.

### 2. Random Forest 🌲🌲🌲🌲

A collection of many decision trees (built using bagging).

Each tree sees random data & random features.

Final prediction = majority vote.

👉 Example:
Instead of one doctor, you ask 100 doctors.

Each doctor looks at a slightly different set of symptoms (data/features).

Then they vote.

The group decision is more reliable than one doctor.

### ✅ Difference:

Decision Tree	Random Forest
Single model	Many trees (ensemble)
High chance of overfitting	Less overfitting
Easy to interpret	Harder to interpret
Fast training	Slower (many trees)
Example: 1 doctor’s opinion	Example: 100 doctors voting

### 📌 Key Insight:

Decision Tree = simple but risky (overfit).

Random Forest = safer, more stable, more accurate because of bagging + ensemble power.


In [2]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load Titanic dataset
titanic = sns.load_dataset("titanic")

# Select features and target
X = titanic[["pclass", "age", "sex", "fare"]]
y = titanic["survived"]

# Handle missing values
X["age"].fillna(X["age"].median(), inplace=True)

# Convert categorical (sex) → numeric
X = pd.get_dummies(X, drop_first=True)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 🌳 Single Decision Tree
tree_model = DecisionTreeClassifier(max_depth=4, random_state=42)
tree_model.fit(X_train, y_train)
tree_preds = tree_model.predict(X_test)

# 🌲 Random Forest
forest_model = RandomForestClassifier(n_estimators=100, max_depth=4, random_state=42)
forest_model.fit(X_train, y_train)
forest_preds = forest_model.predict(X_test)

# Compare accuracy
tree_acc = accuracy_score(y_test, tree_preds)
forest_acc = accuracy_score(y_test, forest_preds)

print("Decision Tree Accuracy:", tree_acc)
print("Random Forest Accuracy:", forest_acc)


Decision Tree Accuracy: 0.7988826815642458
Random Forest Accuracy: 0.8156424581005587


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X["age"].fillna(X["age"].median(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["age"].fillna(X["age"].median(), inplace=True)
