## RandomForest_2

### What is “Feature Importance” in Random Forests?

When you train a random forest, it tries to figure out which features (columns in your dataset) are more useful for making predictions.

Some features help the model a lot.

Some features add very little value.

The model gives a score to each feature, showing how important it is.

🏠 Real-life example

Imagine you’re predicting house prices with these features:

🏠 Size of house (sq. ft.)

📍 Location

🛋️ Number of rooms

🚗 Parking availability

🎨 Paint color of walls

After training a Random Forest, feature importance might look like:

Location → 50%

Size → 30%

Rooms → 15%

Parking → 5%

Paint color → 0%

👉 This means:

Location and Size are the biggest factors in predicting house prices.

Paint color doesn’t really matter.

### ✅ In short: Feature importance tells us which factors really drive the prediction.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Load Titanic dataset (from seaborn for simplicity)
import seaborn as sns
titanic = sns.load_dataset("titanic")

# Select useful columns
X = titanic[["age", "sex", "pclass"]]
y = titanic["survived"]

# Handle missing values
X["age"].fillna(X["age"].median(), inplace=True)

# Convert categorical (sex) → numeric
X = pd.get_dummies(X, drop_first=True)

# Split train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Get feature importance
importances = model.feature_importances_

# Show results
for feature, importance in zip(X.columns, importances):
    print(f"{feature}: {importance:.2f}")


age: 0.45
pclass: 0.16
sex_male: 0.39


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X["age"].fillna(X["age"].median(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["age"].fillna(X["age"].median(), inplace=True)


### 🌳 Random Forest has many knobs we can adjust

These knobs are called hyperparameters (settings you choose before training).
The two most important ones are:

1. n_estimators → Number of trees 🌲🌲🌲

Think of it like a voting team.

More trees = more stable, accurate predictions.

But more trees = slower training.

👉 Example:

n_estimators=10 → only 10 trees (fast but maybe less accurate).

n_estimators=200 → 200 trees (more accurate, but slower).

2. max_depth → How deep each tree can grow

A deep tree memorizes the training data (can overfit).

A shallow tree may miss patterns (can underfit).

👉 Example:

max_depth=2 → tree makes very simple rules (may be too simple).

max_depth=None → tree can grow fully (may overfit).

🎯 Goal of tuning

Find the sweet spot where model is:

Accurate on training data ✅

Also works well on unseen test data ✅

👉 Example analogy:

n_estimators = number of judges in a contest.

max_depth = how detailed each judge’s checklist is.
Too many judges with too deep checklists → slow & confusing.
Too few judges with too short checklists → unfair decision.

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load Titanic dataset from seaborn (or local CSV if you have it)
import seaborn as sns
titanic = sns.load_dataset("titanic")

# Select useful columns
X = titanic[["pclass", "sex", "age", "fare"]]
y = titanic["survived"]

# Handle missing values
X["age"].fillna(X["age"].median(), inplace=True)

# Convert categorical (sex) → numeric
X = pd.get_dummies(X, drop_first=True)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

for n in [10, 50, 100]:
    for depth in [3, 5, None]:
        model = RandomForestClassifier(n_estimators=n, max_depth=depth, random_state=42)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        print(f"Trees={n}, Depth={depth}, Accuracy={acc:.3f}")


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X["age"].fillna(X["age"].median(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X["age"].fillna(X["age"].median(), inplace=True)


Trees=10, Depth=3, Accuracy=0.817
Trees=10, Depth=5, Accuracy=0.802
Trees=10, Depth=None, Accuracy=0.780
Trees=50, Depth=3, Accuracy=0.802
Trees=50, Depth=5, Accuracy=0.810
Trees=50, Depth=None, Accuracy=0.791
Trees=100, Depth=3, Accuracy=0.791
Trees=100, Depth=5, Accuracy=0.806
Trees=100, Depth=None, Accuracy=0.787


In [5]:
# Train a Random Forest
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)

# Accuracy
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Accuracy:", round(acc, 3))

# Feature Importance
importances = model.feature_importances_

print("\nFeature Importance:")
for feature, importance in zip(X.columns, importances):
    print(f"{feature}: {importance:.3f}")


Accuracy: 0.806

Feature Importance:
pclass: 0.159
age: 0.158
fare: 0.220
sex_male: 0.463
