# Tree-Based Models for Text Data
## Objective

Evaluate the use of tree-based models on text data represented via BoW / TF-IDF, and understand:

- When they work

- When they fail

- How to mitigate known limitations

> This notebook emphasizes diagnostic insight over raw performance.

## Why This Notebook Exists

Tree-based models are powerful for:

- Tabular data

- Non-linear interactions

- Feature selection

However, text features are:

- High-dimensional

- Sparse

- Weakly correlated

This creates a structural mismatch.

Understanding this mismatch prevents:

- Wasted compute

- Overfitting

- Misleading benchmarks

## Imports and Setup

In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.metrics import accuracy_score, classification_report


# Example Dataset

Same binary classification setup used for linear models.

In [5]:
data = {
    "text": [
        "this model works well",
        "terrible results and poor model",
        "excellent performance and stability",
        "bad predictions and weak accuracy",
        "robust and interpretable model",
        "awful behavior and unreliable output"
    ],
    "label": [1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)


# Train / Test Split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(
    df["text"],
    df["label"],
    test_size=0.3,
    random_state=2010,
    stratify=df["label"]
)

# Decision Tree
Why Try It?

- Acts as a diagnostic baseline

- Highlights sparsity issues immediately

## Pipeline

In [11]:
dt_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=100)),
    ("model", DecisionTreeClassifier(
        max_depth=5,
        random_state=2010
    ))
])

## Train and Evaluate

In [14]:
dt_pipeline.fit(X_train, y_train)

y_pred_dt = dt_pipeline.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))

Decision Tree Accuracy: 0.0
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       1.0
           1       0.00      0.00      0.00       1.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0



# Random Forest
## Why Random Forest?

- Ensemble reduces variance

- Implicit feature selection

But:

- Memory-intensive

- Still struggles with sparse signals

## Pipeline

In [19]:
rf_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=200)),
    ("model", RandomForestClassifier(
        n_estimators=200,
        max_depth=10,
        random_state=2010,
        n_jobs=-1
    ))
])

## Train and Evaluate

In [24]:
rf_pipeline.fit(X_train, y_train)

y_pred_rf = rf_pipeline.predict(X_test)

print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))


Random Forest Accuracy: 0.5
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


# Gradient Boosting
## Why Gradient Boosting?

- Sequential error correction

- Can sometimes extract signal from noisy features

Limitation:

- Requires dense input â†’ implicit densification

## Pipeline

In [27]:
gb_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=100)),
    ("model", GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=0.1,
        random_state=2010
    ))
])

## Train and Evaluate

In [30]:
gb_pipeline.fit(X_train, y_train)

y_pred_gb = gb_pipeline.predict(X_test)

print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred_gb))
print(classification_report(y_test, y_pred_gb))


Gradient Boosting Accuracy: 0.0
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       1.0
           1       0.00      0.00      0.00       1.0

    accuracy                           0.00       2.0
   macro avg       0.00      0.00      0.00       2.0
weighted avg       0.00      0.00      0.00       2.0



# Performance Comparison

In [33]:
results = pd.DataFrame({
    "model": [
        "Decision Tree",
        "Random Forest",
        "Gradient Boosting"
    ],
    "accuracy": [
        accuracy_score(y_test, y_pred_dt),
        accuracy_score(y_test, y_pred_rf),
        accuracy_score(y_test, y_pred_gb)
    ]
})

results

Unnamed: 0,model,accuracy
0,Decision Tree,0.0
1,Random Forest,0.5
2,Gradient Boosting,0.0


# Feature Importance Caveats

Tree-based feature importance on text:

- Biased toward high-frequency tokens

- Unstable across runs

- Hard to interpret semantically

In [36]:
rf_model = rf_pipeline.named_steps["model"]
vectorizer = rf_pipeline.named_steps["tfidf"]

feature_names = vectorizer.get_feature_names_out()
importances = rf_model.feature_importances_

top_features = np.argsort(importances)[-10:]

for idx in reversed(top_features):
    print(feature_names[idx], importances[idx])

model 0.16479400749063672
and 0.12921348314606745
output 0.08801498127340825
this 0.08052434456928839
robust 0.0702247191011236
weak 0.06367041198501872
predictions 0.055867665418227214
well 0.0552434456928839
bad 0.047752808988764044
awful 0.04681647940074907


# Key Lessons Learned

- Trees are not text-native models

- Sparsity kills split quality

- Linear models dominate in classical NLP

Trees shine when text is:

- Heavily engineered

- Reduced to dense representations

# When Tree-Based Models DO Make Sense

- `[ok] - ` TF-IDF + dimensionality reduction
- `[ok] - ` Text-derived numerical features
- `[ok] - ` Embeddings (dense vectors)
- `[ok] - ` Hybrid tabular + NLP models

# Common Mistakes

- `[x]` - Blindly applying Random Forests to raw TF-IDF

- `[x]` - Ignoring memory usage

- `[x]` - Over-tuning weak model classes

- `[x]` - Misreading feature importance


# Key Takeaways

- Tree-based models are diagnostic, not default, for text

- Sparsity vs split logic is the core conflict

- Use trees after representation engineering

- Always benchmark against linear baselines