# Feature Extraction: Bag-of-Words and TF-IDF
## Objective

Transform normalized textual data into numerical feature representations suitable for:

- Classical machine learning models

- Baseline NLP systems

- Interpretable text analytics

> This notebook focuses on sparse, high-dimensional representations that trade semantic richness for transparency and control.

## Why Feature Extraction Matters

Machine learning models do not understand text — they understand numbers.

Poor feature extraction leads to:

- Sparse but meaningless vectors

- Overfitting on rare terms

- Unstable model coefficients

- Leakage via improper fitting

BoW and TF-IDF remain:

- Strong baselines

- Highly interpretable

- Computationally efficient

## Imports and Setup

In [2]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


## Example Corpus (From Previous Notebooks)

We assume text has been:

- Cleaned

- Tokenized

- Normalized (stopwords + lemmatization)

For vectorizers, we rejoin tokens into strings.

In [5]:
data = {
    "tokens_normalized": [
        ["amazing", "visit"],
        ["nlp", "hard"],
        ["tokenization", "error", "silent", "model", "failure"],
        ["clean", "text", "better", "model"]
    ]
}

df = pd.DataFrame(data)

df["text_normalized"] = df["tokens_normalized"].apply(lambda x: " ".join(x))
df

Unnamed: 0,tokens_normalized,text_normalized
0,"[amazing, visit]",amazing visit
1,"[nlp, hard]",nlp hard
2,"[tokenization, error, silent, model, failure]",tokenization error silent model failure
3,"[clean, text, better, model]",clean text better model


# Bag-of-Words (BoW)
Concept

BoW represents text as:

- Token counts
- 
Order-agnostic

- High-dimensional sparse vectors

## Fit CountVectorizer

 __Important:__ Always fit only on training data.

In [8]:
count_vectorizer = CountVectorizer()

X_bow = count_vectorizer.fit_transform(df["text_normalized"])


### Inspect Vocabulary

In [11]:
vocab = count_vectorizer.get_feature_names_out()
vocab

array(['amazing', 'better', 'clean', 'error', 'failure', 'hard', 'model',
       'nlp', 'silent', 'text', 'tokenization', 'visit'], dtype=object)

### Dense View (For Inspection Only)

In [14]:
bow_df = pd.DataFrame(
    X_bow.toarray(),
    columns=vocab
)

bow_df

Unnamed: 0,amazing,better,clean,error,failure,hard,model,nlp,silent,text,tokenization,visit
0,1,0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,1,0,1,0,0,0,0
2,0,0,0,1,1,0,1,0,1,0,1,0
3,0,1,1,0,0,0,1,0,0,1,0,0


## TF-IDF Vectorization
Concept

TF-IDF weights tokens by:

- Term frequency (TF)

- Inverse document frequency (IDF)

This reduces the impact of ubiquitous terms.

### Fit TfidfVectorizer

In [17]:
tfidf_vectorizer = TfidfVectorizer()

X_tfidf = tfidf_vectorizer.fit_transform(df["text_normalized"])


### Inspect Vocabulary and Weights

In [20]:
tfidf_vocab = tfidf_vectorizer.get_feature_names_out()

tfidf_df = pd.DataFrame(
    X_tfidf.toarray(),
    columns=tfidf_vocab
)

tfidf_df


Unnamed: 0,amazing,better,clean,error,failure,hard,model,nlp,silent,text,tokenization,visit
0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107
1,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.707107,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.465162,0.465162,0.0,0.366739,0.0,0.465162,0.0,0.465162,0.0
3,0.0,0.525473,0.525473,0.0,0.0,0.0,0.414289,0.0,0.0,0.525473,0.0,0.0


## BoW vs TF-IDF


| Aspect           | BoW        | TF-IDF             |
| ---------------- | ---------- | ------------------ |
| Simplicity       | Very high  | High               |
| Interpretability | Excellent  | Very good          |
| Term weighting   | Raw counts | Frequency + rarity |
| Default baseline | ✅          | ✅                  |


# Controlling Sparsity and Noise
### Limit Vocabulary Size

In [25]:
tfidf_limited = TfidfVectorizer(
    max_features=5
)

X_limited = tfidf_limited.fit_transform(df["text_normalized"])
tfidf_limited.get_feature_names_out()

array(['amazing', 'better', 'clean', 'error', 'model'], dtype=object)

## Remove Rare and Frequent Tokens

In [28]:
tfidf_pruned = TfidfVectorizer(
    min_df=2,
    max_df=0.9
)

X_pruned = tfidf_pruned.fit_transform(df["text_normalized"])
tfidf_pruned.get_feature_names_out()


array(['model'], dtype=object)

# N-grams (Optional)

In [31]:
tfidf_ngrams = TfidfVectorizer(
    ngram_range=(1, 2)
)

X_ngrams = tfidf_ngrams.fit_transform(df["text_normalized"])
tfidf_ngrams.get_feature_names_out()[:10]


array(['amazing', 'amazing visit', 'better', 'better model', 'clean',
       'clean text', 'error', 'error silent', 'failure', 'hard'],
      dtype=object)

# Pipeline-Safe Design Pattern

Vectorization must be:

- Fitted on training data

- Reused unchanged for validation/test

In [34]:
def build_tfidf_vectorizer():
    return TfidfVectorizer(
        min_df=2,
        max_df=0.9,
        ngram_range=(1, 2)
    )


# Common Feature Extraction Mistakes

- ❌ Fitting vectorizers on full datasets
- ❌ Changing vocabulary across experiments
- ❌ Ignoring sparsity when choosing models
- ❌ Treating TF-IDF as semantic embeddings

# When NOT to Use BoW / TF-IDF

Avoid when:

- Long-range context matters

- Semantic similarity is required

- Cross-language generalization is needed

➡ Use embeddings or transformers instead.

# Key Takeaways

- BoW and TF-IDF are strong, interpretable baselines

- Sparsity control is essential

- Vectorizers must be pipeline-encapsulated

- Always treat vectorizers as fitted objects

# Next Notebook