# Lesson 39: TF-IDF parameter tuning activity

In this activity, you will explore how different TF-IDF parameters affect text classification performance.

1. **N-gram range** - Compare unigrams, bigrams, and trigrams
2. **Vocabulary size** - Explore the effect of max_features
3. **Document frequency filtering** - Use min_df and max_df parameters

## Notebook set-up

### Imports

In [None]:
import nltk
from nltk.corpus import movie_reviews
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

nltk.download('movie_reviews', quiet=True)

## 1. Load and prepare data

In [None]:
# Load movie reviews
documents = [
    (' '.join(movie_reviews.words(fileid)), category)
    for category in movie_reviews.categories()
    for fileid in movie_reviews.fileids(category)
]

texts = [doc[0] for doc in documents]
labels = [1 if doc[1] == 'pos' else 0 for doc in documents]

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.2, random_state=315
)

print(f'Training samples: {len(X_train)}')
print(f'Testing samples: {len(X_test)}')

## 2. Helper function

This function trains and evaluates a classifier with a given vectorizer configuration.

In [None]:
def evaluate_vectorizer(vectorizer, X_train, X_test, y_train, y_test):
    '''Train Naive Bayes with given vectorizer and return accuracy and vocab size.'''
    
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)
    
    model = MultinomialNB()
    model.fit(X_train_vec, y_train)
    predictions = model.predict(X_test_vec)
    
    accuracy = accuracy_score(y_test, predictions)
    vocab_size = len(vectorizer.vocabulary_)
    
    return accuracy, vocab_size

## 3. N-gram range experiment

### Task 1: Compare unigrams, bigrams, and combinations

Complete the code below to test different n-gram ranges and observe their effect on accuracy.

**Hints:**
- `ngram_range=(1, 1)` uses only unigrams (single words)
- `ngram_range=(1, 2)` uses unigrams and bigrams
- `ngram_range=(2, 2)` uses only bigrams

In [None]:
# TODO: Define n-gram configurations to test
ngram_configs = [
    # YOUR CODE HERE - add tuples like (1, 1), (1, 2), (2, 2), (1, 3)
]

print('N-gram range comparison:\n')
for ngram_range in ngram_configs:

    # TODO: Create TfidfVectorizer with this ngram_range
    vectorizer = None  # YOUR CODE HERE
    
    # Evaluate
    accuracy, vocab_size = evaluate_vectorizer(
        vectorizer, X_train, X_test, y_train, y_test
    )
    
    print(f'ngram_range={ngram_range}: accuracy={accuracy:.4f}, vocab_size={vocab_size:,}')

## 4. Vocabulary size experiment

### Task 2: Explore max_features parameter

The `max_features` parameter limits vocabulary to the top N features. Test how this affects accuracy.

**Hints:**
- Try values like 100, 500, 1000, 5000, 10000, None (unlimited)
- Consider the trade-off between model size and accuracy

In [None]:
# TODO: Define max_features values to test
max_features_values = [
    # YOUR CODE HERE - add values like 100, 500, 1000, 5000, 10000, None
]

print('max_features comparison:\n')
for max_features in max_features_values:

    # TODO: Create TfidfVectorizer with this max_features
    vectorizer = None  # YOUR CODE HERE
    
    # Evaluate
    accuracy, vocab_size = evaluate_vectorizer(
        vectorizer, X_train, X_test, y_train, y_test
    )
    
    print(f'max_features={str(max_features):>6s}: accuracy={accuracy:.4f}, vocab_size={vocab_size:,}')

## 5. Document frequency filtering

### Task 3: Use min_df and max_df to filter terms

- `min_df` removes terms appearing in fewer than N documents (or proportion)
- `max_df` removes terms appearing in more than N documents (or proportion)

**Your task:** Experiment with different combinations and observe the effect.

**Hints:**
- `min_df=5` means "must appear in at least 5 documents"
- `max_df=0.95` means "must appear in at most 95% of documents"
- These help remove rare typos and very common words

In [None]:
# TODO: Define min_df/max_df configurations to test
df_configs = [
    # YOUR CODE HERE - add tuples like (1, 1.0), (5, 0.95), (10, 0.8)
    # Format: (min_df, max_df)
]

print('Document frequency filtering comparison:\n')
for min_df, max_df in df_configs:

    # TODO: Create TfidfVectorizer with these parameters
    vectorizer = None  # YOUR CODE HERE
    
    # Evaluate
    accuracy, vocab_size = evaluate_vectorizer(
        vectorizer, X_train, X_test, y_train, y_test
    )
    
    print(f'min_df={min_df}, max_df={max_df}: accuracy={accuracy:.4f}, vocab_size={vocab_size:,}')

## 6. Analysis questions

After completing the experiments above, answer these questions:

1. Which n-gram configuration gave the best accuracy? Why?
2. At what point does reducing max_features start hurting accuracy significantly?
3. How do min_df and max_df affect both vocabulary size and accuracy?
4. What combination of parameters would you recommend for a production system?