[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/USERNAME/REPO/blob/main/Exercise_DTM_TFIDF.ipynb)

# Coding Exercise: From Text to Matrix

## Building Document-Term Matrices and TF-IDF Representations

**Workshop**: Quantitative Text Analysis and Natural Language Processing using Python  
**Day 2** â€” Bag-of-Words Models

---

### Learning Objectives

By the end of this exercise, you will be able to:

1. Transform text data into numerical representations using `CountVectorizer` and `TfidfVectorizer`
2. Understand and interpret the properties of document-term matrices (dimensionality, sparsity)
3. Compare raw frequency counts with TF-IDF weighted representations
4. Identify distinctive vocabulary associated with different categories of text

### The Data

We'll work with the same populism dataset from Day 1: annotated sentences from speeches by European populist leaders. Each sentence has been coded for the type of populist rhetoric it represents:

| Code | Category | Description |
|------|----------|-------------|
| 0 | Neutral | No clear populist framing |
| 1 | Us vs. Them | Constructing in-group/out-group divisions |
| 2 | People-centrism | Appeals to "the people" as a unified group |
| 3 | Anti-elite | Criticism of elites, establishment, or powerful groups |

---

## Setup

First, let's import the libraries we'll need and load the data.

**If running in Google Colab**: The cell below will automatically download the data file from GitHub.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# For visualisation (optional)
import matplotlib.pyplot as plt

print("Libraries loaded successfully!")

In [None]:
# Download data if running in Google Colab
# ============================================
# IMPORTANT: Replace USERNAME/REPO with your actual GitHub username and repository name
# For example: 'luukschmitz/nlp-workshop'
# ============================================

import os

# Check if data file exists locally; if not, download from GitHub
data_file = 'populism_annotation_sample.csv'

if not os.path.exists(data_file):
    print("Downloading data from GitHub...")
    !wget -q https://raw.githubusercontent.com/USERNAME/REPO/main/populism_annotation_sample.csv
    print("Download complete!")
else:
    print("Data file found locally.")

In [None]:
# Load the data
df = pd.read_csv('populism_annotation_sample.csv')

# Create readable labels for the populism categories
pop_labels = {
    0: 'Neutral',
    1: 'Us vs. Them',
    2: 'People-centrism',
    3: 'Anti-elite'
}
df['pop_label'] = df['pop_code'].map(pop_labels)

print(f"Data loaded: {len(df)} sentences")

---

## Section 1: Exploring the Data

Before we transform text into numbers, we should understand what we're working with.

In [None]:
# Basic dataset overview
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

In [None]:
# Distribution of categories
print("Distribution of populism categories:")
print(df['pop_label'].value_counts())

In [None]:
# Which speakers are in the dataset?
print("Speakers:")
print(df['speaker'].value_counts())

### ðŸ’­ Question 1

Look at the distribution of speakers and categories. What potential issues might this create for our analysis? 

*Hint: Think about what we might actually be capturing when we identify "distinctive words" per category.*

**Your answer:** *(double-click to edit)*



In [None]:
# Let's look at example sentences from each category
print("Example sentences per category:\n")

for code in sorted(df['pop_code'].unique()):
    label = pop_labels[code]
    example = df[df['pop_code'] == code]['translated_sentence'].iloc[0]
    print(f"--- {label} (code {code}) ---")
    print(f"{example}\n")

---

## Section 2: Creating a Count-based Document-Term Matrix

The **Document-Term Matrix (DTM)** is the foundational representation for bag-of-words models. Each row represents a document (in our case, a sentence), and each column represents a unique word in the vocabulary. The cells contain word counts.

We'll use scikit-learn's `CountVectorizer` to create this matrix.

In [None]:
# Initialize the CountVectorizer with default settings
count_vectorizer = CountVectorizer()

# Fit and transform: learn vocabulary from text and create the matrix
dtm_count = count_vectorizer.fit_transform(df['translated_sentence'])

# Get the vocabulary (feature names)
vocabulary = count_vectorizer.get_feature_names_out()

print(f"DTM shape: {dtm_count.shape}")
print(f"  - {dtm_count.shape[0]} documents")
print(f"  - {dtm_count.shape[1]} unique terms")

### ðŸ’­ Question 2

The DTM has 40 rows and several hundred columns. In plain language, what does each number in this matrix represent?

**Your answer:** *(double-click to edit)*



In [None]:
# Let's peek at the vocabulary
print("First 20 terms (alphabetically sorted):")
print(vocabulary[:20])

print("\nLast 20 terms:")
print(vocabulary[-20:])

In [None]:
# The DTM is stored as a "sparse matrix" for efficiency
# Let's convert a small portion to see what it looks like

# Convert to a regular DataFrame for the first 5 documents and first 10 words
sample_dtm = pd.DataFrame(
    dtm_count[:5, :10].toarray(),  # First 5 docs, first 10 terms
    columns=vocabulary[:10]
)

print("Sample of the DTM (first 5 documents, first 10 terms):")
print(sample_dtm)

---

## Section 3: Creating a TF-IDF Matrix

Raw counts treat all words equally, but not all words are equally informative. **TF-IDF** (Term Frequencyâ€“Inverse Document Frequency) reweights words by how distinctive they are.

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

- **TF** (Term Frequency): How often does word $t$ appear in document $d$?
- **IDF** (Inverse Document Frequency): How rare is word $t$ across all documents?

Words that appear frequently in one document but rarely across the corpus get high TF-IDF scores.

In [None]:
# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform
dtm_tfidf = tfidf_vectorizer.fit_transform(df['translated_sentence'])

print(f"TF-IDF matrix shape: {dtm_tfidf.shape}")

In [None]:
# Verify that both vectorizers produce the same vocabulary
vocab_count = count_vectorizer.get_feature_names_out()
vocab_tfidf = tfidf_vectorizer.get_feature_names_out()

print(f"Same vocabulary: {np.array_equal(vocab_count, vocab_tfidf)}")

---

## Section 4: Examining Vocabulary Size and Sparsity

Text data has some distinctive properties. Let's examine them.

In [None]:
# Calculate sparsity
total_cells = dtm_count.shape[0] * dtm_count.shape[1]
non_zero_cells = dtm_count.nnz  # .nnz gives the number of non-zero elements
sparsity = 1 - (non_zero_cells / total_cells)

print(f"Matrix dimensions: {dtm_count.shape[0]} documents Ã— {dtm_count.shape[1]} terms")
print(f"\nTotal cells: {total_cells:,}")
print(f"Non-zero cells: {non_zero_cells:,}")
print(f"Sparsity: {sparsity:.1%}")

### ðŸ’­ Question 3

The matrix is over 95% zeros. Why is this? Is this a problem or a feature of text data?

**Your answer:** *(double-click to edit)*



In [None]:
# How many words per document?
words_per_doc = np.array(dtm_count.sum(axis=1)).flatten()

print("Words per document:")
print(f"  Mean: {words_per_doc.mean():.1f}")
print(f"  Min:  {words_per_doc.min()}")
print(f"  Max:  {words_per_doc.max()}")

In [None]:
# Word frequency distribution (Zipf's Law in action)
word_frequencies = np.array(dtm_count.sum(axis=0)).flatten()

print("Word frequency distribution:")
print(f"  Words appearing exactly once: {np.sum(word_frequencies == 1)}")
print(f"  Words appearing 2-5 times:    {np.sum((word_frequencies >= 2) & (word_frequencies <= 5))}")
print(f"  Words appearing 6+ times:     {np.sum(word_frequencies >= 6)}")

### ðŸ’­ Question 4

Most words appear only once (these are called *hapax legomena*). What does this tell us about the challenges of working with small text corpora?

**Your answer:** *(double-click to edit)*



In [None]:
# What are the most frequent words?
word_freq_df = pd.DataFrame({
    'word': vocabulary,
    'count': word_frequencies
}).sort_values('count', ascending=False)

print("Top 15 most frequent words:")
print(word_freq_df.head(15).to_string(index=False))

### ðŸ’­ Question 5

Look at the most frequent words. How many of these would you consider substantively meaningful for understanding populist rhetoric? What does this suggest about the importance of preprocessing?

**Your answer:** *(double-click to edit)*



---

## Section 5: Comparing Raw Counts vs. TF-IDF

Now let's see how TF-IDF changes which words appear important.

In [None]:
# Convert matrices to DataFrames for easier manipulation
dtm_count_df = pd.DataFrame(
    dtm_count.toarray(),
    columns=vocabulary
)

dtm_tfidf_df = pd.DataFrame(
    dtm_tfidf.toarray(),
    columns=vocabulary
)

In [None]:
# Compare common words vs. distinctive words
words_to_compare = ['the', 'to', 'of',           # Function words (stopwords)
                    'people', 'our', 'we',        # Potentially meaningful
                    'corruption', 'freedom']      # Clearly substantive

# Filter to words that exist in our vocabulary
words_to_compare = [w for w in words_to_compare if w in vocabulary]

print("Comparison: Raw Counts vs. TF-IDF\n")
print(f"{'Word':<12} {'Raw Count':>12} {'Avg TF-IDF':>12} {'Docs':>8}")
print("-" * 48)

for word in words_to_compare:
    raw_count = dtm_count_df[word].sum()
    tfidf_values = dtm_tfidf_df[word]
    docs_with_word = (tfidf_values > 0).sum()
    avg_tfidf = tfidf_values[tfidf_values > 0].mean() if docs_with_word > 0 else 0
    
    print(f"{word:<12} {raw_count:>12} {avg_tfidf:>12.4f} {docs_with_word:>8}")

### ðŸ’­ Question 6

Notice that "the" appears 102 times but has a relatively low average TF-IDF score, while "freedom" appears only twice but has a high TF-IDF score. Explain why this happens.

**Your answer:** *(double-click to edit)*



In [None]:
# Let's look at a single document in detail
doc_index = 0

print(f"Document {doc_index}:")
print(f"'{df['translated_sentence'].iloc[doc_index]}'")
print(f"\nCategory: {df['pop_label'].iloc[doc_index]}")

In [None]:
# Top words in this document by raw count
doc_counts = dtm_count_df.iloc[doc_index]
doc_tfidf = dtm_tfidf_df.iloc[doc_index]

# Only non-zero entries
doc_counts_nonzero = doc_counts[doc_counts > 0].sort_values(ascending=False)
doc_tfidf_nonzero = doc_tfidf[doc_tfidf > 0].sort_values(ascending=False)

print("Top 8 words by RAW COUNT:")
for word, count in doc_counts_nonzero.head(8).items():
    print(f"  {word}: {int(count)}")

print("\nTop 8 words by TF-IDF:")
for word, score in doc_tfidf_nonzero.head(8).items():
    print(f"  {word}: {score:.4f}")

---

## Section 6: Identifying Distinctive Words per Category

Now for the payoff: can we identify which words are most associated with each type of populist rhetoric?

**Approach**: For each category, we calculate the mean TF-IDF score for each word across all documents in that category. Words with high mean TF-IDF are distinctive of that category.

In [None]:
# Calculate mean TF-IDF per category
for code in sorted(df['pop_code'].unique()):
    label = pop_labels[code]
    
    # Select documents from this category
    category_mask = df['pop_code'] == code
    
    # Calculate mean TF-IDF for each word
    mean_tfidf = dtm_tfidf_df[category_mask].mean()
    
    # Get top words
    top_words = mean_tfidf.sort_values(ascending=False).head(10)
    
    print(f"\n{'='*50}")
    print(f"{label.upper()} (code {code})")
    print(f"{'='*50}")
    print(f"\nTop 10 distinctive words:")
    for i, (word, score) in enumerate(top_words.items(), 1):
        print(f"  {i:2}. {word:<15} {score:.4f}")

### ðŸ’­ Question 7

Look at the distinctive words for each category. Do they make substantive sense given what the categories are supposed to capture? Which category has the most interpretable distinctive vocabulary? Which is least interpretable?

**Your answer:** *(double-click to edit)*



### Comparing Two Categories

Another way to find distinctive vocabulary is to directly compare categories. Let's compare **Anti-elite** rhetoric with **Neutral** speech.

In [None]:
# Mean TF-IDF for each category
mean_antielite = dtm_tfidf_df[df['pop_code'] == 3].mean()
mean_neutral = dtm_tfidf_df[df['pop_code'] == 0].mean()

# Difference: positive = more anti-elite, negative = more neutral
diff = mean_antielite - mean_neutral

print("Words MORE associated with Anti-elite rhetoric:")
print("-" * 40)
for word, score in diff.sort_values(ascending=False).head(10).items():
    print(f"  {word:<15} +{score:.4f}")

print("\nWords MORE associated with Neutral speech:")
print("-" * 40)
for word, score in diff.sort_values(ascending=True).head(10).items():
    print(f"  {word:<15} {score:.4f}")

---

## ðŸ§ª Your Turn: Hands-on Task

Now it's your turn to apply what you've learned. Complete the following task:

**Task**: Compare **People-centrism** (code 2) with **Us vs. Them** (code 1). Which words distinguish these two categories?

Use the code cell below to:
1. Calculate mean TF-IDF for both categories
2. Find the difference
3. Identify the top 10 words that distinguish each category

In [None]:
# YOUR CODE HERE
# Hint: Follow the pattern from the Anti-elite vs. Neutral comparison above

# Step 1: Calculate mean TF-IDF for People-centrism (code 2)
mean_people = ___

# Step 2: Calculate mean TF-IDF for Us vs. Them (code 1)
mean_usvsthem = ___

# Step 3: Calculate the difference
diff = ___

# Step 4: Print results
print("Words MORE associated with People-centrism:")
# ...

print("\nWords MORE associated with Us vs. Them:")
# ...

---

## Reflection Questions

Before wrapping up, consider these broader questions:

### ðŸ’­ Question 8

We found that function words like "the", "of", and "to" often dominate our results. What preprocessing step could we add to address this? What might we lose if we remove these words entirely?

**Your answer:** *(double-click to edit)*



### ðŸ’­ Question 9

Our corpus has only 40 sentences. How might our results change with a larger corpus (e.g., 1,000 or 10,000 sentences)? Think about both the vocabulary and the distinctiveness of words per category.

**Your answer:** *(double-click to edit)*



### ðŸ’­ Question 10

We used TF-IDF to identify "distinctive" words, but distinctiveness is not the same as importance or meaning. What are the limitations of this approach for understanding populist rhetoric? What would you need to complement this analysis?

**Your answer:** *(double-click to edit)*



---

## Summary

In this exercise, you learned to:

1. **Transform text to numbers** using `CountVectorizer` (raw counts) and `TfidfVectorizer` (weighted counts)
2. **Understand DTM properties**: high dimensionality (one column per word), extreme sparsity (most cells are zero)
3. **Compare representations**: TF-IDF downweights common words and highlights distinctive vocabulary
4. **Find distinctive words**: By comparing mean TF-IDF across categories

### Key Takeaways

- Text data is inherently high-dimensional and sparse
- Raw frequency counts are dominated by common function words
- TF-IDF reweighting emphasises words that are frequent *in some documents* but rare *across the corpus*
- With small corpora, many words appear only once, limiting statistical power
- Bag-of-words representations ignore word order and contextâ€”a significant limitation

### Next Steps

In the spring workshop, we'll see how **word embeddings** and **LLMs** address some of these limitations by capturing semantic similarity and context.

---

## Bonus: Visualisation (Optional)

If time permits, here's a quick visualisation of the distinctive words.

In [None]:
# Simple bar chart of top words per category
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, code in enumerate(sorted(df['pop_code'].unique())):
    label = pop_labels[code]
    category_mask = df['pop_code'] == code
    mean_tfidf = dtm_tfidf_df[category_mask].mean()
    top_words = mean_tfidf.sort_values(ascending=True).tail(10)  # Ascending for horizontal bar
    
    axes[idx].barh(top_words.index, top_words.values, color='steelblue')
    axes[idx].set_title(f'{label}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Mean TF-IDF')

plt.tight_layout()
plt.suptitle('Top 10 Distinctive Words per Populism Category', y=1.02, fontsize=14, fontweight='bold')
plt.show()