[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/2.compare/ChiSquare_Mann-Whitney_Log-odds.ipynb)

# Comparing prevalence of terms

This notebook examines the words that distinguish the [2024 Democrat party platform](https://www.presidency.ucsb.edu/documents/2024-democratic-party-platform) from the [2024 Republican party platform](https://www.presidency.ucsb.edu/documents/2024-republican-party-platform) (both sourced from the American Presidency Project at UCSB), using the Chi-Square test and the Mann-Whitney test.

In [None]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/2024_democrat_party_platform.txt
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/2024_republican_party_platform.txt

In [None]:
import json
import math
import operator
import sys
from collections import Counter

import nltk
import numpy as np
from scipy.stats import mannwhitneyu
import matplotlib.pyplot as plt

nltk.download("punkt")
nltk.download("punkt_tab")

In [None]:
def read(filename):
    with open(filename, encoding="utf-8") as file:
        # lowercase text
        return file.read().lower()

In [None]:
democrat_text = read("../data/2024_democrat_party_platform.txt")

In [None]:
republican_text = read("../data/2024_republican_party_platform.txt")

Explore your assumptions between the words you think will most distinguish the Democrat and Republican platforms.  Before looking at the results of the tests, what words do you think will be comparatively distinct to both?  (If you're not familiar with either, scan the platforms linked above).

In [None]:
def tokenize(data):
    return nltk.word_tokenize(data)

In [None]:
def get_counts(tokens):
    counts = Counter()
    for token in tokens:
        counts[token] += 1
    return counts

## $\chi^2$ test

The $\chi^2$ test as used in the comparison of different texts is designed to measure how statistically significant the distribution of counts in a 2x2 contingency table is.  Use the following function to analyze the difference between the platforms.  How do the most distinct terms comport with your assumptions?

In [None]:
def get_contingency_table(word, word_counts_a, word_counts_b, total_count_a = None, total_count_b = None):
    """
    Construct a 2x2 contingency table that takes the form of:

      # word in A | # word in B
    --------------+--------------
     # total in A | # total in B
    """
    # we can take the total counts as input if they are precomputed
    # otherwise, we can also compute them easily by summing the word counts
    if total_count_a is None:
        total_count_a = sum(word_counts_a.values())
    if total_count_b is None:
        total_count_b = sum(word_counts_b.values())

    return np.array([
        [word_counts_a[word], word_counts_b[word]],
        [total_count_a - word_counts_a[word], total_count_b - word_counts_b[word]]
    ])

def chi_sq(o):
    """Calculate chi-square value for 2x2 contingency table o
    
    We use the simpler form given in Manning and Schuetze (1999)
    for 2x2 contingency tables:
    https://nlp.stanford.edu/fsnlp/promo/colloc.pdf, equation 5.7
    """

    N = o.sum()
    return (N * (o[0,0] * o[1,1] - o[0,1] * o[1,0]) ** 2) / ((o[0,0] + o[0,1]) * (o[0,0] + o[1,0]) * (o[0,1] + o[1,1]) * (o[1,0] + o[1,1]))
    
def run_chi_square_on_corpus(word_counts_a, word_counts_b): 
    total_count_a = 0.0
    total_count_b = 0.0
    vocab = {}
    for word in word_counts_a:
        vocab[word] = 1
        total_count_a += word_counts_a[word]
    for word in word_counts_b:
        vocab[word] = 1
        total_count_b += word_counts_b[word]

    total_words = total_count_a + total_count_b

    chisq_vals = {}
    for word in vocab:
        contingency_table = get_contingency_table(
            word,
            word_counts_a,
            word_counts_b,
            total_count_a,
            total_count_b
        )

        chisq_vals[word] = chi_sq(contingency_table)

    sorted_chi = sorted(chisq_vals.items(), key=lambda x: x[1], reverse=True)
    corpus_a_words = []
    corpus_b_words = []

    for word, chisq_val in sorted_chi:
        if word_counts_a[word] / total_count_a > word_counts_b[word] / total_count_b:
            corpus_a_words.append(word)
        else:
            corpus_b_words.append(word)

    print("Democrat:\n")
    for word in corpus_a_words[:20]:
        print(f"{word}\t{chisq_vals[word]}")

    print("Republican:\n")
    for word in corpus_b_words[:20]:
        print(f"{word}\t{chisq_vals[word]}")

In [None]:
democrat_tokens = tokenize(democrat_text)
democrat_counts = get_counts(democrat_tokens)

In [None]:
republican_tokens = tokenize(republican_text)
republican_counts = get_counts(republican_tokens)

Before running chi-square on all the words, let's step through the calculation on a single word first. We begin by plotting the 2x2 contingency table of the occurrences of each word in each corpus. We normalize to get probabilities.

In [None]:
def plot_contingency_table(table, normalize=False):
    if normalize:
        table = table / table.sum(axis=0)

    
    fig, ax = plt.subplots()
    im = ax.imshow(table, aspect='auto')
        
    ax.set_xticks(range(table.shape[1]))
    ax.set_yticks(range(table.shape[0]))
    
    for i in range(table.shape[0]):
        for j in range(table.shape[1]):
            value = table.iloc[i, j] if hasattr(table, 'iloc') else table[i, j]
            text_color = 'white' if value < table.max() * 0.5 else 'black'
            
            if normalize:
                text = f'{value:.3f}'
            else:
                text = f'{int(value)}'
            
            ax.text(j, i, text, ha='center', va='center', 
                   color=text_color, fontsize=12, fontweight='bold')

    title = "Contingency Table"
    x_label = "Corpus"
    y_label = "Present"
    ax.set_title(title, fontsize=16, fontweight='bold', pad=20)
    ax.set_xlabel(x_label, fontsize=14, fontweight='bold')
    ax.set_ylabel(y_label, fontsize=14, fontweight='bold')
    ax.set_xticklabels(["Democrat", "Republican"])
    ax.set_yticklabels(["Yes", "No"])
    # Add grid
    ax.set_xticks(np.arange(table.shape[1]+1)-.5, minor=True)
    ax.set_yticks(np.arange(table.shape[0]+1)-.5, minor=True)
    ax.grid(which='minor', color='black', linestyle='-', linewidth=1)
    
    plt.tight_layout()
    return ax


In [None]:
table = get_contingency_table("biden", democrat_counts, republican_counts)

In [None]:
plot_contingency_table(table, normalize=True);

We can see that the Democrat platform mentions "biden" more, but is this statistically significant? We can use the chi-square test to test against the null hypothesis that the distribution of "biden" in the two platforms are equivalent.

For two categories, we can consult a [chi-square table](https://math.arizona.edu/~jwatkins/chi-square-table.pdf) to see that a value of $>7.879$ yields a statistically significant result at $\alpha = 0.005$.

In [None]:
chi_sq(table)

Now let's run the chi-square test on all the words in the vocabulary to see which ones are more prevalent in which platform.

In [None]:
run_chi_square_on_corpus(democrat_counts, republican_counts)

Are these results surprising? Examine specific words to check their frequency in both datasets.

In [None]:
print("Totals: R: %s, D: %s" % (len(republican_tokens), len(democrat_tokens)))

In [None]:
word = "climate"
print("%s -- R: %s, D: %s" % (word, republican_counts[word], democrat_counts[word]))

## Mann-Whitney

We saw earlier that $\chi^2$ is not a perfect estimator since doesn't account for the "burstiness" of language -- if we see the word "Dracula" in a text, we're probably going to see it again in that same text. The occurrence of words are not independent random events; they are tightly coupled with each other. If we're trying to understanding the robust differences between two corpora, we might prefer to prioritize words that show up more frequently everywhere in corpus A (but not in corpus B) over those that show up only very frequently within narrow slice of A (such as one text in a genre, one chapter in a book, or one speaker when measuring the differences between policital parties). Use the following function to execute the Mann-Whitney test to account for this phenomenon while finding distinctive terms.

In [None]:
def count_differences(tokens_a, tokens_b):
    """Measure the difference in the frequency of each word between two corpora."""
    total_len_a = len(tokens_a)
    total_len_b = len(tokens_b)

    word_counts_a = Counter(tokens_a)
    word_counts_b = Counter(tokens_b)

    vocab = set(word_counts_a) | set(word_counts_b)

    differences = {}
    for word in vocab:
        freq_a = word_counts_a[word] / total_len_a
        freq_b = word_counts_b[word] / total_len_b
        diff = freq_a - freq_b
        differences[word] = diff

    return differences


def get_chunk_counts(tokens, chunk_size):
    """Get token counts for chunks of the corpus.

    Returns a list of Counters, each with token counts for their respective chunk.
    """
    chunks = []
    for i in range(0, len(tokens), chunk_size):
        counts = Counter()
        for j in range(chunk_size):
            if i + j < len(tokens):
                counts[tokens[i + j]] += 1
        chunks.append(counts)
    return chunks


def mann_whitney(tokens_a, tokens_b):
    chunk_size = 500
    chunks_a = get_chunk_counts(tokens_a, chunk_size)
    chunks_b = get_chunk_counts(tokens_b, chunk_size)

    pvals = {}
    vocab = set(tokens_a + tokens_b)
    for word in vocab:
        a = []
        b = []

        # Note a and b can be different lengths (i.e., different sample sizes)
        #
        # See Mann and Whitney (1947), "On a Test of Whether one of Two Random
        # Variables is Stochastically Larger than the Other"
        # https://projecteuclid.org/download/pdf_1/euclid.aoms/1177730491

        # (This is part of their innovation over the case of equal sample sizes in Wilcoxon 1945)

        for chunk in chunks_a:
            a.append(chunk[word])
        for chunk in chunks_b:
            b.append(chunk[word])

        # Consider: what information do `a` and `b` encode?

        # we use the scipy implementation of the Mann-Whitney U rank test
        # see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html
        statistic, pval = mannwhitneyu(a, b, alternative="two-sided")

        # We'll use the p-value as our quantity of interest.  [Note in the normal appproximation
        # that Mann-Whitney uses to assess significance for large sample sizes, the significance
        # of the raw statistic depends on the number of ties in the data, so the statistic itself
        # isn't exactly comparable across different words]
        pvals[word] = pval

    return pvals


def mann_whitney_analysis(tokens_a, tokens_b):

    pvals = mann_whitney(tokens_a, tokens_b)

    # Mann-Whitney tells us the significance of a term's difference in two groups, but we also
    # need the directionality of that difference (whether it's used more by group A or group B.

    # Let's use our difference-in-proportions function above to check the directionality.
    # [Note we could also measure directionality by checking whether the Mann-Whitney statistic
    # is greater or less than the mean=len(one_chunks)*len(two_chunks)*0.5.]

    differences = count_differences(tokens_a, tokens_b)

    terms_a = {k : pvals[k] for k in pvals if differences[k] <= 0}
    terms_b = {k : pvals[k] for k in pvals if differences[k] > 0}

    sorted_pvals = sorted(terms_a.items(), key=lambda x: x[1])
    print("More Republican:\n")
    for k,v in sorted_pvals[:20]:
        print("%s\t%.15f" % (k,v))

    print("\nMore Democrat:\n")
    sorted_pvals = sorted(terms_b.items(), key=operator.itemgetter(1))
    for k,v in sorted_pvals[:25]:
        print("%s\t%.15f" % (k,v))

In [None]:
mann_whitney_analysis(democrat_tokens, republican_tokens)