<a href="https://colab.research.google.com/github/middlebury-csci-0451/CSCI-0451/blob/main/lecture-notes/text-classification.ipynb" target="_parent">Open these notes in Google Colab</a>

<a href="https://colab.research.google.com/github/middlebury-csci-0451/CSCI-0451/blob/main/lecture-notes/text-classification-live.ipynb" target="_parent">Open the live version in Google Colab</a>


*Major components of this set of lecture notes are based on the [Text Classification](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) tutorial from the PyTorch documentation*. 

## Deep Text Classification and Word Embedding

In this set of notes, we'll discuss the problem of *text classification*. Text classification is a common problem in which we aim to classify pieces of text into different categories. These categories might be about:

- **Subject matter**: is this news article about news, fashion, finance?
- **Emotional valence**: is this tweet happy or sad? Excited or calm? This particular class of questions is so important that it has its own name: sentiment analysis.
- **Automated content moderation**: is this Facebook comment a possible instance of abuse or harassment? Is this Reddit thread promoting violence? Is this email spam?

We saw text classification previously when we first considered the problem of vectorizing pieces of text. We are now going to look at a somewhat more contemporary approach to text using *word embeddings*. 


In [None]:
import pandas as pd
import torch
import numpy as np

# for embedding visualization later
import plotly.express as px 
import plotly.io as pio

# for VSCode plotly rendering
pio.renderers.default = "plotly_mimetype+notebook"

# for appearance
pio.templates.default = "plotly_white"

from sklearn.model_selection import train_test_split

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

For this example, we are going to use a data set containing headlines from a large number of different news articles on the website [HuffPost](https://www.huffpost.com/). I retrieved this data [from Kaggle](https://www.kaggle.com/rmisra/news-category-dataset). 

In [None]:
# access the data
url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/news/News_Category_Dataset_v2.json"
df  = pd.read_json(url, lines=True)
df  = df[["category", "headline"]]

There are over 200,000 headlines listed here, along with the category in which they appeared on the website.


In [None]:
df.head()

Our task will be to teach an algorithm to classify headlines by predicting the category based on the text of the headline. 

Training a model on this much text data can require a lot of time, so we are going to simplify the problem a little bit, by reducing the number of categories. Let's take a look at which categories we have: 

In [None]:
df.groupby("category").size()

Some of these categories are a little odd:

- "Women"? 
- "Weird News"? 
- What's the difference between "Style," "Style & Beauty," and "Taste"? ). 
- "Parenting" vs. "Parents"? 
- Etc?...

Well, there are definitely some questions here! Let's just choose a few categories, and discard the rest. We're going to give each of the categories an integer that we'll use to encode the category in the target variable. 

In [None]:
categories = {
    "STYLE"   : 0,
    "SCIENCE" : 1, 
    "TECH" : 2
}

df = df[df["category"].apply(lambda x: x in categories.keys())]
df.head()

In [None]:
df["category"] = df["category"].apply(categories.get)
df

Next we need to wrap this Pandas dataframe as a Torch data set. While we've been using pre-implemented Torch classes for things like directories of images, in this case it's not so hard to just implement our own Dataset. We just need to implement `__getitem__()` to return the appropriate row of the dataframe. 

Now let's perform a train-validation split and make Datasets from each one. 

Each element of our data sets is a tuple of text and label: 

## Text Vectorization (Again)

Now we need to vectorize our text. This time, we're not going to use one-hot encodings. Instead, we are going to treat each sentence as a sequence of words, and identify each word via an integer index. First we'll use a *tokenizer* to split each sentence into individual words: 

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')

tokenized = tokenizer(train_data[194][0])
tokenized

You might reasonably disagree about whether this is a good tokenization: should punctuation marks be included? Should "you're" really have become "you", "'", and "re"? These are excellent questions that we won't discuss too much further right now. 

We're now ready to build a *vocabulary*. A vocabulary is a mapping from words to integers. The code below loops through the training data and uses it to build such a mapping. 

Here are the first couple elements of the vocabulary: 

This vocabulary can be applied on a list of tokens like this: 

# Batch Collation

Now we're ready to construct the function that is going to actually pass a batch of data to our training loop. Here are the main steps: 

1. We pull some feature data (i.e. a batch of headlines). 
2. We represent each headline as a sequence of integers using the `vocab`. 
3. We pad the headlines with an unused integer index if necessary so that all headlines have the same length. This index corresponds to "blank" or "no words in this slot." 
4. We return the batch of headlines as a consolidated tensor. 

In [None]:
train_loader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn=collate_batch)
val_loader = DataLoader(val_data, batch_size=8, shuffle=True, collate_fn=collate_batch)

Let's take a look at a batch of data now: 

The first element is the list of labels. The second is the concatenated sequence of integers representing 8 headlines worth of text. The final one is the list of offsets that tells us where each of the 8 headlines begins. 

## Modeling

### Word Embedding

A *word embedding* refers to a representation of a word in a vector space. Each word is assigned an individual vector. The general aim of a word embedding is to create a representation such that words with related meanings are close to each other in a vector space, while words with different meanings are farther apart. One usually hopes for the *directions* connecting words to be meaningful as well. Here's a nice diagram illustrating some of the general concepts: 

![](https://miro.medium.com/max/1838/1*OEmWDt4eztOcm5pr2QbxfA.png)

*Image credit: [Towards Data Science](https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8)*

Word embeddings are often produced as intermediate stages in many machine learning algorithms. In our case, we're going to add an embedding layer at the very base of our model. We'll allow the user to flexibly specify the number of dimensions. 

We'll typically expect pretty low-dimensional embeddings for this lecture, but state-of-the-art embeddings will typically have a much higher number of dimensions. For example, the [Embedding Projector demo](http://projector.tensorflow.org/) supplied by TensorFlow uses a default dimension of 200. 

Let's learn and train a model! 

In [None]:
import time

optimizer = torch.optim.Adam(model.parameters(), lr=0.5)
loss_fn = torch.nn.CrossEntropyLoss()

def train(dataloader):
    epoch_start_time = time.time()
    # keep track of some counts for measuring accuracy
    total_acc, total_count = 0, 0
    log_interval = 300
    start_time = time.time()

    for idx, (label, text) in enumerate(dataloader):
        # zero gradients
        optimizer.zero_grad()
        # form prediction on batch
        predicted_label = model(text)
        # evaluate loss on prediction
        loss = loss_fn(predicted_label, label)
        # compute gradient
        loss.backward()
        # take an optimization step
        optimizer.step()

        # for printing accuracy
        total_acc   += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        
    print(f'| epoch {epoch:3d} | train accuracy {total_acc/total_count:8.3f} | time: {time.time() - epoch_start_time:5.2f}s')
    # print('| end of epoch {:3d} | time: {:5.2f}s | '.format(epoch,
    #                                        time.time() - epoch_start_time))
    
def evaluate(dataloader):

    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text) in enumerate(dataloader):
            predicted_label = model(text)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

Our accuracy on validation data is much lower than what we achieved on the training data. This is a possible sign of overfitting. Regardless, this predictive performance is much better than what we would have achieved by guesswork: 

In [None]:
df_train.groupby("category").size() / len(df_train)

## Inspecting Word Embeddings

Recall from our discussion of image classification that the intermediate layers learned by the model can help us understand the representations that the model uses to construct its final outputs. In the case of word embeddings, we can simply extract this matrix from the corresponding layer of the model: 

Let's also extract the words from our vocabular: 

The weight matrix itself has 16 columns, which is too many for us to conveniently visualize. So, instead we are going to use our friend PCA to extract a 2-dimensional representation that we can plot. 

We'll use the Plotly package to do the plotting. Plotly works best with dataframes: 

And, let's plot! We've used Plotly for the interactivity: hover over a dot to see the word it corresponds to. 

In [None]:
fig = px.scatter(embedding_df, 
                 x = "x0", 
                 y = "x1", 
                 size = list(np.ones(len(embedding_df))),
                 size_max = 10,
                 hover_name = "word")

fig.show()

We've made an embedding! We might notice that this embedding appears to be a little bit "stretched out" in three main directions. Each one corresponds to one of the three classes in our training data. 

## Bias in Text Embeddings

Whenever we create a machine learning model that might conceivably have impact on the thoughts or actions of human beings, we have a responsibility to understand the limitations and biases of that model. Biases can enter into machine learning models through several routes, including the data used as well as choices made by the modeler along the way. For example, in our case: 

1. **Data**: we used data from a popular news source. 
2. **Modeler choice**: we only used data corresponding to a certain subset of labels. 

With these considerations in mind, let's see what kinds of words our model associates with female and male genders. 

In [None]:
feminine = ["she", "her", "woman"]
masculine = ["he", "him", "man"]

highlight_1 = ["strong", "powerful", "smart",     "thinking", "brave", "muscle"]
highlight_2 = ["hot",    "sexy",     "beautiful", "shopping", "children", "thin"]

def gender_mapper(x):
    if x in feminine:
        return 1
    elif x in masculine:
        return 4
    elif x in highlight_1:
        return 3
    elif x in highlight_2:
        return 2
    else:
        return 0

embedding_df["highlight"] = embedding_df["word"].apply(gender_mapper)
embedding_df["size"]      = np.array(1.0 + 50*(embedding_df["highlight"] > 0))

# 
sub_df = embedding_df[embedding_df["highlight"] > 0]

In [None]:
import plotly.express as px 

fig = px.scatter(sub_df, 
                 x = "x0", 
                 y = "x1", 
                 color = "highlight",
                 size = list(sub_df["size"]),
                 size_max = 10,
                 hover_name = "word", 
                 text = "word")

fig.update_traces(textposition='top center')


fig.show()

Our text classification model's word embedding is unambiguously sexist. 

- Words like "hot", "sexy", and "shopping" are more closely located to feminine words like "she", "her", and "woman".
- Words like "strong", "smart", and "thinking" are more closely located to masculine words like "he", "him", and "man". 

Where did these biases come from? 

- The primary source is the data itself: HuffPost headlines in certain categories can be highly gendered, and the "Style" category is an example of this. 
- A secondary source is the choices that I made as a modeler. In particular, I intentionally chose categories that would emphasize biases in the data and make them easy to visualize. 

While I could have made different choices and obtained different results, this episode highlights a fundamental set of questions usually underexamined in contemporary machine learning: 

- What biases are built into my data source? 
- How do my choices about which data to use influence the biases present in my model? 

For more on the topic of bias in language models, you may wish to read the now-infamous paper by Emily Bender, Angelina McMillan-Major, Timnt Gebru, and "Shmargaret Shmitchell" (Margaret Mitchell), "[On the Dangers of Stochastic Parrots](https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf)." This is the paper that ultimately led to the firing of the final two authors by Google in late 2020 and early 2021. 

Here's a very recent example (from Margaret Mitchell) illustrating gender bias in ChatGPT: 

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">I replicated this (my screenshot below).<br>Really great example of gender bias, for those of you who need a canonical example to make the point. <a href="https://t.co/O1A8Tk7oI1">https://t.co/O1A8Tk7oI1</a> <a href="https://t.co/hKt4HSBzh3">pic.twitter.com/hKt4HSBzh3</a></p>&mdash; MMitchell (@mmitchell_ai) <a href="https://twitter.com/mmitchell_ai/status/1650110045781393410?ref_src=twsrc%5Etfw">April 23, 2023</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>