# Tutorial: Text Classification

## Step 1: Token Classification

Our ultimate goal is to classify entire texts into categories. For instance, is this employee review positive or negative? Is this news article about a merger or a scandal? Did this lay-off announcement provide utilitarian or normative justifications? However, a simpler version of this is to classify individual tokens. For example, is this word positive or negative? Is this word a noun or a verb? Is this word a person or an organization?

In [None]:
from pathlib import Path


dictionaries = {
    "pos": ["joy", "happiness", "love", "excitement", "delight", "pleasure", "contentment", "cheerful",
    "optimism", "euphoria", "bliss", "grateful", "satisfied", "elated", "thrilled", "ecstatic",
    "enthusiasm", "hopeful", "affection", "proud", "compassion", "warmth", "amusement", "serene",
    "exhilaration", "inspired", "confidence", "tranquil", "trusting", "peaceful", "relieved",
    "uplifted", "encouraged", "radiant", "vivacious", "glad", "playful", "reassured", "fulfilled",
    "loving", "charmed", "jubilant", "festive", "giddy", "carefree", "graceful", "hearted", "motivated",
    "rejoicing", "affectionate", "beaming"],
    "neg": ["anger", "sadness", "fear", "anxiety", "grief", "frustration", "disgust", "guilt",
    "shame", "hopeless", "loneliness", "resentment", "irritation", "jealousy", "embarrassment",
    "rage", "misery", "depression", "bitterness", "doubt", "distrust", "hurt", "vulnerable",
    "melancholy", "uneasy", "overwhelmed", "insecure", "worried", "defeated", "nervous",
    "pessimistic", "tense", "gloomy", "disappointed", "distressed", "mournful", "hateful",
    "desperate", "exhausted", "despair", "fatigue", "apathy", "alienated", "troubled",
    "shattered", "tormented", "withdrawn", "irate", "lonely", "agitated", "powerless"],
    "test": ['positive', 'negative', 'gorilla']
}

print("Dictionaries loaded.")

As we discussed in previous weeks, we cannot really work with text directly. Supervised machine learning algorithms generally require numerical inputs. Therefore, we need to convert text into numbers. Let's start with a simple example - one-hot encoding.

In one-hot encoding, we create a vector for each word in our vocabulary. The length of the vector is equal to the size of the vocabulary. Each word is represented by a vector that has a 1 in the position corresponding to that word and 0s elsewhere. For example, if our vocabulary is `["the", "cat", "sat", "on", "mat"]`, then the word "cat" would be represented as `[0, 1, 0, 0, 0]`.

In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

words = dictionaries['pos'] + dictionaries['neg'] + dictionaries['test']
sentiment_words = dictionaries['pos'] + dictionaries['neg']
labels = [1] * len(dictionaries['pos']) + [0] * len(dictionaries['neg'])

tfidf_vectorizer = CountVectorizer(vocabulary=words, binary=True)
X_train = tfidf_vectorizer.fit_transform(sentiment_words).toarray()
y_train = np.array(labels)

print("Labels: 1 = positive, 0 = negative\n")
print("The first three training words, their vectors, and their labels:")
print("-"*50)
for i in range(3):
    print(f"{sentiment_words[i]}:\n\tVector: {X_train[i]}\n\tLabel: {y_train[i]}\n")

print("\nThe last three training words, their vectors, and their labels:")
print("-"*50)
for i in range(3):
    print(f"{sentiment_words[-(i+1)]}:\n\tVector: {X_train[-(i+1)]}\n\tLabel: {y_train[-(i+1)]}\n")

OK! We have a way to represent words as numbers. But how do we classify them? Well, in the previous tutorial, we learned that linear regression could work for binary classification. We can use the same approach here. We can train a linear regression model to predict the class of each word based on its one-hot encoding. Let's try it out!

In [None]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state=24601)
classifier.fit(X_train, y_train)

# Test the classifier on a few words
sample_words = ['exhilaration', 'tense', 'giddy', 'hateful']

X_test = tfidf_vectorizer.transform(sample_words).toarray()
predictions = classifier.predict(X_test)
print("Predictions for sample words:")
for word, prediction in zip(sample_words, predictions):
    sentiment = "positive" if prediction == 1 else "negative"
    print(f"{word}: {sentiment}")

That looks right! Those positive and negative words from the training set were classified correctly. But what about words that were not in the training set? For example, our test dataset comprises "positively", "negatively", and "gorilla". These should map to positive, negative, and unknown classes, respectively. Let's see what happens.

In [None]:
X_test = tfidf_vectorizer.transform(dictionaries['test']).toarray()
predictions = classifier.predict(X_test)
print("Predictions for sample words:")
for word, prediction in zip(dictionaries['test'], predictions):
    sentiment = "positive" if prediction == 1 else "negative"
    print(f"{word}: {sentiment}")

That doesn't look quite right. Everything is classified as negative. Could this be because we use Linear Regression? Let's try a different approach. How about we use Naive Bayes? It performed best last time around... right?

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import RandomizedSearchCV

param_dist = {"var_smoothing": np.logspace(-12, -3, 100)}

nb_kfoldgrid = GaussianNB()

random_search = RandomizedSearchCV(
    estimator=nb_kfoldgrid,
    param_distributions=param_dist,
)
random_search.fit(X_train, y_train)

# Test the classifier on a few words
sample_words = ['exhilaration', 'tense', 'giddy', 'hateful']

X_test = tfidf_vectorizer.transform(sample_words).toarray()
predictions = random_search.best_estimator_.predict(X_test)
print("Predictions for sample words:")
for word, prediction in zip(sample_words, predictions):
    sentiment = "positive" if prediction == 1 else "negative"
    print(f"{word}: {sentiment}")


Well, that still looks good. What about the test set?

In [None]:
X_test = tfidf_vectorizer.transform(dictionaries['test']).toarray()
predictions = random_search.best_estimator_.predict(X_test)
print("Predictions for sample words:")
for word, prediction in zip(dictionaries['test'], predictions):
    sentiment = "positive" if prediction == 1 else "negative"
    print(f"{word}: {sentiment}")

Um... nope. That doesn't look good either. It seems like Naive Bayes is also struggling with words that were not in the training set. Why is that?

Let's look at the training set again. Remember that the word encodings are one-hot vectors. Each word is represented by a vector that has a 1 in the position corresponding to that word and 0s elsewhere. For example, if our vocabulary is `["the", "cat", "sat", "on", "mat"]`, then the word "cat" would be represented as `[0, 1, 0, 0, 0]`. So, when the machine learning algorithm learns from the training set, it only sees the words that are in the training set. And because there is a one-to-one mapping between words and their encodings, it cannot generalize to words that are not in the training set. This is why both Linear Regression and Naive Bayes struggled with the test set. 

Let's try a different approach. Instead of using one-hot encoding, let's use dense, lower-dimensional word embeddings. Recall from last week, with denser word embeddings, we can represent words as vectors in a lower-dimensional space. And importantly, these vectors can capture semantic relationships between words. Two words with similar meanings are likely to have similar vectors. For example, the words "cat" and "dog" might be represented by vectors that are close together in this space. This stands in stark contrast to one-hot encoding, where the vectors for "cat" and "dog" would be orthogonal to each other and therefore knowing about one tells you nothing about the other.

We'll use the GloVe embeddings from Stanford that we used last week. Let's load them up and see what we can do.

In [7]:
local_data_path = Path().resolve().parent / "local_data"
assert local_data_path.exists(), "Data path does not exist"
glove_file = local_data_path / f"glove.6B.100d.txt"
embeddings = {}
with open(glove_file, "r", encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = list(map(float, values[1:]))
        embeddings[word] = vector

In [None]:
# Replace the vectorizer with a custom one that uses GloVe embeddings

class GloVeVectorizer:
    def __init__(self, embeddings):
        self.embeddings = embeddings

    def transform(self, words):
        vectors = []
        for word in words:
            if word in self.embeddings:
                vectors.append(self.embeddings[word])
            else:
                vectors.append([0.0] * 100)
        return np.array(vectors)

glove_vectorizer = GloVeVectorizer(embeddings)
X_train = glove_vectorizer.transform(sentiment_words)
print("Labels: 1 = positive, 0 = negative\n")
print("The first three training words, their vectors, and their labels:")
print("-"*50)
for i in range(3):
    print(f"{sentiment_words[i]}:\n\tVector: {X_train[i]}\n\tLabel: {y_train[i]}\n")

print("\nThe last three training words, their vectors, and their labels:")
print("-"*50)
for i in range(3):
    print(f"{sentiment_words[-(i+1)]}:\n\tVector: {X_train[-(i+1)]}\n\tLabel: {y_train[-(i+1)]}\n")

OK, that looks better: our vectors have non-zero values for all dimensions. Let's train the linear regression model again and see what happens.

In [None]:
classifier = LogisticRegression(random_state=24601)
classifier.fit(X_train, y_train)

# Test the classifier on a few words
sample_words = ['exhilaration', 'tense', 'giddy', 'hateful']

X_test = glove_vectorizer.transform(sample_words)
predictions = classifier.predict(X_test)
print("Predictions for sample words:")
for word, prediction in zip(sample_words, predictions):
    sentiment = "positive" if prediction == 1 else "negative"
    print(f"{word}: {sentiment}")

That looks right, but that was correct last time around as well. Let's try the test set.

In [None]:
X_test = glove_vectorizer.transform(dictionaries['test'])
predictions = classifier.predict(X_test)
print("Predictions for sample words:")
for word, prediction in zip(dictionaries['test'], predictions):
    sentiment = "positive" if prediction == 1 else "negative"
    print(f"{word}: {sentiment}")

Hey! Closer! "positively" and "negatively" are classified correctly. But "gorilla" is classified as positive. What went wrong?

Let's think back to how the model was trained. The model learned two things: Some words are positive, and some words are negative (along with a sense for how to discern these apart). It never learned that there are words that are neither positive nor negative. So, when it encounters a word that it has never seen before, it has no way of knowing what to do with it. It has no way of knowing that "gorilla" is neither positive nor negative. It just knows that it is not like the positive words and not like the negative words. So, it classifies it as negative because that happens to be the "closer" category.

We won't worry about this for now. For now, let's move on to the next step: classifying entire texts.

## Step 2: Text Classification

In the previous step, we classified individual tokens. Now, we want to classify entire texts. Last week we used the 20-Newsgroups dataset to do some topic modeling. Now let's use that same dataset to do some text classification. 

A great way of doing this would be to use the same approach as last week: getting document embeddings from a large language model (e.g., OpenAI). However, for a dataset this big, that's a bit expensive. Let's try a simpler approach. We'll use TF-IDF to get document embeddings. TF-IDF is a way of representing documents as vectors in a lower-dimensional space. It works by assigning a weight to each word in the document based on its frequency in the document and its inverse frequency in the corpus. The idea is that words that are common in the corpus but rare in the document are more informative than words that are common in both the corpus and the document.

In [None]:
# Load the 20 newsgroups dataset
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split

print("Getting the 20 newsgroups dataset... this may take a few minutes...")
newsgroups = fetch_20newsgroups(subset='all', shuffle=True, random_state=24601)
stop_words = stopwords.words('english')

def preprocess(text: str) -> list:
    text = re.sub(r'[^A-Za-z]', ' ', text)
    words = word_tokenize(text.lower())
    words = [word.lower() for word in words if word not in stop_words and word.isalpha()]
    return " ".join(words)

newsgroups_data = pd.DataFrame({'text': newsgroups.data, 'label': newsgroups.target})
newsgroups_data['text'] = newsgroups_data['text'].apply(preprocess)
X_train, X_test, y_train, y_test = train_test_split(newsgroups_data['text'], newsgroups_data['label'], test_size=0.2, random_state=24601)
print("20 newsgroups dataset loaded.")

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.metrics import make_scorer, matthews_corrcoef, classification_report
from scipy.stats import loguniform


print("Training a logistic regression model on the 20 newsgroups dataset. This will take a while...")

tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, max_features=5000)
X_train_v = tfidf_vectorizer.fit_transform(X_train)
X_test_v = tfidf_vectorizer.transform(X_test)

param_dist = {
    'C': loguniform(1e-5, 1e5),
    'solver': ['lbfgs', 'liblinear', 'newton-cg', 'saga'],
    'penalty': ['l2'],
    'max_iter': [100, 500, 1000],
    'tol': loguniform(1e-6, 1e-2),
    'class_weight': [None, 'balanced']
}
kfold = KFold(n_splits=5, shuffle=True, random_state=24601)
mcc_scorer = make_scorer(matthews_corrcoef)

logistic = LogisticRegression(max_iter=1000)
random_search = RandomizedSearchCV(
    estimator=logistic,
    param_distributions=param_dist,
    n_iter=100,
    scoring=mcc_scorer,
    cv=kfold,
    verbose=1,
    n_jobs=-1,
    random_state=24601
)
random_search.fit(X_train_v, y_train)

# Predict on the test set
y_pred = random_search.best_estimator_.predict(X_test_v)

# Evaluate the model
print(f'MCC: {matthews_corrcoef(y_test, y_pred)}')
print(classification_report(y_test, y_pred, target_names=newsgroups.target_names))

Looks good! The MCC is 0.88, which is pretty good. However, we also see that not every category is classified equally well. For example, the "talk.religion.misc" category has a F1 score of 0.77, which is decent, but not great. In contrast, the "talk.politics.mideast" category has a F1 score of 0.95, which is excellent. 

Let's see the confusion matrix to see what the religion category is getting confused with.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=newsgroups.target_names, yticklabels=newsgroups.target_names)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()


It looks like it gets confused with alt.atheism and soc.religion.christian. This makes sense: these categories are all related to religion, so it's not surprising that they get confused with each other.

Now, let's try using our trained classifier on some new data unseen by the classifier training process. I used ChatGPT to generate some text for us to classify. Let's see how it does.

In [None]:
test_statements = [
    "I don't believe in any gods, and I think science explains everything.",
    "Religious texts are just ancient stories with cultural significance.",
    "I need help rendering 3D models using OpenGL.",
    "Photoshop filters can create some amazing visual effects.",
    "Windows keeps crashing after the latest update!",
    "How do I change the registry settings in Windows 11?",
    "What’s the best graphics card for gaming under $500?",
    "My PC won’t boot—could it be a power supply issue?",
    "Is the M2 chip significantly faster than the M1 for video editing?",
    "My MacBook’s battery drains too fast—any suggestions?",
    "How do I configure Xorg settings for dual monitors?",
    "My Linux desktop environment isn't rendering fonts correctly.",
    "Selling a barely used RTX 3090, DM for details.",
    "Looking for a second-hand iPhone 13, must be in good condition.",
    "Should I go for a Tesla Model 3 or a Toyota Camry hybrid?",
    "My car makes a weird knocking sound—what could be wrong?",
    "Best beginner-friendly motorcycle for commuting?",
    "My bike won’t start in cold weather—what could be the problem?",
    "Who do you think will win the World Series this year?",
    "Barry Bonds should definitely be in the Hall of Fame.",
    "The Maple Leafs might finally break their playoff curse this season!",
    "Which NHL goalie has the best save percentage this year?",
    "AES encryption is still secure, but quantum computing could change that.",
    "How does RSA encryption work in simple terms?",
    "I need help designing a simple transistor amplifier circuit.",
    "What’s the difference between AC and DC motors?",
    "Is intermittent fasting really effective for weight loss?",
    "What are the long-term effects of taking antibiotics frequently?",
    "NASA just announced a new mission to explore Europa!",
    "Could humans realistically colonize Mars within the next 50 years?",
    "What does the Bible say about forgiveness?",
    "I’m looking for a good church in my area—any recommendations?",
    "Should stricter gun laws be implemented to reduce crime?",
    "The Second Amendment guarantees the right to bear arms, but what about regulations?",
    "The Israel-Palestine conflict has deep historical roots.",
    "What are the latest peace efforts in the Middle East?",
    "The next presidential election will be crucial for climate policies.",
    "How does the electoral college impact voting outcomes in the US?",
    "Buddhism teaches mindfulness and detachment from material desires.",
    "What are the core beliefs of Hinduism compared to Christianity?"
]

for text in test_statements:
    text_processed = preprocess(text)
    text_vectorized = tfidf_vectorizer.transform([text_processed])
    prediction = random_search.best_estimator_.predict(text_vectorized)
    print(f'Text: {text} -- {newsgroups.target_names[prediction[0]]}')


We have accomplished two things here:
* We have classified entire texts rather than individual tokens.
* We have seen how these classifiers can be used to classify texts into multiple categories.

Done...