# Error analysis of clickbait classification with logistic regression
You must be so sick of clickbait by now.

## Load clickbait data from Kaggle
This data consists of headlines classified as clickbait or not (regular news). Source site: https://www.kaggle.com/datasets/amananandrai/clickbait-dataset

In [None]:
# Read in the dataset with pandas
# 0 corresponds to not clickbait, 1 has been judged as clickbait

import pandas as pd

# Set pandas to display entire texts in dataframes
pd.set_option('display.max_colwidth', None)

data = pd.read_csv('data/clickbait_data.csv')
data.info()
data.head()

## Split into training, development, and test sets
It's best to do error analysis on a development set instead of a test set since you don't want to look at examples in the test set, change your model and overfit to that test set.

In [None]:
from sklearn.model_selection import train_test_split

test_size = int(0.1 * len(data))
rest, test = train_test_split(data, test_size=test_size, random_state=9) # split data into test and the "rest" (which will be train + dev)
train, dev = train_test_split(rest, test_size=test_size, random_state=9) # split the "rest" into train and dev

print(len(train))
print(len(dev))
print(len(test))

## Extract unigram features from the text data
As a reminder, this step converts each headline to a numeric vector of unigram counts (how many times each word type occurs).
"Training" the vectorizer means finding how many unique features (in this case, unique words) are in the training set. This sets the number of columns in the matrix.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk

unigram_vectorizer = CountVectorizer(tokenizer=nltk.word_tokenize)
unigram_vectorizer.fit(train['headline']) # input is a list of strings (documents)
train_features = unigram_vectorizer.transform(train['headline'])
dev_features = unigram_vectorizer.transform(dev['headline'])
test_features = unigram_vectorizer.transform(test['headline'])

print(type(train_features))
print(train_features.shape) # (number of rows in the matrix, number of columns)
print(dev_features.shape)  # (number of rows in the matrix, number of columns)
print(test_features.shape)  # (number of rows in the matrix, number of columns)

## Train and evaluate a logistic regression model for clickbait classification
We'll use `scikit-learn`'s `LogisticRegression` class to train a classifiers using unigram features.

In [None]:
from sklearn.linear_model import LogisticRegression

clf_unigrams = LogisticRegression() # Instantiate a logistic regression classifier
clf_unigrams.fit(train_features, train['clickbait']) # Train the classifier

In [None]:
# Evaluate unigram logistic regression classifier
from sklearn.metrics import classification_report # this provides a bunch of useful evaluation metrics

dev_labels = dev['clickbait'] # true (gold) test set labels for clickbait/not clickbait
unigram_dev_predictions = clf_unigrams.predict(dev_features)

results = pd.DataFrame(classification_report(dev_labels, unigram_dev_predictions, output_dict=True))
results

# Error analysis
That's pretty good performance! But what's with those pesky errors? Let's dig into what kind of errors the system made.

First we'll create what's called a **confusion matrix** or **contingency table** of different types of errors.

In [None]:
from sklearn.metrics import confusion_matrix

y_true = dev_labels # the name of the variable where the test set actual labels is stored
y_pred = unigram_dev_predictions # the name of the variable where the test set actual labels is stored
confusion = pd.DataFrame(confusion_matrix(y_true, y_pred), columns=['pred_0', 'pred_1'], index=['true_0', 'true_1'])
confusion

It looks like our most common error is where we predict 0 (not clickbait) and the actual label is 1 (clickbait). **What is the name of that kind of error?**

Let's look at some examples of these types of errors. **Note that this should only be done with a development set, not a test set, otherwise any changes you make to the model to address errors may lead to overestimated performance.**

In [None]:
dev['prediction'] = unigram_dev_predictions # add a column for the system predictions
false_negatives = dev[(dev.prediction == 0) & (dev.clickbait == 1)]
false_negatives

Do you notice any patterns in what the system might not be picking up on? It's okay to speculate here, but remember what features it's using: only unigrams (individual words).

Sometimes it's useful to compare with examples the model got right (true positives in this case). Let's take a look at those.

In [None]:
true_positives = dev[(dev.prediction == 1) & (dev.clickbait == 1)]
true_positives.sample(20)

Finally, take a look at the other type of error our system makes: false positives. **What does the system predict and what is the true label in this case?**

In [None]:
false_positives = dev[(dev.prediction == 1) & (dev.clickbait == 0)]
false_positives

Do you observe any potential patterns in these false positives? It's again good to keep in mind what features the system sees: unigrams.