# Clickbait classification with neural networks

In [None]:
import spacy

# Load the spacy English model without stuff we don't need
nlp = spacy.load('en_core_web_sm', disable=['tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner'])

## Load clickbait data from Kaggle
This data consists of headlines classified as clickbait or not (regular news). Source site: https://www.kaggle.com/datasets/amananandrai/clickbait-dataset

In [None]:
# Read in the dataset with pandas
# 0 corresponds to not clickbait, 1 has been judged as clickbait

import pandas as pd

# Set pandas to display entire texts in dataframes
pd.set_option('display.max_colwidth', None)

data = pd.read_csv('data/clickbait_data.csv')
data.info()
data.head()

## Split into training and test sets

In [None]:
from sklearn.model_selection import train_test_split

test_size = int(0.1 * len(data))
train, test  = train_test_split(data, test_size=test_size, random_state=9)
print(len(train))
print(len(test))

## Extract vector representations (embeddings) for documents
This is different from usual! No sparse n-gram features here since they are so long and don't work as well with neural networks.

Instead, we're calculating document embeddings as average word2vec embeddings from `spacy`. This provides a fixed-size vector representation for each headline that is not sparse (has a lot of 0s) but **dense** instead.

In [None]:
import numpy as np

train_vecs = np.array([doc.vector for doc in nlp.pipe(train['headline'])])
test_vecs = np.array([doc.vector for doc in nlp.pipe(test['headline'])])
print(train_vecs.shape)
print(test_vecs.shape)

**How long are the document vectors here?** I.e. how many columns? Let's take a look at an example one.

In [None]:
i = # FILL IN a random number
train_vecs[i]

## Train and evaluate a neural network for clickbait classification
We'll use `scikit-learn`'s `MLPClassifier` class to train a classifier on these document vector thingies instead of hand-crafted features. This classifier provides an implementation of a feedforward neural network.

Feel free to change the number of units and number of layers in the next cell and run it multiple times.  
Here are some example values:

* `(50,)` # 50 neurons in 1 hidden layer
* `(30, 10)` # 30 neurons in the first hidden layer, then 10 in the second

In [None]:
from sklearn.neural_network import MLPClassifier

hidden_layers =  # FILL IN a tuple indicating how many units (neurons) are in each hidden layer (see text above)
clf = MLPClassifier(hidden_layer_sizes=hidden_layers, max_iter=1000)
train_x = train_vecs
train_y = # FILL IN the reference to the pandas dataframe with the training set labels you are trying to predict
clf.fit(train_x, train_y)

In [None]:
# Evaluate FFNN classifier
from sklearn.metrics import classification_report # this provides a bunch of useful evaluation metrics

test_labels = test['clickbait'] # true (gold) test set labels for clickbait/not clickbait
test_predictions = clf.predict(test_vecs)

results = pd.DataFrame(classification_report(test_labels, test_predictions, output_dict=True))
results