# Clickbait classification
This notebook guides you through creating a simple machine learning model to classify clickbait from a **training set** of examples of clickbait and not clickbait.

# Import necessary packages
Test out the environment to make sure you have the packages needed to run this notebook. We'll import **scikit-learn** (sklearn), a useful Python package for traditional machine learning approaches.

In [None]:
# Test importing package/s.ðŸ¤ž for no errors!
import sklearn
import pandas as pd
import nltk
nltk.download('punkt_tab')

## Load clickbait data from Kaggle
This data consists of headlines classified as clickbait or not (regular news). It is from a dataset on Kaggle, a site that hosts machine learning datasets and competitions. Source site: https://www.kaggle.com/datasets/amananandrai/clickbait-dataset

In [None]:
# Read in the dataset with pandas
# 0 corresponds to not clickbait, 1 has been judged as clickbait

pd.set_option('display.max_colwidth', None) # Make sure entire headlines will be displayed

data = pd.read_csv('data/clickbait_data.csv')
data.info()
data.head()

## Split into training and test sets

In [None]:
from sklearn.model_selection import train_test_split

test_size = int(0.1 * len(data))
train, test  = train_test_split(data, test_size=test_size)
print(len(train))
print(len(test))

**What does this "length" refer to?**

## Extract numeric "features" from the raw text data
This step converts each instance (datapoint, row) of raw text to a numeric vector. No need to worry about the details of this now! We will cover this in the next few class sessions.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk

vectorizer = CountVectorizer(min_df=1, tokenizer=nltk.word_tokenize)
vectorizer.fit(train['headline']) # input is a list of strings (documents)
train_features = vectorizer.transform(train['headline'])
test_features = vectorizer.transform(test['headline'])

print(type(train_features))
print(train_features.shape) # prints (number of rows in the matrix, number of columns)
print(test_features.shape)  # prints (number of rows in the matrix, number of columns)

Note that the input is now a **matrix** with each row as a datapoint (headline) and each column as a numeric feature (don't worry about that now). This is one of the places matrices are used in NLP: as input to machine learning models.

## Train Naive Bayes model
Naive Bayes is a simple machine learning algorithm that we won't be covering in the course. But you know that as a machine learning model, it learns patterns from a training set, that is, parameters describing relationships between input and output in a mathematical model.

That trained model can then be used to make predictions.

In our dataset, `train['clickbait']` is the column that contains the **output** we care about: whether the text is clickbait or not. We pass the example input and output in our training set to train the model.

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(train_features, train['clickbait'])

## Make predictions from the trained model

In [None]:
# Select an example from the test set
# Note that the classifier has never seen this example

example = test.sample(1)
example

Get the text into a vector format that the classifier can use.

In [None]:
example_features = vectorizer.transform(example['headline'])
example_features.shape

In [None]:
clf.predict(example_features)

Recall that 0 = not clickbait and 1 = clickbait. **Did the model get it right?** Feel free to re-run the code in this section with other examples. 

Evaluation is an important part of machine learning and NLP. We'll systematically evaluate the model on a test set in later class sessions.