# N-gram document representations of headlines
This time, we'll build and examine the ngram feature representations of documents in clickbait classification data.

## Setup
No need to run this unless you haven't successfully installed `scikit-learn` and `nltk` from last class session.

In [None]:
# Install scikit-learn, a machine learning Python package
! pip install --user scikit-learn
! pip install --user nltk

Now select **Kernel > Restart Kernel** from the menu bar.

In [None]:
import sklearn
import nltk
nltk.download('punkt_tab')

## Load clickbait data from Kaggle
This data consists of headlines classified as clickbait or not (regular news). It is from a dataset on Kaggle, a site where machine learning competitions and datasets are often hosted. Source site: https://www.kaggle.com/datasets/amananandrai/clickbait-dataset

In [None]:
# Read in the dataset with pandas
# 0 corresponds to not clickbait, 1 has been judged as clickbait

import pandas as pd

# Set pandas to display entire texts in dataframes
pd.set_option('display.max_colwidth', None)

data = pd.read_csv('data/clickbait_data.csv')
data.info()
data.head()

## Split into training and test sets
This isn't strictly necessary since we're not training a machine learning model with this notebook, but it is good practice to only "train" the vectorizer (figure out things like the vocabulary) from the training set. Otherwise you are "looking" at the test set and will get an overly optimistic estimate of performance on the test set.

In [None]:
from sklearn.model_selection import train_test_split

test_size = int(0.1 * len(data))
train, test  = train_test_split(data, test_size=test_size, random_state=9)
print(len(train))
print(len(test))

## Extract **n-gram features** from the raw text data
"Features" are data fields or attributes "extracted" from raw data, in our case, text data. The features were are examining here are "unigram" features, unique sequences of 1 word. This step converts each headline to a numeric vector of unigram counts (how many times each word type occurs).
"Training" the vectorizer means finding how many unique features (in this case, unique words) are in the training set. This sets the number of columns in the matrix.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk

unigram_vectorizer = CountVectorizer(tokenizer=nltk.word_tokenize)
unigram_vectorizer.fit(train['headline']) # input is a list of strings (documents)
train_features = unigram_vectorizer.transform(train['headline'])
test_features = unigram_vectorizer.transform(test['headline'])

print(type(train_features))
print(train_features.shape) # prints (number of rows in the matrix, number of columns)
print(test_features.shape)  # prints (number of rows in the matrix, number of columns)

Let's explore those training set unigram features a bit more. First convert the `scipy` sparse matrix into a regular `numpy` matrix to take a look at it.

In [None]:
unigram_features = train_features.A
print(type(unigram_features))
print(unigram_features.shape)

Each row in this matrix is a headline. Each column is the count of a unique word type. **How many words are there in the entire vocabulary of this dataset?**  
Let's take a look at a few example headline vectors.

In [None]:
sample_index = 9 # FILL IN a random number less than the number of rows (datapoints) in ngram_features here
train.iloc[sample_index] # Take a look at the text

How many values in this large, sparse vector aren't 0?

In [None]:
import numpy as np

np.count_nonzero(unigram_features[sample_index])

Label the nonzero features with the words they correspond to:

In [None]:
# Make a pandas dataframe from the ngram features and label the column with their corresponding feature (unigram or word type)
feature_names = unigram_vectorizer.get_feature_names_out()
print(len(feature_names))

unigram_feature_matrix = pd.DataFrame(unigram_features, columns=feature_names)
unigram_feature_matrix

In [None]:
# View the nonzero values in the feature vector for the example headline
column_mask = unigram_feature_matrix.loc[sample_index].apply(lambda x: x > 0)
nonzero_columns = column_mask[column_mask == True]
unigram_feature_matrix.loc[[sample_index], nonzero_columns.index]

## Extract bigram features
Sequences of 2 words

In [None]:
bigram_vectorizer = CountVectorizer(ngram_range=(2,2), tokenizer=nltk.word_tokenize) # note the ngram_range parameter
bigram_vectorizer.fit(train['headline'])
train_bigram_features = bigram_vectorizer.transform(train['headline'])
test_bigram_features = bigram_vectorizer.transform(test['headline'])

print(train_bigram_features.shape) # prints (number of rows in the matrix, number of columns)

In [None]:
bigram_feature_matrix = pd.DataFrame(train_bigram_features.A, columns=bigram_vectorizer.get_feature_names_out())
bigram_feature_matrix

In [None]:
# View the nonzero values in the feature vector for the example headline
column_mask = bigram_feature_matrix.loc[sample_index].apply(lambda x: x > 0)
nonzero_columns = column_mask[column_mask == True]
bigram_feature_matrix.loc[[sample_index], nonzero_columns.index]

# Cosine similarity
We can use the numeric feature vectors computed for every headline to compute similarities with other headlines using **cosine similarity**. Though contemporary information retreival (search engine) systems are of course much more complex, they still use this basic framework to return results: convert texts to vectors and return the most similar documents to your query.

In [None]:
# Compute cosine similarity between the sample headline vector and all other headlines in the training set

from scipy.spatial.distance import cosine # cosine distance from the scipy package

In [None]:
sample_vector = unigram_features[sample_index]
sample_vector

In [None]:
def compute_cosine_similarity_to_sample(vector):
    """ Compute cosine similarity with sample vector """
    return 1 - cosine(vector, sample_vector)

In [None]:
unigram_feature_matrix.shape

In [None]:
sample_vector.shape

In [None]:
similarities = unigram_feature_matrix.apply(compute_cosine_similarity_to_sample, axis=1) # apply function over every row in the df
similarities

A sanity check first. What should the sample vector's similarity with itself be?

In [None]:
similarities[sample_index]

Now let's rank similarities and find out which vectors are most similar to the sample headline

In [None]:
sorted_similarities = similarities.sort_values(ascending=False)
sorted_similarities

In [None]:
train.iloc[sorted_similarities.index[:10]] # Take a look at the top 10 most similar headlines