# Tf-idf document representations of headlines
Back to clickbait! Let's take a look at how tf-idf document representations compare to raw counts.

## Load clickbait data from Kaggle
This data consists of headlines classified as clickbait or not (regular news). It is from a dataset on Kaggle, a site where machine learning competitions and datasets are often hosted. Source site: https://www.kaggle.com/datasets/amananandrai/clickbait-dataset

In [None]:
# Read in the dataset with pandas
# 0 corresponds to not clickbait, 1 has been judged as clickbait

import pandas as pd

# Set pandas to display entire texts in dataframes
pd.set_option('display.max_colwidth', None)

data = pd.read_csv('data/clickbait_data.csv')
data.info()
data.head()

## Split into training and test sets

In [None]:
from sklearn.model_selection import train_test_split

test_size = int(0.1 * len(data))
train, test  = train_test_split(data, test_size=test_size, random_state=9)
print(len(train))
print(len(test))

## Extract unigram (raw bag-of-word count) features from the text data
"Features" are data fields or attributes "extracted" from raw data, in our case, text data. The features were are examining here are "unigram" features, unique sequences of 1 word. This step converts each headline to a numeric vector of unigram counts (how many times each word type occurs).
"Training" the vectorizer means finding how many unique features (in this case, unique words) are in the training set. This sets the number of columns in the matrix.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk

unigram_vectorizer = CountVectorizer(tokenizer=nltk.word_tokenize)
unigram_vectorizer.fit(train['headline']) # input is a list of strings (documents)
train_features = unigram_vectorizer.transform(train['headline'])
test_features = unigram_vectorizer.transform(test['headline'])

print(type(train_features))
print(train_features.shape) # prints (number of rows in the matrix, number of columns)
print(test_features.shape)  # prints (number of rows in the matrix, number of columns)

Let's explore those training set unigram features a bit more. First convert the `scipy` sparse matrix into a regular `numpy` matrix to take a look at it.

In [None]:
unigram_features = train_features.A
print(type(unigram_features))
print(unigram_features.shape)

Let's take a look at a few example headline vectors.

In [None]:
# Re-run this cell with as many different sample indexes as you like

sample_index = # FILL IN a random number less than the number of rows (datapoints) in ngram_features here. 
train.iloc[sample_index] # Take a look at the text

Label the nonzero features with the words they correspond to:

In [None]:
# Make a pandas dataframe from the ngram features and label the column with their corresponding feature (unigram or word type)
feature_names = unigram_vectorizer.get_feature_names_out()
print(len(feature_names))

unigram_feature_matrix = pd.DataFrame(unigram_features, columns=feature_names)

# View the nonzero values in the feature vector for the example headline
column_mask = unigram_feature_matrix.loc[sample_index].apply(lambda x: x > 0)
nonzero_columns = column_mask[column_mask == True]
unigram_feature_matrix.loc[[sample_index], nonzero_columns.index]

## Extract **tf-idf-weighted** unigram features from the text data
Let's take a look at how the document vectors change when we use tf-idf instead of raw counts.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk

tfidf_vectorizer = TfidfVectorizer(tokenizer=nltk.word_tokenize)
tfidf_vectorizer.fit(train['headline']) # input is a list of strings (documents)
train_features = tfidf_vectorizer.transform(train['headline'])
test_features = tfidf_vectorizer.transform(test['headline'])

print(type(train_features))
print(train_features.shape) # prints (number of rows in the matrix, number of columns)
print(test_features.shape)  # prints (number of rows in the matrix, number of columns)

In [None]:
# Convert the feature matrix to a NumPy array for inspection
tfidf_features = train_features.A
print(tfidf_features.shape)

Let's take a look at an example headline vectors.

In [None]:
# Make a pandas dataframe from the ngram features and label the column with their corresponding feature (unigram or word type)
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_feature_matrix = pd.DataFrame(tfidf_features, columns=feature_names)

# View the nonzero values in the feature vector for the example headline
column_mask = tfidf_feature_matrix.loc[sample_index].apply(lambda x: x > 0)
nonzero_columns = column_mask[column_mask == True]
tfidf_feature_matrix.loc[[sample_index], nonzero_columns.index]

**How do the weighted counts differ for common words and rarer words?**

# Find similar documents with unweighted unigram count vectors
We can use the numeric feature vectors computed for every headline to compute similarities with other headlines using cosine similarity. Though contemporary information retreival (search engine) systems are of course much more complex, they still use this basic framework to return results: convert texts to vectors and return the most similar documents to your query.

In [None]:
# Compute cosine similarity between the sample headline vector and all other headlines in the training set

from scipy.spatial.distance import cosine # cosine distance from the scipy package

def compute_cosine_similarity_to_sample(vector):
    """ Compute cosine similarity with sample vector """
    return 1 - cosine(vector, sample_vector)

sample_vector = unigram_features[sample_index]
sample_vector

Now let's rank similarities and find out which vectors are most similar to the sample headline

In [None]:
# Compute cosine similarity from every other document's vector to the sample document
similarities = unigram_feature_matrix.apply(compute_cosine_similarity_to_sample, axis=1) # apply function over every row in the df
sorted_similarities = similarities.sort_values(ascending=False)

# Create a pandas dataframe of all other headlines, sorted by cosine similarity with the sample
similar_headlines = train.iloc[sorted_similarities.index].copy().reset_index(drop=True)
similar_headlines['cosine_similarity'] = sorted_similarities.values
similar_headlines.head(10)

# Find similar documents with **tf-idf-weighted** unigram count vectors
We can use the numeric feature vectors computed for every headline to compute similarities with other headlines using cosine similarity. Though contemporary information retrieval (search engine) systems are of course much more complex, they still use this basic framework to return results: convert texts to vectors and return the most similar documents to your query.

In [None]:
# Compute cosine similarity between the sample headline vector and all other headlines in the training set
sample_vector = tfidf_features[sample_index]
sample_vector

Now let's rank similarities and find out which vectors are most similar to the sample headline

In [None]:
# Compute cosine similarity from every other document's vector to the sample document
similarities = tfidf_feature_matrix.apply(compute_cosine_similarity_to_sample, axis=1) # apply function over every row in the df
sorted_similarities = similarities.sort_values(ascending=False)

# Create a pandas dataframe of all other headlines, sorted by cosine similarity with the sample
similar_headlines = train.iloc[sorted_similarities.index].copy().reset_index(drop=True)
similar_headlines['cosine_similarity'] = sorted_similarities.values
similar_headlines.head(10)

**How do these results compare with the most similar documents from raw counts?**  
Do they seem better or worse in any particular way? Do you see the influence of the goal of tf-idf weighting, i.e. downweighting common words?

# Evaluate clickbait classifiers
Let's switch gears to building a classifier for this clickbait dataset, as we did before with Naive Bayes. No need to worry about what Naive Bayes is, it's sufficient to know that it's a simple machine learning classifier.

## Evaluation on a test set

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB() # Instantiate a Naive Bayes classifier
clf.fit(train_features, train['clickbait']) # Train the Naive Bayes classifier on features previously extracted

Way back in session 4 we made individual predictions with our classifier. Now let's evaluate the classifier on a test set!

In [None]:
from sklearn.metrics import classification_report # this automatically provides a bunch of useful evaluation metrics

test_labels = test['clickbait'] # true (gold) test set labels for clickbait/not clickbait
test_predictions = clf.predict(test_features)

results = pd.DataFrame(classification_report(test_labels, test_predictions, output_dict=True))
results

Try to interpret these results. `0` and `1` refer to the two different classes (output values): clickbait (1) and non-clickbait (0). **Why are precision and recall different for different classes?**  
`support` refers to how many datapoints (rows) are classified in each of the 2 classes.

## Cross validation
Let's evaluate using cross validation now. To do so, we'll make a scikit-learn `pipeline`, which allows us to combine our preprocessing and training steps. This is important so that we extract features from **only the training folds**, not from everything.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate

clf = make_pipeline(CountVectorizer(), MultinomialNB())
X_input = # FILL IN column of raw text data as input to our pipeline
X_labels = # FILL IN column of labels for that data
num_folds = # FILL IN number of folds to try

scores = cross_validate(clf, X_input, X_labels, scoring=['accuracy', 'precision', 'recall', 'f1'], cv=num_folds, return_train_score=False)
scores = pd.DataFrame(scores)[['test_accuracy', 'test_precision', 'test_recall', 'test_f1']]
scores.mean()