<img src="logoiit.png" width="200" img style="float: right;"> 

**NATURAL LANGUAGE PROCESSING. HOMEWORK 2.**<br>
Author: Lucía Colín Cosano. A20552447.

**PROBLEM 1**

- Using Python, read in the 2 clickbait datasets, and combine both into a single, shuffled dataset.
- Next, split your dataset into train, test, and validation datasets. Use a split of 72% train, 8% validation, and 20% test. (Which is equivalent to a 20% test set, and the remainer split 90%/10% for train and validation).
- What is the "target rate" of each of these three datasets? That is, what % of the test dataset is labeled as clickbait? Show your result in your notebook.

In [1]:
import pandas as pd
import numpy as np
from sklearn.utils import shuffle

In [2]:
positive_file_path = r"C:\Users\lulac\Desktop\Chicago\Fall 2023\Natural Language Processing\Homework 2\clickbait.txt"
negative_file_path = r"C:\Users\lulac\Desktop\Chicago\Fall 2023\Natural Language Processing\Homework 2\not-clickbait.txt"

positive_data = []
negative_data = []

# Read the positive (clickbait) dataset
with open(positive_file_path, 'r', encoding='utf-8') as file:
    for line in file:
        # Check if the line is not empty
        if line.strip():
            positive_data.append(line.strip())

# Read the negative dataset
with open(negative_file_path, 'r', encoding='utf-8') as file:
    for line in file:
        # Check if the line is not empty
        if line.strip():
            negative_data.append(line.strip())

# Creation of dataFrames from the lists
positive_df = pd.DataFrame({'text': positive_data, 'label': 'clickbait'})
negative_df = pd.DataFrame({'text': negative_data, 'label': 'no clickbait'})


# Combination and shuffle of the dataset
combined_data = pd.concat([positive_df, negative_df])
combined_data = shuffle(combined_data, random_state=42)
combined_data.reset_index(drop=True, inplace=True)

# Split the data into train, validation, and test sets
train_ratio = 0.72
validation_ratio = 0.08
test_ratio = 0.20

total_samples = len(combined_data)
train_size = int(train_ratio * total_samples)
validation_size = int(validation_ratio * total_samples)
test_size = int(test_ratio * total_samples)

train_data = combined_data[:train_size]
validation_data = combined_data[train_size:(train_size + validation_size)]
test_data = combined_data[(train_size + validation_size):]

# Calculate the target rate for each dataset
target_rate_train = (train_data['label'] == 'clickbait').mean()
target_rate_validation = (validation_data['label'] == 'clickbait').mean()
target_rate_test = (test_data['label'] == 'clickbait').mean()

# Print the target rates
print("Target Rate for Training Data:", target_rate_train)
print("Target Rate for Validation Data:", target_rate_validation)
print("Target Rate for Test Data:", target_rate_test)


Target Rate for Training Data: 0.3566026759744037
Target Rate for Validation Data: 0.27225130890052357
Target Rate for Test Data: 0.3117154811715481


**PROBLEM 3** 

- Using scikit-learn pipelines module, create a Pipeline to train a BOW naïve bayes model. We suggest the classes CountVectorizer and MultinomialNB. Include both unigrams and bigrams in your model in your vectorizer vocabulary
- Fit your classifier on your training set
- Compute the precision, recall, and F1-score on both your training and validation datasets using functions in sklearn.metrics. Use "clickbait" is your target class (I.e., y=1 for clickbait and y=0 for non-clickbait)

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, f1_score

In [4]:
# Target
y_train = (train_data['label'] == 'clickbait').astype(int)
y_validation = (validation_data['label'] == 'clickbait').astype(int)

In [5]:
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2))),  # Include unigrams and bigrams
    ('classifier', MultinomialNB())
])

pipeline.fit(train_data['text'], y_train)

In [6]:
# Predictions on both training and validation datasets
y_train_pred = pipeline.predict(train_data['text'])
y_validation_pred = pipeline.predict(validation_data['text'])

# Compute precision, recall, and F1-score on the training dataset
precision_train = precision_score(y_train, y_train_pred)
recall_train = recall_score(y_train, y_train_pred)
f1_train = f1_score(y_train, y_train_pred)

# Compute precision, recall, and F1-score on the validation dataset
precision_validation = precision_score(y_validation, y_validation_pred)
recall_validation = recall_score(y_validation, y_validation_pred)
f1_validation = f1_score(y_validation, y_validation_pred)

# Print the results
print("Training Metrics:")
print(f"Precision: {precision_train:.2f}")
print(f"Recall: {recall_train:.2f}")
print(f"F1-score: {f1_train:.2f}")
print("\nValidation Metrics:")
print(f"Precision: {precision_validation:.2f}")
print(f"Recall: {recall_validation:.2f}")
print(f"F1-score: {f1_validation:.2f}")

Training Metrics:
Precision: 1.00
Recall: 1.00
F1-score: 1.00

Validation Metrics:
Precision: 0.94
Recall: 0.90
F1-score: 0.92


**PROBLEM 4**

Using the ParameterGrid class, run a small grid search where you vary at least 3 parameters of your model
- max_df for your count vectorizer (threshold to filter document frequency)
- alpha or smoothing of your NaïveBayes model
- One other parameter of your choice. 

In [7]:
from sklearn.model_selection import ParameterGrid
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score

In [8]:
# Define the hyperparameter grid for your pipeline
param_grid = {
    'vectorizer__max_df': [0.5, 0.7, 0.9], 
    'classifier__alpha': [0.1, 0.5, 1.0], 
    'vectorizer__ngram_range': [(1, 1), (1, 2)]
}

# Initialize an empty list to store results
results = []

# Loop over the parameter combinations
for params in ParameterGrid(param_grid):
    # Create a scikit-learn pipeline with the current parameter settings
    pipeline = Pipeline([
        ('vectorizer', CountVectorizer(max_df=params['vectorizer__max_df'], ngram_range=params['vectorizer__ngram_range'])),
        ('classifier', MultinomialNB(alpha=params['classifier__alpha']))
    ])

    # Fit the model on the training data
    pipeline.fit(train_data['text'], y_train)

    # Make predictions on the validation data
    y_validation_pred = pipeline.predict(validation_data['text'])

    # Calculate the F1-score for the current parameter settings
    f1 = f1_score(y_validation, y_validation_pred)

    # Store the results
    results.append({
        'params': params,
        'f1_score': f1
    })

# Find the best parameter combination based on F1-score
best_result = max(results, key=lambda x: x['f1_score'])

# Print the best parameter combination and its F1-score
print("Best Parameter Combination:")
print(best_result['params'])
print("Best F1-score:", best_result['f1_score'])

Best Parameter Combination:
{'classifier__alpha': 0.5, 'vectorizer__max_df': 0.5, 'vectorizer__ngram_range': (1, 1)}
Best F1-score: 0.9215686274509804


**PROBLEM 5**

Using these validation-set metrics from the previous problem, choose one model as your selected model. It is up to you how to choose this model; one approach is to choose the model that shows the highest F1-score on your training set.
Next apply your selected model to your test set and compute precision, recall and F1.

In [9]:
y_test = (test_data['label'] == 'clickbait').astype(int)

In [10]:
# Find the best parameter combination based on F1-score on the validation set
best_result = max(results, key=lambda x: x['f1_score'])
best_params = best_result['params']

# Create the selected model using the best parameters
selected_model = Pipeline([
    ('vectorizer', CountVectorizer(max_df=best_params['vectorizer__max_df'], ngram_range=best_params['vectorizer__ngram_range'])),
    ('classifier', MultinomialNB(alpha=best_params['classifier__alpha']))
])
selected_model.fit(train_data['text'], y_train)

# Make predictions on the test data using the selected model
y_test_pred = selected_model.predict(test_data['text'])

precision_test = precision_score(y_test, y_test_pred)
recall_test = recall_score(y_test, y_test_pred)
f1_test = f1_score(y_test, y_test_pred)

print("Métricas en el Conjunto de Prueba:")
print(f"Precision: {precision_test:.2f}")
print(f"Recall: {recall_test:.2f}")
print(f"F1-score: {f1_test:.2f}")


Métricas en el Conjunto de Prueba:
Precision: 0.82
Recall: 0.86
F1-score: 0.84


**PROBLEM 6**

Using the log-probabilities of the model you selected in the previous problem, select 5 words that are strong Clickbait indicators. That is, if you needed to filter headlines based on a single word, without a machine learning model, then these words would be good options. Show this list of keywords in your notebook.

In [11]:
# Access the classifier from the selected model
classifier = selected_model.named_steps['classifier']

# Access the feature log probabilities for both classes
log_probabilities = classifier.feature_log_prob_

# Get the vocabulary used by the vectorizer
vectorizer = selected_model.named_steps['vectorizer']
vocabulary = vectorizer.get_feature_names_out()

# Create a dictionary to store words and their log probabilities for clickbait
clickbait_log_probs = {}
non_clickbait_log_probs = {}

# Assuming that clickbait is class 1 and non-clickbait is class 0
for word_idx, word in enumerate(vocabulary):
    clickbait_log_prob = log_probabilities[1][word_idx]
    non_clickbait_log_prob = log_probabilities[0][word_idx]
    
    clickbait_log_probs[word] = clickbait_log_prob
    non_clickbait_log_probs[word] = non_clickbait_log_prob

# Sort the words by their log probabilities for clickbait (in descending order)
sorted_clickbait_words = sorted(clickbait_log_probs.items(), key=lambda x: x[1], reverse=True)

# Select the top 5 words as strong clickbait indicators
top_clickbait_words = sorted_clickbait_words[:5]

# Print the list of top clickbait indicator words
print("Top 5 Strong Clickbait Indicator Words:")
for word, log_prob in top_clickbait_words:
    print(f"{word}: {log_prob:.2f}")


Top 5 Strong Clickbait Indicator Words:
the: -3.55
you: -4.09
to: -4.14
this: -4.22
is: -4.27


**PROBLEM 7**

Your IT department has reached out to you because they heard you can help them find clickbait. They are interested in your machine learning model, but they need a solution today.
- Write a regular expression that checks if any of the keywords from the previous problem are found in the text. You should write one regular expression that detects any of your top 5 keywords. Your regular expression should be aware of word boundaries in some way. That is, the keyword "win" should not be detected in the text "Gas prices up in winter months".
- Using the python re library – apply your function to your test set. (See function re.search). What is the precision and recall of this classifier?

In [12]:
import re

# Define the top 5 clickbait indicator keywords
top_clickbait_keywords = ["the", "you", "to", "this", "is"]

# Create a regular expression pattern to match any of the keywords with word boundaries
pattern = r'\b(?:' + '|'.join(re.escape(keyword) for keyword in top_clickbait_keywords) + r')\b'

# Initialize variables to count true positives, false positives, and false negatives
true_positives = 0
false_positives = 0
false_negatives = 0

# Iterate over the test data and labels to calculate precision and recall
for text, label in zip(test_data['text'], y_test):
    # Search for the pattern in the text
    if re.search(pattern, text, flags=re.IGNORECASE):
        # If the pattern is found in the text and it's labeled as clickbait, it's a true positive
        if label == 1:
            true_positives += 1
        # If the pattern is found in the text but it's not labeled as clickbait, it's a false positive
        else:
            false_positives += 1
    # If the pattern is not found in the text and it's labeled as clickbait, it's a false negative
    elif label == 1:
        false_negatives += 1

# Calculate precision and recall
precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)

# Print the precision and recall
print("Precision:", precision)
print("Recall:", recall)

Precision: 0.4180327868852459
Recall: 0.6845637583892618
