<img src="logoiit.png" width="200" img style="float: right;"> 

**NATURAL LANGUAGE PROCESSING. HOMEWORK 3.**<br>
Author: Lucía Colín Cosano. A20552447.

**PROBLEM 1 – Reading the data**

• Read in file "train.tsv" from the Stanford Sentiment Treebank (SST) as shared in the GLUE task. (See section "DATA" above.)

• Next, split your dataset into train, test, and validation datasets with the sizes defined.

• Review the column "label" which indicates positive=1 or negative=0 sentiment. What is the prior probability of each class on your training set? Show results in your notebook.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
import pandas as pd
import math

In [2]:
df = pd.read_csv("train.tsv", sep='\t', header=0)

In [3]:
validation_size = 100
test_size = 100

validation_set = df.sample(validation_size, random_state=1)
df = df.drop(validation_set.index)

test_set = df.sample(test_size, random_state=2)
df = df.drop(test_set.index)

training_set = df

In [4]:
num_positive = training_set['label'].sum() 
num_negative = len(training_set) - num_positive  
total_samples = len(training_set)

prior_prob_positive = num_positive / total_samples
prior_prob_negative = num_negative / total_samples

print("Prior Probability of Positive Sentiment:", prior_prob_positive)
print("Prior Probability of Negative Sentiment:", prior_prob_negative)

Prior Probability of Positive Sentiment: 0.5578638550090098
Prior Probability of Negative Sentiment: 0.4421361449909902


**PROBLEM 2 – Tokenizing data**

• Write a function that takes a sentence as input, represented as a string, and converts it to a tokenized sequence padded by start and end symbols. 

• Apply your function to all sentences in your training set. Show the tokenization of the first sentence of your training set in your notebook output.

• What is the vocabulary size of your training set? Include your start and end symbol in your vocabulary. Show your result in your notebook.

In [5]:
def tokenize_sentence(sentence):
    tokens = word_tokenize(sentence)
    tokens = ['<s>'] + tokens + ['</s>']
    return tokens

In [6]:
training_set['tokenized_sentence'] = training_set['sentence'].apply(tokenize_sentence)

unique_tokens = set()

# Collect unique tokens from all tokenized sentences
for tokens in training_set['tokenized_sentence']:
    unique_tokens.update(tokens)

# Tokenize the first sentence in your training set
first_sentence = training_set['sentence'].iloc[0]
tokenized_first_sentence = tokenize_sentence(first_sentence)

# Display the tokenization of the first sentence
print("Tokenization of the first sentence:")
print(first_sentence)
print(tokenized_first_sentence)

Tokenization of the first sentence:
hide new secretions from the parental units 
['<s>', 'hide', 'new', 'secretions', 'from', 'the', 'parental', 'units', '</s>']


In [7]:
# Include start and end symbols in the vocabulary
unique_tokens.add('<s>')
unique_tokens.add('</s>')

# Calculate the vocabulary size
vocab_size = len(unique_tokens)

# Display the vocabulary size
print("Vocabulary Size:", vocab_size)

Vocabulary Size: 14802


**PROBLEM 3 – Bigram counts**

• Write a function that takes an array of tokenized sequences as input (i.e., a list of lists) and counts bigram frequencies in that dataset. Your function should return a two-level dictionary (dictionary of dictionaries) or similar data structure, where the value at index [wi][wj] gives the frequency count of bigram (wi, wj). For example, this expression would give the counts of the bigram "academy award": bigram_counts["academy"]["award"]

• Apply your function to the output of problem 2. You should build one counter that represents all sentences in the training dataset.

In [8]:
def count_bigrams(tokenized_sequences):
    bigram_counts = {} 

    for tokens in tokenized_sequences:
        for wi, wj in zip(tokens, tokens[1:]):
            bigram_counts.setdefault(wi, {}).setdefault(wj, 0)
            bigram_counts[wi][wj] += 1

    return bigram_counts

In [9]:
bigram_counts = count_bigrams(training_set['tokenized_sentence'])

In [10]:
start_with_the_count = bigram_counts.get("<s>", {}).get("the", 0)

In [11]:
print("Bigram count of ('<s>', 'the'): ", start_with_the_count)

Bigram count of ('<s>', 'the'):  4455


**PROBLEM 4 – Smoothing**

• Write a function that implements formula [6.13] in that E-NLP textbook (page 129, 6.2 Smoothing and discounting). That is, write a function that applies smoothing and returns a (negative) log-probability of a word given the previous word in the sequence. 

• Using this function to show the log probability that the word "academy" will be followed by the word "award". Try this with alpha=0.001 and alpha=0.5 (you should see very different results!). Show your results in your notebook.

In [12]:
def calculate_smoothed_log_prob(wm, wm_1, bigram_counts, alpha, vocab_size):
    count_wm_1_wm = bigram_counts.get(wm_1, {}).get(wm, 0)

    numerator = count_wm_1_wm + alpha
    denominator = sum(bigram_counts.get(wm_1, {}).values()) + (alpha * vocab_size)
    log_prob = math.log(numerator / denominator)

    return log_prob

In [13]:
# Assuming you already have bigram_counts, vocab_size, and the words "academy" and "award"
word_wm_1 = "academy"
word_wm = "award"
alpha_1 = 0.001
alpha_2 = 0.5

log_prob_alpha_1 = calculate_smoothed_log_prob(word_wm, word_wm_1, bigram_counts, alpha_1, vocab_size)
log_prob_alpha_2 = calculate_smoothed_log_prob(word_wm, word_wm_1, bigram_counts, alpha_2, vocab_size)

print(f"Log Probability ('{word_wm_1}' -> '{word_wm}') with alpha = {alpha_1}: {log_prob_alpha_1}")
print(f"Log Probability ('{word_wm_1}' -> '{word_wm}') with alpha = {alpha_2}: {log_prob_alpha_2}")

Log Probability ('academy' -> 'award') with alpha = 0.001: -1.0248273197292836
Log Probability ('academy' -> 'award') with alpha = 0.5: -6.172171898547395


**PROBLEM 5 – Sentence log-probability**

• Write a function that returns the log-probability of a sentence which is expected to be a negative number. To do this, assume that the probability of a word in a sequence only depends on the previous word. 

• Use your function to compute the log probability of these two sentences (Note that the 2nd is not natural English, so it should have a lower (more negative) result that the first)

In [14]:
def calculate_sentence_log_prob(sentence, bigram_counts, alpha, vocab_size):
    words = sentence.split()
    log_prob = 0.0

    for i in range(1, len(words)):
        log_prob += calculate_smoothed_log_prob(words[i], words[i - 1], bigram_counts, alpha, vocab_size)

    return log_prob

In [15]:
sentence1 = "this was a really great movie but it was a little too long."
sentence2 = "long too little a was it but movie great really a was this."

alpha = 0.01 

log_prob1 = calculate_sentence_log_prob(sentence1, bigram_counts, alpha, vocab_size)
log_prob2 = calculate_sentence_log_prob(sentence2, bigram_counts, alpha, vocab_size)

print("Log Probability of Sentence 1:", log_prob1)
print("Log Probability of Sentence 2:", log_prob2)

Log Probability of Sentence 1: -67.79457212208044
Log Probability of Sentence 2: -126.27667113068851


With the result obtain we can check what it was meant to happen, the second sentence has a more negative log probability.

**PROBLEM 6 – Tuning Alpha**

Next, use your validation set to select a good value for "alpha".

• Apply the function you wrote in Problem 5 to your validation dataset using 3 different values of "alpha", such as (0.001, 0.01, 0.1). For each value, show the log-likelihood estimate of the validation set. That is, in your notebook show the sum of the log probabilities of all sentences.

• Which alpha gives you the best result? To indicate your selection to the grader, save your selected value to a variable named "selected_alpha".

In [16]:
alpha_values = [0.001, 0.01, 0.1]

log_likelihoods = [] 

for alpha in alpha_values:
    total_log_prob = 0.0
    for sentence in validation_set['sentence']:
        total_log_prob += calculate_sentence_log_prob(sentence, bigram_counts, alpha, vocab_size)
    log_likelihoods.append(total_log_prob)

best_alpha = alpha_values[log_likelihoods.index(max(log_likelihoods))]

print("Log-likelihood estimates for different alpha values:")
for i, alpha in enumerate(alpha_values):
    print(f"Alpha = {alpha}: Log-likelihood = {log_likelihoods[i]}")

print("Best alpha:", best_alpha)
selected_alpha=best_alpha

Log-likelihood estimates for different alpha values:
Alpha = 0.001: Log-likelihood = -3829.3573328360308
Alpha = 0.01: Log-likelihood = -4286.71505268244
Alpha = 0.1: Log-likelihood = -5217.489322163317
Best alpha: 0.001


**PROBLEM 7 – Applying Language Models**

In this problem, you will classify your test set of 100 sentences by sentiment, by applying your work from previous problems and modeling the language of both positive and negative sentiment.
To do this, you can follow these steps:

• Separate your training dataset into positive and negative sentences, and compute vocabulary size and bigram counts for both datasets.

• For each of the 100 sentences in your test set:

- Compute both a "positive sentiment score" and a "negative sentiment score" using (1) the function you wrote in Problem 5, (2) Bayes rule, and (3) class priors as computed in Problem 1.

- Compare these scores to assign a predicted sentiment label to the sentence.

• What is the class distribution of your predicted label? That is, how often did your method predict positive sentiment, correctly or incorrectly? How often did it predict negative sentiment? Show results in your notebook.

• Compare your predicted label to the true sentiment label. What is the accuracy of this experiment? That is, how often did the true and predicted label match on the test set? Show results in your notebook.

In [17]:
# Step 1: Separate the Training Dataset into positive and negative subsets
positive_training_set = training_set[training_set['label'] == 1]
negative_training_set = training_set[training_set['label'] == 0]

# Step 2: Compute Vocabulary Size and Bigram Counts for Both Datasets
# Step 2: Compute Vocabulary Size and Bigram Counts for Both Datasets
vocab_size_pos = len(set(word for tokens in positive_training_set['tokenized_sentence'] for word in tokens))
bigram_counts_pos = count_bigrams(positive_training_set['tokenized_sentence'])

vocab_size_neg = len(set(word for tokens in negative_training_set['tokenized_sentence'] for word in tokens))
bigram_counts_neg = count_bigrams(negative_training_set['tokenized_sentence'])


# Step 3 and 4: Compute Sentiment Scores and Predict Sentiment Labels for the Test Set
predicted_sentiment_labels = []

for sentence in test_set['sentence']:
    log_prob_positive = calculate_sentence_log_prob(sentence, bigram_counts_pos, selected_alpha, vocab_size_pos) + math.log(prior_prob_positive)
    log_prob_negative = calculate_sentence_log_prob(sentence, bigram_counts_neg, selected_alpha, vocab_size_neg) + math.log(prior_prob_negative)

    if log_prob_positive > log_prob_negative:
        predicted_sentiment_labels.append(1)  # Predicted positive sentiment
    else:
        predicted_sentiment_labels.append(0)  # Predicted negative sentiment

# Step 5: Analyze the Class Distribution of Predicted Labels
positive_predictions = sum(predicted_sentiment_labels)
negative_predictions = len(predicted_sentiment_labels) - positive_predictions

# Step 6: Calculate Accuracy
true_labels = test_set['label'].tolist()
correct_predictions = sum(1 for true, predicted in zip(true_labels, predicted_sentiment_labels) if true == predicted)
accuracy = correct_predictions / len(test_set)

# Print results
print("Class Distribution of Predicted Labels:")
print("Predicted Positive Sentiment:", positive_predictions)
print("Predicted Negative Sentiment:", negative_predictions)
print("Accuracy:", accuracy)

Class Distribution of Predicted Labels:
Predicted Positive Sentiment: 56
Predicted Negative Sentiment: 44
Accuracy: 0.83
