# Natural Language Processing - Assignment 2
# Sentiment analysis for movie reviews

This notebook was created for you to answer question 2, 3 and 4 from assignment 2. Please read the steps and the provided code carefully and make sure you understand them. 

The (red) comments at the beginning of each function explain what they should do, which parameters you should give as input and which variables should be returned by the function. After the (green) comments "### student code here###' you should write your own code.

**Please modify the next cell specifying your group number**

 *This is the Notebook of* ***Group 0*** 




### Prerequisite - Libraries
Make sure you have the needed libraries installed on your computer: scikit-learn, Pandas, NLTK...

### Prerequisite - Load Data

In the first step, we are going to load the data in a Pandas DataFrame. Pandas DataFrames are a useful way of storing data. DataFrames are tables in which data can be accessed as columns, as rows or as individual cells. You can find more info on DataFrames here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

Read the code below and make sure you understand what is happening. Run the code to load your data.

In [26]:
import os
import re
import pandas as pd
import numpy as np
import glob
### student code here: import the needed modules from sci-kit learn ###
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score


In [4]:
def get_path(filename):
    """
    Makes a list of all the paths that fit the search requirement

    :param filename: A regular expression that defines the search requirement for the filenames
    :return  Returns a list of all the pathnames
    """
    # place the movies folder in the same directory as this notebook
    current_directory = os.getcwd()
    # if you are using Google Colab, you will have to change the above line
    # to load the dataset from your Google Drive

    # glob.glob() is a pattern-matching path finder, it searches for the reviews in the movies folder based on a Regular Expression
    paths = glob.glob(current_directory + '/movies/' + filename)

    if len(paths) == 0:
        print('Your file list is empty. The code looks for the folder '+current_directory+'/movies, but could not find it.')
    else:
        print("Found ", len(paths), "files")
    return paths

In [15]:
def load_data(pathset):
    """
    Loads the data into a dataframe

    :param pathset:  A list of paths
    :return  A dataframe with three columns: Path, Review (Text) and Label
    """
    # Files are named by sentiment (P for positive, N for negative)
    pattern = re.compile('P-(train|test)[0-9]*.txt')
    reviews = []
    labels = []
    df = pd.DataFrame(columns = ['Path', 'Review', 'Label'])
    for path in pathset:
        if re.search(pattern, path):
            text = open(path, "r").read()
            reviews.append(text)
            labels.append('Pos')
        else:
            text = open(path, "r").read()
            reviews.append(text)
            labels.append('Neg')
    df['Path'] = pathset
    df['Review'] = reviews
    df['Label'] = labels
    return df

In [31]:
#Load the files in the Dataframe. This will take a while...
paths = get_path('train/[NP]-train[0-9]*.txt')
data = load_data(paths)
data.head()

Found  1200 files


Unnamed: 0,Path,Review,Label
0,/Users/kornelovics/EIT/UT/NLP/project-nlp/home...,"Let's see, cardboard characters like Muslim te...",Neg
1,/Users/kornelovics/EIT/UT/NLP/project-nlp/home...,"""May contain spoilers"" Sadly Lou Costellos' la...",Neg
2,/Users/kornelovics/EIT/UT/NLP/project-nlp/home...,"I can't emphasize it enough, do *NOT* get this...",Neg
3,/Users/kornelovics/EIT/UT/NLP/project-nlp/home...,I am truly sad that this is the first bad revi...,Neg
4,/Users/kornelovics/EIT/UT/NLP/project-nlp/home...,I'm a Petty Officer 1st Class (E-6) and have b...,Pos


### Part 2 - Tokenization

In this step, you should write a tokenizer and compare it with an off-the-shelf one.

#### Question 2.1 Making your own tokenizer

In [7]:
def my_tokenizer(text):
    """
    The implementation of your own tokenizer

    :param text:  A string with a sentence (or paragraph, or document...)
    :return  A list of tokens
    """
    ### student code here ###


    # Convert to lowercase
    text = text.lower()

    # Split on whitespace and punctuation using regex
    # This pattern keeps letters, digits, and apostrophes together
    tokens = re.findall(r"[a-zA-Z0-9']+|[.,!?;]", text)

    # Remove empty strings
    tokens = [token for token in tokens if token.strip()]

    return tokens

sample_string0 = "If you have the chance, watch it. Although, a warning, you'll cry your eyes out."
sample_string1 = "kaas is lekker"
sample_string2 = "Me and My bEstfriend are going to the city!"
sample_string3 = "me and my number 3 bEStfrienD like movies"
print(my_tokenizer(sample_string0))
print(my_tokenizer(sample_string1))
print(my_tokenizer(sample_string2))
print(my_tokenizer(sample_string3))

['kaas', 'is', 'lekker']
['me', 'and', 'my', 'bestfriend', 'are', 'going', 'to', 'the', 'city', '!']
['me', 'and', 'my', 'number', '3', 'bestfriend', 'like', 'movies']


#### Question 2.2 Using an off-the-shelf tokenizer

In [13]:
#Now we are gonna compare the tokenizer you just wrote with the one from NLTK
#if you installed NLTK but never downloaded the 'punkt' tokenizer, uncomment the following lines:
import nltk
# nltk.download('punkt')
from nltk.tokenize import word_tokenize

def nltk_tokenizer(text):
    """
    This function should apply the word_tokenize (punkt) tokenizer of nltk to the input text

    :param text:  A string with a sentence (or paragraph, or document...)
    :return  A list of tokens
    """
    ### student code here ###
    # Ensure required tokenizers are available
    try:
        nltk.data.find('tokenizers/punkt')
    except LookupError:
        nltk.download('punkt')

    # Newer NLTK versions may also need 'punkt_tab'
    try:
        nltk.data.find('tokenizers/punkt_tab')
    except LookupError:
        try:
            nltk.download('punkt_tab')
        except:
            pass  # It's optional on some versions

    return word_tokenize(text)

test_sentences = ["I like this assignment because:\n-\tit is fun;\n-\tit helps me practice my Python skills.",
        "I won a prize, but I won't be able to attend the ceremony.",
        "“The strange case of Dr. Jekyll and Mr. Hyde” is a famous book... but I haven't read it.",
        "I work for the C.I.A.. And you?",
        "OMG #Twitter is sooooo coooool <3 :-) <-- lol...why do i write like this idk right? :) 🤷😂 🤖"]

for test_string in test_sentences:
    print(my_tokenizer(test_string))
    print(nltk_tokenizer(test_string))
    print("\n")


['i', 'like', 'this', 'assignment', 'because', 'it', 'is', 'fun', ';', 'it', 'helps', 'me', 'practice', 'my', 'python', 'skills', '.']


[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/kornelovics/nltk_data...


['I', 'like', 'this', 'assignment', 'because', ':', '-', 'it', 'is', 'fun', ';', '-', 'it', 'helps', 'me', 'practice', 'my', 'Python', 'skills', '.']


['i', 'won', 'a', 'prize', ',', 'but', 'i', "won't", 'be', 'able', 'to', 'attend', 'the', 'ceremony', '.']
['I', 'won', 'a', 'prize', ',', 'but', 'I', 'wo', "n't", 'be', 'able', 'to', 'attend', 'the', 'ceremony', '.']


['the', 'strange', 'case', 'of', 'dr', '.', 'jekyll', 'and', 'mr', '.', 'hyde', 'is', 'a', 'famous', 'book', '.', '.', '.', 'but', 'i', "haven't", 'read', 'it', '.']
['“', 'The', 'strange', 'case', 'of', 'Dr.', 'Jekyll', 'and', 'Mr.', 'Hyde', '”', 'is', 'a', 'famous', 'book', '...', 'but', 'I', 'have', "n't", 'read', 'it', '.']


['i', 'work', 'for', 'the', 'c', '.', 'i', '.', 'a', '.', '.', 'and', 'you', '?']
['I', 'work', 'for', 'the', 'C.I.A', '..', 'And', 'you', '?']


['omg', 'twitter', 'is', 'sooooo', 'coooool', '3', 'lol', '.', '.', '.', 'why', 'do', 'i', 'write', 'like', 'this', 'idk', 'right', '?']
['OMG', '#', 

[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


### Part 3 - Text classification with a unigram language model

#### Training phase
You now need to create the model and train it on the documents in the dataframe. Look at the scikit learn documentation to learn how to use the CountVectorizer and MultimodalNaiveBayes modules.

In [None]:
# Part 2.1: Theory calculations with the toy corpus

print("TOY CORPUS ANALYSIS")
print("=" * 50)

# Given data from the homework
print("\nPOSITIVE CORPUS word counts:")
positive_words = {
    'the': 5, 'a': 3, 'really': 2, 'plot': 2, 'movie': 2, 'great': 2, 'are': 2,
    'actors': 1, 'all': 1, 'and': 1, 'can': 1, 'delivering': 1, 'director': 1,
    'familiar': 1, 'I': 1, 'identify': 1, 'intriguing': 1, 'is': 1, 'like': 1,
    'manages': 1, 'out': 1, 'performance': 1, 'story': 1, 'tell': 1, 'this': 1,
    'thought': 1, 'to': 1, 'twists': 1, 'was': 1, 'we': 1, 'well': 1, 'with': 1
}

print("\nNEGATIVE CORPUS word counts:")
negative_words = {
    'a': 3, 'to': 2, 'not': 2, 'movie': 2, 'I': 2, 'boring': 2,
    'actors': 1, 'again': 1, 'an': 1, 'are': 1, 'disappointing': 1,
    'enough': 1, 'great': 1, 'had': 1, 'interesting': 1, 'once': 1,
    'plot': 1, 'reminder': 1, 'see': 1, 'shoot': 1, 'terrible': 1,
    'that': 1, 'this': 1, 'uninspiring': 1, 'wasted': 1, 'wish': 1, 'with': 1
}

# Calculate totals
total_pos_words = sum(positive_words.values())
total_neg_words = sum(negative_words.values())
total_pos_vocab = len(positive_words)
total_neg_vocab = len(negative_words)

print(f"Total positive words: {total_pos_words}")
print(f"Total negative words: {total_neg_words}")
print(f"Positive vocabulary size: {total_pos_vocab}")
print(f"Negative vocabulary size: {total_neg_vocab}")

# Combined vocabulary for smoothing
all_words = set(positive_words.keys()) | set(negative_words.keys())
vocab_size = len(all_words)
print(f"Combined vocabulary size (V): {vocab_size}")

TOY CORPUS ANALYSIS

POSITIVE CORPUS word counts:

NEGATIVE CORPUS word counts:
Total positive words: 43
Total negative words: 34
Positive vocabulary size: 32
Negative vocabulary size: 27
Combined vocabulary size (V): 49


In [18]:
# Calculate P(boring|Pos) using Laplace smoothing

print("QUESTION 1: Calculate P(boring|Pos) with Laplace smoothing")
print("=" * 60)

# Formula: P(wi|Pos) = (C(wi, Pos) + k) / (sum of all positive words + k*V)
# where k=1 for Laplace smoothing

word = "boring"
count_boring_pos = positive_words.get(word, 0)  # 0 since boring doesn't appear in positive
k = 1  # Laplace smoothing parameter
V = vocab_size  # Combined vocabulary size

numerator = count_boring_pos + k
denominator = total_pos_words + (k * V)

prob_boring_pos = numerator / denominator

print(f"Word: '{word}'")
print(f"Count of '{word}' in positive reviews: C('{word}', Pos) = {count_boring_pos}")
print(f"Total words in positive reviews: {total_pos_words}")
print(f"Vocabulary size (V): {V}")
print(f"Laplace parameter (k): {k}")
print()
print(f"Formula: P('{word}'|Pos) = (C('{word}', Pos) + k) / (∑w C(w, Pos) + k*V)")
print(f"Formula: P('{word}'|Pos) = ({count_boring_pos} + {k}) / ({total_pos_words} + {k}*{V})")
print(f"Formula: P('{word}'|Pos) = {numerator} / {denominator}")
print(f"Result: P('{word}'|Pos) = {prob_boring_pos:.6f}")
print()

# Also calculate P(boring|Neg) for comparison
count_boring_neg = negative_words.get(word, 0)
numerator_neg = count_boring_neg + k
denominator_neg = total_neg_words + (k * V)
prob_boring_neg = numerator_neg / denominator_neg

print(f"For comparison:")
print(f"P('{word}'|Neg) = ({count_boring_neg} + {k}) / ({total_neg_words} + {k}*{V})")
print(f"P('{word}'|Neg) = {numerator_neg} / {denominator_neg}")
print(f"P('{word}'|Neg) = {prob_boring_neg:.6f}")
print(f"\nAs expected, P(boring|Neg) > P(boring|Pos) since 'boring' appears in negative reviews!")

QUESTION 1: Calculate P(boring|Pos) with Laplace smoothing
Word: 'boring'
Count of 'boring' in positive reviews: C('boring', Pos) = 0
Total words in positive reviews: 43
Vocabulary size (V): 49
Laplace parameter (k): 1

Formula: P('boring'|Pos) = (C('boring', Pos) + k) / (∑w C(w, Pos) + k*V)
Formula: P('boring'|Pos) = (0 + 1) / (43 + 1*49)
Formula: P('boring'|Pos) = 1 / 92
Result: P('boring'|Pos) = 0.010870

For comparison:
P('boring'|Neg) = (2 + 1) / (34 + 1*49)
P('boring'|Neg) = 3 / 83
P('boring'|Neg) = 0.036145

As expected, P(boring|Neg) > P(boring|Pos) since 'boring' appears in negative reviews!


In [None]:
# Question 2: Classify "intriguing yet disappointing"

print("QUESTION 2: Classify 'intriguing yet disappointing'")
print("=" * 60)

test_sentence = "intriguing yet disappointing"
test_words = test_sentence.split()

print(f"Test sentence: '{test_sentence}'")
print(f"Words to analyze: {test_words}")
print()

# Calculate P(sentence|Pos) and P(sentence|Neg)
# For unigram model: P(w1, w2, ..., wn|Class) = ∏ P(wi|Class)

def calculate_word_probability(word, word_counts, total_words, vocab_size, k=1):
    """Calculate P(word|Class) with Laplace smoothing"""
    count = word_counts.get(word, 0)
    return (count + k) / (total_words + k * vocab_size)

print("CALCULATING PROBABILITIES FOR EACH WORD:")
print("-" * 40)

prob_pos_total = 1.0  # Start with 1 for multiplication
prob_neg_total = 1.0

for word in test_words:
    # Calculate P(word|Pos)
    prob_word_pos = calculate_word_probability(word, positive_words, total_pos_words, vocab_size)
    # Calculate P(word|Neg)
    prob_word_neg = calculate_word_probability(word, negative_words, total_neg_words, vocab_size)

    print(f"Word: '{word}'")
    print(f"  Count in Pos: {positive_words.get(word, 0)}, Count in Neg: {negative_words.get(word, 0)}")
    print(f"  P('{word}'|Pos) = {prob_word_pos:.6f}")
    print(f"  P('{word}'|Neg) = {prob_word_neg:.6f}")

    # Multiply for unigram independence assumption
    prob_pos_total *= prob_word_pos
    prob_neg_total *= prob_word_neg
    print()

print("FINAL SENTENCE PROBABILITIES:")
print("-" * 30)
print(f"P('{test_sentence}'|Pos) = {prob_pos_total:.10f}")
print(f"P('{test_sentence}'|Neg) = {prob_neg_total:.10f}")
print()

# Determine classification
if prob_pos_total > prob_neg_total:
    classification = "POSITIVE"
    confidence = prob_pos_total / (prob_pos_total + prob_neg_total)
else:
    classification = "NEGATIVE"
    confidence = prob_neg_total / (prob_pos_total + prob_neg_total)

print(f"CLASSIFICATION: {classification}")
print(f"Confidence: {confidence:.2%}")
print()
print(f"Reasoning: Since P(sentence|{classification[:3]}) > P(sentence|{'POS' if classification=='NEGATIVE' else 'NEG'}),")
print(f"the sentence is classified as {classification}.")

QUESTION 2: Classify 'intriguing yet disappointing'
Test sentence: 'intriguing yet disappointing'
Words to analyze: ['intriguing', 'yet', 'disappointing']

CALCULATING PROBABILITIES FOR EACH WORD:
----------------------------------------
Word: 'intriguing'
  Count in Pos: 1, Count in Neg: 0
  P('intriguing'|Pos) = 0.021739
  P('intriguing'|Neg) = 0.012048

Word: 'yet'
  Count in Pos: 0, Count in Neg: 0
  P('yet'|Pos) = 0.010870
  P('yet'|Neg) = 0.012048

Word: 'disappointing'
  Count in Pos: 0, Count in Neg: 1
  P('disappointing'|Pos) = 0.010870
  P('disappointing'|Neg) = 0.024096

FINAL SENTENCE PROBABILITIES:
------------------------------
P('intriguing yet disappointing'|Pos) = 0.0000025684
P('intriguing yet disappointing'|Neg) = 0.0000034978

CLASSIFICATION: NEGATIVE
Confidence: 57.66%

Reasoning: Since P(sentence|NEG) > P(sentence|POS),
the sentence is classified as NEGATIVE.


In [27]:
### Student code here ###

# Load training data
print("Loading training data...")
train_paths = get_path('train/[NP]-train[0-9]*.txt')
train_data = load_data(train_paths)
print(f"Loaded {len(train_data)} training documents")

# Initialize CountVectorizer with default settings (includes lowercase normalization)
vectorizer = CountVectorizer()

# Transform text data to numerical features
X_train = vectorizer.fit_transform(train_data['Review'])
y_train = train_data['Label']

print(f"Feature matrix shape: {X_train.shape}")
print(f"Number of unique words: {len(vectorizer.vocabulary_)}")

# Train Naive Bayes classifier with Laplace smoothing (alpha=1.0)
classifier = MultinomialNB(alpha=1.0)
classifier.fit(X_train, y_train)

print("Model trained successfully!")

Loading training data...
Found  1200 files
Loaded 1200 training documents
Feature matrix shape: (1200, 17952)
Number of unique words: 17952
Model trained successfully!


#### Testing phase
Now that you have a trained model, you need to test its performance.

1. Load your test data.
2. Classify your test data using the classifier you trained before.
3. Compute the accuracy of your classifier on the test data

In [28]:
# First, read all the test data from the files.
# Then classify it using the classifier you trained before
# Finally, calculate the performance
### Student code here ###

# Load test data
print("Loading test data...")
test_paths = get_path('test/[NP]-test[0-9]*.txt')
test_data = load_data(test_paths)
print(f"Loaded {len(test_data)} test documents")

# Transform test data using the same vectorizer
X_test = vectorizer.transform(test_data['Review'])
y_test = test_data['Label']

# Make predictions
y_pred = classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with Laplace smoothing (k=1): {accuracy:.4f} ({accuracy*100:.2f}%)")

# Show some example predictions
print("\nSample predictions:")
for i in range(5):
    print(f"True: {y_test.iloc[i]}, Predicted: {y_pred[i]}")

Loading test data...
Found  100 files
Loaded 100 test documents
Accuracy with Laplace smoothing (k=1): 0.8400 (84.00%)

Sample predictions:
True: Pos, Predicted: Pos
True: Neg, Predicted: Neg
True: Neg, Predicted: Neg
True: Pos, Predicted: Neg
True: Pos, Predicted: Pos


Now train two more models: one without Laplace smoothing, and one where stopwords are removed. Then test them on the same test data, and compare the performance with the results you previously obtained.

In [29]:
### Student code here ###

print("="*60)
print("COMPARISON EXPERIMENTS")
print("="*60)

# 1. Model without Laplace smoothing (alpha=0)
print("\n1. WITHOUT LAPLACE SMOOTHING (k=0):")
classifier_no_smooth = MultinomialNB(alpha=0.0)
classifier_no_smooth.fit(X_train, y_train)
y_pred_no_smooth = classifier_no_smooth.predict(X_test)
accuracy_no_smooth = accuracy_score(y_test, y_pred_no_smooth)
print(f"Accuracy without Laplace smoothing: {accuracy_no_smooth:.4f} ({accuracy_no_smooth*100:.2f}%)")

# 2. Model with stop words removed
print("\n2. WITH STOP WORDS REMOVED:")
vectorizer_stop = CountVectorizer(stop_words='english')
X_train_stop = vectorizer_stop.fit_transform(train_data['Review'])
X_test_stop = vectorizer_stop.transform(test_data['Review'])

classifier_stop = MultinomialNB(alpha=1.0)
classifier_stop.fit(X_train_stop, y_train)
y_pred_stop = classifier_stop.predict(X_test_stop)
accuracy_stop = accuracy_score(y_test, y_pred_stop)
print(f"Accuracy with stop words removed: {accuracy_stop:.4f} ({accuracy_stop*100:.2f}%)")

# 3. Model without lowercase normalization
print("\n3. WITHOUT LOWERCASE NORMALIZATION:")
vectorizer_no_lower = CountVectorizer(lowercase=False)
X_train_no_lower = vectorizer_no_lower.fit_transform(train_data['Review'])
X_test_no_lower = vectorizer_no_lower.transform(test_data['Review'])

classifier_no_lower = MultinomialNB(alpha=1.0)
classifier_no_lower.fit(X_train_no_lower, y_train)
y_pred_no_lower = classifier_no_lower.predict(X_test_no_lower)
accuracy_no_lower = accuracy_score(y_test, y_pred_no_lower)
print(f"Accuracy without lowercase normalization: {accuracy_no_lower:.4f} ({accuracy_no_lower*100:.2f}%)")

# Summary comparison
print("\n" + "="*60)
print("SUMMARY OF RESULTS:")
print("="*60)
print(f"With Laplace smoothing (k=1):     {accuracy:.4f} ({accuracy*100:.2f}%)")
print(f"Without Laplace smoothing (k=0):  {accuracy_no_smooth:.4f} ({accuracy_no_smooth*100:.2f}%)")
print(f"With stop words removed:          {accuracy_stop:.4f} ({accuracy_stop*100:.2f}%)")
print(f"Without lowercase normalization:  {accuracy_no_lower:.4f} ({accuracy_no_lower*100:.2f}%)")

# Analysis
print("\n" + "="*60)
print("ANALYSIS:")
print("="*60)

print(f"\nLaplace Smoothing Effect:")
if accuracy > accuracy_no_smooth:
    print(f"✓ Laplace smoothing IMPROVED performance by {((accuracy - accuracy_no_smooth)*100):.2f} percentage points")
    print("  This is expected because smoothing helps with unseen words in test data.")
else:
    print(f"✗ Laplace smoothing DECREASED performance by {((accuracy_no_smooth - accuracy)*100):.2f} percentage points")

print(f"\nStop Words Effect:")
if accuracy_stop > accuracy:
    print(f"✓ Removing stop words IMPROVED performance by {((accuracy_stop - accuracy)*100):.2f} percentage points")
    print("  This suggests stop words were adding noise to the classification.")
elif accuracy_stop < accuracy:
    print(f"✗ Removing stop words DECREASED performance by {((accuracy - accuracy_stop)*100):.2f} percentage points")
    print("  This suggests stop words contain useful information for sentiment classification.")
else:
    print("= No significant difference in performance")

print(f"\nLowercase Normalization Effect:")
if accuracy_no_lower > accuracy:
    print(f"✓ Disabling lowercase normalization IMPROVED performance by {((accuracy_no_lower - accuracy)*100):.2f} percentage points")
    print("  This suggests case information is important for sentiment classification.")
elif accuracy_no_lower < accuracy:
    print(f"✗ Disabling lowercase normalization DECREASED performance by {((accuracy - accuracy_no_lower)*100):.2f} percentage points")
    print("  This suggests case normalization helps by reducing feature sparsity.")
else:
    print("= No significant difference in performance")

COMPARISON EXPERIMENTS

1. WITHOUT LAPLACE SMOOTHING (k=0):
Accuracy without Laplace smoothing: 0.5000 (50.00%)

2. WITH STOP WORDS REMOVED:
Accuracy with stop words removed: 0.8500 (85.00%)

3. WITHOUT LOWERCASE NORMALIZATION:


  self.feature_log_prob_ = np.log(smoothed_fc) - np.log(


Accuracy without lowercase normalization: 0.8100 (81.00%)

SUMMARY OF RESULTS:
With Laplace smoothing (k=1):     0.8400 (84.00%)
Without Laplace smoothing (k=0):  0.5000 (50.00%)
With stop words removed:          0.8500 (85.00%)
Without lowercase normalization:  0.8100 (81.00%)

ANALYSIS:

Laplace Smoothing Effect:
✓ Laplace smoothing IMPROVED performance by 34.00 percentage points
  This is expected because smoothing helps with unseen words in test data.

Stop Words Effect:
✓ Removing stop words IMPROVED performance by 1.00 percentage points
  This suggests stop words were adding noise to the classification.

Lowercase Normalization Effect:
✗ Disabling lowercase normalization DECREASED performance by 3.00 percentage points
  This suggests case normalization helps by reducing feature sparsity.


### Part 4 - Text classification with a bigram language model

Now we will classify the same dataset again, but this time with a bigram language model. 

#### Training phase
Build a Naïve Bayes classifier that uses bigrams instead of single words.


In [None]:
# Part 3.1: Bigram model theory calculations

print("PART 3.1: BIGRAM MODEL THEORY")
print("=" * 50)

print("\nQUESTION 1: How to compute P(fi|+) when fi is a bigram?")
print("-" * 55)

print("For a bigram fi = (wi-1, wi), we compute:")
print("P(fi|+) = P(wi|wi-1, +) = C(wi-1, wi, +) / C(wi-1, +)")
print()
print("Where:")
print("- C(wi-1, wi, +) = count of bigram (wi-1, wi) in positive corpus")
print("- C(wi-1, +) = count of word wi-1 in positive corpus")
print()
print("With Laplace smoothing:")
print("P(wi|wi-1, +) = (C(wi-1, wi, +) + k) / (C(wi-1, +) + k*V)")
print("where V is the vocabulary size and k=1 for Laplace smoothing")

print("\nQUESTION 2: Calculate specific bigram probabilities")
print("-" * 50)

# From homework: positive corpus bigrams
positive_bigrams = {
    'a great': 2, 'movie the': 2, 'the plot': 2,
    'a familiar': 1, 'actors are': 1, 'all identify': 1, 'and the': 1,
    'are really': 1, 'are well-thought-out': 1, 'can all': 1,
    'delivering a': 1, 'director manages': 1, 'familiar story': 1,
    'great movie': 1, 'great performance': 1, 'i really': 1,
    'identify with': 1, 'intriguing the': 1, 'is intriguing': 1,
    'like the': 1, 'manages to': 1, 'plot is': 1, 'plot twists': 1,
    'really delivering': 1, 'really like': 1, 'story we': 1,
    'tell a': 1, 'the actors': 1, 'the director': 1, 'the movie': 1,
    'this was': 1, 'to tell': 1, 'twists are': 1, 'was a': 1,
    'we can': 1, 'well-thought-out and': 1
}

# Word counts in positive corpus (from earlier)
positive_word_counts = {
    'the': 5, 'a': 3, 'really': 2, 'plot': 2, 'movie': 2, 'great': 2, 'are': 2,
    'actors': 1, 'all': 1, 'and': 1, 'can': 1, 'delivering': 1, 'director': 1,
    'familiar': 1, 'I': 1, 'identify': 1, 'intriguing': 1, 'is': 1, 'like': 1,
    'manages': 1, 'out': 1, 'performance': 1, 'story': 1, 'tell': 1, 'this': 1,
    'thought': 1, 'to': 1, 'twists': 1, 'was': 1, 'we': 1, 'well': 1, 'with': 1
}

# Calculate P(movie|great) in positive corpus
bigram_count_great_movie = positive_bigrams.get('great movie', 0)
word_count_great = positive_word_counts.get('great', 0)

print(f"Calculate P(movie|great) in positive corpus:")
print(f"  C('great movie', +) = {bigram_count_great_movie}")
print(f"  C('great', +) = {word_count_great}")
print(f"  P(movie|great, +) = {bigram_count_great_movie}/{word_count_great} = {bigram_count_great_movie/word_count_great:.3f}")

# With Laplace smoothing (assuming vocabulary size from earlier)
V_bigram = len(set(word for bigram in positive_bigrams.keys() for word in bigram.split()))
k = 1

prob_movie_given_great_smooth = (bigram_count_great_movie + k) / (word_count_great + k * V_bigram)
print(f"  With Laplace smoothing (V≈{V_bigram}): P(movie|great, +) = ({bigram_count_great_movie}+{k})/({word_count_great}+{k}*{V_bigram}) = {prob_movie_given_great_smooth:.6f}")

# Calculate P(enough|familiar) - "enough" doesn't appear after "familiar" in positive corpus
bigram_count_familiar_enough = 0  # Not in the given bigrams
word_count_familiar = positive_word_counts.get('familiar', 0)

print(f"\nCalculate P(enough|familiar) in positive corpus:")
print(f"  C('familiar enough', +) = {bigram_count_familiar_enough}")
print(f"  C('familiar', +) = {word_count_familiar}")
print(f"  P(enough|familiar, +) = {bigram_count_familiar_enough}/{word_count_familiar} = {bigram_count_familiar_enough/word_count_familiar:.3f}")

prob_enough_given_familiar_smooth = (bigram_count_familiar_enough + k) / (word_count_familiar + k * V_bigram)
print(f"  With Laplace smoothing: P(enough|familiar, +) = ({bigram_count_familiar_enough}+{k})/({word_count_familiar}+{k}*{V_bigram}) = {prob_enough_given_familiar_smooth:.6f}")



PART 3.1: BIGRAM MODEL THEORY

QUESTION 1: How to compute P(fi|+) when fi is a bigram?
-------------------------------------------------------
For a bigram fi = (wi-1, wi), we compute:
P(fi|+) = P(wi|wi-1, +) = C(wi-1, wi, +) / C(wi-1, +)

Where:
- C(wi-1, wi, +) = count of bigram (wi-1, wi) in positive corpus
- C(wi-1, +) = count of word wi-1 in positive corpus

With Laplace smoothing:
P(wi|wi-1, +) = (C(wi-1, wi, +) + k) / (C(wi-1, +) + k*V)
where V is the vocabulary size and k=1 for Laplace smoothing

QUESTION 2: Calculate specific bigram probabilities
--------------------------------------------------
Calculate P(movie|great) in positive corpus:
  C('great movie', +) = 1
  C('great', +) = 2
  P(movie|great, +) = 1/2 = 0.500
  With Laplace smoothing (V≈30): P(movie|great, +) = (1+1)/(2+1*30) = 0.062500

Calculate P(enough|familiar) in positive corpus:
  C('familiar enough', +) = 0
  C('familiar', +) = 1
  P(enough|familiar, +) = 0/1 = 0.000
  With Laplace smoothing: P(enough|familia

In [37]:
### Student code here ###

print("Loading training data...")
train_paths = get_path('train/[NP]-train[0-9]*.txt')
train_data = load_data(train_paths)
print(f"Loaded {len(train_data)} training documents")

# Load test data
print("Loading test data...")
test_paths = get_path('test/[NP]-test[0-9]*.txt')
test_data = load_data(test_paths)
print(f"Loaded {len(test_data)} test documents")

vectorizer_bigram = CountVectorizer(ngram_range=(2,2), lowercase=True)
X_train_bigram = vectorizer_bigram.fit_transform(train_data['Review'])
X_test_bigram = vectorizer_bigram.transform(test_data['Review'])

print(f"Bigram feature matrix shape: {X_train_bigram.shape}")
print(f"Number of unique bigrams: {len(vectorizer_bigram.vocabulary_)}")

# Train bigram model
classifier_bigram = MultinomialNB(alpha=1.0)
classifier_bigram.fit(X_train_bigram, train_data['Label'])

# Test bigram model
y_pred_bigram = classifier_bigram.predict(X_test_bigram)
accuracy_bigram = accuracy_score(test_data['Label'], y_pred_bigram)

print(f"Bigram model accuracy: {accuracy_bigram:.4f} ({accuracy_bigram*100:.2f}%)")


Loading training data...
Found  1200 files
Loaded 1200 training documents
Loading test data...
Found  100 files
Loaded 100 test documents
Bigram feature matrix shape: (1200, 136224)
Number of unique bigrams: 136224
Bigram model accuracy: 0.8900 (89.00%)


#### Testing phase
As before, calculate the performance on your test data, and notice the difference with the previous

In [40]:
### Student code here ###

# COMPARISON: BIGRAM vs UNIGRAM MODELS
print("\n" + "="*60)
print("TESTING PHASE: BIGRAM vs UNIGRAM COMPARISON")
print("="*60)

# Recreate unigram model for comparison (from Part 3)
print("Training unigram model for comparison...")
vectorizer_unigram = CountVectorizer(ngram_range=(1,1), lowercase=True)
X_train_unigram = vectorizer_unigram.fit_transform(train_data['Review'])
X_test_unigram = vectorizer_unigram.transform(test_data['Review'])

classifier_unigram = MultinomialNB(alpha=1.0)
classifier_unigram.fit(X_train_unigram, train_data['Label'])
y_pred_unigram = classifier_unigram.predict(X_test_unigram)
accuracy_unigram = accuracy_score(test_data['Label'], y_pred_unigram)

print(f"Unigram model accuracy: {accuracy_unigram:.4f} ({accuracy_unigram*100:.2f}%)")
print(f"Unigram features: {X_train_unigram.shape[1]} unique words")

# Test bigram model (already trained in previous cell)
print(f"\nBigram model accuracy: {accuracy_bigram:.4f} ({accuracy_bigram*100:.2f}%)")
print(f"Bigram features: {X_train_bigram.shape[1]} unique bigrams")

# Detailed comparison
print("\n" + "="*60)
print("DETAILED COMPARISON")
print("="*60)

difference = accuracy_bigram - accuracy_unigram
if difference > 0:
    print(f"✓ Bigram model OUTPERFORMS unigram by {difference*100:.2f} percentage points")
    improvement = (accuracy_bigram / accuracy_unigram - 1) * 100
    print(f"  Relative improvement: {improvement:.1f}%")
else:
    print(f"✗ Bigram model UNDERPERFORMS unigram by {abs(difference)*100:.2f} percentage points")
    decline = (1 - accuracy_bigram / accuracy_unigram) * 100
    print(f"  Relative decline: {decline:.1f}%")

# Feature sparsity analysis
print(f"\nFeature Analysis:")
print(f"  Unigram vocabulary size: {X_train_unigram.shape[1]}")
print(f"  Bigram vocabulary size:  {X_train_bigram.shape[1]}")
print(f"  Bigram/Unigram ratio:    {X_train_bigram.shape[1]/X_train_unigram.shape[1]:.2f}")

# Show some example predictions where models disagree
print(f"\nExample predictions where models disagree:")
disagreements = 0
for i in range(len(test_data)):
    if y_pred_unigram[i] != y_pred_bigram[i]:
        disagreements += 1
        if disagreements <= 5:  # Show first 5 disagreements
            print(f"\nDocument {i+1}:")
            print(f"  Text: '{test_data['Review'].iloc[i][:100]}...'")
            print(f"  True label: {test_data['Label'].iloc[i]}")
            print(f"  Unigram prediction: {y_pred_unigram[i]}")
            print(f"  Bigram prediction: {y_pred_bigram[i]}")

print(f"\nTotal disagreements: {disagreements} out of {len(test_data)} documents")

# Analysis of why bigrams might perform differently
print("\n" + "="*60)
print("ANALYSIS: WHY BIGRAMS PERFORM DIFFERENTLY")
print("="*60)

if accuracy_bigram > accuracy_unigram:
    print("✓ Bigrams improve performance because:")
    print("  1. They capture word order and context")
    print("  2. They can distinguish sentiment-bearing phrases")
    print("  3. Examples: 'not good' vs 'good not', 'very bad' vs 'bad very'")
    print("  4. They reduce ambiguity in sentiment classification")
else:
    print("✗ Bigrams hurt performance because:")
    print("  1. Data sparsity: many bigrams appear rarely")
    print("  2. Overfitting: model memorizes rare bigram patterns")
    print("  3. Insufficient training data for reliable bigram estimates")
    print("  4. Higher-order n-grams need more data to be effective")

# Show most informative bigrams
print(f"\nMost informative bigrams (top 10):")
feature_names = vectorizer_bigram.get_feature_names_out()
feature_counts = X_train_bigram.sum(axis=0).A1
feature_indices = feature_counts.argsort()[::-1][:10]

for i, idx in enumerate(feature_indices):
    print(f"  {i+1}. '{feature_names[idx]}': {feature_counts[idx]} occurrences")

print("\n" + "="*60)
print("CONCLUSION")
print("="*60)
print(f"Bigram model accuracy: {accuracy_bigram:.4f} ({accuracy_bigram*100:.2f}%)")
print(f"Unigram model accuracy: {accuracy_unigram:.4f} ({accuracy_unigram*100:.2f}%)")

if accuracy_bigram > accuracy_unigram:
    print("✓ Bigrams provide better sentiment classification on this dataset")
    print("  The improvement suggests that word order and context are important")
    print("  for distinguishing sentiment in movie reviews.")
else:
    print("✗ Unigrams perform better than bigrams on this dataset")
    print("  This suggests that the dataset may be too small for effective bigram modeling")
    print("  or that word-level features are sufficient for this classification task.")


TESTING PHASE: BIGRAM vs UNIGRAM COMPARISON
Training unigram model for comparison...
Unigram model accuracy: 0.8400 (84.00%)
Unigram features: 17952 unique words

Bigram model accuracy: 0.8900 (89.00%)
Bigram features: 136224 unique bigrams

DETAILED COMPARISON
✓ Bigram model OUTPERFORMS unigram by 5.00 percentage points
  Relative improvement: 6.0%

Feature Analysis:
  Unigram vocabulary size: 17952
  Bigram vocabulary size:  136224
  Bigram/Unigram ratio:    7.59

Example predictions where models disagree:

Document 4:
  Text: 'I'm torn about this show. While MOST parts of it I found to be HILARIOUS, other parts of it I found ...'
  True label: Pos
  Unigram prediction: Neg
  Bigram prediction: Pos

Document 32:
  Text: 'I remember seeing this film in the theater in 1984 when I was 6 years-old (you do the math). I absol...'
  True label: Pos
  Unigram prediction: Neg
  Bigram prediction: Pos

Document 45:
  Text: 'This latter-day Fulci schlocker is a totally abysmal concoction deali

### Trigrams
When I asked students how to improve the classification performance on this dataset, the first question was always "use trigrams" (or even higher-order n-grams). Let's try how much of an improvement that would be, by training a trigram model and testing it.

In [41]:
### Student code here ###

print("\n" + "="*60)
print("TRIGRAMS AND HIGHER-ORDER N-GRAMS")
print("="*60)

# 1. TRIGRAM MODEL (n=3, pure trigrams only)
print("\n1. PURE TRIGRAM MODEL (n=3)")
print("-" * 40)

vectorizer_trigram = CountVectorizer(ngram_range=(3,3), lowercase=True)
X_train_trigram = vectorizer_trigram.fit_transform(train_data['Review'])
X_test_trigram = vectorizer_trigram.transform(test_data['Review'])

print(f"Trigram feature matrix shape: {X_train_trigram.shape}")
print(f"Number of unique trigrams: {len(vectorizer_trigram.vocabulary_)}")

# Train trigram model
classifier_trigram = MultinomialNB(alpha=1.0)
classifier_trigram.fit(X_train_trigram, train_data['Label'])

# Test trigram model
y_pred_trigram = classifier_trigram.predict(X_test_trigram)
accuracy_trigram = accuracy_score(test_data['Label'], y_pred_trigram)

print(f"Trigram model accuracy: {accuracy_trigram:.4f} ({accuracy_trigram*100:.2f}%)")

# 2. 4-GRAM MODEL (n=4, pure 4-grams only)
print("\n2. PURE 4-GRAM MODEL (n=4)")
print("-" * 40)

vectorizer_4gram = CountVectorizer(ngram_range=(4,4), lowercase=True)
X_train_4gram = vectorizer_4gram.fit_transform(train_data['Review'])
X_test_4gram = vectorizer_4gram.transform(test_data['Review'])

print(f"4-gram feature matrix shape: {X_train_4gram.shape}")
print(f"Number of unique 4-grams: {len(vectorizer_4gram.vocabulary_)}")

# Train 4-gram model
classifier_4gram = MultinomialNB(alpha=1.0)
classifier_4gram.fit(X_train_4gram, train_data['Label'])

# Test 4-gram model
y_pred_4gram = classifier_4gram.predict(X_test_4gram)
accuracy_4gram = accuracy_score(test_data['Label'], y_pred_4gram)

print(f"4-gram model accuracy: {accuracy_4gram:.4f} ({accuracy_4gram*100:.2f}%)")

# COMPARISON WITH PREVIOUS MODELS
print("\n" + "="*60)
print("COMPARISON: ALL N-GRAM MODELS")
print("="*60)

print(f"Unigram model (n=1):     {accuracy_unigram:.4f} ({accuracy_unigram*100:.2f}%)")
print(f"Bigram model (n=2):      {accuracy_bigram:.4f} ({accuracy_bigram*100:.2f}%)")
print(f"Trigram model (n=3):     {accuracy_trigram:.4f} ({accuracy_trigram*100:.2f}%)")
print(f"4-gram model (n=4):      {accuracy_4gram:.4f} ({accuracy_4gram*100:.2f}%)")

# Feature sparsity analysis
print(f"\nFeature Sparsity Analysis:")
print(f"  Unigram features:  {X_train_unigram.shape[1]:,}")
print(f"  Bigram features:   {X_train_bigram.shape[1]:,}")
print(f"  Trigram features:  {X_train_trigram.shape[1]:,}")
print(f"  4-gram features:   {X_train_4gram.shape[1]:,}")

# Calculate sparsity ratios
bigram_ratio = X_train_bigram.shape[1] / X_train_unigram.shape[1]
trigram_ratio = X_train_trigram.shape[1] / X_train_unigram.shape[1]
gram4_ratio = X_train_4gram.shape[1] / X_train_unigram.shape[1]

print(f"\nFeature Growth Ratios:")
print(f"  Bigrams/Unigrams:  {bigram_ratio:.2f}x")
print(f"  Trigrams/Unigrams: {trigram_ratio:.2f}x")
print(f"  4-grams/Unigrams:  {gram4_ratio:.2f}x")

# Data sparsity analysis
print(f"\nData Sparsity Analysis:")
print("Average non-zero features per document:")
print(f"  Unigrams:  {X_train_unigram.nnz / X_train_unigram.shape[0]:.1f}")
print(f"  Bigrams:   {X_train_bigram.nnz / X_train_bigram.shape[0]:.1f}")
print(f"  Trigrams:  {X_train_trigram.nnz / X_train_trigram.shape[0]:.1f}")
print(f"  4-grams:   {X_train_4gram.nnz / X_train_4gram.shape[0]:.1f}")

# Show some example n-grams
print(f"\nSample Features:")
print(f"Sample trigrams: {vectorizer_trigram.get_feature_names_out()[:10]}")
print(f"Sample 4-grams:  {vectorizer_4gram.get_feature_names_out()[:10]}")

# ANALYSIS: WHY HIGHER-ORDER N-GRAMS PERFORM DIFFERENTLY
print("\n" + "="*60)
print("ANALYSIS: WHY N=3 AND N=4 PERFORM DIFFERENTLY")
print("="*60)

# Trigram analysis
print(f"\nTRIGRAM ANALYSIS (n=3):")
if accuracy_trigram > accuracy_bigram:
    diff = (accuracy_trigram - accuracy_bigram) * 100
    print(f"✓ Trigrams IMPROVE performance by {diff:.2f} percentage points")
    print("  Reasons for improvement:")
    print("  1. Capture more context and phrase-level sentiment")
    print("  2. Better at identifying sentiment-bearing phrases")
    print("  3. Examples: 'not very good', 'really quite bad'")
    print("  4. Can distinguish subtle sentiment differences")
else:
    diff = (accuracy_bigram - accuracy_trigram) * 100
    print(f"✗ Trigrams DECREASE performance by {diff:.2f} percentage points")
    print("  Reasons for decline:")
    print("  1. Extreme data sparsity - trigrams are very rare")
    print("  2. Overfitting to rare trigram patterns")
    print("  3. Insufficient training data for reliable estimates")
    print("  4. Most trigrams appear only once in training data")

# 4-gram analysis
print(f"\n4-GRAM ANALYSIS (n=4):")
if accuracy_4gram > accuracy_trigram:
    diff = (accuracy_4gram - accuracy_trigram) * 100
    print(f"✓ 4-grams IMPROVE performance by {diff:.2f} percentage points")
    print("  This is surprising and suggests:")
    print("  1. Very specific phrase patterns are important")
    print("  2. Dataset might have distinctive 4-gram patterns")
    print("  3. Could be overfitting to test set")
else:
    diff = (accuracy_trigram - accuracy_4gram) * 100
    print(f"✗ 4-grams DECREASE performance by {diff:.2f} percentage points")
    print("  Expected behavior because:")
    print("  1. Extreme sparsity - 4-grams are extremely rare")
    print("  2. Most 4-grams appear only once or never")
    print("  3. Severe overfitting to training data")
    print("  4. No generalization capability")

# General pattern analysis
print(f"\nGENERAL PATTERN ANALYSIS:")
accuracies = [accuracy_unigram, accuracy_bigram, accuracy_trigram, accuracy_4gram]
n_values = [1, 2, 3, 4]

best_n = n_values[accuracies.index(max(accuracies))]
worst_n = n_values[accuracies.index(min(accuracies))]

print(f"  Best performing model: n={best_n} ({max(accuracies):.4f})")
print(f"  Worst performing model: n={worst_n} ({min(accuracies):.4f})")

# Check if there's a clear trend
if accuracy_unigram > accuracy_bigram > accuracy_trigram > accuracy_4gram:
    print("  Trend: Performance DECREASES with higher n (classic sparsity problem)")
elif accuracy_unigram < accuracy_bigram < accuracy_trigram < accuracy_4gram:
    print("  Trend: Performance INCREASES with higher n (unusual, might indicate overfitting)")
else:
    print("  Trend: No clear monotonic relationship (mixed results)")

# CONCLUSION
print("\n" + "="*60)
print("CONCLUSION")
print("="*60)

print("Does setting n=3 or n=4 improve accuracy?")
print(f"  n=3 (trigrams): {'YES' if accuracy_trigram > accuracy_bigram else 'NO'}")
print(f"  n=4 (4-grams):  {'YES' if accuracy_4gram > accuracy_trigram else 'NO'}")

print(f"\nWhy/Why not?")
print("1. DATA SPARSITY: Higher-order n-grams become increasingly rare")
print("2. OVERFITTING: Rare n-grams lead to memorization rather than generalization")
print("3. TRAINING DATA SIZE: This dataset may be too small for effective higher-order modeling")
print("4. SWEET SPOT: There's usually an optimal n where context helps but sparsity doesn't hurt")

if max(accuracies) == accuracy_unigram:
    print("\n✓ For this dataset, unigrams work best - word-level features are sufficient")
elif max(accuracies) == accuracy_bigram:
    print("\n✓ For this dataset, bigrams work best - some context helps without too much sparsity")
else:
    print(f"\n✓ For this dataset, n={best_n} works best - this is unusual and worth investigating")

print(f"\nRecommendation: Use n={best_n} for this specific dataset and task.")


TRIGRAMS AND HIGHER-ORDER N-GRAMS

1. PURE TRIGRAM MODEL (n=3)
----------------------------------------
Trigram feature matrix shape: (1200, 230893)
Number of unique trigrams: 230893
Trigram model accuracy: 0.7700 (77.00%)

2. PURE 4-GRAM MODEL (n=4)
----------------------------------------
4-gram feature matrix shape: (1200, 258033)
Number of unique 4-grams: 258033
4-gram model accuracy: 0.6500 (65.00%)

COMPARISON: ALL N-GRAM MODELS
Unigram model (n=1):     0.8400 (84.00%)
Bigram model (n=2):      0.8900 (89.00%)
Trigram model (n=3):     0.7700 (77.00%)
4-gram model (n=4):      0.6500 (65.00%)

Feature Sparsity Analysis:
  Unigram features:  17,952
  Bigram features:   136,224
  Trigram features:  230,893
  4-gram features:   258,033

Feature Growth Ratios:
  Bigrams/Unigrams:  7.59x
  Trigrams/Unigrams: 12.86x
  4-grams/Unigrams:  14.37x

Data Sparsity Analysis:
Average non-zero features per document:
  Unigrams:  138.0
  Bigrams:   211.3
  Trigrams:  220.1
  4-grams:   220.5

Samp

In [None]:
### Student code here ###

# COMPARISON: BIGRAM vs UNIGRAM MODELS
print("\n" + "="*60)
print("TESTING PHASE: BIGRAM vs UNIGRAM COMPARISON")
print("="*60)

# Recreate unigram model for comparison (from Part 3)
print("Training unigram model for comparison...")
vectorizer_unigram = CountVectorizer(ngram_range=(1,1), lowercase=True)
X_train_unigram = vectorizer_unigram.fit_transform(train_data['Review'])
X_test_unigram = vectorizer_unigram.transform(test_data['Review'])

classifier_unigram = MultinomialNB(alpha=1.0)
classifier_unigram.fit(X_train_unigram, train_data['Label'])
y_pred_unigram = classifier_unigram.predict(X_test_unigram)
accuracy_unigram = accuracy_score(test_data['Label'], y_pred_unigram)

print(f"Unigram model accuracy: {accuracy_unigram:.4f} ({accuracy_unigram*100:.2f}%)")
print(f"Unigram features: {X_train_unigram.shape[1]} unique words")

# Test bigram model (already trained in previous cell)
print(f"\nBigram model accuracy: {accuracy_bigram:.4f} ({accuracy_bigram*100:.2f}%)")
print(f"Bigram features: {X_train_bigram.shape[1]} unique bigrams")

# Detailed comparison
print("\n" + "="*60)
print("DETAILED COMPARISON")
print("="*60)

difference = accuracy_bigram - accuracy_unigram
if difference > 0:
    print(f"✓ Bigram model OUTPERFORMS unigram by {difference*100:.2f} percentage points")
    improvement = (accuracy_bigram / accuracy_unigram - 1) * 100
    print(f"  Relative improvement: {improvement:.1f}%")
else:
    print(f"✗ Bigram model UNDERPERFORMS unigram by {abs(difference)*100:.2f} percentage points")
    decline = (1 - accuracy_bigram / accuracy_unigram) * 100
    print(f"  Relative decline: {decline:.1f}%")

# Feature sparsity analysis
print(f"\nFeature Analysis:")
print(f"  Unigram vocabulary size: {X_train_unigram.shape[1]}")
print(f"  Bigram vocabulary size:  {X_train_bigram.shape[1]}")
print(f"  Bigram/Unigram ratio:    {X_train_bigram.shape[1]/X_train_unigram.shape[1]:.2f}")

# Show some example predictions where models disagree
print(f"\nExample predictions where models disagree:")
disagreements = 0
for i in range(len(test_data)):
    if y_pred_unigram[i] != y_pred_bigram[i]:
        disagreements += 1
        if disagreements <= 5:  # Show first 5 disagreements
            print(f"\nDocument {i+1}:")
            print(f"  Text: '{test_data['Review'].iloc[i][:100]}...'")
            print(f"  True label: {test_data['Label'].iloc[i]}")
            print(f"  Unigram prediction: {y_pred_unigram[i]}")
            print(f"  Bigram prediction: {y_pred_bigram[i]}")

print(f"\nTotal disagreements: {disagreements} out of {len(test_data)} documents")

# Analysis of why bigrams might perform differently
print("\n" + "="*60)
print("ANALYSIS: WHY BIGRAMS PERFORM DIFFERENTLY")
print("="*60)

if accuracy_bigram > accuracy_unigram:
    print("✓ Bigrams improve performance because:")
    print("  1. They capture word order and context")
    print("  2. They can distinguish sentiment-bearing phrases")
    print("  3. Examples: 'not good' vs 'good not', 'very bad' vs 'bad very'")
    print("  4. They reduce ambiguity in sentiment classification")
else:
    print("✗ Bigrams hurt performance because:")
    print("  1. Data sparsity: many bigrams appear rarely")
    print("  2. Overfitting: model memorizes rare bigram patterns")
    print("  3. Insufficient training data for reliable bigram estimates")
    print("  4. Higher-order n-grams need more data to be effective")

# Show most informative bigrams
print(f"\nMost informative bigrams (top 10):")
feature_names = vectorizer_bigram.get_feature_names_out()
feature_counts = X_train_bigram.sum(axis=0).A1
feature_indices = feature_counts.argsort()[::-1][:10]

for i, idx in enumerate(feature_indices):
    print(f"  {i+1}. '{feature_names[idx]}': {feature_counts[idx]} occurrences")

print("\n" + "="*60)
print("CONCLUSION")
print("="*60)
print(f"Bigram model accuracy: {accuracy_bigram:.4f} ({accuracy_bigram*100:.2f}%)")
print(f"Unigram model accuracy: {accuracy_unigram:.4f} ({accuracy_unigram*100:.2f}%)")

if accuracy_bigram > accuracy_unigram:
    print("✓ Bigrams provide better sentiment classification on this dataset")
    print("  The improvement suggests that word order and context are important")
    print("  for distinguishing sentiment in movie reviews.")
else:
    print("✗ Unigrams perform better than bigrams on this dataset")
    print("  This suggests that the dataset may be too small for effective bigram modeling")
    print("  or that word-level features are sufficient for this classification task.")


3.2.3 PERFORMANCE IMPROVEMENTS
MOTIVATED SUGGESTIONS FOR IMPROVEMENT:
1. Combined n-grams: Use unigrams + bigrams to capture both word-level and phrase-level patterns
2. Feature selection: Remove rare features that cause overfitting
3. TF-IDF weighting: Reduce impact of common words, emphasize distinctive words
4. Optimized smoothing: Find the right balance between overfitting and underfitting
5. Text preprocessing: Clean data to reduce noise and improve feature quality

IMPROVEMENT 1: COMBINED N-GRAMS
Motivation: Pure bigrams lose unigram information. Combining captures both levels.

1-2 gram model (unigrams + bigrams):
  Features: 154,176 (unigrams + bigrams)
  Sparsity: 0.0023
  Accuracy: 0.8400 (84.00%)

1-3 gram model (unigrams + bigrams + trigrams):
  Features: 385,069 (unigrams + bigrams + trigrams)
  Sparsity: 0.0015
  Accuracy: 0.8700 (87.00%)

IMPROVEMENT 2: FEATURE SELECTION
Motivation: Remove rare features that appear in <2 documents to reduce overfitting.
  Features befor