Homework 4: Sentiment Analysis - Task 2
----

Names 
----
Names: Kaan Tural, Arinjay Singh

Task 2: Train a Naive Bayes Model (30 points)
----

Using `nltk`'s `NaiveBayesClassifier` class, train a Naive Bayes classifier using a Bag of Words as features.
https://www.nltk.org/_modules/nltk/classify/naivebayes.html 

Naive Bayes classifiers use Bayes’ theorem for predictions. Naive Bayes can be a good baseline for NLP applications in particular. You can use it as a baseline for your project!

In [6]:
# our utility functions
# RESTART your jupyter notebook kernel if you make changes to this file
import sentiment_utils as sutils

# nltk for Naive Bayes and metrics
import nltk
import nltk.classify.util
from nltk.metrics.scores import (precision, recall, f_measure, accuracy)
from nltk.classify import NaiveBayesClassifier

# some potentially helpful data structures from collections
from collections import defaultdict, Counter

# so that we can make plots
import matplotlib.pyplot as plt
# if you want to use seaborn to make plots
import seaborn as sns

[nltk_data] Downloading package punkt to /Users/turalk/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
# define constants for the files we are using
TRAIN_FILE = "movie_reviews_train.txt"
DEV_FILE = "movie_reviews_dev.txt"

In [9]:
# load in your data and make sure you understand the format
# Do not print out too much so as to impede readability of your notebook
train_tups = sutils.generate_tuples_from_file(TRAIN_FILE)
dev_tups = sutils.generate_tuples_from_file(DEV_FILE)

In [11]:
# set up a sentiment classifier using NLTK's NaiveBayesClassifier and 
# a bag of words as features
# take a look at the function in lecture notebook 7 (feel free to copy + paste that function)
# the nltk classifier expects a dictionary of features as input where the key is the feature name
# and the value is the feature value

# need to return a dict to work with the NLTK classifier
# Possible problem for students: evaluate the difference 
# between using binarized features and using counts (non binarized features)

def word_feats(words, binary=True):
    """
    Generate a dictionary of word features for input to NLTK's NaiveBayesClassifier.
    
    Args:
        words (list of str): The words from a document.
        binary (bool): If True, features are binarized (presence/absence); 
                       if it's False, features are based on word counts.
    
    Returns:
        dict: A dictionary where keys are words, and values are either binary or counts.
    """
    if binary:
        return {word: True for word in words}
    else:
        return {word: words.count(word) for word in set(words)}
        


# set up & train a sentiment classifier using NLTK's NaiveBayesClassifier and
# classify the first example in the dev set as an example
# make sure your output is well-labeled

def train_classifier(train_data, binary=True):
    """
    Train a Naive Bayes sentiment classifier.
    
    Args:
        train_data (list of (list of str, int)): Training data, each item is a tuple where
                                                 the first element is a list of words from a document,
                                                 and the second element is the label (e.g., 0 or 1 for negative/positive).
        binary (bool): Whether to use binarized features.
    
    Returns:
        NaiveBayesClassifier: The trained NLTK Naive Bayes classifier.
    """
    # transforming the training data into the format expected by NLTK: (features_dict, label)
    train_features = [(word_feats(words, binary), label) for words, label in train_data]
    
    classifier = NaiveBayesClassifier.train(train_features)
    
    return classifier

def classify_example(classifier, example_words, binary=True):
    """
    Classify a new example using the trained classifier.
    
    Args:
        classifier (NaiveBayesClassifier): The trained classifier.
        example_words (list of str): The words from the new example to classify.
        binary (bool): Whether the classifier was trained on binarized features.
    
    Returns:
        str: The predicted label for the example.
    """
    features = word_feats(example_words, binary)
    return classifier.classify(features)

# test to make sure that you can train the classifier and use it to classify a new example

train_data = sutils.generate_tuples_from_file(TRAIN_FILE)  # Returns (X, y) where X is a list of token lists, and y is a list of labels
dev_data = sutils.generate_tuples_from_file(DEV_FILE)

train_tups = [(words, label) for words, label in zip(train_data[0], train_data[1])]
dev_tups = [(words, label) for words, label in zip(dev_data[0], dev_data[1])]

classifier_binary = train_classifier(train_tups, binary=True)

classifier_count = train_classifier(train_tups, binary=False)

test_subset = dev_tups[:10]  # testing first 10 examples so can be displayed nicely

for example, label in test_subset:
    prediction_binary = classify_example(classifier_binary, example, binary=True)
    prediction_count = classify_example(classifier_count, example, binary=False)
    print(f"Example words: {example}")
    print(f"Actual Label: {label}")
    print(f"Prediction with binarized features: {prediction_binary}")
    print(f"Prediction with count-based features: {prediction_count}\n")
    
dev_data = [
    ['really', 'liked', 'the', 'movie'],
    ['bad', 'movie', 'did', 'not', 'like'],
    ['what', 'a', 'great', 'movie'],
    ['terrible', 'movie']
]

for example in dev_data:
    prediction_binary = classify_example(classifier_binary, example, binary=True)
    prediction_count = classify_example(classifier_count, example, binary=False)
    print(f"Example words: {example}")
    print(f"Prediction with binarized features: {prediction_binary}")
    print(f"Prediction with count-based features: {prediction_count}\n")

Example words: ['The', 'movie', "'Gung", 'Ho', '!', "'", ':', 'The', 'Story', 'of', 'Carlson', "'s", 'Makin', 'Island', 'Raiders', 'was', 'made', 'in', '1943', 'with', 'a', 'view', 'to', 'go', 'up', 'the', 'moral', 'of', 'American', 'people', 'at', 'the', 'duration', 'of', 'second', 'world', 'war', '.', 'It', 'shows', 'with', 'the', 'better', 'way', 'that', 'the', 'cinema', 'can', 'constitute', 'body', 'of', 'propaganda', '.', 'The', 'value', 'of', 'this', 'film', 'is', 'only', 'collection', 'and', 'no', 'artistic', '.', 'In', 'a', 'film', 'of', 'propaganda', 'it', 'is', 'useless', 'to', 'judge', 'direction', 'and', 'actors', '.', 'Watch', 'that', 'movie', 'if', 'you', 'are', 'interested', 'to', 'learn', 'how', 'propaganda', 'functions', 'in', 'the', 'movies', 'or', 'if', 'you', 'are', 'a', 'big', 'fun', 'of', 'Robert', 'Mitchum', 'who', 'has', 'a', 'small', 'role', 'in', 'the', 'film', '.', 'If', 'you', 'want', 'to', 'see', 'a', 'film', 'for', 'the', 'second', 'world', 'war', ',', 'th

In [None]:
# Using the provided dev set, evaluate your model with precision, recall, and f1 score as well as accuracy
# You may use nltk's implemented `precision`, `recall`, `f_measure`, and `accuracy` functions
# (make sure to look at the documentation for these functions!)
# you will be creating a similar graph for logistic regression and neural nets, so make sure
# you use functions wisely so that you do not have excessive repeated code
# write any helper functions you need in sentiment_utils.py (functions that you'll use in your other notebooks as well)


# create a graph of your classifier's performance on the dev set as a function of the amount of training data
# the x-axis should be the amount of training data (as a percentage of the total training data)
# NOTE : make sure one of your experiments uses 10% of the data, you will need this to answer the first question in task 5
# the y-axis should be the performance of the classifier on the dev set
# the graph should have 4 lines, one for each of precision, recall, f1, and accuracy
# the graph should have a legend, title, and axis labels

vocab = create_index([word for words, _ in train_tups for word in words])

train_feats_binary = [dict(zip(vocab, featurize(vocab, words, binary=True))) for words, _ in train_tups]
dev_feats_binary = [dict(zip(vocab, featurize(vocab, words, binary=True))) for words, _ in dev_tups]

train_feats_count = [dict(zip(vocab, featurize(vocab, words, binary=False))) for words, _ in train_tups]
dev_feats_count = [dict(zip(vocab, featurize(vocab, words, binary=False))) for words, _ in dev_tups]


Test your model using both a __binarized__ and a __multinomial__ BoW. Use whichever one gives you a better final f1 score on the dev set to produce your graphs.

- f1 score binarized: __YOUR ANSWER HERE__
- f1 score multinomial: __YOUR ANSWER HERE__