<a href="https://colab.research.google.com/github/dasmiq/cs6120-assignment2/blob/main/reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Sentiment Labels

In this experiment, you will explore the accuracy of sentiment classificaiton using different feature representations of text documents.

First, you will implement `createBasicFeatures`, which creates a sparse matrix representation of a collection of documents.  As we discussed in class, many classification models represent a document as a sparse vector of of features and their weights. Weights could be boolean, counts, tf-idf values, or other functions of the document. In a linear model, we then combine the document vectors with the candidate output classes and learn weights for each of the feature-class pairs. This last step is implemented for you.

For this exercise, you should have a feature for each word containing at least one alphabetic character. You may use the `numpy` and `sklearn` packages to help with implementing a sparse matrix.

Then, you will implement `createFancyFeatures`, which can specify at any other features you choose to help improve performance on the classification task.

The two code blocks at the end train and evaluate two models—logistic regression with L1 and L2 regularization—using your featurization functions. Besides held-out classification accuracy with 10-fold cross-validation, you will also see the features in each class given high weights by the model.

There are many helpful resources online for getting up to speed with vector representations of documents. One example is the first two chapters of Delip Rao and Brian McMahan, _Natural Language Processing with PyTorch_, O'Reilly, 2019.  You should be able to <a href="https://learning.oreilly.com/library/view/natural-language-processing/9781491978221/">read it online</a> via the Northeastern Library's subscription using a <tt>northeastern.edu</tt> email address.

In [1]:
import json
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate,LeaveOneOut,KFold
import numpy as np

In [2]:
# read in the movie review corpus
def readReviews():
  raw = requests.get("https://raw.githubusercontent.com/dasmiq/cs6120-assignment2/refs/heads/main/cornell_reviews.json").text.strip()
  corpus = [json.loads(line) for line in raw.split("\n")]

  return corpus

This is where you will implement two functions to featurize the data.

In [20]:
# TODO: Implement createBasicFeatures
# NB: The current contents are for testing only
# This function should return:
#  -a sparse numpy matrix of document features
#  -a list of the correct class for each document
#  -a list of the vocabulary used by the features, such that the ith term of the
#    list is the word whose counts appear in the ith column of the matrix.

# This function should create a feature representation using all tokens that
# contain an alphabetic character.
import re
from collections import Counter
from scipy.sparse import csr_matrix

def createBasicFeatures(corpus):
    # Extract class labels
    classes = [doc['class'] for doc in corpus]
    
    # Extract all tokens containing alphabetic characters and build vocabulary
    vocab_counter = Counter()
    all_tokens = []
    
    for doc in corpus:
        # Extract tokens containing alphabetic characters
        tokens = [token.lower() for token in re.findall(r'\S+', doc['text']) 
                  if re.search(r'[a-zA-Z]', token)]
        all_tokens.append(tokens)
        vocab_counter.update(tokens)
    
    # Create vocabulary (sorted for consistency)
    vocab = sorted(vocab_counter.keys())
    vocab_to_idx = {word: idx for idx, word in enumerate(vocab)}
    
    # Build sparse matrix
    rows, cols, data = [], [], []
    
    for doc_idx, tokens in enumerate(all_tokens):
        token_counts = Counter(tokens)
        for token, count in token_counts.items():
            rows.append(doc_idx)
            cols.append(vocab_to_idx[token])
            data.append(count)
    
    # Create sparse matrix
    texts = csr_matrix((data, (rows, cols)), 
                       shape=(len(corpus), len(vocab)))
    
    return texts, classes, vocab


In [None]:
# TODO: Implement createFancyFeatures andn describe in comments what the
# features are and why they might ben helpful.
# This function can add other features you want that help classification
# accuracy, such as bigrams, word prefixes and suffixes, etc.
def createFancyFeatures(corpus):

    # Extract class labels
    classes = [doc['class'] for doc in corpus]
    
    # Extract tokens and build vocabulary with unigrams + bigrams
    vocab_counter = Counter()
    all_tokens = []
    
    for doc in corpus:
        # Extract tokens containing alphabetic characters
        tokens = [token.lower() for token in re.findall(r'\S+', doc['text']) 
                  if re.search(r'[a-zA-Z]', token)]
        
        # Add unigrams
        features = tokens.copy()
        
        # Add bigrams
        for i in range(len(tokens) - 1):
            features.append(f"{tokens[i]}_{tokens[i+1]}")
        
        all_tokens.append(features)
        vocab_counter.update(features)
    
    # Create vocabulary (sorted for consistency)
    vocab = sorted(vocab_counter.keys())
    vocab_to_idx = {word: idx for idx, word in enumerate(vocab)}
    
    # Build sparse matrix
    rows, cols, data = [], [], []
    
    for doc_idx, tokens in enumerate(all_tokens):
        token_counts = Counter(tokens)
        for token, count in token_counts.items():
            rows.append(doc_idx)
            cols.append(vocab_to_idx[token])
            data.append(count)
    
    # Create sparse matrix
    texts = csr_matrix((data, (rows, cols)), 
                       shape=(len(corpus), len(vocab)))
    
    return texts, classes, vocab

In [11]:
# given a numpy matrix representation of the features for the training set, the
# vector of true classes for each example, and the vocabulary as described
# above, this computes the accuracy of the model using leave one out cross
# validation and reports the most indicative features for each class

def evaluateModel(X,y,vocab,penalty="l1"):
  # create and fit the model
  model = LogisticRegression(penalty=penalty,solver="liblinear")
  results = cross_validate(model,X,y,cv=KFold(n_splits=10, shuffle=True, random_state=1))

  # determine the average accuracy
  scores = results["test_score"]
  avg_score = sum(scores)/len(scores)

  # determine the most informative features
  # this requires us to fit the model to everything, because we need a
  # single model to draw coefficients from, rather than 26
  model.fit(X,y)
  class0_weight_sorted = model.coef_[0, :].argsort()
  class1_weight_sorted = (-model.coef_[0, :]).argsort()

  termsToTake = 20
  class0_indicators = [vocab[i] for i in class0_weight_sorted[:termsToTake]]
  class1_indicators = [vocab[i] for i in class1_weight_sorted[:termsToTake]]

  if model.classes_[0] == "pos":
    return avg_score,class0_indicators,class1_indicators
  else:
    return avg_score,class1_indicators,class0_indicators

def runEvaluation(X,y,vocab):
  print("----------L1 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l1")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)
  #this call will fit a model with L2 normalization
  print("----------L2 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l2")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)

In [3]:
corpus = readReviews()

In [None]:
# Check the structure of the first document
print(corpus[0])
print(corpus[0].keys())

{'class': 'neg', 'text': 'plot : two teen couples go to a church party , drink and then drive . they get into an accident . one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . what\'s the deal ? watch the movie and " sorta " find out . . . critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . they seem to have taken this pretty neat concept , but executed it terribly . so what are the problems with the movie ? well , its main problem is that it\'s simply too jumbled . it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience 

Run the following to train and evaluate two models using basic features:

In [None]:
X,y,vocab = createBasicFeatures(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.816000
The most informative terms for pos are: ['flaws', 'excellent', 'terrific', 'memorable', 'edge', 'fantastic', 'using', 'command', 'perfectly', 'follows', 'color', 'sherri', 'allows', 'fun', 'overall', 'jackie', 'enjoyable', 'gas', 'easily', 'masterpiece']
The most informative terms for neg are: ['tedious', 'waste', 'mess', 'awful', 'ridiculous', 'lame', 'worst', 'unfortunately', 'boring', 'cheap', 'superior', 'write', 'terrible', 'bad', 'nothing', 'annoying', 'neither', 'looks', 'headed', 'flat']
----------L2 Norm-----------
The model's average accuracy is 0.833500
The most informative terms for pos are: ['fun', 'back', 'great', 'excellent', 'yet', 'overall', 'quite', 'seen', 'well', 'perfectly', 'jackie', 'terrific', 'american', 'memorable', 'job', 'pulp', 'true', 'performances', 'bit', 'follows']
The most informative terms for neg are: ['bad', 'unfortunately', 'worst', 'nothing', 'waste', 'boring', 'only', 'script',

Run the following to train and evaluate two models using extended features:

In [None]:
X,y,vocab = createFancyFeatures(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.835000
The most informative terms for pos are: ['even_if', 'flaws', 'terrific', 'on_screen', 'follows', 'excellent', 'due_to', 'overall', 'masterpiece', 'memorable', 'the_more', 'view', 'her_husband', 'by_a', 'perfectly', 'others', 'seen', "he's_a", 'using', 'gas']
The most informative terms for neg are: ['ridiculous', 'awful', 'unfortunately', 'worst', 'mess', 'boring', 'waste', 'cheap', 'should_have', 'lame', 'bad', 'tedious', 'terrible', 'write', 'flat', 'heston', 'annoying', 'poor', 'designed', 'neither']
----------L2 Norm-----------
The model's average accuracy is 0.853000
The most informative terms for pos are: ['great', 'well', 'fun', 'seen', 'back', 'very', 'also', 'quite', 'yet', 'excellent', 'many', 'life', 'while', 'job', 'perfectly', 'jackie', 'people', 'you', 'most', 'american']
The most informative terms for neg are: ['bad', 'only', 'unfortunately', 'worst', 'nothing', 'boring', 'any', 'plot', 'waste', 'script

**TODO**: Briefly comment on your results. You do _not_ need to beat the basic features model to get full credit.

The Fancy Features model performs better than the Basic Features model. For L1 regularization, accuracy improved from 81.6% to 83.5%. For L2 regularization, accuracy improved from 83.35% to 85.3%. This shows that adding bigram features helps the model classify movie reviews more accurately.

Bigrams capture useful word pairs that single words cannot express well. For example, phrases like "should_have" appear in negative reviews when people criticize movies, while phrases like "even_if" and "on_screen" are common in positive reviews. These word pairs give the model more context about the sentiment being expressed.

L2 regularization works better than L1 in both models, reaching the highest accuracy of 85.3%. The bigram features successfully improve classification performance, showing that capturing word pairs is valuable for understanding movie review sentiment.