In this experiment, you will explore the accuracy of sentiment classificaiton using different feature representations of text documents.

First, you will implement `createBasicFeatures`, which creates a sparse matrix representation of a collection of documents. For this exercise, you should have a feature for each word containing at least one alphabetic character. You may use the `numpy` and `sklearn` packages to help with implementing a sparse matrix.

Then, you will implement `createFancyFeatures`, which can specify at any other features you choose to help improve performance on the classification task.

The two code blocks at the end train and evaluate two models—logistic regression with L1 and L2 regularization—using your featurization functions. Besides held-out classification accuracy with 10-fold cross-validation, you will also see the features in each class given high weights by the model.

A helpful resource for getting up to speed with vector representations of documents is the first two chapters of Delip Rao and Brian McMahan, _Natural Language Processing with PyTorch_, O'Reilly, 2019.  You should be able to <a href="https://learning.oreilly.com/library/view/natural-language-processing/9781491978221/">read it online</a> via the Northeastern Library's subscription using a <tt>northeastern.edu</tt> email address.

In [117]:
import json
import requests
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate,LeaveOneOut,KFold
import numpy as np

In [118]:
# read in the movie review corpus
def readReviews():
  raw = requests.get("https://raw.githubusercontent.com/mutherr/CS6120-PS1-data/master/cornell_reviews.json").text.strip()
  corpus = [json.loads(line) for line in raw.split("\n")]

  return corpus

This is where you will implement two functions to featurize the data.

In [135]:
#NB: The current contents are for testing only
#This function should return: 
#  -a sparse numpy matrix of document features
#  -a list of the correct class for each document
#  -a list of the vocabulary used by the features, such that the ith term of the
#    list is the word whose counts appear in the ith column of the matrix. 

# This function should create a feature representation using all tokens that
# contain an alphabetic character.
from sklearn.feature_extraction.text import CountVectorizer
cvectorizer = CountVectorizer()

def createBasicFeatures(corpus):
    #Your code here
    classes = []
    vocab = []
    all_text = []
    for i in corpus:
        classes.append(i['class'])
        all_text.append(i['text'])
    texts = cvectorizer.fit_transform(all_text)
    vocab = cvectorizer.get_feature_names()

    return texts, classes, vocab

# This function can add other features you want that help classification
# accuracy, such as bigrams, word prefixes and suffixes, etc.

nvectorizer = CountVectorizer(ngram_range=(1, 2))

def createFancyFeatures(corpus):
    #Your code here
    classes = []
    vocab = []
    all_text = []
    for i in corpus:
        classes.append(i.get('class'))
        text = i.get('text')
        all_text.append(text)  
        
    texts = nvectorizer.fit_transform(all_text) 
    vocab = nvectorizer.get_feature_names()

    return texts,classes,vocab

import nltk
import re
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from collections import defaultdict
from sklearn.feature_extraction import DictVectorizer
dictvectorizer = DictVectorizer()

def token_freqs(doc):
    freq = defaultdict(int)
    for tok in doc:
        freq[tok] += 1
    return freq

def createFancyFeatures2(corpus):
    classes = []
    all_text = []
    for i in corpus:
        classes.append(i.get('class'))
        text = i.get('text')
        words = nltk.word_tokenize(text)
        tagged = nltk.pos_tag(words)       
        temp = token_freqs(tagged)
        all_text.append(temp)  
    texts = dictvectorizer.fit_transform(all_text) 
    vocab = list(dictvectorizer.get_feature_names())
    return texts,classes,vocab   

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [8]:
#given a numpy matrix representation of the features for the training set, the 
# vector of true classes for each example, and the vocabulary as described 
# above, this computes the accuracy of the model using leave one out cross 
# validation and reports the most indicative features for each class

def evaluateModel(X,y,vocab,penalty="l1"):
  #create and fit the model
  model = LogisticRegression(penalty=penalty,solver="liblinear")
  results = cross_validate(model,X,y,cv=KFold(n_splits=10, shuffle=True, random_state=1))
  
  #determine the average accuracy
  scores = results["test_score"]
  avg_score = sum(scores)/len(scores)
  
  #determine the most informative features
  # this requires us to fit the model to everything, because we need a
  # single model to draw coefficients from, rather than 26
  model.fit(X,y)
  class0_weight_sorted = model.coef_[0, :].argsort()
  class1_weight_sorted = (-model.coef_[0, :]).argsort()

  termsToTake = 20
  class0_indicators = [vocab[i] for i in class0_weight_sorted[:termsToTake]]
  class1_indicators = [vocab[i] for i in class1_weight_sorted[:termsToTake]]

  if model.classes_[0] == "pos":
    return avg_score,class0_indicators,class1_indicators
  else:
    return avg_score,class1_indicators,class0_indicators

def runEvaluation(X,y,vocab):
  print("----------L1 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l1")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)
  #this call will fit a model with L2 normalization
  print("----------L2 Norm-----------")
  avg_score,pos_indicators,neg_indicators = evaluateModel(X,y,vocab,"l2")
  print("The model's average accuracy is %f"%avg_score)
  print("The most informative terms for pos are: %s"%pos_indicators)
  print("The most informative terms for neg are: %s"%neg_indicators)

In [158]:
corpus = readReviews()

Run the following to train and evaluate two models using basic features:

In [159]:
X,y,vocab = createBasicFeatures(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.828500
The most informative terms for pos are: ['flaws', 'memorable', 'terrific', 'edge', 'masterpiece', 'excellent', 'perfectly', 'sherri', 'gas', 'using', 'enjoyable', 'overall', 'fun', 'follows', 'different', 'quite', 'solid', 'allows', 'fantastic', 'side']
The most informative terms for neg are: ['waste', 'mess', 'lame', 'worst', 'headed', 'ridiculous', 'unfortunately', 'awful', 'cheap', 'write', 'boring', 'tedious', 'superior', 'jesse', 'terrible', 'bad', 'poor', 'iii', 'designed', 'looks']
----------L2 Norm-----------
The model's average accuracy is 0.833000
The most informative terms for pos are: ['fun', 'great', 'back', 'quite', 'well', 'seen', 'excellent', 'perfectly', 'memorable', 'overall', 'job', 'yet', 'american', 'terrific', 'pulp', 'bit', 'true', 'performances', 'husband', 'masterpiece']
The most informative terms for neg are: ['bad', 'unfortunately', 'worst', 'waste', 'nothing', 'script', 'awful', 'boring', 

Run the following to train and evaluate two models using extended features:

In [121]:
X,y,vocab = createFancyFeatures(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.839000
The most informative terms for pos are: ['even if', 'flaws', 'masterpiece', 'memorable', 'her husband', 'terrific', 'due to', 'gas', 'follows', 'overall', 'as much', 'perfectly', 'loved', 'great', 'fun', 'the two', 'the true', 'fantastic', 'works', 'the more']
The most informative terms for neg are: ['waste', 'ridiculous', 'mess', 'headed', 'lame', 'worst', 'unfortunately', 'awful', 'cheap', 'write', 'should have', 'poor', 'boring', 'bad', 'metro', 'terrible', 'designed', 'flat', 'jesse', 'iii']
----------L2 Norm-----------
The model's average accuracy is 0.853500
The most informative terms for pos are: ['great', 'fun', 'well', 'seen', 'back', 'very', 'quite', 'also', 'people', 'life', 'many', 'yet', 'job', 'american', 'see', 'while', 'the two', 'excellent', 'perfectly', 'most']
The most informative terms for neg are: ['bad', 'only', 'unfortunately', 'worst', 'nothing', 'plot', 'any', 'boring', 'waste', 'script', 'po

In [122]:
X,y,vocab = createFancyFeatures2(corpus)
runEvaluation(X, y, vocab)

----------L1 Norm-----------
The model's average accuracy is 0.815000
The most informative terms for pos are: [('terrific', 'JJ'), ('fantastic', 'JJ'), ('memorable', 'JJ'), ('excellent', 'JJ'), ('sometimes', 'RB'), ('remember', 'VBP'), ('entertaining', 'JJ'), ('fun', 'NN'), ('follows', 'VBZ'), ('sherri', 'NN'), ('amidala', 'NN'), ('portrayed', 'VBN'), ('loved', 'VBD'), ('perfectly', 'RB'), ('deserves', 'VBZ'), ('epic', 'NN'), ('pulp', 'NN'), ('7', 'CD'), ('class', 'NN'), ('seen', 'VBN')]
The most informative terms for neg are: [('awful', 'JJ'), ('mess', 'NN'), ('worst', 'JJS'), ('ridiculous', 'JJ'), ('waste', 'NN'), ('lame', 'JJ'), ('boring', 'JJ'), ('terrible', 'JJ'), ('unfortunately', 'RB'), ('jesse', 'NN'), ('dull', 'JJ'), ('forgot', 'VBD'), ('bad', 'JJ'), ('designed', 'VBN'), ('flat', 'JJ'), ('poor', 'JJ'), ('failure', 'NN'), ('work', 'VB'), ('jakob', 'NN'), ('only', 'JJ')]
----------L2 Norm-----------
The model's average accuracy is 0.844000
The most informative terms for pos are:

**Basic Features**

Bag of words representation is used to create a sparse matrix. This model does well to predict the class of the reviews.

**Fancy Features**

Used bigram model to create a sparse matrix. The model takes much longer time to run due to the higher dimension of the feature space. We see minor improvement in accuracy. The model did not improve by including the stopwards.

Parts of speech tagging was used to creare fancyfeatures2. Since both postive and negative reviews could have same word the parts of speech can help differentiate between the features. There is noticable improvement in the model, however takes much longer to run.


**Analysis of incorrect predictions**

We can see below that the model gets wrong in cases were even a human reader could guess the class wrong. Overall acuuracy is close to 85%. We can further see below the actual text where the model made mistake and it is clear that the reviews do not clearly say positive or negative, hence the ambiquity on the model.


In [165]:
# Analysis of the incorrect predictions 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01)

model = LogisticRegression(penalty="l1",solver="liblinear")
results = cross_validate(model,X_train,y_train,cv=KFold(n_splits=10, shuffle=True, random_state=1))

model.fit(X_train,y_train)
prediction = model.predict(X_test)

In [166]:
print("Prediction Classes", prediction.tolist())
print("True Classes", y_test)

Prediction Classes ['pos', 'pos', 'neg', 'pos', 'pos', 'neg', 'neg', 'pos', 'pos', 'pos', 'neg', 'pos', 'pos', 'neg', 'neg', 'neg', 'neg', 'neg', 'pos', 'neg']
True Classes ['neg', 'pos', 'pos', 'pos', 'neg', 'neg', 'neg', 'pos', 'neg', 'pos', 'neg', 'neg', 'pos', 'neg', 'neg', 'neg', 'neg', 'neg', 'pos', 'neg']


In [167]:
incorrect = []
for index, key in enumerate(prediction):
  if key != y_test[index]:
    incorrect.append(index)
print(incorrect, len(incorrect)*100/len(prediction), "% predictions are incorrect")

[0, 2, 4, 8, 11] 25.0 % predictions are incorrect


In [168]:
X_test_orig = cvectorizer.inverse_transform(X_test)
" ".join(X_test_orig[2])

'14 about achieved addition adequate age all almost also although amidala an anakin and anticipated anything are arguably asexual babe bad be because believe better between biggest birth bit blockbuster broods but by can caprio celebrate cells century change characters christian christmas claims clan command concentrations concept conception considered dark darth day defines department desirable despite di do doubt down effects emissary encounter end entire every ewan exceeds excellence except experience fabric fact faith far fascinating find for force from george gets girlfriend gon guy hair has he her hermits hero high him his hot how humerous if immaculate in individual inferior interest interesting is it jedi jinn kenobi kill knight knights knows lack lacks leadership led leonardo let liam life live loners love lucas make makes mature mcgregor mean means metoclorian micro milestone missed mistake mixed more most mostly movie much must natalie neeson next no not obi of on one only o