# Sentiment Analyzer Using Sentiment Lexicons

From Chapter 7 of Getting Started with Natural Language Processing (2022, Kochmar)

## 0. Understanding your task

Suppose you are planning an evening out with some friends, and you'd like to go to a cinema. Your friends' preferences seem to have divided between a superhero movie and an action movie. Both start around the same time, and you like both genres. To choose which group of friends to join at the cinema, you decide to check what those who have already seen these movies think about them. You visit a movie review website and find out that there are hundreds of reviews about both movies. Reading through all these reviews would not be feasible, so you decide to apply a sentiment analyzer to see how many positive and negative opinions there are about each of these movies and then make up your mind. How can you implement such a sentiment analyzer?

## 1. Approach

Take a collection of positive and negative reviews. Set up a machine-learning pipeline, as you did for the applications in the previous chapters. This pipeline should rely on the dataset of reviews previously determined to be positive and negative. You should split this set into training and test data, define the set of features to learn the sentiment from, train a classifier of your choice on the training data with the selected set of features, and evaluate it on the test set.

## 2. Data loading and preprocessing

Data: polarity dataset v2.0 from http://mng.bz/Wxnd

In [1]:
import os, codecs

# helper function to read in data
def read_in(folder):
    files = os.listdir(folder)
    a_dict = {}
    for a_file in sorted(files):
        if not a_file.startswith("."):
            with codecs.open(folder + a_file,
                            encoding = 'ISO-8859-1',
                            errors = 'ignore') as f:
                file_id = a_file.split(".")[0].strip()
                a_dict[file_id] = f.read()
            f.close()
    return a_dict

In [2]:
# specify folder
folder = "Data/txt_sentoken/"

# read positive reviews into a dict, print len and first entry
pos_dict = read_in(folder + "pos/")

# check
print(f"Number of positive reviews: {len(pos_dict)}")
print("First positive review:")
print(pos_dict.get(next(iter(pos_dict))))

print()

# same for negative reviews
neg_dict = read_in(folder + "neg/")

# check
print(f"Number of negative reviews: {len(neg_dict)}")
print("First negative review:")
print(neg_dict.get(next(iter(neg_dict))))

Number of positive reviews: 1000
First positive review:
films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . 
for starters , it was created by alan moore ( and eddie campbell ) , who brought the medium to a whole new level in the mid '80s with a 12-part series called the watchmen . 
to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd . 
the book ( or " graphic novel , " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes . 
in other words , don't dismiss this film because of its source . 
if you can get past the whole comic book thing , you might find another stumbling block in from hell's directors , albert and allen hughes . 
get

Reviews are now stored in two python dictionaries (pos_dict and neg_dict), each 1000 items long. Each item has a unique identifier key and the text of the review as the value.

The reviews themselves are each a string, where each new sentence in the string is on a new line.

## 3. Standard quantitative checks

Texts are already tokenized, so to extract words, the texts just need to be split on whitespaces.

Calculate:

   1. Average length of a review (in words)
   2. Average sentence length
   3. Vocabulary size (number of distinct words/word types)
   4. Lexical diversity (type-token ratio; measures how often, on average, each word occurs; if each word were unique, measure would equal 1)

In [3]:
# helper function to split reviews by whitespaces
def tokenize(text):
    
    # replace new lines with spaces
    text.replace("\n", " ")
    
    return text.split()

In [7]:
def statistics(a_dict):
    
    # initialize variables
    length = 0
    sent_length = 0
    num_sents = 0
    vocab = []
    
    # iterate over values in dict
    for review in a_dict.values():
        
        # add number of words in review to length
        length += len(tokenize(review))
        
        # to get number of sentences, split review on new lines
        sents = review.split("\n")
        num_sents += len(sents)
        
        # count sentence length
        for sent in sents:
            sent_length += len(tokenize(sent))
            
        # add word to vocab
        vocab += tokenize(review)
    
    # average length of a review (in words)
    avg_length = float(length) / len(a_dict)
    
    # average length of sentence
    avg_sent_length = float(sent_length) / num_sents
    
    # vocab size
    vocab_size = len(set(vocab))
    
    # diversity
    diversity = float(length) / float(vocab_size)
    
    return avg_length, avg_sent_length, vocab_size, diversity

In [8]:
# print statistics in tabular format

categories = ["Positive", "Negative"]
rows = []
rows.append(["Category", "Avg_Len(Review)", "Avg_Len(Sent)", "Vocabulary Size", "Diversity"])

stats = {}
stats["Positive"] = statistics(pos_dict)
stats["Negative"] = statistics(neg_dict)

for cat in categories:
    rows.append([cat, 
                 f"{stats.get(cat)[0]:.6f}",
                 f"{stats.get(cat)[1]:.6f}",
                 f"{stats.get(cat)[2]:.6f}",
                 f"{stats.get(cat)[3]:.6f}"])

columns = zip(*rows)
column_widths = [max(len(item) for item in col) for col in columns]

for row in rows:
    print(''.join(' {:{width}} '.format(row[i], 
                                        width = column_widths[i]) 
                                        for i in range(0, len(row))))

 Category  Avg_Len(Review)  Avg_Len(Sent)  Vocabulary Size  Diversity 
 Positive  787.051000       23.191531      36805.000000     21.384350 
 Negative  705.630000       21.524266      34542.000000     20.428174 


These statistics show that positive reviews, on average, are longer overall (in number of words), and have longer sentences (in number of words) than negative reviews. They also have a larger vocabulary size. Perhaps partially due to the longer length, positive reviews repeat the same word on average about 21 times, while negative reviews repeat the same word on average about 20 times.

Are there differences between the two sets of vocabulary used for positive and negative reviews?

In [11]:
def vocab_difference(list1, list2):
    
    # initialize vocabulary variable for each dict
    vocab1 = []
    vocab2 = []
    
    # create vocabulary of tokenized reviews for each dict
    for rev in list1:
        vocab1 += tokenize(rev)
        
    for rev in list2:
        vocab2 += tokenize(rev)

    # return the a sorted list of the set (only one instance of each word)
    # of the vocabulary difference between the two lists
    return sorted(list(set(vocab1) - set(vocab2)))

In [12]:
# make lists for each dict
pos_wordlist = pos_dict.values()
neg_wordlist = neg_dict.values()

In [19]:
# check size of unique vocabularies
print("Number of unique words in pos_dict only: " \
      f"{str(len(vocab_difference(pos_wordlist, neg_wordlist)))}")

print("Number of unique words in neg_dict only: " \
      f"{str(len(vocab_difference(neg_wordlist, pos_wordlist)))}")

Number of unique words in pos_dict only: 16378
Number of unique words in neg_dict only: 14115


In [20]:
# check unique words

print("Words unique to pos_dict:")
print(vocab_difference(pos_wordlist, neg_wordlist)[1000:1100])

print()

print("Words unique to neg_dict:")
print(vocab_difference(neg_wordlist, pos_wordlist)[1000:1100])

Words unique to pos_dict:
['agreeably', 'agreements', 'agricultural', 'aguilar', 'aguirresarobe', 'ah-ha', 'ahab', 'ahmad', "ahmad's", 'aides', 'aids-afflicted', 'aids-cautionary', 'ailment', "ain't-i-sexy", 'airborne', 'airbrushed', 'airplane-type', 'airplay', 'airs', 'airtime', 'airwaves', 'aisling', 'aix', 'ajay', 'akiko', "al'", 'alabama', 'alacrity', 'aladdin', "alain's", 'alakina', 'alamo', 'alanis', 'alar', 'albania', 'albanian', 'albany', 'albarn', 'alberta', 'albums', 'alc', 'alchemy', 'alcohol-based', "alda's", 'alderaan', 'aldous', 'alea', 'alejandro', 'alek', 'alexa', "alexander's", 'alexandra', 'alexandria', 'alexi-malle', 'alferd', 'alfredo', 'algar', 'algeria', "ali's", 'aliases', 'alibi', 'alibis', 'alida', "alien's", 'alien-like', 'alien3', 'alienating', 'aliens-as-human', 'aligned', 'aline', 'all--except', 'all-ages', 'all-america', 'all-but-dead', 'all-but-illegible', 'all-consuming', 'all-day', 'all-grown-up', 'all-in-all', 'all-male', 'all-or-nothing', 'all-powerfu