# Sentiment Analysis (unsupervised, lexicon-based)

With the main preprocessing steps in place, we can start with higher-level text analysis. In this tutorial, we perform sentiment analysis using so called sentiment lexicons. A sentiment lexicon is a set of words that have be labeled with "positive" or "negative" or a numerical value reflecting the sentiment of the word. For example, the word "happy" is intuitively associated with a positive sentiment.

This approach is called unsupervised since it does not rely on the labeld input data -- i.e., the text for which we want to calculate the sentiment does not need to be labeled.


## Import important packages

In [1]:
import pandas as pd
import random

# The next imports are only needed for the preprocessing
from nltk.tokenize import TweetTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from utils.nlputil import preprocess_text

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

tweet_tokenizer = TweetTokenizer()
porter_stemmer = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

## Load and prepare data

We need to load and prepare two types of data:

* the sentiment lexicon

* the input data (here: tweets)

### Load and process sentiment lexicon

We use `pandas` to load the publicly availble VADER sentiment lexicon. The advantage of this lexicon compared to others is that it also contains a wide range of non-word such as emoticons. This is useful when dealing with social media data.

In [2]:
df_sentiment = pd.read_csv('data/sentiment-lexicon/sentilex-vader.txt', sep='\t', encoding = "ISO-8859-1", header=None)
df_sentiment.head()

Unnamed: 0,0,1,2,3
0,$:,-1.5,0.80623,"[-1, -1, -1, -1, -3, -1, -3, -1, -2, -1]"
1,%),-0.4,1.0198,"[-1, 0, -1, 0, 0, -2, -1, 2, -1, 0]"
2,%-),-1.5,1.43178,"[-2, 0, -2, -2, -1, 2, -2, -3, -2, -3]"
3,&-:,-0.4,1.42829,"[-3, -1, 0, 0, -1, -1, -1, 2, -1, 2]"
4,&:,-0.7,0.64031,"[0, -1, -1, -1, 1, -1, -1, -1, -1, -1]"


While the lexicon contains the 10 individual labels from each annotator, we only need the average value in Column 1. While not crucial here, we also normalized the sentiment scores from [-4, ..., 4] to [-1, ..., 1]. This is usually a good practice when combining different sentiment lexicons since many use a different scoring theme.

In [3]:
sentiment_dict = {}

for index, row in df_sentiment.iterrows():
    token, score = row[0], row[1]
    sentiment_dict[token] = score / 4.0 # normalize score from [-4,...,4] to [-1,...,1]

# Print a random sample for     
print (random.sample( sentiment_dict.items(), 5 ))

[('beaten', -0.45), ('fubar', -0.75), ('unhappier', -0.6), ('brooding', 0.025), ('wellsite', 0.125)]


Let's illustrate the approach with 2 example sentences, one positive and one negative.

In [4]:
pos = "I was very happy with the service."
neg = "I wasn't very happy with the service."

documents = [pos, neg]

As usual, we first need to preprocess our input documents. Most sentiment lexicons only contain words in the base form, e.g., "happy", but not derived forms, e.g., "happier". Not also that not only adjectives are associated with a sentiment but also nouns such as "love", "hate", "mistake", "luck", etc. as well as verbs such as "celebrate", "suffer", "enjoy", etc.

Note that we do NOT remove stopwords, since "not" and "n't" are considered a stopworda and removing them would clearly alter the meaning of a document. You can try removing the stopwords and see it's effects.

In [5]:
processed_documents = [''] * len(documents)

for idx, doc in enumerate(documents):
    #processed_reviews[idx] = preprocess_text(doc, remove_stopwords=False, remove_punctuation=False)
    #processed_reviews[idx] = preprocess_text(doc, stemmer=porter_stemmer, remove_stopwords=False, remove_punctuation=False)
    processed_documents[idx] = preprocess_text(doc, lemmatizer=wordnet_lemmatizer, remove_stopwords=False, remove_punctuation=False)
    
print (processed_documents)

['i be very happy with the service .', "i be n't very happy with the service ."]


Let's go through both documents check if a word is in the sentiment document. If so, we add the sentiment score of the word to the overall score `review_score`.

In [8]:
for doc in documents:
    review_score = 0.0
    for token in doc.split(): # Here split() is sufficient
        if token in sentiment_dict:
            review_score = review_score + sentiment_dict[token]
    print (review_score)

0.675
0.675


As you can see, both documents got the same score although they clearly express opposite sentiments. That's because we didn't properly handle the negation. "not" or "n't" themselves are not association with any sentiment score, but indicated that the scores of the following words need to be flipped (change of sign).

To accomplish this, we first need to know which words (beyond "not" and "n't") can flip the polarities of words.

In [7]:
df_negation_words = pd.read_csv('data/word-lists/english-negation-words-lowercase.txt', sep='\t', encoding = "ISO-8859-1", header=None)

negation_words = df_negation_words[0].tolist()

print(negation_words)

['neither', 'never', 'no', 'nor', 'nothing', 'nowhere', 'noone', 'none', 'not', 'havent', "haven't", 'hasnt', "hasn't", 'hadnt', "hadn't", 'cant', "can't", 'cannot', 'couldnt', "couldn't", 'shouldnt', "shouldn't", 'wont', "won't", 'wouldnt', "wouldn't", 'dont', "don't", 'doesnt', "doesn't", 'didnt', "didn't", 'isnt', "isn't", 'arent', "aren't", 'aint', "ain't", 'wasnt', "wasn't", 'werent', "weren't", "n't"]


We can now improve the scoring method. To what extent a negation word is effecting a document is actually a challenging task. In the following, we adopt the common and simple heuristic that we flip the sentiment score of the next 3 words -- that is, the scope of a negation is the list if the 3 succeeeding words.

In [8]:
for doc in processed_documents:
    review_score = 0.0
    negation_scope = -1
    
    for token in doc.split(): # Here split() is sufficient
        if token in negation_words:
            negation_scope = 2
        if token in sentiment_dict:
            token_score =  sentiment_dict[token]
            if negation_scope >= 0: # If we are still in the negation scope, flip the sentiment score
                token_score *= (-1)
            review_score = token_score
        negation_scope -= 1 # Reduce the negation score by 1
    print (review_score)

0.675
-0.675


Now we handle the negation in a better way. Be aware, however, that this approach is far from fool-proof. For example, image the 2nd sentence is *"I wasn't really very very happy with the service."*. Here, the word "happy" would be outside of the scope of the negation.

### Load and process Twitter data

We use a publicly dataset of several hundreds of tweets. All tweets a labeled with a polarity:

* 0 = negative
* 2 = neutral
* 4 = negative

Note that we need these labels only to evalute the performance and not for any training.

In [9]:
df_tweets = pd.read_csv('data/twitter-sentiment/twitter-sentiment-bowden-training.csv')
df_tweets.head()

Unnamed: 0,tweet,senti
0,@united UA5396 can wait for me. I'm on the gro...,0
1,I hate Time Warner! Soooo wish I had Vios. Can...,0
2,Tom Shanahan's latest column on SDSU and its N...,2
3,Found the self driving car!! /IWo3QSvdu2,2
4,@united arrived in YYZ to take our flight to T...,0


We store the polarities and tweets in 2 separate lists for further processing. The list `polarities` contains to true polarities of each tweet we can use to evaluate our approach.

In [10]:
polarities = df_tweets['senti'].tolist() 
tweets = df_tweets['tweet'].tolist() 

Again, we need to properly preprocess all tweets

In [None]:
processed_tweets = [''] * len(tweets)

for idx, doc in enumerate(tweets):
    #processed_tweets[idx] = preprocess_text(doc, remove_stopwords=False, remove_punctuation=False)
    #processed_tweets[idx] = preprocess_text(doc, stemmer=porter_stemmer, remove_stopwords=False, remove_punctuation=False)
    processed_tweets[idx] = preprocess_text(doc, tokenizer=tweet_tokenizer, lemmatizer=wordnet_lemmatizer, remove_stopwords=False, remove_punctuation=False)
    
  

## Sentiment calculation

The following two methods do the same steps as shown above. Both loop over all tweets and check for all words in each tweet if a word has a sentiment score (i.e., is in the sentiment lexicon). Depending on the overall score for a tweet (less than 0 or greater than 0) it assigns the respective polarity to the tweet (0, 2, or 4). The method `calc_polarities_negation()` considers the scope of a negation the same way as shown above: flip the sentiment scores of the next 3 words.

Note that there is one special case. Let's assume we only want to label the tweets with "positive" or "negative" (thus, only 2 classes). How do we handle tweets with an overall score of 0. In this case, we assume that the tweet is positive. 

Both methods return a list of polarity labels, e.g., `[2, 0, 0, 2, 0, 4, 2, 0, 4, 4, ...]`. The length of the list is the number of documents (tweets).

In [17]:
def calc_polarities(docs, num_polarities=3):
    calculated_polarities = []
    for doc in docs:
        doc_score = 0.0
        for token in doc.split(): # Here split() is sufficient
            if token in sentiment_dict:
                doc_score += sentiment_dict[token]
        if doc_score > 0:
            calculated_polarities.append(4)
        elif doc_score < 0:
            calculated_polarities.append(0)
        else:
            if num_polarities == 3:
                calculated_polarities.append(2)
            else:
                calculated_polarities.append(4) # By default, assume it's positive (for 2 classes)
    return calculated_polarities


def calc_polarities_negation(docs, num_polarities=3):
    calculated_polarities = []
    for doc in docs:
        doc_score = 0.0
        negation_scope = -1
        for token in doc.split(): # Here split() is sufficient
            if token in negation_words:
                negation_scope = 2
            if token in sentiment_dict:
                token_score = sentiment_dict[token]
                if negation_scope >= 0:
                    token_score *= (-1)
                doc_score += token_score
            negation_scope -= 1
        if doc_score > 0:
            calculated_polarities.append(4)
        elif doc_score < 0:
            calculated_polarities.append(0)
        else:
            if num_polarities == 3:
                calculated_polarities.append(2)
            else:
                calculated_polarities.append(4) # By default, assume it's positive (for 2 classes)
    return calculated_polarities



Let's calculate the polarities for all tweets using the methods as defined above. You can try both methods (with or without negation handling) and see its effects.

In [26]:
#calculated_polarities = calc_polarities(processed_tweets)
calculated_polarities = calc_polarities_negation(processed_tweets)

print(calculated_polarities)

[0, 4, 2, 2, 0, 2, 0, 2, 2, 0, 0, 0, 4, 4, 0, 4, 0, 2, 4, 2, 2, 4, 4, 4, 2, 4, 0, 4, 2, 0, 4, 4, 4, 4, 0, 0, 0, 4, 0, 2, 4, 4, 0, 2, 0, 0, 2, 0, 4, 2, 4, 0, 0, 4, 4, 2, 4, 2, 4, 4, 0, 0, 2, 2, 4, 4, 0, 4, 0, 4, 2, 4, 4, 0, 0, 0, 2, 0, 0, 0, 4, 4, 0, 4, 4, 0, 4, 0, 4, 4, 0, 2, 4, 4, 0, 0, 4, 4, 0, 4, 4, 4, 0, 2, 2, 2, 4, 4, 2, 4, 4, 0, 4, 2, 4, 0, 4, 4, 4, 0, 0, 4, 4, 2, 4, 2, 0, 2, 2, 4, 0, 4, 4, 4, 4, 4, 2, 4, 4, 2, 4, 0, 4, 0, 2, 4, 4, 0, 4, 4, 4, 0, 4, 4, 4, 2, 0, 2, 0, 0, 4, 0, 2, 2, 2, 4, 0, 0, 4, 2, 4, 4, 0, 2, 0, 2, 2, 0, 0, 0, 2, 4, 0, 4, 4, 4, 4, 2, 4, 0, 4, 2, 4, 4, 4, 2, 0, 2, 4, 4, 4, 4, 4, 4, 4, 2, 0, 4, 4, 2, 4, 2, 0, 2, 0, 0, 4, 4, 2, 0, 4, 0, 2, 4, 4, 4, 0, 4, 4, 4, 2, 4, 4, 0, 4, 2, 4, 0, 4, 4, 2, 4, 2, 2, 4, 0, 2, 4, 2, 4, 4, 0, 0, 4, 0, 4, 0, 4, 4, 2, 0, 0, 2, 0, 4, 4, 4, 2, 2, 0, 0, 4, 0, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 2, 2, 4, 0, 4, 2, 0, 0, 2, 0, 0, 0, 0, 0, 4, 4, 2, 4, 4, 2, 4, 4, 4, 0, 4, 4, 0, 2, 2, 4, 2, 2, 0, 4, 4, 2, 4, 2, 4, 0, 4, 0, 4, 0, 4, 2, 2, 0, 0, 

## Evaluation

### Common metrics

We can now use existing metrics to evaluate how well our sentiment analysis performed. Recall that `polarities` contains the true polaritie labels.

In [27]:
print (confusion_matrix(polarities, calculated_polarities))

[[180  22  62]
 [ 26  92  50]
 [ 13  29 225]]


In [22]:
print (classification_report(polarities, calculated_polarities))

             precision    recall  f1-score   support

          0       0.82      0.68      0.75       264
          2       0.64      0.55      0.59       168
          4       0.67      0.84      0.75       267

avg / total       0.72      0.71      0.71       699



A combined f1-score is actually not bad value for sentiment analysis. Sentiments are often highly subjective, and often annotator disagree significantly when labeling a document.

### Error analysis

A common task part of the evaluation is the so-called error analysis. The idea is to manually inspect all misclassified documents to get an idea in what cases the sentiment assignment fails. This in turn serves as basis to improve the approach.

In [28]:
num_wrong_predictions = 0

for idx in range(len(polarities)):
    true_polarity = polarities[idx]
    pred_polarity = calculated_polarities[idx]
    if true_polarity != pred_polarity:
        num_wrong_predictions += 1
        print ("True: {}, Predicted: {}, Tweet: {}".format(true_polarity, pred_polarity, tweets[idx]))
        
print (num_wrong_predictions)

True: 0, Predicted: 4, Tweet: I hate Time Warner! Soooo wish I had Vios. Cant watch the fricken Mets game w/o buffering. I feel like im watching free internet porn.
True: 0, Predicted: 2, Tweet: Driverless cars ? What's the point ?
True: 4, Predicted: 0, Tweet: how can you not love Obama? he makes jokes about himself.
True: 4, Predicted: 2, Tweet: Safeway is very rock n roll tonight
True: 2, Predicted: 0, Tweet: I saw Night at the Museum: Battle of the Swithsonian today. It was okay. Your typical [kids] Ben Stiller movie.
True: 2, Predicted: 0, Tweet:  Missed this. each is 'newer generation'. I'd start w allegra then go claritin then zyrtec. I don't envy you!
True: 2, Predicted: 4, Tweet: I hope the girl at work  buys my Kindle2
True: 4, Predicted: 0, Tweet: @ontheMAPP DITTO! not as good as the Nirvana Sandwiches 
True: 2, Predicted: 4, Tweet: 12 Gift Ideas For The Apple Lover Who Has Everything /0RTcyLOsAD  #technology
True: 2, Predicted: 4, Tweet: New blog post: Harvard Versus Stanfo