Sentiment analysis


## Introduction

Sentiment analysis is the process of identifying and characterising the emotions or the opinions that are expressed within machine-readable texts. Sentiment analysis applications commonly make use of lists of words that are indicative of specific sentiments. Such lists or lexicons usually specify whether the words refer to negative or to positive emotions. By calculating the frequencies of these affective words, and by examining the contexts in which these words are used, such tools generally aim to calculate specific sentiment scores. 

The most basic types of sentiment analysis approaches classify text fragments simply as either positive or negative. Valence-based approaches, by contrast, also consider the intensity of the emmotions that are expressed, and aim to calculate more nuanced sentiment scores. 

This notebook discusses [Vader](https://github.com/cjhutto/vaderSentiment), which is available both as a separate package and as part of Python's NLTK library. Vader stands for the *Valence Aware Dictionary and sEntiment Reasoner*. Vader makes use of [a list of words](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt) whose affective connotations have been evaluated manually by human volunteers. 



## Installing Vader

To work with Vader, it obviously needs to be installed first. 

In [None]:
import sys
!pip install vaderSentiment

If, for some reason, you are unable to install the package, you can also try to download the vader lexicon using NLTK.

In [None]:
import nltk
nltk.download('vader_lexicon', quiet=False)

If you managed to install Vader successfully, you should be able to import the `SentimentIntensityAnalyser` object from the vaderSentiment library in your code. 

In the code below, this object is renamed into `ana`. This object will function as a sentiment analyser. 

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
#from nltk.sentiment.vader import SentimentIntensityAnalyzer

ana = SentimentIntensityAnalyzer()

When you have downloader Vader via NLTK, you should place a hash before the first line and remove the hash in front of the second command that is given. 


## Sentiment ratings

This object `SentimentIntensityAnalyser` object contains a method named `polarity_scores` which can calculate the sentiment scores. The method only demands a string as a parameter. 

The method returns a dictionary with four keys:

* `neg` gives a score for the level of negativity
* `neu` assesses the level of neutrality
* `pos` assignes a score for the positivity
* the `compound` score, finally, is an overall assessment of the sentiments that are expressed. It is the sum of the first three ratings.

The first three scores are all on a range in between 0 and 1. The compound score ranges from -1 to +1. 

Sentiment scores can firstly be requested for individual words. 

In [None]:
word = 'good'
print( f'{word}')

scores = ana.polarity_scores(word)
for s in scores:
    print( f'  {s} => {scores[s]}')
    
word = 'terrible'
print( f'{word}')
    
scores = ana.polarity_scores(word)
for s in scores:
    print( f'  {s} => {scores[s]}')
    
word = 'ordinary'
print( f'{word}')
    
scores = ana.polarity_scores(word)
for s in scores:
    print( f'  {s} => {scores[s]}')

## Sentiments of sentences

The string that you provide as a parameter to the `polarity_scores()` method can also be a full sentence.  

When you run the code below, you will see that first sentence is considered to be 44% neutral and 55% positive, resulting in a compound score of 0.8225. 

The second sentence is given a score of 0.494 for negativity and a score 0.506 on the neutrality scale. The score for positiviy is 0.0. On the whole, the sentence received a negative compund score of -0.5994.  

In [None]:
scores = ana.polarity_scores("A thing of beauty is a joy forever")

for s in scores:
    print( f'{s}: {scores[s]}'  )
    
print('\n')
    
scores = ana.polarity_scores("April is the cruellest month")

for s in scores:
    print( f'{s}: {scores[s]}'  )


## Context 

The scores that are genereated by Vader are not simply the summations of the scores from the lexicon. The application also considers the broader contexts of the words in the sentences. Aspects such as interpunction and capitalisation are taken into account as well. 

Vader assumes, for example, that capitalisation can increase the intensity of an emotion. 

In [None]:
scores = ana.polarity_scores("It was the BEST of times, it was the worst of times.")

print( f'Positive: { scores["pos"] }' )
print( f'Negative: { scores["neg"] }' )

print('\n')

scores = ana.polarity_scores("It was the best of times, it was the worst of times.")

print( f'Positive: { scores["pos"] }' )
print( f'Negative: { scores["neg"] }' )


Intensifiers such as 'very' or 'really' likewise raise the ratings for particular emotions. The same is the case for exclamation marks. Vader also knows that the word 'not' entails a negation and that the value of a positive word following 'not' should in fact be viewed as a negative word. 

In [None]:
scores = ana.polarity_scores("This novel is good.")
print( f'Overall score: { scores["compound"] }' )

In [None]:
scores = ana.polarity_scores("This novel is very good!")
print( f'Overall score: { scores["compound"] }' )

In [None]:
scores = ana.polarity_scores("This novel is very GOOD!")
print( f'Overall score: { scores["compound"] }' )

In [None]:
scores = ana.polarity_scores("This novel is not good.")
print( f'Overall score: { scores["compound"] }' )

In [None]:
scores = ana.polarity_scores("The novel isn't bad.")
print( f'Overall score: { scores["compound"] }' )

Note that Vader also takes into account emoticon codes such as ':)'.  Without the emoticon, the positivity score for the sentence below is 0.23. With the added smiley code, the positivity score rises to 0.338. 

In [None]:
scores = ana.polarity_scores("It was the best of times, it was the worst of times. :) ")

print( f'Positive: { scores["pos"] }' )
print( f'Negative: { scores["neg"] }' )
print( f'Compound: { scores["compound"] }' )

## Sentiment analyses of longer texts

Vader's `polarity_scores()` methods can give good results for individual words, or for relatively shorts text. When the method is applied to longer texts (e.g. the full text of a novel), however, the scores quickly become meaningless. 

When you want to analyse the degree of positivity or negativity in a full novel, it is generally advisable to divide the text into its seperate sentences first, using the `sent_tokenize()` method from the `nltk` package, for instance. Once you have found the sentiment scores for all of these sentences individually, you can alo calculate the average of these scores.

As a second approach, you can also consider the sentiment scores of individual words. We can count all the words with a positive score, and all the words with a negative score. When we then subtract the number of positive words from the number of negative words, this may also give an indication of the overall affective nature of the text. 

You can experiment with these two approaches in the exercises below. 

## Bibliography

C.J. Hutto Eric Gilbert, "VADER: A Parsimonious Rule-based Model for
Sentiment Analysis of Social Media Text", in: *ICWSM 2014* <[https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8109](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8109)>



### Exercise 12.1

Use the code below to download a file named 'ulysses_reviews.txt', available at [edu.nl/fyf8v](https://edu.nl/fyf8v)

This file contains the full text of a number of reviews of James Joyce's novel *Ulysses* which were posted on the [goodReads](https://www.goodreads.com/book/show/338798.Ulysses) social reading platform. 

Can you select all the positive words from these reviews? Can you also select all the words with a negative connotation?
Is the number of positive words higher than the number of than negative words?

You can the steps below:

1. Create two lists, named `positive_words` and `negative_words`.
2. Tokenise the text file, and calculate the score for positivity and the score for negativity for each of these words. 
3. If the posiitivity score is higher than 0.75, append the word to the list named `positive_words`. If the negativitiy score is higher than 0.75, append the word to `negative_words`.
4. Finally, you can subtract the length of the list `negative_words` from the length of the list `positive_words`

In [None]:
import requests

response = requests.get('http://edu.nl/fyf8v')
if response:
    response.encoding = 'utf-8'
    with open( 'ulysses_reviews.txt' , 'w' , encoding = 'utf-8') as out:
        out.write(response.text)
    

In [None]:
from nltk import word_tokenize
reviews = open( 'ulysses_reviews.txt' , encoding = 'utf-8')

positive_words = []
negative_words = []

for line in reviews:
    words = word_tokenize(line)
    for word in words:
        scores = ana.polarity_scores(word)
        if scores["pos"] > 0.75:
            positive_words.append(word)
        elif scores["neg"] > 0.75:
            negative_words.append(word)
        
#print( positive_words) 
#print( negative_words) 

print( len(positive_words) )
print( len(negative_words) )


## Exercise 12.2

What is the most negative sentence in these reviews of *Ulysses*? What is the most positive sentence? 

Try to implement the following steps:
    
1. Create a dictionary named `sent_scores` which can save the scores. In this dictionary, you can use sentences as keys, and the compound scores as values.    
1. Navigate across all the lines in the file. Each line of the file is the full text of single review. 
2. Tokenisise the reviews into sentences.
4. Navigate across all the sentences in these reviews and save the compound scores for these sentences in the dictionary `sent_scores`.
5. Sort the dictionary named `sent_scores` in a descending order, and print the first few items it contain. The postive sentences will be shown at the top of the list. To see the negative sentence, you need to sort the dictionary in a descending order. 

Unfortunately, there is no standard function in Python that you can use to sort a dictionary by its values. This specific task can be performed using the function `sortedByValue()`, defined below. This function can opionally be used with a parameter named `ascending`. If the value of this parameter is `False`, the values will be sorted in a descending order. 

In [None]:
def sortedByValue( dict , ascending = True ):
    if ascending: 
        return {k: v for k, v in sorted(dict.items(), key=lambda item: item[1])}
    else:
        return {k: v for k, v in reversed( sorted(dict.items(), key=lambda item: item[1]))}


In [None]:
from nltk import sent_tokenize
sent_scores = dict()

file = open( 'Ulysses_reviews.txt' , encoding = 'utf-8')

for review in file:
    sentences = sent_tokenize(review)
    for s in sentences:
        scores = ana.polarity_scores(s)
        sent_scores[s] = scores['compound']
        
nr_sentences = 10

i = 0

print('\nPostive sentences\n')

for s in sortedByValue( sent_scores , ascending = False ):
    print( f'{s} [{ sent_scores[s]}]' )
    i+= 1
    if i == nr_sentences:
        break
        
print('\nNegative sentences\n')
i = 0
        
for s in sortedByValue( sent_scores , ascending = True):
    print( f'{s} [{ sent_scores[s]}]' )
    i+= 1
    if i == nr_sentences:
        break

## Exercise 7.3

Does *Ullyses* express more positivity than *Pride and Prejudice*? Try to answer this question by following the steps below:
    
1. Read in the full text of *Ullyses*
2. Create an empty list, named for *all_scores* for instance, to capture all the scores. 
2. Divide the novel into its separate sentences.
3. For each sentence, find the positive sentiment score. 
4. Add this score to the list *all_scores*.
5. Once you have processed all the sentences, divide the sum of all the scores by by the total number of sentences. 
6. Follow steps 1-4 for the reviews *Pride and Prejudice*, and compare the two percentages you have calculated.


In [1]:
from nltk import sent_tokenize
from os.path import join

def average_score(file):
    all_scores = []
    file = open( file , encoding = 'utf-8' )
    full_text = file.read()
    sentences = sent_tokenize( full_text )
    for s in sentences:
        scores = ana.polarity_scores(s)
        all_scores.append( scores["pos"] )
    return sum(all_scores) / len(all_scores)
        
