# DIGI405 Lab Class: Sentiment Analysis

Make sure you use the Python 3.12 Kernel to run this notebook. 

This lab will investigate lexicon-based sentiment analysis with VADER (‘Valence Aware Dictionary for sEntiment Reasoning’). VADER is open source software, so you can inspect the code and modify it if you wish. In this week’s lab we will mainly refer to the lexicon.

Although VADER is more than 10 years old, it is still commonly used. You can learn lots about how language expresses sentiment by using VADER, understanding how it works, when it works and when it doesn't. 

The following cells imports libraries and creates a SentimentIntensityAnalyzer object.

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from IPython.display import display, HTML
from matplotlib.colors import LinearSegmentedColormap, to_hex

In [None]:
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize

In [None]:
from textplumber.report import plot_confusion_matrix, plot_logistic_regression_features_from_pipeline, preview_dataset
from textplumber.embeddings import Model2VecEmbedder
from textplumber.store import TextFeatureStore

# Textplumber implements VADER scoring of texts, and extraction of sentiment features
from textplumber.vader import VaderSentimentEstimator, VaderSentimentExtractor, VaderSentimentProfileExtractor, SentimentIntensityInterpreter

In [None]:
custom_cmap = LinearSegmentedColormap.from_list("red_white_green", ["red", "white", "green"])

In [None]:
def norm_score(s):
    score_min = -1
    score_max = 1
    return (s - score_min) / (score_max - score_min) 

def highlight_row(row):
    normed = norm_score(row['compound'])
    color = to_hex(custom_cmap(normed))
    return [f'background-color: {color}; color: black'] * len(row)

In [None]:
pd.set_option('display.max_colwidth', None)

In [None]:
analyzer = SentimentIntensityAnalyzer()
interpreter = SentimentIntensityInterpreter()

## 1. Learn about VADER scores

In the cell below is a short phrase to show you the output of VADER. Get VADER's scores for the provided text and make sure you understand what each number tells us.

In [None]:
example = '''
This movie is terrible.
'''
vs = analyzer.polarity_scores(example)
print(str(vs))

In [None]:
interpreter.explain(example)

Read the "About the Scoring" section of the Vader Github README, which explains the scores that are returned by Vader:  
https://github.com/cjhutto/vaderSentiment#about-the-scoring

### 1.1 Questions

1. What do the 'neu', 'pos', and 'neg' scores represent?  
2. What range of values of the Compound Score should be associated with a "neutral" classification?  


## 2. Score some text and understand Vader's lexicon and booster/negation rules

 

Here's another example - you can copy and paste this code into new code cells to test out different phrases.

In [None]:
example = '''
The movie was great.
'''
vs = analyzer.polarity_scores(example)
print(str(vs))

### 2.1 Activities

Try different text and make sure you understand the scores VADER returns. Copy the code above into new cells below for each example you come up with.

Create examples for the following conditions:   

1. A sentence that is obviously positive like "The movie is great"
2. A sentence that uses a "booster" e.g. "The movie is really terrible"
3. A sentence that uses negation e.g. "The movie is not great". 
4. Some sentences that attempts to fool Vader. Think about the discussion in class and challenges with sentiment classification in general, and specific challenges related to VADER's lexicon or rules.

### 2.2 Aide your understanding of VADER

Look at the lexicon and the booster/negation words on the VADER repository to get more insight into the scores.

* The VADER module code is here: https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py  
* Negations and booster words are on lines 48-181.  
* The Vader lexicon is available here: https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vader_lexicon.txt  Note: you can search the lexicon in your browser or you can download it and inspect it in a text editor.  
* Make sure you are clear what the values in the VADER lexicon actually mean.  

Here are some examples for your reference: 

    hope 	1.9 0.53852 [3, 2, 2, 1, 2, 2, 1, 2, 2, 2]
    hopeless -2.0 1.78885 [-3, -3, -3, -3, 3, -1, -3, -3, -2, -2]

* The VADER paper itself is helpful also: https://ojs.aaai.org/index.php/ICWSM/article/view/14550

## 3. Score longer texts

Below we load the movie reviews dataset we used in a previous lab. 

You can browse the dataset here: https://huggingface.co/datasets/polsci/sentiment-polarity-dataset-v2.0
or download the texts here: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/movie_reviews.zip 

In [None]:
dataset = load_dataset('polsci/sentiment-polarity-dataset-v2.0')

In [None]:
preview_dataset(dataset)

In [None]:
X = list(dataset['train']['text'])
y = list(dataset['train']['label'])

In [None]:
target_names = dataset['train'].features['label'].names
target_classes = list(range(len(target_names)))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### 3.1 Activity:

Run the cells below to preview a review and get VADER's scores.

Try some different reviews from the dataset and see what scores Vader comes up with. Are the scores correct against the actual label?

In [None]:
review_id = 904
try:
    review = X_train[review_id]
    print(f"Label: {target_names[y_train[review_id]]}")
    print()
    print(review)
except IndexError:
    print(f"Review ID {review_id} is out of range for the training set.")

In [None]:
vs = analyzer.polarity_scores(review)
print(str(vs))

### 3.2. Evaluating VADER's performance on long texts

This cells below compare VADER scoring and a model based on classification of embeddings. 

First, here is an evaluation of VADER.

Note: The labels in the dataset are either positive or negative (i.e. no neutral). Here compound scores greater than or equal to 0 are considered positive, and scores less than 0 are considered negative.

In [None]:
pipeline = Pipeline([
        ('classifier', VaderSentimentEstimator(output = 'labels', neutral_threshold = 0, label_mapping = {'positive': 1, 'negative': 0})),
], verbose=True)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

Second, here is a model based on a Logistic Regression classifier and Model2Vec embeddings on the same dataset. 

In [None]:
feature_store = TextFeatureStore('sentiment-lab-movie_reviews.sqlite')
pipeline = Pipeline([
        ('features', Model2VecEmbedder(feature_store = feature_store)),
        ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)
display(pipeline)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

The labels based on VADER scores are accurate more often than not, but accuracy is not great on these long texts (around 65% on this dataset) when compared to a basic classifier (around 75%). Observe that on these long reviews, VADER has a tendency to label reviews as positive more than negative. 

VADER works better on short texts. The original VADER paper indicates it worked best on social media texts.

Despite these limitations, we can use VADER to explore some of the problems deriving overall sentiment scores using a lexicon-based approach and some of the challenges of measuring sentiment more generally.

## 4. Examining sentiment scores by sentence

Let’s look at an example review to think about the different frames of reference to which sentiments might be connected. The example we will use is a review of Neil Jordan’s film The Butcher Boy filename cv079_11933.txt. 

A descriptive statement describes the content of the film. Eg sentence 3: Francie is a “sick, needy child” - this tells us about what happens in the film.

An analytic statement analyses the content of the film. 

Eg sentence 3: “I found it difficult to laugh at some of Francie’s darkly comic shenanigans” - here the reviewer is analysing the effects of the film.

It’s not a perfect distinction, but we can observe that negative content in the film doesn’t necessarily imply a negative review of the film. Both types of statements can include evaluative language and include indications of the reviewer's point of view about the movie, but lexicon-based sentiment analysis will have difficulty if a review has a lot of “negative” content, but is nonetheless given a positive review.



### 4.1. Activity
Run the following cells to split the text into sentences and output scores for each sentence.

In [None]:
review_id = 904 # change the review ID to examine another review
try:
    review = X_train[review_id]
    # this splits the review using NLTK's sentence tokenizer and removes empty sentences or sentences with only common punctuation
    sentences = sent_tokenize(review)
    sentences = [s for s in sentences if s.strip() and s.strip() not in ['.', ',', '!', '?']]
    print(f"Label: {target_names[y_train[review_id]]}")
    print()
    display(sentences)
except IndexError:
    print(f"Review ID {review_id} is out of range for the training set.")

In [None]:
data = []
for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    data.append([sentence, vs['neg'], vs['neu'], vs['pos'], vs['compound']])

df = pd.DataFrame(data, columns=['sentence','neg','neu','pos','compound'])

print(f"Label: {target_names[y_train[review_id]]}")
print()
display(df.style.apply(highlight_row, axis=1))

### 4.2. Questions

1. Look closely at each sentence and work out which ones relate to the reviewer's evaluation of the movie. Is Vader doing a good job of scoring these sentences?  
2. Try this with another review. Change the ID number in the cell above to load another review. Look carefully at the positively and negatively evaluated sentences using the compound score. From this analysis, what challenges do you see in correctly assigning overall sentiment scores to movie reviews?

## 5. Examining the structure of reviews

In class we talked about the argumentative structure of reviews, what reviewers are doing when they write a review and who a review is for. When evaluating a film, reviewers rarely just say "Loved it" or "Hated it", that is what the number rating is for. Reviewers tend to craft an argument that justifies their rating, using the descriptive and analytical statements discussed above. Reviewers also tend to follow conventions of other reviews they've read and anticipate that people read reviews to find movies to watch. For a reviewer to be viewed as credible and their review to be useful to its potential audience, reviewers will often point out positive and negative features of a film, while expressing their evaluation. This weighing up of good and bad may help readers understand if a film is suitable for them or not. In a positive review, we can expect some discussion of negative features, and in a negative review we can expect some discussion of positive features. 

This point is not just about reviews, this is a general point about the structure of opinion-giving. Part of giving your view is anticipating the views of others.  

### 5.1 Activity

Below is an example of a review that discusses positive and negative features of a film and discusses who the film might be suitable for. Change the ID number and examine how other review authors are orienting to their audience and structuring their evaluation.

In [None]:
review_id = 214 # change the review ID to examine another review
try:
    review = X_train[review_id]
    # this splits the review using NLTK's sentence tokenizer and removes empty sentences or sentences with only common punctuation
    sentences = sent_tokenize(review)
    sentences = [s for s in sentences if s.strip() and s.strip() not in ['.', ',', '!', '?']]

    data = []
    for sentence in sentences:
        vs = analyzer.polarity_scores(sentence)
        data.append([sentence, vs['neg'], vs['neu'], vs['pos'], vs['compound']])

    df = pd.DataFrame(data, columns=['sentence','neg','neu','pos','compound'])

    print(f"Label: {target_names[y_train[review_id]]}")
    print()
    display(df.style.apply(highlight_row, axis=1))
except IndexError:
    print(f"Review ID {review_id} is out of range for the training set.")


### 5.2. Looking at structure across the corpus

The following visualisation shows some interesting patterns from exploratory analysis of the review corpus. The visualisation clusters reviews by the structure of sentiment scores. Take a look at the visualisation now. There are some notes at the bottom of the image to help you interpret it.

In [None]:
VaderSentimentProfileExtractor(output='profileonly').plot_sentiment_structure(X_train, y_train, target_classes = target_classes, target_names = target_names)

### 5.3. Questions about the visualisation

1. What are some differences you notice between positive and negative reviews?  
2. Are there clusters you would expect to be misclassified by VADER across the whole document?

## 5. What happens if we classify based on VADER scores across documents?

The following cell trains a model based on multiple sentiment features for each text. The sentiment profile for each text includes the overall VADER compound score and positive/negative/neutral proportions, as well as sentiment scores across the structure of the texts. Compound scores are extracted for the first three sentences, the last three sentences and four random sentences from the middle of the text. Short texts are handled by padding the extracted features with zeros. 

This takes a while!

In [None]:
pipeline = Pipeline([
        ('features', VaderSentimentProfileExtractor(output='profile')),
        ('classifier', LogisticRegression(max_iter = 5000, random_state=55))
], verbose=True)
display(pipeline)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred, labels = target_classes, target_names = target_names, digits=3))
plot_confusion_matrix(y_test, y_pred, target_classes, target_names)

Although not reaching the performance of the model based on emebddings, this model outperforms the overall VADER scoring of long texts. These sentiment features can be combined with other features to improve performance further.

Take a moment to review the plot of discriminative features below. 

In [None]:
plot_logistic_regression_features_from_pipeline(pipeline, target_classes, target_names, top_n=20, classifier_step_name = 'classifier', features_step_name = 'features')

The plot of discriminative features is interesting, as it shows the model learned the relative importance of sentiment scores for the conclusion of reviews over the introduction or body in predicting a documents sentiment. 

## 5. Concluding Activities and Questions

1. **ACTIVITY:** In class this week we discussed how sentiment analysis might not be an appropriate technique for analysing some kinds of texts. For example, some texts are not primarily about presenting a point of view or evaluation (e.g. journalistic texts, scientific writing) and authors/speakers don't always present their evaluations in a straightforward way (e.g. some political texts). Take some time to explore some different kinds of texts (e.g. editorials, fiction, tweets, news articles, political speeches, texts from the corpus you built for the Corpus Building Project). Vader will tend to perform better with short texts, so make sure you try texts of different lengths.  
**QUESTION:** How does Vader perform on different kinds of texts? What kinds of texts are challenging for a lexicon-based approach to sentiment analysis? What kinds of texts are not appropriate for sentiment analysis?

In [None]:
example = '''
Put your text samples here.
'''
vs = analyzer.polarity_scores(example)
print(str(vs))

2. **ACTIVITY:** Under the Readings section on this week's AKO|LEARN page is a link to a Hugging Face Spaces that allows you to test a pre-trained models for Sentiment Analysis. We have also linked to the relevant model the web app is using. Try out some of the sentences from the movie review example above. Try other texts you have tested in the lab today.  
**QUESTION:** How do these models perform compared with Vader? What are some of the advantages and disadvantages of using pre-trained machine learning models?

**Discuss what you have found with your neighbour. If you have time at the end of the lab you can work on your assignment.**