# **Sentiment Analysis**

Sentiment analysis, sometimes referred to as opinion mining is a strategy used to identify the tone behind a body of text. It is also typically one application of classification.

We will start will using some lexicon-based methods for measuring sentiment. First, we need to install the libraries using pip (or pip3 or conda).

In [5]:
!pip install textblob



Now, we will import the required libraries and load the data on which we will perform sentiment analysis.

In [6]:
import pandas as pd
from textblob import TextBlob
import nltk
nltk.download('all')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Load the data
data = pd.read_csv('sentiment_data.csv', sep='\t', header=None,  names=["text", "class"])

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

We will start with TextBlob, which is a Python library for processing textual data. TextBlob has a built-in sentiment analyzer that uses a lexicon-based approach. We can use this to calculate the polarity and subjectivity of the text.

In [7]:
# Print the first 10 rows of the data
pd.set_option('max_colwidth', 100)
print(data.head())

                                                                                                  text  \
0                A very, very, very slow-moving, aimless movie about a distressed, drifting young man.   
1    Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.   
2  Attempting artiness with black & white and clever camera angles, the movie disappointed - became...   
3                                                           Very little music or anything to speak of.   
4  The best scene in the movie was when Gerardo is trying to find a song that keeps running through...   

   class  
0      0  
1      0  
2      0  
3      0  
4      1  


In [8]:
def get_textblob_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity, blob.sentiment.subjectivity

# get polarity and subjectivity
data['blob_polarity'], data['blob_subjectivity'] = zip(*data.iloc[:, 0].apply(get_textblob_sentiment))

data.head()

Unnamed: 0,text,class,blob_polarity,blob_subjectivity
0,"A very, very, very slow-moving, aimless movie about a distressed, drifting young man.",0,0.18,0.395
1,"Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.",0,0.014583,0.420139
2,"Attempting artiness with black & white and clever camera angles, the movie disappointed - became...",0,-0.122917,0.514583
3,Very little music or anything to speak of.,0,-0.24375,0.65
4,The best scene in the movie was when Gerardo is trying to find a song that keeps running through...,1,1.0,0.3


Next, we will use Vader, which is another lexicon-based sentiment analysis tool. Vader uses a rule-based approach to calculate the sentiment of a piece of text. We can use the SentimentIntensityAnalyzer class from Vader to calculate the sentiment score, which ranges from -1 (most negative) to 1 (most positive).

In [9]:
def get_vader_sentiment(text):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(text)
    return scores['compound']

# get vader sentiment score
data['vader_score'] = data.iloc[:, 0].apply(get_vader_sentiment)

data.head()

Unnamed: 0,text,class,blob_polarity,blob_subjectivity,vader_score
0,"A very, very, very slow-moving, aimless movie about a distressed, drifting young man.",0,0.18,0.395,-0.4215
1,"Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.",0,0.014583,0.420139,-0.5507
2,"Attempting artiness with black & white and clever camera angles, the movie disappointed - became...",0,-0.122917,0.514583,-0.7178
3,Very little music or anything to speak of.,0,-0.24375,0.65,0.0
4,The best scene in the movie was when Gerardo is trying to find a song that keeps running through...,1,1.0,0.3,0.6369


While TextBlob and Vader can be useful tools for sentiment analysis, there are certain cases where they may not perform well. Here are some examples of situations where TextBlob and Vader may miss the mark:

- Negation
- Metaphors
- Sarcasm
- Subtility
- Context

In [10]:
# Example sentences to test
sentences = [
    "I don't hate this product",
    "The movie wasn't bad",
    "The sun was a smiling face in the sky",
    "I just love being stuck in traffic",
    "This restaurant has great food, but terrible service"
    # add some more
]

**Exercise: What happens when you get sentiment scores for these sentences? You can try Textblob and Vader. Can you find other example sentences that give different values between Texblob and Vader?**

## Rule-based Sentiment Analysis

Lists of positive and negative words can be obtained from a variety of sources. The AFINN word list is a well-known list of words with sentiment scores that are commonly used in text analysis. Wordnet also has annotation for negative and positive words. These are some options that can be used through Textblob and Vader.

In [11]:
from textblob.wordnet import Synset

positive_synsets = Synset('good.a.01').lemmas()
negative_synsets = Synset('bad.a.01').lemmas()

positive_words = [lemma.name() for lemma in positive_synsets]
negative_words = [lemma.name() for lemma in negative_synsets]

analyzer = SentimentIntensityAnalyzer()
positive_words = analyzer.lexicon.get('positive')
negative_words = analyzer.lexicon.get('negative')

Let's try using some heuristics to try classifying into negative, positive, and neutral sentiment. Here we are trying to catch some words that can shift sentiment that can signal a shift from say positive to negative sentiment.

In [12]:
import re

positive_words = ['happy', 'joy', 'love', 'great']
negative_words = ['angry', 'hate', 'sad', 'bad']
positive_shifters = ['although', 'but', 'despite', 'even though', 'however', 'nonetheless', 'regardless', 'still', 'though', 'yet']
negative_shifters = ['albeit', 'although', 'despite', 'even though', 'however', 'in spite of', 'nevertheless', 'nonetheless', 'though', 'while', 'yet']

def classify_sentiment(text):
    # Remove special characters and lowercase the text
    text = re.sub('[^a-zA-Z0-9\s]', '', text)
    text = text.lower()

    # tokenize
    words = nltk.word_tokenize(text)

    # Initialize sentiment score
    sentiment_score = 0

    # Look for sentiment-shifters and adjust sentiment score
    for i in range(len(words)):
        if words[i] in positive_shifters and i+1 < len(words) and words[i+1] in negative_words:
            sentiment_score -= 1
        elif words[i] in negative_shifters and i+1 < len(words) and words[i+1] in positive_words:
            sentiment_score += 1
        elif words[i] in positive_words:
            sentiment_score += 1
        elif words[i] in negative_words:
            sentiment_score -= 1

    # Classify sentiment based on sentiment score
    if sentiment_score > 0:
        return 'positive'
    elif sentiment_score < 0:
        return 'negative'
    else:
        return 'neutral'


In [13]:
# Example sentences
sentences = ["I hate rainy days, they make me sad.",
             "Although it's cold outside, I'm still happy.",
             "In spite of his mistakes, I still love him."]

 # Vader sentiment analysis
for sentence in sentences:
  sid = SentimentIntensityAnalyzer()
  vader_score = sid.polarity_scores(sentence)['compound']
  if vader_score > 0:
      vader_sentiment = 'positive'
  elif vader_score < 0:
      vader_sentiment = 'negative'
  else:
      vader_sentiment = 'neutral'
  print("Vader: ", vader_sentiment)

  # Custom sentiment analysis
  custom_sentiment = classify_sentiment(sentence)
  print("Custom: ", custom_sentiment)

Vader:  negative
Custom:  negative
Vader:  positive
Custom:  positive
Vader:  negative
Custom:  positive


## Features for Classification

Until now, we have mostly relied on BOW or some distriibuted representations as features for classification. However, we are not limited to these types of features. Other features can be extracted from text for classification including:

- Count of positive words
- Count of negative words
- Count of adjectives
- Count of adverbs
- Polarity score
- Count of positive bigrams
- Count of negative bigrams
- Bigram polarity score
- Presence of named entity
- etc.

Below, we extract some of those features.

In [14]:
import nltk
from nltk import pos_tag
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from sklearn.feature_extraction.text import CountVectorizer
import string

# initialize sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

def extract_features(text):
    # Tokenize text
    tokens = word_tokenize(text.lower())

    # Get parts of speech for each token
    pos_tags = pos_tag(tokens)

    # Count occurrences of positive and negative words
    positive_words = 0
    negative_words = 0
    for word in tokens:
        sentiment = analyzer.polarity_scores(word)
        if sentiment['compound'] >= 0.5:
            positive_words += 1
        elif sentiment['compound'] <= -0.5:
            negative_words += 1

    # Count occurrences of adjectives and adverbs
    adj_count = 0
    adv_count = 0
    for word, pos in pos_tags:
        if pos == 'JJ':
            adj_count += 1
        elif pos == 'RB':
            adv_count += 1

    # Calculate polarity score
    polarity_score = analyzer.polarity_scores(text)['compound']

    # Check for named entities
    named_entities = nltk.ne_chunk(pos_tag(word_tokenize(text)), binary=False)
    named_entity_presence = False
    for chunk in named_entities:
        if hasattr(chunk, 'label') and chunk.label() in ['PERSON', 'ORGANIZATION', 'LOCATION']:
            named_entity_presence = True

    # Combine features into a single feature vector
    feature_vector = [positive_words, negative_words, adj_count, adv_count, polarity_score, named_entity_presence]

    return feature_vector


In [15]:
# Create a new DataFrame with the extracted features
feature_vectors = []
for index, row in data.iterrows():
    text = row[0]
    feature_vector = extract_features(text)
    class_label = row[1]
    feature_vector.append(class_label)
    feature_vectors.append(feature_vector)

feature_names = ['positive_words', 'negative_words', 'adj_count', 'adv_count', 'polarity_score', 'named_entity_presence', 'class_label']
feature_df = pd.DataFrame(feature_vectors, columns=feature_names)

feature_df.head()

Unnamed: 0,positive_words,negative_words,adj_count,adv_count,polarity_score,named_entity_presence,class_label
0,0,0,4,3,-0.4215,False,0
1,0,0,3,2,-0.5507,False,0
2,0,0,7,2,-0.7178,False,0
3,0,0,1,1,0.0,False,0
4,1,0,0,0,0.6369,True,1


**Activity: See if you can build a Random Forest Classifier to classify samples into positive and negative sentiment for the features we extracted.**

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(feature_df[['positive_words',
                                                                'negative_words',
                                                                'adj_count',
                                                                'adv_count',
                                                                'polarity_score',
                                                                'named_entity_presence']], data['class'], test_size=0.2, random_state=1)


## Example DL-based Solution

In [17]:
!pip install torch
!pip install transformers



In [18]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

# Load the pre-trained DistilBert model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')


for i in range(3):
    for param in model.distilbert.transformer.layer[i].parameters():
        param.requires_grad = False

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
# Prepare the training data (X is a list of input sequences, y is a list of labels)
# convert the first two columns to lists
X = data.iloc[:,0].tolist()
y = data.iloc[:,1].tolist()

In [None]:
batch_size = 8
n_batches = len(X) // batch_size + 1

# Define the binary cross-entropy loss function
criterion = torch.nn.BCELoss()
# Define the optimizer for fine-tuning
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=2e-5)

for epoch in range(10):
    epoch_loss = 0.0
    epoch_accuracy = 0.0
    for i in range(n_batches):
        # Get the batch
        batch_X = X[i*batch_size:(i+1)*batch_size]
        batch_y = y[i*batch_size:(i+1)*batch_size]
        # Tokenize the batch and convert them to PyTorch tensors
        inputs = tokenizer(batch_X, padding=True, truncation=True, return_tensors="pt")
        # Forward pass
        outputs = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])[0]
        probs = torch.sigmoid(outputs).squeeze(dim=-1)
        # Create a 2x2 identity matrix
        batch_y = torch.tensor(batch_y).float()
        identity_matrix = torch.eye(2)
        # Use index_select() to select the appropriate rows based on batch_y
        batch_y = identity_matrix.index_select(dim=0, index=batch_y.long().squeeze())
        # Compute the loss
        loss = criterion(probs, batch_y)
        epoch_loss += loss.item()
        # Backward pass and update parameters
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # Compute accuracy
        preds = (probs.argmax(dim=1) == 1).float()
        target = (batch_y.argmax(dim=1) == 1).float()
        epoch_accuracy += torch.sum(preds == target).item()
    # Print loss and accuracy for the epoch
    epoch_loss /= n_batches
    epoch_accuracy /= len(X)
    print(f"Epoch {epoch+1}: loss={epoch_loss:.4f}, accuracy={epoch_accuracy:.4f}")



Epoch 1: loss=0.5215, accuracy=0.7460
Epoch 2: loss=0.1607, accuracy=0.9559
Epoch 3: loss=0.0830, accuracy=0.9799
Epoch 4: loss=0.0482, accuracy=0.9880
Epoch 5: loss=0.0198, accuracy=0.9973
