# Sentiment Analysis in Natural Language Processing

## 1. Introduction to Sentiment Analysis

- **Definition**: Sentiment analysis is the process of determining the emotional tone behind a body of text. It's used to understand opinions, attitudes, and emotions expressed in text. Texts are usually placed in classification categories like **positive**, **negative**, or **neutral** sentiment.
- **Real-World Examples**:
  - Reviews: Analyzing movie, product, or restaurant reviews to understand customer sentiment.
  - Social Media: Tracking sentiment in tweets or posts to gauge public opinion on events or brands.
- **Importance**: Sentiment analysis is widely used in business to understand customer feedback, optimize marketing strategies, and improve products or services.
- **Additional Context from Notebook**: Sentiment analysis classifies text into categories like positive, negative, or neutral sentiment. Applications include social media monitoring, product reviews, and customer feedback analysis. There are different approaches: rule-based, lexicon-based, and machine learning-based.

- **Techniques Covered in This Lecture**:
    - **Rule-Based**: Using predefined rules to identify sentiment.
    - **Lexicon-Based**: Using predefined word dictionaries (lexicons) to assign sentiment scores.
    - **Machine Learning-Based**: Using supervised learning models like Logistic Regression alongside embeddings like TF-IDF.

## 2. Techniques to Implement a Sentiment Analyzer

### I. **Data Loading and Preprocessing**
  - Typically, reviews or social media data are mined for sentiment scores and classification. Data is stored in separate categories, such as "positive" and "negative".
  - [IMDB Dataset of Movie Reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) from Kaggle

In [1]:
# Install dependencies as needed:
# pip install kagglehub[pandas-datasets]
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Set the path to the file you'd like to load
file_path = "IMDB Dataset.csv"

# Load the latest version
imdb_data = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "lakshmi25npathi/imdb-dataset-of-50k-movie-reviews",
  file_path
)

print("First 5 records:", imdb_data.head())

  imdb_data = kagglehub.load_dataset(


Using Colab cache for faster access to the 'imdb-dataset-of-50k-movie-reviews' dataset.
First 5 records:                                               review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


- Understanding the distribution

In [2]:
# Check sentiment distribution
imdb_data['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


In [3]:
sampled_reviews = imdb_data.groupby('sentiment').sample(n=5000, random_state=4)
sampled_reviews['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
negative,5000
positive,5000


### II. Rule-Based Sentiment Analysis: User-Built
- A predefined sentiment lexicon is used to determine if a text is positive or negative.
- **Key Topics**:
  - Basic rule-based systems work by looking for specific words in the text and labeling them with either a categorical marker (e.g. "positive/negative") or a numeric value.
  - **Limitations**: Rule-based methods are simple but lack flexibility for complex language patterns, negations, and sarcasm.

In [4]:
# Rule-Based Sentiment Analyzer
def rule_sentiment_analyzer(review):
    # Example lexicon with sentiment scores
    lexicon = {
        "good": 1,
        "great": 2,
        "excellent": 3,
        "bad": -2,
        "poor": -3,
        "terrible": -5
    }
    score = 0

    # Convert review to lowercase to ensure case-insensitive matching
    words = review.lower().split()
    for word in words:
        if word in lexicon:
            score += lexicon[word]
    return "positive" if score > 0 else "negative"

test_reviews = ["This is a terrible example of a bad movie!", "I'm excellent now that I've walked out of the movie."]

print("Rule-Based Sentiment Analysis Results:")
for review in test_reviews:
    print(f"Review: {review}\nSentiment: {rule_sentiment_analyzer(review)}\n")

Rule-Based Sentiment Analysis Results:
Review: This is a terrible example of a bad movie!
Sentiment: negative

Review: I'm excellent now that I've walked out of the movie.
Sentiment: positive



In [5]:
# Apply rule_sentiment_analyzer to reviews and create new column
sampled_reviews["rule_based_sent"] = sampled_reviews["review"].apply(rule_sentiment_analyzer)
sampled_reviews

Unnamed: 0,review,sentiment,rule_based_sent
3942,"OK, to start with, this movie was not at all l...",negative,positive
41815,"CQ could have been good, campy fun. But it com...",negative,negative
26637,"If you rent a movie titled ""Exterminators of t...",negative,negative
1898,As much as I dislike saying 'me too' in respon...,negative,positive
30362,"I was prepared for a bad movie, and a bad movi...",negative,negative
...,...,...,...
44495,Lawrence Olivier and Merle Oberon did two movi...,positive,positive
24206,The plot: A crime lord is uniting 3 different ...,positive,negative
10068,"Like many, I first saw The Water Babies as a c...",positive,negative
12673,"This is screamingly funny (well, except when B...",positive,positive


### III. Lexicon-Based Sentiment Analysis: TextBlob
- **TextBlob** is a lexicon-based tool that assigns polarity and subjectivity scores to a sentence.
  - **Polarity**: Measures how positive or negative a statement is (range: -1 to 1).
  - **Subjectivity**: Measures how much personal opinion is present (range: 0 to 1).

In [6]:
!pip install spacytextblob



In [7]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
text = "I had a really horrible day. It was the worst day ever! But every now and then I have a really good day that makes me happy."
nlp.add_pipe("spacytextblob")
doc = nlp(text)

print(doc._.blob.polarity)
# -0.125

print(doc._.blob.subjectivity)
# 0.9

print(doc._.blob.sentiment_assessments.assessments)
# [(['really', 'horrible'], -1.0, 1.0, None), (['worst', '!'], -1.0, 1.0, None), (['really', 'good'], 0.7, 0.6000000000000001, None), (['happy'], 0.8, 1.0, None)]

-0.125
0.9
[(['really', 'horrible'], -1.0, 1.0, None), (['worst', '!'], -1.0, 1.0, None), (['really', 'good'], 0.7, 0.6000000000000001, None), (['happy'], 0.8, 1.0, None)]


- Build a custom function that gets TextBlob's sentiment score

In [8]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("spacytextblob")

# Create a textblob_sentiment_analyzer function
def textblob_sentiment_analyzer(review):
    # Create a TextBlob object
    doc = nlp(review)
    sentiment = doc._.blob.polarity

    # Interpreting the compound score
    if sentiment >= 0:
        return "positive"
    else:
        return "negative"

In [9]:
# Apply textblob_sentiment_analyzer to reviews
sampled_reviews["textblob_sent"] = sampled_reviews["review"].apply(textblob_sentiment_analyzer)
sampled_reviews

Unnamed: 0,review,sentiment,rule_based_sent,textblob_sent
3942,"OK, to start with, this movie was not at all l...",negative,positive,negative
41815,"CQ could have been good, campy fun. But it com...",negative,negative,positive
26637,"If you rent a movie titled ""Exterminators of t...",negative,negative,negative
1898,As much as I dislike saying 'me too' in respon...,negative,positive,positive
30362,"I was prepared for a bad movie, and a bad movi...",negative,negative,positive
...,...,...,...,...
44495,Lawrence Olivier and Merle Oberon did two movi...,positive,positive,positive
24206,The plot: A crime lord is uniting 3 different ...,positive,negative,positive
10068,"Like many, I first saw The Water Babies as a c...",positive,negative,positive
12673,"This is screamingly funny (well, except when B...",positive,positive,positive


### IV. Machine Learning-Based Sentiment Analysis
 - Traditional models like **Logistic Regression** use labeled training data and vectorized text features (e.g., bag of words).

In [None]:
### ONLY BUILD A ML MODEL WITH A SELECT FEW EXAMPLES!!!!
'''
from sklearn.model_selection import train_test_split

# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape)
print(X_test.shape)
'''

In [10]:
# Build a logistic regression model with TF-IDF vectors and sentiment scores as the target variable
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

# Example dataset
X = imdb_data['review']
y = imdb_data['sentiment']

# Vectorize the text
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)

# Train the model
model = LogisticRegression()
model.fit(X, y)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [11]:
# Predict sentiment of new text
input_text = ["That movie is amazing!"]
input_features = vectorizer.transform(input_text)
prediction = model.predict(input_features)[0]
print(f"Machine Learning Sentiment Prediction: {prediction}")

Machine Learning Sentiment Prediction: positive


In [12]:
# Apply logistic regression model to reviews
reviews = vectorizer.transform(sampled_reviews["review"])
sampled_reviews["log_sent"] = model.predict(reviews)
sampled_reviews

Unnamed: 0,review,sentiment,rule_based_sent,textblob_sent,log_sent
3942,"OK, to start with, this movie was not at all l...",negative,positive,negative,negative
41815,"CQ could have been good, campy fun. But it com...",negative,negative,positive,negative
26637,"If you rent a movie titled ""Exterminators of t...",negative,negative,negative,negative
1898,As much as I dislike saying 'me too' in respon...,negative,positive,positive,negative
30362,"I was prepared for a bad movie, and a bad movi...",negative,negative,positive,negative
...,...,...,...,...,...
44495,Lawrence Olivier and Merle Oberon did two movi...,positive,positive,positive,positive
24206,The plot: A crime lord is uniting 3 different ...,positive,negative,positive,positive
10068,"Like many, I first saw The Water Babies as a c...",positive,negative,positive,positive
12673,"This is screamingly funny (well, except when B...",positive,positive,positive,positive


## 3. Evaluate Each Model Using a Classification Report



In [13]:
import numpy as np
from sklearn.metrics import classification_report, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt

# Get true labels
y_true = sampled_reviews['sentiment']

# Rule-Based Method
print("\n1. RULE-BASED METHOD")
print("-" * 30)
y_pred = sampled_reviews['rule_based_sent']
print(f"Accuracy: {accuracy_score(y_true, y_pred):.2%}")
print(classification_report(y_true, y_pred))

# TextBlob Method
print("\n2. TEXTBLOB METHOD")
print("-" * 30)
y_pred = sampled_reviews['textblob_sent']
print(f"Accuracy: {accuracy_score(y_true, y_pred):.2%}")
print(classification_report(y_true, y_pred))

# Logistic Regression Method
print("\n3. LOGISTIC REGRESSION METHOD")
print("-" * 30)
y_pred = sampled_reviews['log_sent']
print(f"Accuracy: {accuracy_score(y_true, y_pred):.2%}")
print(classification_report(y_true, y_pred))


1. RULE-BASED METHOD
------------------------------
Accuracy: 61.44%
              precision    recall  f1-score   support

    negative       0.59      0.73      0.65      5000
    positive       0.65      0.50      0.56      5000

    accuracy                           0.61     10000
   macro avg       0.62      0.61      0.61     10000
weighted avg       0.62      0.61      0.61     10000


2. TEXTBLOB METHOD
------------------------------
Accuracy: 68.86%
              precision    recall  f1-score   support

    negative       0.90      0.43      0.58      5000
    positive       0.62      0.95      0.75      5000

    accuracy                           0.69     10000
   macro avg       0.76      0.69      0.67     10000
weighted avg       0.76      0.69      0.67     10000


3. LOGISTIC REGRESSION METHOD
------------------------------
Accuracy: 95.80%
              precision    recall  f1-score   support

    negative       0.96      0.96      0.96      5000
    positive       0