# VADER LEXICON

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon-based sentiment analysis tool designed for social media text, using a predefined dictionary of words with associated sentiment scores. It is particularly effective for short, informal text and can detect the intensity of sentiments, including negations and emoticons.

In [1]:
import pandas as pd
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

  from pandas.core import (


In [2]:
df = pd.read_csv(r"C:\Users\*****\Downloads\Customer Feedback Dataset - Ideal Output.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Text              96 non-null     object 
 1   Source            96 non-null     object 
 2   Date_Time         96 non-null     object 
 3   User_ID           96 non-null     object 
 4   Location          96 non-null     object 
 5   Sentiment         96 non-null     object 
 6   Confidence_Score  96 non-null     float64
dtypes: float64(1), object(6)
memory usage: 5.4+ KB


In [3]:
df.drop_duplicates()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96 entries, 0 to 95
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Text              96 non-null     object 
 1   Source            96 non-null     object 
 2   Date_Time         96 non-null     object 
 3   User_ID           96 non-null     object 
 4   Location          96 non-null     object 
 5   Sentiment         96 non-null     object 
 6   Confidence_Score  96 non-null     float64
dtypes: float64(1), object(6)
memory usage: 5.4+ KB


### Since we are focusing on Sentimental Analysis, Ignore data cleaning for now.

In [4]:
# Download VADER Lexicon
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\*****\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [5]:
sia = SentimentIntensityAnalyzer()

In [6]:
# Function to get sentiment and confidence score
def get_sentiment(text):
    scores = sia.polarity_scores(text)
    compound = scores['compound']
    
    # Assign sentiment label
    if compound >= 0.05:
        sentiment = "Positive"
    elif compound <= -0.05:
        sentiment = "Negative"
    else:
        sentiment = "Neutral"
    
    # Confidence score (absolute value of compound)
    confidence = abs(compound)
    
    return sentiment, confidence

In [7]:
df.head(10)

Unnamed: 0,Text,Source,Date_Time,User_ID,Location,Sentiment,Confidence_Score
0,I love this product!,Twitter,15-06-2023 09:23,@user123,New York,Positive,0.85
1,The service was terrible.,Yelp Reviews,15-06-2023 11:45,user456,Los Angeles,Negative,0.65
2,This movie is amazing!,IMDb,15-06-2023 14:10,moviefan789,London,Positive,0.92
3,I'm so disappointed with their customer support.,Online Forum,15-06-2023 17:35,forumuser1,Toronto,Negative,0.78
4,Just had the best meal of my life!,TripAdvisor,16-06-2023 08:50,foodie22,Paris,Positive,0.88
5,The quality of this product is subpar.,Amazon Reviews,16-06-2023 10:15,shopper123,San Francisco,Negative,0.72
6,I can't stop listening to this song. It's incr...,Spotify,16-06-2023 13:40,musiclover456,Berlin,Positive,0.91
7,Their website is so user-friendly. Love it!,Website Testimonial,16-06-2023 16:05,testimonialuser1,Sydney,Positive,0.87
8,I loved the movie! It was fantastic!,IMDb,02-07-2023 09:12,user123,New York,Positive,0.92
9,The customer service was terrible.,Yelp Reviews,02-07-2023 10:45,user456,Los Angeles,Negative,0.65


In [8]:
# Apply function to DataFrame
df[['New_Sentiment', 'New_Confidence']] = df['Text'].apply(lambda x: pd.Series(get_sentiment(x)))

In [9]:
# Display updated DataFrame
print(df.head())

                                               Text        Source  \
0                              I love this product!       Twitter   
1                         The service was terrible.  Yelp Reviews   
2                            This movie is amazing!          IMDb   
3  I'm so disappointed with their customer support.  Online Forum   
4                Just had the best meal of my life!   TripAdvisor   

          Date_Time      User_ID     Location Sentiment  Confidence_Score  \
0  15-06-2023 09:23     @user123     New York  Positive              0.85   
1  15-06-2023 11:45      user456  Los Angeles  Negative              0.65   
2  15-06-2023 14:10  moviefan789       London  Positive              0.92   
3  15-06-2023 17:35   forumuser1      Toronto  Negative              0.78   
4  16-06-2023 08:50     foodie22        Paris  Positive              0.88   

  New_Sentiment  New_Confidence  
0      Positive          0.6696  
1      Negative          0.4767  
2      Positive     

'Sentiment' column is given with the dataset which is our ideal output where as 'New_Sentiment' is given by VADER

In [10]:
df[df['Sentiment'] != df['New_Sentiment']][['Text','Sentiment','New_Sentiment']]

Unnamed: 0,Text,Sentiment,New_Sentiment
5,The quality of this product is subpar.,Negative,Neutral
21,The customer service at this store is top-notch.,Positive,Neutral
30,The product I ordered never arrived. Terrible ...,Negative,Positive
31,I can't get enough of this band. Their music i...,Positive,Neutral
36,The customer service at this store is exceptio...,Positive,Neutral
38,The live concert was electrifying! The band's ...,Positive,Negative
54,The concert was mind-blowing! The band knows h...,Positive,Neutral
59,The website loading speed is frustratingly slo...,Negative,Neutral
65,The product I ordered never arrived. Poor cust...,Negative,Positive
69,The website navigation is smooth and intuitive...,Positive,Neutral


In [11]:
df[df['Sentiment'] != df['New_Sentiment']]['Text'].nunique()

11

### 11 out of 96 are wrongly tagged so, lets try something better than vader.

# Distilbert from Hugging face

In [13]:
!pip install transformers torch pandas




[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip





In [14]:
!pip install tf-keras




[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [16]:
import pandas as pd
from transformers import pipeline

In [17]:
# Load pre-trained sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")




Device set to use cpu


DistilBERT model fine-tuned on the Stanford Sentiment Treebank (SST-2) dataset, making it suitable for binary sentiment classification (positive/negative). The model is available in the Hugging Face Transformers library and is automatically downloaded from the Hugging Face Model Hub when used with the pipeline function.

In [18]:
# Function to analyze sentiment
def get_sentiment(text):
    result = sentiment_pipeline(text[:512])[0]  # Truncate to 512 tokens (DistilBERT limit)
    sentiment = result['label']
    confidence = result['score']
    return sentiment, confidence

In [19]:
# Apply function to DataFrame
df[['New_Sentiment_1', 'New_Confidence_1']] = df['Text'].apply(lambda x: pd.Series(get_sentiment(str(x))))

In [20]:
# Display updated DataFrame
print(df.head())

                                               Text        Source  \
0                              I love this product!       Twitter   
1                         The service was terrible.  Yelp Reviews   
2                            This movie is amazing!          IMDb   
3  I'm so disappointed with their customer support.  Online Forum   
4                Just had the best meal of my life!   TripAdvisor   

          Date_Time      User_ID     Location Sentiment  Confidence_Score  \
0  15-06-2023 09:23     @user123     New York  Positive              0.85   
1  15-06-2023 11:45      user456  Los Angeles  Negative              0.65   
2  15-06-2023 14:10  moviefan789       London  Positive              0.92   
3  15-06-2023 17:35   forumuser1      Toronto  Negative              0.78   
4  16-06-2023 08:50     foodie22        Paris  Positive              0.88   

  New_Sentiment  New_Confidence New_Sentiment_1  New_Confidence_1  
0      Positive          0.6696        POSITIVE       

'Sentiment' column is given with the dataset which is our ideal output where as 'New_Sentiment_1' is given by distilbert

In [21]:
df[df['Sentiment'] != df['New_Sentiment_1']][['Text','Sentiment','New_Sentiment_1']]

Unnamed: 0,Text,Sentiment,New_Sentiment_1
0,I love this product!,Positive,POSITIVE
1,The service was terrible.,Negative,NEGATIVE
2,This movie is amazing!,Positive,POSITIVE
3,I'm so disappointed with their customer support.,Negative,NEGATIVE
4,Just had the best meal of my life!,Positive,POSITIVE
...,...,...,...
91,Just had the most amazing vacation! I can't wa...,Positive,POSITIVE
92,The food at this restaurant was awful. Never g...,Negative,NEGATIVE
93,I can't stop listening to this song. It's my n...,Positive,POSITIVE
94,Their website is so confusing and poorly desig...,Negative,NEGATIVE


we have case sensetivity issue here

In [22]:
df['New_Sentiment_1'] = df['New_Sentiment_1'].apply(lambda x: 'Positive' if x == 'POSITIVE' else 'Negative' if x == 'NEGATIVE' else x)

In [23]:
df[df['Sentiment'] != df['New_Sentiment_1']][['Text','Sentiment','New_Sentiment_1']]

Unnamed: 0,Text,Sentiment,New_Sentiment_1
20,This movie is a masterpiece! I was blown away.,Positive,Negative


As we compare with the given sentiment with the database we can say, 
    with vader we could tag 85 correctly out of 96 and with this model 95 are correct, 


# Use a Better Pretrained Model (RoBERTa)

### **What is RoBERTa?**  
RoBERTa (**Robustly Optimized BERT Pretraining Approach**) is an improved version of **BERT** developed by **Facebook AI**. It enhances BERT by removing the next sentence prediction (NSP) task, increasing training data, and using dynamic masking, leading to better performance on NLP tasks.

### **Is RoBERTa Better than DistilBERT for Sentiment Analysis?**  
Yes, **RoBERTa is generally better than DistilBERT** for sentiment analysis because:  
✅ **More Accurate** – RoBERTa achieves higher accuracy on many NLP benchmarks, including sentiment classification tasks.  
✅ **Better Context Understanding** – It captures nuanced sentiment better due to improved training techniques.  
✅ **Larger and More Robust** – Unlike DistilBERT, which is a lighter, faster model optimized for efficiency, RoBERTa retains full BERT-like capabilities.  

However, **DistilBERT is much faster and lighter** while still providing good accuracy, making it a better choice for real-time or resource-constrained applications. If accuracy is the priority, **RoBERTa is better**; if speed and efficiency are crucial, **DistilBERT is preferable**.

In [25]:
from transformers import pipeline

# Load RoBERTa-based sentiment model (better than DistilBERT)
sentiment_pipeline = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment")

# Apply sentiment analysis
df[['New_Sentiment_2', 'New_Confidence_2']] = df['Text'].apply(lambda x: pd.Series(sentiment_pipeline(str(x[:512]))[0]))


Device set to use cpu


'Sentiment' column is given with the dataset which is our ideal output where as 'New_Sentiment_2' is given by RoBERTa

In [26]:
df[df['Sentiment'] != df['New_Sentiment_2']][['Text','Sentiment','New_Sentiment_2']]

Unnamed: 0,Text,Sentiment,New_Sentiment_2
0,I love this product!,Positive,LABEL_2
1,The service was terrible.,Negative,LABEL_0
2,This movie is amazing!,Positive,LABEL_2
3,I'm so disappointed with their customer support.,Negative,LABEL_0
4,Just had the best meal of my life!,Positive,LABEL_2
...,...,...,...
91,Just had the most amazing vacation! I can't wa...,Positive,LABEL_2
92,The food at this restaurant was awful. Never g...,Negative,LABEL_0
93,I can't stop listening to this song. It's my n...,Positive,LABEL_2
94,Their website is so confusing and poorly desig...,Negative,LABEL_0


In [27]:
df['New_Sentiment_2'].unique()

array(['LABEL_2', 'LABEL_0'], dtype=object)

In [28]:
df['New_Sentiment_2'] = df['New_Sentiment_2'].apply(lambda x: 'Positive' if x == 'LABEL_2' else 'Negative' if x == 'LABEL_0' else x)

In [29]:
df['New_Sentiment_2'].unique()

array(['Positive', 'Negative'], dtype=object)

In [30]:
df[df['Sentiment'] != df['New_Sentiment_2']][['Text','Sentiment','New_Sentiment_2']]

Unnamed: 0,Text,Sentiment,New_Sentiment_2


# Thats 100% correct tagging.
# great! 

# Use an Ensemble Approach 

If both models agree, it’s likely correct.
If they disagree, it chooses the more confident one.
Boosts accuracy by combining strengths of both models.

In [33]:
bert_model = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
roberta_model = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment")

def ensemble_sentiment(text):
    bert_result = bert_model(text[:512])[0]
    roberta_result = roberta_model(text[:512])[0]

    # If both models agree, use that result
    if bert_result['label'] == roberta_result['label']:
        return bert_result['label'], (bert_result['score'] + roberta_result['score']) / 2
    
    # Otherwise, choose the one with higher confidence
    return (bert_result['label'], bert_result['score']) if bert_result['score'] > roberta_result['score'] else (roberta_result['label'], roberta_result['score'])

df[['New_Sentiment_3', 'New_Confidence_3']] = df['Text'].apply(lambda x: pd.Series(ensemble_sentiment(str(x))))


Device set to use cpu
Device set to use cpu


In [34]:
df[df['Sentiment'] != df['New_Sentiment_3']][['Text','Sentiment','New_Sentiment_3']]

Unnamed: 0,Text,Sentiment,New_Sentiment_3
0,I love this product!,Positive,POSITIVE
1,The service was terrible.,Negative,NEGATIVE
2,This movie is amazing!,Positive,POSITIVE
3,I'm so disappointed with their customer support.,Negative,NEGATIVE
4,Just had the best meal of my life!,Positive,POSITIVE
...,...,...,...
91,Just had the most amazing vacation! I can't wa...,Positive,POSITIVE
92,The food at this restaurant was awful. Never g...,Negative,NEGATIVE
93,I can't stop listening to this song. It's my n...,Positive,POSITIVE
94,Their website is so confusing and poorly desig...,Negative,NEGATIVE


In [35]:
df['New_Sentiment_3'] = df['New_Sentiment_3'].apply(lambda x: 'Positive' if x == 'POSITIVE' else 'Negative' if x == 'NEGATIVE' else x)

In [36]:
df['New_Sentiment_3'] = df['New_Sentiment_3'].apply(lambda x: 'Positive' if x == 'LABEL_2' else 'Negative' if x == 'LABEL_0' else x)

'Sentiment' column is given with the dataset which is our ideal output where as 'New_Sentiment_3' is given by ensemble approach

In [37]:
df[df['Sentiment'] != df['New_Sentiment_3']][['Text','Sentiment','New_Sentiment_3']]

Unnamed: 0,Text,Sentiment,New_Sentiment_3


# 100% correct tagging, matching with ideal output