# **Sentiment Analysis in Python**
**DESCRIPTION**: <br>
This notebook will explore running sentiment analysis over customer request emails in order to highlight emails highly negative sentiment. These emails can be highlighted to customer service representatives in order to ensure prompt resolution of urgent issues. 

---

**Author:** Kyle Dyett<br>
**Last Updated By:** Kyle Dyett<br>
**Last Updated Date:** 31/08/2024

## Import Modules

In [26]:
import pandas as pd
import os
from tqdm.notebook import tqdm 

## Read Data

In [28]:
# List all files in the directory
files = [f for f in os.listdir(os.getcwd()+'\Files') if os.path.isfile(os.path.join(os.getcwd()+'\Files', f))]
files = [f for f in files if f.startswith('Email')]

for i, f in enumerate(files):
    print(f'{i+1}: {f}')

1: Email1.csv
2: Email10.csv
3: Email11.csv
4: Email12.csv
5: Email13.csv
6: Email14.csv
7: Email15.csv
8: Email16.csv
9: Email2.csv
10: Email3.csv
11: Email4.csv
12: Email5.csv
13: Email6.csv
14: Email7.csv
15: Email8.csv
16: Email9.csv


In [5]:
emails = pd.DataFrame(columns=['Email ID', 'Subject', 'Body', 'Sentiment'])
for f in files:
    email = pd.read_csv(f"{os.getcwd()}\\Files\\{f}", encoding='utf-8', encoding_errors='ignore')
    emails = pd.concat([emails, email], ignore_index=True)

## Explore Columns & Data Types

In [7]:
emails.head()

Unnamed: 0,Email ID,Subject,Body,Sentiment
0,1,Inaccurate Balance Statement – Urgent Attentio...,"Hi,\n\nI have just reviewed my latest statemen...",Negative
1,10,Satisfied with Early Settlement Process,"Hello,\n\nI recently settled my Hire Purchase ...",Positive
2,11,Complaint About Payment Processing,"To Whom It May Concern,\n\nI am extremely unha...",Negative
3,12,Unacceptable Delay in Response,"Hi,\n\nI’m really unhappy with the lack of res...",Negative
4,13,Thank You for the Great Service,"Hello,\n\nI just wanted to drop a quick messag...",Positive


In [29]:
print(emails.shape)
print(emails.dtypes)

(16, 4)
Email ID     object
Subject      object
Body         object
Sentiment    object
dtype: object


In [8]:
for i, col in enumerate(emails):
    print(f'Column {i+1}: {col}')

Column 1: Email ID
Column 2: Subject
Column 3: Body
Column 4: Sentiment


In [9]:
emails['Sentiment'] = emails['Sentiment'].str.strip()

### Analysis of Sentiment in Dataset

In [10]:
emails['Sentiment'].value_counts()

Neutral     7
Negative    6
Positive    3
Name: Sentiment, dtype: int64

## Transformer (Hugging Face) Model

In [11]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

### Load Model

In [12]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
# MODEL = f"bhadresh-savani/distilbert-base-uncased-emotion"
# MODEL = f"nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)



In [13]:
for i, row in tqdm(emails.tail(5).iterrows()):
    print(i, ': ', row['Body'],)

0it [00:00, ?it/s]

11 :  Hi,

I’ve tried contacting your team multiple times over the past week regarding my request to adjust the loan repayment terms, and I have yet to receive a response. This level of service is completely unacceptable, and it’s causing unnecessary stress on my end. If this issue isn’t resolved soon, I’ll have no choice but to look for alternative lenders.

Please respond immediately.
Rachel
12 :  Hi,

Could you please provide me with an update on the current balance of my loan? I just want to confirm that my recent payments have been applied correctly. I’m not in a rush, but a quick email with the latest figures would be helpful for my financial planning.

Thanks for your help.
Ben
13 :  Hi there,

I’ve been reviewing my loan repayment schedule, and I just have a few questions about the interest calculation for the remaining term. Could someone from your team please get back to me at your convenience? There’s no urgent issue; I just want to fully understand my agreement.

Thanks in 

### Create Function to run Sentiment Analysis

In [33]:
def sentiment_analysis_roberta(text):
    # Run for Roberta Model
    encoded_text = tokenizer(text, return_tensors='pt')
    output = model(**encoded_text)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)
    scores_dict = {
        'Roberta_neg': scores[0],
        'Roberta_neu': scores[1],
        'Roberta_pos': scores[2]
    }
    return scores_dict

### Create Dataframe to Store Results

In [40]:
trns_df = pd.DataFrame(columns=['Email ID', 'Roberta_neg', 'Roberta_neu', 'Roberta_pos'])

### Iterate over Emails & Use RoBERTa Sentiment Analysis

In [41]:
for i, row in tqdm(emails.iterrows()):
    text = row['Body']
    roberta = sentiment_analysis_roberta(text)
    roberta['Email ID'] = row['Email ID']
    roberta_df = pd.DataFrame([roberta])
    trns_df = pd.concat([trns_df, roberta_df], ignore_index=True)
    print(i, ': ', roberta)

0it [00:00, ?it/s]

0 :  {'Roberta_neg': 0.9566534, 'Roberta_neu': 0.03866507, 'Roberta_pos': 0.0046815816, 'Email ID': 1}
1 :  {'Roberta_neg': 0.0022362897, 'Roberta_neu': 0.02455041, 'Roberta_pos': 0.97321326, 'Email ID': 10}
2 :  {'Roberta_neg': 0.95206535, 'Roberta_neu': 0.043183554, 'Roberta_pos': 0.0047511267, 'Email ID': 11}
3 :  {'Roberta_neg': 0.89628977, 'Roberta_neu': 0.09201832, 'Roberta_pos': 0.011691968, 'Email ID': 12}
4 :  {'Roberta_neg': 0.0018663536, 'Roberta_neu': 0.011588338, 'Roberta_pos': 0.9865454, 'Email ID': 13}
5 :  {'Roberta_neg': 0.23537403, 'Roberta_neu': 0.5712173, 'Roberta_pos': 0.19340864, 'Email ID': 14}
6 :  {'Roberta_neg': 0.16278999, 'Roberta_neu': 0.35517603, 'Roberta_pos': 0.482034, 'Email ID': 15}
7 :  {'Roberta_neg': 0.072705634, 'Roberta_neu': 0.31069672, 'Roberta_pos': 0.6165977, 'Email ID': 16}
8 :  {'Roberta_neg': 0.9506371, 'Roberta_neu': 0.045175537, 'Roberta_pos': 0.004187347, 'Email ID': 2}
9 :  {'Roberta_neg': 0.0028110011, 'Roberta_neu': 0.024401939, 'Robe

In [45]:
# Find the column with the maximum value and assign Class
trns_df['Roberta_class'] = trns_df[['Roberta_neg', 'Roberta_neu', 'Roberta_pos']].idxmax(axis=1)

In [46]:
# Define a mapping dictionary to convert column names to meaningful descriptions
roberta_mapping = {
    'Roberta_neg': 'NEGATIVE',
    'Roberta_neu': 'NEUTRAL',
    'Roberta_pos': 'POSITIVE'
}
# Map the values in the 'RoBERTa' column to the meaningful descriptions
trns_df['Roberta_class'] = trns_df['Roberta_class'].map(roberta_mapping)

### View Final Output and Save Output File

In [49]:
trns_df.sort_values(by='Email ID', inplace=True)
trns_df

Unnamed: 0,Email ID,Roberta_neg,Roberta_neu,Roberta_pos,Roberta_class
0,1,0.956653,0.038665,0.004682,NEGATIVE
8,2,0.950637,0.045176,0.004187,NEGATIVE
9,3,0.002811,0.024402,0.972787,POSITIVE
10,4,0.03741,0.26149,0.7011,POSITIVE
11,5,0.933342,0.06025,0.006408,NEGATIVE
12,6,0.021373,0.401582,0.577044,POSITIVE
13,7,0.1326,0.696757,0.170643,NEUTRAL
14,8,0.239014,0.450408,0.310578,NEUTRAL
15,9,0.044978,0.278284,0.676738,POSITIVE
1,10,0.002236,0.02455,0.973213,POSITIVE


In [64]:
table_name = 'ROBERTA_SENTIMENT_ANALYSIS'
trns_df.to_csv(f"{os.getcwd()}\\Files\\{table_name}.csv")
print(f'{table_name} Table Saved as CSV File')

ROBERTA_SENTIMENT_ANALYSIS Table Saved as CSV File


## Open AI - Generative AI Model for Sentiment Analysis  

### Retrieve the API Key
* The API Key needs to be created in your OpenAI account
* The API Key has been stored in a Text File

In [55]:
# Open the API Key File in read mode
with open(os.getcwd()+'\\Files\\api_key.txt', 'r') as file:
    # Read the API Key
    API_KEY = file.readline()
    print('Retrieved API Key')

Retrieved API Key


### Create Function to run Sentiment Analysis using OpenAI

In [56]:
import openai
import re

openai.api_key = API_KEY

def analyze_sentiment_openai(text):
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": f"Analyze the sentiment of the following text and provide scores with 4 decimal places for positive, negative, and neutral sentiments from 0.00 to 1.00:\n\n'{text}'\n\nResponse format:\nPositive: <value>\nNegative: <value>\nNeutral: <value>"}
        ]
    )

    sentiment = response.choices[0].message.content
    
    # Use regex to extract the scores
    positive_score = re.search(r'Positive:\s*([0-1]\.\d+|[0-1])', sentiment)
    negative_score = re.search(r'Negative:\s*([0-1]\.\d+|[0-1])', sentiment)
    neutral_score = re.search(r'Neutral:\s*([0-1]\.\d+|[0-1])', sentiment)

    return {
        "positive_gpt": float(positive_score.group(1)) if positive_score else None,
        "negative_gpt": float(negative_score.group(1)) if negative_score else None,
        "neutral_gpt": float(neutral_score.group(1)) if neutral_score else None,
    }

### Create Dataframe to Store Results

In [69]:
genai_df = pd.DataFrame(columns=['Email ID', 'negative_gpt', 'neutral_gpt', 'positive_gpt'])

### Iterate over Emails & Use OpenAI Sentiment Analysis

In [70]:
for i, row in tqdm(emails.iterrows()):
    print(row['Email ID'])
    text = row['Body']
    opn = analyze_sentiment_openai(text)
    opn['Email ID'] = row['Email ID']
    open_df = pd.DataFrame([opn])
    genai_df = pd.concat([genai_df, open_df], ignore_index=True)
    print(i, ': ', opn)

0it [00:00, ?it/s]

1
0 :  {'positive_gpt': 0.0, 'negative_gpt': 1.0, 'neutral_gpt': 0.0, 'Email ID': 1}
10
1 :  {'positive_gpt': 0.9321, 'negative_gpt': 0.0, 'neutral_gpt': 0.0679, 'Email ID': 10}
11
2 :  {'positive_gpt': 0.0, 'negative_gpt': 1.0, 'neutral_gpt': 0.0, 'Email ID': 11}
12
3 :  {'positive_gpt': 0.0, 'negative_gpt': 0.8989, 'neutral_gpt': 0.1011, 'Email ID': 12}
13
4 :  {'positive_gpt': 1.0, 'negative_gpt': 0.0, 'neutral_gpt': 0.0, 'Email ID': 13}
14
5 :  {'positive_gpt': 0.6851, 'negative_gpt': 0.0218, 'neutral_gpt': 0.2931, 'Email ID': 14}
15
6 :  {'positive_gpt': 0.5455, 'negative_gpt': 0.0909, 'neutral_gpt': 0.3636, 'Email ID': 15}
16
7 :  {'positive_gpt': 0.8167, 'negative_gpt': 0.0191, 'neutral_gpt': 0.1642, 'Email ID': 16}
2
8 :  {'positive_gpt': 0.0, 'negative_gpt': 0.775, 'neutral_gpt': 0.225, 'Email ID': 2}
3
9 :  {'positive_gpt': 1.0, 'negative_gpt': 0.0, 'neutral_gpt': 0.0, 'Email ID': 3}
4
10 :  {'positive_gpt': 0.8231, 'negative_gpt': 0.0, 'neutral_gpt': 0.1769, 'Email ID': 4}
5

In [73]:
# Find the column with the maximum value and assign Class
genai_df['gpt_class'] = genai_df[['positive_gpt', 'negative_gpt', 'neutral_gpt']].idxmax(axis=1)

In [74]:
# Define a mapping dictionary to convert column names to meaningful descriptions
openai_mapping = {
    'negative_gpt': 'NEGATIVE',
    'neutral_gpt': 'NEUTRAL',
    'positive_gpt': 'POSITIVE'
}
# Map the values in the 'GPT Class' column to the meaningful descriptions
genai_df['gpt_class'] = genai_df['gpt_class'].map(openai_mapping)

### View Final Output and Save Output File

In [75]:
genai_df.sort_values(by='Email ID', inplace=True)
genai_df

Unnamed: 0,Email ID,negative_gpt,neutral_gpt,positive_gpt,gpt_class
0,1,1.0,0.0,0.0,NEGATIVE
8,2,0.775,0.225,0.0,NEGATIVE
9,3,0.0,0.0,1.0,POSITIVE
10,4,0.0,0.1769,0.8231,POSITIVE
11,5,0.8192,0.1808,0.0,NEGATIVE
12,6,0.0,0.2717,0.7283,POSITIVE
13,7,0.0,0.32,0.68,POSITIVE
14,8,0.0846,0.4026,0.5128,POSITIVE
15,9,0.0,0.2535,0.7465,POSITIVE
1,10,0.0,0.0679,0.9321,POSITIVE


In [76]:
table_name = 'OPENAI_SENTIMENT_ANALYSIS'
genai_df.to_csv(f"{os.getcwd()}\\Files\\{table_name}.csv")
print(f'{table_name} Table Saved as CSV File')

OPENAI_SENTIMENT_ANALYSIS Table Saved as CSV File
