# Sentiment Analysis 

**Sentiment analysis**, also known as opinion mining, is a natural language processing (NLP) technique used to determine the sentiment or emotional tone expressed in a piece of text. It aims to classify the sentiment of a document, sentence, or even individual words as positive, negative, or neutral.

The process of sentiment analysis involves several steps. First, the text data is preprocessed, which includes tasks like tokenization, removing stopwords, and normalizing words. Next, the sentiment analysis algorithm analyzes the text to identify sentiment-bearing words and phrases.

There are different approaches to sentiment analysis, including:

- Lexicon-based: Lexicon-based methods rely on sentiment lexicons or dictionaries that associate words with sentiment scores. Each word is assigned a polarity value, and the sentiment of a text is calculated based on the aggregated scores of the words it contains.

- Machine learning-based: Machine learning techniques use labeled training data to train a model that can predict the sentiment of new, unseen text. This involves feature extraction, where numerical features are derived from the text, and classification algorithms, such as support vector machines (SVM) or recurrent neural networks (RNN), are used to classify the sentiment.

- Hybrid approaches: Hybrid approaches combine lexicon-based methods with machine learning techniques to leverage the strengths of both. For example, lexicons can be used to bootstrap the sentiment classification process, and machine learning models can be fine-tuned using the labeled data.


Note that this noteboko is adapted from [this](https://towardsdatascience.com/a-beginners-guide-to-sentiment-analysis-in-python-95e354ea84f6) example

In [1]:
pip install seaborn




In [1]:
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib

import seaborn as sns
import numpy as np

import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix,classification_report

color = sns.color_palette()
%matplotlib inline

For this analysis, we'll be looking at reviews of Amazon Products. The data consists of a `ProductId`, `UserId`, `ProfileName`, `HelpfulnesNumerator`, `HelpfulnessDenominator`, `Score`, `Time`, `Summary`, and `Text`, however we'll only be using a couple columns for this sentiment analysis. 

In [2]:
df = pd.read_csv(r"D:\\Coding_Stuff\\GitHub\\Natural-Language-Processing\\data\\Amazon_Product_reviews.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


After reading in the data using `pandas`, we can do some quick analysis to look at the distribution of scores

In [3]:
# Quick data discovery on product scores 
fig = px.histogram(df, x="Score")
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Product Score')
fig.show()

Looks like majority of the products have a rating of `4` or `5`. For the sentiment analysis, we'll need to create labels. For this example, we'll say anything less than a 3 will be a negative sentiment, denoted by a `-1` and anything 3 or high will be a positive sentiment denoted with a `1`.

In [4]:
# Creating lables 
# canceling the middle gorund (3) as we want +ve and -ve parts only
df = df[df['Score'] != 3]
df['sentiment'] = df['Score'].apply(lambda rating : +1 if rating > 3 else -1)

Now that we've created lables, let's do some additional exploratory analysis with these lables

In [5]:
# split by sentiment 
positive = df[df['sentiment'] == 1]
negative = df[df['sentiment'] == -1]

In [6]:
# Explore data by sentiment 
df['sentimentt'] = df['sentiment'].replace({-1 : 'negative'})
df['sentimentt'] = df['sentimentt'].replace({1 : 'positive'})
fig = px.histogram(df, x="sentimentt")
fig.update_traces(marker_color="indianred",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Product Sentiment')
fig.show()


Moving on from EDA, we'll do some light cleaning of the text and summary to prepare it for modeling

In [7]:
# Clean data, remove punctuation 
def remove_punctuation(text):
    final = "".join(u for u in text if u not in ("?", ".", ";", ":",  "!",'"'))
    return final
df['Text'] = df['Text'].apply(remove_punctuation)
df = df.dropna(subset=['Summary'])
df['Summary'] = df['Summary'].apply(remove_punctuation)

In [8]:
featurized_df = df[['Summary','sentiment']]
featurized_df.head()

Unnamed: 0,Summary,sentiment
0,Good Quality Dog Food,1
1,Not as Advertised,-1
2,Delight says it all,1
3,Cough Medicine,-1
4,Great taffy,1


Now that labels have been created and data is cleaned we'll move on to splitting the data into a training set and applying a `CountVectorizer` to generate embeddings.

In [9]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(featurized_df["Summary"], 
                                                    featurized_df["sentiment"], test_size=0.2)

In [10]:
# Data prep for count vectorizer 
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')

train_matrix = vectorizer.fit_transform(X_train)
test_matrix = vectorizer.transform(X_test)

Now that we have embeddings, we'll just use a basic `LogisticRegression` to predict if a sentiment is positive or negative.

In [11]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter=500)
lr.fit(train_matrix, y_train)

After the model is trained, we'll use it to generate predictions on our test set. With those predictions, we'll use the `classification_report` and `plot_confusion_matrix` to investigate the quality of the model

In [12]:
predictions = lr.predict(test_matrix)

In [13]:
# Take a look at the classification report 
print(classification_report(predictions,y_test))

              precision    recall  f1-score   support

          -1       0.51      0.83      0.63        94
           1       0.98      0.91      0.94       827

    accuracy                           0.90       921
   macro avg       0.75      0.87      0.79       921
weighted avg       0.93      0.90      0.91       921



Based on the evaulation metrics, the mode does a pretty good job at predicting positive sentiment, but it's not as accurate in predicting negative sentiment. 