# Amazon Reviews - Sentiment Analysis

Objective: To obtain sentence level sentiment analysis for Amazon reviews.
Package and Method: Hugging Face Sentiment Analysis Pipeline

## Install required libraries and import packages

In [None]:
!pip install transformers



In [None]:
import pandas as pd
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from transformers import AutoTokenizer, pipeline
from collections import Counter

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Read Data
* I have chosen 5 Amazon reviews for a Laptop.
* These reviews are stored in an excel and read into dataframe

In [None]:
# Read data from the Excel file 'Amazon reviews.xlsx' into a DataFrame
df = pd.read_excel(r'Amazon reviews.xlsx')

# Display the first few rows of the DataFrame to get an initial overview
df.head()

Unnamed: 0,Reviews
0,charger is too delicate. I will post again if ...
1,I don't know why people are cribbing so much a...
2,After 5 days of review\nI bought this one from...
3,"Firstly , I would like to talk about the scree..."
4,I am writing this review after one month of us...


## Individual Sentence Extraction
* In this step, I iterate through each review and split the reviews into individual sentences using a period (.) and new line character (\n) as delimiters.
* Columnn **Individual Sentence** stores list of individual sentences of each review as shown below

In [None]:
# Create an empty list to store the individual sentences
sentence_list = []

# Iterate through each review in the 'Reviews' column of the DataFrame
for review in df['Reviews']:

    # Split the review into sentences using periods and newlines as delimiters
    sentences = re.split(r'[.\n]', review)

    # Remove empty sentences (sentences with zero length)
    sentences = [x for x in sentences if len(x)]

    # Append the list of sentences for each review to the sentence_list
    sentence_list.append(sentences)

# Create a new column 'Individual Sentence' in the DataFrame to store the sentences
df['Individual Sentence'] = sentence_list

# Display the updated DataFrame with the new column
df

Unnamed: 0,Reviews,Individual Sentence
0,charger is too delicate. I will post again if ...,"[charger is too delicate, I will post again i..."
1,I don't know why people are cribbing so much a...,[I don't know why people are cribbing so much ...
2,After 5 days of review\nI bought this one from...,"[After 5 days of review, I bought this one fro..."
3,"Firstly , I would like to talk about the scree...","[Firstly , I would like to talk about the scre..."
4,I am writing this review after one month of us...,[I am writing this review after one month of u...


## Sentence Cleaning
* I have created a custom function to perform text cleaning. The function is designed to remove non alphabetical characters, URLs, Hashtags, digits, punctuations, white spaces, user name mentions with @.
* It also converts the text to lower, splits the text and performs stop word removal and lemmatization on individual words and finally joins the words back into a sentence

In [None]:
stop_words = set (stopwords.words ('english')) # remove words such as will, I, and, at etc.

In [None]:
def get_clean_sentence(sentence):

  """
    Clean and preprocess a given sentence or text.

    Parameters:
    -----------
    sentence : str
        The input sentence or text to be cleaned and preprocessed.

    Returns:
    --------
    str
        The cleaned and preprocessed sentence or text.
    """

  # remove handles (@), numbers, urls, emojis and any other special characters to have only text
  sentence_text_cln = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|[0-9]",' ', str(sentence))

  # Remove URLs
  sentence_text_cln = re.sub(r'http\S+|www.\S+', '', sentence_text_cln)

  # Remove Punctuations
  sentence_text_cln = re.sub(r'[^\w\s]', '', sentence_text_cln)

  # Remove Numbers
  sentence_text_cln = re.sub(r'\d+', '', sentence_text_cln)

  # Remove extra white spaces
  sentence_text_cln = re.sub(r'\s+', ' ', sentence_text_cln).strip()

  # Remove @ mentions
  sentence_text_cln = re.sub(r'@\w+', '', sentence_text_cln)

  # Remove hashtags
  sentence_text_cln = re.sub(r'#[A-Za-z0-9]+', '', sentence_text_cln)

  # Convert all words to lower case
  sentence_text_cln = sentence_text_cln.lower()

  # Split sentences into words
  sentence_text_cln = sentence_text_cln.split()

  # Remove English stopwords
  sentence_text_cln = [x for x in sentence_text_cln if not x in stop_words]

  # Lemmatize the text
  lemmatizer = WordNetLemmatizer()
  sentence_text_cln = [lemmatizer.lemmatize(x) for x in sentence_text_cln]

  #join words to form the original sentences, but cleaned-up
  sentence_text_cln = " ".join(sentence_text_cln)

  # Append to the list to get all tweets in one place
  return sentence_text_cln

* Iterating through reviews, performing sentence cleaning and obtaining **Cleaned Reviews** column

In [None]:
# Create an empty list to store the cleaned reviews
cleaned_reviews = []

# Iterate through each item (index and review) in the 'Individual Sentence' column
for i, review in df['Individual Sentence'].items():
    # Printing review and individual sentence details
    print(f"{'-'*20}")
    print(f"Review number = {i+1}")  # Index of the review in the list

    # Count the number of sentences in this review
    print(f"\tNumber of sentences in this review = {len(review)}")

    sentence_list = []

    # Iterate through each sentence in the current review
    for sentence in review:
        # Clean and preprocess each sentence using the get_clean_sentence function
        cleaned_sentence = get_clean_sentence(sentence)
        sentence_list.append(cleaned_sentence)

    # Append the cleaned sentences for this review to the cleaned_reviews list
    cleaned_reviews.append(sentence_list)

# Create a new column 'Cleaned Reviews' in the DataFrame to store the cleaned sentences
df['Cleaned Reviews'] = cleaned_reviews

--------------------
Review number = 1
	Number of sentences in this review = 7
--------------------
Review number = 2
	Number of sentences in this review = 18
--------------------
Review number = 3
	Number of sentences in this review = 15
--------------------
Review number = 4
	Number of sentences in this review = 7
--------------------
Review number = 5
	Number of sentences in this review = 5


In [None]:
# 'Cleaned Reviews' Column
df

Unnamed: 0,Reviews,Individual Sentence,Cleaned Reviews
0,charger is too delicate. I will post again if ...,"[charger is too delicate, I will post again i...","[charger delicate, post get replacement, pin m..."
1,I don't know why people are cribbing so much a...,[I don't know why people are cribbing so much ...,"[know people cribbing much screen, laptop ip d..."
2,After 5 days of review\nI bought this one from...,"[After 5 days of review, I bought this one fro...","[day review, bought one amazon great, indian f..."
3,"Firstly , I would like to talk about the scree...","[Firstly , I would like to talk about the scre...",[firstly would like talk screen quality averag...
4,I am writing this review after one month of us...,[I am writing this review after one month of u...,[writing review one month usage first got lapt...


## Hugging Face - Sentiment Analysis Pipeline
* I have chosen to use the default model and tokenizer for the **Sentiment Analysis** pipeline from Hugging Face which uses the following lables: POSITIVE, NEGATIVE
* I have created a custom function to return the label for a given sentence

In [None]:
# Create a sentiment analysis pipeline using the Hugging Face Transformers library
# The "sentiment-analysis" task is specified, which is pre-trained for sentiment analysis.
# This pipeline can be used to analyze the sentiment of text inputs.
nlp = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [None]:
def get_sentiment_label(sentence):
  result = nlp(sentence)
  return result[0]['label']

def get_sentiment_label(sentence):
    """
    Analyze the sentiment of a given sentence using the Hugging Face Transformers pipeline.

    Parameters:
    -----------
    sentence : str
        The input sentence for sentiment analysis.

    Returns:
    --------
    str
        The sentiment label ('POSITIVE','NEGATIVE') of the input sentence.
    """
    # Analyze the sentiment of the input sentence using the Hugging Face Transformers pipeline
    result = nlp(sentence)

    # Extract and return the sentiment label from the pipeline result
    return result[0]['label']

## Obtaining Sentence wise Sentiment Analysis
* Sentence wise Sentiment Analysis - *Positive or Negative* is obtained and it is stored as a dictionary where the *sentence* is the key and *sentiment label* is the value.
* This dictionary is stored as a new column called **Review Labels**

In [None]:
# Create an empty list to store the sentiment labels for sentences in each review
sentence_label_list = []

# Iterate through each item (index and cleaned reviews) in the 'Cleaned Reviews' column
for i, review in df['Cleaned Reviews'].items():
    # Printing review and individual sentence details
    print(f"{'-'*20}")
    print(f"Review number = {i+1}")  # Index of the review in the list

    # Count the number of sentences in this review
    print(f"\tNumber of sentences in this review = {len(review)}")

    # Create a dictionary to store sentence-label pairs for this review
    sentence_label_dict = {}

    # Iterate through each sentence in the current review
    for j, sentence in enumerate(review):
        print(f"\t{'-'*10}")
        print(f"\tSentence number = {j+1}")
        print(f"\tSentence = {sentence}")

        # Get the sentiment label for the current sentence using the get_sentiment_label function
        label = get_sentiment_label(sentence)
        print(f"\tLabel = {label}")

        # Store the sentence-label pair in the dictionary
        sentence_label_dict[sentence] = label

    # Append the dictionary of sentence-label pairs for this review to the sentence_label_list
    sentence_label_list.append(sentence_label_dict)

# Create a new column 'Review Labels' in the DataFrame to store the sentiment labels
df['Review Labels'] = sentence_label_list

--------------------
Review number = 1
	Number of sentences in this review = 7
	----------
	Sentence number = 1
	Sentence = charger delicate
	Label = POSITIVE
	----------
	Sentence number = 2
	Sentence = post get replacement
	Label = NEGATIVE
	----------
	Sentence number = 3
	Sentence = pin mm outer diameter
	Label = POSITIVE
	----------
	Sentence number = 4
	Sentence = mm inner diameter
	Label = POSITIVE
	----------
	Sentence number = 5
	Sentence = get market one artis price
	Label = NEGATIVE
	----------
	Sentence number = 6
	Sentence = gb expandable gb throwing away existing gb ram
	Label = NEGATIVE
	----------
	Sentence number = 7
	Sentence = thought gb run memory odd tab different browzers one excel window explorer get message restart browzer memory shortage
	Label = NEGATIVE
--------------------
Review number = 2
	Number of sentences in this review = 18
	----------
	Sentence number = 1
	Sentence = know people cribbing much screen
	Label = POSITIVE
	----------
	Sentence number = 2


In [None]:
# 'Review Labels' column
df

Unnamed: 0,Reviews,Individual Sentence,Cleaned Reviews,Review Labels
0,charger is too delicate. I will post again if ...,"[charger is too delicate, I will post again i...","[charger delicate, post get replacement, pin m...","{'charger delicate': 'POSITIVE', 'post get rep..."
1,I don't know why people are cribbing so much a...,[I don't know why people are cribbing so much ...,"[know people cribbing much screen, laptop ip d...",{'know people cribbing much screen': 'POSITIVE...
2,After 5 days of review\nI bought this one from...,"[After 5 days of review, I bought this one fro...","[day review, bought one amazon great, indian f...","{'day review': 'POSITIVE', 'bought one amazon ..."
3,"Firstly , I would like to talk about the scree...","[Firstly , I would like to talk about the scre...",[firstly would like talk screen quality averag...,{'firstly would like talk screen quality avera...
4,I am writing this review after one month of us...,[I am writing this review after one month of u...,[writing review one month usage first got lapt...,{'writing review one month usage first got lap...


## Obtaining Final Review Sentiment
* Finally, I use the majority voting method to compute the final sentiment of a review in terms of Positive, Negative or Neutral and store it in the **Final Label** column

In [None]:
def get_majority_voting(review):
    """
    Determine the majority sentiment label for a given review based on the sentiment labels of its sentences.

    Parameters:
    -----------
    review : dict
        A dictionary where keys are sentences and values are sentiment labels.

    Returns:
    --------
    str
        The majority sentiment label for the review ("Positive," "Negative," or "Neutral").
    """
    # Count the occurrences of sentiment labels in the review using Counter
    value_counts = Counter(review.values())

    # Access the count of "POSITIVE" and "NEGATIVE" sentiment labels
    positive_count = value_counts["POSITIVE"]
    negative_count = value_counts["NEGATIVE"]

    # Determine the majority sentiment label based on counts
    if positive_count > negative_count:
        return "Positive"
    elif negative_count > positive_count:
        return "Negative"
    else:
        return "Neutral"

In [None]:
# Determine the final sentiment for each review using majority voting
final_label = [get_majority_voting(review) for review in df['Review Labels']]
df['Final Label'] = final_label

In [None]:
# 'Final Labels' column
df

Unnamed: 0,Reviews,Individual Sentence,Cleaned Reviews,Review Labels,Final Label
0,charger is too delicate. I will post again if ...,"[charger is too delicate, I will post again i...","[charger delicate, post get replacement, pin m...","{'charger delicate': 'POSITIVE', 'post get rep...",Negative
1,I don't know why people are cribbing so much a...,[I don't know why people are cribbing so much ...,"[know people cribbing much screen, laptop ip d...",{'know people cribbing much screen': 'POSITIVE...,Negative
2,After 5 days of review\nI bought this one from...,"[After 5 days of review, I bought this one fro...","[day review, bought one amazon great, indian f...","{'day review': 'POSITIVE', 'bought one amazon ...",Positive
3,"Firstly , I would like to talk about the scree...","[Firstly , I would like to talk about the scre...",[firstly would like talk screen quality averag...,{'firstly would like talk screen quality avera...,Negative
4,I am writing this review after one month of us...,[I am writing this review after one month of u...,[writing review one month usage first got lapt...,{'writing review one month usage first got lap...,Negative


## Displaying Final Results
* Each original review and its final Sentiment Label is displayed

In [None]:
df[['Reviews','Final Label']]

Unnamed: 0,Reviews,Final Label
0,charger is too delicate. I will post again if ...,Negative
1,I don't know why people are cribbing so much a...,Negative
2,After 5 days of review\nI bought this one from...,Positive
3,"Firstly , I would like to talk about the scree...",Negative
4,I am writing this review after one month of us...,Negative
