# Delivery nÂ°4 : Sentiment analysis

*Mathematics and Big Data - Mathias Lommel*

In this 4th delivery, we will try to perform a sentiment Analysis on Amazon Reviews. The final idea will be to solve some business problem for Amazon.

For this purpose, I have chosen one of the 3 datasets that were proposed in the .zip file : *Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv* (which is obviously the biggest one, to face Big Data issues)

## Library importations

As always, we have to import fiew libraries that will be important for our work.

In [1]:
import pandas as pd
import wordcloud

import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.corpus import movie_reviews

import string
import re

import numpy as np

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Definition of the functions

Now, let's define the functions that we will use to solve our problem.

### Reading file

Again, I made this delivery on Google Colab. Then, I have created 2 different codes that can be used to read the .csv file.

In [2]:
# Using google Colab
from google.colab import drive
drive.mount('/content/drive')

def open_document_with_Drive(path):
  """
  This function read a csv file from Drive.

    Input :
        path : string - path of the file
    Output :
       data  : pd.DataFrame - extracted data
  """
  data = pd.read_csv(path)

  return data

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# If the document is stored locally on the computer
def open_document(path):
  """
  This function read a csv file stored locally.

    Input :
        path : string - path of the file
    Output :
        data  : pd.DataFrame - extracted data
  """
  data = pd.read_csv(path)

  return data

### Obtain the opinion of a review

To determine the opinion of a review, we will use the *SentimentIntensityAnalyzer* in order to determine either a review is *positive*, *negative* or *neutral*.

Here, because we want to study only positive and negative opinions, the function will give 3 possible results :     
  - If positive score > negative score : + positive score
  - If positive score < negative score : - negative score
  - If positive score = negative score : 0

In [5]:
def get_opinion(text):
  """
  This function computes the opinion of a text.

    Input :
        text           : string - review of a product
    Output :
        0              : int - for neutral opinion
        -output['neg'] : float - for negative opinion
        output['pos']  : float - for positive opinion
    """
  vader_analyzer = SentimentIntensityAnalyzer()
  output = vader_analyzer.polarity_scores(text)

  if output['neg']>output['pos']:
    return -output['neg']
  elif  output['pos']>output['neg']:
    return output['pos']
  return 0

### Pre-processing

As we have done in the previous deliveries, we have to pre-process our data.

Here, we will write 2 functions, one to preprocess one review, and another one, that uses this first function, to preprocess the whole database.

In [6]:
# Function that cleans the data
def preprocess_review(review):
  """
  This function preprocess a text.

    Input :
        review : string - product's review to preprocess
    Output :
        review : string - preprocessed review
  """
  # Change to lower case
  review = review.lower()

  # Remove URLs (http and https)
  review = re.sub("http?:\/\/.*[\r\n]*", "", review)
  review = re.sub("https?:\/\/.*[\r\n]*", "", review)

  # Remove emails
  review= re.sub(r'\b[A-Za-z0-9._-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b','',review)

  # Remove mentions
  review = re.sub("@\S+", "", review)

  # Remove punctuations, commas and special characters
  punctuation = string.punctuation
  translation_table = str.maketrans('', '', punctuation)

  review = review.translate(translation_table)

  # Remove numbers
  review = re.sub(r'\d+', '', review)

  return review

def preprocess_data(data):
  """
  This function preprocess the whole database.

    Input :
        data          : pd.DataFrame - database to preprocess
    Output :
        cleaned_data  : pd.DataFrame - preprocessed database
  """
  # We apply the preprocessing function to each review
  cleaned_reviews = data['reviews.text'].apply(preprocess_review)

  # We build a new dataframe, with preprocessed reviews
  data_without_reviews = data.drop(columns=['reviews.text'])
  cleaned_data = pd.concat([data_without_reviews, cleaned_reviews], axis=1)

  print("Pre-processing successfully computed.")

  return cleaned_data


### Summary matrix

Here, we are going to define a function that will create a "summary matrix".

This matrix will be useful to solve the problems that we are asked to solve.

It will have a number of rows equal to the number of different products in the database, and will have different columns :      
  - **Product** : Name of the product
  - **Brand** : Brand of the product
  - **Category** : Main category of the product
  - **Rate** : Mean opinion rate of its reviews
  - **Variance** : Variance of its reviews
  - **Score** : A score to quantify the quality of the product, taking into account its reviews*
  - **nb_reviews** : Number of reviews for the product
  - **nb_Positive** : Number of positive reviews
  - **nb_Negative** : Number of negative reviews
  - **nb_pos_threshold** : Number of positive reviews, with a rate over the threshold (parameter of the function)
  - **nb_neg-threshold** : Number of negative reviews, with a rate over the threshold

\\
*The score is based on different parameters :      
  - mean_rate : the greater it is, the best is the score
  - The proportion of positive and negative reviews over the total amount of reviews
  - The proportion of significative positive (resp. negative) reviews over the total number of positive (resp. negative) reviews
  - The number of reviews

In [7]:
import warnings

def business_problem(data,threshold):
  """
  This function computes the Summary matrix.

    Input :
        data          : pd.DataFrame - database of study
        threshold     : float - threshold for significant rates
    Output :
        res           : pd.DataFrame - Summary matrix
  """

  # We build a new data frame (summary matrix)
  res = pd.DataFrame(columns=['Product','Brand','Category','Rate','Variance','Score','nb_reviews','nb_Positive','nb_Negative','nb_pos_threshold','nb_neg_threshold'])

  # We will work on each product separately
  data_by_product = data.groupby('name')

  for key in list(data_by_product.groups.keys()):
    # We get the part of the data frame dedicated to the product being studied
    data_product = data_by_product.get_group(key)

    # We get the rates for this product
    rates = data_product['reviews.text'].apply(get_opinion)

    # Mean rate
    mean_rate = sum(rates)/len(data_product)
    # Variance of the rates
    var = sum( (rates - mean_rate)**2 )/len(data_product)

    # Number of positive / negative rates
    nb_pos = sum( rates > 0 )
    nb_neg = sum( rates < 0 )
    # Number of significants positive / negative rates
    nb_pos_threshold = sum( rates > threshold )
    nb_neg_threshold = sum( rates < -threshold )

    # Computation of a product's score
    ## Positivity test of the global opinion
    test_pos_neg = 1 if mean_rate>0 else -1
    ## Ratio of positive/negative significant reviews
    ratio_pos_threshold = nb_pos_threshold/nb_pos if nb_pos > 0 else 0
    ratio_neg_threshold = nb_neg_threshold/nb_neg if nb_neg > 0 else 0
    ## We compute the score considering that a product need at least 5 reviews to increase his score (ratios are not significant on small-sized data)
    score = mean_rate * ( 1 + ((nb_pos - nb_neg)/len(data_product) + ratio_pos_threshold - ratio_neg_threshold)*test_pos_neg + 0.0001*len(data_product)) if len(data_product)>5 else mean_rate

    # We mute FutureWarning (to avoid them to be shown when executing the function)
    with warnings.catch_warnings():
      warnings.simplefilter(action='ignore', category=FutureWarning)
      # We add this product's data to the result data frame
      new_row = {'Product':data_product.iloc[0]['name'], 'Brand' : data_product.iloc[0]['brand'], 'Category':data_product.iloc[0]['primaryCategories'], 'Rate':mean_rate, 'Variance':var, 'Score':score, 'nb_reviews':len(data_product), 'nb_Positive':nb_pos, 'nb_Negative':nb_neg, 'nb_pos_threshold':nb_pos_threshold,'nb_neg_threshold':nb_neg_threshold}
      res = res.append(new_row, ignore_index=True)

  print("Summary Matrix successfully computed")

  return res


### Printing function

Here, we define a printing function in order to show clearly the different products that we want to highlight.

In [8]:
def print_products(products,type):
  """
  This function shows selected products, with its characteristics.

    Input :
        products : pd.DataFrame - products to print
        type     : string - Type of products
  """

  nb_products = len(products)
  print("\n")
  print(type)
  print("\n")
  for i in range(nb_products):
    print(i+1,"-",products.iloc[i]['Product'])
    print("       Brand          :",products.iloc[i]['Brand'])
    print("       Category       :",products.iloc[i]['Category'])
    print("       Average Rate   :",products.iloc[i]['Rate'])
    print("       Number reviews :",products.iloc[i]['nb_reviews'])
    print("       Score          :",products.iloc[i]['Score'])
    print("\n")

## Application of our functions

Now, we are going to apply our functions on the dataset of study, in order to try to solve Amazon's problems.

### Reading of the file

In [9]:
# Reading of the file
path = '/content/drive/My Drive/Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv'
data = open_document_with_Drive(path)

### Preprocessing

In [10]:
cleaned_data = preprocess_data(data)

Pre-processing successfully computed.


### Computation of the summary Matrix

In [11]:
data_summary = business_problem(cleaned_data,0.4)

Summary Matrix successfully computed


### Problems solving

Let's now answer the different questions. For each of them, we will give 5 products which seems to correspond to the ones that we want to select.


**1 - Products to keep on the website**.

The products that we want to keep are the ones that are frequently ordered, and that are considered as good products by customers.

Then, we will here keep the products with the biggest amount of positive reviews.

In [12]:
# Products to keep
## We sort the matrix, with decreasing nb_Positive
sorted_data = data_summary.sort_values(by='nb_Positive', ascending=False)
## We keep the 5 first products of the list
products_to_keep = sorted_data.head(5)
## We print the result
print_products(products_to_keep,"Products that should be kept :")



Products that should be kept :


1 - AmazonBasics AAA Performance Alkaline Batteries (36 Count)
       Brand          : Amazonbasics
       Category       : Health & Beauty
       Average Rate   : 0.3133037276758983
       Number reviews : 8343
       Score          : 0.8851516019551187


2 - AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary
       Brand          : Amazonbasics
       Category       : Health & Beauty
       Average Rate   : 0.31934844420600794
       Number reviews : 3728
       Score          : 0.7586352359572055


3 - Fire HD 8 Tablet with Alexa, 8 HD Display, 16 GB, Tangerine - with Special Offers
       Brand          : Amazon
       Category       : Electronics
       Average Rate   : 0.26132746623004527
       Number reviews : 2443
       Score          : 0.5978631531132693


4 - All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi, 16 GB - Includes Special Offers, Black
       Brand          : Amazon
       Category       : Electronics
  

We can see here that the selection that we have seems to be quite similar to what we expected to find. In fact, those types of products, with a huge amount of reviews, having many good ones, are, for me, the best products to order, and so, to keep on the website.

**2 - Products which should be dropped**.

Here, it's the complete opposite : we want to drop the products that are not ordered frequently, and that are not well rated.

By looking at bad rated products, we can see that most of them have less than 5 reviews. Then, we will consider the products to be dropped as the lowest rated ones


In [13]:
# Products to drop
sorted_data = data_summary.sort_values(by='Rate', ascending=True)
## We keep the 5 first products of the list
products_to_drop = sorted_data.head(5)
## We print the result
print_products(products_to_drop,"Products that should be dropped :")



Products that should be dropped :


1 - Oem Amazon Kindle Power Usb Adapter Wall Travel Charger Fire/dx/+micro Usb Cable
       Brand          : Amazon
       Category       : Electronics
       Average Rate   : -0.12449999999999999
       Number reviews : 4
       Score          : -0.12449999999999999


2 - Certified Refurbished Amazon Fire TV with Alexa Voice Remote
       Brand          : Amazon
       Category       : Electronics
       Average Rate   : -0.06939999999999999
       Number reviews : 5
       Score          : -0.06939999999999999


3 - Amazon Kindle Replacement Power Adapter (Fits Latest Generation Kindle and Kindle DX) For shipment in the U.S only
       Brand          : Amazon
       Category       : Electronics
       Average Rate   : -0.0038000000000000035
       Number reviews : 5
       Score          : -0.0038000000000000035


4 - AmazonBasics Silicone Hot Handle Cover/Holder - Red
       Brand          : Amazonbasics
       Category       : Home & Garden
   

As in the first case, here we have products that are not really interesting : with just a fiew reviews, which are not really good.

**3 - Products that are junk**

Here, we will consider the number of negative reviews, with a rate over the threshold 0.4 as a good indicator for junk products.

In [18]:
# Junk Products
sorted_data = data_summary.sort_values(by='nb_neg_threshold', ascending=True)
## We keep the 5 first products of the list
junk_products = sorted_data.head(5)
## We print the result
print_products(junk_products,"Products that are junk :")



Products that are junk :


1 - All-New Fire 7 Tablet with Alexa, 7" Display, 8 GB - Marine Blue
       Brand          : Amazon
       Category       : Electronics
       Average Rate   : 0.27275609756097563
       Number reviews : 82
       Score          : 0.578824581713429


2 - Oem Amazon Kindle Power Usb Adapter Wall Travel Charger Fire/dx/+micro Usb Cable
       Brand          : Amazon
       Category       : Electronics
       Average Rate   : -0.12449999999999999
       Number reviews : 4
       Score          : -0.12449999999999999


3 - AmazonBasics External Hard Drive Case
       Brand          : Amazonbasics
       Category       : Electronics
       Average Rate   : 0.218
       Number reviews : 6
       Score          : 0.39979746666666666


4 - AmazonBasics Nespresso Pod Storage Drawer - 50 Capsule Capacity
       Brand          : AmazonBasics
       Category       : Home & Garden
       Average Rate   : 0.271
       Number reviews : 1
       Score          : 0.271


5 

I think that here, the parameter taken into account can be discussed since the first product seems to not be that bad...

**4 - Products recommended**

We will consider a special time period : because trends are moving fast, to recommend a product, we have to take into account the written dates.

Here, our database stops on 24-11-2018. So we will imagine that we want to recommend a product for december 2018. To this purpose, we take into account the reviews written during the year preceding this recommendation date (from 24-11-2017 to 24-11-2018).

Then, we will use the score created before to determine the best actual products.

In [19]:
# Recommended products
## We build a new dataset, composed of the most recent reviews
actual_data = cleaned_data.copy()
actual_data['Date'] = pd.to_datetime(actual_data['dateAdded'])
actual_data = actual_data[((actual_data['Date'].dt.day >= 24) & (actual_data['Date'].dt.month >= 11) & (actual_data['Date'].dt.year ==2017)) | ((actual_data['Date'].dt.month <= 12) & (actual_data['Date'].dt.year ==2018))]
## We compute a new summary matrix
actual_summary = business_problem(actual_data,0.4)

## We get the products with the best score
sorted_data = actual_summary.sort_values(by='Score', ascending=False)
## We keep the 5 first products of the list
recommended_products = sorted_data.head(5)
## We print the result
print_products(recommended_products,"Products to recommend :")

Summary Matrix successfully computed


Most recommended products :


1 - All-New Kindle Oasis E-reader - 7 High-Resolution Display (300 ppi), Waterproof, Built-In Audible, 32 GB, Wi-Fi - Includes Special Offers
       Brand          : Amazon
       Category       : Electronics,Media
       Average Rate   : 0.30214285714285716
       Number reviews : 7
       Score          : 0.6044972142857143


2 - Cat Litter Box Covered Tray Kitten Extra Large Enclosed Hooded Hidden Toilet
       Brand          : Amazonbasics
       Category       : Animals & Pet Supplies
       Average Rate   : 0.5820000000000001
       Number reviews : 2
       Score          : 0.5820000000000001


3 - Certified Refurbished Amazon Echo
       Brand          : Amazon
       Category       : Electronics
       Average Rate   : 0.223
       Number reviews : 7
       Score          : 0.41960848095238096


4 - Fire TV Stick Streaming Media Player Pair Kit
       Brand          : Amazon
       Category       : Electronic

**5 - Best products**

Best products are the ones with the best score.

In [20]:
# Best products
sorted_data = data_summary.sort_values(by='Score', ascending=False)
## We keep the 5 first products of the list
best_products = sorted_data.head(5)
## We print the result
print_products(best_products,"Best products :")



Best products :


1 - AmazonBasics AAA Performance Alkaline Batteries (36 Count)
       Brand          : Amazonbasics
       Category       : Health & Beauty
       Average Rate   : 0.3133037276758983
       Number reviews : 8343
       Score          : 0.8851516019551187


2 - Expanding Accordion File Folder Plastic Portable Document Organizer Letter Size
       Brand          : Amazonbasics
       Category       : Office Supplies
       Average Rate   : 0.3323333333333333
       Number reviews : 9
       Score          : 0.8311324333333333


3 - AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary
       Brand          : Amazonbasics
       Category       : Health & Beauty
       Average Rate   : 0.31934844420600794
       Number reviews : 3728
       Score          : 0.7586352359572055


4 - All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi, 16 GB - Includes Special Offers, Blue
       Brand          : Amazon
       Category       : Electronics
       Average

It's sure that the way the score has been created can be discussed, but we can see on that example that the created score mixes quite well different aspects : number of reviews, average rate, number of positive/negative reviews,...

**6 - Products for coming winter**

Another time, we have to study a particular range of time. Let's focus on the winter period, and chose the best products.

As we have done to recommend products, now we only consider winter reviews (written from december to february).

In [21]:
# Best products for coming winter
## New database, for winter reviews
winter_data = cleaned_data.copy()
winter_data['Date'] = pd.to_datetime(winter_data['dateAdded'])
winter_data = winter_data[(winter_data['Date'].dt.month >= 12) | (winter_data['Date'].dt.month <= 2)]
## New summary matrix
winter_summary = business_problem(winter_data,0.4)

sorted_data = winter_summary.sort_values(by='Score', ascending=False)
## We keep the 5 first products of the list
winter_products = sorted_data.head(5)
## We print the result
print_products(winter_products,"Products for coming winter :")

Summary Matrix successfully computed


Products for coming winter :


1 - AmazonBasics AA Performance Alkaline Batteries (48 Count) - Packaging May Vary
       Brand          : Amazonbasics
       Category       : Health & Beauty
       Average Rate   : 0.31934844420600794
       Number reviews : 3728
       Score          : 0.7586352359572055


2 - Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Blue Kid-Proof Case
       Brand          : Amazon
       Category       : Electronics
       Average Rate   : 0.26950385964912266
       Number reviews : 1425
       Score          : 0.5912374050298681


3 - Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Pink Kid-Proof Case
       Brand          : Amazon
       Category       : Toys & Games,Electronics
       Average Rate   : 0.26823806682577567
       Number reviews : 1676
       Score          : 0.5888491393946643


4 - Amazon Tap Smart Assistant Alexaenabled (black) Brand New
       Brand          : Amazon
       Category       : El

**7 - Products that require advertisement**

The products that require advertisement are the one that are not ordered frequently. Then, we have to find the products with the lowest number of reviews (here 1), and then select the ones that have the best score.

In [23]:
# Require advertisement
products_with_1_review = data_summary[data_summary['nb_reviews'] == 1]
sorted_data = products_with_1_review.sort_values(by='Score',ascending=False)
## We keep the 5 first products of the list
require_advertisement = sorted_data.head(5)
## We print the result
print_products(require_advertisement,"Products that require advertisment :")



Products that require advertisment :


1 - Two Door Top Load Pet Kennel Travel Crate Dog Cat Pet Cage Carrier Box Tray 23"
       Brand          : Amazonbasics
       Category       : Animals & Pet Supplies
       Average Rate   : 0.412
       Number reviews : 1
       Score          : 0.412


2 - AmazonBasics Single-Door Folding Metal Dog Crate - Large (42x28x30 Inches)
       Brand          : AmazonBasics
       Category       : Animals & Pet Supplies
       Average Rate   : 0.369
       Number reviews : 1
       Score          : 0.369


3 - AmazonBasics Nespresso Pod Storage Drawer - 50 Capsule Capacity
       Brand          : AmazonBasics
       Category       : Home & Garden
       Average Rate   : 0.271
       Number reviews : 1
       Score          : 0.271


4 - Echo Dot (Previous generation)
       Brand          : Amazon
       Category       : Electronics
       Average Rate   : 0.135
       Number reviews : 1
       Score          : 0.135


5 - Amazon Echo Show - Black
  

**8 - How many quintuples have positive sentiment**

In [24]:
# We summup the number of positive reviews for each product
positives = sum(data_summary['nb_Positive'])
print("Number of quintuples with positive sentiment : ", positives, "over", len(cleaned_data), "reviews.")

Number of quintuples with positive sentiment :  23747 over 28332 reviews.


## Conclusion

During this 4th delivery, we have used the theoritical notions learnt in class to answer business problems. Re-using the first function that I created for the first delivery, and writing new ones, I had to make some choices on the way to answer the different questions.

Then, for me, this delivery was quite diferent from the others since it was well more based on autonomous reflections and personnal understanding of the questions. For sure, because I made choices, and because I interpreted questions in my own way, the answers that I gave can be discussed.