# Sentiment Analysis
(This is a type of a more general text classification problem)
#### References
Dataset - https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review

#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate way of determining the polarity (positivity/negativity) of a review.

#### Loading the data
The dataset is available in two forms
1. .csv file
2. SQLite Database

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

In [1]:
## Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import sqlite3    ## SQL Interface
import pickle     ## Used to save your data - Converts objects to byte stream and vice versa
import time       ## Time module - to caclulate time taken for execution

## Modules to perform Text Preprocessing
import re  # Regular Expressions
# References:
# https://docs.python.org/3/library/re.html (Official Documentation)
# https://pymotw.com/3/re/ (This is nice tutorial with examples)

import nltk # Natural Language Tool Kit
from nltk.corpus import stopwords

In [60]:
# using the SQLite Table to read data.
conn = sqlite3.connect('../8. Amazon Fine Food Review/database.sqlite')

# filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
filtered_data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""", conn)

# close the connection to the database
conn.close()

In [61]:
# View the top 5 rows
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [7]:
# shape of the dataset
filtered_data.shape

(525814, 10)

### Mark all the reviews with score 1 and 2 as negative while those with score 4 and 5 as positive. Ignore the reviews with score 3.

In [62]:
# Give reviews with Score > 3 a positive rating, and reviews with a Score < 3 a negative rating.
def partition(x):
    if x < 3:
        return 0 # indicating negative
    return 1 # indicating positive

## Pandas Series have a map function which apply function object to all the elements
filtered_data['Score'] = filtered_data['Score'].map(partition)

In [63]:
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...


**Note:** Time is stored in unix timestamp format, however it only includes the date information and not the time information.

In [14]:
filtered_data['Time'].nunique()

3157

## Exploratory Data Analysis and Data Cleaning
### Cleaning Part 1 - Deduplication (Removing duplicates from the data)
Many real world data contain duplicate entries which must be removed otherwise we may get biased results<br>
There are sayings in ML that **Garbage in - Garbage out**, and **Better Data beats Fancier Algorithms**

Let's check for some random user (Note: this was found after doing some experimentations)

In [66]:
filtered_data[(filtered_data['ProfileName'] == 'R. Ellis "Bobby"') & \
              (filtered_data['Summary'] == 'The price is right')]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
8166,8935,B0007A0AP8,A74SHV5ZD3RLT,"R. Ellis ""Bobby""",15,15,1,1303862400,The price is right,We have a little Maltese that we spoil to no e...
162162,175817,B0014DUUFC,A74SHV5ZD3RLT,"R. Ellis ""Bobby""",15,15,1,1303862400,The price is right,We have a little Maltese that we spoil to no e...
494170,534267,B0007A0AOY,A74SHV5ZD3RLT,"R. Ellis ""Bobby""",15,15,1,1303862400,The price is right,We have a little Maltese that we spoil to no e...
504729,545770,B001E5E1C8,A74SHV5ZD3RLT,"R. Ellis ""Bobby""",15,15,1,1303862400,The price is right,We have a little Maltese that we spoil to no e...


**Why these duplicate rows are present in the dataset?**
1. On doing some research and using some domain knowledge it was found that if a user gives a review for a product, then sometimes, that review appears on all the different brands of the same product.
2. Since this data was collected by scraping the HTML pages, we are getting the same review multiple times.
3. For example refer the below pages:<br>
https://www.amazon.com/dp/B0007A0AP8<br>
https://www.amazon.com/dp/B0007A0AOY
4. Getting the same text in train and test will lead to biased results

In [20]:
# Deduplication - If there are multiple rows with same user id and text, keep ony one and remove rest
final=filtered_data.drop_duplicates(subset={"UserId", "Text"})
final.shape

(363859, 10)

**Observe:** We are left with approx 69% of the original data after cleaning which means more than 30% was duplicate data

#### HelpfulnessNumerator must always be less than or equal to HelpfulnessDenominator. Lets check if this always the case

In [21]:
final[final['HelpfulnessNumerator'] > final['HelpfulnessDenominator']]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
41159,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,positive,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...
59301,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,positive,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...


There is something wrong in these two rows and must be removed

In [22]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
final.shape

(363857, 10)

In [23]:
final['Score'].value_counts()

positive    306779
negative     57078
Name: Score, dtype: int64

**Note:** Not a very balanced dataset

### Cleaning Part 2 -  We can see that many reviews contain HTML tags and special characters which are unwanted for our purpose

In [26]:
# Refer - https://docs.python.org/3/library/re.html#re.findall

# Print text of rows containing HTML tags
i = 0
for sen in final['Text'].values:
    if(len(re.findall('<.*?>', sen))): # Find all strings starting with '<' and ending with '>'
        print(sen,"\n\n")
        i += 1
    if i == 5:
        break

I don't know if it's the cactus or the tequila or just the unique combination of ingredients, but the flavour of this hot sauce makes it one of a kind!  We picked up a bottle once on a trip we were on and brought it back home with us and were totally blown away!  When we realized that we simply couldn't find it anywhere in our city we were bummed.<br /><br />Now, because of the magic of the internet, we have a case of the sauce and are ecstatic because of it.<br /><br />If you love hot sauce..I mean really love hot sauce, but don't want a sauce that tastelessly burns your throat, grab a bottle of Tequila Picante Gourmet de Inclan.  Just realize that once you taste it, you will never want to use any other sauce.<br /><br />Thank you for the personal, incredible service! 


Twizzlers, Strawberry my childhood favorite candy, made in Lancaster Pennsylvania by Y & S Candies, Inc. one of the oldest confectionery Firms in the United States, now a Subsidiary of the Hershey Company, the Company

### Cleaning Part 3 -  Stopword Removal
**Stopwords are those words which do not provide much meaning to the sentences**

In [45]:
stop_words = set(stopwords.words('english')) #set of stopwords
print(stop_words)

{'himself', 'or', 'they', 'haven', 'are', 'for', 'each', 'its', 'this', 'mightn', 'her', 'because', 'through', 'did', 'y', 'who', 'll', 'hadn', 'just', 'more', 'not', 'any', 'ourselves', 'now', 'other', 'couldn', 'wouldn', 'your', 'while', 'and', 'that', 'a', 'same', 'once', 'ma', 'hers', 'at', 'few', 'being', 'itself', 'between', 'than', 'ours', 'his', 'most', 'which', 'needn', 'in', 'having', 'to', 'down', 'won', 'off', 'do', 'by', 'were', 'such', 'over', 'him', 'as', 'our', 'isn', 'about', 'after', 'from', 'my', 'm', 'with', 'below', 're', 'have', 'will', 'he', 'why', 'weren', 'where', 'mustn', 'again', 't', 'been', 'out', 'on', 'of', 'what', 'was', 'up', 'under', 'some', 'yourself', 'very', 'shouldn', 'then', 'too', 'wasn', 'when', 'themselves', 'all', 's', 'both', 'you', 'does', 'nor', 'these', 'am', 'so', 'doesn', 'she', 'didn', 'here', 'no', 'i', 'own', 'has', 'further', 'd', 'doing', 'until', 'how', 'don', 'be', 'only', 'into', 'those', 'an', 'above', 'can', 'before', 'yourselv

### Cleaning Part 4 - Stemming
**In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form**

In [39]:
sno = nltk.stem.SnowballStemmer('english') #initialising the snowball stemmer

#### Example of stemming

In [40]:
print(sno.stem("operating"))
print(sno.stem("operated"))
print(sno.stem("operation"))
print(sno.stem("operate"))

oper
oper
oper
oper


### Function to perform all the cleaning on the reviews

In [46]:
def data_cleaning(series):
    '''The function takes a Pandas Series object containing text in all the cells
       And performs following Preprocessing steps on each cell:
       1. Clean text from html tags
       2. Clean text from punctuations and special characters
       3. Retain only non-numeric Latin characters with lenght > 2
       4. Remove stopwords from the sentence
       5. Apply lower casing
       6. Apply stemming to all the words in the sentence
       
       Return values:
       1. final_string - List of cleaned sentences
       2. list_of_sent - List of lists which will be used as input to the W2V model'''
    
    i = 0
    string = ""
    final_string = []    ## This list will contain cleaned sentences
    list_of_sent = []    ## This is a list of lists used as input to the W2V model at a later stage
    cleanr = re.compile('<.*?>') # Compile re to remove html tags
    
    for sent in series.values:
        filtered_sent = []
        sent = re.sub(cleanr, ' ', sent) # remove html tags
        sent = re.sub('[^a-zA-Z0-9\n]', ' ', sent) # remove special characters
        sent = re.sub('\s+',' ', sent) # replace multiple spaces with single space
        sent = sent.lower() # convert all characters to lower case
        for word in sent.split():
            if word not in stop_words and len(word)>2:
                word = sno.stem(word) # Apply Stemming using snowball stemmer
                filtered_sent.append(word)
        list_of_sent.append(filtered_sent) # This list is used later
        string = " ".join(filtered_sent) # Cleaned sentence
        final_string.append(string) # List of cleaned sentences
        i+=1
    return final_string, list_of_sent

#### First 5 reviwes before cleaning

In [42]:
for x in final['Text'].iloc[:5].values:
    print(x,"\n\n")

I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most. 


Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo". 


This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch. 


If you are looking for the

#### First 5 reviwes after cleaning

In [47]:
final_string, list_of_sent = data_cleaning(final['Text'].iloc[:5])
for x in final_string:
    print(x,"\n\n")

bought sever vital can dog food product found good qualiti product look like stew process meat smell better labrador finicki appreci product better 


product arriv label jumbo salt peanut peanut actual small size unsalt sure error vendor intend repres product jumbo 


confect around centuri light pillowi citrus gelatin nut case filbert cut tini squar liber coat powder sugar tini mouth heaven chewi flavor high recommend yummi treat familiar stori lewi lion witch wardrob treat seduc edmund sell brother sister witch 


look secret ingredi robitussin believ found got addit root beer extract order good made cherri soda flavor medicin 


great taffi great price wide assort yummi taffi deliveri quick taffi lover deal 




#### Cleaning all the reviews

In [53]:
start = time.time()
final_string, list_of_sent = data_cleaning(final['Text'])
end = time.time()
print("Time takes in seconds =", end - start)

Time takes in seconds = 1324.2997612953186


In [65]:
# Adding a column of CleanedText which displays the data after cleaning of the reviews
final['CleanedText']=final_string
final.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,CleanedText
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,bought sever vital can dog food product found ...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,product arriv label jumbo salt peanut peanut a...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,confect around centuri light pillowi citrus ge...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,look secret ingredi robitussin believ found go...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...,great taffi great price wide assort yummi taff...


#### Save the updated DataFrame as an SQL Table for future use

In [67]:
conn = sqlite3.connect('final.sqlite')
c=conn.cursor()
final.to_sql('Reviews', conn, if_exists='replace', index = False)
conn.close()

#### Save the list_of_sent in a pickle file so that you can directly load the pickle file every time you use in future

In [57]:
with open('list_of_sent_for_input_to_w2v.pkl', 'wb') as pickle_file:
    pickle.dump(list_of_sent, pickle_file)