### Objective : To apply Text pre-processing on Amazon food reviews dataset.

# Amazon Fine Food Reviews Analysis


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1.  Id
2.  ProductId - unique identifier for the product
3.  UserId - unqiue identifier for the user
4.  ProfileName
5.  HelpfulnessNumerator - number of users who found the review helpful
6.  HelpfulnessDenominator - number of users who indicated whether they 
    found the review helpful or not
7.  Score - rating between 1 and 5
8.  Time - timestamp for the review
9.  Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review.
A review of 1 or 2 could be considered negative. A review of 3 is neutral and ignored. 
This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.


#### Load the Amazon Reviews Dataset 

In [13]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install tqdm

from tqdm import tqdm # for displaying progress bar.


# using the SQLite Table to read the amazon fine food reviews data.
con = sqlite3.connect('database.sqlite') 

#filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 """, con) 

# Give reviews with Score>3 a positive rating, and reviews with a score<3 a negative rating.
def partition(x):
    if x < 3:
        return 'negative'
    return 'positive'

#changing reviews with score less than 3 to be positive and vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition) 
filtered_data['Score'] = positiveNegative




You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


### Data Cleaning : Deduplication

In [14]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", con)

#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

#Remove Data anamolies.
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()


(364171, 10)


positive    307061
negative     57110
Name: Score, dtype: int64

In [20]:
# Observed high FalsePositive cases during NB assignment.
# Solution : Add the summary comments from the Amazon food reviews dataset to the review text column 
#            to reduce the FalsePositive counts.

# Compare Model performance between using Review comments alone versus using Review comments + 
# Summary comments.
#final['SummaryText'] = final['Text']
final['SummaryText'] = ''
final['SummaryText'] = final['Text'] + ". " + final['Summary'] 

###  Text Preprocessing: Stemming, stop-word removal and Lemmatization.

Now that we have finished deduplication our data requires some preprocessing 
before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 
   (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [21]:
import nltk
nltk.download('stopwords')


stop = set(stopwords.words('english')) #set of stopwords
sno = nltk.stem.SnowballStemmer('english') #initialising the snowball stemmer

def cleanhtml(sentence): #function to clean the word of any html-tags
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext
def cleanpunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return  cleaned


[nltk_data] Downloading package stopwords to C:\Users\Vijay
[nltk_data]     Joseph\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [22]:
#Code for implementing step-by-step the checks mentioned in the pre-processing phase

i=0
str1=' '
final_string=[]
all_positive_words=[] # store words from +ve reviews here
all_negative_words=[] # store words from -ve reviews here.
s=''
# Take one review at a time.
for sent in tqdm(final['SummaryText'].values):
    filtered_sentence=[]
    #print(sent);
    sent=cleanhtml(sent) # remove HTMl tags
    # Split sentences into words.
    for w in sent.split():
        # Remove punctuations and special characters in each word.
        for cleaned_words in cleanpunc(w).split():
            # If the word is not a number and not english articles like a,an,...
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):
                # If the word is not a stopword
                if(cleaned_words.lower() not in stop):
                    # Apply stemming on each word
                    s=(sno.stem(cleaned_words.lower())).encode('utf8')
                    # Append to list
                    filtered_sentence.append(s)
                    if (final['Score'].values)[i] == 'positive': 
                        all_positive_words.append(s) #list of all words used to describe positive reviews
                    if(final['Score'].values)[i] == 'negative':
                        all_negative_words.append(s) #list of all words used to describe negative reviews
                else:
                    continue
            else:
                continue 
    #print(filtered_sentence)
    str1 = b" ".join(filtered_sentence) #final string of cleaned words
    #print("***********************************************************************")
    
    final_string.append(str1)
    i+=1

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 364171/364171 [06:16<00:00, 966.67it/s]


In [23]:
# adding a column "CleanedText" to the final cleaned dataset that displays the data after 
# pre-processing of the review. 
final['CleanedText']=final_string 
final['CleanedText']=final['CleanedText'].str.decode("utf-8")
final[['SummaryText','CleanedText']].head()

Unnamed: 0,SummaryText,CleanedText
138706,this witty little book makes my son laugh at l...,witti littl book make son laugh loud recit car...
138688,"I grew up reading these Sendak books, and watc...",grew read sendak book watch realli rosi movi i...
138689,This is a fun way for children to learn their ...,fun way children learn month year learn poem t...
138690,This is a great little book to read aloud- it ...,great littl book read nice rhythm well good re...
138691,This is a book of poetry about the months of t...,book poetri month year goe month cute littl po...


In [24]:
# Copy the final cleaned text to csv file for resuse later.
final.to_csv('finalCleanedText.csv')