<a href="https://colab.research.google.com/github/pSN0W/AI_Practice/blob/main/Amazon_Fine_Food_Review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Loading Data from kaggle


In [1]:
!pip install kaggle



In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"pratyakshsingh","key":"e1a64879a9d9f50ccfaafd16798ab02d"}'}

In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

#change the permission
!chmod 600 ~/.kaggle/kaggle.json

In [4]:
!kaggle datasets download -d snap/amazon-fine-food-reviews

Downloading amazon-fine-food-reviews.zip to /content
 93% 226M/242M [00:07<00:00, 26.8MB/s]
100% 242M/242M [00:07<00:00, 34.1MB/s]


In [5]:
from zipfile import ZipFile
file_name = "amazon-fine-food-reviews.zip"
with ZipFile(file_name,'r') as zip:
  zip.extractall()

#Analyzing on Data

##Importing Modules

In [6]:
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors

##Loading Data

In [7]:
#You can do it using csv too but sometimes maybe you will nead to work with sql so learn this too

#Create a connection to the database
con=sqlite3.connect('database.sqlite')

In [8]:
filtered_data=pd.read_sql_query("SELECT * FROM Reviews WHERE Score!=3",con)
print(filtered_data.columns)
print(filtered_data['Score'].value_counts())

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')
5    363122
4     80655
1     52268
2     29769
Name: Score, dtype: int64


In [9]:
def partition(x):
  if x<3:
    return "negative"
  return "positive"

In [10]:
actualScore=filtered_data['Score']
positive_negative=actualScore.map(partition)
filtered_data['Score']=positive_negative

In [11]:
filtered_data['Score'].value_counts()

positive    443777
negative     82037
Name: Score, dtype: int64

#Data Preprocessing

##Data Cleaning

In [12]:
# Sorting the data in ascending order of product key

sorted_data=filtered_data.sort_values('ProductId')

In [13]:
#Dropping Duplicates
print("Dimension of the data before dropping Duplicates : ",sorted_data.shape)

final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},keep='first',inplace=False)

print("Dimension of the data after dropping Duplicates : ",final.shape)

Dimension of the data before dropping Duplicates :  (525814, 10)
Dimension of the data after dropping Duplicates :  (364173, 10)


In [14]:
#Dropping all the reviews whose Helpfullness Numerator is greater than helpfullness denominator

final=final[final['HelpfulnessNumerator']<=final['HelpfulnessDenominator']]
final.shape

(364171, 10)

In [16]:
final['Score'].value_counts()

positive    307061
negative     57110
Name: Score, dtype: int64

##Text Preprocessing

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [19]:
print(final['Text'].values[0])
print("="*100)
print(final['Text'].values[100])
print("="*100)
print(final['Text'].values[1000])
print("="*100)
print(final['Text'].values[2000])
print("="*100)
print(final['Text'].values[3000])
print("="*100)
print(final['Text'].values[4000])
print("="*100)
print(final['Text'].values[10000])
print("="*100)
print(final['Text'].values[20000])
print("="*100)
print(final['Text'].values[30000])

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
Pros:<br />Dog will do anything for this treat.<br />Doesn't smell as bad as many other treats.<br />Easy to break into smaller pieces.<br />Nothing artificial, easy digestion.<br /><br />Cons:<br />More costly than other dog treats.<br /><br />Overall, this is a great product. While more expensive, my dog will do anything for this treat. He has several phobias, including getting in and out of the car, and walking through doorways, but he ignores all of his fears to get to this treat.
I was really looking forward to these pods based on the reviews.  Starbucks is good, but I prefer bolder taste.... imagine my surprise

We can see that text contains html tag and numbers which are no use to us so we will remove it before vectorizing our text document.

###First we remove URL from text in python

In [26]:
txt="""Why is this $[...] when the same product is available for $[...] here?<br />
http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />
The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby."""

In [27]:
# https://stackoverflow.com/a/40823105/4084039

# The re module offers a set of functions that allows us to search a string for a match:

#Function	   Description
#findall	   Returns a list containing all matches
#search	     Returns a Match object if there is a match anywhere in the string
#split	     Returns a list where the string has been split at each match
#sub	       Replaces one or many matches with a string

txt1=re.sub(r"http\S+","",txt) #replaces url with empty string
print(txt)
print(txt1)

Why is this $[...] when the same product is available for $[...] here?<br />
http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />
The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
Why is this $[...] when the same product is available for $[...] here?<br />
 /><br />
The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


###Removing all the tags from the text

In [29]:
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element

from bs4 import BeautifulSoup

soup=BeautifulSoup(txt1,"lxml")
txt2=soup.get_text()
txt2

'Why is this $[...] when the same product is available for $[...] here?\n />\nThe Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.'

###Decontracting our text

In [30]:
# https://stackoverflow.com/a/47091490/4084039

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [31]:
print(decontracted("I won't party I'll study"))

I will not party I will study


###Remove word with number

In [32]:
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039

sent_0 = re.sub("\S*\d\S*", "", "I took 7 pie out of7").strip()
print(sent_0)

I took  pie out


###Remove Special Character

In [36]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
txt3 = re.sub('[^A-Za-z0-9]+', ' ', txt2)
print(txt2)
print("="*200)
print(txt3)

Why is this $[...] when the same product is available for $[...] here?
 />
The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
Why is this when the same product is available for here The Victor M380 and M502 traps are unreal of course total fly genocide Pretty stinky but only right nearby 


###StopWords

In [37]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

###Stemming

In [42]:
from nltk.stem import SnowballStemmer
sno= SnowballStemmer("english")
print(sno.stem('tasty'))

tasti


##Completing all step and applying on text

In [95]:
import tqdm.notebook as tq
#tqdm gives as a status bar
#Note that we won't apply stemming here caure no need of it for w2v
preprocessed_text=[]
for sentance in tq.tqdm(final['Text'].values):
  sentance = re.sub(r"http\S+","",sentance)
  sentance = BeautifulSoup(sentance,"lxml").get_text()
  sentance = decontracted(sentance)
  sentance = re.sub("\S*\d\S*", "", sentance)
  sentance = re.sub('[^A-Za-z]+', ' ', sentance)
  sentance = ' '.join(e.lower()  for e in sentance.split()  if e not in stopwords and len(e)>2)
  preprocessed_text.append(sentance)

HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))




In [97]:
preprocessed_text[5]

'charming rhyming book describes circumstances eat not chicken soup rice month month this sounds like kind thing kids would make recess sing drive teachers crazy cute catchy sounds really childlike skillfully written'

##Summary Conversion

In [98]:
preprocessed_reviews=[]
for sentance in tq.tqdm(final['Text'].values):
  sentance = re.sub(r"http\S+","",sentance)
  sentance = BeautifulSoup(sentance,"lxml").get_text()
  sentance = decontracted(sentance)
  sentance = re.sub("\S*\d\S*", "", sentance)
  sentance = re.sub('[^A-Za-z]+', ' ', sentance)
  sentance = ' '.join(e.lower()  for e in sentance.split()  if e not in stopwords and len(e)>2)
  preprocessed_reviews.append(sentance)

HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))




In [99]:
preprocessed_reviews[10]

'get movie sound track sing along carol king this great stuff whole extended family knows songs heart quality kids storytelling music'