# Amazon Fine Food Reviews Analysis
Data Source:Kaggle


The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.
Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10 

Attribute Information:
1. Id
2. ProductId - unique identifier of the product
3. UserId - unqiue identifier of the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who found the review helpful or not helpful
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review

# Objective:
Determine the polarity of the given review. Or determine whether a given review for a product is positive or not.


Approach:
We could use the Score/Rating. A score of 4 or 5 could be considered as positive review. A score of 1 or 2 could be considered as negative. Review of rating 3 is ignored as it could be considered as neutral. This approach is approximate and proxy way of determining the polarity of the review.


In [1]:
#Importing modules
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")



import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer


import re

import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

# 1. Loading and reading data.

In [19]:

connec=sqlite3.connect('database.sqlite')

filtered_data= pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score !=3 LIMIT 10000""",connec)

def partition(x):
    if x<3:
        return 0;
    return 1
actualScore=filtered_data['Score']
positiveNegative=actualScore.map(partition)
filtered_data['Score']=positiveNegative
print("Number of datapoints in filtered_data:",filtered_data.shape)
filtered_data.head(3)


Number of datapoints in filtered_data: (10000, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


Reviews given by the same UsedId. Lets see those.

In [20]:
dupl_review = pd.read_sql_query("""
SELECT UserId, ProductId, ProfileName, Time, Score, Text, COUNT(*)
FROM Reviews
GROUP BY UserId
HAVING COUNT(*)>1
""", connec)
print(dupl_review.shape)
dupl_review.head()


(80668, 7)


Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
0,#oc-R115TNMSPFT9I7,B005ZBZLT4,Breyton,1331510400,2,Overall its just OK when considering the price...,2
1,#oc-R11D9D7SHXIJB9,B005HG9ESG,"Louis E. Emory ""hoppy""",1342396800,5,"My wife has recurring extreme muscle spasms, u...",3
2,#oc-R11DNU2NBKQ23Z,B005ZBZLT4,Kim Cieszykowski,1348531200,1,This coffee is horrible and unfortunately not ...,2
3,#oc-R11O5J5ZVQE25C,B005HG9ESG,Penguin Chick,1346889600,5,This will be the bottle that you grab from the...,3
4,#oc-R12KPBODL2B5ZD,B007OSBEV0,Christopher P. Presta,1348617600,1,I didnt like this coffee. Instead of telling y...,2


In [21]:
dupl_review[dupl_review['UserId']=='AZY10LLTJ71NX']

Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
80638,AZY10LLTJ71NX,B001ATMQK2,"undertheshrine ""undertheshrine""",1296691200,5,I bought this 6 pack because for the price tha...,5


In [22]:
dupl_review['COUNT(*)'].sum()

393063

# 2. EDA

1. Data Cleaning: Duplication.
As we have seen that there are multiple reviews given by the same useId. There may be duplicate entries.We can see in the following table that there are some duplicate entries in the table. Time stamp for the some reviews is same. How it can be possible. So it is necessary to remove these duplicate reviews for unbiased analysis of the data.

In [23]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", connec)
display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


In [24]:
sorted_data=filtered_data.sort_values('ProductId',axis=0, ascending=True, inplace=False, kind='quicksort',na_position='last')
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},keep='first',inplace=False)
final.shape

(9564, 10)

In [25]:
#How much data remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1)*100

95.64

It is also observed that for some reviews the HelpfulnessNumerator is greater than HelpfulnessDenominator (see the following table), which is not possible (by the definitions of these features).Hence the rows with such contradictory values of these two features must be removed.

In [26]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", connec)

display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [27]:
#Removing such rows
final=final[final.HelpfulnessNumerator <= final.HelpfulnessDenominator]
print(final.shape)

(9564, 10)


In [28]:
final['Score'].value_counts()

1    7976
0    1588
Name: Score, dtype: int64

# 3.1 Text PreProcessing
After above steps, our data requires some preprocessing before making further analysis and prediction model. In the preprocessing step, we remove stop words, html tags and punctuations or special characters. Also we convert words to lower case and check if it is not alphanumeric and has length greater than 2.    


In [37]:
# Printing some random reviews
sent_1=final['Text'].values[0]
print(sent_1)
print("="*50)

sent_2=final['Text'].values[1]
print(sent_2)
print("="*50)

sent_100=final['Text'].values[100]
print(sent_100)
print("="*50)

sent_1580=final['Text'].values[1580]
print(sent_1580)
print("="*50)

sent_4900=final['Text'].values[4900]
print(sent_4900)
print("="*50)

sent_5500=final['Text'].values[5500]
print(sent_5500)
print("="*50)

sent_7000=final['Text'].values[7000]
print(sent_7000)
print("="*50)

sent_7500=final['Text'].values[7500]
print(sent_7500)
print("="*50)

sent_8500=final['Text'].values[8500]
print(sent_8500)
print("="*50)

sent_9000=final['Text'].values[9000]
print(sent_9000)
print("="*50)

We have used the Victor fly bait for 3 seasons.  Can't beat it.  Great product!
Why is this $[...] when the same product is available for $[...] here?<br />http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
Adzuki ( or Azuki) beans are ment to be used in asian sweets. You can make sweet bean paste by boiling the beans for 30-40 min, (changing the water out at least twice) draining and mashing the beans with sugar.(strain the paste if you do not like the bean skins)  What you get is a paste that can be put on ice cream, fill pasteries or stuff into Mochi Cakes.
These chips taste awesome. And unlike most other flavored chips, they actually make sure that plenty of the flavory salty goodness gets on each individual chip. Just don't pass gas near any pretty ladies after consumption. They'll totally know it was you.
These tablets definitely made things 

In [38]:
# Remove urls from text.
sent_1 = re.sub(r"http\S+", "", sent_1)
sent_2 = re.sub(r"http\S+", "", sent_2)
sent_100 = re.sub(r"http\S+", "", sent_100)
sent_1580 = re.sub(r"http\S+", "", sent_1580)
sent_4900 = re.sub(r"http\S+", "", sent_4900)
sent_5500 = re.sub(r"http\S+", "", sent_5500)
sent_7000 = re.sub(r"http\S+", "", sent_7000)
sent_7500 = re.sub(r"http\S+", "", sent_7500)
sent_8500 = re.sub(r"http\S+", "", sent_8500)
sent_9000 = re.sub(r"http\S+", "", sent_9000)

print(sent_2)

Why is this $[...] when the same product is available for $[...] here?<br /> /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


In [41]:
#beautifulSoup-how-to-remove-all-tags-from-an-element

from bs4 import BeautifulSoup
soup=BeautifulSoup(sent_1, 'lxml');
text=soup.get_text();
print(text)
print("="*50)

soup=BeautifulSoup(sent_2, 'lxml');
text=soup.get_text();
print(text)
print("="*50)

soup=BeautifulSoup(sent_100, 'lxml');
text=soup.get_text();
print(text)
print("="*50)

soup=BeautifulSoup(sent_1580, 'lxml');
text=soup.get_text();
print(text)
print("="*50)

soup=BeautifulSoup(sent_4900, 'lxml');
text=soup.get_text();
print(text)
print("="*50)

soup=BeautifulSoup(sent_5500, 'lxml');
text=soup.get_text();
print(text)
print("="*50)

soup=BeautifulSoup(sent_7000, 'lxml');
text=soup.get_text();
print(text)
print("="*50)

soup=BeautifulSoup(sent_7500, 'lxml');
text=soup.get_text();
print(text)
print("="*50)

soup=BeautifulSoup(sent_8500, 'lxml');
text=soup.get_text();
print(text)
print("="*50)

soup=BeautifulSoup(sent_9000, 'lxml');
text=soup.get_text();
print(text)
print("="*50)



We have used the Victor fly bait for 3 seasons.  Can't beat it.  Great product!
Why is this $[...] when the same product is available for $[...] here? />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
Adzuki ( or Azuki) beans are ment to be used in asian sweets. You can make sweet bean paste by boiling the beans for 30-40 min, (changing the water out at least twice) draining and mashing the beans with sugar.(strain the paste if you do not like the bean skins)  What you get is a paste that can be put on ice cream, fill pasteries or stuff into Mochi Cakes.
These chips taste awesome. And unlike most other flavored chips, they actually make sure that plenty of the flavory salty goodness gets on each individual chip. Just don't pass gas near any pretty ladies after consumption. They'll totally know it was you.
These tablets definitely made things sweeter -- like lemons, limes, and grapefruit.  But it wasn't to the point of sh

In [42]:
#Replace some short form of words with full words, by the following function:

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [44]:
sent_1=decontracted(sent_1)
print(sent_1)
print("="*50)

sent_1580=decontracted(sent_1580)
print(sent_1580)
print("="*50)


We have used the Victor fly bait for 3 seasons.  Ca not beat it.  Great product!
These chips taste awesome. And unlike most other flavored chips, they actually make sure that plenty of the flavory salty goodness gets on each individual chip. Just do not pass gas near any pretty ladies after consumption. They will totally know it was you.


In [47]:
#Remove words with numbers:
sent_2 = re.sub("\S*\d\S*", "", sent_2).strip()
print(sent_2)

Why is this $[...] when the same product is available for $[...] here?<br /> /><br />The Victor  and  traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


In [48]:
#Remove spacial character:
sent_2 = re.sub('[^A-Za-z0-9]+', ' ', sent_2)
print(sent_2)

Why is this when the same product is available for here br br The Victor and traps are unreal of course total fly genocide Pretty stinky but only right nearby 


we are removing the words from the stop words list: 'no', 'nor', 'not'
<br /><br /> ==> after the above steps, we are getting "br br"
we are including them into stop words list
instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

In [49]:
#Source:https://gist.github.com/sebleier/554280
stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [50]:
# Combining all the above changes. 

from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentc in tqdm(final['Text'].values):
    sentc = re.sub(r"http\S+", "", sentc)
    sentc = BeautifulSoup(sentc, 'lxml').get_text()
    sentc = decontracted(sentc)
    sentc = re.sub("\S*\d\S*", "", sentc).strip()
    sentc = re.sub('[^A-Za-z]+', ' ', sentc)
    sentc = ' '.join(e.lower() for e in sentc.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentc.strip())

100%|█████████████████████████████████████| 9564/9564 [00:03<00:00, 2608.52it/s]


In [53]:
print(preprocessed_reviews[2])
print("="*50)
print(preprocessed_reviews[1580])


received shipment could hardly wait try product love slickers call instead stickers removed easily daughter designed signs printed reverse use car windows printed beautifully print shop program going lot fun product windows everywhere surfaces like tv screens computer monitors
chips taste awesome unlike flavored chips actually make sure plenty flavory salty goodness gets individual chip not pass gas near pretty ladies consumption totally know


# 3.2 Preprocessing Summary

In [54]:
#By following the above steps for summary, we can preprocess the Summary column.
preprocessed_summary = []
for sentc in tqdm(final['Summary'].values):
    sentc = re.sub(r"http\S+", "", sentc)
    sentc = BeautifulSoup(sentc, 'lxml').get_text()
    sentc = decontracted(sentc)
    sentc = re.sub("\S*\d\S*", "", sentc).strip()
    sentc = re.sub('[^A-Za-z]+', ' ', sentc)
    sentc = ' '.join(e.lower() for e in sentc.split() if e.lower() not in stopwords)
    preprocessed_summary.append(sentc.strip())

100%|█████████████████████████████████████| 9564/9564 [00:01<00:00, 5151.44it/s]


In [56]:
print(preprocessed_summary[1])
print("="*50)
print(preprocessed_summary[100])
print("="*50)
print(preprocessed_summary[1580])
print("="*50)
print(preprocessed_summary[4900])
print("="*50)
print(preprocessed_summary[9000])
print("="*50)
print(preprocessed_summary[9563])
print("="*50)

thirty bucks
adzuki beans
much flavor farts smell like sweet onions
pretty cool not life changing
taste great
delicious


# 4 Feature Engineering:
    

# 4.1 Converting Reviews and Summary into Numerical Vector

In [64]:
#BoW for reviews
count_vect=CountVectorizer()
count_vect.fit(preprocessed_reviews)
print("Some features names ",count_vect.get_feature_names()[:10])
print("="*50)

final_count=count_vect.transform(preprocessed_reviews)
print("The type of count_vectorizer ",type(final_count))
print("The shape of BoW vectorizer ", final_count.get_shape())
print("The number of unique words ",final_count.get_shape()[1])
print(final_count[1,:])


Some features names  ['aa', 'aaaa', 'aahhhs', 'ab', 'aback', 'abandon', 'abates', 'abberline', 'abbott', 'abby']
The type of count_vectorizer  <class 'scipy.sparse._csr.csr_matrix'>
The shape of BoW vectorizer  (9564, 18244)
The number of unique words  18244
  (0, 1007)	1
  (0, 3612)	1
  (0, 6183)	1
  (0, 6648)	1
  (0, 10473)	1
  (0, 12286)	1
  (0, 12388)	1
  (0, 13523)	1
  (0, 15363)	1
  (0, 16572)	1
  (0, 16674)	1
  (0, 17090)	1
  (0, 17396)	1


In [59]:
# BoW for Summary
count_vect_sum=CountVectorizer()
count_vect_sum.fit(preprocessed_summary)
print("Some features names in Summary_df ",count_vect_sum.get_feature_names()[:10])
print("="*50)

final_count_summary=count_vect_sum.transform(preprocessed_summary)
print("The type of count_vectorizer in summary_df  ",type(final_count_summary))
print("The shape of BoW vectorizer in summary_df ", final_count_summary.get_shape())
print("The number of unique words in summary_df  ",final_count_summary.get_shape()[1])


Some features names in Summary_df  ['aaaarrrrrgggghhhhh', 'able', 'absolute', 'absolutel', 'absolutely', 'absotively', 'acai', 'acceptable', 'accidents', 'according']
The type of count_vectorizer in summary_df   <class 'scipy.sparse._csr.csr_matrix'>
The shape of BoW vectorizer in summary_df  (9564, 4321)
The number of unique words in summary_df   4321


In [65]:
#Bi-Grams and n-Grams for reviews
count_vect = CountVectorizer(ngram_range=(1,2), min_df=10, max_features=7000)
final_bigram = count_vect.fit_transform(preprocessed_reviews)
print("The type of count vectorizer ",type(final_bigram))
print("The shape final_bigram: ",final_bigram.get_shape())
print("The number of unique words including both unigrams and bigrams: ", final_bigram.get_shape()[1])

The type of count vectorizer  <class 'scipy.sparse._csr.csr_matrix'>
The shape final_bigram:  (9564, 5765)
The number of unique words including both unigrams and bigrams:  5765


In [66]:
#Bi-Grams and n-Grams for summary
count_vect_sum = CountVectorizer(ngram_range=(1,2), min_df=10, max_features=7000)
final_bigram_sum = count_vect.fit_transform(preprocessed_summary)
print("The type of count vectorizer for Summary ",type(final_bigram_sum))
print("The shape final_bigram for summary: ",final_bigram_sum.get_shape())
print("The number of unique words including both unigrams and bigrams: ", final_bigram_sum.get_shape()[1])



The type of count vectorizer for Summary  <class 'scipy.sparse._csr.csr_matrix'>
The shape final_bigram for summary:  (9564, 578)
The number of unique words including both unigrams and bigrams:  578


In [67]:
#TF-IDF for reviews
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(preprocessed_reviews)
print("Some sample features(unique words in the corpus): ",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

final_tf_idf_rev = tf_idf_vect.transform(preprocessed_reviews)
print("The type of count vectorizer ",type(final_tf_idf_rev))
print("The shape of final_tf_idf_rev ",final_tf_idf_rev.get_shape())
print("The number of unique words including both unigrams and bigrams ", final_tf_idf_rev.get_shape()[1])

Some sample features(unique words in the corpus):  ['ability', 'able', 'able buy', 'able eat', 'able find', 'able get', 'able order', 'able use', 'absolute', 'absolute best']
The type of count vectorizer  <class 'scipy.sparse._csr.csr_matrix'>
The shape of final_tf_idf_rev  (9564, 5765)
The number of unique words including both unigrams and bigrams  5765


In [68]:
#TF-IDF for summary
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(preprocessed_summary)
print("Some sample features(unique words in the corpus): ",tf_idf_vect.get_feature_names()[0:10])
print('='*50)

final_tf_idf_sum = tf_idf_vect.transform(preprocessed_summary)
print("The type of count vectorizer ",type(final_tf_idf_sum))
print("The shape of final_tf_idf_sum ",final_tf_idf_sum.get_shape())
print("The number of unique words including both unigrams and bigrams ", final_tf_idf_sum.get_shape()[1])

Some sample features(unique words in the corpus):  ['absolutely', 'absolutely delicious', 'actually', 'addicted', 'addictive', 'agree', 'almonds', 'almost', 'alternative', 'aluminum']
The type of count vectorizer  <class 'scipy.sparse._csr.csr_matrix'>
The shape of final_tf_idf_sum  (9564, 578)
The number of unique words including both unigrams and bigrams  578


In [69]:
#Word_2_Vec for review and summary

list_of_sentc_rev=[]
for sentance in preprocessed_reviews:
    list_of_sentc_rev.append(sentance.split())
    
list_of_sentc_sum=[]
for senrance in preprocessed_summary:
    list_of_sentc_sum.append(sentance.split())

In [96]:
want_to_train_w2v=True
if want_to_train_w2v:
    # min_count = 5 considers only words that occured atleast 5 times
    w2v_model_rev=Word2Vec(list_of_sentc_rev,min_count=5,vector_size=50, workers=4)
    print(w2v_model_rev.wv.most_similar('great'))
    print('='*50)
    print(w2v_model_rev.wv.most_similar('worst'))


[('excellent', 0.8910835385322571), ('good', 0.8632918000221252), ('wonderful', 0.7901906371116638), ('fresh', 0.747199535369873), ('overall', 0.7378928065299988), ('makes', 0.7327553033828735), ('well', 0.726107120513916), ('sickening', 0.7162603735923767), ('decent', 0.7157618403434753), ('looking', 0.7107333540916443)]
[('hands', 0.9751193523406982), ('disappointing', 0.9742241501808167), ('eaten', 0.9655758738517761), ('awful', 0.9605571627616882), ('varieties', 0.9604638814926147), ('absolute', 0.9597297310829163), ('jamaica', 0.9589985609054565), ('blends', 0.9584805965423584), ('grey', 0.9580742120742798), ('pg', 0.9576289057731628)]


In [97]:
if want_to_train_w2v:
    w2v_model_sum=Word2Vec(list_of_sentc_sum,min_count=5,vector_size=50, workers=4)
    print(w2v_model_rev.wv.most_similar('easy'))
    print('='*50)
    print(w2v_model_rev.wv.most_similar('love'))

[('quick', 0.8625193238258362), ('make', 0.8391530513763428), ('prepare', 0.8192410469055176), ('works', 0.8069700002670288), ('also', 0.7778550982475281), ('makes', 0.7706996202468872), ('sushi', 0.7660984396934509), ('clean', 0.7644457221031189), ('need', 0.7523777484893799), ('lunch', 0.7502284049987793)]
[('absolutely', 0.8775326013565063), ('licorice', 0.8384461998939514), ('likes', 0.8232775330543518), ('unique', 0.8221467733383179), ('enjoy', 0.8172870874404907), ('spicy', 0.8169312477111816), ('delicious', 0.812677800655365), ('tasty', 0.8089853525161743), ('thai', 0.8075454831123352), ('really', 0.8063867092132568)]


In [98]:
w2v_words_rev = list(w2v_model_rev.wv.key_to_index)
print("number of words that occured minimum 5 times in reviews: ",len(w2v_words_rev))
print("sample words ", w2v_words_rev[0:50])

number of words that occured minimum 5 times in reviews:  5652
sample words  ['not', 'like', 'good', 'great', 'taste', 'coffee', 'one', 'would', 'product', 'flavor', 'love', 'no', 'tea', 'food', 'really', 'get', 'much', 'use', 'best', 'time', 'amazon', 'also', 'tried', 'little', 'make', 'buy', 'price', 'find', 'well', 'better', 'try', 'even', 'cup', 'chips', 'bag', 'chocolate', 'sugar', 'water', 'eat', 'first', 'hot', 'could', 'drink', 'made', 'found', 'mix', 'used', 'bought', 'free', 'sweet']


In [99]:
w2v_words_sum = list(w2v_model_sum.wv.key_to_index)
print("number of words that occured minimum 5 times in summary: ",len(w2v_words_sum))
print("sample words ", w2v_words_sum[0:50])

number of words that occured minimum 5 times in summary:  19
sample words  ['recommend', 'easy', 'product', 'local', 'store', 'ny', 'kids', 'love', 'quick', 'meal', 'strongly', 'put', 'toaster', 'oven', 'toast', 'min', 'ready', 'eat', 'purchased']


In [101]:
#Avg Word2Vec and TF-IDF weighted Word2Vec
sent_vectors_rev = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentc_rev):
    cnt_words =0;
    sent_vec=np.zeros(50)
    for word in sent:
        if word in w2v_words_rev:
            vec = w2v_model_rev.wv[word];
            sent_vec +=vec;
            cnt_words += 1;
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors_rev.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))


100%|█████████████████████████████████████| 9564/9564 [00:08<00:00, 1087.10it/s]

9564
50





In [102]:
sent_vectors_sum = []; # the avg-w2v for each summary is stored in this list
for sent in tqdm(list_of_sentc_sum):
    cnt_words =0;
    sent_vec=np.zeros(50)
    for word in sent:
        if word in w2v_words_sum:
            vec = w2v_model_sum.wv[word];
            sent_vec +=vec;
            cnt_words += 1;
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors_sum.append(sent_vec)
print(len(sent_vectors_sum))
print(len(sent_vectors_sum[0]))

100%|█████████████████████████████████████| 9564/9564 [00:01<00:00, 8948.87it/s]

9564
50





In [104]:
#TFIDF weighted Word2Vec for reviews and summary.
model1 = TfidfVectorizer()
model2 = TfidfVectorizer()
model1.fit(preprocessed_reviews)
model2.fit(preprocessed_summary)

# we are converting a dictionary with word as a key, and the idf as a value
dictionary1 = dict(zip(model1.get_feature_names(), list(model1.idf_)))
dictionary2 = dict(zip(model2.get_feature_names(), list(model2.idf_)))

In [105]:
# TF-IDF weighted Word2Vec for reviews
tfidf_feat_rev = model1.get_feature_names() 
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors_rev = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in tqdm(list_of_sentc_rev): 
    sent_vec = np.zeros(50)
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: 
        if word in w2v_words_rev and word in tfidf_feat_rev:
            vec = w2v_model_rev.wv[word]
#             tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
            # to reduce the computation we are 
            # dictionary[word] = idf value of word in whole courpus
            # sent.count(word) = tf valeus of word in this review
            tf_idf = dictionary1[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors_rev.append(sent_vec)
    row += 1

100%|██████████████████████████████████████| 9564/9564 [01:11<00:00, 134.23it/s]


In [106]:
print(len(tfidf_sent_vectors_rev))
print(len(tfidf_sent_vectors_rev[0]))
print("Number of rows: ",row)

9564
50
Number of rows:  9564


In [107]:
# TF-IDF weighted Word2Vec for summary
tfidf_feat_sum = model2.get_feature_names() 
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors_sum = []; # the tfidf-w2v for each summary of review is stored in this list
row=0;
for sent in tqdm(list_of_sentc_sum): 
    sent_vec = np.zeros(50)
    weight_sum =0; # num of words with a valid vector in the summary
    for word in sent: 
        if word in w2v_words_sum and word in tfidf_feat_sum:
            vec = w2v_model_sum.wv[word]
            tf_idf = dictionary2[word]*(sent.count(word)/len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors_sum.append(sent_vec)
    row += 1

100%|██████████████████████████████████████| 9564/9564 [00:11<00:00, 828.43it/s]


In [108]:
print(len(tfidf_sent_vectors_sum))
print(len(tfidf_sent_vectors_sum[0]))
print("Number of rows: ",row)

9564
50
Number of rows:  9564


The text data for reviews and summary is converted into numerical vectors. Now we can further analyse and trian our model on this engineered data. 