The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.

Number of reviews: 568,454
Number of users: 256,059
Number of products: 74,258
Timespan: Oct 1999 - Oct 2012
Number of Attributes/Columns in data: 10

Attribute Information:

Id
ProductId - unique identifier for the product
UserId - unqiue identifier for the user
ProfileName
HelpfulnessNumerator - number of users who found the review helpful
HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
Score - rating between 1 and 5
Time - timestamp for the review
Summary - brief summary of the review
Text - text of the review
Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).


[Q] How to determine if a review is positive or negative?

[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

In [1]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")


import sqlite3
import pandas as pd
import numpy as np
import nltk          #natural language processing tool kit : for processing Text
import string
import matplotlib.pyplot as plt
import seaborn as sns

#scikitlearn library for machine learning
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/

import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

In [3]:
con = sqlite3.connect('./amazon-fine-food-reviews/database.sqlite') 
filtered_data = pd.read_sql_query('''
SELECT * 
FROM Reviews
WHERE Score !=3
''',con)

def partition(x):
    if x<3:
        return 0
    return 1

actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition)
filtered_data['Score'] = positiveNegative

In [4]:
print('no of data points in our data',filtered_data.shape)
filtered_data.head()

no of data points in our data (525814, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,0,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,1,1350777600,Great taffy,Great taffy at a great price. There was a wid...


## Data Cleaning : Deduplication 

In [8]:
display = pd.read_sql_query('''
Select * FROM Reviews
WHERE Score!=3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductId
''',con)
display.head()
# single user should have given review for the single product. products would have same. but model would be different
# so we should dedupe the data 

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


In [9]:
#step 1: Sorting data according to ProductId in ascending order

sorted_data = filtered_data.sort_values('ProductId',axis=0,ascending=True,inplace=False,kind='quicksort',na_position='last')

In [10]:
# once sorting is done next deduplications of entries

final = sorted_data.drop_duplicates(subset={'UserId','ProfileName','Time','Summary','Text'}, keep='first', inplace=False)
final.shape

# 365333 rows and 10 columns
# after 568,454 after dropping duplicates 365333 is remained

# after filtered_data reviews were 568457 after final filtering data 365333

(365333, 10)

In [11]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0) / (filtered_data['Id'].size*1.0)*100   # 365333/ 568454 *100

# 69% data remained after we cleaned up the data by removing duplicates

69.4795117665182

Observation:- It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calcualtions

In [12]:
# HelpfullnessDenominator > HelpfullnessNumerator

display = pd.read_sql_query("""
SELECT * 
FROM Reviews
WHERE Score !=3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""",con)
display


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [13]:
final = final[final.HelpfulnessNumerator <= final.HelpfulnessDenominator]

In [23]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

final['Score'].value_counts()

# positve reviews:307967  and negative reviews:57364

(365331, 10)


1    307967
0     57364
Name: Score, dtype: int64

## Text Preprocessing

In [28]:
# find sentence containing HTML tags
# Alphanumeric : a character that is either a letter or a number.

i=0
for sent in final['Text'].values:       # for each review / sentence
    if (len(re.findall('<.*>',sent))):# if regular expression of finding that special characters print them along with the count
        print(i)
        print(sent)
        break
    i+=1


6
I set aside at least an hour each day to read to my son (3 y/o). At this point, I consider myself a connoisseur of children's books and this is one of the best. Santa Clause put this under the tree. Since then, we've read it perpetually and he loves it.<br /><br />First, this book taught him the months of the year.<br /><br />Second, it's a pleasure to read. Well suited to 1.5 y/o old to 4+.<br /><br />Very few children's books are worth owning. Most should be borrowed from the library. This book, however, deserves a permanent spot on your shelf. Sendak's best.


In [48]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english')) # set of stopwords
#stop_words = stop_words.remove('not')
snowball_stemmer = nltk.stem.SnowballStemmer('english') # initialising the snowball stemmer

def cleanhtml(s):
    a=re.sub('[|>.*?|\.*|?.*?]',"",s)
    return a
def cleanpunc(s):
    a=re.sub('[.|,|!,|)|(|/|\|”|\’|#|@|$|-|%|]',"",s)  # substitue means replace with space
    return a

print(stop_words)
print('='*100)
print(snowball_stemmer.stem('tasty'))  # stemming drops last two characters

# here i have removed html tags, punctutaions and special charcters etc., and gave the stemming words

{'mightn', "shan't", 'hadn', "haven't", 'are', 'and', 'such', 'had', 'for', 'they', 'during', 'themselves', 'each', 's', 'under', 'out', 'shouldn', "wouldn't", "you're", 'here', 'his', 'of', 'doesn', 'again', 'my', 'couldn', 'itself', 'but', 'did', 'with', 'were', 'won', 'over', 't', 'both', 'to', 'our', 'no', 'yourself', 'at', 'wouldn', 'by', 'further', 'yours', 'in', 'all', "don't", 'so', 'which', 'hasn', 'now', 'against', 'ours', 'before', 'as', "she's", 'hers', 'an', 'not', "mustn't", "you'd", 'below', 'off', 'should', 'few', 'she', 'or', 'theirs', 'will', 'than', "you'll", 'you', 'only', 'd', 'didn', 'other', "shouldn't", 'this', 'a', 'above', 'm', 'nor', 'shan', 'very', "that'll", 'myself', "wasn't", 'her', 'while', 'most', 'he', 'ourselves', 'down', 'there', "should've", 'on', 'me', 'where', 'whom', 'who', 'after', "isn't", 'll', 'same', 'do', "it's", 'its', 'into', 'have', 'once', "weren't", 'isn', 'i', 'aren', 'then', 'yourselves', 'just', 'these', "couldn't", 'ain', 'between'

In [71]:
# preprocessing steps

i = 0
str1 = ' '
final_string =[]
all_positive_words=[]    # store words from positive reviews here
all_negative_words=[]    # store words from negative reviews here
s=''
for sentence in final['Text'].values:
    filtered_sentence = []  # after all the cleaning is done im going to store it in filtered_sentence
    #print(sent)
    sentence = cleanhtml(sentence)  # remove html tags
    for words in sentence.split():       # it splits all the words in the sentence and into the words
        for cleaned_words in cleanpunc(words).split(): # it splits the removal special characters wordsafter cleaning
            if ((cleaned_words.isalpha())) & (len(cleaned_words)>2): # if cleaned_words are either a letter or number
                                                                    # and each cleaned words length is greater than 2
                if (cleaned_words.lower() not in stop_words):#the lowercase cleaned_words which are not present in stopwords
                    s = (snowball_stemmer.stem(cleaned_words.lower())).encode('utf8')
                        # after stemming the lower case cleaned_words eg: taste,tasti
                    filtered_sentence.append(s)  # append stemmed cleaned_words into filtered_sentence
                        
                    if (final['Score'].values)[i] == 1:
                        all_positive_words.append(s)          # list of all positve words appending into all_positive_words
                    if (final['Score'].values)[i] == 0:
                        all_negative_words.append(s)          # list of all positve words appending into all_positive_words
                else:
                    continue  # if lower case cleaned_words are present in stopwords then skip
            else:
                continue # if cleaned_words are not alpha numeric and not len of cleaned_words are greater than 2 then continue
                
    str1 = b' '.join(filtered_sentence) # final string of cleaned words      b is bytes
    final_string.append(str1)
    i+=1
                            

In [58]:
final['CleanedText'] = final_string   # adding a column of CleandText which stores the appending values of final_string

In [59]:
final.head(3)


# store final table into sqllite table for future

conn = sqlite3.connect('final.sqlite')
c = conn.cursor()
conn.text_factory = str
final.to_sql('Reviews',conn, schema = None, if_exists='replace')

observations : 1. a new column has been added to our dataset. which is CleanedText from which we have the filtered_sentence

## BAG OF WORDS

In [62]:
# BOW
# Countvectorizer() in scikit-learn
# converting word to a vector is called vectorizer
# By using CountVectorizer function we can convert text document to matrix of word count.
#eg: review ==> unique words  ==> 1,2,3,4,5 ==> 11010 etc.,
# Matrix which is produced here is sparse matrix. 

#CountVectorizer converts text document to matrix of word count[sparse matrix]

count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['Text'].values)

#in final dataframe at Text column ,get the Text column convert them into values

In [37]:
type(final_counts)

scipy.sparse.csr.csr_matrix

In [63]:
final_counts.get_shape()         #365331 rows/reviews  and 115281 columns/words in document
                                 # every column here corresponds to an unqiue word

#sparse matrix = 365331 * 115281
#115281 unique words ; every word has different dimensioanlity



# Total -elements --> n*m = 365331 * 1152581 = 64803042 words
# total non-zero elements ==> k = 156497

# sparsity ==> s= (n*m-k)/(n*m) = (64803042 - 156497) / 64803042 = 0.9975
# Density ==> D = (k/n*m) or (1-sparsity) = 156497/64803042 = 0.0025
# Reduction = (n*m)/(3* no of non zero elements) = 64803042/3*156497 = 138 times

# sparsity is no of 0 values. 
#Note — Countvectorizer produces sparse matrix which sometime not suited for some machine learning model 
#hence first convert this sparse matrix to dense matrix then apply machine learning model

#thus, dense = 1 - sparsity = 1-0.9975 = 0.0025  
# dense matrix is no of non zero values
# Hence dense matrix is suited for machine learning model

(365331, 115281)

In [60]:
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(final['CleanedText'].values)

In [52]:
type(final_counts)

scipy.sparse.csr.csr_matrix

In [61]:
final_counts.get_shape()   #after all the cleaning is done  we came to 114602 words [dimensions]

(365331, 114602)

## Uni-gram, Bi-gram and N-gram

In [54]:
#After which we collect the words used to describe positive and negative reviews

#now that we have our list of words describing positive and negative reviews lets analyse them.
# we begin analysis by getting the frequency distribution of the words as shown below

In [72]:
freq_dist_positive = nltk.FreqDist(all_positive_words)
freq_dist_negative = nltk.FreqDist(all_negative_words)
print('Most common positive words:',freq_dist_positive.most_common(10))
print('Most common Negative words:',freq_dist_negative.most_common(10))

Most common positive words: [(b'like', 139093), (b'tast', 126408), (b'good', 110151), (b'love', 107082), (b'flavor', 106234), (b'use', 103384), (b'great', 102388), (b'one', 95384), (b'product', 88636), (b'tri', 85869)]
Most common Negative words: [(b'tast', 33853), (b'like', 32189), (b'product', 27476), (b'one', 20283), (b'flavor', 18603), (b'would', 18023), (b'tri', 17670), (b'use', 15246), (b'good', 14572), (b'coffe', 14158)]


Observation: From the above it can be seen that the most common positive and negative words overlap 
for eg: 'like' could be used as 'not like' etc.,

so it is a good idea to consider pairs of consequent words(bi-grams) or q sequence of n consecutive words(n-grams)

In [75]:
# bi-gram , tri-gram and n-gram
import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/

import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

#ngram_range=(1, 1),(min_n , max_n)
count_vect = CountVectorizer(ngram_range=(1,2)) # in scikit learn  ; the words ranges between uni-gram and bi-gram gives massivecount
 # (uni-gram, bi-gram) # as n increases, d[dimensions] increases 
final_bigram_counts = count_vect.fit_transform(final['CleanedText'].values)


In [76]:
final_bigram_counts.get_shape()

#during bow we had 115k dimensions[uni-gram] now we have 2.9M dimensions [massive]

(365331, 2958327)

## TF-IDF (term frequency - inverse documnet frequency)

In [73]:


tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))  # (unigram, bigram)
final_tf_idf = tf_idf_vect.fit_transform(final['CleanedText'].values)

# final_tf_idf itslef is a sparse matirx

In [74]:
final_tf_idf.get_shape()
#during bow we had 115k dimensions[uni-gram] now we have 2.9M dimensions [massive]

#2.9M features/dimensions

(365331, 2958327)

In [77]:
features = tf_idf_vect.get_feature_names()
len(features)

2958327

In [78]:
features[100000:100010]   #bi grams

['anoth method',
 'anoth metra',
 'anoth metro',
 'anoth middl',
 'anoth might',
 'anoth migrain',
 'anoth mild',
 'anoth milder',
 'anoth mile',
 'anoth milk']

In [79]:
# convert a row in sparsematrix to a numpy array
# for review3 if i wanna get the vector
print(final_tf_idf[3,:].toarray()[0])   #if want to get review 3 into sparse matrix

[0. 0. 0. ... 0. 0. 0.]


In [86]:
def top_tfidf_features(row, features, top_n=25):
    # get top n tfidf values in row and return them with their corresponding ranks
    topn_ids = np.argsort(row)[::-1][:top_n]#sorting the top 25 in descending order of review given in calling function, till 25
    #np.argsort(row)[::-1] # sorting the row according to given in calling function[which review/row from descending order]
    #np.argsort(row)[::-1][:top_n] == it sort the row till 25 values
    top_feats = [(features[i],row[i]) for i in topn_ids]  # topfeatures = features,tfidf values and iterates till 25
    df = pd.DataFrame(top_feats)              # creating a dataframe of top_feats
    df.columns = ['feature','tfidf']          # creating the columns
    return df

top_tfidf = top_tfidf_features(final_tf_idf[1,:].toarray()[0],features,25)

# row = i had get the review 1 and converted into sparse matrix 

In [87]:
top_tfidf      # feature is bi-gram

Unnamed: 0,feature,tfidf
0,version paperback,0.192039
1,page open,0.192039
2,grew read,0.192039
3,incorpor love,0.192039
4,keep page,0.192039
5,read sendak,0.192039
6,sendak book,0.192039
7,rosi movi,0.192039
8,movi incorpor,0.192039
9,paperback seem,0.192039


In [None]:
# TF-IDF still doesnt take semantic word so that is why we use word2vec : tasty ==> delicious ; cheap==> affordable

## Word2Vec 

In [89]:
# it takes semantic meaning of words into consideration

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz',binary = True)
# google gives you the vector representation in the form of above link and vector representatition is 300 dimension

In [90]:
model.wv['computer']

array([ 1.07421875e-01, -2.01171875e-01,  1.23046875e-01,  2.11914062e-01,
       -9.13085938e-02,  2.16796875e-01, -1.31835938e-01,  8.30078125e-02,
        2.02148438e-01,  4.78515625e-02,  3.66210938e-02, -2.45361328e-02,
        2.39257812e-02, -1.60156250e-01, -2.61230469e-02,  9.71679688e-02,
       -6.34765625e-02,  1.84570312e-01,  1.70898438e-01, -1.63085938e-01,
       -1.09375000e-01,  1.49414062e-01, -4.65393066e-04,  9.61914062e-02,
        1.68945312e-01,  2.60925293e-03,  8.93554688e-02,  6.49414062e-02,
        3.56445312e-02, -6.93359375e-02, -1.46484375e-01, -1.21093750e-01,
       -2.27539062e-01,  2.45361328e-02, -1.24511719e-01, -3.18359375e-01,
       -2.20703125e-01,  1.30859375e-01,  3.66210938e-02, -3.63769531e-02,
       -1.13281250e-01,  1.95312500e-01,  9.76562500e-02,  1.26953125e-01,
        6.59179688e-02,  6.93359375e-02,  1.02539062e-02,  1.75781250e-01,
       -1.68945312e-01,  1.21307373e-03, -2.98828125e-01, -1.15234375e-01,
        5.66406250e-02, -

In [91]:
model.wv.similarity('woman','man')

0.76640123

In [92]:
model.wv.similarity('man','man')

1.0

In [95]:
model.wv.similarity('woman','Aeroplane')

-0.021405596

In [96]:
model.wv.most_similar('women')

[('men', 0.767493724822998),
 ('Women', 0.7283450365066528),
 ('womens', 0.6786180138587952),
 ('girls', 0.633903980255127),
 ('females', 0.6240420937538147),
 ('mothers', 0.6050933599472046),
 ('ladies', 0.5865179300308228),
 ('husbands', 0.5705342292785645),
 ('transwomen', 0.5697939991950989),
 ('Men', 0.5693342685699463)]

In [97]:
model.wv.most_similar('robots')

[('robot', 0.8341808319091797),
 ('Robots', 0.7578040361404419),
 ('robotic', 0.7340320348739624),
 ('autonomous_robots', 0.7151706218719482),
 ('humanoid_robots', 0.7042970657348633),
 ('robotics', 0.6718464493751526),
 ('Robot', 0.6514714956283569),
 ('bots', 0.6371974349021912),
 ('humanoid', 0.6353020668029785),
 ('androids', 0.6327425837516785)]

In [100]:
model.wv.most_similar('intelligence')

[('Intelligence', 0.7189884185791016),
 ('intel', 0.6356417536735535),
 ('CIA', 0.6148777008056641),
 ('counterintelligence', 0.604588508605957),
 ('Alain_Chouet', 0.5940318703651428),
 ('Intelligence_Agency', 0.5846039056777954),
 ('counterterrorism', 0.5823408365249634),
 ('humint', 0.5769444108009338),
 ('chief_Ali_Mamluk', 0.5650478005409241),
 ('traditional_spycraft', 0.5622859001159668)]

In [102]:
#model.wv.most_similar('tasti')  # it gives you an error
model.wv.most_similar('tasty')

[('delicious', 0.8730389475822449),
 ('scrumptious', 0.8007042407989502),
 ('yummy', 0.7856924533843994),
 ('flavorful', 0.7420164346694946),
 ('delectable', 0.7385421991348267),
 ('juicy_flavorful', 0.7114803791046143),
 ('appetizing', 0.7017217874526978),
 ('crunchy_salty', 0.7012300491333008),
 ('flavourful', 0.6912213563919067),
 ('flavoursome', 0.6857702732086182)]

In [104]:
#model.wv.similarity('taste','tasti') # it gives you an error becoz it wont always accept the stemming words [meaningless]
model.wv.similarity('taste','tasty')

0.45559818

In [110]:
# Train your own word2vec model using your own text corpus
import gensim
i = 0
list_of_sentence = []
for sent in final['Text'].values:
    filtered_sentence = []      # after all the cleaning is done im going to store it in filtered_sentence
    sentence = cleanhtml(sent)    # removes the html
    
    for words in sent.split(): # it splits all cleaned words in the sentence and into the words
        for cleaned_words in cleanpunc(words).split(): # it splits the removal special characters wordsafter cleaning
            if ((cleaned_words.isalpha())) & (len(cleaned_words)>2): # if cleaned_words are either a letter or number
                                                                    # and each cleaned words length is greater than 2
                filtered_sentence.append(cleaned_words.lower())#it adds all the lower case cleaned words into filtered_sentence 
            else:
                continue   # if they are not a letter or a number it just skips
    list_of_sentence.append(filtered_sentence)


In [111]:
print(final['Text'].values[0])
print('='*100)
print(list_of_sentence[0])

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
['this', 'witty', 'little', 'book', 'makes', 'son', 'laugh', 'loud', 'recite', 'the', 'car', 'driving', 'along', 'and', 'always', 'can', 'sing', 'the', 'refrain', 'learned', 'about', 'whales', 'india', 'drooping', 'love', 'all', 'the', 'new', 'words', 'this', 'book', 'introduces', 'and', 'the', 'silliness', 'all', 'this', 'classic', 'book', 'willing', 'bet', 'son', 'will', 'still', 'able', 'recite', 'from', 'memory', 'when', 'college']


In [113]:
# To train the word2Vec model
w2v_model= gensim.models.Word2Vec(list_of_sentence,min_count=5,size=50,workers=4)



In [115]:
words = list(w2v_model.wv.vocab)
print(len(words))

33016


In [117]:
w2v_model.wv.most_similar('tasty')  # using my corpus earlier i have used google corpus

[('tastey', 0.8855271339416504),
 ('yummy', 0.8363196849822998),
 ('delicious', 0.8146854639053345),
 ('satisfying', 0.8130784034729004),
 ('filling', 0.7824431657791138),
 ('flavorful', 0.7773221731185913),
 ('scrumptious', 0.6963809132575989),
 ('versatile', 0.6841164231300354),
 ('hardy', 0.6776205897331238),
 ('delicous', 0.6768090724945068)]

In [118]:
w2v_model.wv.most_similar('like')

[('resemble', 0.7000719308853149),
 ('prefer', 0.6620005965232849),
 ('alright', 0.6514576077461243),
 ('gross', 0.6503801345825195),
 ('weird', 0.6427991390228271),
 ('dislike', 0.6390759944915771),
 ('okay', 0.6293050646781921),
 ('fake', 0.6194825768470764),
 ('hate', 0.6076527833938599),
 ('remind', 0.6031914949417114)]

In [120]:
count_feature_vect = count_vect.get_feature_names()
print(count_feature_vect.index('like'))
print(count_feature_vect[1466718])

1466718
like


## Average Word2vec and TFIDF Word2vec

In [121]:
# Average word2vec
#w2v : word==> vector(d-dim)
#AvgW2v : sequence of words/sentence --> vector

# n1words ==> review1: w1 w2 w1 w3 w4 w5
# review1 ==> vector1
# AvgW2v: 1/n1 [w2v(w1) + w2v(w2) + w2v(w1) + w2v(w3) + w2v(w4) + w2v(w5)]
#AvgW2v: 1/n sum of w2v wi


In [None]:
# tfidf word2vec
# review1 : w1 w2 w1 w3 w4 w5
#tfidf :   |t1|t2|t3|t4|t5|t6|   # while converting word to vector values has been given in the box

#tfidf-w2v(review1) = t1*w2v(w1)+t2*w2v(w2)+t3*w2v(w3)+t4*w2v(w4)+t5*w2v(w5)  //  (t1+t2+t3+t4+t5)

#tfidf-w2v(review1)  = sumof (ti * w2v(wi)) // sum of ti

In [123]:
#average word2vec
# compute average word2vec for each review

sentence_vectors = [] # the average word2vec for each review/sentence is stored in sentence_vect
for sentence in list_of_sentence:  # for each review/sentence
    sent_vect = np.zeros(50)   # as words vectors are of zero length
    count_words = 0  # no of words with a valid vector in the sentence/review
    for word in sentence:  # for each word in a review/sentence
        try:
            vector = w2v_model.wv[word]
            sent_vect += vector
            count_words += 1
        except:
            pass
    sent_vect = sent_vect / count_words
    sentence_vectors.append(sent_vect)
print(len(sentence_vectors))
print(len(sentence_vectors[0]))

365331
50


In [None]:
# tf-idf word2vec
tfidf_feature = tf_idf_vect.get_feature_names()  # tfidf words/col-names
#final-tf_idf is the sparse matrix with row = sentence, col=word and cell_val = tfidf

tfidf_sentence_vectors = []# the tfidf-w2v for eah sentence/review is stored in the variable tfidf_sent_vectors
row = 0
for sentence in list_of_sentence:  # for each review/sentence
    sent_vect = np.zeros(50)    # as words vectors are of zero length
    weight_sum = 0   # no of words with a valid vector in the sentence/review
    for word in sentence:  # for each word in a review/sentence
        try:
            vector = w2v_model.wv[word]
            # obtaun the tfidf of a word in a sentence/word
            tf_idf = final_tf_idf[row, tfidf_feature.index(word)]
            sent_vect += (vector * tf_idf)
            weight_sum += tf_idf
        except: 
            pass
    sent_vect = sent_vect / weight_sum
    tfidf_sentence_vectors.append(sent_vect)
    row +=1
print(len(tfidf_sentence_vectors))
print(len(tfidf_sentence_vectors[0]))