<a href="https://colab.research.google.com/github/pSN0W/AI_Practice/blob/main/Amazon_Fine_Food_Review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Loading Data from kaggle


In [1]:
!pip install kaggle



In [2]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"pratyakshsingh","key":"e1a64879a9d9f50ccfaafd16798ab02d"}'}

In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

#change the permission
!chmod 600 ~/.kaggle/kaggle.json

In [4]:
!kaggle datasets download -d snap/amazon-fine-food-reviews

Downloading amazon-fine-food-reviews.zip to /content
 94% 228M/242M [00:02<00:00, 150MB/s]
100% 242M/242M [00:02<00:00, 107MB/s]


In [5]:
from zipfile import ZipFile
file_name = "amazon-fine-food-reviews.zip"
with ZipFile(file_name,'r') as zip:
  zip.extractall()

#Analyzing on Data

##Importing Modules

In [6]:
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors

##Loading Data

In [7]:
#You can do it using csv too but sometimes maybe you will nead to work with sql so learn this too

#Create a connection to the database
con=sqlite3.connect('database.sqlite')

In [8]:
filtered_data=pd.read_sql_query("SELECT * FROM Reviews WHERE Score!=3",con)
print(filtered_data.columns)
print(filtered_data['Score'].value_counts())

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')
5    363122
4     80655
1     52268
2     29769
Name: Score, dtype: int64


In [9]:
def partition(x):
  if x<3:
    return "negative"
  return "positive"

In [10]:
actualScore=filtered_data['Score']
positive_negative=actualScore.map(partition)
filtered_data['Score']=positive_negative

In [11]:
filtered_data['Score'].value_counts()

positive    443777
negative     82037
Name: Score, dtype: int64

#Data Preprocessing

##Data Cleaning

In [12]:
# Sorting the data in ascending order of product key

sorted_data=filtered_data.sort_values('ProductId')

In [13]:
#Dropping Duplicates
print("Dimension of the data before dropping Duplicates : ",sorted_data.shape)

final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},keep='first',inplace=False)

print("Dimension of the data after dropping Duplicates : ",final.shape)

Dimension of the data before dropping Duplicates :  (525814, 10)
Dimension of the data after dropping Duplicates :  (364173, 10)


In [14]:
#Dropping all the reviews whose Helpfullness Numerator is greater than helpfullness denominator

final=final[final['HelpfulnessNumerator']<=final['HelpfulnessDenominator']]
final.shape

(364171, 10)

In [15]:
final['Score'].value_counts()

positive    307061
negative     57110
Name: Score, dtype: int64

##Text Preprocessing

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [16]:
print(final['Text'].values[0])
print("="*100)
print(final['Text'].values[100])
print("="*100)
print(final['Text'].values[1000])
print("="*100)
print(final['Text'].values[2000])
print("="*100)
print(final['Text'].values[3000])
print("="*100)
print(final['Text'].values[4000])
print("="*100)
print(final['Text'].values[10000])
print("="*100)
print(final['Text'].values[20000])
print("="*100)
print(final['Text'].values[30000])

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
Pros:<br />Dog will do anything for this treat.<br />Doesn't smell as bad as many other treats.<br />Easy to break into smaller pieces.<br />Nothing artificial, easy digestion.<br /><br />Cons:<br />More costly than other dog treats.<br /><br />Overall, this is a great product. While more expensive, my dog will do anything for this treat. He has several phobias, including getting in and out of the car, and walking through doorways, but he ignores all of his fears to get to this treat.
I was really looking forward to these pods based on the reviews.  Starbucks is good, but I prefer bolder taste.... imagine my surprise

We can see that text contains html tag and numbers which are no use to us so we will remove it before vectorizing our text document.

###First we remove URL from text in python

In [17]:
txt="""Why is this $[...] when the same product is available for $[...] here?<br />
http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />
The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby."""

In [18]:
# https://stackoverflow.com/a/40823105/4084039

# The re module offers a set of functions that allows us to search a string for a match:

#Function	   Description
#findall	   Returns a list containing all matches
#search	     Returns a Match object if there is a match anywhere in the string
#split	     Returns a list where the string has been split at each match
#sub	       Replaces one or many matches with a string

txt1=re.sub(r"http\S+","",txt) #replaces url with empty string
print(txt)
print(txt1)

Why is this $[...] when the same product is available for $[...] here?<br />
http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />
The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
Why is this $[...] when the same product is available for $[...] here?<br />
 /><br />
The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


###Removing all the tags from the text

In [19]:
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element

from bs4 import BeautifulSoup

soup=BeautifulSoup(txt1,"lxml")
txt2=soup.get_text()
txt2

'Why is this $[...] when the same product is available for $[...] here?\n />\nThe Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.'

###Decontracting our text

In [20]:
# https://stackoverflow.com/a/47091490/4084039

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [21]:
print(decontracted("I won't party I'll study"))

I will not party I will study


###Remove word with number

In [22]:
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039

sent_0 = re.sub("\S*\d\S*", "", "I took 7 pie out of7").strip()
print(sent_0)

I took  pie out


###Remove Special Character

In [23]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
txt3 = re.sub('[^A-Za-z0-9]+', ' ', txt2)
print(txt2)
print("="*200)
print(txt3)

Why is this $[...] when the same product is available for $[...] here?
 />
The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
Why is this when the same product is available for here The Victor M380 and M502 traps are unreal of course total fly genocide Pretty stinky but only right nearby 


###StopWords

In [24]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

###Stemming

In [25]:
from nltk.stem import SnowballStemmer
sno= SnowballStemmer("english")
print(sno.stem('tasty'))

tasti


##Completing all step and applying on text

In [26]:
import tqdm.notebook as tq
#tqdm gives as a status bar
#Note that we won't apply stemming here caure no need of it for w2v
preprocessed_text=[]
for sentance in tq.tqdm(final['Text'].values):
  sentance = re.sub(r"http\S+","",sentance)
  sentance = BeautifulSoup(sentance,"lxml").get_text()
  sentance = decontracted(sentance)
  sentance = re.sub("\S*\d\S*", "", sentance)
  sentance = re.sub('[^A-Za-z]+', ' ', sentance)
  sentance = ' '.join(e.lower()  for e in sentance.split()  if e not in stopwords and len(e)>2)
  preprocessed_text.append(sentance)

HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))




In [27]:
preprocessed_text[5]

'charming rhyming book describes circumstances eat not chicken soup rice month month this sounds like kind thing kids would make recess sing drive teachers crazy cute catchy sounds really childlike skillfully written'

##Summary Conversion

In [28]:
preprocessed_summary=[]
for sentance in tq.tqdm(final['Text'].values):
  sentance = re.sub(r"http\S+","",sentance)
  sentance = BeautifulSoup(sentance,"lxml").get_text()
  sentance = decontracted(sentance)
  sentance = re.sub("\S*\d\S*", "", sentance)
  sentance = re.sub('[^A-Za-z]+', ' ', sentance)
  sentance = ' '.join(e.lower()  for e in sentance.split()  if e not in stopwords and len(e)>2)
  preprocessed_summary.append(sentance)

HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))




In [29]:
preprocessed_summary[10]

'get movie sound track sing along carol king this great stuff whole extended family knows songs heart quality kids storytelling music'

###Analysing text and summary

In [30]:
print(preprocessed_text[10])
print("="*200)
print(preprocessed_summary[10])

get movie sound track sing along carol king this great stuff whole extended family knows songs heart quality kids storytelling music
get movie sound track sing along carol king this great stuff whole extended family knows songs heart quality kids storytelling music


In [31]:
print(preprocessed_text[100])
print("="*200)
print(preprocessed_summary[100])

pros dog anything treat does not smell bad many treats easy break smaller pieces nothing artificial easy digestion cons more costly dog treats overall great product while expensive dog anything treat several phobias including getting car walking doorways ignores fears get treat
pros dog anything treat does not smell bad many treats easy break smaller pieces nothing artificial easy digestion cons more costly dog treats overall great product while expensive dog anything treat several phobias including getting car walking doorways ignores fears get treat


In [32]:
print(preprocessed_text[1000])
print("="*200)
print(preprocessed_summary[1000])

really looking forward pods based reviews starbucks good prefer bolder taste imagine surprise ordered boxes expired one expired back gosh sakes admit amazon agreed credit cost plus part shipping geez years expired hoping find local san diego area shoppe carries pods try something different starbucks
really looking forward pods based reviews starbucks good prefer bolder taste imagine surprise ordered boxes expired one expired back gosh sakes admit amazon agreed credit cost plus part shipping geez years expired hoping find local san diego area shoppe carries pods try something different starbucks


By analysing text and summary we find that they are mostly same and this is valid too and summary is just a short form of text so we can featurise only our text to get the vector and featurising both won't just make sense as the values will be repeated mostly

#Featurisation of text

##Bag of Word

###Stemming for BoW

In [33]:
#For BoW it is good to perform stemming too

stemmed_text=[]
for sentance in tq.tqdm(preprocessed_text):
  sentance=' '.join([sno.stem(word)  for word in sentance.split()])
  stemmed_text.append(sentance)

HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))




In [34]:
stemmed_text[40]

'this onli dog treat lhasa apso eat make happ becuas ad ingredi preserv well ad salt this onli dog treat vet approv'

In [35]:
stemmed_summary=[]
for sentance in tq.tqdm(preprocessed_summary):
  sentance=' '.join([sno.stem(word)  for word in sentance.split()])
  stemmed_summary.append(sentance)

HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))




In [36]:
stemmed_summary[80]

'dog trainer recommend obedi class found best price amazon puppi love treat great train treat use moder much organ meat not good anyon also excel put treat contain non liver treat cooki use contain shaker help train dog come treat get liver flavor dust coat make dog love even limit liver dust treat two paw stuff'

###BoW

In [37]:
# Setting CountVectorizer such that we want only those words as feature for only those words which are repeated more than 50 times
# We set this threshold because we can simply not train our model for feature that is thi sparse
# You can obviously change the min_df according to your need setting it to 10 works fine 
# Control the number of feature using max_feature 
# This controlling of the feature will help us while training our model

count_vect = CountVectorizer(min_df=50)
count_vect.fit(tq.tqdm(stemmed_text))
print("Few features of the vector are ",count_vect.get_feature_names()[:10])
print("="*200)
final_counts=count_vect.transform(tq.tqdm(stemmed_text))
print("The type of count vectorizer : ",type(final_counts))
print("Shape of count : ",final_counts.get_shape())
print("Total number of unique words which are repeated more than 10 times are : ",final_counts.get_shape()[1])

HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))


Few features of the vector are  ['aback', 'abandon', 'abdomin', 'abil', 'abl', 'about', 'abroad', 'absenc', 'absent', 'absolut']


HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))


The type of count vectorizer :  <class 'scipy.sparse.csr.csr_matrix'>
Shape of count :  (364171, 7182)
Total number of unique words which are repeated more than 10 times are :  7182


### Bi-gram Tri-gram and n-Grams

In [38]:
# bi-gram, tri-gram and n-gram

# removing stop words like "not" should be avoided before building n-grams
# count_vect = CountVectorizer(ngram_range=(1,3))
# please do read the CountVectorizer documentation http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# you can choose these numebrs min_df=10, max_features=5000, of your choice

count_vect = CountVectorizer(ngram_range=(1,3), max_features=5000)
final_trigram_count = count_vect.fit_transform(tq.tqdm(stemmed_text))
print("Few of the feature of the vectors are : ",count_vect.get_feature_names()[:10])
print("Type of count vectorizer is ",type(final_trigram_count))
print("The shape of trigram is ",final_trigram_count.get_shape())
print("The total number of unique mono bi and tri gram is",final_trigram_count.get_shape()[1])

HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))


Few of the feature of the vectors are :  ['abil', 'abl', 'abl buy', 'abl find', 'abl get', 'about', 'absolut', 'absolut best', 'absolut delici', 'absolut favorit']
Type of count vectorizer is  <class 'scipy.sparse.csr.csr_matrix'>
The shape of trigram is  (364171, 5000)
The total number of unique mono bi and tri gram is 5000


### TF-IDF BoW

In [39]:
count_vect=TfidfVectorizer(ngram_range=(1,3), max_features=5000)
tf_idf_bow_matrix=count_vect.fit_transform(tq.tqdm(stemmed_text))
print("Few of the feature of the vectors are : ",count_vect.get_feature_names()[:10])
print("Type of count vectorizer is ",type(tf_idf_bow_matrix))
print("The shape of trigram is ",tf_idf_bow_matrix.get_shape())
print("The total number of unique mono bi and tri gram is",tf_idf_bow_matrix.get_shape()[1])

HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))


Few of the feature of the vectors are :  ['abil', 'abl', 'abl buy', 'abl find', 'abl get', 'about', 'absolut', 'absolut best', 'absolut delici', 'absolut favorit']
Type of count vectorizer is  <class 'scipy.sparse.csr.csr_matrix'>
The shape of trigram is  (364171, 5000)
The total number of unique mono bi and tri gram is 5000


##Word2Vector

###W2V

In [40]:
# Text corpus for training my own W2V model
# It is advised to not use stemming for W2V as few words loose their meaning
# You can use stemming while training your own model but never do it with googles model

list_of_words=[]
for sentance in tq.tqdm(preprocessed_text):
  list_of_words.append(sentance.split())

HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))




In [41]:
list_of_words[1]

['grew',
 'reading',
 'sendak',
 'books',
 'watching',
 'really',
 'rosie',
 'movie',
 'incorporates',
 'love',
 'son',
 'loves',
 'however',
 'miss',
 'hard',
 'cover',
 'version',
 'the',
 'paperbacks',
 'seem',
 'kind',
 'flimsy',
 'takes',
 'two',
 'hands',
 'keep',
 'pages',
 'open']

In [42]:
# Using Google News Word2Vectors

# in this project we are using a pretrained model by google
# its 3.3G file, once you load this into your memory 
# it occupies ~9Gb, so please do this step only if you have >12G of ram
# we will provide a pickle file wich contains a dict , 
# and it contains all our courpus words as keys and  model[word] as values
# To use this code-snippet, download "GoogleNews-vectors-negative300.bin" 
# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# it's 1.9GB in size.


# http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.W17SRFAzZPY
# you can comment this whole cell
# or change these varible according to your need

is_your_ram_gt_16g=False
want_to_use_google_w2v = False
want_to_train_w2v = True

if want_to_train_w2v:
    # min_count = 5 considers only words that occured atleast 5 times
    # size = 50 the dimension of the vector we want
    w2v_model=Word2Vec(tq.tqdm(list_of_words),min_count=5,size=50, workers=4)
    print(w2v_model.wv.most_similar('great'))
    print('='*50)
    print(w2v_model.wv.most_similar('worst'))
    
elif want_to_use_google_w2v and is_your_ram_gt_16g:
    if os.path.isfile('GoogleNews-vectors-negative300.bin'):
        w2v_model=KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
        print(w2v_model.wv.most_similar('great'))
        print(w2v_model.wv.most_similar('worst'))
    else:
        print("you don't have gogole's word2vec file, keep want_to_train_w2v = True, to train your own w2v ")

HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))


[('terrific', 0.8970778584480286), ('fantastic', 0.8785291910171509), ('awesome', 0.8780403733253479), ('good', 0.863150954246521), ('excellent', 0.8548327684402466), ('wonderful', 0.8215221166610718), ('perfect', 0.7789706587791443), ('fabulous', 0.7615782022476196), ('amazing', 0.7553120851516724), ('nice', 0.7517097592353821)]
[('nastiest', 0.8763128519058228), ('greatest', 0.7899782657623291), ('disgusting', 0.7329646348953247), ('best', 0.7305299639701843), ('horrible', 0.7141132354736328), ('vile', 0.6929420828819275), ('terrible', 0.6838528513908386), ('tastiest', 0.6776483654975891), ('awful', 0.6732717752456665), ('saltiest', 0.6710641980171204)]


In [43]:
w2v_model.wv['love']

array([-5.88303626e-01, -3.77372682e-01,  9.28421915e-01,  2.17995977e+00,
       -1.05574772e-01,  1.07015431e+00, -7.09782913e-02,  2.52131534e+00,
       -4.05729651e-01,  3.10498214e+00, -1.54692483e+00,  5.97628176e-01,
       -2.15841913e+00,  5.48614919e-01, -2.63975680e-01, -2.44697309e+00,
       -1.41001165e+00,  2.11569643e+00, -1.34654403e+00,  6.44171417e-01,
       -6.41071737e-01,  1.08331494e-01,  1.37586415e+00,  1.81667554e+00,
        2.31952453e+00, -1.43946755e+00, -1.59849024e+00,  5.01860201e-01,
        2.06251577e-01,  2.18692923e+00,  2.90648532e+00, -4.83332872e-01,
       -2.08255792e+00,  1.61329627e-01,  4.37920284e+00,  6.02605700e-01,
       -1.46586418e+00,  8.13467443e-01, -9.02707279e-01, -1.62411857e+00,
        9.71889675e-01,  2.24373460e+00,  3.89243722e-01, -1.14296281e+00,
       -1.34345949e+00, -2.71557713e+00, -2.83025252e-03, -1.31463361e+00,
        1.73568296e+00, -3.12608421e-01], dtype=float32)

In [44]:
w2v_words=list(w2v_model.wv.vocab)
print("Total number of words in which occured more than 5 times is : ",len(w2v_words))

Total number of words in which occured more than 5 times is :  33259


### Average W2V for each sentence

In [45]:
average_w2v=[]

for sentance in tq.tqdm(list_of_words):
  temp_vect=np.zeros(50)
  number_of_word=0
  for word in sentance:
    if word in w2v_words:
      vec=w2v_model.wv[word]
      temp_vect+=vec
      number_of_word+=1
  if number_of_word:
    temp_vect/=number_of_word
    average_w2v.append(temp_vect)


HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))




In [46]:
len(preprocessed_text)

364171

In [47]:
len(list_of_words)

364171

In [48]:
len(average_w2v)

363205

###Tf-idf weighted W2V

In [49]:
model=TfidfVectorizer()
model.fit(tq.tqdm(preprocessed_text))

idf_values=list(model.idf_)
words=list(model.get_feature_names())

HBox(children=(FloatProgress(value=0.0, max=364171.0), HTML(value='')))




In [50]:
print(len(idf_values))
len(words)

116338


116338

In [51]:
# Storing idf of each word in a dictionary
dictionary={}
for i in range(len(words)):
  dictionary[words[i]]=idf_values[i]

In [None]:
tfidf_bow=[]
for sentance in tq.tqdm(list_of_words):
  temp_vect=np.zeros(50)
  weight=0.0
  for word in sentance:
    if word in w2v_words and word in words:
      vec=w2v_model.wv[word]

      # tf_idf = tf_idf_matrix[row, tfidf_feat.index(word)]
      # to reduce the computation we are 
      # dictionary[word] = idf value of word in whole courpus
      # sentance.count(word) = tf valeus of word in this review

      tf_idf=sentance.count(word)*dictionary[word]
      temp_vect+=(vec*tf_idf)
      weight+=tf_idf
  if weight:
    temp_vect /=weight
  tfidf_bow.append(temp_vect)