## Text Classification on Amazon Fine Food Dataset with Google Word2Vec Word Embeddings in Gensim and training using LSTM In Keras.

## [Please star/upvote if u like it.]

### IMPORTING THE MODULES

In [2]:
# Ignore  the warnings
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

# data visualisation and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
#configure
# sets matplotlib to inline and displays graphs below the corressponding cell.
% matplotlib inline  
style.use('fivethirtyeight')
sns.set(style='whitegrid',color_codes=True)

#nltk
import nltk

#preprocessing
from nltk.corpus import stopwords  #stopwords
from nltk import word_tokenize,sent_tokenize # tokenizing
from nltk.stem import PorterStemmer,LancasterStemmer  # using the Porter Stemmer and Lancaster Stemmer and others
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer  # lammatizer from WordNet

# for part-of-speech tagging
from nltk import pos_tag

# for named entity recognition (NER)
from nltk import ne_chunk

# vectorizers for creating the document-term-matrix (DTM)
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

# BeautifulSoup libraray
from bs4 import BeautifulSoup 

import re # regex

#model_selection
from sklearn.model_selection import train_test_split,cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

#evaluation
from sklearn.metrics import accuracy_score,roc_auc_score 
from sklearn.metrics import classification_report
from mlxtend.plotting import plot_confusion_matrix

#preprocessing scikit
from sklearn.preprocessing import MinMaxScaler,StandardScaler,Imputer,LabelEncoder

#classifiaction.
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC,SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB
 
#stop-words
stop_words=set(nltk.corpus.stopwords.words('english'))

#keras
import keras
from keras.preprocessing.text import one_hot,Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense , Flatten ,Embedding,Input,CuDNNLSTM,LSTM
from keras.models import Model
from keras.preprocessing.text import text_to_word_sequence

#gensim w2v
#word2vec
from gensim.models import Word2Vec

Using TensorFlow backend.


### LOADING THE DATASET

In [0]:
rev_frame=pd.read_csv(r'drive/Colab Notebooks/amazon food reviews/Reviews.csv')

In [0]:
df=rev_frame.copy()

In [7]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


#### A brief description of the dataset from Overview tab on Kaggle : -

Data includes:
- Reviews from Oct 1999 - Oct 2012
- 568,454 reviews
- 256,059 users
- 74,258 products
- 260 users with > 50 reviews

### DATA CLEANING AND PRE-PROCESSING

#### Since here I am concerned with sentiment analysis I shall keep only the 'Text' and the 'Score' column.

In [0]:
df=df[['Text','Score']]

In [0]:
df['review']=df['Text']
df['rating']=df['Score']
df.drop(['Text','Score'],axis=1,inplace=True)


In [13]:
print(df.shape)
df.head()

(568454, 2)


Unnamed: 0,review,rating
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


#### Let us now see if any of the column has any null values.

In [15]:
# check for null values
print(df['rating'].isnull().sum())
df['review'].isnull().sum()  # no null values.

0


0

#### Note that there is no point for keeping rows with different scores or sentiment for same review text.  So I will keep only one instance and drop the rest of the duplicates.

In [0]:
# remove duplicates/ for every duplicate we will keep only one row of that type. 
df.drop_duplicates(subset=['rating','review'],keep='first',inplace=True) 

In [17]:
# now check the shape. note that shape is reduced which shows that we did has duplicate rows.
print(df.shape)
df.head()

(393675, 2)


Unnamed: 0,review,rating
0,I have bought several of the Vitality canned d...,5
1,Product arrived labeled as Jumbo Salted Peanut...,1
2,This is a confection that has been around a fe...,4
3,If you are looking for the secret ingredient i...,2
4,Great taffy at a great price. There was a wid...,5


#### Let us now print some reviews and see if we can get insights from the text.

In [18]:
# printing some reviews to see insights.
for review in df['review'][:5]:
    print(review+'\n'+'\n')

I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.


Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".


This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.


If you are looking for the se

#### There is nothing much that I can figure out except the fact that there are some stray words and some punctuation that we have to remove before moving ahead.

**But note that if I remove the punctuation now then it will be difficult to break the reviews into sentences which is required by Word2Vec constructor in Gensim. So we will first break text into sentences and then clean those sentences. **

#### Note that since we are doing sentiment analysis I will convert the values in score column to sentiment. Sentiment is 0 for ratings or scores less than 3 and 1 or  +  elsewhere.

In [0]:
def mark_sentiment(rating):
  if(rating<=3):
    return 0
  else:
    return 1

In [0]:
df['sentiment']=df['rating'].apply(mark_sentiment)

In [0]:
df.drop(['rating'],axis=1,inplace=True)

In [25]:
df.head()

Unnamed: 0,review,sentiment
0,I have bought several of the Vitality canned d...,1
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,1
3,If you are looking for the secret ingredient i...,0
4,Great taffy at a great price. There was a wid...,1


In [26]:
df['sentiment'].value_counts()

1    306819
0     86856
Name: sentiment, dtype: int64

As you can see the sentiment column now has sentiment of the corressponding product review.

#### Pre-processing steps :

1 ) First **removing punctuation and html tags** if any. note that the html tas may be present ast the data must be scraped from net.

2) **Tokenize** the reviews into tokens or words .

3) Next **remove the stop words and shorter words** as they cause noise.

4) **Stem or lemmatize** the words depending on what does better. Herer I have yse lemmatizer.

In [0]:
# function to clean and pre-process the text.
def clean_reviews(review):  
    
    # 1. Removing html tags
    review_text = BeautifulSoup(review).get_text()
    
    # 2. Retaining only alphabets.
    review_text = re.sub("[^a-zA-Z]"," ",review_text)
    
    # 3. Converting to lower case and splitting
    word_tokens= review_text.lower().split()
    
    # 4. Remove stopwords
    le=WordNetLemmatizer()
    stop_words= set(stopwords.words("english"))     
    word_tokens= [le.lemmatize(w) for w in word_tokens if not w in stop_words]
    
    cleaned_review=" ".join(word_tokens)
    return cleaned_review

#### Note that pre processing all the reviews is taking way too much time and so I will take only 100K reviews. To balance the class  I have taken equal instances of each sentiment.

In [0]:
pos_df=df.loc[df.sentiment==1,:][:50000]
neg_df=df.loc[df.sentiment==0,:][:50000]

In [29]:
pos_df.head()

Unnamed: 0,review,sentiment
0,I have bought several of the Vitality canned d...,1
2,This is a confection that has been around a fe...,1
4,Great taffy at a great price. There was a wid...,1
5,I got a wild hair for taffy and ordered this f...,1
6,This saltwater taffy had great flavors and was...,1


In [30]:
neg_df.head()

Unnamed: 0,review,sentiment
1,Product arrived labeled as Jumbo Salted Peanut...,0
3,If you are looking for the secret ingredient i...,0
12,My cats have been happily eating Felidae Plati...,0
16,I love eating them and they are good for watch...,0
26,"The candy is just red , No flavor . Just plan...",0


#### We can now combine reviews of each sentiment and shuffle them so that their order doesn't make any sense.

In [0]:
#combining
df=pd.concat([pos_df,neg_df],ignore_index=True)

In [32]:
print(df.shape)
df.head()

(100000, 2)


Unnamed: 0,review,sentiment
0,I have bought several of the Vitality canned d...,1
1,This is a confection that has been around a fe...,1
2,Great taffy at a great price. There was a wid...,1
3,I got a wild hair for taffy and ordered this f...,1
4,This saltwater taffy had great flavors and was...,1


In [33]:
# shuffling rows
df = df.sample(frac=1).reset_index(drop=True)
print(df.shape)  # perfectly fine.
df.head()


(100000, 2)


Unnamed: 0,review,sentiment
0,I love wasabi peas and have ordered different ...,1
1,"This tea supposed to be really good, but... It...",0
2,Slate Magazine has a review on table salts and...,1
3,"This product Tastes nothing like Apple Cider, ...",0
4,"""SweetLeaf Liquid Stevia Flavors Tin"" indicat...",0


### CREATING GOOGLE WORD2VEC WORD EMBEDDINGS IN GENSIM

In this section I have actually created the word embeddings in Gensim. Note that I planed touse the pre-trained word embeddings like the google word2vec trained on google news corpusor the famous Stanford Glove embeddings. But as soon as I load the corressponding embeddings through Gensim the runtime dies and kernel crashes ; perhaps because it contains 30L words and which is exceeding the RAM on Google Colab.

Because of this ; for now I have created the embeddings by training on my own corpus.

In [1]:
# import gensim
# # load Google's pre-trained Word2Vec model.
# pre_w2v_model = gensim.models.KeyedVectors.load_word2vec_format(r'drive/Colab Notebooks/amazon food reviews/GoogleNews-vectors-negative300.bin', binary=True) 


#### First we need to break our data into sentences which is requires by the constructor of the Word2Vec class in Gensim. For this I have used Punk English tokenizer from the NLTK.

In [38]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences=[]
sum=0
for review in df['review']:
  sents=tokenizer.tokenize(review.strip())
  sum+=len(sents)
  for sent in sents:
    cleaned_sent=clean_reviews(sent)
    sentences.append(cleaned_sent.split()) # can use word_tokenize also.
print(sum)
print(len(sentences))  # total no of sentences

512639
512639


#### Now let us print some sentences just to check iff they are in the correct fornat.

In [39]:
# trying to print few sentences
for te in sentences[:5]:
  print(te,"\n")

['love', 'wasabi', 'pea', 'ordered', 'different', 'brand', 'bought', 'every', 'type', 'imaginable', 'grocery', 'store', 'shelf'] 

['pea', 'jr', 'mushroom', 'specialty', 'best', 'date', 'like', 'wasabi', 'pea', 'little', 'kick', 'like', 'way', 'pea', 'get'] 

['shipping', 'fast', 'well'] 

['even', 'ordered', 'afghanistan', 'delivered', 'quickly', 'perfect', 'condition', 'far', 'price', 'go', 'beat'] 

['usually', 'order', 'lb', 'pea', 'come', 'either', 'large', 'medium', 'sized', 'zip', 'lock', 'bag', 'take', 'want', 'seal', 'later', 'use'] 



####  Now actually creating the word 2 vec embeddings.

In [0]:
import gensim
w2v_model=gensim.models.Word2Vec(sentences=sentences,size=300,window=10,min_count=1)

#### Parameters: -

**sentences : ** The sentences we have obtained.

**size : ** The dimesnions of the vector used to represent each word.

**window : ** The number f words around any word to see the context.

**min_count : ** The minimum number of times a word should appear for its embedding to be formed or learnt.


In [41]:
w2v_model.train(sentences,epochs=10,total_examples=len(sentences))

(38406306, 41193730)

#### Now can try some things with word2vec embeddings. Thanks to Gensim ;)

In [43]:
# embedding of a particular word.
w2v_model.wv.get_vector('like')

array([-1.03112519e+00, -1.17288835e-01,  2.63634294e-01,  4.15361196e-01,
        5.02656877e-01, -7.75653899e-01, -3.14052671e-01, -9.11366940e-01,
       -3.25315803e-01, -1.23334563e+00, -1.44130743e+00,  5.73770642e-01,
        9.83815432e-01,  9.11602616e-01, -7.81245589e-01,  3.03729355e-01,
       -1.50223106e-01,  2.89199412e-01,  4.93014812e-01,  1.23629682e-01,
       -7.04080403e-01, -1.60265613e+00, -1.18101704e+00,  4.15536165e-01,
        8.34703088e-01, -2.28984490e-01, -5.52173138e-01,  1.44407332e-01,
        8.45157862e-01, -7.58443534e-01, -3.69563490e-01,  7.48030722e-01,
       -4.08394367e-01,  1.00567269e+00,  6.78839386e-01, -5.30777097e-01,
        1.23378408e+00,  1.10454750e+00,  2.07447457e+00, -1.94725662e-01,
        1.53318092e-01, -1.00127973e-01, -4.44085449e-01,  3.26064564e-02,
        1.15453482e-01, -3.34932059e-01, -3.02316993e-01, -6.99055016e-01,
       -1.12457097e+00, -6.13700807e-01,  6.92114756e-02, -4.46245283e-01,
       -2.40285173e-01,  

In [44]:
# total numberof extracted words.
vocab=w2v_model.wv.vocab
print("The total number of words are : ",len(vocab))

The total number of words are :  56379


In [45]:
# words most similar to a given word.
w2v_model.wv.most_similar('like')

[('reminded', 0.47334882616996765),
 ('weird', 0.4720185101032257),
 ('reminds', 0.4685436487197876),
 ('strange', 0.46590107679367065),
 ('alright', 0.44735682010650635),
 ('funny', 0.4438462257385254),
 ('wierd', 0.4427754580974579),
 ('funky', 0.4383835196495056),
 ('akin', 0.43709149956703186),
 ('gross', 0.4343647360801697)]

In [46]:
# similaraity b/w two words
w2v_model.wv.similarity('good','like')

0.3698099

#### Now creating a dictionary with words in vocab and their embeddings. This will be used when we will be creating embedding matrix (for feeding to keras embedding layer).

In [48]:
print("The no of words :",len(vocab))
# print(vocab)

The no of words : 56379


In [0]:
# print(vocab)
vocab=list(vocab.keys())

In [50]:
word_vec_dict={}
for word in vocab:
  word_vec_dict[word]=w2v_model.wv.get_vector(word)
print("The no of key-value pairs : ",len(word_vec_dict)) # should come equal to vocab size
  

The no of key-value pairs :  56379


In [0]:
# # just check
# for word in vocab[:5]:
#   print(word_vec_dict[word])

### PREPARING THE DATA FOR KERAS EMBEDDING LAYER.

Now we have obtained the w2v embeddings. But there are a couple of steps required by Keras embedding layer before we can move on.

**Also note that since w2v embeddings have been made now ; we can preprocess our review column by using the function that we saw above.**

In [0]:
# cleaning reviews.
df['clean_review']=df['review'].apply(clean_reviews)

#### We need to find the maximum lenght of any document or review in our case. WE will pad all reviews to have this same length.This will be required by Keras embedding layer. Must check [this](https://www.kaggle.com/rajmehra03/a-detailed-explanation-of-keras-embedding-layer) kernel on Kaggle for a wonderful explanation of keras embedding layer.

In [54]:
# number of unique words = 56379.

# now since we will have to pad we need to find the maximum lenght of any document.

maxi=-1
for i,rev in enumerate(df['clean_review']):
  tokens=rev.split()
  if(len(tokens)>maxi):
    maxi=len(tokens)
print(maxi)

1564


#### Now we integer encode the words in the reviews using Keras tokenizer. 

**Note that there two important variables: which are the vocab_size which is the total no of unique words while the second is max_doc_len which is the length of every document after padding. Both of these are required by the Keras embedding layer.**

In [0]:
tok = Tokenizer()
tok.fit_on_texts(df['clean_review'])
vocab_size = len(tok.word_index) + 1
encd_rev = tok.texts_to_sequences(df['clean_review'])

In [0]:
max_rev_len=1565  # max lenght of a review
vocab_size = len(tok.word_index) + 1  # total no of words
embed_dim=300 # embedding dimension as choosen in word2vec constructor

In [58]:
# now padding to have a amximum length of 1565
pad_rev= pad_sequences(encd_rev, maxlen=max_rev_len, padding='post')
pad_rev.shape   # note that we had 100K reviews and we have padded each review to have  a lenght of 1565 words.

(100000, 1565)

### CREATING THE EMBEDDING MATRIX

#### Now we need to pass the w2v word embeddings to the embedding layer in Keras. For this we will create the embedding matrix and pass it as 'embedding_initializer' parameter to the layer.

**The embedding matrix will be of dimensions (vocab_size,embed_dim) where the word_index of each word from keras tokenizer is its index into the matrix and the corressponding entry is its w2v vector ;)**

**Note that there may be words which will not be present in embeddings learnt by the w2v model. The embedding matrix entry corressponding to those words will be a vector of all zeros.**

**Also note that if u are thinkng why won't a word be present then it is bcoz now we have learnt on out own corpus but if we use pre-trained embedding then it may happen that some words specific to our dataset aren't present then in those cases we may use a fixed vector of zeros to denote all those words that earen;t present in th pre-trained embeddings. Also note that it may also happen that some words are not present ifu have filtered some words by setting min_count in w2v constructor.
  **

In [0]:
# now creating the embedding matrix
embed_matrix=np.zeros(shape=(vocab_size,embed_dim))
for word,i in tok.word_index.items():
  embed_vector=word_vec_dict.get(word)
  if embed_vector is not None:  # word is in the vocabulary learned by the w2v model
    embed_matrix[i]=embed_vector
  # if word is not found then embed_vector corressponding to that vector will stay zero.

In [60]:
# checking.
print(embed_matrix[14])

[ 2.79096913e+00  1.75269902e+00  5.43922126e-01 -8.71004164e-01
 -2.79567409e+00  3.58174145e-01 -4.41622198e-01  8.37980881e-02
  1.45070171e+00  2.92627835e+00  5.62364399e-01  2.08853111e-02
 -1.60128057e+00  1.27360773e+00 -1.78457367e+00 -4.24255729e-01
 -2.03603911e+00 -7.85955548e-01  8.28200519e-01 -2.46066642e+00
 -3.02389354e-01 -1.79599479e-01  1.68112910e+00 -9.88541305e-01
  2.58514857e+00 -2.33111188e-01  9.99956131e-01 -4.64532673e-01
 -1.42131495e+00  1.13931441e+00  1.26426506e+00 -1.27443767e+00
  1.04536235e+00 -2.07876945e+00 -1.60261726e+00 -3.23246241e+00
  4.00768071e-02 -9.97284830e-01  1.30593526e+00  9.84316245e-02
  1.59767115e+00  6.23917580e-01 -9.50469911e-01  8.45327795e-01
  1.00067115e+00  1.78077495e+00 -9.47615266e-01 -2.22256136e+00
 -2.16277361e+00 -7.11105168e-01 -2.25981045e+00  1.10815513e+00
 -4.42510433e-02  6.48146868e-01 -6.03007615e-01  9.33634043e-01
  1.42349434e+00 -2.50034952e+00 -2.19512081e+00  1.03199989e-01
  2.86492556e-01 -2.32114

### PREPARING TRAIN AND VALIDATION SETS.

In [0]:
# prepare train and val sets first
Y=keras.utils.to_categorical(df['sentiment'])  # one hot target as required by NN.
x_train,x_test,y_train,y_test=train_test_split(pad_rev,Y,test_size=0.20,random_state=42)

### BUILDING A MODEL AND FINALLY PERFORMING TEXT CLASSIFICATION

Having done all the pre-requisites we finally move onto make model in Keras .

**Note that I have commented the LSTM layer as including it causes the trainig loss to be stucked at a value of about 0.6932. I don;t know why ;(.**

**In case someone knows please comment below. **

In [0]:
from keras.initializers import Constant
from keras.layers import ReLU
from keras.layers import Dropout
model=Sequential()
model.add(Embedding(input_dim=vocab_size,output_dim=embed_dim,input_length=max_rev_len,embeddings_initializer=Constant(embed_matrix)))
# model.add(CuDNNLSTM(64,return_sequences=False)) # loss stucks at about 
model.add(Flatten())
model.add(Dense(16,activation='relu'))
model.add(Dropout(0.50))
# model.add(Dense(16,activation='relu'))
# model.add(Dropout(0.20))
model.add(Dense(2,activation='sigmoid'))  # sigmod for bin. classification.

#### Let us now print a summary of the model.

In [83]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 1565, 300)         16914000  
_________________________________________________________________
flatten_4 (Flatten)          (None, 469500)            0         
_________________________________________________________________
dense_7 (Dense)              (None, 16)                7512016   
_________________________________________________________________
dropout_4 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 2)                 34        
Total params: 24,426,050
Trainable params: 24,426,050
Non-trainable params: 0
_________________________________________________________________


In [0]:
# compile the model
model.compile(optimizer=keras.optimizers.RMSprop(lr=1e-3),loss='binary_crossentropy',metrics=['accuracy'])

In [0]:
# specify batch size and epocj=hs for training.
epochs=5
batch_size=64

In [86]:
# fitting the model.
model.fit(x_train,y_train,epochs=epochs,batch_size=batch_size,validation_data=(x_test,y_test))

Train on 80000 samples, validate on 20000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f20d440dc88>

#### Note that loss as well as val_loss is  is still deceasing. You can train for more no of epochs but I am not so patient ;)

**The final accuracy after 5 epochs is about 84% which is pretty decent.**

### FURTHER IDEAS : -

1) ProductId and UserId can be used to track the general ratings of a given product and also to track the review patter of a particular user as if he is strict in reviwing or not.
 

2) Helpfulness feature may tell about the product. This is because gretare the no of people talking about reviews, the mre stronger or critical it is expected to be.

3) Summary column can also give a hint.

4) One can also try the pre-trained embeddings like Glove word vectors etc...

5) Lastly tuning the n/w hyperparameters is always an option;).

 

## THE END!!!

## [Please star/upvote if it was helpful.]