# Aim and Motivation 
[Nirant](https://www.kaggle.com/nirant)'s latest kernel on spaCy: [Hitchhiker's Guide to NLP in spaCy](https://www.kaggle.com/nirant/hitchhiker-s-guide-to-nlp-in-spacy) has made me realize that spaCy maybe as good or even better than NLTK for Natural Language Processing. My recent kernels deal with deep learning and I want to extend that by using text data for deep learning and intend to use spaCy for processing and modelling this data. 

In [1]:
# Usual imports
import numpy as np
import pandas as pd
from tqdm import tqdm
import string
import matplotlib.pyplot as plt
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
import concurrent.futures
import time
import pyLDAvis.sklearn
from pylab import bone, pcolor, colorbar, plot, show, rcParams, savefig
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
import os
print(os.listdir("../data"))

# Plotly based imports for visualization
from plotly import tools
import chart_studio.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff

# spaCy based imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
#!python -m spacy download en_core_web_lg

['Bag_Reviews.xlsx', 'productReviewShopee_1.csv']


In [2]:
import bz2
import re

In [3]:
reviews_ms = pd.read_excel('../data/Bag_Reviews.xlsx')
reviews_ms = reviews_ms[['rating','comments']]
reviews_ms.head()

Unnamed: 0,rating,comments
0,4,Give 4 stars because order at the price 37 but...
1,5,Ordered at a discount of 10 baht per piece. It...
2,5,"Small, cute, compact, good But the sash looks ..."
3,1,The size is not as large as it is down. The st...
4,1,The product is compared to the price. Okay. Se...


In [4]:
#reviews_ms=pd.DataFrame({'rating':test_labels,'comments':test_sentences})
reviews=reviews_ms
reviews.head()

Unnamed: 0,rating,comments
0,4,Give 4 stars because order at the price 37 but...
1,5,Ordered at a discount of 10 baht per piece. It...
2,5,"Small, cute, compact, good But the sash looks ..."
3,1,The size is not as large as it is down. The st...
4,1,The product is compared to the price. Okay. Se...


In [5]:
reviews.shape

(3182, 2)

### Remove Duplicates 

In [6]:
reviews.drop_duplicates().shape

(2677, 2)

In [7]:
reviews=reviews.drop_duplicates()
reviews.head()

Unnamed: 0,rating,comments
0,4,Give 4 stars because order at the price 37 but...
1,5,Ordered at a discount of 10 baht per piece. It...
2,5,"Small, cute, compact, good But the sash looks ..."
3,1,The size is not as large as it is down. The st...
4,1,The product is compared to the price. Okay. Se...


### Remove very small text

In [8]:
reviews.comments=reviews.comments.astype(str)
reviews['len_review']=reviews.comments.apply(len)

In [9]:
reviews.len_review.describe()

count    2677.000000
mean      163.776242
std       131.114011
min         2.000000
25%        77.000000
50%       140.000000
75%       202.000000
max      1236.000000
Name: len_review, dtype: float64

In [10]:
#reviews.comments.loc[reviews.len_review>=300]

In [11]:
reviews.comments=reviews.comments.apply(lambda x: x.replace('👍','good '))

In [12]:
s_limit=200
max_limit=1300
reviews=reviews.loc[reviews.len_review<=s_limit,:]
reviews.head()

Unnamed: 0,rating,comments,len_review
0,4,Give 4 stars because order at the price 37 but...,66
1,5,Ordered at a discount of 10 baht per piece. It...,167
2,5,"Small, cute, compact, good But the sash looks ...",150
3,1,The size is not as large as it is down. The st...,135
5,4,Beautiful work Sewing Good compact Suitable fo...,111


In [13]:
reviews.shape, reviews_ms.shape

((2000, 3), (3182, 2))

In [14]:
reviews.len_review.describe()

count    2000.000000
mean      105.794500
std        55.747216
min         2.000000
25%        52.000000
50%       118.000000
75%       150.250000
max       200.000000
Name: len_review, dtype: float64

In [15]:
int(reviews.shape[0]*.1)

200

In [16]:
#tqdm.pandas()
reviews_ms['len_review']=reviews_ms.comments.apply(len)
reviews_others=reviews_ms.loc[((reviews_ms.len_review>s_limit) &(reviews_ms.len_review<max_limit)) ,:]

In [17]:
reviews_others.shape

(749, 3)

In [18]:
reviews_outliers=reviews_others.sample(int(reviews.shape[0]*.1))
reviews_outliers.head()

Unnamed: 0,rating,comments,len_review
2089,5,"Beautiful, good price, but the line is not str...",379
2177,5,Good cheap products Value for money Fast deliv...,246
1456,4,Fast delivery shop Receive the product in acco...,218
2324,5,Very good. Very good. Reed. Press. Reed. Press...,263
1686,2,"Buy the product at the price of 8 baht, but pr...",229


In [19]:
reviews_outliers_remain = reviews_others[~reviews_others.index.isin(reviews_outliers.index)]
reviews_outliers_remain.shape

(549, 3)

In [20]:
reviews_outliers.shape

(200, 3)

In [21]:
reviews=reviews.append(reviews_outliers)
reviews.reset_index(inplace=True)
reviews.len_review.describe()

count    2200.000000
mean      128.273182
std       101.729377
min         2.000000
25%        59.000000
50%       125.000000
75%       161.000000
max      1236.000000
Name: len_review, dtype: float64

In [22]:
# Creating a spaCy object
nlp = spacy.load('en_core_web_lg')

spaCy also comes with a built-in named entity visualizer that lets you check your model's predictions in your browser. You can pass in one or more <code>Doc</code> objects and start a web server, export HTML files or view the visualization directly from a Jupyter Notebook.

### Named Entity Recognition
 Named Entity Recognition is an information extraction task where named entities in unstructured sentences are located and classified  in some pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In [23]:
doc = nlp(reviews["comments"][0])
spacy.displacy.render(doc, style='ent',jupyter=True)

In [24]:
punctuations = string.punctuation
stopwords = list(STOP_WORDS)

In [25]:
punctuations

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

### Lemmatization
It is the  process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Words like "ran" and "running" are converted to "run" to avoid having words with similar meanings in our data.

In [26]:
review = str(" ".join([i.lemma_ for i in doc]))

In [27]:
doc = nlp(review)
spacy.displacy.render(doc, style='ent',jupyter=True)

The sentence looks much different now that it is lemmatized.

### Parts of Speech tagging

This is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech,[1] based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

In [28]:
# POS tagging
# for i in nlp(review):
#     print(i,"=>",i.pos_)

In [29]:
# Parser for reviews
parser = English()
def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

In [30]:
reviews.reset_index(inplace=True)

In [31]:
tqdm.pandas()
reviews["processed_description"] = reviews["comments"].progress_apply(spacy_tokenizer)

100%|██████████| 2200/2200 [00:00<00:00, 5396.24it/s]


In [32]:
reviews

Unnamed: 0,level_0,index,rating,comments,len_review,processed_description
0,0,0,4,Give 4 stars because order at the price 37 but...,66,4 star order price 37 today 29 baht sorry
1,1,1,5,Ordered at a discount of 10 baht per piece. It...,167,ordered discount 10 baht piece worth note leav...
2,2,2,5,"Small, cute, compact, good But the sash looks ...",150,small cute compact good sash look like little ...
3,3,3,1,The size is not as large as it is down. The st...,135,size large stitch wrong bag line contemplate l...
4,4,5,4,Beautiful work Sewing Good compact Suitable fo...,111,beautiful work sewing good compact suitable pr...
5,5,6,4,Worth the price ordered When holding the bag S...,116,worth price order hold bag send fast think wra...
6,6,8,4,"Very nice, but a little crumbling.",34,nice little crumble
7,7,14,5,"The product size is the same as the palm, but ...",136,product size palm purchase price 10 baht try u...
8,8,15,5,ðŸ˜šðŸŽ‰ â™¥ ï¸ Cute value bag Ordered brown ...,192,ðÿ˜šðÿž‰ â ™ ¥ ï¸ cute value bag ordered brow...
9,9,16,4,The product is not fully contactable. Not info...,178,product fully contactable inform delivery slow...


# What is topic-modelling?
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. 

The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is. It involves various techniques of dimensionality reduction(mostly non-linear) and unsupervised learning like LDA, SVD, autoencoders etc.

Source: [Wikipedia](https://en.wikipedia.org/wiki/Topic_model)

In [33]:
# Creating a vectorizer
vectorizer = CountVectorizer(min_df=0.005, max_df=0.85, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(reviews["processed_description"])

In [34]:
data_vectorized.data

array([1, 1, 1, ..., 1, 2, 2], dtype=int64)

In [35]:
NUM_TOPICS = 5

In [36]:
# Latent Dirichlet Allocation Model
#lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=50, learning_method='online',verbose=True)
lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=50, learning_method='batch',verbose=True)
data_lda = lda.fit_transform(data_vectorized)

iteration: 1 of max_iter: 50
iteration: 2 of max_iter: 50
iteration: 3 of max_iter: 50
iteration: 4 of max_iter: 50
iteration: 5 of max_iter: 50
iteration: 6 of max_iter: 50
iteration: 7 of max_iter: 50
iteration: 8 of max_iter: 50
iteration: 9 of max_iter: 50
iteration: 10 of max_iter: 50
iteration: 11 of max_iter: 50
iteration: 12 of max_iter: 50
iteration: 13 of max_iter: 50
iteration: 14 of max_iter: 50
iteration: 15 of max_iter: 50
iteration: 16 of max_iter: 50
iteration: 17 of max_iter: 50
iteration: 18 of max_iter: 50
iteration: 19 of max_iter: 50
iteration: 20 of max_iter: 50
iteration: 21 of max_iter: 50
iteration: 22 of max_iter: 50
iteration: 23 of max_iter: 50
iteration: 24 of max_iter: 50
iteration: 25 of max_iter: 50
iteration: 26 of max_iter: 50
iteration: 27 of max_iter: 50
iteration: 28 of max_iter: 50
iteration: 29 of max_iter: 50
iteration: 30 of max_iter: 50
iteration: 31 of max_iter: 50
iteration: 32 of max_iter: 50
iteration: 33 of max_iter: 50
iteration: 34 of ma

In [37]:
# Non-Negative Matrix Factorization Model
nmf = NMF(n_components=NUM_TOPICS)
data_nmf = nmf.fit_transform(data_vectorized) 

In [38]:
# Latent Semantic Indexing Model using Truncated SVD
lsi = TruncatedSVD(n_components=NUM_TOPICS)
data_lsi = lsi.fit_transform(data_vectorized)

In [39]:
# Functions for printing keywords for each topic
def selected_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]]) 

In [40]:
# Keywords for topics clustered by Latent Dirichlet Allocation
print("LDA Model:")
selected_topics(lda, vectorizer)

LDA Model:
Topic 0:
[('product', 733.1329747733943), ('delivery', 169.44346792622804), ('shop', 140.825805887628), ('received', 140.19422433321145), ('order', 135.26850100926237), ('fast', 133.60531314559685), ('time', 91.78073179106534), ('deliver', 80.19105571233335), ('receive', 79.50657231783944), ('pack', 76.72638950202584)]
Topic 1:
[('price', 1050.1868804197202), ('worth', 223.1886068707072), ('baht', 206.19233104815515), ('bag', 187.6363232510208), ('cheap', 179.38290721841489), ('buy', 159.2332024267427), ('okay', 133.43719042687022), ('quality', 121.52837881489891), ('suitable', 117.32043721680907), ('reasonable', 95.19161411722038)]
Topic 2:
[('reed', 368.1986704206008), ('like', 350.1875446416171), ('beautiful', 332.2182524554942), ('small', 241.19275686611337), ('cute', 185.8162561306143), ('lot', 119.19140431433183), ('bag', 101.60018140079507), ('little', 89.51188016625241), ('leave', 68.19159937093214), ('line', 60.56125347582972)]
Topic 3:
[('color', 304.1900753392418)

In [41]:
# Keywords for topics clustered by Non-Negative Matrix Factorization
print("NMF Model:")
selected_topics(nmf, vectorizer)

NMF Model:
Topic 0:
[('good', 16.050791801858498), ('value', 0.40234590422026906), ('service', 0.18352799175744322), ('quality', 0.11884780750675987), ('delivery', 0.06112558703945681), ('speed', 0.05422096591025064), ('shop', 0.05285159385445546), ('company', 0.047060149234615406), ('transportation', 0.04386983535077017), ('ship', 0.041430196506091914)]
Topic 1:
[('product', 7.852213315887415), ('quality', 4.090446691935891), ('delivery', 2.7908776484245017), ('fast', 1.8974094110845068), ('beautiful', 1.4695729689781996), ('value', 1.3608968894253328), ('service', 1.2482752532659656), ('shop', 0.7094150682158288), ('order', 0.6270495177801724), ('cheap', 0.5186828513823669)]
Topic 2:
[('like', 6.106692606928153), ('lot', 5.004684078620655), ('receive', 0.2018299371766206), ('pretty', 0.10272477835086065), ('beautiful', 0.09875462700815703), ('seller', 0.09441828311284291), ('send', 0.08977383497798772), ('cute', 0.08333885709249023), ('order', 0.08125572892419562), ('wrong', 0.080210

In [42]:
# Keywords for topics clustered by Latent Semantic Indexing
print("LSI Model:")
selected_topics(lsi, vectorizer)

LSI Model:
Topic 0:
[('good', 0.9918893144241032), ('product', 0.0723197747919944), ('price', 0.054176539455994925), ('quality', 0.05068494357500812), ('value', 0.04052114524318918), ('delivery', 0.03515924187290823), ('service', 0.024553986887720855), ('fast', 0.02381396985847893), ('beautiful', 0.018022187960142027), ('shop', 0.011177684118339236)]
Topic 1:
[('good', 0.12139552733550249), ('able', -0.0006891254789252166), ('lack', -0.0007113397860697986), ('sent', -0.0009188991646991687), ('slowly', -0.0010899784989560266), ('neat', -0.001227000120729912), ('understand', -0.0012889309688799856), ('wallet', -0.0013292967130048706), ('yes', -0.0013749494426810345), ('match', -0.0014584869599756519)]
Topic 2:
[('like', 0.7491377359625666), ('lot', 0.6231295757293588), ('reed', 0.11671738179246642), ('receive', 0.021176144178573088), ('good', 0.014991515366150449), ('seller', 0.011091507877528526), ('pretty', 0.01105256284115569), ('wrong', 0.009143758470470432), ('send', 0.0081907301178

In [92]:
# Transforming an individual sentence
text = spacy_tokenizer("I like it. Worth money")
text = reviews_outliers_remain.comments.iloc[0]
x = lda.transform(vectorizer.transform([text]))#[0]
print(x)

[[0.01549253 0.54410057 0.40949246 0.01552459 0.01538984]]


In [93]:
reviews_outliers_remain.comments.iloc[0]

'This bag is like a shoulder bag. Colorful shapes But regret it when shipping the stuffing in the bag The line is wrinkled and broken almost the entire line. So I reduced the beauty down a bit. Bought more expensive than the price seen later But overall I like it'

The index in the above list with the largest value represents the most dominant topic for the given review.



#### Finding the main topic of each reviews which has length >1300

In [94]:
topic_list = []
for i in tqdm(range(len(reviews_outliers_remain))):
    text = reviews_outliers_remain.comments.iloc[i]
    x = lda.transform(vectorizer.transform([text]))[0]
    y = pd.Series(x)
    y = y[y==max(y)].index[0]+1
    topic_list.append(y)
    
reviews_outliers_remain['topic'] = topic_list


100%|██████████| 549/549 [00:00<00:00, 1136.03it/s]


In [95]:
reviews_outliers_remain.head()

Unnamed: 0,rating,comments,len_review,topic
4,1,The product is compared to the price. Okay. Se...,302,4
7,5,This bag is like a shoulder bag. Colorful shap...,262,2
9,5,"Beautiful, cheap, very worthwhile. ðŸ‘ðŸ‘ðŸ‘...",531,2
10,5,"A cute seller, very worthwhile. Very worthwhil...",285,2
12,5,55555 Cute little leaves. Should be able to we...,235,2


In [96]:
reviews_outliers_remain.topic.value_counts()

5    190
1    175
2    104
3     46
4     34
Name: topic, dtype: int64


# Visualizing LDA results with pyLDAvis

In [97]:
pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda, data_vectorized, vectorizer, mds='tsne')
dash

## How to interpret this graph?
1. Topics on the left while their respective keywords are on the right.
2. Larger topics are more frequent and closer the topics, mor the similarity
3. Selection of keywords is based on their frequency and discriminancy.

**Hover over the topics on the left to get information about their keywords on the right.**

# Visualizing LSI(SVD) scatterplot
We will be visualizing our data for 2  topics to see similarity between keywords which is measured by distance with the markers using LSI model

In [128]:
svd_2d = TruncatedSVD(n_components=2)
data_2d = svd_2d.fit_transform(data_vectorized)

In [130]:
trace = go.Scattergl(
    x = data_2d[:,0],
    y = data_2d[:,1],
    mode = 'markers',
    marker = dict(
        color = '#FFBAD2',
        line = dict(width = 1)
    ),
    text = vectorizer.get_feature_names(),
    hovertext = vectorizer.get_feature_names(),
    hoverinfo = 'text' 
)
data = [trace]
# iplot(data, filename='scatter-mode')

## The text version of scatter plot looks messy but you can zoom it for great results

In [131]:
trace = go.Scattergl(
    x = data_2d[:,0],
    y = data_2d[:,1],
    mode = 'text',
    marker = dict(
        color = '#FFBAD2',
        line = dict(width = 1)
    ),
    text = vectorizer.get_feature_names()
)
data = [trace]
# iplot(data, filename='text-scatter-mode')

Let's see what happens when we use a spaCy based bigram tokenizer for topic modelling

In [132]:
def spacy_bigram_tokenizer(phrase):
    doc = parser(phrase) # create spacy object
    token_not_noun = []
    notnoun_noun_list = []
    noun = ""

    for item in doc:
        if item.pos_ != "NOUN": # separate nouns and not nouns
            token_not_noun.append(item.text)
        if item.pos_ == "NOUN":
            noun = item.text
        
        for notnoun in token_not_noun:
            notnoun_noun_list.append(notnoun + " " + noun)

    return " ".join([i for i in notnoun_noun_list])

In [133]:
bivectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, ngram_range=(1,2))
bigram_vectorized = bivectorizer.fit_transform(wines["processed_description"])

## LDA for bigram data

In [134]:
bi_lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online',verbose=True)
data_bi_lda = bi_lda.fit_transform(bigram_vectorized)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


### Topics for bigram model

In [135]:
print("Bi-LDA Model:")
selected_topics(bi_lda, bivectorizer)

Bi-LDA Model:
Topic 0:
[('fabric', 1279.3906417767519), ('good', 979.3483740289875), ('soft', 518.9963092578602), ('comfortable', 358.90695914186847), ('good fabric', 348.531936552534), ('fabric good', 255.54350101548306), ('sensitive', 239.21618709516886), ('soft fabric', 213.62763630610462), ('round', 155.32346231647145), ('order', 153.39090981912284)]
Topic 1:
[('like', 842.15461715345), ('price', 567.5175399729417), ('bag', 559.511920455546), ('small', 419.1657622893145), ('cute', 401.8435697694018), ('suitable', 383.4554223340235), ('little', 378.43662560745526), ('okay', 350.1816583952013), ('lot', 314.40994201363714), ('beautiful', 233.79428976264757)]
Topic 2:
[('order', 574.5230027549043), ('product', 525.9615007324182), ('color', 478.51683648329646), ('receive', 325.43108927413886), ('picture', 240.37829411130096), ('poor', 211.76910930890793), ('like', 149.83761605638324), ('receive product', 139.58124809504048), ('pink', 132.73228264930262), ('doe', 113.10456911668574)]
Top

In [136]:
bi_dash = pyLDAvis.sklearn.prepare(bi_lda, bigram_vectorized, bivectorizer, mds='tsne')
bi_dash

**Very few keywords with 2 words have been found like "spin dry" , "black cherry", etc.**

Kindly upvote and comment if you like this.