# Aim and Motivation 
[Nirant](https://www.kaggle.com/nirant)'s latest kernel on spaCy: [Hitchhiker's Guide to NLP in spaCy](https://www.kaggle.com/nirant/hitchhiker-s-guide-to-nlp-in-spacy) has made me realize that spaCy maybe as good or even better than NLTK for Natural Language Processing. My recent kernels deal with deep learning and I want to extend that by using text data for deep learning and intend to use spaCy for processing and modelling this data. 

In [162]:
# Usual imports
import numpy as np
import pandas as pd
from tqdm import tqdm
import string
import matplotlib.pyplot as plt
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
import concurrent.futures
import time
import pyLDAvis.sklearn
from pylab import bone, pcolor, colorbar, plot, show, rcParams, savefig
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
import os
print(os.listdir("../data"))

# Plotly based imports for visualization
from plotly import tools
import chart_studio.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff

# spaCy based imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
#!python -m spacy download en_core_web_lg

['productReviewShopee_1.csv']


In [63]:
# Loading data
reviews = pd.read_csv('../data/productReviewShopee_1.csv')
reviews.head()

Unnamed: 0,title,rating,date,categorie,comments,product_option
0,🔥 Mini shoulder bag (M-688),4,22-05-2019,Bag,Give 4 stars because order at the price 37 but...,Pink:
1,🔥 Mini shoulder bag (M-688),5,18-07-2019,Bag,Ordered at a discount of 10 baht per piece. It...,Black
2,🔥 Mini shoulder bag (M-688),5,02-07-2019,Bag,"Small, cute, compact, good But the sash looks ...",Tau
3,🔥 Mini shoulder bag (M-688),1,15-12-2018,Bag,The size is not as large as it is down. The st...,Black
4,🔥 Mini shoulder bag (M-688),1,24-07-2019,Bag,The product is compared to the price. Okay. Se...,Black


In [66]:
reviews.shape

(11416, 6)

### Remove Duplicates 

In [68]:
reviews.drop_duplicates().shape

(8984, 6)

In [69]:
reviews=reviews.drop_duplicates()
reviews.head()

Unnamed: 0,title,rating,date,categorie,comments,product_option
0,🔥 Mini shoulder bag (M-688),4,22-05-2019,Bag,Give 4 stars because order at the price 37 but...,Pink:
1,🔥 Mini shoulder bag (M-688),5,18-07-2019,Bag,Ordered at a discount of 10 baht per piece. It...,Black
2,🔥 Mini shoulder bag (M-688),5,02-07-2019,Bag,"Small, cute, compact, good But the sash looks ...",Tau
3,🔥 Mini shoulder bag (M-688),1,15-12-2018,Bag,The size is not as large as it is down. The st...,Black
4,🔥 Mini shoulder bag (M-688),1,24-07-2019,Bag,The product is compared to the price. Okay. Se...,Black


### Remove very small text

In [88]:
reviews.comments=reviews.comments.astype(str)
reviews['len_review']=reviews.comments.apply(len)

In [89]:
reviews.len_review.describe()

count    8975.000000
mean      106.887577
std        94.884396
min         3.000000
25%        40.000000
50%        84.000000
75%       149.000000
max      1387.000000
Name: len_review, dtype: float64

In [90]:
reviews.comments.loc[reviews.len_review<=5]

688        The
796      Price
807        For
914       Cute
949       Good
965       Late
981       Cute
996       Some
1005      Good
1057      Good
1060      send
1062      good
1063      good
1331     Price
1578       👍👍👍
1594      ✌🏻✌🏻
1604     Worth
1636     Good.
1642      Good
1645       Fit
1706       Fit
1812      Good
1889      good
1907      Good
1983     Price
2173      Soft
2330      Good
2371      Cool
2391     Price
2421     Cloth
         ...  
7607      mind
7780       nan
7790      good
7984      Just
8339      Good
8526      good
8557      late
8584       nan
8703      late
8789      Good
9023      Fast
9050      Good
9121      good
9144      Good
9145      Fair
9250      Good
9363       👍👍👍
10465    Rusty
10470     Cute
10471     Good
10518     Cute
10548     Okay
10631    Price
10633      For
10789     Cute
10804    Price
10835     Fair
10861    Price
10928     Cute
11381    Price
Name: comments, Length: 80, dtype: object

In [92]:
reviews.comments=reviews.comments.apply(lambda x: x.replace('👍','good '))

In [93]:
reviews=reviews.loc[reviews.len_review>=3,:]
reviews.head()

Unnamed: 0,title,rating,date,categorie,comments,product_option,len_review
0,🔥 Mini shoulder bag (M-688),4,22-05-2019,Bag,Give 4 stars because order at the price 37 but...,Pink:,66
1,🔥 Mini shoulder bag (M-688),5,18-07-2019,Bag,Ordered at a discount of 10 baht per piece. It...,Black,167
2,🔥 Mini shoulder bag (M-688),5,02-07-2019,Bag,"Small, cute, compact, good But the sash looks ...",Tau,150
3,🔥 Mini shoulder bag (M-688),1,15-12-2018,Bag,The size is not as large as it is down. The st...,Black,135
4,🔥 Mini shoulder bag (M-688),1,24-07-2019,Bag,The product is compared to the price. Okay. Se...,Black,302


In [94]:
reviews.len_review.describe()

count    8975.000000
mean      106.887577
std        94.884396
min         3.000000
25%        40.000000
50%        84.000000
75%       149.000000
max      1387.000000
Name: len_review, dtype: float64

In [95]:
# Creating a spaCy object
nlp = spacy.load('en_core_web_lg')

spaCy also comes with a built-in named entity visualizer that lets you check your model's predictions in your browser. You can pass in one or more <code>Doc</code> objects and start a web server, export HTML files or view the visualization directly from a Jupyter Notebook.

### Named Entity Recognition
 Named Entity Recognition is an information extraction task where named entities in unstructured sentences are located and classified  in some pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In [96]:
doc = nlp(reviews["comments"][3])
spacy.displacy.render(doc, style='ent',jupyter=True)

In [97]:
punctuations = string.punctuation
stopwords = list(STOP_WORDS)

### Lemmatization
It is the  process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Words like "ran" and "running" are converted to "run" to avoid having words with similar meanings in our data.

In [98]:
review = str(" ".join([i.lemma_ for i in doc]))

In [99]:
doc = nlp(review)
spacy.displacy.render(doc, style='ent',jupyter=True)

The sentence looks much different now that it is lemmatized.

### Parts of Speech tagging
This is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech,[1] based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

In [100]:
# POS tagging
# for i in nlp(review):
#     print(i,"=>",i.pos_)

In [102]:
# Parser for reviews
parser = English()
def spacy_tokenizer(sentence):
    mytokens = parser(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
    mytokens = " ".join([i for i in mytokens])
    return mytokens

In [114]:
reviews.reset_index(inplace=True)

In [116]:
tqdm.pandas()
reviews["processed_description"] = reviews["comments"].progress_apply(spacy_tokenizer)





  0%|          | 0/8975 [00:00<?, ?it/s][A[A[A[A



  8%|▊         | 675/8975 [00:00<00:01, 6738.52it/s][A[A[A[A



 22%|██▏       | 1956/8975 [00:00<00:00, 7851.48it/s][A[A[A[A



 31%|███       | 2793/8975 [00:00<00:00, 7998.47it/s][A[A[A[A



 45%|████▍     | 4009/8975 [00:00<00:00, 8912.78it/s][A[A[A[A



 62%|██████▏   | 5595/8975 [00:00<00:00, 10261.15it/s][A[A[A[A



 80%|███████▉  | 7142/8975 [00:00<00:00, 11410.48it/s][A[A[A[A



 93%|█████████▎| 8330/8975 [00:00<00:00, 10092.89it/s][A[A[A[A



100%|██████████| 8975/8975 [00:00<00:00, 10732.24it/s][A[A[A[A

# What is topic-modelling?
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. 

The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is. It involves various techniques of dimensionality reduction(mostly non-linear) and unsupervised learning like LDA, SVD, autoencoders etc.

Source: [Wikipedia](https://en.wikipedia.org/wiki/Topic_model)

In [151]:
# Creating a vectorizer
vectorizer = CountVectorizer(min_df=0.005, max_df=0.85, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(reviews["processed_description"])

In [152]:
NUM_TOPICS = 10

In [153]:
# Latent Dirichlet Allocation Model
lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online',verbose=True)
data_lda = lda.fit_transform(data_vectorized)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


In [154]:
# Non-Negative Matrix Factorization Model
nmf = NMF(n_components=NUM_TOPICS)
data_nmf = nmf.fit_transform(data_vectorized) 

In [155]:
# Latent Semantic Indexing Model using Truncated SVD
lsi = TruncatedSVD(n_components=NUM_TOPICS)
data_lsi = lsi.fit_transform(data_vectorized)

In [156]:
# Functions for printing keywords for each topic
def selected_topics(model, vectorizer, top_n=10):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]]) 

In [157]:
# Keywords for topics clustered by Latent Dirichlet Allocation
print("LDA Model:")
selected_topics(lda, vectorizer)

LDA Model:
Topic 0:
[('beautiful', 1180.714894965156), ('like', 1137.6013759132952), ('lot', 332.1180931501352), ('reasonable', 241.9843404611558), ('slow', 234.50534593453494), ('disappoint', 98.5390950560214), ('bite', 96.45754707101207), ('lovely', 55.12577840860908), ('little', 44.59101148036041), ('overall', 39.31083007414578)]
Topic 1:
[('good', 6889.769017339979), ('quality', 3819.4104516381067), ('product', 3627.8323949944884), ('value', 3130.489782445354), ('delivery', 1129.4639676535746), ('fast', 803.2671344810761), ('reed', 593.4965900746149), ('speed', 221.00695284198616), ('pack', 102.96264494666391), ('light', 60.987538437732404)]
Topic 2:
[('good', 10928.331445343772), ('recommend', 2.157128356644135), ('price', 0.17443220176293373), ('value', 0.10254998256143424), ('beautiful', 0.10090715825657193), ('product', 0.10068421630142722), ('lovely', 0.10027408754084417), ('body', 0.10009146173059813), ('right', 0.10007566264198961), ('work', 0.10006745238662187)]
Topic 3:
[(

In [158]:
# Keywords for topics clustered by Latent Semantic Indexing
print("NMF Model:")
selected_topics(nmf, vectorizer)

NMF Model:
Topic 0:
[('good', 28.989801986679403), ('service', 0.07242228058062336), ('fabric', 0.03719620544692658), ('company', 0.024273130939245446), ('transportation', 0.023387094988641347), ('cheap', 0.02214641974786085), ('ship', 0.02103383236865832), ('shop', 0.018504565739875627), ('worthwhile', 0.0175982528409033), ('speed', 0.01445737562465288)]
Topic 1:
[('product', 14.09783989450921), ('complete', 0.4920647830961364), ('receive', 0.48135275066797845), ('long', 0.4601329169364193), ('received', 0.4344444645602124), ('suitable', 0.3256245670168312), ('deliver', 0.27397691483783315), ('shop', 0.2640445182763761), ('cheap', 0.2566909977620899), ('cover', 0.22617607590692132)]
Topic 2:
[('like', 9.628034510996798), ('lot', 4.670522921876767), ('receive', 0.08657547684384492), ('money', 0.07721137131015365), ('fabric', 0.06887464437657165), ('come', 0.06452147005011198), ('picture', 0.06207086089004432), ('buy', 0.054203576743599886), ('quickly', 0.04893439257549312), ('cute', 0.

In [159]:
# Keywords for topics clustered by Non-Negative Matrix Factorization
print("LSI Model:")
selected_topics(lsi, vectorizer)

LSI Model:
Topic 0:
[('good', 0.9983640587654579), ('product', 0.033049161233831205), ('quality', 0.026254225156590993), ('value', 0.026031136534682586), ('delivery', 0.015196544865179417), ('price', 0.012810887953440025), ('fast', 0.011728000897526807), ('service', 0.009340023517035128), ('beautiful', 0.005408832121786787), ('shop', 0.0042181578548651085)]
Topic 1:
[('product', 0.7080369146193238), ('quality', 0.46789682631490154), ('value', 0.28003535008484237), ('delivery', 0.2646473643952752), ('fast', 0.21397180678801567), ('price', 0.1671053635544995), ('service', 0.11360521837248615), ('beautiful', 0.0703058921761972), ('like', 0.06634357433782001), ('order', 0.06540473864899793)]
Topic 2:
[('like', 0.8829774940669581), ('lot', 0.42552730070934763), ('beautiful', 0.06316030941408526), ('price', 0.048975247798520534), ('value', 0.04769546705814684), ('fast', 0.038011378279758146), ('delivery', 0.03719246732996216), ('order', 0.032711578739051554), ('fabric', 0.031048026695667617)

In [160]:
# Transforming an individual sentence
text = spacy_tokenizer("I like it. Worth money")
x = lda.transform(vectorizer.transform([text]))[0]
print(x)

[0.27500004 0.025      0.025      0.27500001 0.025      0.025
 0.27499995 0.025      0.025      0.025     ]


The index in the above list with the largest value represents the most dominant topic for the given review.


# Visualizing LDA results with pyLDAvis

In [161]:
pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda, data_vectorized, vectorizer, mds='tsne')
dash

## How to interpret this graph?
1. Topics on the left while their respective keywords are on the right.
2. Larger topics are more frequent and closer the topics, mor the similarity
3. Selection of keywords is based on their frequency and discriminancy.

**Hover over the topics on the left to get information about their keywords on the right.**

# Visualizing LSI(SVD) scatterplot
We will be visualizing our data for 2  topics to see similarity between keywords which is measured by distance with the markers using LSI model

In [128]:
svd_2d = TruncatedSVD(n_components=2)
data_2d = svd_2d.fit_transform(data_vectorized)

In [130]:
trace = go.Scattergl(
    x = data_2d[:,0],
    y = data_2d[:,1],
    mode = 'markers',
    marker = dict(
        color = '#FFBAD2',
        line = dict(width = 1)
    ),
    text = vectorizer.get_feature_names(),
    hovertext = vectorizer.get_feature_names(),
    hoverinfo = 'text' 
)
data = [trace]
# iplot(data, filename='scatter-mode')

## The text version of scatter plot looks messy but you can zoom it for great results

In [131]:
trace = go.Scattergl(
    x = data_2d[:,0],
    y = data_2d[:,1],
    mode = 'text',
    marker = dict(
        color = '#FFBAD2',
        line = dict(width = 1)
    ),
    text = vectorizer.get_feature_names()
)
data = [trace]
# iplot(data, filename='text-scatter-mode')

Let's see what happens when we use a spaCy based bigram tokenizer for topic modelling

In [132]:
def spacy_bigram_tokenizer(phrase):
    doc = parser(phrase) # create spacy object
    token_not_noun = []
    notnoun_noun_list = []
    noun = ""

    for item in doc:
        if item.pos_ != "NOUN": # separate nouns and not nouns
            token_not_noun.append(item.text)
        if item.pos_ == "NOUN":
            noun = item.text
        
        for notnoun in token_not_noun:
            notnoun_noun_list.append(notnoun + " " + noun)

    return " ".join([i for i in notnoun_noun_list])

In [133]:
bivectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, ngram_range=(1,2))
bigram_vectorized = bivectorizer.fit_transform(wines["processed_description"])

## LDA for bigram data

In [134]:
bi_lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online',verbose=True)
data_bi_lda = bi_lda.fit_transform(bigram_vectorized)

iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10


### Topics for bigram model

In [135]:
print("Bi-LDA Model:")
selected_topics(bi_lda, bivectorizer)

Bi-LDA Model:
Topic 0:
[('fabric', 1279.3906417767519), ('good', 979.3483740289875), ('soft', 518.9963092578602), ('comfortable', 358.90695914186847), ('good fabric', 348.531936552534), ('fabric good', 255.54350101548306), ('sensitive', 239.21618709516886), ('soft fabric', 213.62763630610462), ('round', 155.32346231647145), ('order', 153.39090981912284)]
Topic 1:
[('like', 842.15461715345), ('price', 567.5175399729417), ('bag', 559.511920455546), ('small', 419.1657622893145), ('cute', 401.8435697694018), ('suitable', 383.4554223340235), ('little', 378.43662560745526), ('okay', 350.1816583952013), ('lot', 314.40994201363714), ('beautiful', 233.79428976264757)]
Topic 2:
[('order', 574.5230027549043), ('product', 525.9615007324182), ('color', 478.51683648329646), ('receive', 325.43108927413886), ('picture', 240.37829411130096), ('poor', 211.76910930890793), ('like', 149.83761605638324), ('receive product', 139.58124809504048), ('pink', 132.73228264930262), ('doe', 113.10456911668574)]
Top

In [136]:
bi_dash = pyLDAvis.sklearn.prepare(bi_lda, bigram_vectorized, bivectorizer, mds='tsne')
bi_dash

**Very few keywords with 2 words have been found like "spin dry" , "black cherry", etc.**

Kindly upvote and comment if you like this.