#### Problem Statement: 

A popular mobile phone brand, Lenovo has launched their budget smartphone in the Indian market. The client wants to understand the VOC (voice of the customer) on the product. This will be useful to not just evaluate the current product, but to also get some direction for developing the product pipeline. The client is particularly interested in the different aspects that customers care about. Product reviews by customers on a leading e-commerce site should provide a good view.

In [47]:
import pandas as pd

Read the data file using Pandas.

In [48]:
dt = pd.read_csv("K8 Reviews v0.2.csv")
dt.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


Change all to lower case and extract text into a list.

In [49]:
dt_lower = [sent.lower() for sent in dt.review.values] 
dt_lower[0]

'good but need updates and improvements'

Tokenize using NLTK.

In [50]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
dt_token = [word_tokenize(sent) for sent in dt_lower]
dt_token[0]

['good', 'but', 'need', 'updates', 'and', 'improvements']

Perform parts of speech tagging using nltk.pos_tag

In [51]:
nltk.pos_tag(dt_token[0])

[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

Do the samething for all sentances.

In [52]:
dt_tagged = [nltk.pos_tag(tokens) for tokens in dt_token]
dt_tagged[0]

[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

For the topic model, include only nouns.
* Find out all POS tags that correspond to nouns.
* Limit the data to only terms with these tages.

In [53]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

The tags we are interested in are NN, NNP, NNS, and NNPS, all tags that begin with 'NN'.

In [54]:
dt_tuple = nltk.pos_tag(['great'])
dt_tuple[0]

('great', 'JJ')

Extract the second element of each tuple

In [55]:
import re
dt_noun = []
for sent in dt_tagged:
    dt_noun.append([token for token in sent if re.search("NN.*",token[1])])

dt_noun[0]

[('updates', 'NNS'), ('improvements', 'NNS')]

Lemmatize:
* Different forms of the terms needs to be treated as one.
* No need to provide POS tag to lemmatizer for now.

In [56]:
from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()

Create an empty list store the lemmatize result

In [57]:
dt_lemm=[]
for sent in dt_noun:
    dt_lemm.append([lemm.lemmatize(word[0]) for word in sent]) 

In [58]:
dt_lemm[0]

['update', 'improvement']

Remove stopwords and punctuation

In [59]:
from string import punctuation
from nltk.corpus import stopwords
stop_nltk = stopwords.words("english")
stop_updated = stop_nltk + list(punctuation) + ["..."] + [".."]

In [60]:
reviews_sw_removed=[]
for sent in dt_lemm:
    reviews_sw_removed.append([term for term in sent if term not in stop_updated])

In [61]:
reviews_sw_removed[0]

['update', 'improvement']

Create a topic model using LDA on the cleaned up data with 12 topics.
* Print out the top terms for each topic.
* What is the coherence of the model with the c_v metric?

In [62]:
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.models import ldamodel
id2word = corpora.Dictionary(reviews_sw_removed)
texts = reviews_sw_removed

Using Gensim’s corpora utility, getting term to index mapping for each term

In [63]:
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[200])

[(426, 1), (427, 1), (428, 1), (429, 1)]


Build the topic model using LDA, with 12 topics

In [64]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=12, random_state=42,
                                           passes=10, per_word_topics=True)
print(lda_model.print_topics())

[(0, '0.381*"mobile" + 0.023*"problem" + 0.023*"notification" + 0.017*"heat" + 0.016*"cell" + 0.016*"message" + 0.011*"hang" + 0.011*"rate" + 0.010*"whatsapp" + 0.009*"call"'), (1, '0.267*"battery" + 0.105*"problem" + 0.055*"backup" + 0.055*"heating" + 0.052*"issue" + 0.037*"performance" + 0.036*"hour" + 0.032*"day" + 0.030*"time" + 0.029*"life"'), (2, '0.062*"handset" + 0.051*"software" + 0.041*"box" + 0.032*"contact" + 0.030*"update" + 0.026*"set" + 0.023*"star" + 0.023*"option" + 0.022*"item" + 0.020*"purchase"'), (3, '0.080*"phone" + 0.049*"amazon" + 0.044*"service" + 0.030*"lenovo" + 0.030*"day" + 0.029*"issue" + 0.027*"problem" + 0.026*"time" + 0.022*"delivery" + 0.019*"experience"'), (4, '0.135*"feature" + 0.076*"camera" + 0.048*"mode" + 0.037*"video" + 0.027*"android" + 0.025*"stock" + 0.023*"depth" + 0.019*"gallery" + 0.018*"volta" + 0.017*"thanks"'), (5, '0.439*"product" + 0.090*"charger" + 0.018*"earphone" + 0.016*"turbo" + 0.016*"buy" + 0.016*"piece" + 0.015*"awesome" + 0.0

Calculating the coherence of the model using the c_v metric.

In [65]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=reviews_sw_removed, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.5560767730635368


##### Determine which of the topics can be combined.
Looking at the topics and each terms following can be combined –
* Topic 2 and 5 possibly talks about 'pricing'
* Topic 4, 6 and 10 closely talks about 'battery related issues'
* Topic 3 and 11 vaguely talks about 'performance'

Create a topic model using LDA with what you think is the optimal number of topics

What is the coherence of the model?

* Eight topics seems to be the right number of topics from the data.
* We’ll create a topic model with 8 topics.

In [66]:
lda_model8 = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=8, random_state=42,
                                           passes=10, per_word_topics=True)
coherence_model_lda = CoherenceModel(model=lda_model8, texts=reviews_sw_removed, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.5470127061130555


The business should  be able to interpret the topics.
* Name each of the identified topics.
* Create a table with the topic name and the top 10 terms in each to present to the business.

In [67]:
x = lda_model8.show_topics(formatted=False)
topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]

for topic,words in topics_words:
    print(str(topic)+ "::"+ str(words))
print()

0::['mobile', 'charger', 'heat', 'charge', 'superb', 'turbo', 'hour', 'min', 'notification', 'awesome']
1::['battery', 'phone', 'problem', 'camera', 'backup', 'heating', 'issue', 'performance', 'quality', 'life']
2::['note', 'k8', 'lenovo', 'phone', 'software', 'screen', 'update', 'issue', 'handset', 'option']
3::['phone', 'amazon', 'issue', 'time', 'service', 'day', 'problem', 'month', 'lenovo', 'delivery']
4::['phone', 'camera', 'price', 'feature', 'range', 'mode', 'performance', 'device', 'quality', 'depth']
5::['product', 'money', 'waste', 'performance', 'ok', 'cast', 'item', 'pic', 'please', 'work']
6::['phone', 'network', 'call', 'sim', 'hai', 'jio', 'volta', 'budget', 'card', 'issue']
7::['camera', 'quality', 'money', 'value', 'music', 'speed', 'h', 'clarity', 'video', 'screen']



| Topic   |  Business Name |
|----------|------:|
| 1 |Product Accessories |
| 2 |Amazon |
| 3 |Pricing |
| 4 |Phone Performance |
| 5 |Battery Related Issues |
| 6 |Camera Quality |
| 7 |Sound Features |
| 8 |Overall General Phone Features |
