### Project - I (Review Project Analysis)

**DESCRIPTION:** Help a leading mobile brand understand the voice of the customer by analyzing the reviews of their product on Amazon and the topics that customers are talking about. You will perform topic modeling on specific parts of speech. You’ll finally interpret the emerging topics.

**Problem Statement:** A popular mobile phone brand, Lenovo has launched their budget smartphone in the Indian market. The client wants to understand the VOC (voice of the customer) on the product. This will be useful to not just evaluate the current product, but to also get some direction for developing the product pipeline. The client is particularly interested in the different aspects that customers care about. Product reviews by customers on a leading e-commerce site should provide a good view.

**Domain:** Amazon reviews for a leading phone brand

**Analysis to be done:** POS tagging, topic modeling using LDA, and topic interpretation

**Content:**

Dataset: ‘K8 Reviews v0.2.csv’

Columns:
- Sentiment: The sentiment against the review (4,5 star reviews are positive, 1,2 are negative)
- Reviews: The main text of the review

**Steps to perform:** Discover the topics in the reviews and present it to business in a consumable format. Employ techniques in syntactic processing and topic modeling.

Perform specific cleanup, POS tagging, and restricting to relevant POS tags, then, perform topic modeling using LDA. Finally, give business-friendly names to the topics and make a table for business.

**Tasks:**

1. Read the .csv file using Pandas. Take a look at the top few records.
2. Normalize casings for the review text and extract the text into a list for easier manipulation.
3. Tokenize the reviews using NLTKs word_tokenize function.
4. Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.
5. For the topic model, we should  want to include only nouns.
    1. Find out all the POS tags that correspond to nouns.
    2. Limit the data to only terms with these tags.
6. Lemmatize. 
    1. Different forms of the terms need to be treated as one.
    2. No need to provide POS tag to lemmatizer for now.
7. Remove stopwords and punctuation (if there are any). 
8. Create a topic model using LDA on the cleaned-up data with 12 topics.
    1. Print out the top terms for each topic.
    2. What is the coherence of the model with the c_v metric?
9. Analyze the topics through the business lens.
    1. Determine which of the topics can be combined.
10. Create topic model using LDA with what you think is the optimal number of topics
    1. What is the coherence of the model?
11. The business should  be able to interpret the topics.
    1. Name each of the identified topics.
    2. Create a table with the topic name and the top 10 terms in each to present to the  business.

#### Task-1: Read the .csv file using Pandas. Take a look at the top few records.

In [1]:
import pandas as pd
import time

start_time = time.time()

df = pd.read_csv("K8 Reviews v0.2.csv")

In [2]:
df.head(8)

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...
5,0,Only I'm telling don't buyI'm totally disappoi...
6,1,"Phone is awesome. But while charging, it heats..."
7,0,The battery level has worn down


#### Task-2. Normalize casings for the review text and extract the text into a list for easier manipulation.

In [3]:
review = df['review'].str.lower().to_list()

In [4]:
review[:5]

['good but need updates and improvements',
 "worst mobile i have bought ever, battery is draining like hell, backup is only 6 to 7 hours with internet uses, even if i put mobile idle its getting discharged.this is biggest lie from amazon & lenove which is not at all expected, they are making full by saying that battery is 4000mah & booster charger is fake, it takes at least 4 to 5 hours to be fully charged.don't know how lenovo will survive by making full of us.please don;t go for this else you will regret like me.",
 'when i will get my 10% cash back.... its already 15 january..',
 'good',
 'the worst phone everthey have changed the last phone but the problem is still same and the amazon is not returning the phone .highly disappointing of amazon']

**Optional Step:** Lets remove special characters and numbers otherwise these characters may spoil text preprocessing.

In [5]:
import re

review = [re.sub("[.,|:='~^0-9\\\]", " ", item) for item in review]

#### Task-3. Tokenize the reviews using NLTKs word_tokenize function.

In [6]:
from nltk import word_tokenize

review_tokens = [word_tokenize(item) for item in review]
print(review_tokens[:5])

[['good', 'but', 'need', 'updates', 'and', 'improvements'], ['worst', 'mobile', 'i', 'have', 'bought', 'ever', 'battery', 'is', 'draining', 'like', 'hell', 'backup', 'is', 'only', 'to', 'hours', 'with', 'internet', 'uses', 'even', 'if', 'i', 'put', 'mobile', 'idle', 'its', 'getting', 'discharged', 'this', 'is', 'biggest', 'lie', 'from', 'amazon', '&', 'lenove', 'which', 'is', 'not', 'at', 'all', 'expected', 'they', 'are', 'making', 'full', 'by', 'saying', 'that', 'battery', 'is', 'mah', '&', 'booster', 'charger', 'is', 'fake', 'it', 'takes', 'at', 'least', 'to', 'hours', 'to', 'be', 'fully', 'charged', 'don', 't', 'know', 'how', 'lenovo', 'will', 'survive', 'by', 'making', 'full', 'of', 'us', 'please', 'don', ';', 't', 'go', 'for', 'this', 'else', 'you', 'will', 'regret', 'like', 'me'], ['when', 'i', 'will', 'get', 'my', '%', 'cash', 'back', 'its', 'already', 'january'], ['good'], ['the', 'worst', 'phone', 'everthey', 'have', 'changed', 'the', 'last', 'phone', 'but', 'the', 'problem', 

#### Task-4. Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

In [7]:
from nltk import pos_tag

review_postags = [pos_tag(item) for item in review_tokens]
print(review_postags[:2])

[[('good', 'JJ'), ('but', 'CC'), ('need', 'VBP'), ('updates', 'NNS'), ('and', 'CC'), ('improvements', 'NNS')], [('worst', 'JJS'), ('mobile', 'NN'), ('i', 'NN'), ('have', 'VBP'), ('bought', 'VBN'), ('ever', 'RB'), ('battery', 'NN'), ('is', 'VBZ'), ('draining', 'VBG'), ('like', 'IN'), ('hell', 'NN'), ('backup', 'NN'), ('is', 'VBZ'), ('only', 'RB'), ('to', 'TO'), ('hours', 'NNS'), ('with', 'IN'), ('internet', 'NN'), ('uses', 'NNS'), ('even', 'RB'), ('if', 'IN'), ('i', 'JJ'), ('put', 'VBP'), ('mobile', 'JJ'), ('idle', 'NN'), ('its', 'PRP$'), ('getting', 'VBG'), ('discharged', 'VBN'), ('this', 'DT'), ('is', 'VBZ'), ('biggest', 'JJS'), ('lie', 'NN'), ('from', 'IN'), ('amazon', 'NN'), ('&', 'CC'), ('lenove', 'NN'), ('which', 'WDT'), ('is', 'VBZ'), ('not', 'RB'), ('at', 'IN'), ('all', 'DT'), ('expected', 'VBN'), ('they', 'PRP'), ('are', 'VBP'), ('making', 'VBG'), ('full', 'JJ'), ('by', 'IN'), ('saying', 'VBG'), ('that', 'DT'), ('battery', 'NN'), ('is', 'VBZ'), ('mah', 'JJ'), ('&', 'CC'), ('boo

#### Task-5. For the topic model, we should  want to include only nouns.
1.	Find out all the POS tags that correspond to nouns.
2.	Limit the data to only terms with these tags.

In [8]:
import nltk
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [9]:
noun_tags = ['NN', 'NNS', 'NNP', 'NNPS']

review_postags_nouns = []

for item in review_postags:
    noun_tokens = [token_tag[0] for token_tag in item if token_tag[1] in noun_tags]
    review_postags_nouns.append(noun_tokens)
    
print(review_postags_nouns[:5])

[['updates', 'improvements'], ['mobile', 'i', 'battery', 'hell', 'backup', 'hours', 'internet', 'uses', 'idle', 'lie', 'amazon', 'lenove', 'battery', 'booster', 'charger', 'hours', 'don', 't', 'don'], ['i', '%', 'cash'], [], ['phone', 'everthey', 'phone', 'problem', 'amazon', 'phone', 'amazon']]


#### Task-6. Lemmatize. 
1.	Different forms of the terms need to be treated as one.
2.	No need to provide POS tag to lemmatizer for now.

In [10]:
from nltk import WordNetLemmatizer

wnl = WordNetLemmatizer()

review_postags_nouns_lemmed = []

for item in review_postags_nouns:
    lemmed_tokens = [wnl.lemmatize(token, 'n') for token in item]
    review_postags_nouns_lemmed.append(lemmed_tokens)
    
print(review_postags_nouns_lemmed[:5])

[['update', 'improvement'], ['mobile', 'i', 'battery', 'hell', 'backup', 'hour', 'internet', 'us', 'idle', 'lie', 'amazon', 'lenove', 'battery', 'booster', 'charger', 'hour', 'don', 't', 'don'], ['i', '%', 'cash'], [], ['phone', 'everthey', 'phone', 'problem', 'amazon', 'phone', 'amazon']]


#### Task-7. Remove stopwords and punctuation (if there are any). 

In [11]:
from nltk.corpus import stopwords
import string

sw = stopwords.words("english")
punc = list(string.punctuation)

custom_sw = sw + punc

review_preprocessed = []

for item in review_postags_nouns_lemmed:
    if len(item)>0:        
        preprocessed_tokens = [token for token in item if token not in custom_sw and len(token)>1]
        review_preprocessed.append(preprocessed_tokens)
    else:
        review_preprocessed.append(item)
    
print(review_preprocessed[:5])

[['update', 'improvement'], ['mobile', 'battery', 'hell', 'backup', 'hour', 'internet', 'us', 'idle', 'lie', 'amazon', 'lenove', 'battery', 'booster', 'charger', 'hour'], ['cash'], [], ['phone', 'everthey', 'phone', 'problem', 'amazon', 'phone', 'amazon']]


#### Task-8. Create a topic model using LDA on the cleaned-up data with 12 topics.
1.	Print out the top terms for each topic.
2.	What is the coherence of the model with the c_v metric?

In [12]:
import gensim
from gensim.corpora import Dictionary
import numpy as np

np.random.seed(100)

In [13]:
dictionary = Dictionary(review_preprocessed)

In [14]:
type(dictionary)

gensim.corpora.dictionary.Dictionary

In [15]:
dictionary.filter_extremes(no_below=5, no_above=.8 ,keep_n=None)

In [16]:
print(dictionary)

Dictionary(1128 unique tokens: ['improvement', 'update', 'amazon', 'backup', 'battery']...)


In [17]:
bow_text = [dictionary.doc2bow(item) for item in review_preprocessed]

In [18]:
bow_text[:5]

[[(0, 1), (1, 1)],
 [(2, 1),
  (3, 1),
  (4, 2),
  (5, 1),
  (6, 1),
  (7, 1),
  (8, 2),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1)],
 [(14, 1)],
 [],
 [(2, 2), (15, 3), (16, 1)]]

In [19]:
sample_bow = bow_text[4]

for item in sample_bow:
    print("Word '{}' comes {} times in this sample review".format(dictionary[item[0]], item[1]))

Word 'amazon' comes 2 times in this sample review
Word 'phone' comes 3 times in this sample review
Word 'problem' comes 1 times in this sample review


In [20]:
# Let us create LDA model from Gensim library

lda_model = gensim.models.LdaMulticore(bow_text, 
                                   num_topics = 12, 
                                   id2word = dictionary,
                                   random_state=1,                                    
                                   passes = 50)

In [21]:
for idx, topic in lda_model.print_topics(-1):
    print("\nTopic: {} \nWords: {}".format(idx, topic ))


Topic: 0 
Words: 0.272*"battery" + 0.076*"hour" + 0.065*"time" + 0.063*"charger" + 0.044*"charge" + 0.038*"life" + 0.036*"day" + 0.033*"turbo" + 0.031*"drain" + 0.030*"hr"

Topic: 1 
Words: 0.064*"use" + 0.056*"superb" + 0.054*"star" + 0.039*"min" + 0.036*"data" + 0.033*"wifi" + 0.031*"internet" + 0.023*"heat" + 0.020*"month" + 0.020*"cable"

Topic: 2 
Words: 0.106*"phone" + 0.060*"amazon" + 0.057*"price" + 0.053*"service" + 0.042*"product" + 0.039*"day" + 0.033*"device" + 0.026*"delivery" + 0.024*"customer" + 0.023*"range"

Topic: 3 
Words: 0.055*"phone" + 0.049*"processor" + 0.038*"camera" + 0.032*"ram" + 0.026*"gb" + 0.025*"budget" + 0.023*"game" + 0.023*"smartphone" + 0.019*"device" + 0.019*"apps"

Topic: 4 
Words: 0.295*"camera" + 0.167*"quality" + 0.035*"sound" + 0.030*"display" + 0.021*"everything" + 0.019*"performance" + 0.019*"front" + 0.017*"feature" + 0.017*"price" + 0.016*"clarity"

Topic: 5 
Words: 0.213*"issue" + 0.130*"phone" + 0.073*"lenovo" + 0.043*"update" + 0.038*"s

In [22]:
from gensim.models import CoherenceModel

In [23]:
coherence_score = CoherenceModel(model=lda_model, texts=review_preprocessed, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_score.get_coherence()
print('Coherence Score for this LDA model is: ', coherence_lda)

Coherence Score for this LDA model is:  0.5558224125871732


#### Task-9. Analyze the topics through the business lens.
1.	Determine which of the topics can be combined.

In [24]:
# Let us visualize topics in our text

import pyLDAvis
from pyLDAvis import gensim_models

In [25]:
import warnings
warnings.simplefilter('ignore')

pyLDAvis.enable_notebook()

  and should_run_async(code)


In [26]:
LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, bow_text, dictionary)

In [27]:
pyLDAvis.save_html(LDAvis_prepared, 'LDA_model_vis'+'.html')

**Answer:** From topics printed above and model visualization, it can be concluded that following topics can be clubbed together:
- Topic- 1, 5 and 8
- Topic- 4 and 7
- Topic- 2 and 3
- Topic- 6 and 11
- Topic- 9 and 10
- Topic- 12

#### Task-10. Create topic model using LDA with what you think is the optimal number of topics
1.	What is the coherence of the model?

In [28]:
# From our exercise in previous step, we find that optimum no. of topics will 6 for given text.
# Let us again create model with 6 topics

lda_model1 = gensim.models.LdaMulticore(bow_text, 
                                   num_topics = 6, 
                                   id2word = dictionary,  
                                   random_state=1,
                                   passes = 50)

for idx, topic in lda_model1.print_topics(-1):
    print("\nTopic: {} \nWords: {}".format(idx, topic ))


Topic: 0 
Words: 0.201*"battery" + 0.056*"phone" + 0.046*"hour" + 0.039*"charger" + 0.038*"backup" + 0.029*"day" + 0.029*"life" + 0.028*"charge" + 0.027*"time" + 0.025*"heat"

Topic: 1 
Words: 0.046*"screen" + 0.039*"call" + 0.038*"option" + 0.034*"speaker" + 0.033*"note" + 0.020*"feature" + 0.019*"app" + 0.019*"mobile" + 0.015*"music" + 0.015*"superb"

Topic: 2 
Words: 0.173*"product" + 0.042*"amazon" + 0.037*"service" + 0.028*"money" + 0.024*"day" + 0.023*"waste" + 0.020*"lenovo" + 0.020*"time" + 0.018*"device" + 0.018*"delivery"

Topic: 3 
Words: 0.346*"phone" + 0.060*"price" + 0.042*"feature" + 0.022*"range" + 0.019*"money" + 0.017*"value" + 0.014*"lenovo" + 0.013*"budget" + 0.012*"android" + 0.011*"stock"

Topic: 4 
Words: 0.230*"camera" + 0.094*"quality" + 0.067*"mobile" + 0.051*"battery" + 0.048*"performance" + 0.021*"sound" + 0.018*"mode" + 0.018*"display" + 0.016*"everything" + 0.013*"backup"

Topic: 5 
Words: 0.121*"problem" + 0.107*"phone" + 0.104*"issue" + 0.066*"note" + 0

In [29]:
coherence_score1 = CoherenceModel(model=lda_model1, texts=review_preprocessed, dictionary=dictionary, coherence='c_v')
coherence_lda1 = coherence_score1.get_coherence()
print('Coherence Score for new LDA model_1 is: ', coherence_lda1)

Coherence Score for new LDA model_1 is:  0.6521650462859487


#### Task-11. The business should  be able to interpret the topics.
1.	Name each of the identified topics.
2.	Create a table with the topic name and the top 10 terms in each to present to the  business.

In [30]:
topic_words = {}

for idx, topic in lda_model1.print_topics(-1): 
    temp = []
    for item in topic.split('+'):
        item_alpha = [letter for letter in item if letter.isalpha()]
        temp.append("".join(item_alpha))    
    topic_words[('Topic_'+str(idx+1))] = temp

topic_table = pd.DataFrame(topic_words)    
topic_table.index = ['Word_'+str(i+1) for i in range(topic_table.shape[0])]
topic_table

Unnamed: 0,Topic_1,Topic_2,Topic_3,Topic_4,Topic_5,Topic_6
Word_1,battery,screen,product,phone,camera,problem
Word_2,phone,call,amazon,price,quality,phone
Word_3,hour,option,service,feature,mobile,issue
Word_4,charger,speaker,money,range,battery,note
Word_5,backup,note,day,money,performance,heating
Word_6,day,feature,waste,value,sound,network
Word_7,life,app,lenovo,lenovo,mode,lenovo
Word_8,charge,mobile,time,budget,display,update
Word_9,time,music,device,android,everything,sim
Word_10,heat,superb,delivery,stock,backup,software


#### Topics according to above table:
- Topic 1: Battery Charging Time and Backup
- Topic 2: Screen and Speaker
- Topic 3: Overall Service
- Topic 4: Price and Value for Money
- Topic 5: Camera Quality & Battery Performance
- Topic 6: Problems in Phone

#### Another method to print table of topics-words:

In [31]:
lda_model1.show_topics(formatted=False)

[(0,
  [('battery', 0.20146361),
   ('phone', 0.056217175),
   ('hour', 0.046134613),
   ('charger', 0.039011948),
   ('backup', 0.037642494),
   ('day', 0.029444695),
   ('life', 0.028981982),
   ('charge', 0.02757468),
   ('time', 0.027095284),
   ('heat', 0.02451619)]),
 (1,
  [('screen', 0.04578664),
   ('call', 0.0387779),
   ('option', 0.037541877),
   ('speaker', 0.03352352),
   ('note', 0.03343935),
   ('feature', 0.0201642),
   ('app', 0.019448888),
   ('mobile', 0.018761748),
   ('music', 0.015398961),
   ('superb', 0.014905993)]),
 (2,
  [('product', 0.17258413),
   ('amazon', 0.042288005),
   ('service', 0.036922492),
   ('money', 0.028111745),
   ('day', 0.023756752),
   ('waste', 0.02300188),
   ('lenovo', 0.019892301),
   ('time', 0.01961759),
   ('device', 0.018222963),
   ('delivery', 0.018084608)]),
 (3,
  [('phone', 0.3464384),
   ('price', 0.060063317),
   ('feature', 0.04176926),
   ('range', 0.02201355),
   ('money', 0.019040186),
   ('value', 0.016864127),
   ('l

In [32]:
x = lda_model1.show_topics(formatted=False)
topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]

for topic, words in topics_words:
    print(str(topic)+"::"+str(words))
    
print()

0::['battery', 'phone', 'hour', 'charger', 'backup', 'day', 'life', 'charge', 'time', 'heat']
1::['screen', 'call', 'option', 'speaker', 'note', 'feature', 'app', 'mobile', 'music', 'superb']
2::['product', 'amazon', 'service', 'money', 'day', 'waste', 'lenovo', 'time', 'device', 'delivery']
3::['phone', 'price', 'feature', 'range', 'money', 'value', 'lenovo', 'budget', 'android', 'stock']
4::['camera', 'quality', 'mobile', 'battery', 'performance', 'sound', 'mode', 'display', 'everything', 'backup']
5::['problem', 'phone', 'issue', 'note', 'heating', 'network', 'lenovo', 'update', 'sim', 'software']



In [33]:
LDAvis_prepared1 = pyLDAvis.gensim_models.prepare(lda_model1, bow_text, dictionary)
pyLDAvis.save_html(LDAvis_prepared1, 'LDA_model_vis_1'+'.html')

#### Topics according to topic modelling visualization:
- Topic 1: Camera Quality, Battery Performance, Sound, Display
- Topic 2: Price, Feature, Money Value, Budget
- Topic 3: Problem, Heating, Network, software
- Topic 4: Service, Waste, Delivery
- Topic 5: Battery Backup, Charge time
- Topic 6: Screen, Call, Speaker, Features