# Project Review Project Analysis
## Author: Jessica Bosch
## Last Update: May 21, 2021

## Description

Help a leading mobile brand understand the voice of the customer by analyzing the reviews of their product on Amazon and the topics that customers are talking about. You will perform topic modeling on specific parts of speech. You'll finally interpret the emerging topics.

**Problem Statement:**
A popular mobile phone brand, Lenovo has launched their budget smartphone in the Indian market. The client wants to understand the VOC (voice of the customer) on the product. This will be useful to not just evaluate the current product, but to also get some direction for developing the product pipeline. The client is particularly interested in the different aspects that customers care about. Product reviews by customers on a leading e-commerce site should provide a good view.

**Domain:** Amazon reviews for a leading phone brand

**Analysis to be done:** POS tagging, topic modeling using LDA, and topic interpretation

**Content:**
- Dataset: ‘K8 Reviews v0.2.csv’
- Columns:
    - Sentiment: The sentiment against the review (4,5 star reviews are positive, 1,2 are negative)
    - Reviews: The main text of the review

**Steps to perform:**
- Discover the topics in the reviews and present it to business in a consumable format. Employ techniques in syntactic processing and topic modeling.
- Perform specific cleanup, POS tagging, and restricting to relevant POS tags, then, perform topic modeling using LDA. Finally, give business-friendly names to the topics and make a table for business.

**Tasks:**
1. Read the .csv file using Pandas. Take a look at the top few records.


2. Normalize casings for the review text and extract the text into a list for easier manipulation.


3. Tokenize the reviews using NLTKs word_tokenize function.


4. Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.


5. For the topic model, we should  want to include only nouns.
    1. Find out all the POS tags that correspond to nouns.
    2. Limit the data to only terms with these tags.


6. Lemmatize. 
    1. Different forms of the terms need to be treated as one.
    2. No need to provide POS tag to lemmatizer for now.


7. Remove stopwords and punctuation (if there are any). 


8. Create a topic model using LDA on the cleaned-up data with 12 topics.
    1. Print out the top terms for each topic.
    2. What is the coherence of the model with the c_v metric?


9. Analyze the topics through the business lens.
    1. Determine which of the topics can be combined.


10. Create topic model using LDA with what you think is the optimal number of topics
    1. What is the coherence of the model?


11. The business should  be able to interpret the topics.
    1. Name each of the identified topics.
    2. Create a table with the topic name and the top 10 terms in each to present to the  business.

In [1]:
import pandas as pd
import numpy as np

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from string import punctuation

from gensim.corpora.dictionary import Dictionary
from gensim import models, similarities
from gensim.models import CoherenceModel

### 1. Read the .csv file using Pandas. Take a look at the top few records.

In [2]:
df = pd.read_csv('K8 Reviews v0.2.csv')
df.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


In [3]:
# data types
df.dtypes

sentiment     int64
review       object
dtype: object

In [4]:
# size of dataset
df.shape

(14675, 2)

In [5]:
# check dataset for missing values
df.isnull().sum()

sentiment    0
review       0
dtype: int64

No missing values in the dataset.

### 2. Normalize casings for the review text and extract the text into a list for easier manipulation.

In [6]:
# convert review text to lower case
df['review'] = df['review'].str.lower()
df.head(10)

Unnamed: 0,sentiment,review
0,1,good but need updates and improvements
1,0,"worst mobile i have bought ever, battery is dr..."
2,1,when i will get my 10% cash back.... its alrea...
3,1,good
4,0,the worst phone everthey have changed the last...
5,0,only i'm telling don't buyi'm totally disappoi...
6,1,"phone is awesome. but while charging, it heats..."
7,0,the battery level has worn down
8,0,it's over hitting problems...and phone hanging...
9,0,a lot of glitches dont buy this thing better g...


In [7]:
# extract review text into a list
df_list = list(df['review'])
df_list

['good but need updates and improvements',
 "worst mobile i have bought ever, battery is draining like hell, backup is only 6 to 7 hours with internet uses, even if i put mobile idle its getting discharged.this is biggest lie from amazon & lenove which is not at all expected, they are making full by saying that battery is 4000mah & booster charger is fake, it takes at least 4 to 5 hours to be fully charged.don't know how lenovo will survive by making full of us.please don;t go for this else you will regret like me.",
 'when i will get my 10% cash back.... its already 15 january..',
 'good',
 'the worst phone everthey have changed the last phone but the problem is still same and the amazon is not returning the phone .highly disappointing of amazon',
 "only i'm telling don't buyi'm totally disappointedpoor batterypoor camerawaste of money",
 'phone is awesome. but while charging, it heats up allot..really a genuine reason to hate lenovo k8 note',
 'the battery level has worn down',
 "it'

In [8]:
# remove emojis
import emoji

def remove_emoji(text):
    return emoji.get_emoji_regexp().sub(r'', text)

no_emoji_list = [remove_emoji(sentence) for sentence in df_list]
no_emoji_list

['good but need updates and improvements',
 "worst mobile i have bought ever, battery is draining like hell, backup is only 6 to 7 hours with internet uses, even if i put mobile idle its getting discharged.this is biggest lie from amazon & lenove which is not at all expected, they are making full by saying that battery is 4000mah & booster charger is fake, it takes at least 4 to 5 hours to be fully charged.don't know how lenovo will survive by making full of us.please don;t go for this else you will regret like me.",
 'when i will get my 10% cash back.... its already 15 january..',
 'good',
 'the worst phone everthey have changed the last phone but the problem is still same and the amazon is not returning the phone .highly disappointing of amazon',
 "only i'm telling don't buyi'm totally disappointedpoor batterypoor camerawaste of money",
 'phone is awesome. but while charging, it heats up allot..really a genuine reason to hate lenovo k8 note',
 'the battery level has worn down',
 "it'

### 3. Tokenize the reviews using NLTKs word_tokenize function.

In [9]:
tokenized_reviews = [nltk.word_tokenize(sentence) for sentence in no_emoji_list]
tokenized_reviews

[['good', 'but', 'need', 'updates', 'and', 'improvements'],
 ['worst',
  'mobile',
  'i',
  'have',
  'bought',
  'ever',
  ',',
  'battery',
  'is',
  'draining',
  'like',
  'hell',
  ',',
  'backup',
  'is',
  'only',
  '6',
  'to',
  '7',
  'hours',
  'with',
  'internet',
  'uses',
  ',',
  'even',
  'if',
  'i',
  'put',
  'mobile',
  'idle',
  'its',
  'getting',
  'discharged.this',
  'is',
  'biggest',
  'lie',
  'from',
  'amazon',
  '&',
  'lenove',
  'which',
  'is',
  'not',
  'at',
  'all',
  'expected',
  ',',
  'they',
  'are',
  'making',
  'full',
  'by',
  'saying',
  'that',
  'battery',
  'is',
  '4000mah',
  '&',
  'booster',
  'charger',
  'is',
  'fake',
  ',',
  'it',
  'takes',
  'at',
  'least',
  '4',
  'to',
  '5',
  'hours',
  'to',
  'be',
  'fully',
  'charged.do',
  "n't",
  'know',
  'how',
  'lenovo',
  'will',
  'survive',
  'by',
  'making',
  'full',
  'of',
  'us.please',
  'don',
  ';',
  't',
  'go',
  'for',
  'this',
  'else',
  'you',
  'will

### 4. Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

In [10]:
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_reviews]
tagged_sentences

[[('good', 'JJ'),
  ('but', 'CC'),
  ('need', 'VBP'),
  ('updates', 'NNS'),
  ('and', 'CC'),
  ('improvements', 'NNS')],
 [('worst', 'JJS'),
  ('mobile', 'NN'),
  ('i', 'NN'),
  ('have', 'VBP'),
  ('bought', 'VBN'),
  ('ever', 'RB'),
  (',', ','),
  ('battery', 'NN'),
  ('is', 'VBZ'),
  ('draining', 'VBG'),
  ('like', 'IN'),
  ('hell', 'NN'),
  (',', ','),
  ('backup', 'NN'),
  ('is', 'VBZ'),
  ('only', 'RB'),
  ('6', 'CD'),
  ('to', 'TO'),
  ('7', 'CD'),
  ('hours', 'NNS'),
  ('with', 'IN'),
  ('internet', 'JJ'),
  ('uses', 'NNS'),
  (',', ','),
  ('even', 'RB'),
  ('if', 'IN'),
  ('i', 'JJ'),
  ('put', 'VBP'),
  ('mobile', 'JJ'),
  ('idle', 'NN'),
  ('its', 'PRP$'),
  ('getting', 'VBG'),
  ('discharged.this', 'NN'),
  ('is', 'VBZ'),
  ('biggest', 'JJS'),
  ('lie', 'NN'),
  ('from', 'IN'),
  ('amazon', 'NN'),
  ('&', 'CC'),
  ('lenove', 'NN'),
  ('which', 'WDT'),
  ('is', 'VBZ'),
  ('not', 'RB'),
  ('at', 'IN'),
  ('all', 'DT'),
  ('expected', 'VBN'),
  (',', ','),
  ('they', 'PRP'),


### 5. For the topic model, we should want to include only nouns.

In [11]:
noun_list = [[word for word in sent if (word[1] == 'NN' or word[1] == 'NNS' or word[1] == 'NNP' or word[1] == 'NNPS')] 
       for sent in tagged_sentences]

noun_list

[[('updates', 'NNS'), ('improvements', 'NNS')],
 [('mobile', 'NN'),
  ('i', 'NN'),
  ('battery', 'NN'),
  ('hell', 'NN'),
  ('backup', 'NN'),
  ('hours', 'NNS'),
  ('uses', 'NNS'),
  ('idle', 'NN'),
  ('discharged.this', 'NN'),
  ('lie', 'NN'),
  ('amazon', 'NN'),
  ('lenove', 'NN'),
  ('battery', 'NN'),
  ('charger', 'NN'),
  ('hours', 'NNS'),
  ('don', 'NN')],
 [('i', 'NN'), ('%', 'NN'), ('cash', 'NN'), ('..', 'NN')],
 [],
 [('phone', 'NN'),
  ('everthey', 'NN'),
  ('phone', 'NN'),
  ('problem', 'NN'),
  ('amazon', 'NN'),
  ('phone', 'NN'),
  ('amazon', 'NN')],
 [('camerawaste', 'NN'), ('money', 'NN')],
 [('phone', 'NN'),
  ('allot', 'NN'),
  ('..', 'NNP'),
  ('reason', 'NN'),
  ('k8', 'NNS')],
 [('battery', 'NN'), ('level', 'NN')],
 [('problems', 'NNS'),
  ('phone', 'NN'),
  ('hanging', 'NN'),
  ('problems', 'NNS'),
  ('note', 'NN'),
  ('station', 'NN'),
  ('ahmedabad', 'NN'),
  ('years', 'NNS'),
  ('phone', 'NN'),
  ('lenovo', 'NN')],
 [('lot', 'NN'), ('glitches', 'NNS'), ('thing',

### 6. Lemmatize.

In [12]:
wnl = WordNetLemmatizer()

lemma_list = [[wnl.lemmatize(word[0]) for word in sent] for sent in noun_list]

lemma_list

[['update', 'improvement'],
 ['mobile',
  'i',
  'battery',
  'hell',
  'backup',
  'hour',
  'us',
  'idle',
  'discharged.this',
  'lie',
  'amazon',
  'lenove',
  'battery',
  'charger',
  'hour',
  'don'],
 ['i', '%', 'cash', '..'],
 [],
 ['phone', 'everthey', 'phone', 'problem', 'amazon', 'phone', 'amazon'],
 ['camerawaste', 'money'],
 ['phone', 'allot', '..', 'reason', 'k8'],
 ['battery', 'level'],
 ['problem',
  'phone',
  'hanging',
  'problem',
  'note',
  'station',
  'ahmedabad',
  'year',
  'phone',
  'lenovo'],
 ['lot', 'glitch', 'thing', 'option'],
 ['wrost'],
 ['phone', 'charger', 'damage', 'month'],
 ['item', 'battery', 'life'],
 ['i',
  'battery',
  'problem',
  'motherboard',
  'problem',
  'month',
  'mobile',
  'life'],
 ['phone', 'slim', 'battry', 'backup', 'screen'],
 ['headset'],
 ['time', 'i'],
 ['product',
  'prize',
  'range',
  'specification',
  'comparison',
  'mobile',
  'range',
  'i',
  'phone',
  'seal',
  'i',
  'credit',
  'card',
  'i',
  '..',
  '..

### 7. Remove stopwords and punctuation (if there are any).

In [13]:
# English stopwords
stopwords_en = stopwords.words('english')

# add additional stopwords based on the data
stopwords_en.extend(['mobile', 'phone', 'problem', 'issue', 'note', 'product', 'day', 'box', 'experience', 'month', 
                     'option', 'time'])
stopwords_en = set(stopwords_en)

# combine punctuations with stopwords
stopwords_en_withpunct = stopwords_en.union(set(punctuation))

# remove: stopwords with punctuations, all tokens that are not alphabetic, and all words that have less than 3 characters
clean_list = [[word for word in sent if (word not in stopwords_en_withpunct) and (word.isalpha()) and (len(word) > 2)]
               for sent in lemma_list]
              
clean_list   

[['update', 'improvement'],
 ['battery',
  'hell',
  'backup',
  'hour',
  'idle',
  'lie',
  'amazon',
  'lenove',
  'battery',
  'charger',
  'hour'],
 ['cash'],
 [],
 ['everthey', 'amazon', 'amazon'],
 ['camerawaste', 'money'],
 ['allot', 'reason'],
 ['battery', 'level'],
 ['hanging', 'station', 'ahmedabad', 'year', 'lenovo'],
 ['lot', 'glitch', 'thing'],
 ['wrost'],
 ['charger', 'damage'],
 ['item', 'battery', 'life'],
 ['battery', 'motherboard', 'life'],
 ['slim', 'battry', 'backup', 'screen'],
 ['headset'],
 [],
 ['prize',
  'range',
  'specification',
  'comparison',
  'range',
  'seal',
  'credit',
  'card',
  'deal',
  'amazon'],
 ['battery', 'solution', 'battery', 'life'],
 ['smartphone'],
 [],
 ['galery', 'speaker'],
 ['camera', 'battery'],
 [],
 ['camera', 'battery'],
 ['cast', 'screen', 'wifi', 'call', 'hotspot'],
 ['usb', 'cable'],
 ['price', 'lenovo', 'display'],
 ['specification', 'function'],
 ['fon', 'fon', 'speekars'],
 ['color', 'screen'],
 ['update',
  'oreo',
  'b

### 8. Create a topic model using LDA on the cleaned-up data with 12 topics.

In [14]:
# create a corpus from a list of texts
# transform the documents to a vectorized form
# Bag-of-words representation of the documents
dictionary = Dictionary(clean_list)
corpus = [dictionary.doc2bow(text) for text in clean_list]

In [15]:
# train the model on the corpus
lda = models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=12, chunksize=100, passes=10)

#### 8.A Print out the top terms for each topic.

In [16]:
for topic in lda.show_topics(num_topics=12, num_words=5, formatted=False):
    print(topic,'\n')

(0, [('feature', 0.24322882), ('heating', 0.16471389), ('everything', 0.07270336), ('model', 0.040916547), ('headphone', 0.03558348)]) 

(1, [('performance', 0.16280702), ('charger', 0.07899474), ('waste', 0.05721072), ('turbo', 0.0433842), ('handset', 0.041374978)]) 

(2, [('price', 0.20818011), ('network', 0.13492903), ('range', 0.08370599), ('review', 0.038749248), ('card', 0.037778687)]) 

(3, [('lenovo', 0.1571492), ('charge', 0.061939955), ('sim', 0.059710395), ('hai', 0.046061322), ('game', 0.035799395)]) 

(4, [('battery', 0.29471016), ('update', 0.05396953), ('hour', 0.051154304), ('software', 0.04335487), ('life', 0.040314347)]) 

(5, [('screen', 0.08158129), ('processor', 0.055387765), ('display', 0.055002693), ('speaker', 0.052000154), ('music', 0.040605176)]) 

(6, [('money', 0.13709873), ('heat', 0.09322205), ('video', 0.0601744), ('lot', 0.059423015), ('value', 0.04740173)]) 

(7, [('amazon', 0.1455861), ('device', 0.12201087), ('mode', 0.09872138), ('usage', 0.06454679)

#### 8.B What is the coherence of the model with the c_v metric?

In [17]:
coherence_model_lda = CoherenceModel(model=lda, texts=clean_list, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence:', coherence_lda)

Coherence: 0.3956522226664394


### 9. Analyze the topics through the business lens.

Determine which of the topics can be combined.

Currently, the topics - based on the top terms - are:
1. General
2. Battery
3. Price
4. Service
5. Battery
6. Screen
7. Price
8. Service
9. Service
10. Service
11. Camera
12. Battery

So, we can combine:
- Topics 2 & 5 & 12
- Topics 3 & 7
- Topics 4 & 8 & 9 & 10

A better output would be something like:
1. Battery
2. Price
3. Service
4. Screen
5. Camera

I think ~5 is the optimal number of topics.

### 10. Create topic model using LDA with what you think is the optimal number of topics

In [18]:
# this function is for parameter tuning (parameter=number of topics)
def parameter_tuning(corpus, dictionary, k):
    lda_model = models.ldamodel.LdaModel(corpus=corpus,
                                         id2word=dictionary,
                                         num_topics=k, 
                                         random_state=1,
                                         chunksize=100,
                                         passes=10)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=clean_list, dictionary=dictionary, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

In [19]:
# topics range
min_topics = 3
max_topics = 11
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

# build dictionary for topics and coherence
model_results = {'Topics': [], 'Coherence': []}

# iterate through number of topics anc compute coherence to find the best number of topics
for k in topics_range:
    # get the coherence
    coherence = parameter_tuning(corpus=corpus, dictionary=dictionary, k=k)
    # add results to our dictionary
    model_results['Topics'].append(k)
    model_results['Coherence'].append(coherence)

# print results
pd.DataFrame(model_results)

Unnamed: 0,Topics,Coherence
0,3,0.53034
1,4,0.494726
2,5,0.457784
3,6,0.460135
4,7,0.440915
5,8,0.438901
6,9,0.450048
7,10,0.401571


The best coherence score is for 3 topics, but I think that's too low for business interpretations. I will keep choosing 5 topics as optimal number.

In [30]:
# train the model on the corpus
lda_mod = models.ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=5, chunksize=100, passes=10)

In [31]:
# print top terms for each topic
for topic in lda_mod.show_topics(num_topics=5, num_words=5, formatted=False):
    print(topic,'\n')

(0, [('battery', 0.19497578), ('performance', 0.054742206), ('backup', 0.032805193), ('device', 0.03046186), ('heat', 0.027017077)]) 

(1, [('heating', 0.05501778), ('network', 0.05380345), ('update', 0.04042241), ('software', 0.032474056), ('superb', 0.021100897)]) 

(2, [('camera', 0.22879432), ('quality', 0.09056122), ('price', 0.060603615), ('screen', 0.03556442), ('display', 0.023980623)]) 

(3, [('feature', 0.067548044), ('money', 0.04609829), ('amazon', 0.042172596), ('lenovo', 0.040639285), ('service', 0.03651178)]) 

(4, [('call', 0.04564232), ('processor', 0.03234879), ('speaker', 0.030371679), ('sound', 0.02979413), ('everything', 0.023756728)]) 



#### 10.A What is the coherence of the model?

In [32]:
coherence_model_lda_mod = CoherenceModel(model=lda_mod, texts=clean_list, dictionary=dictionary, coherence='c_v')
coherence_lda_mod = coherence_model_lda_mod.get_coherence()
print('Coherence:', coherence_lda_mod)

Coherence: 0.4923424136157353


The second model using 5 topics has a better coherence value than the model using 12 topics.

### 11. The business should be able to interpret the topics.

#### 11.A Name each of the identified topics.

1. Battery
2. Network/Software
3. Camera
4. Service
5. General

#### 11.B Create a table with the topic name and the top 10 terms in each to present to the business.

In [35]:
topic_names = ['Battery', 'Network/Software', 'Camera', 'Service', 'General']

In [36]:
my_dict = {'Topic_' + str(i): [token for token, score in lda_mod.show_topic(i, topn=10)] for i in range(0, lda_mod.num_topics)}

In [37]:
topics_df = pd.DataFrame(my_dict)
topics_df.columns = topic_names
topics_df

Unnamed: 0,Battery,Network/Software,Camera,Service,General
0,battery,heating,camera,feature,call
1,performance,network,quality,money,processor
2,backup,update,price,amazon,speaker
3,device,software,screen,lenovo,sound
4,heat,superb,display,service,everything
5,mode,bit,speed,charger,music
6,range,value,video,charge,app
7,hour,budget,charging,waste,customer
8,life,game,glass,delivery,ram
9,sim,photo,android,hai,support
