## Review Project Analysis

#### DESCRIPTION

Help a leading mobile brand understand the voice of the customer by analyzing the reviews of their product on Amazon and the topics that customers are talking about. You will perform topic modeling on specific parts of speech. You‚Äôll finally interpret the emerging topics.

#### Problem Statement: 

A popular mobile phone brand, Lenovo has launched their budget smartphone in the Indian market. The client wants to understand the VOC (voice of the customer) on the product. This will be useful to not just evaluate the current product, but to also get some direction for developing the product pipeline. The client is particularly interested in the different aspects that customers care about. Product reviews by customers on a leading e-commerce site should provide a good view.

#### Domain: Amazon reviews for a leading phone brand

#### Analysis to be done: POS tagging, topic modeling using LDA, and topic interpretation

#### Content: 

Dataset: ‚ÄòK8 Reviews v0.2.csv‚Äô

Columns:

Sentiment: The sentiment against the review (4,5 star reviews are positive, 1,2 are negative)

Reviews: The main text of the review

#### Steps to perform:

Discover the topics in the reviews and present it to business in a consumable format. Employ techniques in syntactic processing and topic modeling.

Perform specific cleanup, POS tagging, and restricting to relevant POS tags, then, perform topic modeling using LDA. Finally, give business-friendly names to the topics and make a table for business.

Tasks: 

1. Read the .csv file using Pandas. Take a look at the top few records.

2. Normalize casings for the review text and extract the text into a list for easier manipulation.

3. Tokenize the reviews using NLTKs word_tokenize function.

4. Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

5. For the topic model, we should  want to include only nouns.

    1. Find out all the POS tags that correspond to nouns.

    2. Limit the data to only terms with these tags.

6. Lemmatize. 

    1. Different forms of the terms need to be treated as one.

    2. No need to provide POS tag to lemmatizer for now.

7. Remove stopwords and punctuation (if there are any). 

8. Create a topic model using LDA on the cleaned-up data with 12 topics.

    1. Print out the top terms for each topic.

    2. What is the coherence of the model with the c_v metric?

9. Analyze the topics through the business lens.

    1. Determine which of the topics can be combined.

10. Create topic model using LDA with what you think is the optimal number of topics

    1. What is the coherence of the model?

11. The business should  be able to interpret the topics.

    1. Name each of the identified topics.

    2. Create a table with the topic name and the top 10 terms in each to present to the  business.

#### Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import pyLDAvis
# import pyLDAvis.gensim_models as gensimvis
import pyLDAvis.gensim 
pyLDAvis.enable_notebook()

import matplotlib.pyplot as plt
%matplotlib inline

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  EPS = np.finfo(np.float).eps


#### 1. Read the .csv file using Pandas. Take a look at the top few records

In [2]:
#Read the .csv file using Pandas. Take a look at the top few records.
ReviewData = pd.read_csv(r'C:\Users\Lovely Rajput\Desktop\Natural Language Processing\project\1569836815_reviewprojectanalysis\K8 Reviews v0.2.csv')
ReviewData.head()

  and should_run_async(code)


Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


#### 2. Normalize casings for the review text and extract the text into a list for easier manipulation.

In [3]:
def Normalize(reviews):
    NormalizeReviews = []
    for review in reviews:
        NormalizeReviews.append(review.lower())
    return NormalizeReviews

  and should_run_async(code)


In [4]:
#Normalize casings for the review text and extract the text into a list for easier manipulation.
NormalizeReviewText = Normalize(ReviewData['review'].values)
NormalizeReviewText

  and should_run_async(code)


['good but need updates and improvements',
 "worst mobile i have bought ever, battery is draining like hell, backup is only 6 to 7 hours with internet uses, even if i put mobile idle its getting discharged.this is biggest lie from amazon & lenove which is not at all expected, they are making full by saying that battery is 4000mah & booster charger is fake, it takes at least 4 to 5 hours to be fully charged.don't know how lenovo will survive by making full of us.please don;t go for this else you will regret like me.",
 'when i will get my 10% cash back.... its already 15 january..',
 'good',
 'the worst phone everthey have changed the last phone but the problem is still same and the amazon is not returning the phone .highly disappointing of amazon',
 "only i'm telling don't buyi'm totally disappointedpoor batterypoor camerawaste of money",
 'phone is awesome. but while charging, it heats up allot..really a genuine reason to hate lenovo k8 note',
 'the battery level has worn down',
 "it'

#### 4. Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

#### 5. For the topic model, we should want to include only nouns.
    1. Find out all the POS tags that correspond to nouns.

    2. Limit the data to only terms with these tags.

In [5]:
def Tokenize_POS(reviews):
    TokenizeReviews = []
    for review in reviews:
        #review = nltk.word_tokenize(review)
        #TokenizeReviews.append(nltk.pos_tag(review))  
        for word,pos in nltk.pos_tag(nltk.word_tokenize(review)):
            if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
                #review = lemmatizer.lemmatize(word)
                #print (word)
                TokenizeReviews.append(review)    
    return TokenizeReviews  

  and should_run_async(code)


In [6]:
#3. Tokenize the reviews using NLTKs word_tokenize function.
#4. Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.
TokenizeReviews = Tokenize_POS(NormalizeReviewText)
TokenizeReviews

  and should_run_async(code)


['good but need updates and improvements',
 'good but need updates and improvements',
 "worst mobile i have bought ever, battery is draining like hell, backup is only 6 to 7 hours with internet uses, even if i put mobile idle its getting discharged.this is biggest lie from amazon & lenove which is not at all expected, they are making full by saying that battery is 4000mah & booster charger is fake, it takes at least 4 to 5 hours to be fully charged.don't know how lenovo will survive by making full of us.please don;t go for this else you will regret like me.",
 "worst mobile i have bought ever, battery is draining like hell, backup is only 6 to 7 hours with internet uses, even if i put mobile idle its getting discharged.this is biggest lie from amazon & lenove which is not at all expected, they are making full by saying that battery is 4000mah & booster charger is fake, it takes at least 4 to 5 hours to be fully charged.don't know how lenovo will survive by making full of us.please don;

#### 6. Lemmatize.
    1. Different forms of the terms need to be treated as one.
    2. No need to provide POS tag to lemmatizer for now.


#### 7. Remove stopwords and punctuation (if there are any).

In [7]:
# function to remove Stopwords
def Remove_Stopwords(word_list, lang='english'):
    """Function removes english stopwords
    Args:
        word_list  : list of words
    Return:
        The return value. List of words
    """
    content = []
    stopwords_list = stopwords.words(lang)
    #print(type(word_list))
    #for word in word_list:
    #    print(word)
    #    if word.lower() not in stopwords_list:
    #        content.append(word)
    content = [w for w in word_list if w.lower() not in stopwords_list]
    #print(content)
    return content

  and should_run_async(code)


In [8]:
# function to remove punctuation
def Simplify_Punctuation(text):
    """
    This function simplifies doubled or more complex punctuation. The exception is '...'.
    """
    corrected = str(text)
    corrected = re.sub(r'([!?,;])\1+', r'\1', corrected)
    corrected = re.sub(r'\.{2,}', r'...', corrected)
    return corrected

  and should_run_async(code)


In [9]:
# function to lemmatize using WordNetLemmatizer
def Lemmatize_WordNet(words_list):
    wnl = WordNetLemmatizer()
    encoded_list = []
    for word in words_list:
        encoded_list.append(wnl.lemmatize(word, pos="v"))#.encode("utf8"))
    #print(encoded_list)
    return encoded_list

  and should_run_async(code)


In [10]:
def tokenize(txt):
    """Function computes Tokenizes into sentences, strips punctuation/abbr, 
       converts to lowercase and tokenizes words
    Args:
        txt  : text documents
    Return:
        The return value. Tokenized words
    """
    return [word_tokenize(" ".join(re.findall(r'\w+', t,flags = re.UNICODE )).lower()) 
                for t in sent_tokenize(txt.replace("'", ""))]

  and should_run_async(code)


In [11]:
def Apply_Stopwords_punctuation_lemmatize(reviews):
    PreprocessReviews = []
    for review in reviews:
        lemmetized = []
        review = Simplify_Punctuation(review)  # Remove Punctuation        
        sentences = tokenize(review)
        for sentence in sentences:
            words = Remove_Stopwords(sentence)         # Remove Stopwords
            words = Lemmatize_WordNet(words)           # lemmatize 
            # lets's skip short sentences with less than 3 words
            if len(words) < 3:
                continue
            lemmetized.append(" ".join(words))
        PreprocessReviews.append(" ".join(lemmetized))
    return PreprocessReviews

  and should_run_async(code)


#### Lemmatize

In [12]:
PreProcessReviews = Apply_Stopwords_punctuation_lemmatize(TokenizeReviews)
PreProcessReviews

  and should_run_async(code)


['good need update improvements',
 'good need update improvements',
 'worst mobile buy ever battery drain like hell backup 6 7 hours internet use even put mobile idle get discharge biggest lie amazon lenove expect make full say battery 4000mah booster charger fake take least 4 5 hours fully charge dont know lenovo survive make full us please go else regret like',
 'worst mobile buy ever battery drain like hell backup 6 7 hours internet use even put mobile idle get discharge biggest lie amazon lenove expect make full say battery 4000mah booster charger fake take least 4 5 hours fully charge dont know lenovo survive make full us please go else regret like',
 'worst mobile buy ever battery drain like hell backup 6 7 hours internet use even put mobile idle get discharge biggest lie amazon lenove expect make full say battery 4000mah booster charger fake take least 4 5 hours fully charge dont know lenovo survive make full us please go else regret like',
 'worst mobile buy ever battery drain 

#### 8. Create a topic model using LDA on the cleaned-up data with 12 topics.
    1. Print out the top terms for each topic.
    2. What is the coherence of the model with the c_v metric?

In [13]:
TokenizeReviews = []
for review in PreProcessReviews:
    TokenizeReviews.append(nltk.word_tokenize(review)) 
#TokenizeReviews

  and should_run_async(code)


In [14]:
# Create Dictionary

id2word = corpora.Dictionary(TokenizeReviews)

# Create Corpus
texts = TokenizeReviews

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])
print(id2word[0])

[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

  and should_run_async(code)


[[(0, 1), (1, 1), (2, 1), (3, 1)]]
good


[[('good', 1), ('improvements', 1), ('need', 1), ('update', 1)]]

In [15]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=12, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

  and should_run_async(code)


In [16]:
# Print the Keyword in the 12 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.129*"lenovo" + 0.114*"note" + 0.101*"k8" + 0.083*"heat" + 0.035*"key" + '
  '0.030*"play" + 0.024*"waste" + 0.023*"review" + 0.023*"u" + 0.022*"im"'),
 (1,
  '0.090*"first" + 0.085*"touch" + 0.074*"internet" + 0.072*"please" + '
  '0.063*"would" + 0.051*"7" + 0.033*"couple" + 0.028*"complain" + '
  '0.028*"company" + 0.027*"slow"'),
 (2,
  '0.129*"work" + 0.118*"charge" + 0.075*"take" + 0.067*"bad" + 0.041*"2" + '
  '0.036*"charger" + 0.034*"turbo" + 0.033*"cant" + 0.032*"full" + '
  '0.029*"google"'),
 (3,
  '0.132*"time" + 0.083*"sensor" + 0.080*"back" + 0.075*"android" + 0.048*"mp" '
  '+ 0.048*"stock" + 0.039*"video" + 0.033*"card" + 0.032*"13" + '
  '0.031*"finger"'),
 (4,
  '0.077*"get" + 0.059*"mobile" + 0.054*"also" + 0.040*"even" + 0.033*"4" + '
  '0.030*"better" + 0.028*"5" + 0.027*"compare" + 0.026*"awesome" + '
  '0.025*"one"'),
 (5,
  '0.148*"much" + 0.110*"make" + 0.067*"life" + 0.058*"purchase" + '
  '0.045*"read" + 0.042*"provide" + 0.040*"picture" + 0.037*"su

  and should_run_async(code)


In [17]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=TokenizeReviews, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

  and should_run_async(code)



Perplexity:  -8.396436109705357

Coherence Score:  0.3110833641714682


#### 9. Analyze the topics through the business lens.

Here are the possible topic headers

0 - Possible Topic - Lenovo Note K8 (1)

1 - Possible Topic - First Touch Phone (2)

2 - Possible Topic - Charging Review (3)

3 - Possible Topic - Review on sensor time (4)

4 - Possible Topic - Positive Mobile Review (5)

5 - Possible Topic - Picture quality (6)

6 - Possible Topic - Positive Review (5)

7 - Possible Topic - Review on Processor (7)

8 - Possible Topic - Positive Review (5)

9 - Possible Topic - Negative Review (8)

10 - Possible Topic - Review on Return policy (9)

11 - Possible Topic - Review on software update (10)

#### Determine which of the topics can be combined.

#### Distinct topics can be treated as 10

#### 10. Create a topic model using LDA with what you think is the optimal number of topics.
    1. What is the coherence of the model?


In [18]:
# Build LDA model with 8 topics
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=10, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

  and should_run_async(code)


In [19]:
# Print the Keyword in the 8 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.109*"heat" + 0.077*"product" + 0.076*"update" + 0.058*"days" + 0.044*"1" '
  '+ 0.039*"play" + 0.034*"software" + 0.033*"need" + 0.022*"user" + '
  '0.019*"ok"'),
 (1,
  '0.182*"lenovo" + 0.171*"note" + 0.150*"k8" + 0.038*"first" + 0.034*"u" + '
  '0.029*"previous" + 0.024*"mobiles" + 0.019*"still" + 0.019*"face" + '
  '0.018*"office"'),
 (2,
  '0.079*"work" + 0.073*"use" + 0.072*"charge" + 0.058*"get" + 0.045*"take" + '
  '0.039*"4" + 0.036*"2" + 0.033*"5" + 0.024*"like" + 0.022*"charger"'),
 (3,
  '0.085*"time" + 0.059*"bite" + 0.053*"sensor" + 0.052*"back" + '
  '0.048*"android" + 0.046*"image" + 0.043*"dedicate" + 0.031*"stock" + '
  '0.029*"lot" + 0.028*"music"'),
 (4,
  '0.270*"phone" + 0.063*"buy" + 0.035*"dont" + 0.033*"better" + 0.031*"get" + '
  '0.030*"compare" + 0.028*"one" + 0.023*"worst" + 0.020*"last" + '
  '0.019*"service"'),
 (5,
  '0.151*"poor" + 0.134*"dual" + 0.110*"much" + 0.082*"make" + 0.050*"life" + '
  '0.045*"8" + 0.043*"purchase" + 0.031*"provide" +

  and should_run_async(code)


In [22]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=TokenizeReviews, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

  and should_run_async(code)



Perplexity:  -7.968139323395193

Coherence Score:  0.2887103347547684


#### Evaluate LDA model

Perplexity Score: Lower is better

Coherence Score: Higher is better

#### 11. The business should be able to interpret the topics.
    1. Name each of the identified topics.
    2. Create a table with the topic name and the top 10 terms in each to present to the business.

#### Here are possible topics and and top words for each topic

(Topic 1: General Review,

Words: "heat" , "product" , "update" , "days" , 1" , "play" , "software" , "need" , "user" , "ok" ),

(Topic 2: Review on Lenovo Note K8,

Words: "lenovo" , "note" , "k8" , "first" , "u" , "previous" , "mobiles" , "still" , "face" , "office" ),

( Topic 3: Review on Charging time ,

Words: "work" , "use" , "charge" , "get" , "take" , "4" , "2" , "5" , "like" , "charger" ),

( Topic 4: Review on Sensor time,

Words: "time" , "bite" , "sensor" , "back" , "android" , "image" , "dedicate" , "stock" , "lot" , "music" ),

( Topic 5: Negative Review,

Words: "phone" , "buy" , "dont" , "better" , "get" , "compare" , "one" , "worst" , "last" , "service" ),

( Topic 6: Review on redmi ,

Words: "poor" , "dual" , "much" , "make" , "life" , "8" , "purchase" , "provide" , "redmi" , "two" ),

( Topic 7: Review on camera,

Words: "good" , "camera" , "quality" , "issue" , "game" , "also" , "clarity" , "average" , "screen" , "light" ),

( Topic 8: Review on network,

Words: "doesnt" , "call" , "even" , "bad" , "network" , "many" , "cant" , "support" , "full" , "find" ),

( Topic 9: Review on battery life,

Words: "battery" , "feature" , "mode" , "fast" , "drain" , "great" , "speed" , "nice" , "device" , "really" ),

( Topic 10: Review on price,

Words: "mobile" , "amazon" , "problem" , "price" , "awesome" , "hai" , "return" , "properly" , "best" , "hang" )

In [21]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

  and should_run_async(code)


import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = ReviewData
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
print(data_words[:4]) #it will print the data after prepared for stopwords
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc)) 
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[
   'NOUN', 'ADJ', 'VERB', 'ADV'
])
print(data_lemmatized[:4]) #it will print the lemmatized data.
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:4]) #it will print the corpus we created above.
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:4]] 
#it will print the words with their frequencies.
lda_model = gensim.models.ldamodel.LdaModel(
   corpus=corpus, id2word=id2word, num_topics=20, random_state=100, 
   update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)