 ## Overall Observation:
 ### Data pre-processing
    1. Data pre-processing: there are 2404 sentenses where sent tokenize did not work because bad sentence formation in review, for eg 'them.you', 'segment.using'
    2. 37 occurences where comma could not be segregated from word, for eg: 'back,2mp', 'september,2017'
    3. 37 non-english, Indian local language words
    4. 217 words with emoji and symbols
    5. There are few single words-chars which need treatment, There are Hindi words written in english which need removal
        
### Model Overview
    1. First model created with 12 topics. 3 topics are clearly segregated but rest 9 are spirally overlapping with each other. That means we need much less than 12 topics
    2. Second model created with 4 Topics, which has better performance but cohenece score was around 0.41. Also, two topics were almost overlapping with each other.
    3. Third model created with 3 Topics. This time we got non-overlapping topic with a coherence scopre of 0.512
    4. After few parameters tuning, I achieved a coherence score of 0.56
    
    **Final Model: Final model has 3 topics and a coherence score of 0.56**
    
### Topics Overview:
    3 topics can be categorized as below. Their top 10 words are list as well.
        1. Customer Service Experience
        2. Phone performance and hardware
        3. Phone features and user experience 

    
            Customer Service Experience
            -------------------------------
                product
                problem
                issue
                money
                heating
                network
                month
                time
                day
                service
            -------------------------------


             Phone performance and hardware
            -------------------------------
                camera
                battery
                quality
                price
                feature
                performance
                backup
                day
                hour
                range
            -------------------------------


             Phone features and user experience
            -------------------------------
                note
                lenovo
                screen
                call
                update
                software
                charger
                sim
                option
                delivery
            -------------------------------


        

In [None]:
import os
os.chdir(r'E:\Simplilearn\Cohort 3 - Jan\PG DS - NLP _ Jul 25 - Aug 23 _ Shanti Swaroop (Cohort 3)\Project')

## Reading File

In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv('K8 Reviews v0.2.csv')

In [4]:
data.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


In [5]:
data.shape

(14675, 2)

### checking data bias

In [6]:
data.sentiment.value_counts()

0    7712
1    6963
Name: sentiment, dtype: int64

In [7]:
data.shape

(14675, 2)

### Convert to lower case

In [8]:
reviews_lower = list(data.review.str.lower().values)

In [9]:
reviews_lower[0:3]

['good but need updates and improvements',
 "worst mobile i have bought ever, battery is draining like hell, backup is only 6 to 7 hours with internet uses, even if i put mobile idle its getting discharged.this is biggest lie from amazon & lenove which is not at all expected, they are making full by saying that battery is 4000mah & booster charger is fake, it takes at least 4 to 5 hours to be fully charged.don't know how lenovo will survive by making full of us.please don;t go for this else you will regret like me.",
 'when i will get my 10% cash back.... its already 15 january..']

### Special processing to keep sentence structure proper

In [10]:
import re

In [11]:
patt_1 = re.compile('\.')
patt_2 = re.compile(',')
patt_3 = re.compile('[^\w\s]')
patt_4 = re.compile('\d')

reviews_lower = [re.sub(patt_1, ' . ', sent) for sent in reviews_lower]
reviews_lower = [re.sub(patt_2, ' , ', sent) for sent in reviews_lower]
reviews_lower = [re.sub(patt_3, '', sent) for sent in reviews_lower]
reviews_lower = [re.sub(patt_4, '', sent) for sent in reviews_lower]




In [12]:
reviews_lower[0:10]

['good but need updates and improvements',
 'worst mobile i have bought ever   battery is draining like hell   backup is only  to  hours with internet uses   even if i put mobile idle its getting discharged  this is biggest lie from amazon  lenove which is not at all expected   they are making full by saying that battery is mah  booster charger is fake   it takes at least  to  hours to be fully charged  dont know how lenovo will survive by making full of us  please dont go for this else you will regret like me  ',
 'when i will get my  cash back         its already  january    ',
 'good',
 'the worst phone everthey have changed the last phone but the problem is still same and the amazon is not returning the phone   highly disappointing of amazon',
 'only im telling dont buyim totally disappointedpoor batterypoor camerawaste of money',
 'phone is awesome   but while charging   it heats up allot    really a genuine reason to hate lenovo k note',
 'the battery level has worn down',
 'its 

## Tokenize 

In [13]:
from nltk import sent_tokenize, word_tokenize

In [14]:
review_sent_token = [sent_tokenize(para) for para in reviews_lower]

In [15]:
word_tokens = []
for review in review_sent_token:
    for sent in review:
        word_tokens.append(word_tokenize(sent))    


In [16]:
len(word_tokens)

14663

## POS Tagging

In [17]:
from nltk import pos_tag

In [18]:
review_pos_tags= [pos_tag(tokens) for tokens in word_tokens]

### Filter Noun only tags

In [20]:
nouns_tag = [[tag for tag, pos in doc if pos.startswith('NN') ] for doc in review_pos_tags ]

## Remove Stopwords and Punctuation 

In [21]:
from string import punctuation 
from nltk.corpus import stopwords 

In [22]:
stop_words = list(punctuation) + stopwords.words('english')

In [79]:
context_stop_words = ['phone', 'mobile', 'cell', 'amazon', 'k8', 'k', 'ka', 'please', 'g', 'h', 'hai']

In [80]:
all_stop_words = stop_words + context_stop_words

In [81]:
nouns_post_sw = [[word for word in doc if word not in all_stop_words ] for doc in nouns_tag]

In [82]:
len(nouns_post_sw)

14663

## Lemmetize 

In [27]:
from nltk.stem import WordNetLemmatizer

In [28]:
wl = WordNetLemmatizer()

In [83]:
nouns_lemmetized = [ [ wl.lemmatize(word) for word in doc ] for doc in nouns_post_sw]

In [53]:
len(nouns_lemmetized)

14663

In [84]:
english_only_nouns = [[ word for word in doc if re.match('[a-z]', word) ] for doc in nouns_lemmetized]

In [64]:
len(english_only_nouns)

14663

In [86]:
exp = [ noun for noun in english_only_nouns if any(noun)]

In [87]:
len(exp)

12594

## Using LDA to create a topic model

In [32]:
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel

import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

In [88]:
id2word = corpora.Dictionary(english_only_nouns)
texts = english_only_nouns
corpus = [id2word.doc2bow(text) for text in texts]

In [89]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=12, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=5,
                                           alpha='auto',
                                           per_word_topics=True)

In [119]:
for topic_id in range(12):
    terms = lda_model.get_topic_terms(topic_id)
    print(f'Topic ID: {topic_id}')
    for tid, score in terms:
        print(f'{id2word[tid]:{20}} {score}')
    

Topic ID: 0
value                0.1291103959083557
system               0.10465578734874725
expectation          0.07785967737436295
number               0.06553763896226883
internet             0.06366458535194397
cover                0.0400787852704525
hardware             0.02817434072494507
iam                  0.025098353624343872
recording            0.02279970422387123
something            0.02242615632712841
Topic ID: 1
camera               0.27911949157714844
quality              0.10801383852958679
mode                 0.031352218240499496
display              0.03051639348268509
processor            0.030391190201044083
speaker              0.026952629908919334
use                  0.024446213617920876
music                0.022135119885206223
gb                   0.016046706587076187
ram                  0.01591634191572666
Topic ID: 2
charger              0.10164657235145569
waste                0.08217675983905792
turbo                0.05367719382047653
handset         

In [90]:
import pprint
pprint.pprint(lda_model.print_topics())

[(0,
  '0.129*"value" + 0.105*"system" + 0.078*"expectation" + 0.066*"number" + '
  '0.064*"internet" + 0.040*"cover" + 0.028*"hardware" + 0.025*"iam" + '
  '0.023*"recording" + 0.022*"something"'),
 (1,
  '0.279*"camera" + 0.108*"quality" + 0.031*"mode" + 0.031*"display" + '
  '0.030*"processor" + 0.027*"speaker" + 0.024*"use" + 0.022*"music" + '
  '0.016*"gb" + 0.016*"ram"'),
 (2,
  '0.102*"charger" + 0.082*"waste" + 0.054*"turbo" + 0.053*"handset" + '
  '0.051*"return" + 0.047*"budget" + 0.045*"data" + 0.035*"volta" + '
  '0.035*"picture" + 0.030*"u"'),
 (3,
  '0.138*"month" + 0.093*"heat" + 0.084*"life" + 0.065*"video" + '
  '0.038*"company" + 0.034*"purchase" + 0.028*"application" + 0.026*"flash" + '
  '0.025*"policy" + 0.025*"till"'),
 (4,
  '0.139*"call" + 0.094*"sound" + 0.070*"doesnt" + 0.054*"card" + '
  '0.049*"front" + 0.045*"light" + 0.042*"look" + 0.032*"specification" + '
  '0.032*"mp" + 0.026*"voice"'),
 (5,
  '0.144*"time" + 0.088*"network" + 0.070*"device" + 0.044*"si

In [91]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=english_only_nouns, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.4117204228480638


In [92]:
print('\nPerplexity: ', lda_model.log_perplexity(corpus)) 


Perplexity:  -7.794105068532517


In [98]:
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

In [149]:
lda_model_2 = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=3, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='symmetric',
                                           decay=.4,
                                           per_word_topics=True)

In [150]:
coherence_model_lda_2 = CoherenceModel(model=lda_model_2, texts=english_only_nouns, dictionary=id2word, coherence='c_v')
coherence_lda_2 = coherence_model_lda_2.get_coherence()
print('\nCoherence Score: ', coherence_lda_2)


Coherence Score:  0.5558486403857085


In [151]:
vis_2 = pyLDAvis.gensim.prepare(lda_model_2, corpus, id2word)
vis_2

In [153]:
pprint.pprint(lda_model_2.print_topics())

[(0,
  '0.096*"product" + 0.070*"problem" + 0.061*"issue" + 0.027*"money" + '
  '0.026*"heating" + 0.025*"network" + 0.024*"month" + 0.024*"time" + '
  '0.022*"day" + 0.020*"service"'),
 (1,
  '0.102*"camera" + 0.090*"battery" + 0.038*"quality" + 0.028*"price" + '
  '0.027*"feature" + 0.024*"performance" + 0.014*"backup" + 0.014*"day" + '
  '0.012*"hour" + 0.012*"range"'),
 (2,
  '0.052*"note" + 0.031*"lenovo" + 0.023*"screen" + 0.019*"call" + '
  '0.019*"update" + 0.017*"software" + 0.017*"charger" + 0.013*"sim" + '
  '0.012*"option" + 0.012*"delivery"')]


In [126]:
topic_names = { 0 : 'Customer Service Experience',
                1 : 'Phone performance and hardware',
                2 : 'Phone features and user experience',}

In [158]:
for topic_id in range(3):
    terms = lda_model_2.get_topic_terms(topic_id, 20)
    print(f'\t {topic_names[topic_id]}')
    print(f'\t-------------------------------')
    for tid, score in terms:        
        print(f'\t\t{id2word[tid]}')
    print(f'\t-------------------------------\n\n')

	 Customer Service Experience
	-------------------------------
		product
		problem
		issue
		money
		heating
		network
		month
		time
		day
		service
		device
		heat
		waste
		dont
		customer
		return
		signal
		value
		buy
		system
	-------------------------------


	 Phone performance and hardware
	-------------------------------
		camera
		battery
		quality
		price
		feature
		performance
		backup
		day
		hour
		range
		time
		mode
		processor
		display
		life
		charge
		sound
		speaker
		everything
		bit
	-------------------------------


	 Phone features and user experience
	-------------------------------
		note
		lenovo
		screen
		call
		update
		software
		charger
		sim
		option
		delivery
		support
		turbo
		budget
		handset
		superb
		hr
		app
		data
		volta
		ho
	-------------------------------


