### Topic Modelling

This notebook demonstrates data preprocessing as well as web based Visualization using pyLDAvis while doing topic modelling using gensim LDA algorithm. 

In [1]:
!pip install pyLDAvis

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/03/a5/15a0da6b0150b8b68610cc78af80364a80a9a4c8b6dd5ee549b8989d4b60/pyLDAvis-3.3.1.tar.gz (1.7MB)
[K     |████████████████████████████████| 1.7MB 6.9MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting numpy>=1.20.0
[?25l  Downloading https://files.pythonhosted.org/packages/3f/03/c3526fb4e79a793498829ca570f2f868204ad9a8040afcd72d82a8f121db/numpy-1.21.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7MB)
[K     |████████████████████████████████| 15.7MB 383kB/s 
Collecting funcy
  Downloading https://files.pythonhosted.org/packages/44/52/5cf7401456a461e4b481650dfb8279bc000f31a011d0918904f86e755947/funcy-1.16-py2.py3-none-any.whl
Collecting pandas>=1.2.0
[?25l  Downloading https://files.pythonhosted.org/packa

## Read Dataset

In [2]:
import pandas as pd

reviews_df = pd.read_csv("https://drive.google.com/uc?export=download&id=10dXjNNV9dbkn5shYLPcKXk_FWHvLmbGT")

In [3]:
pd.set_option("max_colwidth", 200)

In [4]:
reviews_df.sample(10)

AttributeError: ignored

                                                                                                                                              review  sentiment
4493                                                                                             awesome  I really loved their services n food tqqqq          1
4530                                                                                                                                     yummy foood          1
1210                                                                                                               Quality wasnt good waste of money          0
2461  Good place to visit with friends Price of food is little bit high but taste is good\nI ordered pasta shake fish and chips and fries Must visit          1
500       Worst restaurant 2rs bun coupled with 2 tomatoes and 2 cucumber and theyll charge you 130 Avoid at any cost even if youre extremely hungry          0
3997                                    

In [5]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5476 entries, 0 to 5475
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     5322 non-null   object
 1   sentiment  5476 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 85.7+ KB


In [6]:
reviews_df = reviews_df.dropna()

In [7]:
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5322 entries, 0 to 5475
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     5322 non-null   object
 1   sentiment  5322 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 124.7+ KB


##### Download NLTK Resources

In [8]:
import nltk

In [9]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

##### Clean Text and return Tokens

In [10]:
from nltk.tokenize import WhitespaceTokenizer
tokenizer_w = WhitespaceTokenizer()

def tokenize(text):
    tokenized_list = tokenizer_w.tokenize(text.lower())   
    return tokenized_list

##### wordnet and lemmatization

Use NLTK's wordnet to find meanings of words, synonyms, antonyms, and more. In addition, we use WordNetLemmatizer to get the root word.

In [11]:

from nltk.stem.wordnet import WordNetLemmatizer

def get_lemma(word):
    return WordNetLemmatizer().lemmatize(word)

##### Filter out stop words

In [12]:
import nltk
en_stop = set(nltk.corpus.stopwords.words('english'))

##### Define a function to prepare Text for Topic Modelling

In [13]:
min_token_length = 3

In [14]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) > min_token_length]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens]
    return tokens

### Prepare Data

In [15]:
reviews_tokens = reviews_df[reviews_df.sentiment == 1].review.map(lambda x: prepare_text_for_lda(x))

In [16]:
reviews_tokens[0:5]

1                [small, beautiful, place, really, loved, everything, beautiful, loved, every, visited, church, street, lasagna]
3                                                                                                                         [good]
5    [best, place, sunday, breakfast, always, crowded, need, wait, scrumptious, food, homely, environment, become, sunday, adda]
6                          [thai, curry, dont, evidence, tiramisu, okay, better, cozy, nice, place, good, music, nice, location]
7                           [calm, peaceful, place, good, choice, date, ambiance, good, loved, coffee, italian, dish, delicious]
Name: review, dtype: object

In [17]:
len(reviews_tokens)

2877

##### LDA with gensim

First, we are creating a dictionary from the data, then convert to bag-of-words corpus and save the dictionary and corpus for future use.

In [18]:
from gensim import corpora
dictionary = corpora.Dictionary(reviews_tokens)
corpus = [dictionary.doc2bow(text) for text in reviews_tokens]

In [19]:
dictionary[0]

'beautiful'

In [20]:
len(dictionary)  # total tokens

1016

In [21]:
corpus[0:5]

[[(0, 2),
  (1, 1),
  (2, 1),
  (3, 1),
  (4, 1),
  (5, 2),
  (6, 1),
  (7, 1),
  (8, 1),
  (9, 1),
  (10, 1)],
 [(11, 1)],
 [(6, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1),
  (17, 1),
  (18, 1),
  (19, 1),
  (20, 1),
  (21, 1),
  (22, 1),
  (23, 2),
  (24, 1)],
 [(6, 1),
  (11, 1),
  (25, 1),
  (26, 1),
  (27, 1),
  (28, 1),
  (29, 1),
  (30, 1),
  (31, 1),
  (32, 2),
  (33, 1),
  (34, 1),
  (35, 1)],
 [(5, 1),
  (6, 1),
  (11, 2),
  (36, 1),
  (37, 1),
  (38, 1),
  (39, 1),
  (40, 1),
  (41, 1),
  (42, 1),
  (43, 1),
  (44, 1)]]

In [22]:
dictionary.id2token[3]

'everything'

In [23]:
dictionary.id2token[8]

'small'

In [24]:
# save corpus and dictionary to disk so that we can use it later during visualization
import pickle

pickle.dump(corpus, open('corpus.pkl', 'wb'))
dictionary.save('dictionary.gensim')

##### Ask LDA to find 5 topics from the given data. Takes about 1 minute.

In [25]:
import gensim

NUM_TOPICS = 5

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)

### Save the model

In [26]:
ldamodel.save('review_topics.gensim')

###### Focus on Topics and Let us see top 4 words in each topic

In [27]:
topics = ldamodel.print_topics(num_words=6)
for topic in topics:
    print(topic)

(0, '0.136*"good" + 0.049*"place" + 0.041*"food" + 0.019*"taste" + 0.018*"nice" + 0.018*"ambience"')
(1, '0.048*"awesome" + 0.033*"great" + 0.031*"food" + 0.027*"place" + 0.026*"burger" + 0.024*"cafe"')
(2, '0.090*"place" + 0.041*"food" + 0.027*"loved" + 0.023*"good" + 0.021*"ambience" + 0.019*"really"')
(3, '0.044*"place" + 0.041*"nice" + 0.035*"food" + 0.030*"amazing" + 0.029*"great" + 0.024*"cafe"')
(4, '0.078*"good" + 0.053*"place" + 0.051*"food" + 0.047*"service" + 0.038*"great" + 0.035*"nice"')


With LDA, we can see how words contribute to make each topic

##### Test with a new document

Find out topic distribution for a given document

In [28]:
new_doc = 'The cofee tasted good, but the place is not pathetic'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

[(6, 1), (467, 1)]
[(0, 0.067850046), (1, 0.0672918), (2, 0.069163986), (3, 0.7277469), (4, 0.067947276)]


Above new document is about machine learning algorithms, the LDA output shows that certain topic has the highest probability assigned, and some other topic has the second highest and some other topic the least.t probability assigned. We agreed!
Remember that the above 5 probabilities add up to 1.

In [29]:
dictionary[1]

'church'

In [30]:
dictionary[9]

'street'

Instead of 5 topics let us ask LDA to find just 3 topics in the data:

In [31]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, id2word=dictionary, passes=15)
ldamodel.save('topics_3.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.064*"food" + 0.061*"place" + 0.031*"good" + 0.022*"nice"')
(1, '0.103*"good" + 0.055*"place" + 0.035*"food" + 0.034*"really"')
(2, '0.073*"good" + 0.050*"place" + 0.033*"nice" + 0.025*"great"')


##### Visualization using pyLDAvis

pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data.

The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

In [32]:
dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
corpus = pickle.load(open('corpus.pkl', 'rb'))
lda = gensim.models.ldamodel.LdaModel.load('review_topics.gensim')

In [33]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()
lda_display = gensimvis.prepare(lda, corpus, dictionary, sort_topics=False)
lda_display

  from collections import Iterable


ImportError: ignored

Saliency: a measure of how much the term tells you about the topic.

Relevance: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.

The size of the bubble measures the importance of the topics, relative to the data.

First, we got the most salient terms, means terms mostly tell us about what’s going on relative to the topics. We can also look at individual topic.

### Try with only the nouns

<code>
tags = nltk.pos_tag(tokenized_list)

nouns = [word for word,pos in tags if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS')]

</code>

## Topic Modelling using NMF

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np

In [54]:
no_topics = 5 
no_top_words = 4 
no_top_documents = 3 

In [55]:
# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(reviews_df.review)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

In [56]:
# Run NMF
nmf_model = NMF(n_components=no_topics, random_state=1, init='nndsvd').fit(tfidf)
nmf_W = nmf_model.transform(tfidf)
nmf_H = nmf_model.components_

In [57]:
tfidf.shape

(5322, 1318)

In [58]:
nmf_W.shape

(5322, 5)

In [59]:
nmf_H.shape

(5, 1318)

In [62]:
def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print("Topic Words:", " ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        print("----------------------------------------------------------------------")        
        print("Some documents that contain this topic are:")
        print("----------------------------------------------------------------------")        
        for doc_index in top_doc_indices:
            print(documents[doc_index])
        print("----------------------------------------------------------------------")        
        print("----------------------------------------------------------------------")                

In [63]:
print("NMF Topics")
display_topics(nmf_H, nmf_W, tfidf_feature_names, reviews_df.review, no_top_words, no_top_documents)
print("--------------")

NMF Topics
Topic 0:
Topic Words: good taste wasnt food
----------------------------------------------------------------------
Some documents that contain this topic are:
----------------------------------------------------------------------
not bad
I didnt get the food i ordered
the choco lava cake was cold very disappointing
Topic 1:
Topic Words: bad taste quality pizza
----------------------------------------------------------------------
Some documents that contain this topic are:
----------------------------------------------------------------------
taste and even the packing of the food is good
Nothing above average tbh  coffee shake is trash  delivery was horrendous
not bad
Topic 2:
Topic Words: great place food nice
----------------------------------------------------------------------
Some documents that contain this topic are:
----------------------------------------------------------------------
bad quality
Much needed place in ORR has a great ambience The staff r too friendl

### Exercise for participants

- Try to model topics only for Positive sentiments
- Try to model topics only for Negative sentiments

### References

- https://towardsdatascience.com/setting-up-text-preprocessing-pipeline-using-scikit-learn-and-spacy-e09b9b76758f