# Unsupervised Topic Modeling

## Latent Dirichlet Allocation

Now the data we've been provided with is all unlabeled. To manually annotate these reviews would take awfully long and in general, is not practical. Hence, we need to look at unsupervised methods to label our data or atleast identify the different abstract topics we're dealing with. We perform LDA on our data to divide it into relevant topics. 

In [1]:
# Reading reviews 
import pandas as pd

data = pd.read_csv('./data.csv')

docs = data['text']

### Preprocessing
Now, we need to carry out some basic preprocessing to make our data suitable for the LDA model. The following few code cells will address just that.

In [2]:
# Using gensim to preprocess and tokenize the data 

import gensim
from gensim.utils import simple_preprocess

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data = docs.values.tolist()
data_words = list(sent_to_words(data))

print(data_words[:1][0])



['tires', 'where', 'delivered', 'to', 'the', 'garage', 'of', 'my', 'choice', 'the', 'garage', 'notified', 'me', 'when', 'they', 'had', 'been', 'delivered', 'day', 'and', 'time', 'was', 'arranged', 'with', 'the', 'garage', 'and', 'went', 'and', 'had', 'them', 'fitted', 'hassel', 'free', 'experience']


In [3]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [4]:
# NLTK Stop words
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend(['redacted']) # Adding redacted as a stopword as it is not particularly relevant

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\praat\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0

You should consider upgrading via the 'C:\Users\praat\Unsupervised-Topic-Modeling\env\Scripts\python.exe -m pip install --upgrade pip' command.



  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [6]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [7]:
import spacy

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1][0])

['tire', 'deliver', 'garage', 'choice', 'garage', 'notify', 'deliver', 'day', 'time', 'arrange', 'garage', 'go', 'fit', 'hassel', 'free', 'experience']


In [8]:
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1][0])

[(0, 1), (1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 1), (7, 3), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1)]


This completes our preprocessing of the data. We can now feed this data into our LDA model. Now, LDA requires us to provide the number of topics we want the data to be segregated into. Initially, we try using an arbitrary value and see how a base model does. Later, we use a function to identify the number of topics that would be ideal. Here, we shall directly look at the custom function we made to identify the best parameters for our LDA model.  

In [None]:
import tqdm
import numpy as np

# supporting function
def compute_coherence_values(corpus, dictionary, k, a, b):
    
    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=k, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=a,
                                           eta=b)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

### Hyperparameter Tuning
There are three hyperparameters that we will try and tune to improve our LDA model. They are-
1) Number of topics \
2) Dirichlet hyperparameter alpha: Document-Topic Density \
3) Dirichlet hyperparameter beta: Word-Topic Density 

We run the code cell below and calculate the coherence for all combinations of hyperparameters. The higher the coherence, the better our model. 

In [None]:
# Trying various 

grid = {}
grid['Validation_Set'] = {}

# Topics range
min_topics = 2
max_topics = 11
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.3))
alpha.append('symmetric')
alpha.append('asymmetric')

# Beta parameter
beta = list(np.arange(0.01, 1, 0.3))
beta.append('symmetric')

# Validation sets
num_of_docs = len(corpus)
corpus_sets = [gensim.utils.ClippedCorpus(corpus, int(num_of_docs*0.75)), 
               corpus]

corpus_title = ['75% Corpus', '100% Corpus']

model_results = {'Validation_Set': [],
                 'Topics': [],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': []
                }

if 1 == 1:
    pbar = tqdm.tqdm(total=(len(beta)*len(alpha)*len(topics_range)*len(corpus_title)))
    
    # iterate through validation corpuses
    for i in range(len(corpus_sets)):
        # iterate through number of topics
        for k in topics_range:
            # iterate through alpha values
            for a in alpha:
                # iterare through beta values
                for b in beta:
                    # get the coherence score for the given parameters
                    cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=id2word, 
                                                  k=k, a=a, b=b)
                    # Save the model results
                    model_results['Validation_Set'].append(corpus_title[i])
                    model_results['Topics'].append(k)
                    model_results['Alpha'].append(a)
                    model_results['Beta'].append(b)
                    model_results['Coherence'].append(cv)
                    
                    pbar.update(1)
    pd.DataFrame(model_results).to_csv('/content/drive/MyDrive/lda_tuning_results.csv', index=False)
    pbar.close()

### Ideal Hyperparameters
After trying out a variety of hyperparameters, we see that the ideal number of topics is 5, alpha = 0.31 and eta is symmetric. For these values, we achieve a coherence of ~0.60 \
However, on trying a model with 5 topics, I felt that it did not cover a large number of words and the topics felt vague. Hence, I decided to try various number of topics and I found number of topics = 10 with alpha = 'asymmetric' and eta = 0.61 to provide more distinct topics with more interpretable and easily differentiable keywords. The model still manages to achieve a coherence of ~0.56 and hence, we will proceed with these hyperparameters for our model.


In [9]:
num_topics = 10

lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=num_topics, 
                                           random_state=100,
                                           chunksize=50,
                                           passes=5,
                                           alpha='asymmetric',
                                           eta=0.61)

In [10]:
from pprint import pprint

# Print the keywords in each topic
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.077*"service" + 0.059*"good" + 0.058*"price" + 0.047*"easy" + 0.044*"use" '
  '+ 0.042*"great" + 0.042*"tyre" + 0.029*"excellent" + 0.020*"garage" + '
  '0.019*"fit"'),
 (1,
  '0.058*"tyre" + 0.038*"time" + 0.031*"fit" + 0.029*"garage" + 0.021*"get" + '
  '0.017*"go" + 0.014*"car" + 0.013*"use" + 0.013*"day" + 0.012*"fitting"'),
 (2,
  '0.043*"start" + 0.042*"finish" + 0.022*"exactly" + 0.011*"smooth" + '
  '0.008*"perfect" + 0.008*"faultless" + 0.007*"process" + 0.005*"tin" + '
  '0.005*"say" + 0.005*"whole"'),
 (3,
  '0.043*"staff" + 0.037*"helpful" + 0.023*"friendly" + 0.018*"hassle_free" + '
  '0.011*"first_class" + 0.010*"polite" + 0.008*"guy" + 0.007*"knowledgeable" '
  '+ 0.004*"nice" + 0.004*"lovely"'),
 (4,
  '0.021*"anywhere_else" + 0.012*"wide" + 0.008*"seamless" + 0.008*"class" + '
  '0.007*"sell" + 0.006*"top" + 0.005*"grip" + 0.005*"try" + 0.005*"past" + '
  '0.004*"noise"'),
 (5,
  '0.045*"tyre" + 0.017*"car" + 0.012*"order" + 0.011*"fit" + 0.009*"size" + '
  '

## Visualizing topic clusters
We can visualize the generated topic clusters using pyLDAvis and see the various keywords for each topic.

In [12]:
import pyLDAvis.gensim_models
import pickle 
import pyLDAvis

# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
LDAvis_prepared

  default_term_info = default_term_info.sort_values(


Now, let's save our model and further explore how to proceed with labeling in the next notebook.

In [11]:
lda_model.save('lda.model')

In [12]:
# Storing these variables as they will be required later on while making predictions in our next notebook
%store stop_words
%store bigram_mod
%store id2word
%store nlp

Stored 'stop_words' (list)
Stored 'bigram_mod' (FrozenPhrases)
Stored 'id2word' (Dictionary)
Stored 'nlp' (English)
