
# Implementation of LDA model in Python

Authors  
-   Nicolas Pizzo and Akshar Nair


> **note**
>
> -   The worksheet is part of the research project on clusttering of question answer pairs using ML models, at the University of Bath.
>-    Research Group: Alexandra Gkolia, Nicolas Pizzo, Nikhil Fernandes, Akshar Nair and James Davenport
>-   This worksheet is derived from https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0 and catered to the data set collected from CM20254 course on Data Structures and Algorithms at the University of Bath.

>- For further queries please feel free to email: (np700, ag2214, nkdf20, asn42, masjhd)@bath.ac.uk

# Note: This script requires high CPU usage. Be-aware your machine will be under heavy load, use with care.  

To use different ML models, we need various functions form the pandas library. You may need the following pip installs (incase this your first time of using ML models)
>- pip install pandas
>- pip install nltk
>- pip install gensim
>- pip install spacy

In [1]:
import pandas as pd
import os

os.chdir('..')

## Reading csv file into a list of lists
### Note: your path for the csv file might be different. Please update accordingly.

In [2]:
papers = pd.read_csv('./Documents/Questions_300.csv')

In [3]:
#Print head
papers.head()

Unnamed: 0,Question,Answer 1,Answer 2,Answer 3,Answer 4,Answer 5,Explanation,Author
0,The runtime for the following code fragment is...,nlogn\n,n^2\n,n^2(logn)\n,n^3\n,None of the above\n,,wd371 (wd371)\n
1,"Given the binary search algorithm, as taught i...",1 iteration\n,2 iterations\n,3 iterations\n,4 iterations\n,,,wd371 (wd371)\n
2,An algorithm has time complexity . Using the D...,b = 12\n,b = -4\n,b = 6\n,b = 9\n,b = 0\n,Solve for the roots of |T(N)|=4N^2. The roots ...,
3,Which one of the following sorting algorithms ...,Selection sort\n,Insertion sort\n,Bubble sort\n,Merge sort\n,Quick\n,,aa2955 (aa2955)\n
4,Given the tree: a / \ b c / \ / \ e f g hWhat ...,ebfagch\n,abefcgh\n,efbghca\n,abcefgh\n,,For the in order traversal we start from the r...,


### Removing irrelevant columns from the dataset. 
#### If you wish to drop labels from the index set axis = 0, if you wish to remove labels from the columns set axis = 1

In [4]:
papers = papers.drop(columns=['Answer 1', 'Answer 2', 'Answer 3', 'Answer 4', 'Answer 5', 'Author', 'Explanation'], axis=1)

In [5]:
#Will be used often.  Simply prints out the current state of the file.
papers.head()

Unnamed: 0,Question
0,The runtime for the following code fragment is...
1,"Given the binary search algorithm, as taught i..."
2,An algorithm has time complexity . Using the D...
3,Which one of the following sorting algorithms ...
4,Given the tree: a / \ b c / \ / \ e f g hWhat ...


In [6]:
#Load the regular expression library
import re

#### The following removes punctuations and converts the titles to lowercase.

In [7]:
papers['Question_Processed'] = papers['Question'].map(lambda x: re.sub('[,\.!?]', '', x))

In [8]:
papers['Question_Processed'] = papers['Question_Processed'].map(lambda x: x.lower())

In [9]:
papers.head()

Unnamed: 0,Question,Question_Processed
0,The runtime for the following code fragment is...,the runtime for the following code fragment is...
1,"Given the binary search algorithm, as taught i...",given the binary search algorithm as taught in...
2,An algorithm has time complexity . Using the D...,an algorithm has time complexity using the de...
3,Which one of the following sorting algorithms ...,which one of the following sorting algorithms ...
4,Given the tree: a / \ b c / \ / \ e f g hWhat ...,given the tree: a / \ b c / \ / \ e f g hwhat ...


In [10]:
#%%time

In [11]:
#Tokenising sentences into individual words with no punctuation

import gensim
from gensim.utils import simple_preprocess

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
        #deacc=True removes punctuations

data = papers.Question_Processed.values.tolist()
data_words = list(sent_to_words(data))

#Note: [a:b] is the range of sentences you want printed.  Eg. [1:3] prints sentence 2 and 3
#Remember, Python starts at 0, but [:0] prints no sentence.
print(data_words[:1])

[['the', 'runtime', 'for', 'the', 'following', 'code', 'fragment', 'is', 'what', 'is', 'for', 'int', 'for', 'int', 'for', 'int']]


### The following code builds the bigram and trigram models. It also describes a faster way of clubbing sentances as a trigram/bigram. You will require lines 1 and 2  to compute the faster models. 

In [12]:
bigram = gensim.models.Phrases(data_words, min_count = 5, threshold = 100)
trigram = gensim.models.Phrases(bigram[data_words], threshold = 100)


bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [13]:
print(bigram)
print(trigram)

Phrases<4599 vocab, min_count=5, threshold=100, max_vocab_size=40000000>
Phrases<4599 vocab, min_count=5, threshold=100, max_vocab_size=40000000>


#### The following code describes removing stopwords for a given text. This is ran on entire dataset of questions. 
#### Note: You may need to uncomment line 1 and 2, incase you havent previously downloaded stopwords

In [14]:
#import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

#Define functions for stopwords, bigrams, trigrams and lemmatization

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

#### Removing stopwords using spacy and preprocess your data using lemmatization.

In [15]:
import spacy
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_sm", disable = ['parser', 'ner'])
# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])
print(data_lemmatized[:2])

[['runtime', 'follow', 'code', 'fragment', 'int', 'int', 'int'], ['give', 'binary', 'search', 'teach', 'first', 'last', 'index', 'many', 'iteration', 'algorithm', 'need', 'find', 'element']]


In [16]:
x = [['from','subject','int'],['second','sentence','use']]
y = remove_stopwords(x)
make_bigrams(y)
#print(lemmatization(make_bigrams(remove_stopwords(data_words))))
make_bigrams(remove_stopwords(data_words))[:3]
test = make_bigrams(remove_stopwords(data_words))[1]
print(test)
print(lemmatization(test, allowed_postags=['NOUN','ADJ','VERB','ADV']))

['given', 'binary', 'search', 'algorithm', 'taught', 'lecture', 'array', 'first', 'last', 'index', 'many', 'iterations', 'algorithm', 'need', 'find', 'element']
[['v', 'n'], ['i', 'r', 'y'], ['a', 'r', 'c', 'h'], ['g', 'o', 'r', 'i', 't', 'h', 'm'], ['g', 'h', 't'], ['c', 't', 'u', 'r', 'e'], ['r', 'r', 'y'], ['r', 's', 't'], ['t'], ['e', 'x'], ['n', 'y'], ['t', 'e', 'r', 't', 'i', 'o', 'n'], ['g', 'o', 'r', 'i', 't', 'h', 'm'], ['e', 'e', 'd'], ['i'], ['m', 'e', 'n', 't']]


#### Calculating the frequency of the words obtained in the previous step, and creating it's corpus.

In [17]:
import gensim.corpora as corpora
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)
# Create Corpus
texts = data_lemmatized
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 3), (4, 1)]]


#### The following two bubbles create and view the LDA model for the given dataset. 

In [18]:
# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=id2word, num_topics=10, random_state=100, chunksize=100, passes=10, per_word_topics=True)

In [19]:
#View the LDA model
from pprint import pprint
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.057*"follow" + 0.042*"sort" + 0.038*"time" + 0.033*"complexity" + '
  '0.028*"path" + 0.021*"algorithm" + 0.017*"short" + 0.017*"record" + '
  '0.016*"node" + 0.013*"graph"'),
 (1,
  '0.069*"list" + 0.051*"time" + 0.041*"link" + 0.032*"sort" + '
  '0.031*"complexity" + 0.027*"average" + 0.024*"bad" + 0.022*"element" + '
  '0.020*"search" + 0.017*"case"'),
 (2,
  '0.084*"follow" + 0.047*"tree" + 0.046*"order" + 0.041*"give" + '
  '0.035*"statement" + 0.024*"hash" + 0.024*"use" + 0.022*"correct" + '
  '0.022*"function" + 0.021*"traversal"'),
 (3,
  '0.107*"int" + 0.059*"complexity" + 0.054*"follow" + 0.054*"sort" + '
  '0.046*"time" + 0.041*"case" + 0.035*"good" + 0.028*"algorithm" + '
  '0.026*"bad" + 0.021*"log"'),
 (4,
  '0.137*"tree" + 0.048*"binary" + 0.042*"follow" + 0.041*"avl" + '
  '0.031*"search" + 0.031*"value" + 0.030*"order" + 0.026*"use" + '
  '0.023*"balance" + 0.021*"balanced"'),
 (5,
  '0.022*"value" + 0.022*"system" + 0.020*"count" + 0.020*"number" + '
  '0.01

#### Our Coherence score is high, which is due to the dataset we are using

In [20]:
#Calculating the coherence score of our model
from gensim.models import CoherenceModel
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.3731624974915805


In [26]:
# Finding the optimum parameter values
def compute_coherence_values(corpus, dictionary, a, b):
    
    lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=id2word, num_topics=k, random_state=100, chunksize=100, passes=10, alpha=a, eta=b, per_word_topics=True)
    coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

compute_coherence_values(corpus,id2word,0.1,0.1)

0.3731624974915805

#### This describe the LDA model implementation after pre-processing. Commenting in progress...

In [27]:
# Topics range
import numpy as np
min_topics = 2
max_topics = 11
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

alpha = list(np.arange(0.01, 1, 0.3))

print(len(topics_range))
print(topics_range)
print(len(alpha))
print(alpha)

9
range(2, 11)
4
[0.01, 0.31, 0.61, 0.9099999999999999]


In [28]:
num_of_docs = len(corpus)
corpus_sets = [#gensim.utils.ClippedCorpus(corpus, num_of_docs*0.25), 
               #gensim.utils.ClippedCorpus(corpus, num_of_docs*0.5), 
               gensim.utils.ClippedCorpus(corpus, num_of_docs*0.75), 
               corpus]
corpus_title = ['75% Corpus', '100% Corpus']

#print(len(corpus_sets))
corpus_title[1]
#print(num_of_docs)
300*0.75
#compute_coherence_values(corpus_sets[1],id2word,1,0.1,0.1)
#working = gensim.models.LdaMulticore(corpus=corpus_sets[1], id2word=id2word, num_topics=10, random_state=100, chunksize=100, passes=10, alpha=0.1, eta=0.1, per_word_topics=True)
#check = gensim.models.LdaMulticore(corpus=corpus_sets[0], id2word=id2word, num_topics=10, random_state=100, chunksize=100, passes=10, alpha=0.1, eta=0.1, per_word_topics=True)
#corpus_sets[0]
#print(len(corpus_sets[0]))

225.0

In [29]:
import numpy as np
import tqdm

# Topics range - to be removed (Why do we include this? LDA Model doesn't need to specify the number of topics)
min_topics = 2
max_topics = 11
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.3))

# Beta parameter
beta = list(np.arange(0.01, 1, 0.3))

#Validation Sets
num_of_docs = len(corpus)
corpus_sets = [#gensim.utils.ClippedCorpus(corpus, 75), 
               #gensim.utils.ClippedCorpus(corpus, 150), 
               gensim.utils.ClippedCorpus(corpus, 225), 
               corpus]
corpus_title = ['75% Corpus', '100% Corpus']

model_results = {'Validation_Set':[],'Topics': [], 'Alpha': [], 'Beta': [], 'Coherence': []}

# Can take a long time to run 
x = len(topics_range)*len(alpha)*len(beta)*len(corpus_sets)
print(x)
pbar = tqdm.tqdm(total=x)

#iterate through validation corpuses
for i in range(0,len(corpus_sets)):
    #iterate through number of topics
    for k in topics_range:
        #iterate through alpha values
        for a in alpha:
            #iterate through beta values
            for b in beta:
                # get the coherence score for the given parameters
                cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=id2word, a=a, b=b)

                # Save the model results
                model_results['Validation_Set'].append(corpus_title[i])
                model_results['Topics'].append(k)
                model_results['Alpha'].append(a)
                model_results['Beta'].append(b)
                model_results['Coherence'].append(cv)

                pbar.update(1)
pd.DataFrame(model_results).to_csv('lda_tuning_results.csv', index=False)
pbar.close()


  0%|                                                                                          | 0/288 [00:00<?, ?it/s][A

288


KeyboardInterrupt: 

In [24]:
import numpy as np

topics_range = range(2,11,1)
alpha_check = np.arange(0.01, 1, 0.3)
beta_check = np.arange(0.01,1,0.3)
loop_length  = 0

#iterate through validation corpuses
for i in range(0,len(corpus_sets)):
    #iterate through number of topics
    for k in topics_range:
        #iterate through alpha values
        for a in alpha_check:
            #iterate through beta values
            for b in beta_check:
                loop_length += 1
            
#print(loop_length)
print(model_results)

x = len(topics_range)*len(alpha)*len(beta)*len(corpus_sets)
#print(x)
#print(len(range(0,len(corpus_sets))))

NameError: name 'model_results' is not defined

In [39]:
for i in range(0,len(corpus_sets)):
    print(i)
    print(corpus_title[i])

0
75% Corpus
1
100% Corpus


In [28]:
print(num_of_docs)
print(len(corpus_sets))
print(len(range(1,len(corpus_sets))))

300
2
1


#### Here we are training the model using the optimal parameters after extensive calculations

In [73]:
lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=id2word, num_topics=8, random_state=100, chunksize=100, passes=10, alpha=0.01, eta=0.9)

#### The following gives a visual representations of the clusters

In [74]:
import pyLDAvis.gensim
import pickle 
import pyLDAvis
# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
LDAvis_prepared

ModuleNotFoundError: No module named 'pyLDAvis'