# Topic Modelling

* **Latent Dirichlet Allocation (LDA)**
* **Latent Semantim Indexing (LSI)**

### Download wikipedia articles

* **wikipedia** library
* ***page*** object
* ***content*** attribute

In [1]:
import wikipedia

glw = wikipedia.page("Global Warming")
ai = wikipedia.page("Artificial Intelligence")
eif = wikipedia.page("Eiffel Tower")
mls = wikipedia.page("Mona Lisa")

corpus = [glw.content, ai.content, eif.content, mls.content]

### Data preprocessing

* Remove all special characters
* Remove all single characters
* Remove single characters from the start
* Replace multiple spaces with a single space
* Remove prefixed 'b'
* Convert text to lower-case
* Lemmatize words
* Remove stopwords
* Remove smaller words

In [2]:
import re
import nltk
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer

stopwords = set(nltk.corpus.stopwords.words('english'))
lemma = WordNetLemmatizer()

def preprocess(document):
    document = re.sub(r'\W', ' ', str(document))
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)
    document = re.sub(r'\s+', ' ', document)
    document = re.sub(r'^b', '', document)
    document = document.lower()
    tokens = document.split()
    tokens = [lemma.lemmatize(word) for word in tokens]
    tokens = [word for word in tokens if word not in stopwords]
    tokens = [word for word in tokens if len(word)>5]
    return tokens

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rchattopadhyay/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## LDA

### Modeling topics

In [3]:
from gensim import corpora
from gensim.models.ldamodel import LdaModel
import pickle

# preprocess data
data = [];
for document in corpus:
    tokens = preprocess(document)
    data.append(tokens)

# create and save dictionary
dictionary = corpora.Dictionary(data)
dictionary.save('dictionary.gensim')

# create and save bag of words
bow = [dictionary.doc2bow(token, allow_update=True) for token in data]
pickle.dump(bow, open('bag_of_words.pkl', 'wb'))

# create and save LDA model
lda_model = LdaModel(bow, num_topics=4, id2word=dictionary, passes=20)
lda_model.save('lda_model.gensim')

**Using the model to find 10 words for each topic**
* ***print_topics***

In [4]:
topics = lda_model.print_topics(num_words=10)

for topic in topics:
    print(topic)

(0, '0.036*"painting" + 0.017*"leonardo" + 0.010*"portrait" + 0.009*"louvre" + 0.006*"century" + 0.006*"museum" + 0.006*"french" + 0.005*"italian" + 0.005*"giocondo" + 0.004*"subject"')
(1, '0.014*"climate" + 0.011*"change" + 0.011*"intelligence" + 0.009*"machine" + 0.009*"warming" + 0.009*"global" + 0.008*"artificial" + 0.008*"system" + 0.008*"problem" + 0.007*"learning"')
(2, '0.000*"climate" + 0.000*"change" + 0.000*"intelligence" + 0.000*"eiffel" + 0.000*"problem" + 0.000*"warming" + 0.000*"emission" + 0.000*"learning" + 0.000*"system" + 0.000*"painting"')
(3, '0.026*"eiffel" + 0.008*"second" + 0.006*"french" + 0.006*"structure" + 0.006*"exposition" + 0.005*"tallest" + 0.005*"engineer" + 0.004*"design" + 0.004*"france" + 0.004*"restaurant"')


### Evaluating the model

* **CohenrenceModel** ***get_coherence***
* ***log_perplexity***

In [5]:
from gensim.models import CoherenceModel
c_model = CoherenceModel(model=lda_model, texts=data, dictionary=dictionary, coherence='c_v')
print('Coherence Score: ', c_model.get_coherence())

print('Model Perplexity: ', lda_model.log_perplexity(bow))

Coherence Score:  0.48925924983784896
Model Perplexity:  -7.702318241758859


### Testing the model

Create a test document and its bag of words.
* ***get_document_topics***

In [6]:
test_document = 'Great structures are built to remember a historical event.'
test_document = preprocess(test_document)
test_bow = dictionary.doc2bow(test_document)

print('Topics (test doc): ', lda_model.get_document_topics(test_bow))

Topics (test doc):  [(0, 0.25484878), (1, 0.085481435), (2, 0.0834606), (3, 0.5762092)]


Topic 0 = 8.9%<br>
Topic 1 = 8.4%<br>
Topic 2 = 73.8%<br>
Topic 3 = 8.9%<br>

### Vizualizing LDA model

* **pyLDAvis**
* ***prepare***
* ***display***

In [7]:
import pyLDAvis.gensim
import gensim

dictionary = corpora.Dictionary.load('dictionary.gensim')
bow = pickle.load(open('bag_of_words.pkl', 'rb'))
lda_model = LdaModel.load('lda_model.gensim')

lda_vis = pyLDAvis.gensim.prepare(lda_model, bow, dictionary, sort_topics=False)
pyLDAvis.display(lda_vis)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


## LSI

In [8]:
from gensim.models import LsiModel

lsi_model = LsiModel(bow, num_topics=4, id2word=dictionary)

topics = lsi_model.print_topics(num_words=10)

for topic in topics:
    print(topic)

(0, '0.312*"intelligence" + 0.270*"machine" + 0.236*"artificial" + 0.226*"climate" + 0.221*"problem" + 0.209*"system" + 0.199*"learning" + 0.195*"change" + 0.156*"network" + 0.152*"warming"')
(1, '0.452*"climate" + 0.352*"change" + 0.305*"warming" + 0.289*"global" + -0.205*"intelligence" + 0.201*"emission" + -0.177*"machine" + 0.161*"temperature" + 0.156*"greenhouse" + -0.154*"artificial"')
(2, '0.685*"painting" + 0.327*"leonardo" + 0.187*"eiffel" + 0.185*"portrait" + 0.180*"louvre" + 0.155*"french" + 0.124*"century" + 0.121*"museum" + 0.097*"original" + 0.092*"italian"')
(3, '0.656*"eiffel" + -0.272*"painting" + 0.180*"second" + 0.145*"exposition" + 0.145*"structure" + -0.134*"leonardo" + 0.128*"tallest" + 0.116*"engineer" + 0.107*"design" + 0.102*"restaurant"')
