https://datascienceplus.com/topic-modeling-in-python-with-nltk-and-gensim/

##### The Process 
* We pick the number of topics ahead of time even if we’re not sure what the topics are.
* Each document is represented as a distribution over topics.
* Each topic is represented as a distribution over words.

### TExt CLeaning

In [1]:
#Uncomment the below line; if you wish to install spacy

# !conda install --yes spacy

Fetching package metadata .........
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /home/amit/anaconda3/envs/VenvPandas:
#
spacy                     1.8.2                    py36_0  


In [2]:
import spacy

In [3]:
spacy.load('en')
# from spacy.lang.en import English
# parser = English()



    Only loading the 'en' tokenizer.



<spacy.en.English at 0x7f9faacf6320>

In [4]:
from spacy.en import English
parser = English()

In [5]:
# Looking inside; what can be accomplished using the parser object
parser?

tokens = parser('An example sentence. Another example sentence.')

print(tokens.sentiment)
print(tokens[0].orth_.isspace())
print(tokens[0].head.tag_)

0.0
False



In [6]:
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    for token in tokens:
        if token.orth_.isspace():
            continue
        elif token.like_url:
            lda_tokens.append('URL')
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
        else:
            lda_tokens.append(token.lower_)
    return lda_tokens

We use __*NLTK’s Wordnet*__ to find the meanings of<span style="color:red"> words, synonyms, antonyms, and more</span>. In addition, we use __WordNetLemmatizer__ to get the root word

In [7]:
import nltk

In [8]:
from nltk.corpus import wordnet as wn

def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:
        return word
    else:
        return lemma

from nltk.stem.wordnet import WordNetLemmatizer

def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

In [9]:
wn?

<u>Filtering out stop words</u>

In [10]:
# nltk.download('stopwords')

en_stop = set(nltk.corpus.stopwords.words('english'))

In [11]:
en_stop

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [12]:
#Prepare text for topic Modelling : latent dirichlet allocation
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    tokens = [token for token in tokens if len(token) >4 ]
    tokens = [token for token in tokens if token not in en_stop]
    tokens = [get_lemma(token) for token in tokens ]
    return tokens

Open up our data, read line by line, for each line, prepare text for LDA, then add to a list

In [13]:
import random
text_data = []
with open(r'./Data/Topic_modelling_data.csv','r') as f:
    for line in f:
        tokens = prepare_text_for_lda(line)
        if random.random() > 0.99:
            print(tokens)
        text_data.append(tokens)

['multiuser', 'detection', 'base', 'grover', 'algorithm']
['generate', 'diverse', 'representative', 'image', 'search', 'result', 'landmark']
['distribute', 'large', 'scale', 'natural', 'graph', 'factorization']
['efficient', 'evaluation', 'generalize', 'pattern', 'query']
['ultra', 'power', 'employ', 'noise', 'cancellation']
['automatic', 'identification', 'goal', 'search']
['ibind', 'smooth', 'indirect', 'binding', 'using', 'segment', 'layer']
['testbed', 'manage', 'dynamic', 'mix', 'workload']
['indexing', 'orient', 'overlay', 'network']
['consensus', 'network', 'multi', 'agent', 'system', 'model', 'predictive', 'control', 'horizon']
['phase', 'noise', 'bottom', 'series', 'coupling', 'capacitor', 'tapping']
['fix', 'pattern', 'noise', 'current', 'imager', 'using', 'velocity', 'saturate', 'readout', 'transistor']
['context', 'aware', 'image', 'semantic', 'extraction', 'social']
['interaction', 'tabletop', 'augment', 'reality']
['scalable', 'spatio', 'temporal', 'knowledge', 'harvestin

In [14]:
print(type(text_data))
print(len(text_data))
text_data

<class 'list'>
2507


[['innovation',
  'database',
  'management',
  'computer',
  'science',
  'engineering'],
 ['performance', 'prime', 'field', 'multiplication'],
 ['enchant',
  'scissors',
  'scissor',
  'interface',
  'support',
  'cutting',
  'interactive',
  'fabrication'],
 ['detection',
  'channel',
  'degradation',
  'attack',
  'intermediary',
  'linear',
  'network'],
 ['pinning', 'complex', 'network', 'betweenness', 'centrality', 'strategy'],
 ['analysis', 'design', 'memoryless', 'interconnect', 'encoding', 'scheme'],
 ['dynamic', 'bluescreens'],
 ['quantitative', 'assure', 'forwarding', 'service'],
 ['automatic',
  'sanitization',
  'social',
  'network',
  'prevent',
  'inference',
  'attack'],
 ['916;&#931',
  'radar',
  'range',
  'capability',
  'human',
  'monitoring',
  'system'],
 ['architecture', 'multi', 'memory', 'system', 'operation'],
 ['base', 'service', 'customization', 'houdini'],
 ['business', 'policy', 'modeling', 'enforcement', 'database'],
 ['speed', 'linearity', 'power', '

### LDA WITH GENSIM

First, we are creating a dictionary from the data, then convert to bag-of-words corpus and save the dictionary and corpus for future use

In [20]:
!mkdir ./output/topic_modelling/


In [21]:
from gensim import corpora
dictionary = corpora.Dictionary(text_data)
corpus = [dictionary.doc2bow(text) for text in text_data]
import pickle
pickle.dump(corpus, open('./output/topic_modelling/corpus.pkl', 'wb'))
dictionary.save('./output/topic_modelling/dictionary.gensim')

In [16]:
type(corpus)

list

We are asking LDA to 5 topics in the data

In [28]:
import gensim
NUM_OF_TOPICS = 5
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=NUM_OF_TOPICS,id2word=dictionary,passes=15)
ldamodel.save('./output/topic_modelling/model5.gensim')

In [29]:
# we can check the major words related to each topic
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.028*"base" + 0.017*"algorithm" + 0.015*"network" + 0.011*"using"')
(1, '0.054*"network" + 0.026*"wireless" + 0.021*"sensor" + 0.013*"mobile"')
(2, '0.018*"query" + 0.013*"using" + 0.013*"model" + 0.011*"scalable"')
(3, '0.022*"using" + 0.018*"filter" + 0.014*"efficient" + 0.012*"search"')
(4, '0.047*"system" + 0.017*"base" + 0.012*"management" + 0.010*"design"')


Let's try a new document

In [30]:
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = prepare_text_for_lda(new_doc)
new_doc_bow = dictionary.doc2bow(new_doc)
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

[(133, 1), (234, 1), (440, 1), (587, 1), (1860, 1)]
[(0, 0.13474018905523655), (1, 0.033890014580698992), (2, 0.033334529831441383), (3, 0.51612457465521888), (4, 0.2819106918774042)]


My new document is about machine learning algorithms, the LDA output shows that topic 3 has the highest probability assigned, and topic 4 has the second highest probability assigned. We agreed!

Remember that the above 5 probabilities add up to 1.

Now we are asking LDA to find 3 topics in the data

In [31]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus,num_topics=3,id2word=dictionary,passes=15)
ldamodel.save('./output/topic_modelling/model3.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.035*"network" + 0.021*"base" + 0.014*"wireless" + 0.014*"using"')
(1, '0.022*"query" + 0.020*"database" + 0.014*"search" + 0.012*"system"')
(2, '0.013*"system" + 0.012*"base" + 0.010*"network" + 0.010*"analysis"')


We can also ask for 10

In [33]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10,id2word=dictionary,passes=15)
ldamodel.save('./output/topic_modelling/model10.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.016*"using" + 0.016*"semantic" + 0.015*"base" + 0.014*"model"')
(1, '0.018*"large" + 0.017*"services" + 0.016*"mobile" + 0.016*"scale"')
(2, '0.026*"efficient" + 0.019*"information" + 0.015*"design" + 0.015*"system"')
(3, '0.026*"power" + 0.019*"application" + 0.018*"efficient" + 0.018*"system"')
(4, '0.032*"algorithm" + 0.016*"base" + 0.014*"using" + 0.013*"network"')
(5, '0.033*"database" + 0.026*"system" + 0.023*"search" + 0.020*"query"')
(6, '0.028*"video" + 0.021*"base" + 0.015*"coding" + 0.011*"voltage"')
(7, '0.030*"system" + 0.023*"base" + 0.014*"simulation" + 0.013*"level"')
(8, '0.026*"base" + 0.021*"detection" + 0.021*"image" + 0.018*"method"')
(9, '0.130*"network" + 0.049*"wireless" + 0.028*"sensor" + 0.016*"route"')


### pyLDAvis

[pyLDAvis](https://pypi.python.org/pypi/pyLDAvis/2.1.1) is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The __package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.__

visualizing 5 topics model

In [11]:
# Uncomment the beloe code and run this cell to install pyLDAvis
# !pip install pyLDAvis

In [2]:
# import gensim
import pickle

In [4]:
dictionary = gensim.corpora.Dictionary.load('./output/topic_modelling/dictionary.gensim')
corpus = pickle.load(open('./output/topic_modelling/corpus.pkl','rb'))
lda = gensim.models.ldamodel.LdaModel.load('./output/topic_modelling/model5.gensim')

In [8]:
import pyLDAvis.gensim

In [9]:
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary,sort_topics =False)

In [10]:
pyLDAvis.display(lda_display)

__Saliency__: a measure of how much the term tells you about the topic.

__Relevance__: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.

The __size of the bubble measures the importance of the topics__, relative to the data.


visualizing with 3 topics

In [12]:
lda_model = gensim.models.ldamodel.LdaModel.load('./output/topic_modelling/model3.gensim')
lda_display = pyLDAvis.gensim.prepare(lda_model,corpus,dictionary, sort_topics=False)

In [13]:
pyLDAvis.display(lda_display)

Visualizing 10 topics

In [15]:
lda_model = gensim.models.ldamodel.LdaModel.load('./output/topic_modelling/model10.gensim')
lda_display = pyLDAvis.gensim.prepare(lda_model,corpus, dictionary, sort_topics=False)

In [16]:
pyLDAvis.display(lda_display)

When we have 5 or 10 topics, we can see certain topics are clustered together, this indicates the similarity between topics. What a a nice way to visualize what we have done thus far!

#### Summary


### References
http://vis.stanford.edu/files/2012-Termite-AVI.pdf