<a href="https://colab.research.google.com/github/priyanshgupta1998/Machine_learning/blob/master/TOPIC_MODELING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TOP_MODELING

# Latent Dirichlet Allocation (LDA):

### find a text dataset, remove the label if it is labeled, and build a topic model!

In [0]:
import pandas as pd
import numpy as np

In [4]:

data = pd.read_csv("/home/data.csv")
print(data.shape)
data.head()

(2506, 1)


Unnamed: 0,Innovation in Database Management: Computer Science vs. Engineering.
0,High performance prime field multiplication fo...
1,enchanted scissors: a scissor interface for su...
2,Detection of channel degradation attack by Int...
3,Pinning a Complex Network through the Betweenn...
4,Analysis and Design of Memoryless Interconnect...


In [5]:
#spaCy comes with pre-trained statistical models and word vectors, and currently supports tokenization for 49+ languages.
import spacy
spacy.load('en')

<spacy.lang.en.English at 0x7f9025ee7278>

In [0]:
from spacy.lang.en import English  # Frim 49+ languages we are choosing ENGLISH language , over here
parser = English()

In [12]:
liss = []
with open('/home/data.csv') as f:
    for line in f:
        tokens = parser(line)
        liss.append(tokens)
liss[:5]       

[Innovation in Database Management: Computer Science vs. Engineering.,
 High performance prime field multiplication for GPU.,
 enchanted scissors: a scissor interface for support in cutting and interactive fabrication.,
 Detection of channel degradation attack by Intermediary Node in Linear Networks.,
 Pinning a Complex Network through the Betweenness Centrality Strategy.]

In [0]:
def tokenize(text):
    lda_tokens = []
    tokens = parser(text)
    
    for token in tokens:
        if token.orth_.isspace():
            continue
            
        elif token.like_url:
            lda_tokens.append('URL')
            
        elif token.orth_.startswith('@'):
            lda_tokens.append('SCREEN_NAME')
            
        else:
            lda_tokens.append(token.lower_)
            
    return lda_tokens

# We use NLTK’s Wordnet to find the meanings of words, synonyms, antonyms, and more. 
###### In addition, we use WordNetLemmatizer to get the root word.

In [14]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [0]:
from nltk.corpus import wordnet as wn
def get_lemma(word):
    lemma = wn.morphy(word)
    if lemma is None:    # if there is no any root word of given word
        return word
    else:
        return lemma

In [0]:
from nltk.stem.wordnet import WordNetLemmatizer

def get_lemma2(word):
    return WordNetLemmatizer().lemmatize(word)

# Remove the stopwords  

Stopwords are redundant words which don't have any meaning in sentences( simentic analysis)

In [17]:
nltk.download('stopwords')   
en_stop = set(nltk.corpus.stopwords.words('english'))
print(en_stop)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
{'before', 'y', 'they', 'he', 'being', "haven't", "hasn't", 'do', 'why', 'our', "weren't", 'wouldn', 't', 'couldn', 'their', 'those', 'of', 'now', 'very', 'while', 'too', 'hadn', 'didn', "should've", "don't", "shouldn't", "mustn't", 'having', "you're", 'your', 'on', 'down', "she's", 'after', 'themselves', 'other', 'the', 'there', 'yourself', 'am', 'herself', 'through', "wouldn't", 'these', 'don', 'is', 'wasn', "that'll", 'did', 'than', "needn't", 'where', 'me', 'aren', 'hasn', 'we', 'shouldn', 'o', 're', 'against', "isn't", "mightn't", 'just', 'but', 'own', 'out', 'then', 'into', 'needn', 'in', 'you', 'her', 'itself', 'not', 'a', 'that', 'm', "couldn't", 'how', "you've", 'each', 'most', 'once', 'which', 'for', 'up', 'were', 'under', 'both', 'ourselves', 'should', "doesn't", 'myself', 'my', "hadn't", 'can', 'will', 'nor', 'same', 'had', 'between', 'few', 'has', 'doing', 'd', "a

# Topic modeling

In [0]:
def prepare_text_for_lda(text):
    tokens = tokenize(text)
    
    tokens = [token for token in tokens if len(token) > 4]         # the length of the individual token should be greater than 4.
    tokens = [token for token in tokens if token not in en_stop]   # remove stopword 
    tokens = [get_lemma(token) for token in tokens]                 # find the root word
    return tokens

### Open up our data, read line by line, for each line, prepare text for LDA, then add to a list.

In [40]:
import random
text_data = []
all_tokens = []
with open('/home/data.csv') as f:
    for line in f:
        tokens = prepare_text_for_lda(line)
        all_tokens.append(tokens)
        val = random.random()
        if val > .99:
            print(val , tokens)
            text_data.append(tokens)
            

0.9905800739196984 ['parametric', 'keyframe', 'interpolation', 'incorporate', 'kinetic', 'adjustment', 'phrasing', 'control']
0.9966230792353066 ['design', '64-bit', 'energy', 'performance', 'adder', 'using', 'dynamic', 'feedthrough', 'logic']
0.9961678051443218 ['speed', 'front', 'photodiode', 'base', 'fluorescence', 'lifetime', 'measurement', 'system']
0.9911974310448745 ['smooth', 'distribute', 'multimedia', 'database', 'system']
0.9984689370951286 ['multiuser', 'detection', 'base', 'grover', 'algorithm']
0.9968423448568698 ['warmth', 'night']
0.9985135525474417 ['optimize', 'energy', 'latency', 'trade', 'sensor', 'network', 'control', 'mobility']
0.9964852715192138 ['articulate', 'deformation', 'range']
0.9939022031588081 ['voltage', 'dtmos', 'mtcmos', 'circuit', 'technique', 'design', 'optimization', 'power', 'application']
0.9907188596709444 ['restful', 'services', 'services', 'making', 'right', 'architectural', 'decision']
0.9954505338623301 ['numerical', 'simulation', 'fluid', 

In [42]:
print( len(data)+1 , len(all_tokens))
all_tokens[:5]

2507 2507


[['innovation',
  'database',
  'management',
  'computer',
  'science',
  'engineering'],
 ['performance', 'prime', 'field', 'multiplication'],
 ['enchant',
  'scissors',
  'scissor',
  'interface',
  'support',
  'cutting',
  'interactive',
  'fabrication'],
 ['detection',
  'channel',
  'degradation',
  'attack',
  'intermediary',
  'linear',
  'network'],
 ['pinning', 'complex', 'network', 'betweenness', 'centrality', 'strategy']]

In [43]:
text_data[:5]

[['parametric',
  'keyframe',
  'interpolation',
  'incorporate',
  'kinetic',
  'adjustment',
  'phrasing',
  'control'],
 ['design',
  '64-bit',
  'energy',
  'performance',
  'adder',
  'using',
  'dynamic',
  'feedthrough',
  'logic'],
 ['speed',
  'front',
  'photodiode',
  'base',
  'fluorescence',
  'lifetime',
  'measurement',
  'system'],
 ['smooth', 'distribute', 'multimedia', 'database', 'system'],
 ['multiuser', 'detection', 'base', 'grover', 'algorithm']]

# LDA with Gensim

### Gensim is a free Python library designed to "automatically extract semantic topics from documents" , as efficiently (computer-wise) and painlessly (human-wise) as possible.

We are gonna use 'Bag of Words' word embedding technique to find the frequency of the word in the document.   
After that we will make a dictionary named corpus.

In [44]:
from gensim import corpora

dictionary = corpora.Dictionary(text_data)

print(list(dictionary))    # Number of total words in the dictionary


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132]


### The function doc2bow() simply counts the number of occurrences of each distinct word, converts the word to its integer word id

In [45]:
corpus1 = [dictionary.doc2bow(text) for text in text_data[:5]]

text_data[:5] , corpus1

([['parametric',
   'keyframe',
   'interpolation',
   'incorporate',
   'kinetic',
   'adjustment',
   'phrasing',
   'control'],
  ['design',
   '64-bit',
   'energy',
   'performance',
   'adder',
   'using',
   'dynamic',
   'feedthrough',
   'logic'],
  ['speed',
   'front',
   'photodiode',
   'base',
   'fluorescence',
   'lifetime',
   'measurement',
   'system'],
  ['smooth', 'distribute', 'multimedia', 'database', 'system'],
  ['multiuser', 'detection', 'base', 'grover', 'algorithm']],
 [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
  [(8, 1),
   (9, 1),
   (10, 1),
   (11, 1),
   (12, 1),
   (13, 1),
   (14, 1),
   (15, 1),
   (16, 1)],
  [(17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1)],
  [(24, 1), (25, 1), (26, 1), (27, 1), (28, 1)],
  [(17, 1), (29, 1), (30, 1), (31, 1), (32, 1)]])

In [0]:
corpus = [dictionary.doc2bow(text) for text in text_data]

In [47]:
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(8, 1),
  (9, 1),
  (10, 1),
  (11, 1),
  (12, 1),
  (13, 1),
  (14, 1),
  (15, 1),
  (16, 1)],
 [(17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1)],
 [(24, 1), (25, 1), (26, 1), (27, 1), (28, 1)],
 [(17, 1), (29, 1), (30, 1), (31, 1), (32, 1)],
 [(33, 1), (34, 1)],
 [(1, 1), (12, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1)],
 [(41, 1), (42, 1), (43, 1)],
 [(10, 1),
  (44, 1),
  (45, 1),
  (46, 1),
  (47, 1),
  (48, 1),
  (49, 1),
  (50, 1),
  (51, 1)],
 [(52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 2)],
 [(16, 1),
  (58, 1),
  (59, 1),
  (60, 1),
  (61, 1),
  (62, 1),
  (63, 1),
  (64, 1),
  (65, 1),
  (66, 1)],
 [(1, 1),
  (38, 1),
  (49, 1),
  (67, 1),
  (68, 1),
  (69, 1),
  (70, 1),
  (71, 1),
  (72, 1)],
 [(17, 1), (73, 1), (74, 1), (75, 1), (76, 1)],
 [(37, 1), (39, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1)],
 [(83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (8

In [0]:
import pickle

pickle.dump(corpus, open('/home/corpus.pkl', 'wb'))   # save the object in the local storage

dictionary.save('/home/dictionary.gensim')

### We are asking LDA to find 5 topics in the data:

In [0]:
import gensim
NUM_TOPICS = 5

ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
ldamodel.save('/home/model5.gensim')

In [50]:
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.039*"sensor" + 0.039*"measurement" + 0.021*"base" + 0.021*"photodiode"')
(1, '0.034*"control" + 0.034*"factor" + 0.018*"multi" + 0.018*"synthesis"')
(2, '0.042*"services" + 0.023*"structure" + 0.023*"base" + 0.023*"decision"')
(3, '0.039*"system" + 0.021*"design" + 0.021*"controller" + 0.021*"energy"')
(4, '0.020*"network" + 0.020*"algorithm" + 0.020*"using" + 0.020*"geometry"')


As we can see here   topic '0' is related to Electronics.


## With LDA, we can see that different document with different topics,   

# Let's try new document

In [52]:
new_doc = 'Practical Bayesian Optimization of Machine Learning Algorithms'
new_doc = prepare_text_for_lda(new_doc)     # get topics
new_doc_bow = dictionary.doc2bow(new_doc)   # word occurence with integer id
print(new_doc_bow)
print(ldamodel.get_document_topics(new_doc_bow))

[(29, 1), (48, 1)]
[(0, 0.3996502), (1, 0.06669175), (2, 0.3977759), (3, 0.06669592), (4, 0.06918624)]


My new document is about machine learning algorithms, the LDA out put shows that topic 0 and 2 has the highest probability assigned, and topic 1 has the second highest probability assigned. We agreed!

Remember that the above 5 probabilities add up to 1.

Now we are asking LDA to find 3 topics in the data:

In [53]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 3, id2word=dictionary, passes=15)
ldamodel.save('/home/model3.gensim')

topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.033*"services" + 0.019*"structure" + 0.019*"decision" + 0.019*"decimators"')
(1, '0.024*"base" + 0.024*"measurement" + 0.014*"using" + 0.014*"route"')
(2, '0.019*"sensor" + 0.019*"control" + 0.019*"network" + 0.019*"design"')


#Find the 10 topics

In [54]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 10, id2word=dictionary, passes=15)
ldamodel.save('model10.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.036*"network" + 0.036*"scalable" + 0.036*"independent" + 0.036*"robust"')
(1, '0.045*"multi" + 0.045*"structure" + 0.045*"distribute" + 0.045*"database"')
(2, '0.080*"base" + 0.042*"system" + 0.042*"measurement" + 0.042*"speed"')
(3, '0.069*"algorithm" + 0.036*"design" + 0.036*"imdst" + 0.036*"hardware"')
(4, '0.040*"control" + 0.040*"prefix" + 0.040*"phrasing" + 0.040*"deaggregation"')
(5, '0.025*"optimization" + 0.025*"circuit" + 0.025*"technique" + 0.025*"simulation"')
(6, '0.039*"design" + 0.039*"energy" + 0.039*"using" + 0.039*"performance"')
(7, '0.008*"appeal" + 0.008*"warmth" + 0.008*"heavenly" + 0.008*"algorithm"')
(8, '0.056*"services" + 0.056*"controller" + 0.029*"factor" + 0.029*"optimize"')
(9, '0.039*"sensor" + 0.039*"network" + 0.039*"route" + 0.039*"polyhedron"')


# pyLDAvis  

`pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.`

`Visualizing 5 topics:`

In [0]:
# Load all the object and models

dictionary = gensim.corpora.Dictionary.load('/home/dictionary.gensim')

corpus = pickle.load(open('/home/corpus.pkl', 'rb'))

lda = gensim.models.ldamodel.LdaModel.load('/home/model5.gensim')


In [56]:
!pip install pyLDAvis

Collecting pyLDAvis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 2.4MB/s 
Collecting funcy (from pyLDAvis)
  Downloading https://files.pythonhosted.org/packages/b3/23/d1f90f4e2af5f9d4921ab3797e33cf0503e3f130dd390a812f3bf59ce9ea/funcy-1.12-py2.py3-none-any.whl
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.12 pyLDAvis-2.1.2


In [59]:
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Saliency: a measure of how much the term tells you about the topic.

Relevance: a weighted average of the probability of the word given the topic and the word given the topic normalized by the probability of the topic.

The size of the bubble measures the importance of the topics, relative to the data.

First, we got the most salient terms, means terms mostly tell us about what’s going on relative to the topics. We can also look at individual topic.

Visualizing 3 topics:

In [61]:
lda3 = gensim.models.ldamodel.LdaModel.load('/home/model3.gensim')
lda_display3 = pyLDAvis.gensim.prepare(lda3, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display3)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


### Visualizing 10 topics:

In [62]:


lda10 = gensim.models.ldamodel.LdaModel.load('model10.gensim')
lda_display10 = pyLDAvis.gensim.prepare(lda10, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display10)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


We can see that more  than one node can have similarities between them .