# Latent Dirichlet Allocation(LDA)

Topic modeling can be useful when having a large corpus, when we want to unearth the meaning or categories of the data we have, which is too large to be done manually.
In simple terms LDA is probabilistic unsupervised models that gives out top topics.
So suppose we have a set of documents. we’ve chosen some fixed number of K topics to discover, and want to use LDA to learn the topic representation of each document and the words associated to each topic. LDA uses collapsed Gibbs sampling. How?

* Go through each document, and randomly assign each word(w) in the document(d) to one of the K topics(t).
* This random assignment already gives us topic assignment and the topic distribution but they won't be good at all.
* So to importve upon them we have to,
    * Go through every word in the document and compute two things
        * p(topic t | document d), i.e. the words(w) in document d currently assgin to topic t.
        * p(word w | topic t),  i.e. the assignment of topic  over all the document because of this word(w).
        * This a generative model and hence we have to reassign a new topic to a word and repeat.
   
To learn about it more and to get an intution behind the idea how LDA works [click here](http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/) I got the intusion from there.


### Steps 
   * Pre-processing and training corpus creation
   * Building dictionary
   * Feature extraction
   * LDA model training
   
Pre-processing text for LDA is a little bit different for LDA than what I did for RNN screated in this [post](https://jdvala.github.io/blog.io/thesis/2018/05/23/Creating-Data-Set-Again-!.html).

In [10]:
# Lets pre-process them.
import os
import random
from nltk.corpus import stopwords
import string
import sys
import spacy
import re
import logging
from gensim.models.ldamodel import LdaModel as Lda
from gensim import corpora

In [3]:
# NLP model from spacy
nlp = spacy.load('en')

-----
### Pre-processing Text

In [4]:
def preprocess(text):
    """Returns text after preprocessing
    :params:list of text
    :returns:list of text after manipulation"""
    
    pun = string.punctuation+'$€¥₹£|–'
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    punct = " ".join([i for i in text.lower().split() if i.split() not in pun.split()])
    stop = " ".join([i for i in punct.lower().split() if i not in stopwords.words('english')])
    digit = re.sub(r'\d+','',stop)
    # removing punct again
    punct_ = regex.sub('',digit)
    doc = nlp(punct_)
    lam = " ".join(word.lemma_ for word in doc)
    x = lam.split()
    y = [s for s in x if len(s) > 2]
    return y

In [5]:
# Lets load text from every doc into a list 

document_list = []

for root, dirs, files in os.walk('/home/jay/Thesis_1/Data/Data_EN'):
    for file in files:
        if file != 'log.txt':
            with open(os.path.join(root, file), 'r') as f:
                document_list.append(f.read())

In [6]:
# Lets divide the documents into test and train set
# I am taking 20% of documents for test set, but before that lets just suffle it.
random.shuffle(document_list)
train = document_list[round(len(document_list)*.2):]
test = document_list[:round(len(document_list)*.2)]

In [7]:
len(train)

3692

In [8]:
# preprocess the training set
cleaned = [preprocess(doc) for doc in train]

-------------------------------------------------------------------------------------------------------------------
### Dictionary Building
For dictionary building gensim requires all the words in corpus. So lets create a list of words in dictionary

In [11]:
# For building dictionary I will use gensim.
# Dictionary are nothing but every unique term with its unique id as we have already created for training RNNs.
# We can also create 'hashdictionary' were it uses hashing algorithm which will increase speed, but I will not worry
# about it as my corpus is small

dictionary = corpora.Dictionary(cleaned)

In [15]:
print(dictionary)

Dictionary(19634 unique tokens: ['-PRON-', 'access', 'achieve', 'achievement', 'act']...)


Now that dictonary is created we need to filter out the dictonary, we will filter out the words that occur in less than 4 document and words that occur in more than 40% of the documents. We do this because these words do not contibute in the different themes and topics that are in the corpus.


In [16]:
# removing extremes
dictionary.filter_extremes(no_below=4, no_above=0.4)

In [17]:
print(dictionary)

Dictionary(8231 unique tokens: ['-PRON-', 'access', 'achieve', 'achievement', 'adapt']...)


-------------------------------------------------------------------------------------------------------------------
### Feature Extraction
Now that dictionary is created, moving on to the next step, extracting features. Gensim provides use to necessary tools to extract features out of the courpus.
Feature extraction is nothing but generating but the frequencies of all the words in the vocabulary for that particular word.

In [19]:
doc_term_matrix = [dictionary.doc2bow(doc) for doc in cleaned]

---------------------
### Model Building

In [20]:
# As I know that I have only 32 topics in the corpus, I will set the num_topic argument as 32
# To see the progress I added loggig as suggested in Gensim Tutorial 
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

ldamodel = Lda(doc_term_matrix, num_topics=32, id2word = dictionary, passes=50, iterations=500)

2018-05-28 18:52:24,623 : INFO : using symmetric alpha at 0.03125
2018-05-28 18:52:24,627 : INFO : using symmetric eta at 0.03125
2018-05-28 18:52:24,631 : INFO : using serial LDA version on this node
2018-05-28 18:52:24,696 : INFO : running online (multi-pass) LDA training, 32 topics, 50 passes over the supplied corpus of 3692 documents, updating model once every 2000 documents, evaluating perplexity every 3692 documents, iterating 500x with a convergence threshold of 0.001000
2018-05-28 18:52:24,698 : INFO : PROGRESS: pass 0, at document #2000/3692
2018-05-28 18:53:04,688 : INFO : merging changes from 2000 documents into a model of 3692 documents
2018-05-28 18:53:04,760 : INFO : topic #14 (0.031): 0.013*"research" + 0.010*"innovation" + 0.009*"sector" + 0.007*"protection" + 0.007*"woman" + 0.006*"europe" + 0.006*"medium" + 0.005*"content" + 0.005*"financial" + 0.005*"interactive"
2018-05-28 18:53:04,763 : INFO : topic #1 (0.031): 0.007*"person" + 0.007*"directive" + 0.006*"product" +

2018-05-28 18:54:43,924 : INFO : topic #27 (0.031): 0.020*"datum" + 0.011*"access" + 0.010*"protection" + 0.007*"plan" + 0.007*"internet" + 0.006*"network" + 0.006*"partner" + 0.006*"region" + 0.006*"regional" + 0.006*"instrument"
2018-05-28 18:54:43,928 : INFO : topic #9 (0.031): 0.018*"financial" + 0.012*"crime" + 0.011*"terrorist" + 0.009*"fraud" + 0.009*"prevention" + 0.009*"infrastructure" + 0.009*"combat" + 0.008*"noise" + 0.008*"terrorism" + 0.008*"money"
2018-05-28 18:54:43,931 : INFO : topic #3 (0.031): 0.028*"transport" + 0.026*"air" + 0.023*"directive" + 0.022*"passenger" + 0.019*"tax" + 0.015*"vehicle" + 0.012*"carrier" + 0.011*"vat" + 0.011*"duty" + 0.009*"emission"
2018-05-28 18:54:43,937 : INFO : topic #13 (0.031): 0.041*"directive" + 0.019*"law" + 0.014*"court" + 0.013*"right" + 0.010*"person" + 0.010*"legal" + 0.009*"proceeding" + 0.008*"eec" + 0.008*"worker" + 0.008*"case"
2018-05-28 18:54:43,940 : INFO : topic diff=0.752502, rho=0.454264
2018-05-28 18:54:43,942 : INF

2018-05-28 18:56:07,223 : INFO : -7.157 per-word bound, 142.7 perplexity estimate based on a held-out corpus of 1692 documents with 605190 words
2018-05-28 18:56:07,224 : INFO : PROGRESS: pass 5, at document #3692/3692
2018-05-28 18:56:15,141 : INFO : merging changes from 1692 documents into a model of 3692 documents
2018-05-28 18:56:15,234 : INFO : topic #9 (0.031): 0.020*"financial" + 0.020*"crime" + 0.012*"terrorist" + 0.012*"combat" + 0.011*"prevention" + 0.010*"fraud" + 0.010*"terrorism" + 0.009*"money" + 0.009*"fight" + 0.008*"organise"
2018-05-28 18:56:15,236 : INFO : topic #19 (0.031): 0.067*"energy" + 0.018*"emission" + 0.017*"environmental" + 0.014*"gas" + 0.011*"electricity" + 0.010*"directive" + 0.010*"renewable" + 0.010*"source" + 0.009*"climate" + 0.009*"efficiency"
2018-05-28 18:56:15,238 : INFO : topic #28 (0.031): 0.026*"right" + 0.018*"datum" + 0.017*"justice" + 0.013*"judicial" + 0.013*"protection" + 0.011*"criminal" + 0.011*"freedom" + 0.009*"fundamental" + 0.009*"c

2018-05-28 18:57:24,976 : INFO : topic #9 (0.031): 0.024*"crime" + 0.019*"financial" + 0.013*"combat" + 0.012*"terrorist" + 0.012*"prevention" + 0.011*"terrorism" + 0.010*"organise" + 0.010*"criminal" + 0.010*"fraud" + 0.010*"fight"
2018-05-28 18:57:24,978 : INFO : topic #26 (0.031): 0.028*"treaty" + 0.017*"article" + 0.010*"procedure" + 0.010*"euro" + 0.009*"budgetary" + 0.009*"rate" + 0.009*"central" + 0.009*"bank" + 0.009*"deficit" + 0.009*"government"
2018-05-28 18:57:24,991 : INFO : topic diff=0.245463, rho=0.303644
2018-05-28 18:57:37,174 : INFO : -7.113 per-word bound, 138.5 perplexity estimate based on a held-out corpus of 1692 documents with 605190 words
2018-05-28 18:57:37,176 : INFO : PROGRESS: pass 8, at document #3692/3692
2018-05-28 18:57:44,115 : INFO : merging changes from 1692 documents into a model of 3692 documents
2018-05-28 18:57:44,203 : INFO : topic #6 (0.031): 0.051*"product" + 0.036*"food" + 0.017*"consumer" + 0.011*"health" + 0.010*"agricultural" + 0.009*"modi

2018-05-28 18:58:47,464 : INFO : topic #31 (0.031): 0.031*"progress" + 0.025*"accession" + 0.021*"acquis" + 0.021*"candidate" + 0.015*"turkey" + 0.014*"sec" + 0.013*"enlargement" + 0.012*"priority" + 0.012*"negotiation" + 0.012*"reform"
2018-05-28 18:58:47,466 : INFO : topic #10 (0.031): 0.029*"financial" + 0.024*"euro" + 0.021*"company" + 0.015*"capital" + 0.015*"payment" + 0.015*"bank" + 0.014*"directive" + 0.011*"credit" + 0.009*"coin" + 0.009*"account"
2018-05-28 18:58:47,470 : INFO : topic #23 (0.031): 0.056*"convention" + 0.038*"maritime" + 0.033*"ship" + 0.021*"sea" + 0.020*"pollution" + 0.019*"port" + 0.018*"protocol" + 0.016*"party" + 0.015*"marine" + 0.014*"vessel"
2018-05-28 18:58:47,477 : INFO : topic #9 (0.031): 0.026*"crime" + 0.018*"financial" + 0.014*"combat" + 0.013*"criminal" + 0.011*"prevention" + 0.011*"terrorist" + 0.011*"terrorism" + 0.011*"organise" + 0.010*"fight" + 0.009*"offence"
2018-05-28 18:58:47,479 : INFO : topic #1 (0.031): 0.023*"person" + 0.012*"citize

2018-05-28 19:00:02,397 : INFO : topic #19 (0.031): 0.075*"energy" + 0.022*"emission" + 0.020*"environmental" + 0.016*"gas" + 0.013*"water" + 0.012*"electricity" + 0.011*"renewable" + 0.011*"climate" + 0.010*"source" + 0.010*"directive"
2018-05-28 19:00:02,403 : INFO : topic diff=0.106847, rho=0.251212
2018-05-28 19:00:02,406 : INFO : PROGRESS: pass 14, at document #2000/3692
2018-05-28 19:00:10,798 : INFO : merging changes from 2000 documents into a model of 3692 documents
2018-05-28 19:00:10,896 : INFO : topic #22 (0.031): 0.038*"fund" + 0.026*"financial" + 0.019*"project" + 0.015*"million" + 0.014*"eur" + 0.014*"assistance" + 0.014*"period" + 0.013*"budget" + 0.013*"instrument" + 0.010*"finance"
2018-05-28 19:00:10,899 : INFO : topic #7 (0.031): 0.053*"tax" + 0.047*"custom" + 0.028*"request" + 0.025*"duty" + 0.023*"taxation" + 0.020*"assistance" + 0.019*"vat" + 0.014*"legislation" + 0.014*"excise" + 0.013*"agreement"
2018-05-28 19:00:10,902 : INFO : topic #25 (0.031): 0.081*"aid" + 

2018-05-28 19:01:27,112 : INFO : topic #8 (0.031): 0.033*"acquis" + 0.027*"legislation" + 0.026*"progress" + 0.019*"accession" + 0.014*"field" + 0.013*"law" + 0.012*"note" + 0.012*"november" + 0.012*"sector" + 0.011*"effort"
2018-05-28 19:01:27,114 : INFO : topic #10 (0.031): 0.031*"financial" + 0.024*"euro" + 0.023*"company" + 0.016*"directive" + 0.016*"payment" + 0.016*"capital" + 0.016*"bank" + 0.012*"credit" + 0.010*"account" + 0.009*"coin"
2018-05-28 19:01:27,116 : INFO : topic #16 (0.031): 0.016*"sector" + 0.012*"europe" + 0.010*"propose" + 0.010*"would" + 0.009*"industry" + 0.007*"change" + 0.007*"increase" + 0.007*"challenge" + 0.007*"competitiveness" + 0.007*"transport"
2018-05-28 19:01:27,120 : INFO : topic diff=0.077434, rho=0.230351
2018-05-28 19:01:27,123 : INFO : PROGRESS: pass 17, at document #2000/3692
2018-05-28 19:01:35,136 : INFO : merging changes from 2000 documents into a model of 3692 documents
2018-05-28 19:01:35,227 : INFO : topic #15 (0.031): 0.053*"health" + 0

2018-05-28 19:02:49,038 : INFO : merging changes from 1692 documents into a model of 3692 documents
2018-05-28 19:02:49,123 : INFO : topic #23 (0.031): 0.061*"convention" + 0.038*"maritime" + 0.033*"ship" + 0.022*"sea" + 0.020*"port" + 0.019*"pollution" + 0.018*"protocol" + 0.016*"party" + 0.015*"marine" + 0.013*"vessel"
2018-05-28 19:02:49,126 : INFO : topic #9 (0.031): 0.026*"crime" + 0.017*"criminal" + 0.015*"combat" + 0.015*"financial" + 0.011*"offence" + 0.011*"organise" + 0.011*"europol" + 0.011*"prevention" + 0.011*"terrorism" + 0.010*"terrorist"
2018-05-28 19:02:49,128 : INFO : topic #8 (0.031): 0.033*"acquis" + 0.027*"legislation" + 0.026*"progress" + 0.019*"accession" + 0.014*"field" + 0.013*"law" + 0.012*"note" + 0.012*"november" + 0.012*"sector" + 0.011*"effort"
2018-05-28 19:02:49,130 : INFO : topic #29 (0.031): 0.026*"right" + 0.024*"child" + 0.023*"woman" + 0.021*"human" + 0.018*"equal" + 0.013*"equality" + 0.013*"discrimination" + 0.012*"man" + 0.010*"protection" + 0.01

2018-05-28 19:03:56,016 : INFO : topic #12 (0.031): 0.035*"fishery" + 0.031*"committee" + 0.029*"fishing" + 0.015*"statistic" + 0.015*"vessel" + 0.014*"board" + 0.014*"statistical" + 0.011*"representative" + 0.011*"fish" + 0.010*"management"
2018-05-28 19:03:56,018 : INFO : topic diff=0.050097, rho=0.200619
2018-05-28 19:04:08,815 : INFO : -7.063 per-word bound, 133.7 perplexity estimate based on a held-out corpus of 1692 documents with 605190 words
2018-05-28 19:04:08,816 : INFO : PROGRESS: pass 22, at document #3692/3692
2018-05-28 19:04:15,477 : INFO : merging changes from 1692 documents into a model of 3692 documents
2018-05-28 19:04:15,565 : INFO : topic #20 (0.031): 0.034*"trade" + 0.028*"consumer" + 0.013*"standard" + 0.009*"business" + 0.009*"legislation" + 0.007*"practice" + 0.006*"access" + 0.006*"barrier" + 0.006*"committee" + 0.006*"sector"
2018-05-28 19:04:15,568 : INFO : topic #3 (0.031): 0.065*"transport" + 0.033*"air" + 0.027*"vehicle" + 0.026*"passenger" + 0.021*"road"

2018-05-28 19:05:16,593 : INFO : topic #9 (0.031): 0.026*"crime" + 0.018*"criminal" + 0.015*"combat" + 0.014*"financial" + 0.012*"organise" + 0.012*"europol" + 0.011*"offence" + 0.011*"prevention" + 0.011*"exchange" + 0.010*"terrorism"
2018-05-28 19:05:16,595 : INFO : topic #1 (0.031): 0.026*"person" + 0.013*"right" + 0.013*"citizen" + 0.012*"residence" + 0.012*"visa" + 0.012*"application" + 0.012*"condition" + 0.012*"noneu" + 0.011*"document" + 0.011*"entry"
2018-05-28 19:05:16,596 : INFO : topic #24 (0.031): 0.026*"paper" + 0.021*"medium" + 0.021*"audiovisual" + 0.021*"green" + 0.019*"general" + 0.017*"interest" + 0.012*"consultation" + 0.010*"television" + 0.010*"society" + 0.009*"sport"
2018-05-28 19:05:16,598 : INFO : topic diff=0.041875, rho=0.189504
2018-05-28 19:05:28,116 : INFO : -7.060 per-word bound, 133.4 perplexity estimate based on a held-out corpus of 1692 documents with 605190 words
2018-05-28 19:05:28,119 : INFO : PROGRESS: pass 25, at document #3692/3692
2018-05-28 19

2018-05-28 19:06:26,649 : INFO : PROGRESS: pass 28, at document #2000/3692
2018-05-28 19:06:34,434 : INFO : merging changes from 2000 documents into a model of 3692 documents
2018-05-28 19:06:34,531 : INFO : topic #25 (0.031): 0.081*"aid" + 0.027*"competition" + 0.019*"article" + 0.013*"grant" + 0.012*"treaty" + 0.012*"guideline" + 0.010*"application" + 0.009*"condition" + 0.009*"sector" + 0.008*"case"
2018-05-28 19:06:34,533 : INFO : topic #1 (0.031): 0.026*"person" + 0.013*"right" + 0.013*"citizen" + 0.012*"residence" + 0.012*"visa" + 0.012*"application" + 0.012*"condition" + 0.012*"noneu" + 0.011*"document" + 0.011*"entry"
2018-05-28 19:06:34,535 : INFO : topic #13 (0.031): 0.041*"directive" + 0.027*"law" + 0.019*"court" + 0.013*"contract" + 0.013*"person" + 0.012*"legal" + 0.011*"case" + 0.011*"procedure" + 0.011*"right" + 0.009*"proceeding"
2018-05-28 19:06:34,537 : INFO : topic #5 (0.031): 0.048*"region" + 0.042*"regional" + 0.024*"transport" + 0.018*"network" + 0.014*"infrastruc

2018-05-28 19:07:45,180 : INFO : topic #29 (0.031): 0.027*"right" + 0.024*"child" + 0.024*"woman" + 0.021*"human" + 0.018*"equal" + 0.014*"equality" + 0.013*"discrimination" + 0.012*"man" + 0.010*"combat" + 0.010*"gender"
2018-05-28 19:07:45,186 : INFO : topic #5 (0.031): 0.048*"region" + 0.043*"regional" + 0.024*"transport" + 0.018*"network" + 0.014*"infrastructure" + 0.014*"urban" + 0.013*"local" + 0.013*"partner" + 0.012*"cohesion" + 0.012*"investment"
2018-05-28 19:07:45,194 : INFO : topic diff=0.032519, rho=0.174485
2018-05-28 19:07:45,201 : INFO : PROGRESS: pass 31, at document #2000/3692
2018-05-28 19:07:53,158 : INFO : merging changes from 2000 documents into a model of 3692 documents
2018-05-28 19:07:53,256 : INFO : topic #15 (0.031): 0.061*"health" + 0.046*"safety" + 0.030*"nuclear" + 0.029*"directive" + 0.029*"risk" + 0.021*"protection" + 0.017*"worker" + 0.013*"euratom" + 0.011*"exposure" + 0.010*"radioactive"
2018-05-28 19:07:53,258 : INFO : topic #29 (0.031): 0.027*"right

2018-05-28 19:09:05,710 : INFO : topic #19 (0.031): 0.081*"energy" + 0.024*"emission" + 0.022*"environmental" + 0.017*"gas" + 0.016*"water" + 0.012*"electricity" + 0.012*"renewable" + 0.011*"climate" + 0.011*"efficiency" + 0.011*"directive"
2018-05-28 19:09:05,712 : INFO : topic #13 (0.031): 0.041*"directive" + 0.027*"law" + 0.019*"court" + 0.013*"contract" + 0.013*"legal" + 0.012*"person" + 0.011*"case" + 0.011*"procedure" + 0.010*"right" + 0.009*"proceeding"
2018-05-28 19:09:05,713 : INFO : topic #6 (0.031): 0.064*"product" + 0.040*"food" + 0.015*"agricultural" + 0.011*"consumer" + 0.010*"label" + 0.009*"import" + 0.009*"production" + 0.009*"name" + 0.009*"origin" + 0.009*"application"
2018-05-28 19:09:05,716 : INFO : topic #16 (0.031): 0.018*"sector" + 0.012*"europe" + 0.011*"propose" + 0.010*"would" + 0.008*"industry" + 0.008*"increase" + 0.008*"could" + 0.007*"future" + 0.007*"change" + 0.007*"challenge"
2018-05-28 19:09:05,717 : INFO : topic diff=0.028947, rho=0.167024
2018-05-28

2018-05-28 19:10:26,247 : INFO : -7.052 per-word bound, 132.7 perplexity estimate based on a held-out corpus of 1692 documents with 605190 words
2018-05-28 19:10:26,249 : INFO : PROGRESS: pass 36, at document #3692/3692
2018-05-28 19:10:32,532 : INFO : merging changes from 1692 documents into a model of 3692 documents
2018-05-28 19:10:32,623 : INFO : topic #17 (0.031): 0.024*"strategy" + 0.024*"research" + 0.015*"innovation" + 0.014*"plan" + 0.013*"technology" + 0.012*"sustainable" + 0.011*"investment" + 0.011*"business" + 0.009*"growth" + 0.009*"europe"
2018-05-28 19:10:32,625 : INFO : topic #26 (0.031): 0.034*"treaty" + 0.021*"article" + 0.011*"euro" + 0.010*"procedure" + 0.010*"central" + 0.009*"bank" + 0.009*"government" + 0.009*"rate" + 0.009*"deficit" + 0.009*"budgetary"
2018-05-28 19:10:32,627 : INFO : topic #15 (0.031): 0.061*"health" + 0.047*"safety" + 0.032*"nuclear" + 0.029*"risk" + 0.027*"directive" + 0.022*"protection" + 0.017*"worker" + 0.013*"euratom" + 0.011*"exposure" 

2018-05-28 19:11:35,742 : INFO : topic #1 (0.031): 0.026*"person" + 0.014*"right" + 0.013*"citizen" + 0.013*"residence" + 0.013*"condition" + 0.012*"application" + 0.012*"noneu" + 0.012*"visa" + 0.012*"document" + 0.011*"entry"
2018-05-28 19:11:35,745 : INFO : topic #6 (0.031): 0.065*"product" + 0.042*"food" + 0.016*"agricultural" + 0.011*"consumer" + 0.010*"label" + 0.009*"import" + 0.009*"production" + 0.009*"modify" + 0.009*"origin" + 0.009*"name"
2018-05-28 19:11:35,750 : INFO : topic diff=0.024092, rho=0.154587
2018-05-28 19:11:48,421 : INFO : -7.051 per-word bound, 132.6 perplexity estimate based on a held-out corpus of 1692 documents with 605190 words
2018-05-28 19:11:48,423 : INFO : PROGRESS: pass 39, at document #3692/3692
2018-05-28 19:11:54,681 : INFO : merging changes from 1692 documents into a model of 3692 documents
2018-05-28 19:11:54,767 : INFO : topic #7 (0.031): 0.060*"tax" + 0.057*"custom" + 0.029*"duty" + 0.025*"taxation" + 0.025*"request" + 0.021*"vat" + 0.019*"ass

2018-05-28 19:12:53,182 : INFO : topic #24 (0.031): 0.027*"paper" + 0.023*"medium" + 0.022*"audiovisual" + 0.022*"green" + 0.020*"general" + 0.018*"interest" + 0.012*"consultation" + 0.011*"television" + 0.011*"sport" + 0.010*"society"
2018-05-28 19:12:53,185 : INFO : topic #15 (0.031): 0.063*"health" + 0.047*"safety" + 0.031*"nuclear" + 0.030*"risk" + 0.025*"directive" + 0.022*"protection" + 0.017*"worker" + 0.013*"euratom" + 0.011*"exposure" + 0.010*"radioactive"
2018-05-28 19:12:53,190 : INFO : topic #19 (0.031): 0.080*"energy" + 0.025*"emission" + 0.022*"environmental" + 0.017*"gas" + 0.016*"water" + 0.012*"electricity" + 0.011*"renewable" + 0.011*"climate" + 0.011*"directive" + 0.011*"reduce"
2018-05-28 19:12:53,192 : INFO : topic #17 (0.031): 0.025*"research" + 0.024*"strategy" + 0.015*"innovation" + 0.014*"plan" + 0.013*"technology" + 0.012*"sustainable" + 0.011*"investment" + 0.011*"business" + 0.009*"growth" + 0.009*"europe"
2018-05-28 19:12:53,193 : INFO : topic diff=0.022139

2018-05-28 19:14:01,415 : INFO : topic diff=0.020855, rho=0.146105
2018-05-28 19:14:01,417 : INFO : PROGRESS: pass 45, at document #2000/3692
2018-05-28 19:14:08,058 : INFO : merging changes from 2000 documents into a model of 3692 documents
2018-05-28 19:14:08,158 : INFO : topic #27 (0.031): 0.061*"datum" + 0.026*"access" + 0.022*"electronic" + 0.017*"network" + 0.015*"internet" + 0.014*"protection" + 0.014*"personal" + 0.013*"digital" + 0.010*"process" + 0.009*"plan"
2018-05-28 19:14:08,160 : INFO : topic #30 (0.031): 0.013*"strategy" + 0.012*"security" + 0.012*"dialogue" + 0.011*"partnership" + 0.010*"political" + 0.009*"partner" + 0.009*"relation" + 0.009*"governance" + 0.009*"aid" + 0.008*"global"
2018-05-28 19:14:08,162 : INFO : topic #25 (0.031): 0.081*"aid" + 0.027*"competition" + 0.019*"article" + 0.013*"grant" + 0.012*"treaty" + 0.012*"guideline" + 0.010*"application" + 0.009*"condition" + 0.009*"sector" + 0.008*"case"
2018-05-28 19:14:08,165 : INFO : topic #14 (0.031): 0.060

2018-05-28 19:15:18,955 : INFO : topic #7 (0.031): 0.060*"tax" + 0.058*"custom" + 0.029*"duty" + 0.025*"taxation" + 0.024*"request" + 0.021*"vat" + 0.019*"assistance" + 0.015*"legislation" + 0.015*"excise" + 0.014*"administrative"
2018-05-28 19:15:18,957 : INFO : topic #18 (0.031): 0.055*"directive" + 0.042*"agency" + 0.021*"safety" + 0.021*"railway" + 0.019*"infrastructure" + 0.016*"network" + 0.016*"technical" + 0.015*"rail" + 0.015*"requirement" + 0.015*"equipment"
2018-05-28 19:15:18,958 : INFO : topic diff=0.019507, rho=0.141640
2018-05-28 19:15:18,961 : INFO : PROGRESS: pass 48, at document #2000/3692
2018-05-28 19:15:26,213 : INFO : merging changes from 2000 documents into a model of 3692 documents
2018-05-28 19:15:26,297 : INFO : topic #24 (0.031): 0.027*"paper" + 0.023*"medium" + 0.023*"audiovisual" + 0.022*"green" + 0.020*"general" + 0.019*"interest" + 0.012*"consultation" + 0.012*"sport" + 0.012*"television" + 0.010*"society"
2018-05-28 19:15:26,300 : INFO : topic #20 (0.031

In [21]:
# Print topics
# I also know that none of my topic is more than 5 words i will set the num_words argument to 5
for i,topic in enumerate(ldamodel.print_topics(num_topics=32, num_words=5)):
    words = topic[1].split("+")
    print (words,"\n")

2018-05-28 19:16:25,031 : INFO : topic #0 (0.031): 0.045*"employment" + 0.029*"labour" + 0.019*"worker" + 0.017*"job" + 0.013*"people"
2018-05-28 19:16:25,033 : INFO : topic #1 (0.031): 0.027*"person" + 0.014*"right" + 0.013*"citizen" + 0.013*"application" + 0.013*"residence"
2018-05-28 19:16:25,035 : INFO : topic #2 (0.031): 0.043*"directive" + 0.027*"animal" + 0.015*"eec" + 0.015*"product" + 0.012*"substance"
2018-05-28 19:16:25,037 : INFO : topic #3 (0.031): 0.071*"transport" + 0.034*"air" + 0.028*"vehicle" + 0.027*"passenger" + 0.022*"road"
2018-05-28 19:16:25,038 : INFO : topic #4 (0.031): 0.066*"right" + 0.034*"property" + 0.026*"protection" + 0.026*"intellectual" + 0.017*"copyright"
2018-05-28 19:16:25,041 : INFO : topic #5 (0.031): 0.051*"region" + 0.044*"regional" + 0.025*"transport" + 0.019*"network" + 0.016*"cohesion"
2018-05-28 19:16:25,042 : INFO : topic #6 (0.031): 0.065*"product" + 0.041*"food" + 0.016*"agricultural" + 0.011*"consumer" + 0.010*"label"
2018-05-28 19:16:25

['0.045*"employment" ', ' 0.029*"labour" ', ' 0.019*"worker" ', ' 0.017*"job" ', ' 0.013*"people"'] 

['0.027*"person" ', ' 0.014*"right" ', ' 0.013*"citizen" ', ' 0.013*"application" ', ' 0.013*"residence"'] 

['0.043*"directive" ', ' 0.027*"animal" ', ' 0.015*"eec" ', ' 0.015*"product" ', ' 0.012*"substance"'] 

['0.071*"transport" ', ' 0.034*"air" ', ' 0.028*"vehicle" ', ' 0.027*"passenger" ', ' 0.022*"road"'] 

['0.066*"right" ', ' 0.034*"property" ', ' 0.026*"protection" ', ' 0.026*"intellectual" ', ' 0.017*"copyright"'] 

['0.051*"region" ', ' 0.044*"regional" ', ' 0.025*"transport" ', ' 0.019*"network" ', ' 0.016*"cohesion"'] 

['0.065*"product" ', ' 0.041*"food" ', ' 0.016*"agricultural" ', ' 0.011*"consumer" ', ' 0.010*"label"'] 

['0.061*"tax" ', ' 0.059*"custom" ', ' 0.029*"duty" ', ' 0.025*"taxation" ', ' 0.024*"request"'] 

['0.034*"acquis" ', ' 0.027*"progress" ', ' 0.026*"legislation" ', ' 0.020*"accession" ', ' 0.013*"law"'] 

['0.026*"crime" ', ' 0.021*"criminal" ', ' 

In [23]:
# Now lets save the model, dictonary, and corpus to use it for further use
# Saving Model
ldamodel.save('/home/jay/ANN_Models/LDA/LDA_Model')

# Saving Corpus
import pickle
with open('/home/jay/ANN_Models/LDA/LAD_Corpus.pickle', 'wb') as p:
    pickle.dump(cleaned,p)
    
# Saving Dictonary
dictionary.save('/home/jay/ANN_Models/LDA/LDA_Dictonary')

2018-05-28 19:17:04,097 : INFO : saving LdaState object under /home/jay/ANN_Models/LDA/LDA_Model.state, separately None
2018-05-28 19:17:04,111 : INFO : saved /home/jay/ANN_Models/LDA/LDA_Model.state
2018-05-28 19:17:04,120 : INFO : saving LdaModel object under /home/jay/ANN_Models/LDA/LDA_Model, separately ['expElogbeta', 'sstats']
2018-05-28 19:17:04,122 : INFO : storing np array 'expElogbeta' to /home/jay/ANN_Models/LDA/LDA_Model.expElogbeta.npy
2018-05-28 19:17:04,126 : INFO : not storing attribute id2word
2018-05-28 19:17:04,127 : INFO : not storing attribute state
2018-05-28 19:17:04,128 : INFO : not storing attribute dispatcher
2018-05-28 19:17:04,132 : INFO : saved /home/jay/ANN_Models/LDA/LDA_Model
2018-05-28 19:17:04,848 : INFO : saving Dictionary object under /home/jay/ANN_Models/LDA/LDA_Dictonary, separately None
2018-05-28 19:17:04,856 : INFO : saved /home/jay/ANN_Models/LDA/LDA_Dictonary
