# Using LDA with skLearn and gensim

The notebook uses skLearn and gensim packages to fetch 'n' important topics and 'm' most occuring words in each topic, grouped according to the LDA. 

In [1]:
import keras, tensorflow, sys
keras.__version__, tensorflow.__version__, sys.version

Using TensorFlow backend.


('2.2.4',
 '1.11.0',
 '3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)]')

## Fetching data

In [2]:
# Using easy to use 20newsgroups data from sklearn.

from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

In [3]:
documents[0]

"Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n"

## NLTK stopwords for english

In [4]:
from nltk.corpus import stopwords
stopwords_en = set(stopwords.words('english'))
print(stopwords_en)

{'why', 'few', "hadn't", 'theirs', 'me', 'he', 'very', "should've", "wouldn't", 'in', 'my', 'hasn', 'with', 'the', "hasn't", 'for', 'down', 'himself', 'we', 'doesn', "it's", 'm', 'have', 'that', "aren't", 'whom', 'where', 's', 'their', 'other', 'she', 'was', 'shouldn', 'at', 'off', 'if', "she's", 'don', 'what', "you'd", 'be', 'here', 'do', 'each', 'no', 'isn', 'is', 'from', 'ain', 'couldn', 'more', 'were', 'then', 'too', 'by', 'on', 'its', 'own', "weren't", 'both', 'ourselves', 'any', 'up', "mustn't", 'and', 'mightn', 'hadn', 'had', 'after', 'weren', 'over', 'itself', 'some', 'will', 'to', 'so', 'during', 'shan', 'under', 'yours', "don't", 'these', 'because', 'myself', 'you', 'which', 'until', 't', 're', 'your', 'while', "isn't", "you've", 'him', 'should', "couldn't", 'd', 'our', 'all', "shouldn't", 'once', 'of', 'further', 'before', 'ma', 'are', "needn't", 'am', 'an', "didn't", 'between', "wasn't", 'as', 'didn', 'it', 'most', 'did', "mightn't", "that'll", 'haven', 'about', 'o', 'only'

## Using skLearn LDA for clustering into topics and finding interesting words in each topic

The documents must be vectorized using countvectorizer when performing LDA clustering on data using sklearn.

The count Vevorizer covertes a document by representing it as a vector of count of all the different words in the vocabulary. As one can see most of the units in a document vector will be 0. 

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer


c_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')



max_df -> if a word occurs in max_df percentage of documents, ignore those words. Ignore words that occurs in almost all the documents. eg. 'a' , 'the'. 

min_df -> if a word occurs in less than min_df number of dcouments, ignore those words. Ignore words that occurs in very few documents. eg. name of a person.

max_features -> Consider the max_features number of words for the evaluation of topics. Words are taken by considering the ordered frequency of words across the documents (ofcourse, by ignoring the words occuring more than max_df).

stop_words -> Remove english stopwords from the corpus. stop words are words like 'a', 'the', 'of', etc.


Fit the above LDA with the countvectorized document.

In [6]:
c_vec = c_vectorizer.fit_transform(documents)
c_feature_names = c_vectorizer.get_feature_names()

Finding the LDA components by running the LatentDirichletAllocation function.  

In [7]:
from sklearn.decomposition import LatentDirichletAllocation

no_topics = 20

# Run LDA
lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online',
                                learning_offset=50.,random_state=0).fit(c_vec)



learning_method -> 'online' training means topics(components) will be incrementally trained on mini batches of data, rather than updating component values from the whole data at once.

learning_offset -> a parameter for online training to slowly learn at the start of the training.

### Example Topics and top words in the topic.

In [8]:

no_top_words = 10

for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (topic_idx))
    print(" ".join([c_feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

Topic 0:
people gun state control right guns crime states law police
Topic 1:
time question book years did like don space answer just
Topic 2:
mr line rules science stephanopoulos title current define int yes
Topic 3:
key chip keys clipper encryption number des algorithm use bit
Topic 4:
edu com cs vs w7 cx mail uk 17 send
Topic 5:
use does window problem way used point different case value
Topic 6:
windows thanks know help db does dos problem like using
Topic 7:
bike water effect road design media dod paper like turn
Topic 8:
don just like think know people good ve going say
Topic 9:
car new price good power used air sale offer ground
Topic 10:
file available program edu ftp information files use image version
Topic 11:
ax max b8f g9v a86 145 pl 1d9 0t 34u
Topic 12:
government law privacy security legal encryption court fbi technology information
Topic 13:
card bit memory output video color data mode monitor 16
Topic 14:
drive scsi disk mac hard apple drives controller software port
T

## LDA using gensim

The stopwords in the document are removed for finding relevant top words. 

In [9]:
from gensim import corpora
dictionary = corpora.Dictionary([x.split() for x in documents])
corpus = [dictionary.doc2bow([text for text in x.split() if text.lower() not in stopwords_en]) for x in documents] 



### Topic words are given along with there importance.

In [10]:
from gensim.models.ldamodel import LdaModel
ldamodel = LdaModel(corpus, num_topics=no_topics, id2word=dictionary, passes=15)

topics = ldamodel.print_topics(num_words=no_top_words)
for topic in topics:
    print(topic)

(0, '0.070*":" + 0.026*">" + 0.005*"-" + 0.004*"anonymous" + 0.003*"?" + 0.003*"RIPEM" + 0.003*"mail" + 0.002*"email" + 0.002*"information" + 0.002*"posting"')
(1, '0.023*"." + 0.008*"|" + 0.004*"Gordon" + 0.003*"----------------------------------------------------------------------------" + 0.003*"surrender" + 0.003*"Banks" + 0.003*"intellect," + 0.003*"shameful" + 0.003*"N3JXP" + 0.003*""Skepticism"')
(2, '0.007*"*" + 0.004*"used" + 0.003*"use" + 0.003*"-" + 0.003*"ground" + 0.002*"power" + 0.002*"one" + 0.002*"using" + 0.002*"car" + 0.002*"may"')
(3, '0.012*"key" + 0.005*"space" + 0.004*"launch" + 0.003*"keys" + 0.003*"algorithm" + 0.003*"first" + 0.003*"chip" + 0.003*"satellite" + 0.003*"DES" + 0.003*"--"')
(4, '0.005*"-" + 0.004*"&" + 0.004*"Space" + 0.004*"University" + 0.003*"1993" + 0.003*"available" + 0.003*"Center" + 0.003*"--" + 0.003*"NASA" + 0.003*"April"')
(5, '0.007*"use" + 0.006*"-" + 0.006*"get" + 0.005*"would" + 0.005*"like" + 0.005*"using" + 0.005*"know" + 0.004*"one