# LDA
 
 https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730
 https://www.kaggle.com/thebrownviking20/topic-modelling-with-spacy-and-scikit-learn

# Dataset
[https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)

In [1]:
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint

In [2]:
newsgroups_train_data = fetch_20newsgroups(subset='train')

In [3]:
pprint(list(newsgroups_train_data.target_names)), len(newsgroups_train_data.data)

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


(None, 11314)

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

In [5]:
documents = newsgroups_train_data.data
num_features = 1000

# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=num_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=num_features, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

# Model
Linear Algebra Model : https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

Probabilistic Model: [https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)

In [6]:
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [7]:
num_topics = 20

In [8]:
# Run NMF
nmf = NMF(n_components=num_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

In [9]:
# Run LDA
lda = LatentDirichletAllocation(n_components=num_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0)
lda.fit(tf)

LatentDirichletAllocation(learning_method='online', learning_offset=50.0,
                          max_iter=5, n_components=20, random_state=0)

**Displaying the topics**

In [10]:
def display_topics(model, feature_names, num_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print (f"Topic {topic_idx}:", " ".join([feature_names[i]
                        for i in topic.argsort()[:-num_top_words - 1:-1]]))

In [11]:
num_top_words = 10

In [12]:
display_topics(nmf, tfidf_feature_names, num_top_words)

Topic 0: people don just like think good know time right make
Topic 1: edu university posting nntp host article distribution writes washington cs
Topic 2: com hp article writes ibm sun posting att nntp host
Topic 3: drive scsi ide drives disk hard controller mac bus apple
Topic 4: game team games year players hockey season play win nhl
Topic 5: key clipper chip encryption keys escrow government algorithm security secret
Topic 6: uk ac university ed 44 __ newsreader host posting nntp
Topic 7: nasa gov space center research ___ __ moon laboratory station
Topic 8: cs pitt gordon banks science edu pittsburgh computer univ soon
Topic 9: god jesus bible christian christ christians faith believe church christianity
Topic 10: windows file dos window files card program use thanks help
Topic 11: ca canada bnr university writes article cs toronto posting nntp
Topic 12: israel israeli jews turkish armenian arab armenians org jewish armenia
Topic 13: ohio state acs edu university john article nntp 

In [13]:
display_topics(lda, tf_feature_names, num_top_words)

Topic 0: space gov nasa netcom access earth research center moon digex
Topic 1: key government chip encryption clipper president __ keys clinton security
Topic 2: file windows program files ftp available image server edu version
Topic 3: edu writes article state org uiuc ohio university acs cso
Topic 4: god jesus people christian believe bible christians does church say
Topic 5: problem time use work using hp ve problems does used
Topic 6: edu cs cc writes university article science pitt columbia computer
Topic 7: people government turkish war armenian armenians world said children years
Topic 8: ca team game edu year games hockey play season players
Topic 9: mit window code number db output source application motif widget
Topic 10: com mail software sale fax internet graphics list email information
Topic 11: 10 00 1993 25 15 16 20 11 14 12
Topic 12: drive card scsi mac apple ibm disk dos windows pc
Topic 13: ax max g9v b8f a86 0d 145 pl 1d9 34u
Topic 14: edu keith writes technology in

## Visulaize

In [14]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [24]:
import pyLDAvis
import pyLDAvis.sklearn
from pyLDAvis import PreparedData

In [35]:
pyLDAvis.enable_notebook()

In [18]:
data = pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer, mds='tsne')



In [36]:
pyLDAvis.display(dash, local=True)