# `pyLDAvis.sklearn`

pyLDAvis now also supports LDA application from scikit-learn. Let's take a look into this in more detail. We will be using the 20 newsgroups dataset as provided by scikit-learn.

In [11]:
from __future__ import print_function

In [12]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

  from collections import Iterable


In [33]:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

## Load 20 newsgroups dataset

First, the 20 newsgroups dataset available in sklearn is loaded. As always, the headers, footers and quotes are removed.

In [14]:
import pickle
file = open("moties_processed_df.pickle","rb")
df = pickle.load(file)
print(len(df))

25594


In [26]:
docs_raw = df['Text_processed']

In [34]:
#processing all lines from a file
f=open('stopwords.txt') #not with read because thats probably the whole file
lines = [line.rstrip('\n') for line in f]

## Convert to document-term matrix

Next, the raw documents are converted into document-term matrix, possibly as raw counts or in TF-IDF form.

In [35]:
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = lines,
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
print(dtm_tf.shape)

  'stop_words.' % sorted(inconsistent))


(25594, 9161)


In [36]:
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print(dtm_tfidf.shape)

  'stop_words.' % sorted(inconsistent))


(25594, 9161)


## Fit Latent Dirichlet Allocation models

Finally, the LDA models are fitted.

In [55]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_components=30, random_state=0)
topics = lda_tf.fit(dtm_tf)
# for TFIDF DTM
# lda_tfidf = LatentDirichletAllocation(n_components=15, random_state=0)
# topics = lda_tfidf.fit(dtm_tfidf)

## Visualizing the models with pyLDAvis

In [None]:
d = {0:'EU',1:'Onderwijs',2:'Zorg',3:'Chronisch zieken',4:'Werkgelegenheid',5:'Publieke omroep',6:'Defensie',7:'Criminaliteit',8:'Energie & Gaswinning',9:'Infrastructuur',10:'Belasting',11:'Pensioen',
    12:'?',13:'Asielzoekers',14:'Provincies',15}

In [69]:
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)



of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [68]:
#pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

### Using different MDS functions

With `sklearn` installed, other MDS functions, such as MMDS and TSNE can be used for plotting if the default PCoA is not satisfactory.

In [11]:
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='mmds')

pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='tsne')

In [63]:
#trans = topics.transform(dtm_tfidf)
trans = topics.transform(dtm_tf)

In [64]:
trans[1]

array([0.0037037 , 0.0037037 , 0.0037037 , 0.0037037 , 0.89259259,
       0.0037037 , 0.0037037 , 0.0037037 , 0.0037037 , 0.0037037 ,
       0.0037037 , 0.0037037 , 0.0037037 , 0.0037037 , 0.0037037 ,
       0.0037037 , 0.0037037 , 0.0037037 , 0.0037037 , 0.0037037 ,
       0.0037037 , 0.0037037 , 0.0037037 , 0.0037037 , 0.0037037 ,
       0.0037037 , 0.0037037 , 0.0037037 , 0.0037037 , 0.0037037 ])

In [65]:
import numpy as np
df['Topic'] = list(np.argmax(trans,axis=1))

In [66]:
#df = df[df['Onderwerp'].str.contains('klimaat')]
import pickle
with open('moties_processed_df.pickle', 'wb') as handle:
    pickle.dump(df, handle, protocol=pickle.HIGHEST_PROTOCOL)