<font color=#505050 size=3 face='arial black'>
<br><br><center><font color=#505050 size=14 face='arial black'>Topic Models in Python</font></center>  <br><br><center><font color=#13577F size=5 face='arial black'>Kareem Carr</font></center><br><br><br><br>

<p>Python is a powerful programming language that's especially suited to text analysis. In this workshop, we will cover some of the most state of the art packages in python for processing text.</p>

<p>These packages are much less developed. They require a higher tolerance for non-intuitive interfaces and experimental, incomplete or imperfect features. However, if you stick with it, you can do many very cool things.</p>

</font>

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> Outline of Latent Dirichlet Allocation-based Topic Modeling </font></p>

<p><font color=#13577F size=3 face='arial black'>Topic models</font> are statistical models for learning the hidden structure in document collections.</p>

<p><font color=#13577F size=3 face='arial black'>Latent dirichlet allocation</font> and its extensions are a popular class of topic models (Blei et al. 2003). Each document is considered to a mixture of topics. Each topic is defined by the probability of observing certain words. It's more likely one would observe the word 'ball' in a document where the topic is sport than it is to observe the same word in a document where the topic is politics. </p>

<br>

<figure>
    <img src='topics.png' alt='[missing figure]'/>
    <figcaption> <center> <font color=#505050 size=2 face='arial black'>Illustration of LDA topic model</font> </center> </figcaption>
</figure>

<br>

</font>

<p><font color=#13577F size=4 face='arial black'> What kind of text do you have and where is it located?</font></p>

<p><font color=#13577F size=4 face='arial black'> Case Study: HTML </font></p>

In [1]:
from urllib.request import urlopen
wikipedia = urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
wikipedia_html = wikipedia.read()
wikipedia_html[0:200]

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Python (programming language) - Wikipedia</title>\n<script>document.documentElement.className = docu'

In [57]:
#http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text

from bs4 import BeautifulSoup
import re

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True


def clean_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(visible,texts)
    text = " ".join(visible_texts)
    return(text)

cleaned_html = clean_html(wikipedia_html)
print(cleaned_html[:200])


 
   
 
 
 
  CentralNotice  
 
 
 
 Python (programming language) 
 
 From Wikipedia, the free encyclopedia 
 
 
					Jump to:					 navigation , 					 search 
 
 
 Python 
 
 
 
 
 Paradigm 
 multi-


<p><font color=#13577F size=4 face='arial black'> Case Study: Trawling a folder for text files </font></p>

In [None]:
import os
current_dir = os.getcwd()
example_datasource = os.path.join(current_dir, 'example_datasource' )
print(current_dir)
print(example_datasource)

In [None]:
import os

text_files = []

for path,dirs, files in os.walk(example_datasource):
    for file in files:
        if file.endswith('txt'):
            text_files.append(os.path.join(path,file))
            
print(text_files)

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> How to Clean Text</font></p>

<p> Raw text often contains meta data such as the document name, chapter denotations and names of speakers such as in a play or transcript, along with boilerplate such as declarations of copyright or trademark.
</p>

<p> The solution to these challenges is a combination of specialized parsers, regular expressions, removing data from fixed locations. </p>

</font>

<p><font color=#13577F size=4 face='arial black'> 1. Inspect the files </font></p>

In [3]:
alice_file = open('alice_in_wonderland.txt','r')
raw_text = alice_file.read()

In [4]:
#starts with meta data
print(raw_text[:800])

Project Gutenberg’s Alice’s Adventures in Wonderland, by Lewis Carroll

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Alice’s Adventures in Wonderland

Author: Lewis Carroll

Posting Date: June 25, 2008 [EBook #11]
Release Date: March, 1994
Last Updated: October 6, 2016

Language: English

Character set encoding: UTF-8

*** START OF THIS PROJECT GUTENBERG EBOOK ALICE’S ADVENTURES IN WONDERLAND ***










ALICE’S ADVENTURES IN WONDERLAND

Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0




CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her 


In [5]:
#ends with legal language
print(raw_text[-1200:])

g Web pages for current donation
methods and addresses.  Donations are accepted in a number of other
ways including checks, online payments and credit card donations.
To donate, please visit: http://pglaf.org/donate


Section 5.  General Information About Project Gutenberg-tm electronic
works.

Professor Michael S. Hart is the originator of the Project Gutenberg-tm
concept of a library of electronic works that could be freely shared
with anyone.  For thirty years, he produced and distributed Project
Gutenberg-tm eBooks with only a loose network of volunteer support.


Project Gutenberg-tm eBooks are often created from several printed
editions, all of which are confirmed as Public Domain in the U.S.
unless a copyright notice is included.  Thus, we do not necessarily
keep eBooks in compliance with any particular paper edition.


Most people start at our Web site which has the main PG search facility:

     http://www.gutenberg.org

This Web site includes information about Project Gutenbe

<p><font color=#13577F size=4 face='arial black'> 2. Extract the context you are interested in </font></p>

In [6]:
import re

#locate in the text the first occurence of <term> followed by a colon and extract the data after the colon
def _get_term(term, text):
    result = re.search(term+':.*',text)
    if result:
        return result.group()[len(term)+1:].strip()
    else:
        return ''

author   = _get_term('Author',raw_text)
title    = _get_term('Title',raw_text)
language = _get_term('Language',raw_text)
        
preamble, body = re.split('\*\*\* START OF THIS PROJECT GUTENBERG EBOOK.*',raw_text)

alice,boilerplate = body[:-18765],body[-18765:]

In [7]:
#meta data
print(author,title,language,sep='\n')

Lewis Carroll
Alice’s Adventures in Wonderland
English


In [8]:
print(alice[:500])












ALICE’S ADVENTURES IN WONDERLAND

Lewis Carroll

THE MILLENNIUM FULCRUM EDITION 3.0




CHAPTER I. Down the Rabbit-Hole

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, ‘and what is the use of a book,’ thought Alice ‘without pictures or
conversations?’

So she was considering in her own mind (as well as she could, fo


<p><font color=#13577F size=4 face='arial black'> 3. Standardize and Transform </font></p>

<font color=#505050 size=3 face='arial black'>
<ul>
<li>tokenization</li>
<li>stopword removal</li>
<li>normalization</li>
<li>collation discovery </li>
</ul>

</font>

In [10]:
paragraph= """Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversations?’"""
print(paragraph)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversations?’


<p><font color=#13577F size=3 face='arial black'> Tokenization </font></p>

In [11]:
from nltk import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+').tokenize
tokenized_text = tokenizer(paragraph)
print(tokenized_text)

['Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', 'and', 'of', 'having', 'nothing', 'to', 'do', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', 'thought', 'Alice', 'without', 'pictures', 'or', 'conversations']


<p><font color=#13577F size=3 face='arial black'> Stopword removal </font></p>

In [12]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopword_list = list(stopwords.words('english'))
print(stopword_list)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kareemcarr/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'ot

In [13]:
tokenized_text = [word for word in tokenized_text  if word not in set(stopword_list)]
print(tokenized_text)

['Alice', 'beginning', 'get', 'tired', 'sitting', 'sister', 'bank', 'nothing', 'twice', 'peeped', 'book', 'sister', 'reading', 'pictures', 'conversations', 'use', 'book', 'thought', 'Alice', 'without', 'pictures', 'conversations']


<p><font color=#13577F size=3 face='arial black'> More filtering </font></p>

In [14]:
def has_no_numbers(inputString):
     return not any(char.isdigit() for char in inputString)

def filter_numbers_from_word_list(word_list):
    return list(filter(has_no_numbers, word_list))

filter_numbers_from_word_list(['cat','k33p','dog','mouse','1221'])

['cat', 'dog', 'mouse']

<p><font color=#13577F size=3 face='arial black'> Extremely stringent filtering </font></p>

In [15]:
import pip
pip.main(['install', 'pyenchant'])

import enchant
eng = enchant.Dict("en_US")
eng.check("Apple")

list(filter(eng.check,['cat','k33p','dog','mouse','CGIS','Knafel','IQSS']))



['cat', 'dog', 'mouse']

<p><font color=#13577F size=3 face='arial black'> Normalization </font></p>

In [16]:
#simple normalization
words = ['President','PRESIDENT','president']
[word.lower() for word in words]

['president', 'president', 'president']

<font color=#505050 size=3 face='arial black'>

<p> <font color=#13577F size=3 face='arial black'> Stemming </font> is the process of reducing inflected words to their stem, base or root form </p>

<p> <font color=#13577F size=3 face='arial black'> Lemmatisation </font> is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.</p>

</font>

In [17]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer().stem
porter_stemmer('president')

'presid'

In [18]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer().lemmatize

print(wordnet_lemmatizer('president'))

president


<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=3 face='arial black'> Collation discovery: n-grams </font></p>
</font>

In [19]:
from nltk import bigrams, ngrams
tokens = tokenizer('I had a cat. I had a rat. I had a bat.')
bgs = bigrams(tokens)
bgs=list(bgs)

tgs = ngrams(tokens,n=3)
tgs = list(tgs)

In [20]:
print(bgs)

[('I', 'had'), ('had', 'a'), ('a', 'cat'), ('cat', 'I'), ('I', 'had'), ('had', 'a'), ('a', 'rat'), ('rat', 'I'), ('I', 'had'), ('had', 'a'), ('a', 'bat')]


In [21]:
print(tgs)

[('I', 'had', 'a'), ('had', 'a', 'cat'), ('a', 'cat', 'I'), ('cat', 'I', 'had'), ('I', 'had', 'a'), ('had', 'a', 'rat'), ('a', 'rat', 'I'), ('rat', 'I', 'had'), ('I', 'had', 'a'), ('had', 'a', 'bat')]


In [22]:
nltk.FreqDist(bgs)

FreqDist({('I', 'had'): 3,
          ('a', 'bat'): 1,
          ('a', 'cat'): 1,
          ('a', 'rat'): 1,
          ('cat', 'I'): 1,
          ('had', 'a'): 3,
          ('rat', 'I'): 1})

In [23]:
nltk.FreqDist(tgs)

FreqDist({('I', 'had', 'a'): 3,
          ('a', 'cat', 'I'): 1,
          ('a', 'rat', 'I'): 1,
          ('cat', 'I', 'had'): 1,
          ('had', 'a', 'bat'): 1,
          ('had', 'a', 'cat'): 1,
          ('had', 'a', 'rat'): 1,
          ('rat', 'I', 'had'): 1})

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> Topic Modeling </font></p>
</font>

In [24]:
import json
import pandas as pd
shakespeare = pd.read_csv('will_play_text.data',delimiter=';',index_col=0,names=['play','act','index','speaker','dialogue'])
shakespeare.head(6)

Unnamed: 0,play,act,index,speaker,dialogue
1,Henry IV,,,,ACT I
2,Henry IV,,,,SCENE I. London. The palace.
3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils


In [25]:
shakespeare = shakespeare[shakespeare['speaker'].notnull()]
shakespeare.head(6)

Unnamed: 0,play,act,index,speaker,dialogue
4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils
7,Henry IV,1.0,1.1.4,KING HENRY IV,To be commenced in strands afar remote.
8,Henry IV,1.0,1.1.5,KING HENRY IV,No more the thirsty entrance of this soil
9,Henry IV,1.0,1.1.6,KING HENRY IV,Shall daub her lips with her own children's bl...


In [26]:
shakespeare = shakespeare[['speaker','dialogue']]
shakespeare.head(6)

Unnamed: 0,speaker,dialogue
4,KING HENRY IV,"So shaken as we are, so wan with care,"
5,KING HENRY IV,"Find we a time for frighted peace to pant,"
6,KING HENRY IV,And breathe short-winded accents of new broils
7,KING HENRY IV,To be commenced in strands afar remote.
8,KING HENRY IV,No more the thirsty entrance of this soil
9,KING HENRY IV,Shall daub her lips with her own children's bl...


In [27]:
speech = shakespeare.groupby('speaker').apply(lambda x: " ".join(x['dialogue']))
speech.head()

speaker
A Lord                                            Sir, it was I.
A Patrician    This man has marr'd his fortune. You do the no...
A Player       So please your lordship to accept our duty. I ...
AARON          Now climbeth Tamora Olympus' top, Safe out of ...
ABERGAVENNY    I cannot tell What heaven hath given him,--let...
dtype: object

In [28]:
characters = pd.DataFrame()
characters['dialogue']=speech
characters

Unnamed: 0_level_0,dialogue
speaker,Unnamed: 1_level_1
A Lord,"Sir, it was I."
A Patrician,This man has marr'd his fortune. You do the no...
A Player,So please your lordship to accept our duty. I ...
AARON,"Now climbeth Tamora Olympus' top, Safe out of ..."
ABERGAVENNY,"I cannot tell What heaven hath given him,--let..."
ABHORSON,"Do you call, sir? A bawd, sir? fie upon him! h..."
ABRAHAM,"Do you bite your thumb at us, sir? Do you bite..."
ACHILLES,"Why, how now, Ajax! wherefore do you thus? How..."
ADAM,"Yonder comes my master, your brother. Sweet ma..."
ADRIAN,"Though this island seem to be desert,-- Uninha..."


In [29]:
corpus = characters['dialogue'].tolist()
print(corpus[0:2])

['Sir, it was I.', "This man has marr'd his fortune. You do the nobler. Ay, and burn too. Enter MENENIUS and Senators"]


In [30]:

def clean_and_tokenize(raw_text,custom_stopwords=None):
    if custom_stopwords == None: custom_stopwords=[] 
    
    #tokenization
    tokenizer = RegexpTokenizer(r'\w+').tokenize
    words = tokenizer(raw_text)
    
    #filters
    
    #stopwords
    words = [word for word in words if word not in set(custom_stopwords+stopword_list)]
    
    #numbers
    words = filter_numbers_from_word_list(words)
    
    #normalization (stemming)
    porter_stemmer = PorterStemmer()
    
    words = [porter_stemmer.stem(word) for word in words]
    
    words = [word.lower() for word in words]
    
    words = [word for word in words if len(word) > 2]

    return words

In [31]:
cleaned_corpus = list(map(clean_and_tokenize, corpus))

In [32]:
print(" ".join(cleaned_corpus[5]))

call sir bawd sir fie upon discredit mysteri sir mysteri sir mysteri everi true man apparel fit thief littl thief true man think big enough big thief thief think littl enough everi true man apparel fit thief enter provost come bawd instruct thee trade follow sirrah bring barnardin hither what barnardin tell must awak quickli fetch axe upon block sirrah truli sir would desir clap prayer look warrant come look sir come ghostli father jest think enter duke vincentio disguis


<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> Topic Models: gensim package</font></p>
</font>

In [33]:
pip.main(['install', 'gensim'])
import gensim

'pattern' package not found; tag filters are not available for English




<font color=#505050 size=3 face='arial black'>

<p>The <font color=#13577F face='arial black'>term frequency</font> in the simplest case is to use the raw frequency of a term in a document. (High if a word occurs a lot in a particular document).</p>

<p>The <font color=#13577F face='arial black'>inverse document frequency</font> is related to the inverse fraction of the documents that contain the word. (High if the term occurs rarely.)</p>

<p>The <font color=#13577F face='arial black'>term frequency-inverse document frequency (tf-idf)</font> is just the product of these two measures. A term has a high tf–idf weight if it is a high frequency term in the given document and has a low document frequency in the whole collection of documents.</p>

<p>This transformation natural downweights stopwords and words similar to stopwords.</p>

</font>

In [34]:
from gensim import corpora
dictionary = corpora.Dictionary(cleaned_corpus)
vecs = [dictionary.doc2bow(document) for document in cleaned_corpus]

from gensim import models
tfidf = models.TfidfModel(vecs)
tfidf_vecs = tfidf[vecs]

adding document #0 to Dictionary(0 unique tokens: [])
built Dictionary(13999 unique tokens: ['inexecr', 'alway', 'cockl', 'prief', 'immodest']...) from 934 documents (total 445999 corpus positions)
collecting document frequencies
PROGRESS: processing document #0
calculating IDF weights for 934 documents and 13998 features (238502 matrix non-zeros)


In [47]:
from gensim.models.ldamodel import LdaModel
lda = LdaModel(corpus=tfidf_vecs,id2word=dictionary,num_topics=3,update_every=0,passes=1)

using symmetric alpha at 0.3333333333333333
using symmetric eta at 0.3333333333333333
using serial LDA version on this node
running batch LDA training, 3 topics, 1 passes over the supplied corpus of 934 documents, updating model once every 934 documents, evaluating perplexity every 934 documents, iterating 50x with a convergence threshold of 0.001000


too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy


-23.043 per-word bound, 8644595.3 perplexity estimate based on a held-out corpus of 934 documents with 9523 words
PROGRESS: pass 0, at document #934/934
topic #0 (0.333): 0.001*"sir" + 0.001*"lord" + 0.001*"and" + 0.001*"madam" + 0.001*"thou" + 0.001*"love" + 0.001*"know" + 0.001*"king" + 0.001*"timon" + 0.001*"well"
topic #1 (0.333): 0.002*"thou" + 0.002*"and" + 0.001*"thee" + 0.001*"thi" + 0.001*"the" + 0.001*"shall" + 0.001*"come" + 0.001*"sir" + 0.001*"love" + 0.001*"that"
topic #2 (0.333): 0.002*"lord" + 0.002*"thi" + 0.001*"thou" + 0.001*"sir" + 0.001*"and" + 0.001*"king" + 0.001*"thee" + 0.001*"shall" + 0.001*"good" + 0.001*"come"
topic diff=1.328788, rho=1.000000


<font color=#13577F face='arial black'>getting the topics</font>

In [48]:
topics=lda.top_topics(tfidf_vecs)

In [49]:
topics[0][0]

[(0.0019144836857934065, 'thou'),
 (0.0015290445193528049, 'and'),
 (0.0012956516811382514, 'thee'),
 (0.0010381593824981712, 'thi'),
 (0.00099496145255033265, 'the'),
 (0.00098808924946222902, 'shall'),
 (0.00095541572018764383, 'come'),
 (0.0009503643707128906, 'sir'),
 (0.00089552143630192203, 'love'),
 (0.00087839237844720638, 'that'),
 (0.00085936672048703849, 'well'),
 (0.00085721788953281315, 'good'),
 (0.00085664173218580331, 'you'),
 (0.00084361326820827579, 'king'),
 (0.00083619940634704199, 'let'),
 (0.0008273345372338724, 'like'),
 (0.00080057799214420741, 'would'),
 (0.00078966577203003782, 'but'),
 (0.00076652829510123277, 'honour'),
 (0.0007508148135508978, 'lord')]

<font color=#13577F face='arial black'>getting the scores</font>

In [50]:
scores=list(lda[tfidf_vecs])

In [51]:
scores[0]

[(0, 0.22855052802710862), (1, 0.19884786266769566), (2, 0.57260160930519566)]

In [52]:
from collections import defaultdict

def dict_to_tuple(n,t):
    d = defaultdict(float,t)
    return [d[key] for key in range(n)]

scores=list(map(lambda t: dict_to_tuple(3,t),scores))

In [53]:
scores[2]

[0.80324393575143471, 0.099343009306734356, 0.097413054941831029]

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> Topic Models: scikit-learn</font></p>
</font>

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 

vectorizer = TfidfVectorizer(ngram_range=(1, 3),
                             token_pattern=r'\b\w+\b',
                             max_features=10000,
                             min_df=10,max_df=0.3)

transformed_articles = vectorizer.fit_transform(corpus)

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> How to Transform the scikit-learn vectorizer output for use in gensim</font></p>
</font>

In [None]:
import gensim

tfidf_vecs_scikit = gensim.matutils.Sparse2Corpus(transformed_articles, documents_columns=False)
dictionary_scikit = dict((v, k) for k, v in vectorizer.vocabulary_.items())

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> Fitting LDA in scikit-learn</font></p>
</font>

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=3,doc_topic_prior=10)
lda.fit(transformed_articles)
scores_scikit = lda.transform(transformed_articles)

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> Looking at results</font></p>
</font>

In [None]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" | ".join([feature_names[i]+","+str(topic[i])
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        print()

print("\nTopics in LDA model:")
tf_feature_names = vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 5)

In [None]:
##https://github.com/scikit-learn/scikit-learn/issues/6353



<font color=#505050 size=3 face='arial black'>

<p><font color=#13577F size=4 face='arial black'> Categories of bad topics</font></p>

<p>
 <ul>
  <li> general and specific words eg. cat, rabbit, conspecific, heterogenous </li>
  <li> mixed topics eg. dog, cat, bird, honda, chevrolet, bmw </li>
  <li> changed topics eg. reagan, roosevelt, clinton, licoln, honda, chevrolet, bmw </li>
  <li> identical topics </li>
  <li> cluttered topics eg. vii, vi, x, xiv, xviii </li>
  <li> non-sensical topics </li>
</ul>
</p>
</font>

<font color=#505050 size=3 face='arial black'>

<p><font color=#13577F size=4 face='arial black'> Evaluating topics</font></p>

<p>
 <ul>
  <li> topic size (total probability of a topic across documents) </li>
  <li> average word length of top words </li>
  <li> topic prominence </li>
</ul>
</p>
</font>

In [58]:
%matplotlib inline
from collections import Counter

import matplotlib.pyplot as plt
Counter(scores_scikit.argmax(axis=1)+1)

NameError: name 'scores_scikit' is not defined

In [None]:
scores_scikit.sum(axis=0)/scores_scikit.sum()