<font color=#505050 size=3 face='arial black'>
<br><br><center><font color=#505050 size=14 face='arial black'>Topic Models in Python</font></center>  <br><br><center><font color=#13577F size=5 face='arial black'>Kareem Carr</font></center><br><br><br><br>

<p>Python is a powerful programming language that's especially suited to text analysis. In this workshop, we will cover some of the most state of the art packages in python for processing text. These packages require a tolerance for non-intuitive interfaces and experimental, incomplete or imperfect features. However, if you stick with it, you can do many very cool things.</p>
</font>

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> Outline of Latent Dirichlet Allocation-based Topic Modeling </font></p>

<p><font color=#13577F size=3 face='arial black'>Topic models</font> are statistical models for learning the hidden structure in document collections.</p>

<p><font color=#13577F size=3 face='arial black'>Latent dirichlet allocation</font> and its extensions are a popular class of topic models (Blei et al. 2003). Each document is considered to a mixture of topics. Each topic is defined by the probability of observing certain words. It's more likely one would observe the word 'ball' in a document where the topic is sport than it is to observe the same word in a document where the topic is politics. </p>

<br>

<figure>
    <img src='topics.png' alt='[missing figure]'/>
    <figcaption> <center> <font color=#505050 size=2 face='arial black'>Illustration of LDA topic model</font> </center> </figcaption>
</figure>

<br>

</font>

<p><font color=#13577F size=4 face='arial black'> What kind of text do you have and where is it located?</font></p>

<p><font color=#13577F size=4 face='arial black'> Case Study: HTML </font></p>

In [None]:
from urllib.request import urlopen
wikipedia = urlopen('https://en.wikipedia.org/wiki/Python_(programming_language)')
wikipedia_html = wikipedia.read()
wikipedia_html[0:200]

In [None]:
#http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text

from bs4 import BeautifulSoup
import re

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True


def clean_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(visible,texts)
    text = " ".join(visible_texts)
    return(text)

cleaned_html = clean_html(wikipedia_html)
print(cleaned_html[:200])

<p><font color=#13577F size=4 face='arial black'> Case Study: Trawling a folder for text files </font></p>

In [None]:
import os
current_dir = os.getcwd()
example_datasource = os.path.join(current_dir, 'example_datasource' )
print(current_dir)
print(example_datasource)

In [None]:
import os

text_files = []

for path,dirs, files in os.walk(example_datasource):
    for file in files:
        if file.endswith('txt'):
            text_files.append(os.path.join(path,file))
            
print(text_files)

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> How to Clean Text</font></p>

<p> Raw text often contains meta data such as the document name, chapter denotations and names of speakers such as in a play or transcript, along with boilerplate such as declarations of copyright or trademark or advertisements.
</p>

<p> The solution to these challenges is a combination of specialized parsers, regular expressions and removing data from fixed locations. </p>

</font>

<p><font color=#13577F size=4 face='arial black'> 1. Inspect the files </font></p>

In [None]:
alice_file = open('alice_in_wonderland.txt','r')
raw_text = alice_file.read()

In [None]:
#starts with meta data
print(raw_text[:800])

In [None]:
#ends with legal language
print(raw_text[-1200:])

<p><font color=#13577F size=4 face='arial black'> 2. Extract the context you are interested in </font></p>

In [None]:
import re

#locate in the text the first occurence of <term> followed by a colon and extract the data after the colon
def _get_term(term, text):
    result = re.search(term+':.*',text)
    if result:
        return result.group()[len(term)+1:].strip()
    else:
        return ''

author   = _get_term('Author',raw_text)
title    = _get_term('Title',raw_text)
language = _get_term('Language',raw_text)
        
preamble, body = re.split('\*\*\* START OF THIS PROJECT GUTENBERG EBOOK.*',raw_text)

alice,boilerplate = body[:-18765],body[-18765:]

In [None]:
#meta data
print(author,title,language,sep='\n')

In [None]:
print(alice[:500])

<p><font color=#13577F size=4 face='arial black'> 3. Standardize and Transform </font></p>

<font color=#505050 size=3 face='arial black'>
<ul>
<li>tokenization</li>
<li>stopword removal</li>
<li>normalization</li>
<li>collation discovery </li>
</ul>

</font>

In [None]:
paragraph= """Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, ‘and what is the use of a book,’ thought Alice ‘without pictures or conversations?’"""
print(paragraph)

<p><font color=#13577F size=3 face='arial black'> Tokenization </font></p>

In [None]:
from nltk import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+').tokenize
tokenized_text = tokenizer(paragraph)
print(tokenized_text)

<p><font color=#13577F size=3 face='arial black'> Stopword removal </font></p>

In [None]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stopword_list = list(stopwords.words('english'))
print(stopword_list)

In [None]:
tokenized_text = [word for word in tokenized_text  if word not in set(stopword_list)]
print(tokenized_text)

<p><font color=#13577F size=3 face='arial black'> More filtering </font></p>

In [None]:
def has_no_numbers(inputString):
     return not any(char.isdigit() for char in inputString)

def filter_numbers_from_word_list(word_list):
    return list(filter(has_no_numbers, word_list))

filter_numbers_from_word_list(['cat','k33p','dog','mouse','1221'])

<p><font color=#13577F size=3 face='arial black'> Extremely stringent filtering </font></p>

In [None]:
import pip
pip.main(['install', 'pyenchant'])

import enchant
eng = enchant.Dict("en_US")
eng.check("Apple")

list(filter(eng.check,['cat','k33p','dog','mouse','CGIS','Knafel','IQSS']))

<p><font color=#13577F size=3 face='arial black'> Normalization </font></p>

In [None]:
#simple normalization
words = ['President','PRESIDENT','president']
[word.lower() for word in words]

<font color=#505050 size=3 face='arial black'>

<p> <font color=#13577F size=3 face='arial black'> Stemming </font> is the process of reducing inflected words to their stem, base or root form </p>

<p> <font color=#13577F size=3 face='arial black'> Lemmatisation </font> is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.</p>

</font>

In [None]:
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer().stem
porter_stemmer('president')

In [None]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer().lemmatize

print(wordnet_lemmatizer('president'))

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=3 face='arial black'> Collation discovery: n-grams </font></p>
</font>

In [None]:
from nltk import bigrams, ngrams
tokens = tokenizer('I had a cat. I had a rat. I had a bat.')
bgs = bigrams(tokens)
bgs=list(bgs)

tgs = ngrams(tokens,n=3)
tgs = list(tgs)

In [None]:
print(bgs)

In [None]:
print(tgs)

In [None]:
nltk.FreqDist(bgs)

In [None]:
nltk.FreqDist(tgs)

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> Topic Modeling </font></p>

<p>Our example today is a plain text file containing all the lines in each of Shakespeare's plays. It is in a specialized format where each entry is delimited by a semi-colon. We are going to load the data into Python and convert it to corpus where the document will be all the lines spoken by a particular character in any play.
</p>

</font>

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=3 face='arial black'> read data from file </font></p>
</font>

In [1]:
import json
import pandas as pd
shakespeare = pd.read_csv('will_play_text.data',delimiter=';',index_col=0,names=['play','act','index','speaker','dialogue'])
shakespeare.head(6)

Unnamed: 0,play,act,index,speaker,dialogue
1,Henry IV,,,,ACT I
2,Henry IV,,,,SCENE I. London. The palace.
3,Henry IV,,,,"Enter KING HENRY, LORD JOHN OF LANCASTER, the ..."
4,Henry IV,1.0,1.1.1,KING HENRY IV,"So shaken as we are, so wan with care,"
5,Henry IV,1.0,1.1.2,KING HENRY IV,"Find we a time for frighted peace to pant,"
6,Henry IV,1.0,1.1.3,KING HENRY IV,And breathe short-winded accents of new broils


<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=3 face='arial black'> isolate the text for each speaker </font></p>
</font>

In [None]:
shakespeare = shakespeare[shakespeare['speaker'].notnull()]
shakespeare.head(6)

In [None]:
shakespeare = shakespeare[['speaker','dialogue']]
shakespeare.head(6)

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=3 face='arial black'> collect and convert to a list </font></p>
</font>

In [None]:
speech = shakespeare.groupby('speaker').apply(lambda x: " ".join(x['dialogue']))
speech.head()

In [None]:
characters = pd.DataFrame()
characters['dialogue']=speech
characters

In [None]:
corpus = characters['dialogue'].tolist()
print(corpus[0:2])

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=3 face='arial black'> process for topic modeling using the processes we discussed earlier </font></p>
</font>

In [None]:

def clean_and_tokenize(raw_text,custom_stopwords=None):
    if custom_stopwords == None: custom_stopwords=[] 
    
    #tokenization
    tokenizer = RegexpTokenizer(r'\w+').tokenize
    words = tokenizer(raw_text)
    
    #filters
    
    #stopwords
    words = [word for word in words if word not in set(custom_stopwords+stopword_list)]
    
    #numbers
    words = filter_numbers_from_word_list(words)
    
    #normalization (stemming)
    porter_stemmer = PorterStemmer()
    
    words = [porter_stemmer.stem(word) for word in words]
    
    words = [word.lower() for word in words]
    
    words = [word for word in words if len(word) > 2]

    return words

In [None]:
cleaned_corpus = list(map(clean_and_tokenize, corpus))

In [None]:
print(" ".join(cleaned_corpus[5]))

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> Topic Models: gensim package</font></p>
</font>

In [None]:
pip.main(['install', 'gensim'])
import gensim

<font color=#505050 size=3 face='arial black'>

<p>The <font color=#13577F face='arial black'>term frequency</font> in the simplest case is to use the raw frequency of a term in a document. (High if a word occurs a lot in a particular document).</p>

<p>The <font color=#13577F face='arial black'>inverse document frequency</font> is related to the inverse fraction of the documents that contain the word. (High if the term occurs rarely.)</p>

<p>The <font color=#13577F face='arial black'>term frequency-inverse document frequency (tf-idf)</font> is just the product of these two measures. A term has a high tf–idf weight if it is a high frequency term in the given document and has a low document frequency in the whole collection of documents.</p>

<p>This transformation natural downweights stopwords and words similar to stopwords.</p>

</font>

In [None]:
from gensim import corpora
dictionary = corpora.Dictionary(cleaned_corpus)
vecs = [dictionary.doc2bow(document) for document in cleaned_corpus]

from gensim import models
tfidf = models.TfidfModel(vecs)
tfidf_vecs = tfidf[vecs]

In [None]:
from gensim.models.ldamodel import LdaModel
lda = LdaModel(corpus=tfidf_vecs,id2word=dictionary,num_topics=3,update_every=0,passes=1)

<font color=#13577F  size=3 face='arial black'>getting the topics</font>

In [None]:
topics=lda.top_topics(tfidf_vecs)

In [None]:
topics[0][0]

<font color=#13577F size=3 face='arial black'>getting the scores</font>

In [None]:
scores=list(lda[tfidf_vecs])

In [None]:
scores[0]

In [None]:
from collections import defaultdict

def dict_to_tuple(n,t):
    d = defaultdict(float,t)
    return [d[key] for key in range(n)]

scores=list(map(lambda t: dict_to_tuple(3,t),scores))

In [None]:
scores[2]

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> Topic Models: scikit-learn</font></p>
</font>

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 

vectorizer = TfidfVectorizer(ngram_range=(1, 3),
                             token_pattern=r'\b\w+\b',
                             max_features=10000,
                             min_df=10,max_df=0.3)

transformed_articles = vectorizer.fit_transform(corpus)

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> How to Transform the scikit-learn vectorizer output for use in gensim</font></p>
</font>

In [None]:
import gensim

tfidf_vecs_scikit = gensim.matutils.Sparse2Corpus(transformed_articles, documents_columns=False)
dictionary_scikit = dict((v, k) for k, v in vectorizer.vocabulary_.items())

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> Fitting LDA in scikit-learn</font></p>
</font>

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=3,doc_topic_prior=10)
lda.fit(transformed_articles)
scores_scikit = lda.transform(transformed_articles)

<font color=#505050 size=3 face='arial black'>
<p><font color=#13577F size=4 face='arial black'> Looking at results</font></p>
</font>

In [None]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" | ".join([feature_names[i]+","+str(topic[i])
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        print()

print("\nTopics in LDA model:")
tf_feature_names = vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 5)

In [None]:
##https://github.com/scikit-learn/scikit-learn/issues/6353



<font color=#505050 size=3 face='arial black'>

<p><font color=#13577F size=4 face='arial black'> Categories of bad topics</font></p>

<p>
 <ul>
  <li> general and specific words eg. cat, rabbit, conspecific, heterogenous </li>
  <li> mixed topics eg. dog, cat, bird, honda, chevrolet, bmw </li>
  <li> changed topics eg. reagan, roosevelt, clinton, licoln, honda, chevrolet, bmw </li>
  <li> identical topics </li>
  <li> cluttered topics eg. vii, vi, x, xiv, xviii </li>
  <li> non-sensical topics </li>
</ul>
</p>
</font>

<font color=#505050 size=3 face='arial black'>

<p><font color=#13577F size=4 face='arial black'> Evaluating topics</font></p>

<p>
 <ul>
  <li> topic size (total probability of a topic across documents) </li>
  <li> average word length of top words </li>
  <li> topic prominence </li>
</ul>
</p>
</font>

In [None]:
#topic sie
scores_scikit.sum(axis=0)/scores_scikit.sum()

In [None]:
#topic prominence
%matplotlib inline
from collections import Counter

import matplotlib.pyplot as plt
Counter(scores_scikit.argmax(axis=1)+1)

<font color=#505050 size=3 face='arial black'>

<p>
Topic modeling requires continual refinement of the filtering process. You will typically generate the topics, look at the resutls and repeat the process until you find good results.
</p>

<br>
<br>


</font>

<font color=#505050 size=3 face='arial black'>

<p><font color=#13577F size=8 face='arial black'> Thanks!</font></p>

</font>