# NLP Topic Modeling


This notebook focuses on topic modeling in Natural Language Processing. As one can infer from the verbiage, Topic Modeling deals with determining the major themes of a document. Based on the document, it generates a group of words for each topic. The API used here is GenSim and the method is primarily Latent Dirichlet Algorithm. This notebook is divided among several sections. The first half is about data retrieval, cleanup, and organization, and the second half gets into semantics of topic modeling. The majority of code in the second half has been retrieved from the following link that does a great job of explaining core gensim concepts: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/



In [2]:
import nltk
import pickle

import gensim
from gensim.models.word2vec import Word2Vec
from gensim.models.phrases import Phraser, Phrases
from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from pprint import pprint

import pandas as pd
import requests
import string

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.tag import StanfordNERTagger
from nltk import word_tokenize, sent_tokenize
from sklearn.manifold import TSNE
from bokeh.io import output_notebook, output_file
from bokeh.plotting import show, figure
from bs4 import BeautifulSoup

# spacy for lemmatization
import spacy

# Plotting tools
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)



ModuleNotFoundError: No module named 'spacy'

In [2]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Retrieve Dataset

In [0]:
# For text dataset, I often use SEC EDGAR filings of US based public companies. For demonstration purpose, I often keep the list limited to around 1-3 such documents. 
urls = []
urls.append("https://www.sec.gov/Archives/edgar/data/70858/000007085818000042/bac-930201810xq.htm") # BAC:10Q: 20183Q

"""
urls.append("https://www.sec.gov/Archives/edgar/data/886982/000119312518056383/d480167d10k.htm")
urls.append("https://www.sec.gov/Archives/edgar/data/72971/000007297119000227/wfc-12312018x10k.htm") # WFC: 10K: 2018
urls.append("https://www.sec.gov/Archives/edgar/data/78003/000007800318000091/pfe-09302018x10q.htm") # PFE : 10Q : 3Q2018
urls.append("https://www.sec.gov/Archives/edgar/data/886982/000119312517056804/d308759d10k.htm")
urls.append("https://www.sec.gov/Archives/edgar/data/895421/000119312517059212/d328282d10k.htm")
urls.append("https://www.sec.gov/Archives/edgar/data/895421/000119312518060831/d500533d10k.htm")
tgtUrl = 'https://www.sec.gov/Archives/edgar/data/886982/000119312519050198/d669877d10k.htm'
"""
# Retrieve the HTML pages for URLs
pages = ''
for url in urls:
  pages += requests.get(url).text

In [0]:
# Read the HTML using BeautifulSoup. Fallback 'html.parser' in case lxml has challages",
soup = BeautifulSoup(pages, "lxml")  

In [0]:
# Find all 'div' and 'p' tags as these are the ones that contain data in our documents. Maintain the order of text.
tagTypes = ['div', 'p']
tags = soup.find_all(tagTypes)

In [0]:
# Retrieve the plain-text from HTML tags.
origTxt = ''
for t in tags:
    origTxt += t.text


## Data Cleanup

Now we have the raw text data. We need to clean it up to remove stop words, punctuations, and other common trivial patterns. 

This text has several "\xa0" characters which need to be replaced. Start data clean up with these characters. Refer to:

https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string

In [0]:
intermediateTxt = origTxt.replace(u'\xa0', u' ')

Now, clean stop words.

In [8]:
import itertools
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
stopWords = set(stopwords.words('english') + list(string.punctuation))

for i, val in enumerate(itertools.islice(stopWords, 10)): # print sample from stopWords to ensure the set is populated
  print(val) 


a
itself
of
didn't
_
was
by
you'd
most
their


In [9]:
intermediateTokens = nltk.word_tokenize(intermediateTxt)
len(intermediateTokens)

226838

In [10]:
# Remove stop words.
cleanTokens = []
for w in intermediateTokens:
    if w not in stopWords:
        cleanTokens.append(w.lower())
        cleanTokens.append(' ') # Need to append a single space for cases where words are losing space in between

len(cleanTokens)

292566

In [0]:
cleanedTxt = ''
cleanedTxtLst = []
for token in cleanTokens:
  if token != ' ':
    cleanedTxtLst.append(token)
    cleanedTxt += (token)
cleanedTxt = cleanedTxt.replace('  ', ' ')  
# cleanedTxtLst now contains individual words or tokens

It is important to first tokenize and then match individual word against stop words. If we simply search for a stop word in the entire string and remove it, we will lose some important information. Consider example of a word "I.R.S". If the logic is to remove stop words and punctuation from entire string in one go, then the dots within this word will get removed. On the other hand, if we tokenize first then the comparison will be with entire "I.R.S" word as a token and therefore the dots inside will not get removed. This is one simple example but I have seen better results when stop words removal is done after tokenization. If you need a single string, then simply concatenate all tokens in a list.

## Clean and Arrange Original Text into a List of Sentences

Thus far, we have tokenized input individual tokens  or words. However, it is also important to tokenize input by sentences. Toeknization by sentence can help highlight phrases, sentiments etc... which can not be generated on invidual words efficiently. 

In [0]:
# Start with original text scraped from web resources. Check that it wasn't inadvertently modified.
# origTxt

In [0]:
intermediateTxt = origTxt.replace(u'\xa0', u' ')

In [14]:
# Tokenize by sentence. Notce that the stop words and punctuations have not been removed yet in this cell
from nltk.tokenize import PunktSentenceTokenizer
sents_tokenized = sent_tokenize(intermediateTxt)
sents_tokenized[0:10]

['               UNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549FORM 10-Q(Mark One)[ü] QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIESEXCHANGE ACT OF 1934For the Quarterly Period Ended September 30, 2018 or[   ] TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIESEXCHANGE ACT OF 1934For the transition period from          toCommission file number:1-6523Exact name of registrant as specified in its charter:Bank of America CorporationState or other jurisdiction of incorporation or organization:DelawareIRS Employer Identification No.',
 ':56-0906609Address of principal executive offices:Bank of America Corporate Center100 N. Tryon StreetCharlotte, North Carolina 28255Registrant’s telephone number, including area code:(704) 386-5681Former name, former address and former fiscal year, if changed since last report:Indicate by check mark whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Sec

So sentences are recognized. We still need to do some data cleaning here as well. For that, we will create tokens per sentence, clean punctuations at that point, and then create sentences again, and finally will push them into a list of sentences. 

In [15]:
sents_ClnTknzd = []
punctuations = list(string.punctuation) # only remove punctuations. Keep stop words for phrases and un-abbreviated forms. Don't lose "of" in US of A for example. 

for sent in sents_tokenized:
  tempStr = ''
  tempTokens = nltk.word_tokenize(sent)
  for token in tempTokens:
    if token not in punctuations:
        tempStr += (token)
        tempStr += ' '
        #cleanTokens.append(' ') # Need to append a single space for cases where words are losing space in between
  
  sents_ClnTknzd.append(tempStr.strip())

sents_ClnTknzd[0:10]

['UNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington D.C. 20549FORM 10-Q Mark One ü QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15 d OF THE SECURITIESEXCHANGE ACT OF 1934For the Quarterly Period Ended September 30 2018 or TRANSITION REPORT PURSUANT TO SECTION 13 OR 15 d OF THE SECURITIESEXCHANGE ACT OF 1934For the transition period from toCommission file number:1-6523Exact name of registrant as specified in its charter Bank of America CorporationState or other jurisdiction of incorporation or organization DelawareIRS Employer Identification No',
 ':56-0906609Address of principal executive offices Bank of America Corporate Center100 N. Tryon StreetCharlotte North Carolina 28255Registrant ’ s telephone number including area code 704 386-5681Former name former address and former fiscal year if changed since last report Indicate by check mark whether the registrant 1 has filed all reports required to be filed by Section 13 or 15 d of the Securities Exchange Act of 1934 during the pre

## Topic Modeling

Now, we have clean data. Next we format it so that gensim can consume it and generate data for topic modeling. Code from here onwards has largely been borrowed from https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/.  

In [16]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(sents_ClnTknzd))

print(data_words[:1])

[['united', 'and', 'exchange', 'form', 'mark', 'one', 'quarterly', 'report', 'pursuant', 'to', 'section', 'or', 'of', 'the', 'act', 'of', 'for', 'the', 'quarterly', 'period', 'ended', 'september', 'or', 'transition', 'report', 'pursuant', 'to', 'section', 'or', 'of', 'the', 'act', 'of', 'for', 'the', 'transition', 'period', 'from', 'tocommission', 'file', 'number', 'exact', 'name', 'of', 'registrant', 'as', 'specified', 'in', 'its', 'charter', 'bank', 'of', 'america', 'or', 'other', 'jurisdiction', 'of', 'incorporation', 'or', 'organization', 'delawareirs', 'employer', 'identification', 'no']]


In [17]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])



['united', 'and', 'exchange', 'form', 'mark', 'one', 'quarterly', 'report', 'pursuant', 'to', 'section', 'or', 'of', 'the', 'act', 'of', 'for', 'the', 'quarterly', 'period', 'ended', 'september', 'or', 'transition', 'report', 'pursuant', 'to', 'section', 'or', 'of', 'the', 'act', 'of', 'for', 'the', 'transition', 'period', 'from', 'tocommission', 'file', 'number', 'exact', 'name', 'of', 'registrant', 'as', 'specified', 'in', 'its', 'charter', 'bank', 'of', 'america', 'or', 'other', 'jurisdiction', 'of', 'incorporation', 'or', 'organization', 'delawareirs', 'employer', 'identification', 'no']


In [0]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [19]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

[['united', 'exchange', 'form', 'mark', 'quarterly', 'report', 'pursuant', 'section', 'act', 'quarterly', 'period', 'end', 'september', 'transition', 'report', 'pursuant', 'section', 'act', 'transition', 'period', 'tocommission', 'file', 'number', 'exact', 'name', 'registrant', 'specify', 'charter', 'bank', 'america', 'jurisdiction', 'incorporation', 'organization', 'delawareir', 'employer', 'identification']]


In [20]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 2), (19, 2), (20, 2), (21, 1), (22, 2), (23, 2), (24, 1), (25, 1), (26, 1), (27, 2), (28, 1)]]


In [21]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('act', 2),
  ('america', 1),
  ('bank', 1),
  ('charter', 1),
  ('delawareir', 1),
  ('employer', 1),
  ('end', 1),
  ('exact', 1),
  ('exchange', 1),
  ('file', 1),
  ('form', 1),
  ('identification', 1),
  ('incorporation', 1),
  ('jurisdiction', 1),
  ('mark', 1),
  ('name', 1),
  ('number', 1),
  ('organization', 1),
  ('period', 2),
  ('pursuant', 2),
  ('quarterly', 2),
  ('registrant', 1),
  ('report', 2),
  ('section', 2),
  ('september', 1),
  ('specify', 1),
  ('tocommission', 1),
  ('transition', 2),
  ('united', 1)]]

In [0]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=50,
                                           alpha='auto',
                                           per_word_topics=True)

In [23]:
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.246*"loan" + 0.076*"commercial" + 0.062*"mortgage" + 0.060*"consumer" + '
  '0.053*"non" + 0.042*"credit" + 0.040*"month" + 0.040*"portfolio" + '
  '0.037*"estate" + 0.036*"real"'),
 (1,
  '0.184*"basis" + 0.167*"percent" + 0.108*"fte" + 0.050*"equal" + '
  '0.031*"subprime" + 0.028*"greater" + 0.027*"great" + 0.024*"effective" + '
  '0.024*"march" + 0.019*"embed"'),
 (2,
  '0.063*"section" + 0.053*"taxonomy_extension" + 0.049*"act" + '
  '0.044*"certification" + 0.042*"sarbanes_oxley" + 0.042*"linkbase_document" '
  '+ 0.042*"officer_pursuant" + 0.028*"estimate" + 0.026*"payment" + '
  '0.025*"weight"'),
 (3,
  '0.073*"year" + 0.061*"represent" + 0.051*"matter" + 0.044*"day" + '
  '0.043*"enter" + 0.038*"expect" + 0.036*"measure" + 0.032*"result" + '
  '0.030*"current" + 0.028*"generally"'),
 (4,
  '0.087*"valuation" + 0.086*"market" + 0.079*"activity" + 0.036*"realize" + '
  '0.035*"position" + 0.029*"pricing" + 0.029*"volatility" + 0.026*"severity" '
  '+ 0.024*"disclosure

In [24]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -6.053528584742239

Coherence Score:  0.45001237881600514


In [25]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


The next step is really how to present this information in a end-user friendly way in a web page. I have a site and web pages where this information can seamlessly be integrated, but the challenge is to draw out a single word or two that can properly summarize the topic or core theme of a document and not show this set of 20 (or any N) individual words. Is there a pre-trained Deep NN available that can depict topic if a group of words is fed to it? How else can we group these words to have a much smaller phrase as a topic?

