# NLP Topic Modeling


This notebook focuses on topic modeling in Natural Language Processing. As one can infer from the verbiage, Topic Modeling deals with determining the major themes of a document. Based on the document, it generates a group of words for each topic. The API used here is GenSim and the method is primarily Latent Dirichlet Algorithm. 

This notebook is divided among several sections. The first half is about data retrieval, cleanup, and organization, and the second half gets into semantics of topic modeling. The majority of code in the second half has been retrieved from the following link that does a great job of explaining core gensim concepts: 

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

The first half is my own code which I often use to demonstrate NLP functionalities. 

In [0]:
import nltk
import pickle

#!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim  # don't skip this

import gensim
from gensim.models.word2vec import Word2Vec
from gensim.models.phrases import Phraser, Phrases
from gensim import corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from pprint import pprint

import pandas as pd
import requests
import string

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.tag import StanfordNERTagger
from nltk import word_tokenize, sent_tokenize
from sklearn.manifold import TSNE
from bokeh.io import output_notebook, output_file
from bokeh.plotting import show, figure
from bs4 import BeautifulSoup

# spacy for lemmatization
import spacy

# Plotting tools
#import pyLDAvis
#import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

%matplotlib inline

In [40]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Retrieve Dataset

In [0]:
# For text dataset, I often use SEC EDGAR filings of US based public companies. For demonstration purpose, I often keep the list limited to around 5 such dodcuments. 
urls = []
urls.append("https://www.sec.gov/Archives/edgar/data/886982/000119312518056383/d480167d10k.htm")
"""
urls.append("https://www.sec.gov/Archives/edgar/data/78003/000007800318000091/pfe-09302018x10q.htm") # PFE : 10Q : 3Q2018

urls.append("https://www.sec.gov/Archives/edgar/data/886982/000119312517056804/d308759d10k.htm")
urls.append("https://www.sec.gov/Archives/edgar/data/895421/000119312517059212/d328282d10k.htm")
urls.append("https://www.sec.gov/Archives/edgar/data/895421/000119312518060831/d500533d10k.htm")
tgtUrl = 'https://www.sec.gov/Archives/edgar/data/886982/000119312519050198/d669877d10k.htm'
"""
# Retrieve the HTML pages for URLs
pages = ''
for url in urls:
  pages += requests.get(url).text

In [0]:
# Read the HTML using BeautifulSoup. Fallback 'html.parser' in case lxml has challages",
soup = BeautifulSoup(pages, "lxml")  

In [0]:
# Find all 'div' and 'p' tags as these are the ones that contain data in our documents. Maintain the order of text.
tagTypes = ['div', 'p']
tags = soup.find_all(tagTypes)

In [0]:
# Retrieve the plain-text from HTML tags.
origTxt = ''
for t in tags:
    origTxt += t.text


## Data Cleanup

Now we have the raw text data. We need to clean it up to remove stop words, punctuations, and other common trivial patterns. 

This text has several "\xa0" characters which need to be replaced. Start data clean up with these characters. Refer to:

https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string

In [0]:
intermediateTxt = origTxt.replace(u'\xa0', u' ')

Now, clean stop words.

In [0]:
cleanedTxt = ''
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
stopWords = set(stopwords.words('english') + list(string.punctuation))


In [95]:
intermediateTokens = nltk.word_tokenize(intermediateTxt)
len(intermediateTokens)

457682

In [96]:
# Remove stop words.
cleanTokens = []
for w in intermediateTokens:
    if w not in stopWords:
        cleanTokens.append(w.lower())
        cleanTokens.append(' ') # Need to append a single space for cases where words are losing space in between

len(cleanTokens)

547674

In [0]:
cleanedTxt = ''
cleanedTxtLst = []
for token in cleanTokens:
  if token != ' ':
    cleanedTxtLst.append(token)
    cleanedTxt += (token)
cleanedTxt = cleanedTxt.replace('  ', ' ')  
# cleanedTxtLst now contains individual words or tokens

It is important to first tokenize and then match individual word against stop words. If we simply search for a stop word in the entire string and remove it, we will lose some important information. Consider example of a word "I.R.S". If the logic is to remove stop words and punctuation from entire string in one go, then the dots within this word will get removed. On the other hand, if we tokenize first then the comparison will be with entire "I.R.S" word as a token and therefore the dots inside will not get removed. This is one simple example but I have seen better results when stop words removal is done after tokenization. If you need a single string, then simply concatenate all tokens in a list.

In [0]:
# Optional: remove numbers and any words containing numbers from our cleaned tokens:
# cleanedTxt = re.sub('\\w*\\d\\w*', '', cleanedTxt)

## Clean and Arrange Original Text into a List of Sentences

Thus far, we have tokenized input individual tokens  or words. However, it is also important to tokenize input by sentences. Toeknization by sentence can help highlight phrases, sentiments etc... which can not be generated on invidual words efficiently. 

In [99]:
# Start with original text scraped from web resources. Check that it wasn't inadvertently modified.
origTxt

'\nUNITED STATES SECURITIES AND EXCHANGE COMMISSION \nWashington, D.C. 20549  \xa0\n\xa0 Form\xa010-K  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF \nTHE SECURITIES EXCHANGE ACT OF 1934  \xa0\n\xa0 \xa0\n\n\n\n\n\n\n For the fiscal year ended December\xa031,\xa02017\n\xa0\nCommission File Number: 001-14965\n The Goldman Sachs Group, Inc. \n(Exact name of registrant as specified in its charter)  \xa0\n\n\n\n\n\n\nDelaware\n\xa0\n13-4019460\n\n (State or other jurisdiction of\nincorporation or organization)\n\xa0\n (I.R.S. Employer\nIdentification No.)\n\n\n\n\n200 West Street\n\xa0\n10282\n\n New York, N.Y.\n(Address of principal executive offices)\n\xa0\n(Zip Code)\n (212) 902-1000 \n(Registrant\x92s telephone number, including area code) \nSecurities registered pursuant to Section\xa012(b) of the Act:  \xa0\n\n\n\n\n\n\nTitle of each class:\n\xa0\nName of each exchange on which registered:\n\n Common stock, par value $.01 per share\n\xa0\nNew York Stock Exchange\n\n Depositary Sha

In [0]:
intermediateTxt = origTxt.replace(u'\xa0', u' ')

In [101]:
# Tokenize by sentence. Notce that the stop words and punctuations have not been removed yet in this cell
from nltk.tokenize import PunktSentenceTokenizer
sents_tokenized = sent_tokenize(intermediateTxt)
sents_tokenized[1:20]

['Employer\nIdentification No.)',
 '200 West Street\n \n10282\n\n New York, N.Y.\n(Address of principal executive offices)\n \n(Zip Code)\n (212) 902-1000 \n(Registrant\x92s telephone number, including area code) \nSecurities registered pursuant to Section 12(b) of the Act:   \n\n\n\n\n\n\nTitle of each class:\n \nName of each exchange on which registered:\n\n Common stock, par value $.01 per share\n \nNew York Stock Exchange\n\n Depositary Shares, Each Representing 1/1,000th Interest in a Share of Floating Rate\nNon-Cumulative Preferred Stock, Series A\n \nNew York Stock Exchange\n\n Depositary Shares, Each Representing 1/1,000th Interest in a Share of 6.20%\nNon-Cumulative Preferred Stock, Series B\n \nNew York Stock Exchange\n\n Depositary Shares, Each Representing 1/1,000th Interest in a Share of Floating Rate\nNon-Cumulative Preferred Stock, Series C\n \nNew York Stock Exchange\n\n Depositary Shares, Each Representing 1/1,000th Interest in a Share of Floating Rate\nNon-Cumulative 

So sentences are recognized. We still need to do some data cleaning here as well. For that, we will create tokens per sentence, clean punctuations at that point, and then create sentences again, and finally will push them into a list of sentences. 

In [102]:
sents_ClnTknzd = []
punctuations = list(string.punctuation) # only remove punctuations. Keep stop words for phrases and un-abbreviated forms. Don't lose "of" in US of A for example. 

for sent in sents_tokenized:
  tempStr = ''
  tempTokens = nltk.word_tokenize(sent)
  for token in tempTokens:
    if token not in punctuations:
        tempStr += (token)
        tempStr += ' '
        #cleanTokens.append(' ') # Need to append a single space for cases where words are losing space in between
  
  sents_ClnTknzd.append(tempStr.strip())

sents_ClnTknzd[1:20]

['Employer Identification No',
 '200 West Street 10282 New York N.Y. Address of principal executive offices Zip Code 212 902-1000 Registrant\x92s telephone number including area code Securities registered pursuant to Section 12 b of the Act Title of each class Name of each exchange on which registered Common stock par value .01 per share New York Stock Exchange Depositary Shares Each Representing 1/1,000th Interest in a Share of Floating Rate Non-Cumulative Preferred Stock Series A New York Stock Exchange Depositary Shares Each Representing 1/1,000th Interest in a Share of 6.20 Non-Cumulative Preferred Stock Series B New York Stock Exchange Depositary Shares Each Representing 1/1,000th Interest in a Share of Floating Rate Non-Cumulative Preferred Stock Series C New York Stock Exchange Depositary Shares Each Representing 1/1,000th Interest in a Share of Floating Rate Non-Cumulative Preferred Stock Series D New York Stock Exchange Depositary Shares Each Representing 1/1,000th Interest in

## Topic Modeling

Now, we have clean data. Next we format it so that gensim can consume it and generate data for topic modeling. Code from here onwards has largely been borrowed from https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/.  

In [103]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(sents_ClnTknzd))

print(data_words[:1])

[['united', 'states', 'securities', 'and', 'exchange', 'commission', 'washington', 'form', 'annual', 'report', 'pursuant', 'to', 'section', 'or', 'of', 'the', 'securities', 'exchange', 'act', 'of', 'for', 'the', 'fiscal', 'year', 'ended', 'december', 'commission', 'file', 'number', 'the', 'goldman', 'sachs', 'group', 'inc', 'exact', 'name', 'of', 'registrant', 'as', 'specified', 'in', 'its', 'charter', 'delaware', 'state', 'or', 'other', 'jurisdiction', 'of', 'incorporation', 'or', 'organization']]


In [104]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])



['united_states', 'securities', 'and', 'exchange', 'commission', 'washington', 'form', 'annual', 'report', 'pursuant', 'to', 'section', 'or', 'of', 'the', 'securities', 'exchange', 'act', 'of', 'for', 'the', 'fiscal', 'year', 'ended', 'december', 'commission', 'file', 'number', 'the', 'goldman', 'sachs', 'group', 'inc', 'exact', 'name', 'of', 'registrant', 'as', 'specified', 'in', 'its', 'charter', 'delaware', 'state', 'or', 'other', 'jurisdiction', 'of', 'incorporation', 'or', 'organization']


In [0]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [106]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

[['united_state', 'security', 'exchange', 'commission', 'washington', 'form', 'annual', 'report', 'pursuant', 'section', 'security', 'exchange', 'act', 'fiscal', 'year', 'end', 'december', 'commission', 'file', 'number', 'goldman', 'sach', 'group', 'inc', 'exact', 'name', 'registrant', 'specify', 'charter', 'delaware', 'state', 'jurisdiction', 'incorporation', 'organization']]


In [107]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 1), (7, 1), (8, 2), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 2), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1)]]


In [108]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('act', 1),
  ('annual', 1),
  ('charter', 1),
  ('commission', 2),
  ('december', 1),
  ('delaware', 1),
  ('end', 1),
  ('exact', 1),
  ('exchange', 2),
  ('file', 1),
  ('fiscal', 1),
  ('form', 1),
  ('goldman', 1),
  ('group', 1),
  ('inc', 1),
  ('incorporation', 1),
  ('jurisdiction', 1),
  ('name', 1),
  ('number', 1),
  ('organization', 1),
  ('pursuant', 1),
  ('registrant', 1),
  ('report', 1),
  ('sach', 1),
  ('section', 1),
  ('security', 2),
  ('specify', 1),
  ('state', 1),
  ('united_state', 1),
  ('washington', 1),
  ('year', 1)]]

In [0]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=50,
                                           alpha='auto',
                                           per_word_topics=True)

In [110]:
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

[(0,
  '0.095*"certain" + 0.067*"equity" + 0.055*"option" + 0.053*"award" + '
  '0.044*"compensation" + 0.044*"investment" + 0.042*"price" + '
  '0.041*"director" + 0.032*"transaction" + 0.031*"performance"'),
 (1,
  '0.195*"bank" + 0.111*"gs" + 0.098*"service" + 0.077*"client" + 0.057*"usa" '
  '+ 0.057*"segment" + 0.054*"operating" + 0.045*"lending" + 0.029*"invest" + '
  '0.025*"proceeding"'),
 (2,
  '0.115*"february" + 0.105*"exhibit" + 0.094*"share" + 0.042*"act" + '
  '0.038*"outstanding" + 0.036*"new_york" + 0.034*"co" + 0.034*"principal" + '
  '0.021*"see" + 0.021*"law"'),
 (3,
  '0.176*"reference" + 0.160*"loan" + 0.096*"instrument" + 0.077*"may" + '
  '0.039*"future" + 0.028*"receivable" + 0.027*"underlie" + 0.021*"claim" + '
  '0.021*"various" + 0.020*"transfer"'),
 (4,
  '0.167*"term" + 0.101*"borrowing" + 0.098*"action" + 0.071*"long" + '
  '0.059*"sell" + 0.049*"applicable" + 0.039*"exclude" + 0.038*"own" + '
  '0.025*"recognize" + 0.025*"indemnification"'),
 (5,
  '0.135

In [111]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Perplexity:  -7.5858258719464216

Coherence Score:  0.4002224099255929


In [112]:
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis