# Topic Modelling on SEC 10-k Forms

In [1]:
import random
from pprint import pprint
import gensim.corpora as corpora
from nltk.corpus import stopwords as sw
from gensim.models.ldamodel import LdaModel

from utils.processing import scrape_edgar, format_text, lem_and_stem, add_bigrams
from utils.stopwords import stopwords as custom_sw
from utils.analysis import cluster_companies_by_main_topic

from data.sec_edgar_urls import URLS_10K

In [2]:
stopwords = set(sw.words('english')).union(custom_sw)

## Scrape Text from Edgar ##

Using the URLs to the latest 10-k filings of 32 large cap companies, we scrape the SEC Edgar database to retrieve the documents in HTML format. 

In [3]:
docContents = scrape_edgar(URLS_10K)

100%|██████████| 32/32 [02:32<00:00,  7.37s/it]


#### 10-k Document Sample

Here is a sample title page and table of contents for Apple's 10-k form for the 2018 fiscal year.

![title](data/images/Title.png)

![title](data/images/TableOfContents.png)

#### Structure & Navigating the Documents ####

While individual reports may vary between companies or fiscal years, the structure displayed in the table of contents seems to generalize across documents. We therefore use the header sections as a way to navigate the HTML document during text scraping.

#### Relevant Text ####

Furthermore, since the majority of the document is concerned with legal text which isn't very informative about the company's products, we discard all text except for the text *Business* header. This sections contains data like company description, their market and a list of their products.

#### Failed Scrapes ####

Since this is a prototype, and we are not looking for robustness, we cut our losses and discard all documents for which the our scraper failed to get text from the relevant sections. The initial 32 companies are then reduced to 15, which are Alphabet, Amazon, Apple, Chevron, Cisco, Disney, Facebook, Homedepot, Mastercard, Merck, Pfizer, Philip Morris, United Health Group, Visa and Walmart.

## Text Processing ##

For each document we normalize the text, remove stopwords, add POS tagging and lemmatize and stem the tokens. Furthermore, since nouns are most indicative of the topic, we keep only tokens which were tagged as nouns. We also decide to only add words to our dictionary which appear 5 times or more.

In [4]:
X = []
companies = []
for company, doc in docContents.items():
    companies.append(company)
    X.append(lem_and_stem(format_text(doc), stopwords))
X = add_bigrams(X)

In [5]:
idWordDictionary = corpora.Dictionary(X)
idWordDictionary.filter_extremes(no_below=5)  # filter rare words
corpus = [idWordDictionary.doc2bow(doc) for doc in X]
readableCorpus = [[(idWordDictionary[wordid], freq) for wordid, freq in cp] for cp in corpus[:1]]

## Topic Modelling ##

We will now empirically determine groups of companies by performing topic modelling on the 10-k form corpus. The idea is that if two companies have the same product or market, the words they use to describe their products in the 10-k form will be similar. Peer groups can then be inferred by looking at the topic allocation for each document / company. Latent Dirichlet Association is a great way to agnostically model topics. It also provides us with a set of words 

In [6]:
numTopics = 5
passes = 1000
seed = 667  # random.randint(0,1000)

In [7]:
lda = LdaModel(
    corpus=corpus,
    id2word=idWordDictionary,
    num_topics=numTopics, 
    passes=passes,
    eta='auto',
    alpha='auto',
    update_every=0,  # batch learning
    random_state=seed
)

## Results

Now that we ran the model, let's have a look at how the companies group by sorting them into their most relevant topics:

In [8]:
cluster = cluster_companies_by_main_topic(lda, numTopics, corpus, companies)
print("Companies clustered by their most relevant topic association: \n")
pprint(cluster)

Companies clustered by their most relevant topic association: 

{0: ['CHEVRON', 'HOMEDEPOT'],
 1: ['UNITEDHEALTH', 'PFIZER', 'MERCK'],
 2: ['APPLE', 'WALMART', 'AMAZON', 'CISCO', 'DISNEY'],
 3: ['VISA', 'MASTERCARD'],
 4: ['FACEBOOK', 'ALPHABET', 'PHILIPMORRIS']}


There are a couple of odd balls, which is expected in this small sample. For example, it isn't obvious why Philip Morris is more similar to Facebook or Alphabet than to drug or health companies like United, Pfizer or Merck. 

That being said, generally, similar companies seem to group together. For eaxmple, group 3 is payment providers, 1 is health / drug companies, 2 and 4 seem to be tech companies (where two seems to have tech companies that also have physical locations.

To get an idea of the interpretation of the topics, we can view the words the LDA algorithm associates with each topic: 

In [9]:
print("Words associated with each topic: \n")
pprint(lda.print_topics())

Words associated with each topic: 

[(0,
  '0.079*"field" + 0.059*"well" + 0.051*"home" + 0.044*"affili" + '
  '0.028*"capac" + 0.028*"energi" + 0.022*"australia" + 0.020*"water" + '
  '0.017*"mexico" + 0.015*"sharehold"'),
 (1,
  '0.116*"care" + 0.033*"phase" + 0.022*"insur" + 0.020*"adult" + '
  '0.017*"agenc" + 0.014*"japan" + 0.014*"coverag" + 0.014*"rule" + '
  '0.014*"section" + 0.013*"contract"'),
 (2,
  '0.036*"station" + 0.023*"cloud" + 0.020*"home" + 0.019*"video" + '
  '0.018*"execut" + 0.015*"enterpris" + 0.014*"hardwar" + 0.014*"hour" + '
  '0.014*"game" + 0.013*"shop"'),
 (3,
  '0.058*"institut" + 0.056*"card" + 0.040*"issuer" + 0.028*"fee" + '
  '0.028*"commerc" + 0.028*"core" + 0.028*"credit" + 0.021*"accept" + '
  '0.016*"fund" + 0.015*"jurisdict"'),
 (4,
  '0.019*"cloud" + 0.017*"hardwar" + 0.016*"video" + 0.013*"learn" + '
  '0.012*"climat" + 0.012*"execut" + 0.011*"trend" + 0.011*"disclosur" + '
  '0.011*"oblig" + 0.011*"instanc"')]
