# NLP Financials Metadata

This notebook will look to generate required NLP data of a financial return in the form that is accepted by the upstream program. I already have a Java program generating NLP data using Stanford CoreNLP, but I now realize that the quality of data generated in Python is better than what I currently have. It also provides more in-depth data and better control of the output. Therefore, it is pragmatic to take a hit right now and invest time in generating NLP data for catalog in Python. 

In [1]:
import nltk
import pickle
import gensim
import pandas as pd
import requests
import string
import datetime
import pattern
import hashlib



from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk import word_tokenize, sent_tokenize
from nltk.stem.snowball import SnowballStemmer
from gensim.models.word2vec import Word2Vec
from gensim.models.phrases import Phraser, Phrases
from gensim import corpora
from sklearn.manifold import TSNE
from bs4 import BeautifulSoup
from gensim.utils import lemmatize

%matplotlib inline

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pshar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pshar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\pshar\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pshar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Each word CSV file must have: Symbol name, filing type, word, lemma, POS, NER, fileName, s#, and absolute path
Each sentence CSV file must have: Symbol name, filing type, fileName, s#, filePath (local or url), and actual sentence



Next, define a function that reads the input HTML file and filters out text devoid of any HTML text. 

In [3]:
def retrieveTxt(htmlPage):
  soup = BeautifulSoup(htmlPage, "lxml")
  tagTypes = ['div', 'p']
  tags = (soup).find_all(tagTypes)
  origTxt = ''
  for t in tags:
    origTxt += (t.text+" ")
  return origTxt

Clean the raw non-HTML text taking out stop words, punctuation marks, some special UTF-8 characters etc... Note that the cleanup here is result of earlier testing that has given me the best possible results thus far. 

In [4]:
def genIntermediateTokens(origTxt):
  # Start with taking out a UTF-8 token that seems to be prevalent
  intermediateTxt = origTxt.replace(u'\xa0', u' ')
  intermediateTokens = nltk.word_tokenize(intermediateTxt)
  return intermediateTokens

The next function takes care of the stop words. But the value of rest of the code in function is somewhat arguable. There are repeated for loops here to avoid one specific condition where words at the end of sentence is getting merged with word at the start of next sentence. TODO:  Look more into the problem, and avoid this double for loop. 
  

In [5]:
def cleanedWordsTxt(intermediateTokens):
  stopWords = set(stopwords.words('english') + list(string.punctuation))
  cleanTokens = []
  for w in intermediateTokens:
      if w not in stopWords:
          cleanTokens.append(w.lower())
          cleanTokens.append(' ') # Need to append a single space for cases where words are losing space in between
          
  cleanedTxt = ''
  cleanedTxtLst = []
  for token in cleanTokens:
    if token != ' ':
      cleanedTxtLst.append(token)
      cleanedTxt += (token)
  cleanedTxt = cleanedTxt.replace('  ', ' ')  
  return cleanedTxtLst

Thus far we have tokenized text into words and cleaned them. Now, do the same but keep the unit at sentence level rather than at word level. 

In [6]:
from nltk.tokenize import PunktSentenceTokenizer
def cleanedSentsTxt(origTxt):
  intermediateTxt = origTxt.replace(u'\xa0', u' ')
  sents_tokenized = sent_tokenize(intermediateTxt)
  sents_ClnTknzd = []
  punctuations = list(string.punctuation) # only remove punctuations. Keep stop words for phrases and un-abbreviated forms. Ex: United States "of" America. 

  for sent in sents_tokenized:
    tempStr = ''
    tempTokens = nltk.word_tokenize(sent)
    for token in tempTokens:
      if token not in punctuations:
          tempStr += (token)
          tempStr += ' '
          #cleanTokens.append(' ') # Need to append a single space for cases where words are losing space in between

    sents_ClnTknzd.append(tempStr.strip()) # Helps remove the space at the end of tempStr

  return sents_ClnTknzd


It is important to use sentences to figure out Parts of Speech. If you determine PoS just based on list of words then the context is likely to be lost. This also presents a slight problem

In [7]:
enableNER = False
stopWords = set(stopwords.words('english') + list(string.punctuation))
from nltk.stem.wordnet import WordNetLemmatizer
# create an object of stemming function
stemmer = SnowballStemmer("english")
# lemmatizer = WordNetLemmatizer()

if (enableNER):
  from nltk.parse import CoreNLPParser
  try:
    ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
  except:
    print("NER_Tagger not available. Ensure that local CoreNLP Server is running.")

def genAnnotations(symbol, filingType, cleanedSentsList, nameInputFile, url):
  annList = []
  sentList = []

  sCount = 0
  for sent in cleanedSentsList:
    sCount += 1
    i = 0
    
    # Disabling Annotations (words) output for the time-being until Solr cataloging is resolved
    """
    tokens = word_tokenize(sent)
    pos = nltk.pos_tag(tokens)
    if (enableNER):
      ner = ner_tagger.tag(tokens)
    posLength = len(pos)
    """
    
    
    lstSent = []
    lstSent.append(symbol); lstSent.append(filingType); lstSent.append(nameInputFile)
    lstSent.append('s'+str(sCount)); lstSent.append(url); lstSent.append(sent.replace(",", ""))
    # This is purely to group search results by sentence. Solr won't allow grouping by large text fields, so hash it. 
    hash = hashlib.sha224(bytes(sent.replace(",", ""), encoding='utf-8')).hexdigest()
    lstSent.append(hash)
    #tempSentStr = symbol + "," + filingType + "," + nameInputFile + "," + str(sCount) + "," + url + "," + sent.replace(",", "")
    sentList.append(lstSent)

    # Disabling Annotations (words) output for the time-being until Solr cataloging is resolved
    """
    try:
      for token in tokens:
        if token not in stopWords:
          # lemma = lemmatizer.lemmatize(token) # Didn't work 
          lemma = stemmer.stem(token)
          posWord = pos[i][1] 
          if (enableNER):
            nerWord = ner[i][1] 
          else:
            nerWord = 'ner'
                 
          lstWords = []
          lstWords.append(symbol); lstWords.append(filingType); lstWords.append(token)
          lstWords.append(lemma); lstWords.append(posWord); lstWords.append(nerWord)
          lstWords.append(nameInputFile); lstWords.append('s'+str(sCount)); lstWords.append(url)
          
          annList.append(lstWords)
        i += 1 # Must increase value of i regardless of whether the word is in stopWords or not. Recall that POS and NER were computed with stopwords included. 
        
    except IndexError:
      print("List index out of range likely for PoS or NER tagger")
      print("Sentence where the error occured is: '\"+sent+\"'")
      print("Skipping indexing of this sentence.")
    """
  return (annList, sentList)

In [19]:
import csv
filesLocal = True
dirNLPOut = 'C:/Users/pshar/Dropbox/Programming/SampleTexts/NLP_Output'
"""
urls = []
urls.append("https://www.sec.gov/Archives/edgar/data/886982/000156459019008879/gs-424b2.htm")
# Manual settings for the test run. 
symbol = 'GS'
filingType = '10-K'
nameInputFile = 'd480167d10k.htm'
"""

page = ''
def traverseFilingFiles(symbol, filingType, filePath):
    (parentDir, nameInputFile) = os.path.split(filePath)
    origTxt = ''
    if (filesLocal == False):
        # The filePath here should be URL. 
        # TODO: Handle upstream function to send URL or localPath both
        htmlPage = requests.get(filePath).text # Retrieve text for one file at one point
        origTxt = retrieveTxt(htmlPage)
    else: 
        try:
            with open(filePath, 'r', encoding='utf8') as content_file:
                htmlPage = content_file.read()
                origTxt = retrieveTxt(htmlPage)
        except UnicodeDecodeError: 
            print ("UnicodeDecodeError reading HTML: '"+filePath+"' for symbol:"+symbol)
        
    """
    # Commenting out individual words based function calls, while keeping analysis at the sentence level
    intermediateTokens = genIntermediateTokens(origTxt)
    cleanedWordsList = cleanedWordsTxt(intermediateTokens)
    """
    cleanedSentsList = cleanedSentsTxt(origTxt)
    (annList, sentList) = genAnnotations(symbol, filingType, cleanedSentsList, nameInputFile, filePath)

    now = datetime.datetime.now()
    time = str(now)[:19].replace(":", "-").replace(" ", "_")

    
    # Disabling Annotations (words) output for the time-being until Solr cataloging is resolved
    """
    #fileWords = 'Annotations_'+symbol+'_'+filingType+'_'+nameInputFile+'_'+time+'.csv'
    #absPathFileWords = os.path.join(dirNLPOut, fileWords)
    #print(fileWords)
    #dfrmWords = pd.DataFrame(annList)
    #dfrmWords.to_csv(absPathFileWords, sep =",", index = None, header=False, quoting=csv.QUOTE_NONE, escapechar="\\")
    """
    fileSents = 'Sentences_'+symbol+'_'+filingType+'_'+nameInputFile+'_'+time+'.csv'
    absPathFileSents = os.path.join(dirNLPOut, fileSents)
    print(fileSents)
    dfrmSents = pd.DataFrame(sentList)#, columns = labelSents)
    dfrmSents.to_csv(absPathFileSents, sep =",", index = None, header=False, quoting=csv.QUOTE_NONE, escapechar="\\")

# Retrieve & Traverse Filings Previously Downloaded 

### Base functions to traverse symbols, then filing types (10-k, 10-Q...), and then individual filings

Start with listing symbols whose filings were earlier downloaded in a dir listing arranged by symbols

In [20]:
import os
from os.path import isdir
from os.path import abspath
"""
The base location locally 
"""
def genListDirsSymbols(basePath):
    symbols = []
    dirsSymbols = os.listdir(basePath)
    for dirSymbol in dirsSymbols:
        (head, tail) = os.path.split(dirSymbol)
        symbols.append(tail)
    return symbols

For each symbol, generate listing of folders underneath. This listing will correspond to filing types (10-k etc...) downloaded for each of these symbols. 

In [21]:

def genListFilingsDirsForSymbol(basePathPlusSymbol):
    filingDirs = []
    dirsFilings = os.listdir(basePathPlusSymbol)
    for dirFilings in dirsFilings:
        (head, tail) = os.path.split(dirFilings)
        filingDirs.append(tail)
    return filingDirs   

Retrieve listing of files for given symbol and given filing type.

In [22]:
def genListFilingsForSymbol(basePathPlusSymbolPlusFilingDir):
    filings = []
    files = os.listdir(basePathPlusSymbolPlusFilingDir)
    for file in files:
        (head, tail) = os.path.split(file)
        if (tail != 'txt'): 
            filings.append(file)
    return filings

    

Now, that we have functions defined to list symbols, filing types, and actual files defined, go through them as a list and traverse these files grouping them first by symbol and then by filing type.

In [24]:
dirBase = 'C:/Users/pshar/Dropbox/WebServices/mysite/polls/templates/polls/FilingsBySymbols'
filingsFullLst = []
symbols = genListDirsSymbols(dirBase)
for symbol in symbols:
    basePathPlusSymbol = os.path.join(dirBase, symbol.upper())
    filingDirs = genListFilingsDirsForSymbol(basePathPlusSymbol)
    for filingDir in filingDirs:
        basePathPlusSymbolPlusFilingDir =  os.path.join(basePathPlusSymbol, filingDir)
        filings = genListFilingsForSymbol(basePathPlusSymbolPlusFilingDir)
        for filing in filings:
            filePath = os.path.join(basePathPlusSymbolPlusFilingDir, filing)
            traverseFilingFiles(symbol, filingDir, filePath)
    
print("DONE!")

Sentences_JPM_10-K_corp10k2016htm_2019-03-28_11-02-57.csv
Sentences_JPM_10-K_corp10k2017htm_2019-03-28_11-03-40.csv
Sentences_JPM_10-K_corp10k2018htm_2019-03-28_11-03-43.csv
Sentences_JPM_10-Q_corpq12018htm_2019-03-28_11-04-03.csv
Sentences_JPM_10-Q_corpq22017htm_2019-03-28_11-04-24.csv
Sentences_JPM_10-Q_corpq22018htm_2019-03-28_11-04-48.csv
Sentences_JPM_10-Q_corpq32017htm_2019-03-28_11-05-17.csv
Sentences_JPM_10-Q_corpq32018htm_2019-03-28_11-05-18.csv
Sentences_MRK_10-K_mrk1231201610khtm_2019-03-28_11-05-27.csv
Sentences_MRK_10-K_mrk1231201710khtm_2019-03-28_11-05-35.csv
Sentences_MRK_10-K_mrk1231201810khtm_2019-03-28_11-05-44.csv
Sentences_MRK_10-Q_mrk0331201810qhtm_2019-03-28_11-05-50.csv
Sentences_MRK_10-Q_mrk0630201710qhtm_2019-03-28_11-05-54.csv
Sentences_MRK_10-Q_mrk0630201810qhtm_2019-03-28_11-05-59.csv
Sentences_MRK_10-Q_mrk0930201710qhtm_2019-03-28_11-06-03.csv
Sentences_MRK_10-Q_mrk0930201810qhtm_2019-03-28_11-06-11.csv
Sentences_MS_10-K_d328282d10khtm_2019-03-28_11-06-31.

## Possible Enhancements

- P1: The Lemmatization API being called is yielding poor results. Not relying on lemma right now but need to correct that later. 
- P1: Keep reference to NLP_Financials_Cloud.ipynb for other possible enhancements.
- P1: Can we introduce topic modeling for each paragraph? The p HTML tag is already defined.  
- P1: Bring complete cycle (download, generate NLP, and indexing) here rather than just the middle part. 
- P1: If not full cycle, then keep the ability to generate NLP from both local files as well as from EDGAR. Couple of function may require little tweaking. 
