# NLP Financials Metadata

This notebook will look to generate required NLP data of a financial return in the form that is accepted by the upstream program. I already have a Java program generating NLP data using Stanford CoreNLP, but I now realize that the quality of data generated in Python is better than what I currently have. It also provides more in-depth data and better control of the output. Therefore, it is pragmatic to take a hit right now and invest time in generating NLP data for catalog in Python. 

In [102]:
import nltk
import pickle
import gensim
import pandas as pd
import requests
import string
import datetime
import pattern
import hashlib
import numpy as np


from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk import word_tokenize, sent_tokenize
from nltk.stem.snowball import SnowballStemmer
from gensim.models.word2vec import Word2Vec
from gensim.models.phrases import Phraser, Phrases
from gensim import corpora
from sklearn.manifold import TSNE
from bs4 import BeautifulSoup
from gensim.utils import lemmatize

%matplotlib inline

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Error loading punkt: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [Errno 11001] getaddrinfo failed>
[nltk_data] Error loading wordnet: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

Each word CSV file must have: Symbol name, filing type, word, lemma, POS, NER, fileName, s#, and absolute path
Each sentence CSV file must have: Symbol name, filing type, fileName, s#, filePath (local or url), and actual sentence



Next, define a function that reads the input HTML file and filters out text devoid of any HTML text. 

In [3]:
def retrieveTxt(htmlPage):
  soup = BeautifulSoup(htmlPage, "lxml")
  tagTypes = ['div', 'p']
  tags = (soup).find_all(tagTypes)
  origTxt = ''
  for t in tags:
    origTxt += (t.text+" ")
  return origTxt

Clean the raw non-HTML text taking out stop words, punctuation marks, some special UTF-8 characters etc... Note that the cleanup here is result of earlier testing that has given me the best possible results thus far. 

In [4]:
def genIntermediateTokens(origTxt):
  # Start with taking out a UTF-8 token that seems to be prevalent
  intermediateTxt = origTxt.replace(u'\xa0', u' ')
  intermediateTokens = nltk.word_tokenize(intermediateTxt)
  return intermediateTokens

The next function takes care of the stop words. But the value of rest of the code in function is somewhat arguable. There are repeated for loops here to avoid one specific condition where words at the end of sentence is getting merged with word at the start of next sentence. TODO:  Look more into the problem, and avoid this double for loop. 
  

In [5]:
def cleanedWordsTxt(intermediateTokens):
  stopWords = set(stopwords.words('english') + list(string.punctuation))
  cleanTokens = []
  for w in intermediateTokens:
      if w not in stopWords:
          cleanTokens.append(w.lower())
          cleanTokens.append(' ') # Need to append a single space for cases where words are losing space in between
          
  cleanedTxt = ''
  cleanedTxtLst = []
  for token in cleanTokens:
    if token != ' ':
      cleanedTxtLst.append(token)
      cleanedTxt += (token)
  cleanedTxt = cleanedTxt.replace('  ', ' ')  
  return cleanedTxtLst

Thus far we have tokenized text into words and cleaned them. Now, do the same but keep the unit at sentence level rather than at word level. 

In [6]:
from nltk.tokenize import PunktSentenceTokenizer
def cleanedSentsTxt(origTxt):
  intermediateTxt = origTxt.replace(u'\xa0', u' ')
  sents_tokenized = sent_tokenize(intermediateTxt)
  sents_ClnTknzd = []
  punctuations = list(string.punctuation) # only remove punctuations. Keep stop words for phrases and un-abbreviated forms. Ex: United States "of" America. 

  for sent in sents_tokenized:
    tempStr = ''
    tempTokens = nltk.word_tokenize(sent)
    for token in tempTokens:
      if token not in punctuations:
          tempStr += (token)
          tempStr += ' '
          #cleanTokens.append(' ') # Need to append a single space for cases where words are losing space in between

    sents_ClnTknzd.append(tempStr.strip()) # Helps remove the space at the end of tempStr

  return sents_ClnTknzd


Adding a function to generate metadata such as word itself, lemma, NER, PoS etc... for individual words in a sentence. Earlier, sentences and words in those sentences were being pushed to two separate files for each symbol and filing. However, that was causing the unique ID to be different upon indexing for a sentence and words within it. That, in turn, would not allow sentences and words to both appear under a single result and there was no way to then group them. To avoid all that, the words metadata must get indexed with the sentence that includes those words in a single unique identifier within Solr. It presents a challenge in that while a sentence is unique with in a single row, now the "word" or "pos" or "ner" will no longer be unique even within a single document as Solr defines it. Each row will contain all the words and their respective pos within a single sentence. This will not only increase the size of input file for indexing, but will also increase duration and complexity. I spent few days thinking about alternate solutions but couldn't figure out a way to combine sentence and words annotation metadata if they were indexed at separate times, therefore, this effort now as an option of last resort. 

In [88]:
enableNER = False # Keep it disabled unless Stanford CoreNLP server is running locally and you are on a loaded (CPU, mem) instance
stopWords = set(stopwords.words('english') + list(string.punctuation))
from nltk.stem.wordnet import WordNetLemmatizer
# create an object of stemming function
stemmer = SnowballStemmer("english")
# lemmatizer = WordNetLemmatizer()

if (enableNER):
  from nltk.parse import CoreNLPParser
  try: # Careful when enabling NER generation. It can increase compute times by order of 20-30 times 
    ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
  except:
    print("NER_Tagger not available. Ensure that local CoreNLP Server is running.")

def genLemmas(sent):
    i = 0
    tokens = word_tokenize(sent)
    pos = nltk.pos_tag(tokens)
    if (enableNER):
      ner = ner_tagger.tag(tokens)
    posLength = len(pos)
    # Higher level list which will contain map with token as "k" and list of lemma, ner, pos as "v"
    annListForSent = []
       
    try:
      for token in tokens:
        if token not in stopWords:
          tknEntry = {}
          lstTemp = [] 
          # lemma = lemmatizer.lemmatize(token) # Didn't work 
          lemma = stemmer.stem(token)
          posWord = pos[i][1] 
          if (enableNER): nerWord = ner[i][1] 
          else: nerWord = 'ner'

          lstTemp.append(lemma); lstTemp.append(posWord); lstTemp.append(nerWord) # 0:lemma, 1:pos, 2:ner, 
          tknEntry[token] = lstTemp


          annListForSent.append(tknEntry)
        i += 1 # Must increase value of i regardless of whether the word is in stopWords or not. Recall that POS and NER were computed with stopwords included. 
        
    except IndexError:
      print("List index out of range likely for PoS or NER tagger")
      print("Sentence where the error occured is: '\"+sent+\"'")
      print("Skipping indexing of this sentence.")
    
    return annListForSent
        

In [89]:
"""
sent = "The Company has entered into a guarantee for the benefit of PSA Group and pursuant to \
    which the Company has agreed to guarantee the Seller's obligation to indemnify PSA Group for certain \
    losses resulting from any inaccuracy of certain representations and warranties or breaches of our covenants \
    in the Agreement and for certain other liabilities."
annListForSent = genLemmas(sent)
annListForSent
"""

'\nsent = "The Company has entered into a guarantee for the benefit of PSA Group and pursuant to     which the Company has agreed to guarantee the Seller\'s obligation to indemnify PSA Group for certain     losses resulting from any inaccuracy of certain representations and warranties or breaches of our covenants     in the Agreement and for certain other liabilities."\nannListForSent = genLemmas(sent)\nannListForSent\n'

It is important to use sentences to figure out Parts of Speech. If you determine PoS just based on list of words then the context is likely to be lost.

In [103]:
def genAnnotations(symbol, filingType, cleanedSentsList, nameInputFile, url):
  annList = []
  sentList = []

  sCount = 0
  for sent in cleanedSentsList:
    sCount += 1
    
    
    lstSent = []
    lstSent.append(symbol); lstSent.append(filingType); lstSent.append(nameInputFile)
    lstSent.append('s'+str(sCount)); lstSent.append(url); lstSent.append(sent.replace(",", ""))
    # This is purely to group search results by sentence. Solr won't allow grouping by large text fields, so hash it. 
    hash = hashlib.sha224(bytes(sent.replace(",", ""), encoding='utf-8')).hexdigest()
    lstSent.append(hash)
    # Now, generate annotations at individual word level
    annListForSent = genLemmas(sent)
    cntWords = len(annListForSent)
    lstSent.append(cntWords)
    for kvPair in annListForSent:
        for key, value in kvPair.items():
            lstSent.append(key)
            lstSent.append(value[0])
            lstSent.append(value[1])
            lstSent.append(value[2])

    #tempSentStr = symbol + "," + filingType + "," + nameInputFile + "," + str(sCount) + "," + url + "," + sent.replace(",", "")
    sentList.append(lstSent)
    
  return sentList

In [124]:
import csv
filesLocal = True
dirNLPOut = 'C:/Users/pshar/Dropbox/Programming/SampleTexts/NLP_Output'
"""
urls = []
urls.append("https://www.sec.gov/Archives/edgar/data/886982/000156459019008879/gs-424b2.htm")
# Manual settings for the test run. 
symbol = 'GS'
filingType = '10-K'
nameInputFile = 'd480167d10k.htm'
"""

page = ''
def traverseFilingFiles(symbol, filingType, filePath):
    (parentDir, nameInputFile) = os.path.split(filePath)
    origTxt = ''
    if (filesLocal == False):
        # The filePath here should be URL. 
        # TODO: Handle upstream function to send URL or localPath both
        htmlPage = requests.get(filePath).text # Retrieve text for one file at one point
        origTxt = retrieveTxt(htmlPage)
    else: 
        try:
            with open(filePath, 'r', encoding='utf8') as content_file:
                htmlPage = content_file.read()
                origTxt = retrieveTxt(htmlPage)
        except UnicodeDecodeError: 
            print ("UnicodeDecodeError reading HTML: '"+filePath+"' for symbol:"+symbol)
        
    """
    # Commenting out individual words based function calls, while keeping analysis at the sentence level
    intermediateTokens = genIntermediateTokens(origTxt)
    cleanedWordsList = cleanedWordsTxt(intermediateTokens)
    """
    cleanedSentsList = cleanedSentsTxt(origTxt)
    sentList = genAnnotations(symbol, filingType, cleanedSentsList, nameInputFile, filePath)
    #print(sentList)

    now = datetime.datetime.now()
    time = str(now)[:19].replace(":", "-").replace(" ", "_")

    """
    # Disabling Annotations (words) output for the time-being until Solr cataloging is resolved
    
    fileWords = 'Annotations_'+symbol+'_'+filingType+'_'+nameInputFile+'_'+time+'.csv'
    absPathFileWords = os.path.join(dirNLPOut, fileWords)
    print(fileWords)
    dfrmWords = pd.DataFrame(annList)
    dfrmWords.to_csv(absPathFileWords, sep =",", index = None, header=False, quoting=csv.QUOTE_NONE, escapechar="\\")
    """
    
    fileSents = 'Sentences_'+symbol+'_'+filingType+'_'+nameInputFile+'_'+time+'.csv'
    absPathFileSents = os.path.join(dirNLPOut, fileSents)
    print(fileSents)
    dfrmSents = pd.DataFrame(sentList)#, columns = labelSents)
    #print(dfrmSents)
    dfrmSents.to_csv(absPathFileSents, sep =",", index = None, header=False, quoting=csv.QUOTE_NONE, escapechar="\\")    

# Retrieve & Traverse Filings Previously Downloaded 

### Base functions to traverse symbols, then filing types (10-k, 10-Q...), and then individual filings

Start with listing symbols whose filings were earlier downloaded in a dir listing arranged by symbols

In [125]:
import os
from os.path import isdir
from os.path import abspath
"""
The base location locally 
"""
def genListDirsSymbols(basePath):
    symbols = []
    dirsSymbols = os.listdir(basePath)
    for dirSymbol in dirsSymbols:
        (head, tail) = os.path.split(dirSymbol)
        symbols.append(tail)
    return symbols

For each symbol, generate listing of folders underneath. This listing will correspond to filing types (10-k etc...) downloaded for each of these symbols. 

In [126]:

def genListFilingsDirsForSymbol(basePathPlusSymbol):
    filingDirs = []
    dirsFilings = os.listdir(basePathPlusSymbol)
    for dirFilings in dirsFilings:
        (head, tail) = os.path.split(dirFilings)
        filingDirs.append(tail)
    return filingDirs   

Retrieve listing of files for given symbol and given filing type.

In [127]:
def genListFilingsForSymbol(basePathPlusSymbolPlusFilingDir):
    filings = []
    files = os.listdir(basePathPlusSymbolPlusFilingDir)
    for file in files:
        (head, tail) = os.path.split(file)
        if (tail != 'txt'): 
            filings.append(file)
    return filings

    

Now, that we have functions defined to list symbols, filing types, and actual files defined, go through them as a list and traverse these files grouping them first by symbol and then by filing type.

In [129]:
dirBase = 'C:/Users/pshar/Dropbox/WebServices/mysite/polls/templates/polls/FilingsBySymbols'
filingsFullLst = []
symbols = genListDirsSymbols(dirBase)
for symbol in symbols:
    basePathPlusSymbol = os.path.join(dirBase, symbol.upper())
    filingDirs = genListFilingsDirsForSymbol(basePathPlusSymbol)
    for filingDir in filingDirs:
        basePathPlusSymbolPlusFilingDir =  os.path.join(basePathPlusSymbol, filingDir)
        filings = genListFilingsForSymbol(basePathPlusSymbolPlusFilingDir)
        for filing in filings:
            filePath = os.path.join(basePathPlusSymbolPlusFilingDir, filing)
            traverseFilingFiles(symbol, filingDir, filePath)
               
            
    
print("DONE!")

Sentences_GE_10-K_ge10k2016htm_2019-04-01_19-34-53.csv
     0     1             2      3     \
0      GE  10-K  ge10k2016htm     s1   
1      GE  10-K  ge10k2016htm     s2   
2      GE  10-K  ge10k2016htm     s3   
3      GE  10-K  ge10k2016htm     s4   
4      GE  10-K  ge10k2016htm     s5   
5      GE  10-K  ge10k2016htm     s6   
6      GE  10-K  ge10k2016htm     s7   
7      GE  10-K  ge10k2016htm     s8   
8      GE  10-K  ge10k2016htm     s9   
9      GE  10-K  ge10k2016htm    s10   
10     GE  10-K  ge10k2016htm    s11   
11     GE  10-K  ge10k2016htm    s12   
12     GE  10-K  ge10k2016htm    s13   
13     GE  10-K  ge10k2016htm    s14   
14     GE  10-K  ge10k2016htm    s15   
15     GE  10-K  ge10k2016htm    s16   
16     GE  10-K  ge10k2016htm    s17   
17     GE  10-K  ge10k2016htm    s18   
18     GE  10-K  ge10k2016htm    s19   
19     GE  10-K  ge10k2016htm    s20   
20     GE  10-K  ge10k2016htm    s21   
21     GE  10-K  ge10k2016htm    s22   
22     GE  10-K  ge10k201

Sentences_GE_10-K_ge10k2017htm_2019-04-01_19-36-19.csv
     0     1             2      3     \
0      GE  10-K  ge10k2017htm     s1   
1      GE  10-K  ge10k2017htm     s2   
2      GE  10-K  ge10k2017htm     s3   
3      GE  10-K  ge10k2017htm     s4   
4      GE  10-K  ge10k2017htm     s5   
5      GE  10-K  ge10k2017htm     s6   
6      GE  10-K  ge10k2017htm     s7   
7      GE  10-K  ge10k2017htm     s8   
8      GE  10-K  ge10k2017htm     s9   
9      GE  10-K  ge10k2017htm    s10   
10     GE  10-K  ge10k2017htm    s11   
11     GE  10-K  ge10k2017htm    s12   
12     GE  10-K  ge10k2017htm    s13   
13     GE  10-K  ge10k2017htm    s14   
14     GE  10-K  ge10k2017htm    s15   
15     GE  10-K  ge10k2017htm    s16   
16     GE  10-K  ge10k2017htm    s17   
17     GE  10-K  ge10k2017htm    s18   
18     GE  10-K  ge10k2017htm    s19   
19     GE  10-K  ge10k2017htm    s20   
20     GE  10-K  ge10k2017htm    s21   
21     GE  10-K  ge10k2017htm    s22   
22     GE  10-K  ge10k201

Sentences_GE_10-K_ge10k2018htm_2019-04-01_19-37-18.csv
     0     1             2      3     \
0      GE  10-K  ge10k2018htm     s1   
1      GE  10-K  ge10k2018htm     s2   
2      GE  10-K  ge10k2018htm     s3   
3      GE  10-K  ge10k2018htm     s4   
4      GE  10-K  ge10k2018htm     s5   
5      GE  10-K  ge10k2018htm     s6   
6      GE  10-K  ge10k2018htm     s7   
7      GE  10-K  ge10k2018htm     s8   
8      GE  10-K  ge10k2018htm     s9   
9      GE  10-K  ge10k2018htm    s10   
10     GE  10-K  ge10k2018htm    s11   
11     GE  10-K  ge10k2018htm    s12   
12     GE  10-K  ge10k2018htm    s13   
13     GE  10-K  ge10k2018htm    s14   
14     GE  10-K  ge10k2018htm    s15   
15     GE  10-K  ge10k2018htm    s16   
16     GE  10-K  ge10k2018htm    s17   
17     GE  10-K  ge10k2018htm    s18   
18     GE  10-K  ge10k2018htm    s19   
19     GE  10-K  ge10k2018htm    s20   
20     GE  10-K  ge10k2018htm    s21   
21     GE  10-K  ge10k2018htm    s22   
22     GE  10-K  ge10k201

Sentences_GE_10-Q_ge1q201810qhtm_2019-04-01_19-37-45.csv
     0     1               2      3     \
0      GE  10-Q  ge1q201810qhtm     s1   
1      GE  10-Q  ge1q201810qhtm     s2   
2      GE  10-Q  ge1q201810qhtm     s3   
3      GE  10-Q  ge1q201810qhtm     s4   
4      GE  10-Q  ge1q201810qhtm     s5   
5      GE  10-Q  ge1q201810qhtm     s6   
6      GE  10-Q  ge1q201810qhtm     s7   
7      GE  10-Q  ge1q201810qhtm     s8   
8      GE  10-Q  ge1q201810qhtm     s9   
9      GE  10-Q  ge1q201810qhtm    s10   
10     GE  10-Q  ge1q201810qhtm    s11   
11     GE  10-Q  ge1q201810qhtm    s12   
12     GE  10-Q  ge1q201810qhtm    s13   
13     GE  10-Q  ge1q201810qhtm    s14   
14     GE  10-Q  ge1q201810qhtm    s15   
15     GE  10-Q  ge1q201810qhtm    s16   
16     GE  10-Q  ge1q201810qhtm    s17   
17     GE  10-Q  ge1q201810qhtm    s18   
18     GE  10-Q  ge1q201810qhtm    s19   
19     GE  10-Q  ge1q201810qhtm    s20   
20     GE  10-Q  ge1q201810qhtm    s21   
21     GE  10-Q  ge

Sentences_GE_10-Q_ge2q201710qhtm_2019-04-01_19-38-08.csv


KeyboardInterrupt: 

## Possible Enhancements

- P1: The Lemmatization API being called is yielding poor results. Not relying on lemma right now but need to correct that later. 
- P1: Keep reference to NLP_Financials_Cloud.ipynb for other possible enhancements.
- P1: Can we introduce topic modeling for each paragraph? The p HTML tag is already defined.  
- P1: Bring complete cycle (download, generate NLP, and indexing) here rather than just the middle part. 
- P1: If not full cycle, then keep the ability to generate NLP from both local files as well as from EDGAR. Couple of function may require little tweaking. 
