# NLP Analysis of Financial Returns
This notebook has couple of major goals. First and foremost is to analyze text using Python. I have handled similar analysis in Java and Scala using Stanford CoreNLP, Spark, and other libraries, but I haven't done that using Python code. The second objective is to gain familiarity with NLTK which is a major NLP package that I am not yet familiar with. 

This notebook will also help me to fill some gaps in the NLP analysis of the financial returns that I am doing with Stanford CoreNLP. Generating NLP metadata can be resource intensive and I have had problems with that on Java side. I am wondering how much of that is a hardware bottleneck versus performance differences among the APIs. I will anyways run these in AWS with loaded instances, but it is good to know what is the optimal value offered by each of the primary packages in NLP world. 

In [13]:
from bs4 import BeautifulSoup
import pickle
import nltk
import requests
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pshar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pshar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\pshar\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Download a filing from SEC Edgar. Note that I have separate programs for downloading these files and arrancing them nicely based on company, filing type etc... Here, to keep things simple, I will attemt the download of a single file from EDGAR and run the NLP on that. 

In [14]:
url = 'https://www.sec.gov/Archives/edgar/data/78003/000007800319000015/pfe-12312018x10kshell.htm'
page = requests.get(url).text

I have exclusively used Jaunt for data scraping. Here, I will attempt to use BeautifulSoup for the same pupose. The innner workings remain the same. You download stuff using HTTP protocols and then look for the HTML tags that you are interested in for text, links etcetera. For financial returns, the text is usually in div or p tags. 

In [15]:
soup = BeautifulSoup(page, "lxml")       # Fallback 'html.parser' in case lxml has challages \n",

In [16]:
tags = soup.find_all('div')

In [17]:
origTxt = ''
for t in tags:
    origTxt += t.text
origTxt



Now, let us clean the text by running Regular Expressions for all punctuation marks, numbers (commented out for now), and special characters. 

In [18]:
import re
import string
def cleanTxt(inputText):
    retText = inputText.lower()
    retText = re.sub('\[.*?\\]', ' ', retText)
    retText = re.sub('[%s]' % re.escape(string.punctuation), ' ', retText)
    retText = re.sub('\\\\xa0', ' ', retText)
    #retText = re.sub('\\w*\\d\\w*', '', retText)
    return retText                     

In [19]:
cleanedTxt = cleanTxt(origTxt)
cleanedTxt



Also need to remove the stop words. 

In [20]:
tokens = nltk.word_tokenize(cleanedTxt)
len(tokens)

48596

In [21]:
stopWords = set(stopwords.words('english'))
cleanTokens = []
for w in tokens:
    if w not in stopWords:
        cleanTokens.append(w)
len(cleanTokens)

32343

In [22]:
# This is how you can generate POS in NLTK. Don't print this output as it will hang the notebook.
"""
for word in cleanTokens:
    nltk.pos_tag((nltk.word_tokenize(cleanedTxt)))
"""

'\nfor word in cleanTokens:\n    nltk.pos_tag((nltk.word_tokenize(cleanedTxt)))\n'

In [25]:
# See following for setup: https://stackoverflow.com/questions/13883277/stanford-parser-and-nltk/51981566#51981566
# Run server local with: 		○ (base) C:\Users\pshar\Dropbox\Programming\NLP_Metadata\stanford-corenlp-full-2018-10-05>java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -preload tokenize,ssplit,pos,lemma,ner,parse,depparse -status_port 9000 -port 9000 -timeout 15000 &


from nltk.parse import CoreNLPParser
ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')

In [12]:
list(ner_tagger.tag(('Rami Eid is studying at Stony Brook University in NY'.split())))

ConnectionError: HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%22outputFormat%22%3A+%22json%22%2C+%22annotators%22%3A+%22tokenize%2Cssplit%2Cner%22%2C+%22ssplit.isOneSentence%22%3A+%22true%22%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000258D253D9E8>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',))

In [None]:
text = 'Our products are also subject to postmarket surveillance under the FFDCA and its implementing regulations with respect to drugs, as well as the Public Health Service Act and its implementing regulations with respect to biologics. Our Consumer Healthcare products are also subject to FDA regulation. Other U.S. federal agencies, including the DEA, also regulate certain of our products. The U.S. Federal Trade Commission has the authority to regulate the advertising of consumer healthcare products, including OTC drugs and dietary supplements. Many of our activities also are subject to the jurisdiction of the SEC. Biopharmaceutical companies seeking to market a product in the U.S. must first test the product to demonstrate that it is safe and effective for its intended use. If, after evaluation, the FDA determines the product is safe (i.e., its benefits outweigh its known risks) and effective, then the FDA will approve the product for marketing, issuing a NDA or BLA as appropriate. Companies seeking to market a generic prescription drug must scientifically demonstrate that the generic drug is bioequivalent to the innovator drug. The ANDA, or generic drug application, must show, among other things, that the generic drug is pharmaceutically equivalent to the brand, the manufacturer is capable of making the drug correctly, and the proposed label is the same as that of the innovator/brand drug’s label. Even after a drug or biologic is approved for marketing, it may still be subject to postmarketing commitments or postmarketing requirements. Postmarketing commitments are studies or clinical trials that the drug or biologic sponsor has agreed to conduct, but are not required by law and/or regulation. Postmarketing requirements include studies and clinical trials that sponsors are required to conduct, by law and/or regulation, as a condition of approval. Postmarketing studies or clinical trials can be required in order to assess a known risk or demonstrate clinical benefit for drugs or biologics approved pursuant to accelerated approval. If a company fails to meet its postmarketing requirements, the FDA may assess a civil monetary penalty, issue a warning letter or deem the drug or biologic misbranded. Once a drug or biologic is approved, any modifications to the product must be notified to the FDA and may also require a manufacturer to submit additional studies or conduct clinical trials. In addition, we are also required to report adverse events and comply with cGMPs, as well as advertising and promotion regulations. Failure to comply with the FFDCA may subject us to administrative and/or judicial sanctions, including warning letters, product recalls, seizures, delays in product approvals, injunctions, fines, civil penalties and/or criminal prosecution. Biosimilar Regulation. The ACA created a framework for the approval of biosimilars (also known as follow-on biologics) following the expiration of 12 years of exclusivity for the innovator biologic, with a potential six-month pediatric extension. Under the ACA, biosimilar applications may not be submitted until four years after the approval of the reference innovator biologic. The FDA is responsible for implementation of the legislation and approval of new biosimilars. Through FDA approvals and the issuance of draft and final guidance, the FDA has addressed a number of issues related to the biosimilars approval pathway, such as the labeling expectations for biosimilars. Over the next several years, the FDA is expected to issue additional draft and final guidance documents impacting biosimilars, including updated draft or final guidance regarding the standards for demonstrating interchangeability with a U.S.-licensed reference product. In addition, in 2017, the Biosimilar User Fee Act was reauthorized for a five-year period, which should lead to a significant increase in the FDA’s biosimilar user fee revenues, thereby providing the FDA with additional resources to process biosimilar applications. For example, in the first year under the newly authorized fee structure, the FDA estimates its revenues from biosimilar user fees will increase by more than $10 million. Sales and Marketing Laws and Regulations. The marketing practices of U.S. biopharmaceutical companies are generally subject to various federal and state healthcare laws that are intended, among other things, to prevent fraud and abuse in the healthcare industry and to protect the integrity of government healthcare programs. These laws include anti-kickback laws and false claims laws. Anti-kickback laws generally prohibit a biopharmaceutical company from soliciting, offering, receiving, or paying anything of value to generate business, including purchasing or prescribing of a particular product. False claims laws generally prohibit anyone from knowingly and willingly presenting, or causing to be presented, any claims for payment for goods (including drugs or biologics) or services to third-party payers (including Medicare and Medicaid) that are false or fraudulent and generally treat claims generated through kickbacks as false or fraudulent. Violations of fraud and abuse laws may be punishable by criminal or civil sanctions and/or exclusion from federal healthcare programs (including Medicare and Medicaid). The federal government and various states also have enacted laws to regulate the sales and marketing practices of pharmaceutical companies. The laws and regulations generally limit financial interactions between manufacturers and healthcare providers, require disclosure to the federal or state government and the public of such interactions, and/or require the adoption of compliance standards or programs. Many of these laws and regulations contain ambiguous requirements or require administrative guidance for implementation. Individual states, acting through their attorneys general, have become active as well, seeking to regulate the marketing of prescription drugs under state consumer protection and false advertising laws. Given the lack of clarity in laws and their implementation, our activities could be subject to the penalties under the pertinent laws and regulations. Pricing and Reimbursement. Pricing and reimbursement for our pharmaceutical products depends in part on government regulation. Pfizer must offer discounted pricing or rebates on purchases of pharmaceutical products under various federal and state healthcare programs, such as the Medicaid Drug Rebate Program, the “federal ceiling price” drug pricing program, the 340B drug pricing program and the Medicare Part D Program. Pfizer must also report specific prices to government agencies under healthcare programs, such as the Medicaid Drug Rebate Program and Medicare Part B. The calculations necessary to determine the prices reported are complex and the failure to report prices accurately may expose Pfizer to penalties. See the discussion regarding rebates in the Analysis of the Consolidated Statements of Income—Revenues—Overview section in our 2018 Financial Report and in the Notes to Consolidated Financial Statements—Note 1G. Basis of Presentation and Significant Accounting Policies: Revenues and Trade Accounts Receivable in our 2018 Financial Report, which are incorporated by reference. Drug Regulation. In the U.S., biopharmaceutical products are subject to extensive pre- and postmarket regulation by the FDA, including regulations that govern, among other things, the safety and efficacy of our medicines, clinical trials, advertising and promotion, manufacturing, labeling and record keeping.'

tokenized_text = word_tokenize(text)
classified_text = ner_tagger.tag(tokenized_text)

print(classified_text)