# SEC Press Releases

This notebook is a simple modification from notebook prepared by [Vincent Grégoire](http://www.vincentgregoire.com), Department of Finance, The University of Melbourne. 



In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# This one is used to load web pages
from urllib.request import urlopen
# This one is used to parse (extract information) from web pages
from bs4 import BeautifulSoup
# This one is to search/replace in text
import re
# This one is for textual analysis
import nltk 

## Step 1: Web scraping

Our first goal is to extract all press releases from the SEC website listed at https://www.sec.gov/news/pressreleases?year=All&month=All&items_per_page=100&page=0

Before we start, we need to look at the website to see if we can identify patterns that will help with our task.

Looking at the website, it appears that all press releases for a given year are numbered in chronological order, formatted as `YYYY-###` where `###` is a one, two or three digit number. We can assume that on a busy year it could go to four digits. Furthermore, the URL to each press release text is the format `https://www.sec.gov/news/press-release/YYYY-###`, which makes our job easier. We could iterate automatically over the list of press releases and extract the link. However, it's probably less work to figure out manually how many PR there for each year we are interested in. 

Remember, coding something is good, but sometimes doing it manually is the optimal way to go. This is especially true with web scraping because you can't assume that you'll be able to reuse your code in the future, the format of the website might have changed by then.


In [2]:
# Say we want the 2016 and 2017 PR. We can see on the website that 
# there were 248 PR in 2017 and 283 PR in 2016.
# From that information, we build a list of PR to get.
pr_list = [(2017, x + 1) for x in range(248)] + [(2016, x + 1) for x in range(283)]
pr_list[-10:]

[(2016, 274),
 (2016, 275),
 (2016, 276),
 (2016, 277),
 (2016, 278),
 (2016, 279),
 (2016, 280),
 (2016, 281),
 (2016, 282),
 (2016, 283)]

In [3]:
# In class, we'll limit ourselves to the first 60 in 2017.
pr_list = [(2017, x + 1) for x in range(60)]

Next, we want to retrieve each press release and extract the textual content. Let's take 2017-1 as an example. The PR is available at https://www.sec.gov/news/pressrelease/2017-1.html. We see that the content in the middle, but there is also a lot of other information (header, footer, menus, quick links, etc.) What the computer sees is the HTML code of the page, and that is what we should be looking at too. You can look at it using the *View source* menu option of your browser, or through this link (works with Firefox and Chrome): view-source:https://www.sec.gov/news/pressrelease/2017-1.html

Finding the content is there, we see that the title is between `<h1 class="article-title"></h1>` tags, the date is between `<p class="article-location-publishdate"></p>` tags, and the actual body of the text is between `<div class="article-body"></div>` tags. This is the information we need to extract the data.

As we did in previous examples, we'll make our code work for one item (2017-1), and then we'll repackage our code in a function to apply it to every item.

In [4]:
pr_item = (2017, 1)

# First we load the page
url = 'https://www.sec.gov/news/press-release/' + str(pr_item[0]) + '-' + str(pr_item[1])
page = urlopen(url)

In [5]:
# Check the status. 200 means OK
page.status

200

In [6]:
# Parse the page content
soup = BeautifulSoup(page, 'html.parser')

In [7]:
# We can look at the page title. That might be an easier way to extract the news title
soup.title.contents

['SEC.gov | SEC Awards $5.5 Million to Whistleblower']

In [8]:
title = soup.title.contents[0][10:]
title

'SEC Awards $5.5 Million to Whistleblower'

In [9]:
# Now let's find the date
soup.find('p', attrs={'class' : 'article-location-publishdate'})

<p class="article-location-publishdate">
          Washington D.C.,           Jan. 6, 2017 —
        </p>

In [10]:
loc_date_str = soup.find('p', attrs={'class' : 'article-location-publishdate'}).get_text()
loc_date_str

'\n          Washington D.C.,           Jan. 6, 2017 —\n        '

In [11]:
location = loc_date_str.split(',')[0].strip()
location

'Washington D.C.'

In [12]:
date_str = loc_date_str.split(',')[1].strip() + ', ' + loc_date_str.split(',')[2][:5].strip()
date_str

'Jan. 6, 2017'

In [13]:
date = pd.to_datetime(date_str)
date

Timestamp('2017-01-06 00:00:00')

In [14]:
# Finally, let's get the content
body = soup.find('div', attrs={'class' : 'article-body'}).get_text()
print(body)

The Securities and Exchange Commission today announced an award of more than $5.5 million to a whistleblower who provided critical information that helped the SEC uncover an ongoing scheme.
According to the SEC’s order, the whistleblower was employed at the company involved in the wrongdoing and reported the information directly to the SEC, which brought a successful enforcement action to end the scheme.
“Whistleblowers play a key role in bringing wrongdoing to the SEC’s attention, and this whistleblower helped prevent further harm to a vulnerable investor community by boldly stepping forward while still employed at the company,” said Jane Norberg, Chief of the SEC’s Office of the Whistleblower.
SEC enforcement actions from whistleblower tips have resulted in more than $904 million in financial remedies.
The SEC’s whistleblower program has now awarded approximately $142 million to 38 whistleblowers since issuing its first award in 2012. 
By law, the SEC protects the confidentiality of 

In [15]:
body

'The Securities and Exchange Commission today announced an award of more than $5.5 million to a whistleblower who provided critical information that helped the SEC uncover an ongoing scheme.\nAccording to the SEC’s order, the whistleblower was employed at the company involved in the wrongdoing and reported the information directly to the SEC, which brought a successful enforcement action to end the scheme.\n“Whistleblowers play a key role in bringing wrongdoing to the SEC’s attention, and this whistleblower helped prevent further harm to a vulnerable investor community by boldly stepping forward while still employed at the company,” said Jane Norberg, Chief of the SEC’s Office of the Whistleblower.\nSEC enforcement actions from whistleblower tips have resulted in more than $904 million in financial remedies.\nThe SEC’s whistleblower program has now awarded approximately $142 million to 38 whistleblowers since issuing its first award in 2012.\xa0\nBy law, the SEC protects the confidenti

In [16]:
# We want to get plain text only, so we need to remove all the non-visible charaters 
# such as `\xa0` and `\n` and replace them by a space.

# The '\n' (return) is easy as it's a specific case
body = body.replace('\n', ' ')
body

'The Securities and Exchange Commission today announced an award of more than $5.5 million to a whistleblower who provided critical information that helped the SEC uncover an ongoing scheme. According to the SEC’s order, the whistleblower was employed at the company involved in the wrongdoing and reported the information directly to the SEC, which brought a successful enforcement action to end the scheme. “Whistleblowers play a key role in bringing wrongdoing to the SEC’s attention, and this whistleblower helped prevent further harm to a vulnerable investor community by boldly stepping forward while still employed at the company,” said Jane Norberg, Chief of the SEC’s Office of the Whistleblower. SEC enforcement actions from whistleblower tips have resulted in more than $904 million in financial remedies. The SEC’s whistleblower program has now awarded approximately $142 million to 38 whistleblowers since issuing its first award in 2012.\xa0 By law, the SEC protects the confidentiality

In [17]:
# Handling `\xa0` and all the other characters like that that might appear need more machinery.
# For that, we use regular expressions: https://docs.python.org/3/library/re.html

body = re.sub(r'[^\x00-\x7F]+',' ', body)
body

'The Securities and Exchange Commission today announced an award of more than $5.5 million to a whistleblower who provided critical information that helped the SEC uncover an ongoing scheme. According to the SEC s order, the whistleblower was employed at the company involved in the wrongdoing and reported the information directly to the SEC, which brought a successful enforcement action to end the scheme.  Whistleblowers play a key role in bringing wrongdoing to the SEC s attention, and this whistleblower helped prevent further harm to a vulnerable investor community by boldly stepping forward while still employed at the company,  said Jane Norberg, Chief of the SEC s Office of the Whistleblower. SEC enforcement actions from whistleblower tips have resulted in more than $904 million in financial remedies. The SEC s whistleblower program has now awarded approximately $142 million to 38 whistleblowers since issuing its first award in 2012.  By law, the SEC protects the confidentiality of

In [18]:
# Finally, we package the result in a dictionary. Once we have all the press releases
# in a list of dictionaries, it's very easy to convert to a pandas DataFrame.

result = {'year': pr_item[0],
          'item': pr_item[1],
          'title': title,
          'location': location,
          'date': date,
          'body': body}
result

{'year': 2017,
 'item': 1,
 'title': 'SEC Awards $5.5 Million to Whistleblower',
 'location': 'Washington D.C.',
 'date': Timestamp('2017-01-06 00:00:00'),
 'body': 'The Securities and Exchange Commission today announced an award of more than $5.5 million to a whistleblower who provided critical information that helped the SEC uncover an ongoing scheme. According to the SEC s order, the whistleblower was employed at the company involved in the wrongdoing and reported the information directly to the SEC, which brought a successful enforcement action to end the scheme.  Whistleblowers play a key role in bringing wrongdoing to the SEC s attention, and this whistleblower helped prevent further harm to a vulnerable investor community by boldly stepping forward while still employed at the company,  said Jane Norberg, Chief of the SEC s Office of the Whistleblower. SEC enforcement actions from whistleblower tips have resulted in more than $904 million in financial remedies. The SEC s whistleb

### Step 1 function

We're ready to package everything in a function. Because a lot of things might go wrong (i.e., a page might be missing), we want to make sure our code works, but we want to catch error. For that, we'll use a `try` block, and print out any error. For more information on catching errors, see https://docs.python.org/3/tutorial/errors.html.

In [19]:
# Good, now we're ready to package as a function
def download_and_parse_pr(pr_item):
    # First we load the page
    url = 'https://www.sec.gov/news/press-release/' + str(pr_item[0]) + '-' + str(pr_item[1])
    
    try:
        page = urlopen(url)

        # Parse the page content
        soup = BeautifulSoup(page, 'html.parser')

        # Get title
        title = soup.title.contents[0][10:]

        # Get location and date
        loc_date_str = soup.find('p', attrs={'class' : 'article-location-publishdate'}).get_text()
        location = loc_date_str.split(',')[0].strip()
        date_str = loc_date_str.split(',')[1].strip() + ', ' + loc_date_str.split(',')[2][:5].strip()
        date = pd.to_datetime(date_str)

        # Get the content
        body = soup.find('div', attrs={'class' : 'article-body'}).get_text()
        body = body.replace('\n', ' ')
        body = re.sub(r'[^\x00-\x7F]+',' ', body)

        result = {'year': pr_item[0],
                  'item': pr_item[1],
                  'title': title,
                  'location': location,
                  'date': date,
                  'body': body}
        return result
    
    except Exception as e:
        print(str(pr_item[0]) + '-' + str(pr_item[1]) + ': ' + str(e))
        return None

In [20]:
# Now download all the pages
results = []

for pr_item in pr_list:
    results.append(download_and_parse_pr(pr_item))

2017-58: HTTP Error 404: Not Found


In [21]:
# Package the result as dataframe
df_pr = pd.DataFrame([r for r in results if r is not None])

In [22]:
df_pr.head()

Unnamed: 0,year,item,title,location,date,body
0,2017,1,SEC Awards $5.5 Million to Whistleblower,Washington D.C.,2017-01-06,The Securities and Exchange Commission today a...
1,2017,2,SEC Charges Two Brokers With Defrauding Customers,Washington D.C.,2017-01-09,The Securities and Exchange Commission today c...
2,2017,3,"Investment Adviser, Lawyer Settle Charges in S...",Washington D.C.,2017-01-09,The Securities and Exchange Commission today a...
3,2017,4,SEC: Port Authority Omitted Risks to Investors...,Washington D.C.,2017-01-10,The Securities and Exchange Commission today a...
4,2017,5,SEC Charges Government Contractor With Inadequ...,Washington D.C.,2017-01-11,The Securities and Exchange Commission today a...


In [23]:
# We can rearrange the columns
df_pr = df_pr[['year', 'item', 'date', 'location', 'title', 'body']]
df_pr

Unnamed: 0,year,item,date,location,title,body
0,2017,1,2017-01-06,Washington D.C.,SEC Awards $5.5 Million to Whistleblower,The Securities and Exchange Commission today a...
1,2017,2,2017-01-09,Washington D.C.,SEC Charges Two Brokers With Defrauding Customers,The Securities and Exchange Commission today c...
2,2017,3,2017-01-09,Washington D.C.,"Investment Adviser, Lawyer Settle Charges in S...",The Securities and Exchange Commission today a...
3,2017,4,2017-01-10,Washington D.C.,SEC: Port Authority Omitted Risks to Investors...,The Securities and Exchange Commission today a...
4,2017,5,2017-01-11,Washington D.C.,SEC Charges Government Contractor With Inadequ...,The Securities and Exchange Commission today a...
5,2017,6,2017-01-12,Washington D.C.,ITG Paying $24 Million for Improper Handling o...,The Securities and Exchange Commission today a...
6,2017,7,2017-01-12,Washington D.C.,SEC Announces 2017 Examination Priorities,The Securities and Exchange Commission today a...
7,2017,8,2017-01-12,Washington D.C.,Biomet Charged With Repeating FCPA Violations,The Securities and Exchange Commission today a...
8,2017,9,2017-01-12,Washington D.C.,BNY Mellon Settles Charges Stemming From Misca...,The Securities and Exchange Commission today a...
9,2017,10,2017-01-12,Washington D.C.,"Michael J. Osnato Jr., Chief of Enforcement Di...",The Securities and Exchange Commission today a...


## Step 2: Textual analysis

For this step, we want to look at the body of the text, and figure out the "sentiment," i.e., is it good or bad news. The algorithm we'll use for that is simple: it looks at the words appearing in the text and assigns a sentiment score based on those words. 

Textual analysis is done with the [Natural Language Toolkit (NLTK)](http://www.nltk.org/). We also need a context-specific dictionary (score associated with words). The standard one in finance is the [Loughran and McDonald word list](https://www3.nd.edu/~mcdonald/Word_Lists.html), so we'll use that one.


The idea here is the same again, we'll work with one piece of text, and once we have things in working order, we'll package in a function.

In [24]:
text = df_pr[df_pr.item==1].iloc[0].body
text

'The Securities and Exchange Commission today announced an award of more than $5.5 million to a whistleblower who provided critical information that helped the SEC uncover an ongoing scheme. According to the SEC s order, the whistleblower was employed at the company involved in the wrongdoing and reported the information directly to the SEC, which brought a successful enforcement action to end the scheme.  Whistleblowers play a key role in bringing wrongdoing to the SEC s attention, and this whistleblower helped prevent further harm to a vulnerable investor community by boldly stepping forward while still employed at the company,  said Jane Norberg, Chief of the SEC s Office of the Whistleblower. SEC enforcement actions from whistleblower tips have resulted in more than $904 million in financial remedies. The SEC s whistleblower program has now awarded approximately $142 million to 38 whistleblowers since issuing its first award in 2012.  By law, the SEC protects the confidentiality of

In [25]:
# First, we want to extract all the words in the text.
# The regexp_tokenize line will convert everything to lower cap, keep only words (i.e. drop numbers and
# ponctuation) and split everything in tokens.
tokens = nltk.regexp_tokenize(text.lower(), '[a-z]+')
tokens[:10]

['the',
 'securities',
 'and',
 'exchange',
 'commission',
 'today',
 'announced',
 'an',
 'award',
 'of']

In [26]:
# NLTK has a lot of functions for analysis, for simple ones like this to very complex.
# Let's look at the 20 most frequent words
freq = nltk.FreqDist(tokens)
pd.Series(freq).sort_values(ascending=False).head(20)

the               21
to                12
sec               11
whistleblower     11
a                  7
of                 6
and                6
in                 5
information        5
s                  5
that               4
by                 4
million            4
an                 4
whistleblowers     4
enforcement        3
more               3
from               3
award              3
successful         2
dtype: int64

In [27]:
# We typically want to remove the stop words (frequent
# words like "a" and "the")
# We'll use the default english corpus stopwords, but
# first we need to download them if we haven't already.
# If it's the first time running this code, uncomment
# the last line and run. Then download the "stopwords"
# corpora.
# Note that for our purposes this doesn't make a
# difference, so you can skip the filtering on stop
# words.

#nltk.download()

In [28]:
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /Users/leah/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [29]:
sr = nltk.corpus.stopwords.words('english')
sr[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [30]:
clean_tokens = []
for t in tokens:
    if t not in sr:
        clean_tokens.append(t)
clean_tokens[:10]

['securities',
 'exchange',
 'commission',
 'today',
 'announced',
 'award',
 'million',
 'whistleblower',
 'provided',
 'critical']

In [31]:
# We could actually do all that in one go:
tokens = []
for t in nltk.regexp_tokenize(text.lower(), '[a-z]+'):
    if t not in sr:
        tokens.append(t)
tokens[:10]

['securities',
 'exchange',
 'commission',
 'today',
 'announced',
 'award',
 'million',
 'whistleblower',
 'provided',
 'critical']

In [32]:
# Next we may want to stem the words (remove the ending)
# Note: not necessary for us, the LM dictionary is not stemmed.
stemmed_tokens = []
for t in clean_tokens:
    t = nltk.PorterStemmer().stem(t)
    stemmed_tokens.append(t)
stemmed_tokens[:10]

['secur',
 'exchang',
 'commiss',
 'today',
 'announc',
 'award',
 'million',
 'whistleblow',
 'provid',
 'critic']

In [33]:
# Before we can do the sentiment analysis, we need to load the Loughran and MacDonald dictionary.
# See https://www3.nd.edu/~mcdonald/Word_Lists.html
lmdict = pd.read_excel('https://www3.nd.edu/~mcdonald/Word_Lists_files/LoughranMcDonald_MasterDictionary_2014.xlsx')

In [34]:
lmdict.head()

Unnamed: 0,Word,Sequence Number,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Constraining,Superfluous,Interesting,Modal,Irr_Verb,Harvard_IV,Syllables,Source
0,AARDVARK,1,81,5.690194e-09,3.06874e-09,5.779943e-07,45,0,0,0,0,0,0,0,0,0,0,2,12of12inf
1,AARDVARKS,2,2,1.404986e-10,8.217606e-12,7.84187e-09,1,0,0,0,0,0,0,0,0,0,0,2,12of12inf
2,ABACI,3,8,5.619945e-10,1.686149e-10,7.09624e-08,7,0,0,0,0,0,0,0,0,0,0,3,12of12inf
3,ABACK,4,5,3.512466e-10,1.727985e-10,7.532677e-08,5,0,0,0,0,0,0,0,0,0,0,2,12of12inf
4,ABACUS,5,1752,1.230768e-07,1.198634e-07,1.110293e-05,465,0,0,0,0,0,0,0,0,0,0,3,12of12inf


In [35]:
lmdict.tail()

Unnamed: 0,Word,Sequence Number,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Constraining,Superfluous,Interesting,Modal,Irr_Verb,Harvard_IV,Syllables,Source
85126,ZYGOTE,85127,35,2.458726e-09,1.025127e-09,2.320929e-07,25,0,0,0,0,0,0,0,0,0,0,2,12of12inf
85127,ZYGOTES,85128,1,7.024931e-11,2.593031e-11,2.474469e-08,1,0,0,0,0,0,0,0,0,0,0,2,12of12inf
85128,ZYGOTIC,85129,0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,3,12of12inf
85129,ZYMURGIES,85130,0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,3,12of12inf
85130,ZYMURGY,85131,0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,3,12of12inf


In [36]:
# So there are roughly 85k word in there. What are some positive words?
lmdict[lmdict.Positive != 0].head()

Unnamed: 0,Word,Sequence Number,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Constraining,Superfluous,Interesting,Modal,Irr_Verb,Harvard_IV,Syllables,Source
125,ABLE,126,3253260,0.0002285393,0.0002318702,0.000343,553588,0,2009,0,0,0,0,0,0,0,0,2,12of12inf
334,ABUNDANCE,335,5014,3.5223e-07,3.649827e-07,8e-06,4364,0,2009,0,0,0,0,0,0,0,0,3,12of12inf
336,ABUNDANT,337,8824,6.198799e-07,5.78917e-07,1.1e-05,6648,0,2009,0,0,0,0,0,0,0,0,3,12of12inf
435,ACCLAIMED,436,1513,1.062872e-07,1.003959e-07,4e-06,1168,0,2009,0,0,0,0,0,0,0,0,2,12of12inf
474,ACCOMPLISH,475,142345,9.999638e-06,1.043611e-05,5.4e-05,95816,0,2009,0,0,0,0,0,0,0,0,3,12of12inf


In [37]:
# And negative ones?
lmdict[lmdict.Negative != 0].head()

Unnamed: 0,Word,Sequence Number,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Constraining,Superfluous,Interesting,Modal,Irr_Verb,Harvard_IV,Syllables,Source
9,ABANDON,10,80492,5.654508e-06,5.249083e-06,4.1e-05,45941,2009,0,0,0,0,0,0,0,0,1,3,12of12inf
10,ABANDONED,11,174298,1.224431e-05,1.221126e-05,8.8e-05,83234,2009,0,0,0,0,0,0,0,0,2,3,12of12inf
11,ABANDONING,12,15926,1.118791e-06,9.946277e-07,1.5e-05,10125,2009,0,0,0,0,0,0,0,0,2,4,12of12inf
12,ABANDONMENT,13,177889,1.249658e-05,1.187335e-05,8.2e-05,65686,2009,0,0,0,0,0,0,0,0,1,4,12of12inf
13,ABANDONMENTS,14,7091,4.981379e-07,6.848194e-07,1.7e-05,3891,2009,0,0,0,0,0,0,0,0,2,4,12of12inf


In [38]:
# Ok, so the number in the columns is not a dummy (1), but the year it was added to the dictionary.
# We need to get a list of all the negative and positive words.
neg_words =  lmdict.loc[lmdict.Negative != 0, 'Word'].str.lower().unique()
pos_words =  lmdict.loc[lmdict.Positive != 0, 'Word'].str.lower().unique()

In [39]:
neg_words[:20]

array(['abandon', 'abandoned', 'abandoning', 'abandonment',
       'abandonments', 'abandons', 'abdicated', 'abdicates', 'abdicating',
       'abdication', 'abdications', 'aberrant', 'aberration',
       'aberrational', 'aberrations', 'abetting', 'abnormal',
       'abnormalities', 'abnormality', 'abnormally'], dtype=object)

In [40]:
pos_words[:20]

array(['able', 'abundance', 'abundant', 'acclaimed', 'accomplish',
       'accomplished', 'accomplishes', 'accomplishing', 'accomplishment',
       'accomplishments', 'achieve', 'achieved', 'achievement',
       'achievements', 'achieves', 'achieving', 'adequately',
       'advancement', 'advancements', 'advances'], dtype=object)

In [41]:
# Count the number of positive and negative words.
pos_count = 0
neg_count = 0
for t in tokens:
    if t in pos_words:
        pos_count += 1
    elif t in neg_words:
        neg_count += 1
print('Positive count: ' + str(pos_count))
print('Negative count: ' + str(neg_count))

Positive count: 2
Negative count: 9


In [42]:
# A crude measure of sentiment is the normalized difference between the
# number of positive and negative words
sentiment = (pos_count - neg_count)/(pos_count + neg_count)
sentiment

-0.6363636363636364

In [43]:
# Now that may cause problems if with detect no positive
# or negative words (division by zero). In that case,
# we can assume the text is neutral (sentiment = 0)
if (pos_count + neg_count) > 0:
    sentiment = (pos_count - neg_count)/(pos_count + neg_count)
else:
    sentiment = 0
sentiment

-0.6363636363636364

### Step 2: Function

Now we have all we need for a function. There is no need to reload the dictionary for every bit of text, so we'll skip this step. Since the variables `pos_words` and `neg_words` have been defined in the global context (i.e. not in a function), the function will have access to them (the reverse would not be true). The function will return the sentiment measure.

In [44]:
def compute_sentiment(text):
    # Tokenize and remove stop words
    tokens = []
    for t in nltk.regexp_tokenize(text.lower(), '[a-z]+'):
        if t not in sr:
            tokens.append(t)
    tokens[:10]
    
    # Count the number of positive and negative words.
    pos_count = 0
    neg_count = 0
    for t in tokens:
        if t in pos_words:
            pos_count += 1
        elif t in neg_words:
            neg_count += 1
            
    # Compute sentiment
    if (pos_count + neg_count) > 0:
        sentiment = (pos_count - neg_count)/(pos_count + neg_count)
    else:
        sentiment = 0
    return sentiment

In [45]:
# Test it
compute_sentiment(text)

-0.6363636363636364

In [46]:
# Now we can apply it to all our new releases
df_pr['sentiment'] = df_pr['body'].apply(compute_sentiment)

In [47]:
# Top 10 negative news
df_pr.sort_values('sentiment').iloc[:10]

Unnamed: 0,year,item,date,location,title,body,sentiment
27,2017,28,2017-01-23,Washington D.C.,"SEC Announces Fraud Charges, Asset Freeze in A...",The Securities and Exchange Commission today a...,-1.0
1,2017,2,2017-01-09,Washington D.C.,SEC Charges Two Brokers With Defrauding Customers,The Securities and Exchange Commission today c...,-1.0
2,2017,3,2017-01-09,Washington D.C.,"Investment Adviser, Lawyer Settle Charges in S...",The Securities and Exchange Commission today a...,-1.0
46,2017,47,2017-02-14,Washington D.C.,Purported Real Estate Investment Manager Settl...,The Securities and Exchange Commission today a...,-1.0
32,2017,33,2017-01-25,Washington D.C.,Brokerage Firm Charged With Gatekeeper Failure...,The Securities and Exchange Commission today a...,-1.0
5,2017,6,2017-01-12,Washington D.C.,ITG Paying $24 Million for Improper Handling o...,The Securities and Exchange Commission today a...,-1.0
44,2017,45,2017-02-13,Washington D.C.,Brokerage Firm Paying Penalty for Compliance a...,The Securities and Exchange Commission today a...,-1.0
34,2017,35,2017-01-26,Washington D.C.,Citigroup Paying $18 Million for Overbilling C...,The Securities and Exchange Commission today a...,-1.0
8,2017,9,2017-01-12,Washington D.C.,BNY Mellon Settles Charges Stemming From Misca...,The Securities and Exchange Commission today a...,-1.0
42,2017,43,2017-02-08,Washington D.C.,SEC Announces Agenda for February 15 Meeting o...,The Securities and Exchange Commission today a...,-1.0


In [48]:
# Top 10 positive news
df_pr.sort_values('sentiment', ascending=False).iloc[:10]

Unnamed: 0,year,item,date,location,title,body,sentiment
52,2017,53,2017-02-23,Washington D.C.,SEC Announces Agenda for March 9 Investor Advi...,The Securities and Exchange Commission today a...,1.0
19,2017,20,2017-01-18,Washington D.C.,SEC Deputy Chief of Staff Nathaniel Stankard ...,The Securities and Exchange Commission today a...,1.0
57,2017,59,2017-03-02,Washington D.C.,SEC’s Office of the Investor Advocate to Hold ...,The Securities and Exchange Commission s Offic...,0.818182
22,2017,23,2017-01-19,Washington D.C.,SEC Chief of Staff Andrew J. Donohue to Leave ...,The Securities and Exchange Commission today a...,0.714286
37,2017,38,2017-01-30,Washington D.C.,OCIE Director Marc Wyatt to Leave SEC,The Securities and Exchange Commission today a...,0.692308
50,2017,51,2017-02-21,Washington D.C.,SEC to Host Crowdfunding Dialogue February 28,The Securities and Exchange Commission will ho...,0.6
49,2017,50,2017-02-17,Washington D.C.,"SEC, NASAA Sign Info-Sharing Agreement for Cro...",The Securities and Exchange Commission and the...,0.571429
51,2017,52,2017-02-23,Washington D.C.,SEC Staff Issues Guidance Update and Investor ...,The Securities and Exchange Commission toda...,0.5
35,2017,36,2017-01-27,Washington D.C.,Chief Operating Officer Jeffery Heslop to Leav...,The Securities and Exchange Commission today a...,0.5
55,2017,56,2017-03-01,Washington D.C.,SEC Proposes Inline XBRL Filing of Tagged Data,The Securities and Exchange Commission today v...,0.5
