## News Named Entity Extraction (NER) and Sentiment Analysis

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


###### Load dependency libraries 

In [0]:
!pip install psycopg2-binary
!pip install feedparser
import pandas as pd
import numpy as np
from textblob import TextBlob
import psycopg2
from pandas.io.json import json_normalize
import seaborn as sns
import feedparser
import requests
import json
import yaml



**News / NLP Signal Pipeline**

1. **Fetch news** - read in news source via an RSS feed (Feedparser)
2. **Extract entitities** - perform named entity recognition (NER) on the unstructured text ( Thomson Reuters Intelligent Tagging (TRIT) / Refinitiv Open Calais)
3. **Filter on news and entities** - filter on entities and events of interest
4. **Sentiment analysis** - extract sentiment on the news item (Texblob)
5. **Find signal** - correlate sentiment with price movement
6. **Historical EOD prices** - fetch historical prices (Eodhistoricaldata.com)
6. **Backtesting** - back test for PnL performance (internal or Zipline)

https://www.altsignals.ai



In [0]:
# Dictionary of RSS feeds that we will fetch and combine
# GlobeNewsire / Europe - http://www.globenewswire.com/Rss/List
# potential keys: ['summary_detail', 'published_parsed', 'links', 'title', 'summary', 'guidislink', 'title_detail', 'link', 'published', 'id']
newsurls = {
    'globenewswire-us':           'http://www.globenewswire.com/RssFeed/country/United%20States/feedTitle/GlobeNewswire%20-%20News%20from%20United%20States',
}

**1. Fetch news from RSS feed**

In [0]:
# Function to fetch the rss feed and return the parsed RSS
def parse_rss( rss_url ):
    return feedparser.parse( rss_url ) 
    
# Function grabs the rss feed headlines (titles) and returns them as a list
def get_headlines( rss_url ):
    headlines = []
    feed = parse_rss( rss_url )
    for newsitem in feed['items']:
        headlines.append(newsitem['title'])
    return headlines

def get_summaries( rss_url ):
    summaries = []
    feed = parse_rss( rss_url )
    for newsitem in feed['items']:
        summaries.append(newsitem['summary'])
    return summaries

def get_entries( rss_url ):
    entries = []
    feed = parse_rss( rss_url )
    for newsitem in feed['items']:
        entries.append(newsitem.keys())
    return entries

**1.1 Inspect entries available in news feed**

In [0]:
# Inspect the entries available in the RSS feed
entries = []

# Iterate over the feed urls
for key,url in newsurls.items():
    # Call getHeadlines() and combine the returned headlines with allheadlines
    entries.extend( get_entries( url ) )

print(entries[0])

dict_keys(['id', 'guidislink', 'link', 'links', 'tags', 'title', 'title_detail', 'summary', 'summary_detail', 'published', 'published_parsed', 'dc_identifier', 'language', 'publisher', 'publisher_detail', 'contributors', 'dc_modified', 'dc_keyword'])


**1.2 Build dictionary of {published_date: headline}**

In [0]:


# Function grabs the rss feed headlines (titles) and returns them as a list
def get_headlines_dict( rss_url ):
    feed = parse_rss( rss_url )
    hl_dict = {}
    for newsitem in feed['items']:
       hl_dict[newsitem['published']] = newsitem['title']
    return hl_dict

# Iterate over the feed urls
for key,url in newsurls.items():
    # Call getHeadlines() and combine the returned headlines with allheadlines
    hl_dict = get_headlines_dict( url )

print(hl_dict)
#hf_df = pd.DataFrame.from_dict(hl_dict, orient='index', columns=['published_date','headline'])
#print(hf_df)

{'Sun, 26 Apr 2020 15:17 GMT': 'PHARMACIELO DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $50,000 In PharmaCielo Ltd. To Contact The Firm', 'Sun, 26 Apr 2020 13:49 GMT': 'Velocity (VEL) Alert: Johnson Fistel Investigates Velocity Financial, Inc.; Investors Suffering Losses Encouraged to Contact Firm', 'Sun, 26 Apr 2020 05:51 GMT': 'ALIGN DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $100,000 In Align Technology, Inc. To Contact The Firm', 'Sun, 26 Apr 2020 05:34 GMT': 'ALLAKOS DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $50,000 In Allakos Inc. To Contact The Firm', 'Sun, 26 Apr 2020 05:04 GMT': 'FUNKO LEAD PLAINTIFF DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $50,000 In Funko, Inc. To Contact The Firm', 'Sun, 26 Apr 2020 04:05 GMT': 'WWE DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Ex

In [0]:
# A list to hold all headlines and summaries
allheadlines = []
summaries = []

# Iterate over the feed urls
for key,url in newsurls.items():
    # Call getHeadlines() and combine the returned headlines with allheadlines
    allheadlines.extend( getHeadlines( url ) )
    summaries.extend( getSummaries( url ) )

**1.3 View headlines**

In [0]:
# Iterate over the allheadlines list and print each headline
for hl in allheadlines:
    print(hl)

Velocity (VEL) Alert: Johnson Fistel Investigates Velocity Financial, Inc.; Investors Suffering Losses Encouraged to Contact Firm
ALIGN DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $100,000 In Align Technology, Inc. To Contact The Firm
ALLAKOS DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $50,000 In Allakos Inc. To Contact The Firm
FUNKO LEAD PLAINTIFF DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $50,000 In Funko, Inc. To Contact The Firm
WWE DEADLINE ALERT: Faruqi & Faruqi, LLP Encourages Investors Who Suffered Losses Exceeding $50,000 in World Wrestling Entertainment, Inc. to Contact the Firm
ROSEN, A GLOBALLY RECOGNIZED LAW FIRM, Reminds Golden Star Resources Ltd. Investors of Important Deadline in Securities Class Action – GSS
ROSEN, A GLOBALLY RECOGNIZED LAW FIRM, Reminds LogicBio Therapeutics, Inc. Investors of the Important Deadline in Securities Class 

**1.3 View news summaries**


In [0]:
# Iterate over the summaries list and print each summary
# TODO: see if HTML chars can be removed
for s in summaries:
    print(s)

<p>SAN DIEGO, April  26, 2020  (GLOBE NEWSWIRE) -- Shareholder rights law firm Johnson Fistel, LLP is investigating potential violations of the federal securities laws by Velocity Financial, Inc. ("Velocity" or "the Company") (NYSE: VEL).<br></p>
<p align="justify">NEW YORK, April  26, 2020  (GLOBE NEWSWIRE) -- Faruqi &amp; Faruqi, LLP, a leading national securities law firm, reminds investors in Align Technology, Inc. (“Align” or the “Company”) (NASDAQ:ALGN) of the May 1, 2020 deadline to seek the role of lead plaintiff in a federal securities class action that has been filed against the Company.</p>
<p>NEW YORK, April  26, 2020  (GLOBE NEWSWIRE) -- Faruqi &amp; Faruqi, LLP, a leading national securities law firm, reminds investors in Allakos Inc. (“Allakos” or the “Company”) (NASDAQ:ALLK) of the May 11, 2020 deadline to seek the role of lead plaintiff in a federal securities class action that has been filed against the Company.</p>
<p align="left">NEW YORK, April  26, 2020  (GLOBE NE

**2. Named Entity Extraction (NER) - make an API calls to Thomson Reutersr Intelligent Tagging (TRIT) with news headline content**


* https://developers.refinitiv.com/open-permid/intelligent-tagging-restful-apiquick-start
* https://developers.refinitiv.com/article/intelligent-tagging-extract-information-api-response
* https://permid.org/faq

In [0]:
# Define sample content to be queried
contentText = allheadlines[1]
print(contentText)

Velocity (VEL) Alert: Johnson Fistel Investigates Velocity Financial, Inc.; Investors Suffering Losses Encouraged to Contact Firm


**2.1 Query TRIT / OpenCalais JSON API**

In [0]:

headType = "text/raw"
token = 'oSyQfYcRShExGJmJPXRgr4kOFAsIHqoJ'
url = "https://api-eit.refinitiv.com/permid/calais"
payload = contentText.encode('utf8')
headers = {
    'Content-Type': headType,
    'X-AG-Access-Token': token,
    'outputformat': "application/json"
    }

#  The daily limit is 5,000 requests, and the concurrent limit varies by API from 1-4 calls per second. 
TRITResponse = requests.request("POST", url, data=payload, headers=headers)
# Load content into JSON object
JSONResponse = json.loads(TRITResponse.text)
# print(json.dumps(JSONResponse, indent=4, sort_keys=True))

**2.2 Get entities in news**

In [0]:
#Get Entities
print('====Entities====')
print('Type, Name')

for key in JSONResponse:
    if ('_typeGroup' in JSONResponse[key]):
        if JSONResponse[key]['_typeGroup'] == 'entities':
            print(JSONResponse[key]['_type'] + ", " + JSONResponse[key]['name'])

====Entities====
Type, Name
Company, JOHNSON FISTEL
Company, velocity financial, inc.


**2.3 Get RIC code for entity**

In [0]:
#Get RIC code

print('====RIC====')
print('RIC')

for entity in JSONResponse:
    for info in JSONResponse[entity]:
        if (info =='resolutions'):
            for companyinfo in (JSONResponse[entity][info]):
                if 'primaryric' in companyinfo:
                    symbol = companyinfo['primaryric']
                    print(symbol)

====RIC====
RIC
VEL.N


**2.4 Get topics for the news item**

In [0]:
#Print Header
print(symbol)
print('====Topics====')
print('Topics, Score')

for key in JSONResponse:
    if ('_typeGroup' in JSONResponse[key]):
        if JSONResponse[key]['_typeGroup'] == 'topics':
            print(JSONResponse[key]['name'] + ", " + str(JSONResponse[key]['score']))

VEL.N
====Topics====
Topics, Score
Business_Finance, 1
Health_Medical_Pharma, 0.935
Disaster_Accident, 0.817


**4. Sentiment Analysis**

In [0]:
# Define function to be used for text senitments analysis 
def get_sentiment(txt): 
        ''' 
        Utility function to clean text by removing links, special characters 
        using simple regex statements and to classify sentiment of passed tweet 
        using textblob's sentiment method 
        '''
        #clean text
        clean_txt = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", txt).split())
        # create TextBlob object of passed tweet text 
        analysis = TextBlob(clean_txt) 
        # set sentiment 
        if analysis.sentiment.polarity > 0: 
            return 'positive'
        elif analysis.sentiment.polarity == 0: 
            return 'neutral'
        else: 
            return 'negative'

In [0]:
print('headline: ', allheadlines[1])
print('headline sentiment: ', get_sentiment(allheadlines[1]))
print('summary: ', summaries[1])
print('summary sentiment: ', get_sentiment(summaries[1]))

headline:  Velocity (VEL) Alert: Johnson Fistel Investigates Velocity Financial, Inc.; Investors Suffering Losses Encouraged to Contact Firm
headline sentiment:  negative
summary:  <p>SAN DIEGO, April  26, 2020  (GLOBE NEWSWIRE) -- Shareholder rights law firm Johnson Fistel, LLP is investigating potential violations of the federal securities laws by Velocity Financial, Inc. ("Velocity" or "the Company") (NYSE: VEL).<br></p>
summary sentiment:  negative


**5. Get historical EOD price data**

In [0]:
eod_api_token = '5cc0ea63d1cda3.37070012'
eod_symbol = symbol.replace('N', 'US')
eod_price_url = 'https://eodhistoricaldata.com/api/eod/' + eod_symbol + '?api_token=' + eod_api_token
price_df = pd.read_csv(eod_price_url)
price_df.sort_values(by=['Date'], inplace=True, ascending=False)
price_df.head()


Unnamed: 0,Date,Open,High,Low,Close,Adjusted_close,Volume
68,3081,,,,,,
67,2020-04-24,3.22,3.22,3.03,3.09,3.09,31200.0
66,2020-04-23,3.13,3.32,3.06,3.14,3.14,68200.0
65,2020-04-22,3.29,3.361,3.1,3.16,3.16,89500.0
64,2020-04-21,3.42,3.42,3.05,3.15,3.15,112400.0
