# Classifying Stocks Using Machine Learning
## Collection
<hr>

The goal of this notebook is to provide collection methods for attributes that we believe will be important to predicting a stock's price movement. After collection and compiliation into a dataframe, data will be pickled for Cleaning and EDA.

### Objectives
 - Collect Ticker and Date
 - Collect Open Price
 - Collect Close Price
 - Collect Volume
 - Collect News Sentiment Score
    - Full article sentiment more accurate than just headline (perform on full content if possible)
 - Collect Social Sentiment Score
 - Collect Related Stock Sentiment Score
   - Collect related stock tickers that are found within articles
   - Compile into a list based on frequency
   - Apply weighted social and news sentiment of these related stock to the primary stock
   
### Organization of data

Data must be organized based on ticker and date. After collection is complete, a label will be applied for the **next day's** closing price to evalute whether the sentiment affects stock price.

### Sample Stocks

Due to the need to collect articles on both the original ticker, and any ticker collected while reading articles, we will limit collect to various stocks across different industries.

 - Tech
   - UBER
   - FB
   - ORCL
   - PYPL
   - CSCO
   - MSFT
 - Finance
   - BAC
   - WFC
   - C
   - JPM
   - V
   - MS
 - Electronics
   - AMD
   - AAPL
   - MU
   - INTC
   - NOK
   - BA
 - Health
   - PFE
   - JNJ
 - Retail
   - COST
   - WMT
   - HD
   - CVS
   - BABA
   - KR
 - Energy
   - BP
   - XOM
   - CVX
 - Transportation
   - AAL
   - DAL
   - LUV
   - CSX
   - XPO
   - UPS
 - Consumer Items (durables)
   - F
   - GM
   - ATVI
   - TSLA
   - GT
   - EA
   - HMC
 - Consumer Items (non-durables)
   - KO
   - ABEV
   - UAA
   - PG
   - NKE
   - GIS
 - Communication
   - T
   - VZ
   - TMUS
 - Services
   - AMC
   - VIAC
   - MGM
   - CMCSA
   - DIS
   - WEN

In [1]:
# our focus will be the last 90 days for each of the following stocks

keys = ["c3s6juiad3ie4i7q63b0","c45oediad3ia3sn569mg",
        "c45oeliad3ia3sn56a1g","c45of1iad3ia3sn56a7g",
        "c45ofn2ad3ia3sn56as0","c45oftiad3ia3sn56b10",
        "c3dhc1iad3icrjj6i7qg"]

#small amt to test prior to full run
# tickers = ["UBER","FB","ORCL","PYPL"]

tickers = ["UBER","FB","ORCL","PYPL","CSCO","MSFT","BAC","WFC",
 "C","JPM","V","MS","AMD","AAPL","MU","INTC","NOK",
 "BA","PFE","JNJ","COST","WMT","HD","CVS","BABA","KR",
 "BP","XOM","CVX","AAL","DAL","LUV","CSX","XPO","UPS",
 "F","GM","ATVI","TSLA","GT","EA","HMC","KO","ABEV",
 "UAA","PG","NKE","GIS","T","VZ","TMUS","AMC","VIAC",
 "MGM","CMCSA","DIS","WEN"]

In [2]:
# Swaps api key to allow more frequent calls

FINNHUB_API_KEY = "c3s6juiad3ie4i7q63b0"

def swap_key(currentKey):
    currentKey = keys.index(currentKey) + 1
    try:
        return keys[currentKey]
    except:
        #should return zero if idx out of range
        return keys[0]

# this swaps to the next key
# FINNHUB_API_KEY = swap_key(FINNHUB_API_KEY)

In [3]:
# Time function for date to increment 
def increment_one_day(str_date):

    _date = datetime.strptime(str_date, '%Y-%m-%d') + timedelta(days=1)
    _date = _date.strftime('%Y-%m-%d')
    
    return _date

### This code requires the stanfordCoreNLP to be running as a local server. 
Download it and run it as a server using the commands below.
```
cd stanford-corenlp-4.2.2
java -mx6g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 5000
```

In [4]:
import os
import requests
from datetime import datetime, timedelta
import time
import warnings
import pandas as pd
import numpy as np
from pycorenlp import StanfordCoreNLP
from IPython.display import display

warnings.filterwarnings('ignore')

In [5]:
# StanfordCoreNLP Function
nlp = StanfordCoreNLP('http://localhost:9000')

def get_sentiment(text):
    result = nlp.annotate(text, properties={
                   'annotators': 'sentiment',
                   'outputFormat': 'json',
                   'timeout': 5000,
               })
    return np.mean([int(i['sentimentValue']) for i in result['sentences']])

In [6]:
# Gets news for ticker on a specfic date.
def get_news(ticker, date):
    r = requests.get(f'https://finnhub.io/api/v1/company-news?symbol={ticker}&from={date}&to={date}&token={FINNHUB_API_KEY}')
    
    data = r.json()
    h = []
    for d in data:
        d['date'] = datetime.utcfromtimestamp(d['datetime']).strftime('%Y-%m-%d')
        h.append([d['id'], d['date'], d['headline'], d['source'], d['summary'],d['url']])

    df = pd.DataFrame(h, columns=['id', 'date', 'headline', 'source', 'summary','url'])
    df['date'] = pd.to_datetime(df['date'])
    return df

In [7]:
# Gets_news_sentiment for each ticker. Usually Recent stats (last week or 2 weeks ago stats possibly.)
def get_news_sentiment(ticker):
    r = requests.get(f'https://finnhub.io/api/v1/news-sentiment?symbol={ticker}&token={FINNHUB_API_KEY}')

    data = r.json()
    h={}

    for d in data:
        try:
            for i in data[d]:
                sd=[]
                sd.append(data[d][i])
                h[i]=sd

        except:
            kl=[]
            kl.append(data[d])
            h[d]=kl

    df = pd.DataFrame.from_dict(h)
    df.insert(0,'Ticker',ticker)
    
    return df

In [8]:
def get_social_sent(day,symbol):
    day2 = day[:-1] + str(int(day[-1])+1)

    r = requests.get(f'https://finnhub.io/api/v1/stock/social-sentiment?symbol={symbol}&token={FINNHUB_API_KEY}&from={day}&to={day2}')
    data = r.json()
    
    scores = []
    mentions = []

    for i in data['reddit']:
        scores.append(i['score'])
        mentions.append(i['mention'])
    for i in data['twitter']:
        scores.append(i['score'])
        mentions.append(i['mention'])
    
    if scores:
        products = [a * b for a, b in zip(scores, mentions)]
        return sum(products)/sum(mentions),sum(mentions)
    else:
        return -1,-1

In [9]:
# gets open and close for 
def get_open_close(day, symbol):
    #TODO: open close and vol collection
    start_t = day + " 00:00:00"
    end_t = day + " 23:59:59"
    start_t = int(time.mktime(datetime.strptime(start_t, "%Y-%m-%d %H:%M:%S").timetuple()))
    end_t = int(time.mktime(datetime.strptime(end_t, "%Y-%m-%d %H:%M:%S").timetuple()))

    r = requests.get(f'https://finnhub.io/api/v1/stock/candle?symbol={symbol}&resolution=D&from={start_t}&to={end_t}&token={FINNHUB_API_KEY}')
    data = r.json()
    
    try:
        return data['o'][0],data['c'][0],data['h'][0],data['l'][0],data['v'][0]
    except:
        return -1,-1,-1,-1,-1

In [10]:
# def relationalStock():
    #TODO: relational stock collection
    #this may need to be its own notebook

In [11]:
%%time
# Where Magic Happens. 
# Where everything comes together.

all_news = pd.DataFrame([])

for ticker in tickers:
    #start date
    _date = '2021-03-15'
    
    #create empty dataframe
    df = pd.DataFrame([])
    
    #stop date
    _fdate = '2021-08-03'
    
    #loop over dates and get news articles and append to dataframe : limit 60 api calls per minute
    while _date != _fdate: #datetime.today().strftime('%Y-%m-%d'):
        df = df.append(get_news(ticker, _date))
        df = df.drop_duplicates()
        FINNHUB_API_KEY = swap_key(FINNHUB_API_KEY)
        time.sleep(0.15)
        _date = increment_one_day(_date)
    
    #There are some repeat headlines on the same day, so getting a daily headline count per article
    #Maybe duplicates of the same headline indicates more important news??
    duplicate_headlines = df[['date', 'headline', 'id']]
    dh = (duplicate_headlines.groupby(['date', 'headline'], as_index=False)
          .count()
          .rename(columns={'id': 'headline_count'}))
      
    # Get unique headlines by date
    no_dups = df.drop_duplicates(subset=['date', 'headline'])
    
    #Merge in headline counts
    no_dups = no_dups.merge(dh, how='left', on=['date', 'headline'])
    
    #Insert ticker
    no_dups.insert(0, 'ticker', ticker)
    
    #Append to dataframe that has all tickers
    all_news = all_news.append(no_dups)
    
all_news.reset_index(drop=True,inplace=True)

Wall time: 38min 12s


In [12]:
# all the articles with same date and ticker is merged into a list.
all_news_1=pd.DataFrame(all_news.groupby(['ticker','date'])['headline'].apply(list))

In [13]:
%%time
# Gets the score of those list of articles for each ticker for each date
finals=[]
for i,j in dict(all_news_1).items():
    for k in j:
        score=[]
        for l in k:
            try: 
                score.append(get_sentiment(l))
            except:
                score.append(-1)
        finals.append(round(np.mean(score),2))

Wall time: 1h 49min 24s


In [14]:
# adds the sentiment score to dataframe column
all_news_1['news_sentiment_score']=finals

In [15]:
# adds sources
all_news_3=pd.DataFrame(all_news.groupby(['ticker','date'])['source'].apply(list))
all_news_1["source"] = all_news_3["source"]

In [16]:
# adds url
all_news_2=pd.DataFrame(all_news.groupby(['ticker','date'])['url'].apply(list))
all_news_1["url"] = all_news_2["url"]

In [17]:
# applies amount of articles
a=all_news_1.reset_index()
a['amount_of_articles']=a['headline'].apply(lambda x: len(x))

a['date'] = a['date'].apply(lambda x: datetime.strftime(x, '%Y-%m-%d'))

In [18]:
%%time
# collecting the opens, closes, and vols
opens = []
closes =  []
highs = []
lows = []
vols = []
for idx,row in a.iterrows():
    FINNHUB_API_KEY = swap_key(FINNHUB_API_KEY)
    time.sleep(0.15)
    o,c,h,l,v = get_open_close(row['date'], row['ticker'])
    opens.append(o)
    closes.append(c)
    vols.append(v)
    highs.append(h)
    lows.append(l)
    
b = {'open': opens, 'close': closes, 'highs': highs, 'lows': lows, 'volume': vols}
b = pd.DataFrame(data=b)

Wall time: 1h 26s


In [19]:
a["open"] = b["open"]
a["close"] = b["close"]
a["volume"] = b["volume"]
a["highs"] = b["highs"]
a["lows"] = b["lows"]

In [20]:
%%time
# collecting sentiments
sents=[]
mentions=[]
for idx,row in a.iterrows():
    FINNHUB_API_KEY = swap_key(FINNHUB_API_KEY)
    time.sleep(0.15)
    s,m = get_social_sent(row['date'],row['ticker'])
    sents.append(s)
    mentions.append(m)
    
c = {'social_sentiments': sents, 'mentions': mentions}
c = pd.DataFrame(data=c)

Wall time: 31min 56s


In [21]:
# combine dataframes
a["social_sentiments"] = c["social_sentiments"]
a["mentions"] = c["mentions"]

In [22]:
# this is the final dataframe
a

Unnamed: 0,ticker,date,headline,news_sentiment_score,source,url,amount_of_articles,open,close,volume,highs,lows,social_sentiments,mentions
0,AAL,2021-03-15,[Get Ready For A Great American Travel And Spe...,2.25,"[SeekingAlpha, DowJones, SeekingAlpha, MarketW...",[https://finnhub.io/api/news?id=7c2dbca0b704a9...,4,24.55,25.17,94133688,25.9400,24.21,-1.000000,-1
1,AAL,2021-03-16,[American Airlines Group Inc. stock underperfo...,2.00,"[MarketWatch, SeekingAlpha, SeekingAlpha]",[https://finnhub.io/api/news?id=8097e865de52b8...,3,25.11,24.47,47923579,25.2500,24.31,-0.496911,2
2,AAL,2021-03-17,[American Airlines Group Inc. stock outperform...,2.00,"[MarketWatch, MarketWatch, MarketWatch, DowJones]",[https://finnhub.io/api/news?id=218be0fe1534fa...,4,24.12,25.16,38540131,25.2200,23.90,0.014636,21
3,AAL,2021-03-18,[American Airlines Group Inc. stock underperfo...,2.00,"[MarketWatch, SeekingAlpha]",[https://finnhub.io/api/news?id=c8722f77fbf8f6...,2,25.12,24.70,53368955,26.0900,24.55,-0.237848,20
4,AAL,2021-03-19,[American Airlines Group Inc. stock outperform...,2.00,"[MarketWatch, SeekingAlpha, MarketWatch, Seeki...",[https://finnhub.io/api/news?id=e17f340aa8ca4b...,4,24.68,24.97,49461200,25.1100,23.88,-1.000000,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7090,XPO,2021-07-27,"[GXO Logistics, Victoria's Secret & GameStop S...",1.50,"[Yahoo, MarketWatch]",[https://finnhub.io/api/news?id=e652932d3709d4...,2,139.71,138.87,1363769,140.7499,137.60,-1.000000,-1
7091,XPO,2021-07-28,[XPO Logistics (XPO) Tops Q2 Earnings and Reve...,2.00,"[Yahoo, DowJones, Yahoo, Yahoo, MarketWatch, M...",[https://finnhub.io/api/news?id=181ba4c8ec3180...,12,139.07,137.78,1302691,139.5364,136.14,-1.000000,-1
7092,XPO,2021-07-29,[XPO Logistics's (XPO) CEO Brad Jacobs on Q2 2...,2.00,[SeekingAlpha],[https://finnhub.io/api/news?id=dbd91b5ce37247...,1,137.01,141.03,1712136,144.2000,136.39,-1.000000,-1
7093,XPO,2021-07-30,[The XPO-GXO Spinoff Saga: 'See You At The Cro...,2.00,"[Yahoo, Yahoo, Yahoo]",[https://finnhub.io/api/news?id=8937926dfdb7b6...,3,139.57,138.69,3054300,141.0700,138.32,-0.249992,4


In [24]:
# save as a HDF5 because it saves and loads fast
a.to_hdf('stocksFull.h5', 'data')

In [280]:
# use below code to recover dataframe
#reread = pd.read_hdf('stocks.h5')