## How does news affect stock returns?
(Navruzbek Karamatov)

This challenge focuses on news articles and its effect on stock movements.

First, I do web scraping to get news articles published online on Nasdaq.com. Second, articles are investigated through vader sentiment analysis methodologies.
Lastly, I check if it has any explanatory power in CAPM case.

Initially, specific functions are defined as shown below.


In [1]:
## First, here specific web scraping functions are defined

from dateutil import parser
import requests
import re
from bs4 import BeautifulSoup

## for extracting news date
def scrape_news_date(news_url):
    try:        
        news_html = requests.get(news_url).content 
        news_soup = BeautifulSoup(news_html , 'html.parser')
        all_p = news_soup.find('span', {'itemprop':'datePublished'})
        news_date = all_p.text 
        news_date = parser.parse(news_date)
        return news_date
    except:
        return 'No date'

## for extracting news title
def scrape_news_title(news_url):
    news_html = requests.get(news_url).content 
    news_soup = BeautifulSoup(news_html , 'html.parser')
    news_title = news_soup.title.text
    return news_title

## for extracting news text
def scrape_news_text(news_url):
    news_html = requests.get(news_url).content
    news_soup = BeautifulSoup(news_html , 'html.parser') 
    paragraphs = [par.text for par in news_soup.find_all('p')]
    news_text = ' '.join(paragraphs[1:-4]) ## just for the Nasdaq.com case
    return news_text

def get_news_urls(links_site):
    resp = requests.get(links_site)
    if not resp.ok:
        return None
 
    html = resp.content
    bs = BeautifulSoup(html , 'html.parser') 
    links = bs.find_all('a')

    urls = [link.get('href') for link in links]
    urls = [url for url in urls if url is not None] 
    news_urls = [url for url in urls if '/article/' in url]
 
    return news_urls

def scrape_all_articles(ticker , page_limit):
    website = 'http://www.nasdaq.com/symbol/' + ticker + '/news-headlines'
    all_news_urls = get_news_urls(website)
    
    ind = 2
    while ind <= page_limit:
        current_site = website + '?page=' + str(ind)
        urls_list = get_news_urls(current_site)
        
        all_news_urls = all_news_urls + urls_list
        ind += 1
    return all_news_urls
 
def scape_all_info(all_news_urls):
    all_news_urls = list(set(all_news_urls))
    all_articles = [scrape_news_text(news_url) for news_url in all_news_urls]
    all_titles = [scrape_news_title(news_url) for news_url in all_news_urls]
    all_dates = [scrape_news_date(news_url) for news_url in all_news_urls]
    return all_articles, all_titles, all_dates


Here I start extracting news articles for a specific stock.
Stock stickers could be obtained from yahoo as shown below.


In [2]:
import pandas as pd
dow_info = pd.read_html('https://finance.yahoo.com/quote/%5EDJI/components?p=%5EDJI')[0]
dow_tickers = dow_info.Symbol.tolist()

In [2]:
# dow_tickers ## i focused only on boeing and apple stock sticker

Using the sticker, each html page is obtained and parsed. 
From parsed html all links for articles are saved. 

In [93]:
ap_news_urls = scrape_all_articles('ibm' , 109) #167 pages for boeing case

In [94]:
all_news_urls = list(set(ap_news_urls))
len(all_news_urls)

1088

Here only get the titles from articles in order to extract articles based on stock name later

In [95]:
all_titles = [scrape_news_title(news_url) for news_url in all_news_urls]

Now extract articles related to a specific stock based on titles

In [97]:
import re
# all_boeing_titles = [re.search("Boeing", w) for w in all_titles]
# all_apple_titles = [re.search("alphabet|youtube|google|googl", w.lower()) for w in all_titles]

all_apple_titles = [re.search("ibm|international business machines", w.lower()) for w in all_titles]

In [99]:
# title_indices = [all_boeing_titles.index(w) for w in all_boeing_titles if w is not None]
# boeing_titles = [all_titles[w] for w in title_indices]

title_indices = [all_apple_titles.index(w) for w in all_apple_titles if w is not None]
apple_titles = [all_titles[w] for w in title_indices]
len(apple_titles)

494

I look through titles of news and save texts with titles that mentioned specific company

In [100]:
apple_urls = [all_news_urls[w] for w in title_indices] ## get urls 
apple_dates = [scrape_news_date(news_url)  for news_url in apple_urls]    
apple_articles = [scrape_news_text(news_url) for news_url in apple_urls]

# boeing_urls = [all_news_urls[w] for w in title_indices]
# boeing_dates = [all_dates[w] for w in title_indices]
# boeing_articles = [scrape_news_text(news_url) for news_url in boeing_urls]



In [101]:
## save it for further use
apple_articles = pd.DataFrame(apple_articles)
apple_titles = pd.DataFrame(apple_titles)
apple_dates = pd.DataFrame(apple_dates)
apple_urls = pd.DataFrame(apple_urls)

apple_articles.to_csv(r'C:\Users\navruzbek\Machine Learning Notebook\web_scraping\ibm_articles.csv', index = None, header=True)
apple_titles.to_csv(r'C:\Users\navruzbek\Machine Learning Notebook\web_scraping\ibm_titles.csv', index = None, header=True)
apple_dates.to_csv(r'C:\Users\navruzbek\Machine Learning Notebook\web_scraping\ibm_dates.csv', index = None, header=True)
apple_urls.to_csv(r'C:\Users\navruzbek\Machine Learning Notebook\web_scraping\ibm_urls.csv', index = None, header=True)

Data is also provided in the link.
So without web scraping analysis could be done.


In [3]:
## import data

import pandas as pd
# boeing_articles = pd.read_csv(r"C:\boeing_articles.csv") 
# boeing_titles = pd.read_csv(r'C:\boeing_titles.csv')
# boeing_dates = pd.read_csv(r'C:\boeing_dates.csv')
# boeing_urls = pd.read_csv(r'C:\boeing_urls.csv')

# boeing_articles.columns = ['boeing_articles']
# boeing_articles = boeing_articles['boeing_articles'].tolist()

# boeing_dates.columns = ['boeing_dates']
# boeing_dates = boeing_dates['boeing_dates'].tolist()

apple_articles = pd.read_csv(r"C:\Users\navruzbek\Machine Learning Notebook\web_scraping\data\apple_articles.csv") 
apple_titles = pd.read_csv(r'C:\Users\navruzbek\Machine Learning Notebook\web_scraping\data\apple_titles.csv')
apple_dates = pd.read_csv(r'C:\Users\navruzbek\Machine Learning Notebook\web_scraping\data\apple_dates.csv')

apple_articles.columns = ['apple_articles']
apple_articles_l = apple_articles['apple_articles'].tolist()

apple_dates.columns = ['apple_dates']
apple_dates_l = apple_dates['apple_dates'].tolist()


In [4]:
apple_articles['apple_articles'] = apple_articles['apple_articles'].apply(lambda x: " ".join(x.lower() for x in x.split()))
apple_articles['apple_articles'] = apple_articles['apple_articles'].str.replace('[^\w\s]','') ## non-alphanumeric characters


In [5]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

apple_articles['apple_articles'] = apple_articles['apple_articles'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
apple_articles.iloc[0,][0]

'investorplace stock market news stock advice trading tips leading apple nasdaq aapl rumor mill today news new ipad models coming soon today well look apple rumors thursday new ipad looks like getting closer release next new ipad devices reports macrumors tech company registered two new ipad devices india devices model numbers a2124 a2133 model numbers shown eurasia registrations rumors claim new ipad models company planning release update ipad mini line smart gloves new patent reveals possible plans apple smart gloves appleinsider notes smart gloves would allow user type better touchscreens forms feed back includes gloves squeezing around tip finger offer precision also able give types feedback may allow sense closer haptic feedback folding iphone another patent apple reveals may solve one problem folding iphone reports 9to5mac patent details system would allow device heat screen cold weather could keep screen breaking folding temperatures cold still unknown far away folding iphone pa

In [156]:
freq = pd.Series(' '.join(apple_articles['apple_articles']).split()).value_counts()[:10]
freq

alphabet    2866
google      2683
stocks      2082
stock       1945
zacks       1629
shares      1617
billion     1502
year        1442
market      1437
earnings    1404
dtype: int64

In [43]:
## remove the most used words 

# freq = list(freq.index)
# apple_articles['apple_articles'] = apple_articles['apple_articles'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
# apple_articles['apple_articles'].head()

0    investorplace news advice trading tips leading...
1    stephen nellis march 14 reuters inc thursday l...
2    immediate release chicago il february 7 2019 z...
3    one product line proven runaway success nasdaq...
4    immediate release chicago il february 19 2019 ...
Name: apple_articles, dtype: object

In [34]:
freq = pd.Series(' '.join(apple_articles['apple_articles']).split()).value_counts()[-10:]
freq

sydney          1
231000          1
afterrenault    1
litany          1
11x             1
defying         1
43year          1
royallondon     1
raresales       1
swoon           1
dtype: int64

In [45]:
## remove the least used words

# freq = list(freq.index)
# apple_articles['apple_articles'] = apple_articles['apple_articles'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))

In [57]:
## now correct words 
## it takes a long time
## maybe u dont need since articles always spelled properly

# from textblob import TextBlob
# print(apple_articles['apple_articles'].iloc[0,])
# print(TextBlob(apple_articles['apple_articles'].iloc[0,]).correct())

# .apply(lambda x: str(TextBlob(x).correct()))

In [56]:
## maybe u dont need 

# from nltk.stem import PorterStemmer
# st = PorterStemmer()

# apple_articles['apple_articles'][:1].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

In [None]:
## lemmatize finds the root of words
## i wont need for sentiment analysis

# from textblob import Word

# apple_articles['apple_articles'] = apple_articles['apple_articles'].apply(lambda x: " ".join([Word(w).lemmatize() for w in x.split()]))

## Sentiment analysis

Here, I do sentiment analysis for each article extracted earlier.

I repeated the same analysis for boeing stocks as well. But here is shown only apple stocks case.

In [6]:

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [7]:
apple_articles_l = apple_articles['apple_articles'].tolist()

pos_word_list=[]
neu_word_list=[]
neg_word_list=[]
polarity_list = []

for word in apple_articles_l:
    polarity_list.append(sid.polarity_scores(word)['compound'])
    pos_word_list.append(sid.polarity_scores(word)['pos'])
    neu_word_list.append(sid.polarity_scores(word)['neu'])
    neg_word_list.append(sid.polarity_scores(word)['neg'])


In [8]:
apple_dates = pd.to_datetime(apple_dates_l, format='%Y-%m-%d %H:%M:%S', utc=False)

new_data = pd.DataFrame(
    {'pos': pos_word_list,
     'neg': neg_word_list,
     'neu': neu_word_list,
     'pol': polarity_list
    }, index = apple_dates)


In [9]:
new_data_ = new_data.resample('D')["pos", "neg", "neu", "pol"].mean()
new_data_med = new_data.resample('D')["pos", "neg", "neu", "pol"].median()

In [10]:
len(new_data_)

87

In [144]:
# new_data_.sort_index()

Next I also include Boeing stock and S&P 500 index price in the data for further analysis

In [11]:
import pandas_datareader as pdr
import numpy as np
import pandas as pd

start_date = '2018-07-09'
end_date = '2019-03-22'

# boeing = pdr.get_data_yahoo('BA', start_date, end_date) ## so far only yahoo works
apple = pdr.get_data_yahoo('AAPL', start_date, end_date) ## so far only yahoo works
sp500 = pdr.get_data_yahoo('^GSPC', start_date, end_date) ## so far only yahoo works

apple["Log_Ret"] = np.log(apple["Close"] / apple["Close"].shift(1))
sp500["Log_Ret"] = np.log(sp500["Close"] / sp500["Close"].shift(1))

apple_ = apple.iloc[:, 4:7]
apple_.columns = ['vol_ba', 'vol_close_ba', 'log_ret_ba']
sp500_ = sp500.iloc[:, 4:7]
sp500_.columns = ['vol_sp', 'vol_close_sp', 'log_ret_sp']


In [12]:
apple_.head()

Unnamed: 0_level_0,vol_ba,vol_close_ba,log_ret_ba
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-07-09,19756600.0,188.445404,
2018-07-10,15939100.0,188.217972,-0.001208
2018-07-11,18831500.0,185.77565,-0.013061
2018-07-12,18041100.0,188.890366,0.016627
2018-07-13,12513900.0,189.187012,0.001569


In [13]:
new_data_ = pd.DataFrame(new_data_)
new_data_med = pd.DataFrame(new_data_med)

new_data_ = new_data_.join(apple_, how='outer')
new_data_ = new_data_.join(sp500_, how='outer')
new_data_["log_vol_ba"] = np.log(new_data_['vol_ba'])
new_data_["log_vol_sp"] = np.log(new_data_['vol_sp'])

new_data_med = new_data_med.join(apple_, how='outer')
new_data_med = new_data_med.join(sp500_, how='outer')
new_data_med["log_vol_ba"] = np.log(new_data_med['vol_ba'])
new_data_med["log_vol_sp"] = np.log(new_data_med['vol_sp'])

new_data_["pol_t_1"] = new_data_["pol"].shift(1)
new_data_["pos_t_1"] = new_data_["pos"].shift(1)
new_data_["neg_t_1"] = new_data_["neg"].shift(1)
new_data_["neu_t_1"] = new_data_["neu"].shift(1)

new_data_med["pol_t_1"] = new_data_med["pol"].shift(1)
new_data_med["pos_t_1"] = new_data_med["pos"].shift(1)
new_data_med["neg_t_1"] = new_data_med["neg"].shift(1)
new_data_med["neu_t_1"] = new_data_med["neu"].shift(1)

# new_data_med["pol_t_2"] = new_data_med["pol"].shift(2)
# new_data_med["pos_t_2"] = new_data_med["pos"].shift(2)
# new_data_med["neg_t_2"] = new_data_med["neg"].shift(2)
# new_data_med["neu_t_2"] = new_data_med["neu"].shift(2)


Now I also obtain risk free data (1 month T-bill rate) from Federal Reserve Bank. Next, I subtract risk free rates from stock and market returns to estimate stock's beta.

In [14]:
import pandas_datareader as pdr
start_date = '2018-07-09'
end_date = '2019-03-22'

risk_free = pdr.get_data_fred('DTB4WK', start_date, end_date) ## so far only yahoo works

risk_free['1month_risk_free'] = risk_free['DTB4WK'] / 100

new_data_ = new_data_.join(risk_free, how='left')
new_data_med = new_data_med.join(risk_free, how='left')

new_data_['y'] = new_data_['log_ret_ba'] - new_data_['1month_risk_free']
new_data_['x'] = new_data_['log_ret_sp'] - new_data_['1month_risk_free']

new_data_med['y'] = new_data_med['log_ret_ba'] - new_data_med['1month_risk_free']
new_data_med['x'] = new_data_med['log_ret_sp'] - new_data_med['1month_risk_free']


### Least Squares regression

Here I do LS regression to estimate CAPM beta. Next I included sentimant analysis's negative score.

First table shows the result when only S&P 500 return is considered as a market return and stock beta is estimated and clearly, R2 is around 44%.

Second table shows the regression result when sentiment analysis's negative score is included. Variable is statistically significant and R2 increased up to 47%.

This result is a proof that sentiment analysis of news are useful to study stock movements.

In [116]:
## boeing case
## cleaned, uncapitalized, removed stopwords

import statsmodels.formula.api as sm

result = sm.ols(formula="y ~ x", data=new_data_).fit()
print(result.summary())

result = sm.ols(formula="y ~ x + pol", data=new_data_).fit()
print(result.summary())


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.444
Model:                            OLS   Adj. R-squared:                  0.441
Method:                 Least Squares   F-statistic:                     138.4
Date:                Tue, 26 Mar 2019   Prob (F-statistic):           7.48e-24
Time:                        22:01:31   Log-Likelihood:                 486.08
No. Observations:                 175   AIC:                            -968.2
Df Residuals:                     173   BIC:                            -961.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0059      0.003      2.297      0.0

Here, I repeated the same analysis but using only Apple stocks data.
From results it is clear that inclusion of sentiment analysis's negativity score significantly improved R2 and the coefficient is statistically signifiant as well. This result supports earlier findings for Boeing stocks case. 

In [38]:
## apple case
## not cleaned and uncapitalized 

import statsmodels.formula.api as sm

# result = sm.ols(formula="log_vol_ba ~ neg + neg_t_1", data=new_data_).fit()
# print(result.summary())

# result = sm.ols(formula="y ~ x + pol + pol_t_1 ", data=new_data_).fit()
# print(result.summary())

result = sm.ols(formula="y ~ x + neg", data=new_data_).fit()
print(result.summary())

# new_data_.to_csv(r'C:\Users\navruzbek\Machine Learning Notebook\web_scraping\new_data_.csv', index = None, header=True)
# new_data_med.to_csv(r'C:\Users\navruzbek\Machine Learning Notebook\web_scraping\new_data_med.csv', index = None, header=True)


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.706
Model:                            OLS   Adj. R-squared:                  0.695
Method:                 Least Squares   F-statistic:                     64.95
Date:                Mon, 15 Apr 2019   Prob (F-statistic):           4.28e-15
Time:                        14:50:41   Log-Likelihood:                 169.80
No. Observations:                  57   AIC:                            -333.6
Df Residuals:                      54   BIC:                            -327.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0341      0.007      4.949      0.0

In [31]:
## apple case
## not cleaned and uncapitalized 

import statsmodels.formula.api as sm

result1 = sm.ols(formula="y ~ x + pos + pos_t_1", data=new_data_).fit()
# print(result.summary())

result2 = sm.ols(formula="y ~ x + neg", data=new_data_).fit()
# print(result.summary())

result3 = sm.ols(formula="y ~ x", data=new_data_).fit()
# print(result.summary())


In [36]:
import statsmodels.formula.api as smf

beginningtex = """\\documentclass{report}
\\usepackage{booktabs}
\\begin{document}"""
endtex = "\end{document}"

f = open('myreg.tex', 'w')
f.write(beginningtex)
f.write(result1.summary().as_latex())
f.write(result2.summary().as_latex())
f.write(result3.summary().as_latex())
f.write(endtex)
f.close()

In [53]:
## apple case
## pol_t and pol_t_1 is the best
## data is cleaned and str uncapitalized

import statsmodels.formula.api as sm

# print(new_data_.head())

result = sm.ols(formula="y ~ x", data=new_data_med).fit()
print(result.summary())

result = sm.ols(formula="y ~ x + pol ", data=new_data_med).fit()
print(result.summary())

# result = sm.ols(formula="y ~ x + pos ", data=new_data_).fit()
# print(result.summary())

# result = sm.ols(formula="y ~ x + neg + neg_t_1 + neg_t_2", data=new_data_med).fit()
# print(result.summary())


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.601
Model:                            OLS   Adj. R-squared:                  0.599
Method:                 Least Squares   F-statistic:                     260.5
Date:                Tue, 26 Mar 2019   Prob (F-statistic):           2.44e-36
Time:                        21:46:17   Log-Likelihood:                 507.72
No. Observations:                 175   AIC:                            -1011.
Df Residuals:                     173   BIC:                            -1005.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0114      0.002      5.037      0.0

In [183]:
## google case

import statsmodels.formula.api as sm

result = sm.ols(formula="y ~ x  ", data=new_data_).fit()
print(result.summary())

result = sm.ols(formula="y ~ x + pol_t_1", data=new_data_).fit()
print(result.summary())


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.645
Model:                            OLS   Adj. R-squared:                  0.642
Method:                 Least Squares   F-statistic:                     313.7
Date:                Tue, 26 Mar 2019   Prob (F-statistic):           1.05e-40
Time:                        22:11:23   Log-Likelihood:                 551.20
No. Observations:                 175   AIC:                            -1098.
Df Residuals:                     173   BIC:                            -1092.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.0068      0.002      3.834      0.0

In [185]:
new_data_.head(100)

Unnamed: 0,pos,neg,neu,pol,vol_ba,vol_close_ba,log_ret_ba,vol_sp,vol_close_sp,log_ret_sp,log_vol_ba,log_vol_sp,pol_t_1,pos_t_1,neg_t_1,neu_t_1,DTB4WK,1month_risk_free,y,x
2018-07-09,,,,,1079200.0,1167.280029,,3.050040e+09,2784.169922,,13.891731,21.838421,,,,,1.83,0.0183,,
2018-07-10,,,,,1066700.0,1167.140015,-0.000120,3.063850e+09,2793.840088,0.003467,13.880080,21.842938,,,,,1.85,0.0185,-0.018620,-0.015033
2018-07-11,,,,,1662600.0,1171.459961,0.003694,2.964740e+09,2774.020020,-0.007119,14.323893,21.810055,,,,,1.86,0.0186,-0.014906,-0.025719
2018-07-12,,,,,2207400.0,1201.260010,0.025120,2.821690e+09,2798.290039,0.008711,14.607326,21.760602,,,,,1.86,0.0186,0.006520,-0.009889
2018-07-13,,,,,1630600.0,1204.420044,0.002627,2.614000e+09,2801.310059,0.001079,14.304459,21.684147,,,,,1.84,0.0184,-0.015773,-0.017321
2018-07-16,,,,,1339200.0,1196.510010,-0.006589,2.812230e+09,2798.429932,-0.001029,14.107583,21.757244,,,,,1.86,0.0186,-0.025189,-0.019629
2018-07-17,,,,,2008100.0,1213.079956,0.013754,3.050730e+09,2809.550049,0.003966,14.512700,21.838647,,,,,1.90,0.0190,-0.005246,-0.015034
2018-07-18,,,,,1947400.0,1212.910034,-0.000140,3.089780e+09,2815.620117,0.002158,14.482006,21.851366,,,,,1.87,0.0187,-0.018840,-0.016542
2018-07-19,,,,,1916900.0,1199.099976,-0.011451,3.266700e+09,2804.489990,-0.003961,14.466220,21.907046,,,,,1.86,0.0186,-0.030051,-0.022561
2018-07-20,,,,,1896900.0,1197.880005,-0.001018,3.230210e+09,2801.830078,-0.000949,14.455732,21.895813,,,,,1.82,0.0182,-0.019218,-0.019149


In [152]:
## ibm case
## cleaned, uncapitalized, removed stopwords

import statsmodels.formula.api as sm

result = sm.ols(formula="y ~ x", data=new_data_med).fit()
print(result.summary())

result = sm.ols(formula="y ~ x + pol + pol_t_1", data=new_data_med).fit()
print(result.summary())


                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.382
Model:                            OLS   Adj. R-squared:                  0.379
Method:                 Least Squares   F-statistic:                     107.1
Date:                Tue, 26 Mar 2019   Prob (F-statistic):           7.72e-20
Time:                        22:06:11   Log-Likelihood:                 515.17
No. Observations:                 175   AIC:                            -1026.
Df Residuals:                     173   BIC:                            -1020.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0016      0.002     -0.730      0.4

In [141]:
result.model

<statsmodels.regression.linear_model.OLS at 0x17522024400>