### EDA

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv("../data/jvdm_data_prep.csv")

In [3]:
data

Unnamed: 0,Date,Headlines,Close,Close+1,Close+2,Close+3,Close+4,Close+5,Close+6,Close+7,...,PercentageD+5,PercentageD+6,PercentageD+7,TrendD+1,TrendD+2,TrendD+3,TrendD+4,TrendD+5,TrendD+6,TrendD+7
0,2017-12-17,The Guardian view on Ryanair’s model : a union...,239.5091,241.0279,240.1023,239.9764,240.4707,240.4078,240.4078,240.4078,...,0.3752,0.3752,0.3752,2,1,1,1,1,1,1
1,2017-12-17,Butchers carve out a niche as UK shoppers opt ...,239.5091,241.0279,240.1023,239.9764,240.4707,240.4078,240.4078,240.4078,...,0.3752,0.3752,0.3752,2,1,1,1,1,1,1
2,2017-12-17,Grogonomics This year has been about companies...,239.5091,241.0279,240.1023,239.9764,240.4707,240.4078,240.4078,240.4078,...,0.3752,0.3752,0.3752,2,1,1,1,1,1,1
3,2017-12-17,Youngest staff to be given UK workplace pensio...,239.5091,241.0279,240.1023,239.9764,240.4707,240.4078,240.4078,240.4078,...,0.3752,0.3752,0.3752,2,1,1,1,1,1,1
4,2017-12-17,Peter Preston on press and broadcasting Paul D...,239.5091,241.0279,240.1023,239.9764,240.4707,240.4078,240.4078,240.4078,...,0.3752,0.3752,0.3752,2,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53325,2020-07-18,World Bank calls on creditors to cut poorest n...,303.2904,303.2904,305.7415,306.3919,308.1360,304.4593,302.4986,302.4986,...,0.3854,-0.2611,-0.2611,1,2,2,2,1,1,1
53326,2020-07-18,British Airways retires Boeing 747 fleet as Co...,303.2904,303.2904,305.7415,306.3919,308.1360,304.4593,302.4986,302.4986,...,0.3854,-0.2611,-0.2611,1,2,2,2,1,1,1
53327,2020-07-18,What will changes to England's lockdown rules ...,303.2904,303.2904,305.7415,306.3919,308.1360,304.4593,302.4986,302.4986,...,0.3854,-0.2611,-0.2611,1,2,2,2,1,1,1
53328,2020-07-18,Atol protection to be extended to vouchers on ...,303.2904,303.2904,305.7415,306.3919,308.1360,304.4593,302.4986,302.4986,...,0.3854,-0.2611,-0.2611,1,2,2,2,1,1,1


### Headline length vs. token window of the model 

**mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis** --> 512 tokens; approx. 1 token per word

In [4]:
data["Word_Count"] = data["Headlines"].apply(lambda x: len(x.split()))

In [5]:
data["Word_Count"].describe()

count    53330.000000
mean        11.770429
std          2.820468
min          2.000000
25%         10.000000
50%         11.000000
75%         13.000000
max         31.000000
Name: Word_Count, dtype: float64

In [6]:
512/12

42.666666666666664

--> Token window will fit approx. 43 headlines

### Headlines per date?

In [7]:
date_counts = data.groupby('Date').size().reset_index(name='count')

In [45]:
date_counts.describe()

Unnamed: 0,count
count,931.0
mean,57.282492
std,31.718634
min,1.0
25%,27.0
50%,63.0
75%,80.5
max,170.0


In [9]:
len(date_counts[date_counts["count"] <5])/len(date_counts)

0.031149301825993556

For 3% of the dates in the dataset, we have less than 5 headlines available.

### Sentiment analysis with the existing model

using a random sample of  5000 entries from the dataset

In [10]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

In [11]:
tokenizer = AutoTokenizer.from_pretrained("mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")
model = AutoModelForSequenceClassification.from_pretrained("mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")



In [12]:
sentiment_analysis = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, device=0)

In [13]:
max_length = tokenizer.model_max_length
max_length

512

In [14]:
test_data = data.sample(5000, random_state=42)

In [15]:
test_data["Sentiment"] = test_data["Headlines"].apply(lambda x: sentiment_analysis(x[:max_length])[0]["label"])
test_data["Score"] = test_data["Headlines"].apply(lambda x: sentiment_analysis(x[:max_length])[0]["score"])

In [17]:
sentiment_mapping = {
    "negative": 0,
    "neutral": 1,
    "positive": 2
}

In [18]:
test_data["Sentiment_Value"] = test_data["Sentiment"].map(sentiment_mapping)

### Correlation between the sentiment value and the trend of the S&P500 after x days?

In [19]:
from sklearn.metrics import accuracy_score

for i in range(1,8):
    accuracy = accuracy_score(test_data[f"TrendD+{i}"], test_data["Sentiment_Value"])
    print(f"Accuracy D+{i}: {accuracy:.2f}")

Accuracy D+1: 0.41
Accuracy D+2: 0.38
Accuracy D+3: 0.34
Accuracy D+4: 0.32
Accuracy D+5: 0.30
Accuracy D+6: 0.29
Accuracy D+7: 0.29


--> Beyond 1 day, sentiment analysis as a predictor of the S&P500 trend performs similar or worse than random guess

### Aggregating headlines by date

Rationale:
- Intuition says that single headlines should be a poor predictor of the S&P trend
- On most days, one would expect both positive and negative headlines
- Aggregating many headlines from the same day should give a better indication of investors' sentiment
- Practical reason: reducing the size of the dataset (from 50k+ to less than 1k entries) will make it easier to train model (avoid resource limits both local and on Google Colab)

In [20]:
aggregated_data = data.groupby("Date").agg({
    "Headlines": lambda x: " ".join(x),  
    "TrendD+1": "first",
    "TrendD+2": "first",
    "TrendD+3": "first",
    "TrendD+4": "first",
    "TrendD+5": "first",
    "TrendD+6": "first",
    "TrendD+7": "first"
}).reset_index(drop=True)


In [21]:
aggregated_data

Unnamed: 0,Headlines,TrendD+1,TrendD+2,TrendD+3,TrendD+4,TrendD+5,TrendD+6,TrendD+7
0,The Guardian view on Ryanair’s model : a union...,2,1,1,1,1,1,1
1,Universal basic income is no panacea for us – ...,1,1,1,1,1,1,1
2,Business live World markets driven to record h...,1,1,1,1,1,1,1
3,Parts of UK that voted Brexit are most exposed...,1,1,1,1,1,1,1
4,Fall in demand shrinks UK car making by 4 . 6%...,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...
926,VW to shift centre of software development to ...,2,2,2,2,2,2,2
927,"Delta takes USD 3 billion charge on buyouts , ...",1,1,1,1,2,2,2
928,"Negative U . S . rate bets persist , but seen ...",1,1,1,2,2,2,2
929,IQ Capital CEO Keith Bliss says tech and healt...,1,1,2,2,2,1,1


In [22]:
aggregated_data["Headlines"].sample().values

array(["Business live Markets nervous amid new US-China tariffs and Trump's troubles - as it happened Bounced cheques , cancelled flights – why do we still fly Ryanair ? Ryanair deal with Irish pilots union looks to end strikes Competition watchdog to scrutinise merger of Sainsbury's and Asda Ban diesel cars from cities , say half of UK drivers in poll Goldman Sachs to tempt Britain's savers with Marcus account No-deal Brexit : Britons in EU could lose access to UK bank accounts Fighting to survive : Noble Group's fate hangs on investors restructuring vote Waymo sets up subsidiary in Shanghai as Google plans China push Tipping point ? Inflation creep at Australia's mines to erode margins Trade tensions may power down China's robot industry U . S . -China trade talks end with no breakthrough as tariffs kick in China's Huawei slams Australia 5G mobile network ban as 'politically motivated' Four Toyota group firms to form JV for self-driving tech : Nikkei SEC to review decision rejecting 

In [23]:
aggregated_data["Word_Count"] = aggregated_data["Headlines"].apply(lambda x: len(x.split()))

In [24]:
aggregated_data["Word_Count"].describe()

count     931.000000
mean      674.239527
std       373.397260
min         4.000000
25%       327.000000
50%       743.000000
75%       956.000000
max      1976.000000
Name: Word_Count, dtype: float64

--> Simply concatenating headlines results in strings that are too large for the token window for more than 50% of the dates

### Cleaning the concatenated Headlines data

- remove non-alphabetic characters and convert to lowercase
- remove stopwords
- remove duplicate words

In [25]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

def process_text(text): 
    text = re.sub('[^a-zA-Z]', ' ', text).lower()
    words = text.split()
    
    all_stopwords = set(stopwords.words('english'))
    all_stopwords.discard('not')
    
    filtered_words = [word for word in words if word not in all_stopwords]

    seen = set()
    unique_words = [word for word in filtered_words if not (word in seen or seen.add(word))]

    return ' '.join(unique_words)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jeroen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [26]:
aggregated_data["Headlines_Condensed"] = aggregated_data["Headlines"].apply(lambda x: process_text(x))

In [27]:
aggregated_data["Word_Count_Condensed"] = aggregated_data["Headlines_Condensed"].apply(lambda x: len(x.split()))

In [28]:
aggregated_data["Headlines_Condensed"].sample().values

array(['kremlin stays silent russia oil cut plans ahead opec meeting xerox hp blame takeover battle heats resiliency test well chinese firms cope financially virus hit amazon confirms first coronavirus case among u employees exclusive softbank backed cloudminds blocked exporting tech china hk bank east asia review assets elliott management supreme court leans toward sec power recover ill gotten gains ceo hosts pre ipo summit new york courts investors fed cuts rates blunt impact markets drop evans expects impacts economy short lived another foul day wall street surprise rate business live slides federal reserve makes emergency us happened prepares gets closer home impossible foods prices plant based meat sold distributors hewlett packard enterprise cash flow outlook shares chevron says hearing optimistic talk around production nbcuniversal sells record usd billion tokyo olympic ads berkshire hathaway hold may annual despite curb events mester sees possible economic outbreak jpmorgan wor

In [29]:
aggregated_data["Word_Count_Condensed"].describe()

count    931.000000
mean     359.446831
std      180.620984
min        4.000000
25%      195.000000
50%      406.000000
75%      497.000000
max      867.000000
Name: Word_Count_Condensed, dtype: float64

--> After this cleanup, Headlines data for more than 75% of the dates will fit in the token window

In [30]:
len(aggregated_data[aggregated_data["Word_Count_Condensed"] > 512].Headlines.values)/len(aggregated_data["Word_Count_Condensed"])

0.21804511278195488

### Sentiment analysis on condensed Headlines data

(truncated if longer than token window)

In [31]:
aggregated_data["Sentiment"] = aggregated_data["Headlines_Condensed"].apply(lambda x: sentiment_analysis(x[:max_length])[0]["label"])
aggregated_data["Score"] = aggregated_data["Headlines_Condensed"].apply(lambda x: sentiment_analysis(x[:max_length])[0]["score"])

In [32]:
sentiment_mapping = {
    "negative": 0,
    "neutral": 1,
    "positive": 2
}

In [33]:
aggregated_data["Sentiment_Value"] = aggregated_data["Sentiment"].map(sentiment_mapping)

In [34]:
aggregated_data

Unnamed: 0,Headlines,TrendD+1,TrendD+2,TrendD+3,TrendD+4,TrendD+5,TrendD+6,TrendD+7,Word_Count,Headlines_Condensed,Word_Count_Condensed,Sentiment,Score,Sentiment_Value
0,The Guardian view on Ryanair’s model : a union...,2,1,1,1,1,1,1,234,guardian view ryanair model union friendly com...,144,neutral,0.998799,1
1,Universal basic income is no panacea for us – ...,1,1,1,1,1,1,1,438,universal basic income panacea us labour back ...,271,neutral,0.893167,1
2,Business live World markets driven to record h...,1,1,1,1,1,1,1,250,business live world markets driven record high...,147,neutral,0.903362,1
3,Parts of UK that voted Brexit are most exposed...,1,1,1,1,1,1,1,241,parts uk voted brexit exposed effects report s...,134,positive,0.570853,2
4,Fall in demand shrinks UK car making by 4 . 6%...,1,1,1,1,1,1,1,241,fall demand shrinks uk car making long read hi...,149,positive,0.875229,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
926,VW to shift centre of software development to ...,2,2,2,2,2,2,2,1357,vw shift centre software development audi crow...,669,negative,0.933551,0
927,"Delta takes USD 3 billion charge on buyouts , ...",1,1,1,1,2,2,2,1482,delta takes usd billion charge buyouts america...,715,negative,0.739900,0
928,"Negative U . S . rate bets persist , but seen ...",1,1,1,2,2,2,2,1404,negative u rate bets persist seen unlikely hap...,706,negative,0.934993,0
929,IQ Capital CEO Keith Bliss says tech and healt...,1,1,2,2,2,1,1,866,iq capital ceo keith bliss says tech healthcar...,482,positive,0.932718,2


In [35]:
for i in range(1,8):
    accuracy = accuracy_score(aggregated_data[f"TrendD+{i}"], aggregated_data["Sentiment_Value"])
    print(f"Accuracy D+{i}: {accuracy:.2f}")

Accuracy D+1: 0.33
Accuracy D+2: 0.32
Accuracy D+3: 0.33
Accuracy D+4: 0.32
Accuracy D+5: 0.32
Accuracy D+6: 0.35
Accuracy D+7: 0.33


--> Surprisingly(!) sentiment analysis of the aggregated Headlines has less predictive value than that of individual headlines - on par with random guess for next day up  to next 7 days

In [36]:
aggregated_data.to_csv("../data/jvdm_aggregated_data.csv", index=False)

In [42]:
data["TrendD+1"].value_counts()

TrendD+1
1    31524
2    12406
0     9400
Name: count, dtype: int64

In [43]:
test_data["TrendD+1"].value_counts()

TrendD+1
1    2928
2    1174
0     898
Name: count, dtype: int64

In [44]:
aggregated_data["TrendD+1"].value_counts()

TrendD+1
1    595
2    191
0    145
Name: count, dtype: int64