<a href="https://colab.research.google.com/github/gned0/NLP_stock_prediction/blob/main/cnn_bert_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Predizione di indici di borsa tramite financial news sentiment analysis

Progetto per tirocinio

Studente: Gian Luca Nediani

E-mail: gianluca.nediani@studio.unibo.it

## Introduzione

A partire da quanto mostrato nel paper [Deep Learning for Event-Driven Stock Prediction](https://www.ijcai.org/Proceedings/15/Papers/329.pdf), l'obiettivo è sviluppare una rete neurale in grado di predire l'andamento del mercato azionario tramite metodi di sentiment analysis: valutando le news di carattere finanziario di un dato giorno si vuole predire se il giorno dopo il valore di un certo indice di borsa aumenterà o diminuirà. Come nel paper, l'indice di riferimento utilizzato è _S&P500_, un indice rappresentativo delle performance delle 500 aziende più quotate nella borsa statunitense.

Per comprendere il significato semantico delle news e fare valutazioni sull'andamento del mercato, gli autori del paper rappresentano le news finanziarie come degli eventi. In questo esperimento invece, si farà ricorso a un'architettura Transformer, l'attuale stato dell'arte nel _natural language processing_. Grazie all'encoder di questa architettura, sarà possibile generare degli embedding in grado di rappresentare in maniera ricca il significato semantico dei titoli di notizie finanziarie. Questi embedding saranno poi l'input per una rete neurale di classificazione.

Come nel paper originale per realizzare una predizione per un dato giorno vengono utilizzate news finanziarie dell'intero mese precedente.

In [1]:
!pip install yfinance
!pip install transformers

Collecting yfinance
  Downloading yfinance-0.1.67-py2.py3-none-any.whl (25 kB)
Collecting lxml>=4.5.1
  Downloading lxml-4.7.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 7.5 MB/s 
Installing collected packages: lxml, yfinance
  Attempting uninstall: lxml
    Found existing installation: lxml 4.2.6
    Uninstalling lxml-4.2.6:
      Successfully uninstalled lxml-4.2.6
Successfully installed lxml-4.7.1 yfinance-0.1.67
Collecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 5.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 33.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████

In [2]:
import pandas as pd
import numpy as np
import torch
import datetime
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from transformers import AdamW, get_linear_schedule_with_warmup, BertTokenizer, BertModel
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from collections import defaultdict

## Bert encoder fine-tuning

Per prima cosa viene caricato un modello Transformer pre-addestrato. Su di esso viene eseguito il fine-tuning tramite un dataset di notizie finanziarie, [Financial PhraseBank](https://www.kaggle.com/ankurzing/sentiment-analysis-for-financial-news). L'obiettivo di questa operazione è ottenere un modello Transformer "allenato" su notizie finanziarie, per poi utilizzarne l'encoder per generare gli emebdding delle news finanziarie che saranno dati in input alla rete neurale di classificazione.

Path dei pesi del modello transformer preaddestrato.

In [3]:
MODEL_PATH = 'bert-base-uncased'

Viene utilizzata la GPU fornita da Colab in quanto il calcolo degli embedding e l'addestramento della rete neurale tramite CPU sarebbero troppo lenti.

In [4]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Using {} device'.format(device))
CUDA_LAUNCH_BLOCKING = "1"

Using cuda device


Download del dataset Financial PhraseBank per fine-tuning

In [5]:
import os.path
from urllib.request import urlretrieve

if not os.path.exists("financial_data_all.csv"):
    urlretrieve("https://raw.githubusercontent.com/gned0/financial_sentiment_analysis/main/financial_data_all.csv", "financial_data_all.csv")

data = pd.read_csv('financial_data_all.csv', delimiter=',', encoding='latin-1')

Rimozione delle entry etichettate come "neutral" (non vengono utilizzate nel fine-tuning) e labeling delle etichette.

In [6]:
data_binary = data.set_axis(['Target', "Text"], axis=1, inplace=False)
data_binary = data_binary[data_binary.Target != 'neutral']
le = preprocessing.LabelEncoder()
data_binary["Target"] = data_binary["Target"].astype("category")
data_binary['Target'] = le.fit_transform(data_binary.Target.values)

In [7]:
RANDOM_SEED = 21

df_train, df_test = train_test_split(
  data_binary,
  test_size=0.2,
  random_state=RANDOM_SEED
)

In [8]:
class FinetuningDataset(Dataset):
  def __init__(self, titles, targets, tokenizer, max_len):
    self.titles = titles
    self.targets = targets
    self.tokenizer = tokenizer
    self.max_len = max_len

  def __len__(self):
    return len(self.titles)

  def __getitem__(self, item):
    title = str(self.titles[item])
    target = self.targets[item]
    encoding = self.tokenizer.encode_plus(title, add_special_tokens=True, max_length=self.max_len, pad_to_max_length=True, return_attention_mask=True, return_tensors="pt")
    return {
      'titles': title,
      'input_ids': encoding['input_ids'].flatten(),
      'attention_mask': encoding['attention_mask'].flatten(),
      'targets': torch.tensor(target, dtype=torch.float)
    }

In [9]:
MAX_LEN = 256
BATCH_SIZE = 16
RANDOM_SEED = 21

In [10]:
def create_data_loader(text, targets, tokenizer, max_len, batch_size):
  ds = FinetuningDataset(
    titles=text,
    targets=targets,
    tokenizer=tokenizer,
    max_len=max_len
  )
  return DataLoader(
    ds,
    batch_size=BATCH_SIZE,
    shuffle=True
  )

In [11]:
tokenizer = BertTokenizer.from_pretrained(MODEL_PATH, truncation=True)
train_data_loader = create_data_loader(df_train['Text'].to_numpy(), df_train['Target'].to_numpy(), tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(df_test['Text'].to_numpy(), df_test['Target'].to_numpy(), tokenizer, MAX_LEN, BATCH_SIZE)

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [12]:
encoder = BertModel.from_pretrained(MODEL_PATH, output_hidden_states=True).to(device)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Viene definita una rete neurale PyTorch di classificazione, essa è composta dall'encoder Transformer e da un layer di output binario. 

In [13]:
class Classifier(nn.Module):
  def __init__(self):
        super(Classifier, self).__init__()

        self.encoder = encoder
        self.out = nn.Linear(encoder.config.hidden_size, 1)

  def forward(self, input_ids, attention_mask):
        output = self.encoder(input_ids, attention_mask).pooler_output
        return self.out(output)
        
    

In [14]:
model = Classifier().to(device)

In [15]:
EPOCHS = 6
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=0.05)
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=total_steps
)
loss_fn = nn.BCEWithLogitsLoss().to(device)

In [16]:
def train_epoch(model, data_loader, loss_fn, optimizer, scheduler, n_examples, device):
  model = model.train()
  losses = []
  correct_predictions = 0
  step = 0
  for d in data_loader:
      step += 1
      optimizer.zero_grad() # clears previous gradients
      input_ids = d["input_ids"].to(device)
      attention_mask = d["attention_mask"].to(device)
      targets = d["targets"].to(device)
      
      outputs = model(input_ids, attention_mask).to(device)
      preds = outputs>0    
      loss = loss_fn(outputs, targets.unsqueeze(1)) # computes loss
      correct_predictions += torch.sum(torch.transpose(preds, 0, 1) == targets)
      losses.append(loss.item())
      loss.backward() 
      nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
      optimizer.step() # optimizer takes step based on gradients
      scheduler.step() 
  return correct_predictions.double() / n_examples, np.mean(losses)

In [17]:
def eval_model(model, data_loader, loss_fn, device, n_examples):
  model = model.eval()
  losses = []
  correct_predictions = 0
  step = 0
  with torch.no_grad(): # gradient computation disabled for evalutaion
      for d in data_loader:
        step += 1
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        targets = d["targets"].to(device)
        outputs = model(input_ids, attention_mask).to(device)
        preds = (outputs>0)    
        loss = loss_fn(outputs, targets.unsqueeze(1))
        correct_predictions += torch.sum(torch.transpose(preds, 0, 1) == targets)
        losses.append(loss.item())
  return correct_predictions.double() / n_examples, np.mean(losses)

In [18]:
history = defaultdict(list)
least_loss = 1000
for epoch in range(EPOCHS):
  
  print(f'Epoch {epoch + 1}/{EPOCHS}')
  train_acc, train_loss = train_epoch(
    model,
    train_data_loader,
    loss_fn,
    optimizer,
    scheduler,
    len(df_train),
    device
  )

  print(f'Train loss {train_loss} accuracy {train_acc}')
  
  val_acc, val_loss = eval_model(
    model,
    test_data_loader,
    loss_fn,
    device,
    len(df_test)
  )


  print(f'Val   loss {val_loss} accuracy {val_acc}')
  history['train_acc'].append(train_acc)
  history['train_loss'].append(train_loss)
  history['val_acc'].append(val_acc)
  history['val_loss'].append(val_loss)
  if float(val_loss) < float(least_loss):
    torch.save(model.state_dict(), 'best_model_state.bin')
    best_loss = val_loss

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Epoch 1/6




Train loss 0.6008252718231895 accuracy 0.684043229497775
Val   loss 0.5617706322669983 accuracy 0.6954314720812182
Epoch 2/6
Train loss 0.49742390561585476 accuracy 0.750794659885569
Val   loss 0.3975951826572418 accuracy 0.8248730964467005
Epoch 3/6
Train loss 0.3751399035405631 accuracy 0.8289891926255563
Val   loss 0.29056292831897734 accuracy 0.8883248730964466
Epoch 4/6
Train loss 0.32906671244688707 accuracy 0.8607755880483153
Val   loss 0.2516662389039993 accuracy 0.9060913705583755
Epoch 5/6
Train loss 0.2896031912679624 accuracy 0.8874761602034329
Val   loss 0.24552456319332122 accuracy 0.9010152284263959
Epoch 6/6
Train loss 0.27061360449802996 accuracy 0.8970120788302607
Val   loss 0.2561542823910713 accuracy 0.9035532994923857


In [19]:
WEIGHTS = 'best_model_state.bin'
model.load_state_dict(torch.load(WEIGHTS))

<All keys matched successfully>

In [20]:
def compute_matches(preds, targets):
  TP = FP = FN = TN = 0
  targets = targets>0
  
  preds = preds.detach().cpu().numpy()
  targets = targets.detach().cpu().numpy()
    
  for i in range(len(preds)):
      if(preds[i] and targets[i]):
          TP += 1
      elif(preds[i] and not targets[i]):
          FP += 1
      elif(not preds[i] and targets[i]):
          FN += 1
      else:
          TN += 1
          
  return TP, FP, FN, TN

In [21]:
def final_model_evaluation(model, data_loader, loss_fn, device, n_examples):
  model = model.eval()
  losses = []
  correct_predictions = 0
  step = 0
  dictionary = {
      "TP": 0,
      "FP": 0,
      "FN": 0,
      "TN": 0
  }
  with torch.no_grad(): # gradient computation disabled for evalutaion
      for d in data_loader:
        step += 1
        input_ids = d["input_ids"].to(device)
        attention_mask = d["attention_mask"].to(device)
        targets = d["targets"].to(device)
        outputs = model(input_ids, attention_mask)
        preds = (outputs>0)
        
        matches = compute_matches(preds, targets)
        dictionary["TP"] += matches[0]
        dictionary["FP"] += matches[1]
        dictionary["FN"] += matches[2]
        dictionary["TN"] += matches[3]    

        loss = loss_fn(outputs, targets.unsqueeze(1))
        correct_predictions += torch.sum(torch.transpose(preds, 0, 1) == targets)
        losses.append(loss.item())
  return correct_predictions.double() / n_examples, np.mean(losses), dictionary

In [22]:
  val_acc, val_loss, dictionary = final_model_evaluation(
    model,
    test_data_loader,
    loss_fn,
    device,
    len(df_test)
  ) 
  
  print(f'Final model: loss {val_loss} accuracy {val_acc}')
  pd.DataFrame([["True positives: " + str(dictionary["TP"]), "False positives: " + str(dictionary["FP"])],
              ["False ngatives: " + str(dictionary["FN"]), "True negatives: " + str(dictionary["TN"])]])



Final model: loss 0.25353765815496443 accuracy 0.9035532994923857


Unnamed: 0,0,1
0,True positives: 241,False positives: 8
1,False ngatives: 30,True negatives: 115


In [23]:
torch.save(model.encoder.state_dict(), 'encoder_weights.bin')

Grazie al fine-tuning, l'encoder Transformer preaddestrato è ora più efficace nel generare embedding di news finanziarie. L'encoder viene nella seguente sezione utilizzato per generare gli embedding delle news di ogni giorno.

## Encoding del testo in embedding con attention

### Il dataset

Il dataset utilizzato in questo esperimento è ottenuto a partire da due dataset di news finanziarie, entrambi utilizzati nel paper [Deep Learning for Event-Driven Stock Prediction](https://www.ijcai.org/Proceedings/15/Papers/329.pdf). Essi racchiudono rispettivamente 450341 news di natura finanziaria provenienti dalla testata giornalistica _Bloomberg_ e 109110 news di natura finanziaria provenienti dalla testata giornalistica _Reuters_. Sulle orme del paper sopracitato, sono stati estratti soltanto i titoli delle news, in quanto considerati più significativi del corpo della notizia. Inoltre, siccome il modello sviluppato può processare un numero finito di informazioni, i titoli sono stati filtrati, mantenendo solo quelli che includano il nome di uno o più degli indici di borsa che compongono l'indice _S&P500_. 
Le operazioni preliminari appena descritte portano ad avere il seguente file CSV, che per ogni giorno del periodo preso in esame (2007-2016), unisce i titoli di Bloomberg e Reuters.

In [24]:
import os.path
from urllib.request import urlretrieve

if not os.path.exists("financial_titles.csv"):
    urlretrieve("https://raw.githubusercontent.com/gned0/NLP_stock_prediction/main/all_financial_titles.csv", "financial_titles.csv")

df = pd.read_csv('financial_titles.csv', delimiter=',')
df = df.drop('Unnamed: 0', 1)
df = df.dropna(axis=0)
df

Unnamed: 0,ts,title
0,20070102,Apple options probe spotlights ex-officials: p...
1,20070103,Ford CEO says restructuring going well. Ford s...
2,20070104,"US STOCKS-Indexes end up as Intel lifts techs,..."
3,20070105,Nasdaq says no decisions made about LSE stake....
4,20070107,"CES-UPDATE 2-Sony, Microsoft hit game console ..."
...,...,...
3059,20110305,AT&T Says John Stephens to Become CFO When Lin...
3060,20110312,Apple IPad 2 Lines Led by Gray Marketers Eager...
3061,20110414,Apple Is Said to Ready White IPhone Following ...
3062,20110917,"Samsung Seeks to Lift German Sales Ban, Sues A..."


### Embedding

In [25]:
MAX_LEN = 512

In [26]:
class EmbeddingGenerator():
  def __init__(self, encoder, tokenizer, max_len):
    self.encoder = encoder
    self.tokenizer = tokenizer
    self.max_len = max_len

  def tokenize(self, text):
    
    encoding = self.tokenizer.encode_plus(text, add_special_tokens=True, max_length=self.max_len, pad_to_max_length=True, return_attention_mask=True, return_tensors="pt")

    return encoding['input_ids'].to(device), encoding['attention_mask'].to(device)

  def encode(self, text):

    ids, att_mask = self.tokenize(text)
    output = self.encoder(ids, att_mask)
    return output.pooler_output

In [27]:
embedding_generator = EmbeddingGenerator(encoder, tokenizer, MAX_LEN)

Per ogni entry viene generato il rispettivo embedding

In [28]:
titles = (df['title'].to_numpy())
encodings = []
with torch.no_grad():
  for t in titles:
    encoding = embedding_generator.encode(t)
    encodings.append(encoding.cpu().detach().numpy())
series = pd.Series(encodings)



Gli embedding vengono aggiunti al dataframe

In [29]:
df["embedding"] = series

## Creazione del dataset per rete neurale di classificazione

A partire dal dataframe ottenuto in precedenza, è necessario ottenere il dataset finale da utilizzare per l'addestramento e la valutazione della rete neurale di classificazione. Ogni entry di tale dataset avrà le seguenti feature:

*   Data del giorno usata come indice
*   Dati a lungo termine (embedding dei 30 giorni precedenti, matrice 30x768)
*   Dati a medio termine (embedding dei 7 giorni precedenti, matrice 7x768)
*   Dati a breve termine (embedding del giorno precedente, matrice 1x768)





In [30]:
df["ts"] = df["ts"].astype(str)

In [31]:
df["ts"] = df["ts"].apply(lambda x: datetime.date(int(x[:4]), int(x[4:6]), int(x[6:8])))


In [32]:
df

Unnamed: 0,ts,title,embedding
0,2007-01-02,Apple options probe spotlights ex-officials: p...,"[[-0.17940022, -0.15011771, 0.052140586, 0.099..."
1,2007-01-03,Ford CEO says restructuring going well. Ford s...,"[[-0.120098576, -0.17048064, -0.36095574, 0.18..."
2,2007-01-04,"US STOCKS-Indexes end up as Intel lifts techs,...","[[-0.013124456, -0.15857759, 0.1519155, 0.0521..."
3,2007-01-05,Nasdaq says no decisions made about LSE stake....,"[[0.014146556, 0.06537137, 0.53293455, -0.1374..."
4,2007-01-07,"CES-UPDATE 2-Sony, Microsoft hit game console ...","[[-0.19307749, -0.06193406, -0.029514024, 0.21..."
...,...,...,...
3059,2011-03-05,AT&T Says John Stephens to Become CFO When Lin...,"[[-0.09041582, 0.26887658, 0.78703946, -0.1247..."
3060,2011-03-12,Apple IPad 2 Lines Led by Gray Marketers Eager...,"[[-0.3574039, -0.09415499, -0.74995434, 0.4718..."
3061,2011-04-14,Apple Is Said to Ready White IPhone Following ...,"[[-0.14961429, -0.0939417, -0.22493501, 0.1817..."
3062,2011-09-17,"Samsung Seeks to Lift German Sales Ban, Sues A...","[[-0.126408, 0.1760775, 0.94538426, -0.1181099..."


In [33]:
final_df = pd.DataFrame({'ts': [], 'data_long': [], 'data_mid': [], 'data_short': []})

In [34]:
step = 0
for _, row in df.iterrows():
  step += 1
  if((row.ts - df.iloc[0, 0]).days > 30):
    entry_long = []
    entry_mid = []
    entry_short = []
    
    for _, row2 in df.iterrows():
      
      if((row.ts - row2.ts).days == -1):
        entry_long.append(row2.embedding)
        entry_mid.append(row2.embedding)
        entry_short.append(row2.embedding)
      elif((row.ts - row2.ts).days in range(-1, -8, -1)):
        entry_long.append(row2.embedding)
        entry_mid.append(row2.embedding)
      elif((row.ts - row2.ts).days in range(-1, -31, -1)):
        entry_long.append(row2.embedding)

    if(len(entry_long) and len(entry_mid) and len(entry_short)):
        np_entry_long = np.array(entry_long).squeeze(1)
        padded_long = np.zeros((30, 768))
        padded_long[:np_entry_long.shape[0],:np_entry_long.shape[1]] = np_entry_long
        np_entry_mid = np.array(entry_mid).squeeze(1)
        padded_mid = np.zeros((7, 768))
        padded_mid[:np_entry_mid.shape[0],:np_entry_mid.shape[1]] = np_entry_mid

        np_entry_short = np.array(entry_short)
        entry = {'ts': row.ts, 'data_long': padded_long, 'data_mid': padded_mid, 'data_short': np_entry_short}
        final_df = final_df.append(entry, ignore_index=True)


In [35]:
final_df

Unnamed: 0,ts,data_long,data_mid,data_short
0,2007-02-04,"[[-0.05449385568499565, -0.18924008309841156, ...","[[-0.05449385568499565, -0.18924008309841156, ...","[[[-0.054493856, -0.18924008, -0.2798243, 0.15..."
1,2007-02-05,"[[-0.01331113651394844, -0.1721067875623703, -...","[[-0.01331113651394844, -0.1721067875623703, -...","[[[-0.0133111365, -0.17210679, -0.48394886, 0...."
2,2007-02-06,"[[-0.09542350471019745, -0.17273758351802826, ...","[[-0.09542350471019745, -0.17273758351802826, ...","[[[-0.095423505, -0.17273758, -0.39313886, 0.1..."
3,2007-02-07,"[[-0.09329986572265625, -0.1423080563545227, -...","[[-0.09329986572265625, -0.1423080563545227, -...","[[[-0.093299866, -0.14230806, -0.07187233, 0.1..."
4,2007-02-08,"[[-0.0364128015935421, -0.15996508300304413, -...","[[-0.0364128015935421, -0.15996508300304413, -...","[[[-0.0364128, -0.15996508, -0.06955198, 0.093..."
...,...,...,...,...
2695,2011-03-05,"[[-0.271576851606369, -0.18177928030490875, -0...","[[-0.271576851606369, -0.18177928030490875, -0...","[[[-0.27157685, -0.18177928, -0.5939481, 0.395..."
2696,2011-03-12,"[[-0.27202633023262024, -0.23935601115226746, ...","[[-0.27202633023262024, -0.23935601115226746, ...","[[[-0.27202633, -0.23935601, -0.8394005, 0.387..."
2697,2011-04-14,"[[-0.11568888276815414, -0.22953049838542938, ...","[[-0.11568888276815414, -0.22953049838542938, ...","[[[-0.11568888, -0.2295305, -0.72008353, 0.265..."
2698,2011-09-17,"[[-0.2923088073730469, -0.1281108409166336, -0...","[[-0.2923088073730469, -0.1281108409166336, -0...","[[[-0.2923088, -0.12811084, -0.48269793, 0.318..."


## Aggiunta dei dati finanziari al dataset

Ora è necessario ottenere le informazioni relative all'andamento della borsa, in particolare dell'indice S&P500. Tramite il pacchetto yfinance viene creato un dataframe con informazioni sull'andamento di tale titolo (label ^GSPC) nel periodo corrispondente a quello coperto dal dataset di news.

In [36]:
import yfinance as yf

stock = yf.download("^GSPC", start="2007-01-01", end="2016-08-16")
stock

[*********************100%***********************]  1 of 1 completed


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2007-01-03,1418.030029,1429.420044,1407.859985,1416.599976,1416.599976,3429160000
2007-01-04,1416.599976,1421.839966,1408.430054,1418.339966,1418.339966,3004460000
2007-01-05,1418.339966,1418.339966,1405.750000,1409.709961,1409.709961,2919400000
2007-01-08,1409.260010,1414.979980,1403.969971,1412.839966,1412.839966,2763340000
2007-01-09,1412.839966,1415.609985,1405.420044,1412.109985,1412.109985,3038380000
...,...,...,...,...,...,...
2016-08-09,2182.239990,2187.659912,2178.610107,2181.739990,2181.739990,3334300000
2016-08-10,2182.810059,2183.409912,2172.000000,2175.489990,2175.489990,3254950000
2016-08-11,2177.969971,2188.449951,2177.969971,2185.790039,2185.790039,3423160000
2016-08-12,2183.739990,2186.280029,2179.419922,2184.050049,2184.050049,3000660000


Per ottenere le etichette da usare per la classificazione delle giornate nel mercato azionario, viene creato un valore binario: 0 se in un dato giorno il valore dell'indice chiude in calo rispetto all'apertura e 1 se al contrario chiude in rialzo.

In [37]:
def binarize(x):
  if x > 0:
    return 1
  return 0

In [38]:
stock['target'] = 0
stock['target'] = stock['Close'] - stock['Open']
stock['target'] = stock['target'].apply(binarize)
stock.reset_index(inplace=True)
stock.rename(columns={'Date':'ts'}, inplace = True)
stock

Unnamed: 0,ts,Open,High,Low,Close,Adj Close,Volume,target
0,2007-01-03,1418.030029,1429.420044,1407.859985,1416.599976,1416.599976,3429160000,0
1,2007-01-04,1416.599976,1421.839966,1408.430054,1418.339966,1418.339966,3004460000,1
2,2007-01-05,1418.339966,1418.339966,1405.750000,1409.709961,1409.709961,2919400000,0
3,2007-01-08,1409.260010,1414.979980,1403.969971,1412.839966,1412.839966,2763340000,1
4,2007-01-09,1412.839966,1415.609985,1405.420044,1412.109985,1412.109985,3038380000,0
...,...,...,...,...,...,...,...,...
2417,2016-08-09,2182.239990,2187.659912,2178.610107,2181.739990,2181.739990,3334300000,0
2418,2016-08-10,2182.810059,2183.409912,2172.000000,2175.489990,2175.489990,3254950000,0
2419,2016-08-11,2177.969971,2188.449951,2177.969971,2185.790039,2185.790039,3423160000,1
2420,2016-08-12,2183.739990,2186.280029,2179.419922,2184.050049,2184.050049,3000660000,1


In [39]:
stock.dropna(inplace=True)


In [40]:
stock = stock[['ts', 'target']]
stock['ts'] = stock['ts'].astype(str).apply(lambda x: x.replace('-', ''))
stock['ts'] = stock['ts'].apply(lambda x: datetime.date(int(x[:4]), int(x[4:6]), int(x[6:8])))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [41]:
final_df = final_df.merge(stock, on='ts')
final_df

Unnamed: 0,ts,data_long,data_mid,data_short,target
0,2007-02-05,"[[-0.01331113651394844, -0.1721067875623703, -...","[[-0.01331113651394844, -0.1721067875623703, -...","[[[-0.0133111365, -0.17210679, -0.48394886, 0....",0
1,2007-02-06,"[[-0.09542350471019745, -0.17273758351802826, ...","[[-0.09542350471019745, -0.17273758351802826, ...","[[[-0.095423505, -0.17273758, -0.39313886, 0.1...",1
2,2007-02-07,"[[-0.09329986572265625, -0.1423080563545227, -...","[[-0.09329986572265625, -0.1423080563545227, -...","[[[-0.093299866, -0.14230806, -0.07187233, 0.1...",1
3,2007-02-08,"[[-0.0364128015935421, -0.15996508300304413, -...","[[-0.0364128015935421, -0.15996508300304413, -...","[[[-0.0364128, -0.15996508, -0.06955198, 0.093...",0
4,2007-02-12,"[[-0.012401222251355648, -0.11849361658096313,...","[[-0.012401222251355648, -0.11849361658096313,...","[[[-0.012401222, -0.11849362, 0.07116662, 0.06...",0
...,...,...,...,...,...
2143,2016-08-09,"[[-0.03872044011950493, -0.13178403675556183, ...","[[-0.03872044011950493, -0.13178403675556183, ...","[[[-0.03872044, -0.13178404, 0.11120205, 0.129...",0
2144,2016-08-10,"[[0.022384976968169212, -0.15621121227741241, ...","[[0.022384976968169212, -0.15621121227741241, ...","[[[0.022384977, -0.15621121, -0.25552735, 0.16...",0
2145,2016-08-11,"[[0.00803440436720848, -0.17861692607402802, -...","[[0.00803440436720848, -0.17861692607402802, -...","[[[0.008034404, -0.17861693, -0.24132417, 0.13...",1
2146,2016-08-12,"[[-0.0062514678575098515, -0.11046932637691498...","[[-0.0062514678575098515, -0.11046932637691498...","[[[-0.006251468, -0.11046933, 0.22151257, 0.03...",1


## Preprocessing

In [42]:
np_dataset = final_df.to_numpy()

In [43]:
class FinancialDataset(Dataset):
  def __init__(self, dates, long_data, mid_data, short_data, targets, max_len):
    self.dates = dates
    self.long_data = long_data
    self.mid_data = mid_data
    self.short_data = short_data
    self.targets = targets
    self.max_len = max_len

  def __len__(self):
    return len(self.dates)

  def __getitem__(self, item):
    date = str(self.dates[item])
    target = self.targets[item]
    long_data = self.long_data[item]
    mid_data = self.mid_data[item]
    short_data = self.short_data[item]
    return {
      'date': date,
      'long_data': torch.tensor(long_data, dtype=torch.float),
      'mid_data': torch.tensor(mid_data, dtype=torch.float),
      'short_data': torch.tensor(short_data, dtype=torch.float),
      'targets': torch.tensor(target, dtype=torch.float)
    }

In [44]:
def create_data_loader(dates, data_long, data_mid, data_short, targets, max_len, batch_size):
  ds = FinancialDataset(
    dates=dates,
    long_data = data_long,
    mid_data = data_mid,
    short_data = data_short,
    targets=targets,
    max_len=max_len
  )
  return DataLoader(
    ds,
    batch_size=BATCH_SIZE,
    shuffle=False
  )

In [45]:
BATCH_SIZE = 8
RANDOM_SEED = 21
MAX_LEN = 512

In [46]:
df_train, df_test = train_test_split(
  final_df,
  test_size=0.15,
  random_state=RANDOM_SEED
)

In [47]:
df_train = df_train.to_numpy()

In [48]:
df_test = df_test.to_numpy()

In [49]:
df_train_list = []
for i in df_test:
    
    #print(i[3].shape)
    if(i[3].shape != (0,)):
        df_train_list.append(i)
df_train = np.array(df_train_list)

In [50]:
df_test_list = []
for i in df_test:
    
    #print(i[3].shape)
    if(i[3].shape != (0,)):
        df_test_list.append(i)
df_test = np.array(df_test_list)
        

In [51]:
train_data_loader = create_data_loader(df_train[:, 0], df_train[:, 1], df_train[:, 2], df_train[:, 3], df_train[:, 4], MAX_LEN, BATCH_SIZE)

In [52]:
test_data_loader = create_data_loader(df_test[:, 0], df_test[:, 1], df_test[:, 2], df_test[:, 3], df_test[:, 4], MAX_LEN, BATCH_SIZE)

## Rete neurale convoluzionale per classificazione

Viene qui definita la rete neurale di classificazione: è composta da due blocchi convoluzionali, uno per processare i dati a lungo termine (30 giorni prima) e una per quelli a breve termine (7 giorni prima). Nei blocchi convoluzionali viene eseguita una convoluzione, seguita da normalizzazione, funzione di attivazione ReLU e dropout, infine è posto un max pooling per ridurre la dimensionalità.
L'output dei due blocchi convoluzionali sono due tensori 1x768, che vengono concatenati col tensore 1x768 dei dati a breve termine (1 giorno prima). Si ottiene dunque un tensore 1x2304, che viene passato al layer in output per eseguire la classificazione binaria.

In [53]:
class Classifier(nn.Module):
  def __init__(self):
        super(Classifier, self).__init__()

        self.cnn_long = self.conv_block(c_in=1, c_out=8, dropout=0.1, kernel_size=(3, 1), stride=(3, 1))
        self.maxpool_long = nn.MaxPool3d(kernel_size=(8, 10, 1))

        self.cnn_mid = self.conv_block(c_in=1, c_out=8, dropout=0.1, kernel_size=(3, 1), stride=(3, 1), padding=(1, 0))
        self.maxpool_mid = nn.MaxPool3d(kernel_size=(8, 3, 1))
        self.out = nn.Linear(2304, 1)



  def forward(self, input_long, input_mid, input_short):
        x = self.cnn_long(input_long)
        x = self.maxpool_long(x).squeeze(1)

        y = self.cnn_mid(input_mid)
        y = self.maxpool_mid(y).squeeze(1)
        
        concat = torch.cat([x.squeeze(1), y.squeeze(1), input_short.squeeze(1).squeeze(1)], dim=1)
        
        return self.out(concat)
  
  def conv_block(self, c_in, c_out, dropout,  **kwargs):
        seq_bloc = nn.Sequential(
            nn.Conv2d(in_channels=c_in, out_channels=c_out, **kwargs),
            nn.BatchNorm2d(num_features=c_out),
            nn.ReLU(),
            nn.Dropout2d(p=dropout)
        )
        return seq_bloc

In [54]:
model = Classifier().to(device)

## Addestramento e valutazione

Vengono definiti un ottimizzatore e una funzione d'errore. La funzione di errore utilizzata è _binary cross entropy_ in quanto si tratta di un problema di classificazione binaria. Viene utilizzata la versione _with logits_ in quanto gli output della rete neurale non passano per una funzione di attivazione.

In [55]:
EPOCHS = 30
optimizer = torch.optim.AdamW(model.parameters())
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=0,
  num_training_steps=total_steps
)
loss_fn = nn.BCEWithLogitsLoss().to(device)

Secondo le indicazioni della [documentazione PyTorch](https://pytorch.org/docs/stable/optim.html), vengono definiti gli step per l'addestramento e la valutazione del modello. 

In [56]:
def train_epoch(model, data_loader, loss_fn, optimizer, scheduler, n_examples, device):
  model = model.train()
  losses = []
  correct_predictions = 0
  step = 0
  for d in data_loader:
      step += 1
      optimizer.zero_grad() # clears previous gradients
      input_long = d["long_data"].unsqueeze(1).to(device)
      input_mid = d["mid_data"].unsqueeze(1).to(device)
      input_short = d["short_data"].to(device)
      targets = d["targets"].to(device)
      outputs = model(input_long, input_mid, input_short)
      preds = outputs>0    
      loss = loss_fn(outputs, targets.unsqueeze(1)) # computes loss
      correct_predictions += torch.sum(torch.transpose(preds, 0, 1) == targets)
      losses.append(loss.item())
      loss.backward() 
      nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
      optimizer.step() # optimizer takes step based on gradients
      scheduler.step() 
  return correct_predictions.double() / n_examples, np.mean(losses)

In [57]:
def eval_model(model, data_loader, loss_fn, device, n_examples):
  model = model.eval()
  losses = []
  correct_predictions = 0
  step = 0
  with torch.no_grad(): # gradient computation disabled for evalutaion
      for d in data_loader:
        step += 1
        input_long = d["long_data"].unsqueeze(1).to(device)
        input_mid = d["mid_data"].unsqueeze(1).to(device)
        input_short = d["short_data"].to(device)
        targets = d["targets"].to(device)
        outputs = model(input_long, input_mid, input_short)
        preds = (outputs>0)    
        loss = loss_fn(outputs, targets.unsqueeze(1))
        correct_predictions += torch.sum(torch.transpose(preds, 0, 1) == targets)
        losses.append(loss.item())
  return correct_predictions.double() / n_examples, np.mean(losses)

In [58]:
history = defaultdict(list)
least_loss = 1000
for epoch in range(EPOCHS):
  
  print(f'Epoch {epoch + 1}/{EPOCHS}')
  train_acc, train_loss = train_epoch(
    model,
    train_data_loader,
    loss_fn,
    optimizer,
    scheduler,
    len(df_train),
    device
  )

  print(f'Train loss {train_loss} accuracy {train_acc}')
  
  val_acc, val_loss = eval_model(
    model,
    test_data_loader,
    loss_fn,
    device,
    len(df_test)
  )


  print(f'Val   loss {val_loss} accuracy {val_acc}')
  history['train_acc'].append(train_acc)
  history['train_loss'].append(train_loss)
  history['val_acc'].append(val_acc)
  history['val_loss'].append(val_loss)
  if float(val_loss) < float(least_loss):
    torch.save(model.state_dict(), 'best_model_state.bin')
    best_loss = val_loss

Epoch 1/30
Train loss 0.7795568900864299 accuracy 0.5789473684210527
Val   loss 0.6919128836655035 accuracy 0.5882352941176471
Epoch 2/30
Train loss 0.7752503127586551 accuracy 0.5077399380804953
Val   loss 0.9549891279964913 accuracy 0.5851393188854489
Epoch 3/30
Train loss 0.8077404062922408 accuracy 0.5015479876160991
Val   loss 0.8946401520473201 accuracy 0.5851393188854489
Epoch 4/30
Train loss 0.7869316673860317 accuracy 0.5170278637770899
Val   loss 1.03883238463867 accuracy 0.5851393188854489
Epoch 5/30
Train loss 0.7820308462875646 accuracy 0.5510835913312694
Val   loss 1.056960157504896 accuracy 0.5851393188854489
Epoch 6/30
Train loss 0.7414963332618155 accuracy 0.6037151702786379
Val   loss 0.7223467397980574 accuracy 0.6006191950464397
Epoch 7/30
Train loss 0.7148974864948087 accuracy 0.56656346749226
Val   loss 0.8456680869183889 accuracy 0.5882352941176471
Epoch 8/30
Train loss 0.7149353812380534 accuracy 0.5851393188854489
Val   loss 0.8019622498896064 accuracy 0.588235

##Conclusioni

Vengono caricati i pesi relativi all'epoca con i risultati migliori in fase di addestramento

In [59]:
WEIGHTS = 'best_model_state.bin'
model.load_state_dict(torch.load(WEIGHTS))

<All keys matched successfully>

Viene fatta una valutazione finale del modello con tali pesi, con anche una confusion matrix per meglio interpretare i risultati.

In [60]:
def final_model_evaluation(model, data_loader, loss_fn, device, n_examples):
  model2 = model.eval()
  losses = []
  correct_predictions = 0
  step = 0
  dictionary = {
      "TP": 0,
      "FP": 0,
      "FN": 0,
      "TN": 0
  }
  with torch.no_grad(): # gradient computation disabled for evalutaion
      for d in data_loader:
        step += 1
        input_long = d["long_data"].unsqueeze(1).to(device)
        input_mid = d["mid_data"].unsqueeze(1).to(device)
        input_short = d["short_data"].to(device)
        targets = d["targets"].to(device)
        outputs = model(input_long, input_mid, input_short)
        preds = (outputs>0)
        
        matches = compute_matches(preds, targets)
        dictionary["TP"] += matches[0]
        dictionary["FP"] += matches[1]
        dictionary["FN"] += matches[2]
        dictionary["TN"] += matches[3]    

        loss = loss_fn(outputs, targets.unsqueeze(1))
        correct_predictions += torch.sum(torch.transpose(preds, 0, 1) == targets)
        losses.append(loss.item())
  return correct_predictions.double() / n_examples, np.mean(losses), dictionary

In [61]:
  val_acc, val_loss, dictionary = final_model_evaluation(
    model,
    test_data_loader,
    loss_fn,
    device,
    len(df_test)
  ) 
  
  print(f'Final model: loss {val_loss} accuracy {val_acc}')
  pd.DataFrame([["True positives: " + str(dictionary["TP"]), "False positives: " + str(dictionary["FP"])],
              ["False ngatives: " + str(dictionary["FN"]), "True negatives: " + str(dictionary["TN"])]])

Final model: loss 0.537829702220312 accuracy 0.7306501547987616


Unnamed: 0,0,1
0,True positives: 166,False positives: 64
1,False ngatives: 23,True negatives: 70


Il paper originale utilizzava due metriche di performance : l'accuracy totale del modello e il coefficiente MCC ($T P ·T N −F P ·F N
√
(TP +FP)(TP +FN )(TN +FP)(TN +FN )$, punteggio più alto è migliore). Per la predizione dell'indice S&P 500, l'accuracy del modello del paper si attestava al 65.08%, mentre l'MCC era pari a 0.4357. Il modello con encoder Transformer sviluppato in questo notebook nello stesso task e con lo stesso dataset ha una accuracy del 73.06% e uno score MCC di 0.4442, dimostrandosi, almeno empiricamente, migliore del modello a cui si ispira. 