The goal of this code is the building of an SI-RCNN model to forcast intraday directional movements.

The first step is the loading of seven technical indicators from our stock of choice. For the remit of this assignment we used the S&P 500.

We made use of the following 7 indicators:


1. Stochastic %K
2. William’s %R
3. Stochastic %D
4. A/D Oscillator
5. Momentum
6. Disparity
7. Rate of Change

In [None]:
import pandas as pd
import numpy as np
import yfinance as yf

technical_layer = yf.download("^GSPC", start="2008-01-01", end="2013-12-31", interval="1d")
technical_layer.reset_index(inplace=True)
technical_layer.dropna(inplace=True)
lookback = 14

# 1. Stochastic %K
low_min = technical_layer['Low'].rolling(window=lookback).min()
high_max = technical_layer['High'].rolling(window=lookback).max()
technical_layer['Stochastic_%K'] = 100 * ((technical_layer['Close'] - low_min) / (high_max - low_min))

# 2. Williams %R
technical_layer["Williams_%R"] = -100 * ((high_max - technical_layer['Close']) / (high_max - low_min))

# 3. Stochastic %D (3-period SMA of %K)
technical_layer['Stochastic_%D'] = technical_layer['Stochastic_%K'].rolling(window=3).mean()

# 4. A/D Oscillator (Accumulation/Distribution Line)
ad = ((technical_layer['Close'] - technical_layer['Low']) - (technical_layer['High'] - technical_layer['Close'])) / (technical_layer['High'] - technical_layer['Low']) * technical_layer['Volume']
technical_layer['AD_Line'] = ad.cumsum()
technical_layer['AD_Oscillator'] = technical_layer['AD_Line'] - technical_layer['AD_Line'].shift(lookback)

# 5. Momentum (Close - Close n periods ago)
technical_layer['Momentum'] = technical_layer['Close'] - technical_layer['Close'].shift(lookback)

# 6. Disparity (Close / Moving Average * 100)
technical_layer['Disparity'] = (technical_layer['Close'] / technical_layer['Close'].rolling(window=lookback).mean()) * 100

# 7. Rate of Change (ROC)
technical_layer['ROC'] = ((technical_layer['Close'] - technical_layer['Close'].shift(lookback)) / technical_layer['Close'].shift(lookback)) * 100

[*********************100%***********************]  1 of 1 completed


The next step was to create another input of financial news sentence embeddings, for this we used the FNSPID dataset which hold millions of financial news records covering S&P 500 companies.

https://github.com/Zdong104/FNSPID_Financial_News_Dataset

We manually cleaned the dataset by removing some of the lower rows that had garbage data upon downloaded. Following this we loaded it into python sorted it by date and removed all other columns before saving it again so we may reduce how many times this section is run. We then reduce it to the 5 year date range of 2008-2013 which will be our training window.

In [30]:
# full_csv = pd.read_csv('All_external.csv',usecols=['Article_title', 'Date'])  
# full_csv = full_csv.sort_values('Date').reset_index(drop=True)  

# full_csv = full_csv.set_index('Date')
# full_csv.to_csv('Sorted_Articles.csv')    

# filtered_layer = embedding_layer.loc['2008-01-01':'2013-12-31']
# filtered_layer.to_csv('Sorted_Articles_Reduced.csv')

First we tokenize the titles, we handled quotations as this caused some parsing issues.

In [31]:
from gensim.models import Word2Vec
import re

embedding_layer = pd.read_csv('Sorted_Articles_Reduced.csv')  
embedding_layer = embedding_layer.sort_values('Date').reset_index(drop=True)  
embedding_layer['Date'] = pd.to_datetime(embedding_layer['Date']).dt.tz_localize(None).dt.date

def preprocess_title(title):
    title = str(title).lower()
    title = title.replace("’", "'").replace("‘", "'").replace("“", '"').replace("”", '"')
    tokens = re.findall(r"\b[a-zA-Z']+\b", title)
    return tokens

token_list = embedding_layer['Article_title'].apply(preprocess_title)    

Next we create a sentence embeddings by averaging the word vectors

In [32]:
model = Word2Vec(sentences=token_list, vector_size=100, window=5, min_count=1, workers=4)

def get_sentence_vector(tokens):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(model.vector_size)

embedding_layer['sentence_vector'] = token_list.apply(get_sentence_vector)

daily_news = embedding_layer.groupby('Date')['sentence_vector'].apply(
    lambda vecs: np.mean(list(vecs), axis=0)
).reset_index()
trading_days = pd.to_datetime(technical_layer['Date']).dt.tz_localize(None).dt.date
daily_news_trading_days = daily_news[daily_news['Date'].isin(trading_days)].reset_index(drop=True)

In [33]:
merged = technical_layer
merged['sentence_vector'] = daily_news_trading_days['sentence_vector']

In [34]:

print(daily_news_trading_days)
print(trading_days)

            Date                                    sentence_vector
0     2008-01-02  [-0.855402569357792, 0.4430555461688188, 0.158...
1     2008-01-03  [-0.7145925860987046, 0.4845895619678438, 0.24...
2     2008-01-04  [-0.8094568846824378, 0.4771541593007326, 0.27...
3     2008-01-07  [-0.5404372451732585, 0.5290310096459692, 0.35...
4     2008-01-08  [-0.5793470705228083, 0.5519009553358155, 0.35...
...          ...                                                ...
1505  2013-12-23  [-0.39359990033183284, 0.721949965451881, 0.44...
1506  2013-12-24  [-0.2235240432299434, 0.6596912460643772, 0.63...
1507  2013-12-26  [-0.18483768425603447, 0.6359868007674152, 0.7...
1508  2013-12-27  [-0.4623775944448141, 0.5904485740989636, 0.44...
1509  2013-12-30  [-0.34335334563084224, 0.6613745883400418, 0.4...

[1510 rows x 2 columns]
0       2008-01-02
1       2008-01-03
2       2008-01-04
3       2008-01-07
4       2008-01-08
           ...    
1505    2013-12-23
1506    2013-12-24
1507   

In [35]:
lookback = 5
X_news = []
X_tech = []
y = []

for i in range(lookback, len(merged) - 1):
    news_seq = np.stack(merged['sentence_vector'].iloc[i-lookback:i].values)  # shape (5, 100)
    tech_seq = merged[['Stochastic_%K', 'Williams_%R', 'Stochastic_%D',
                       'AD_Oscillator', 'Momentum', 'Disparity', 'ROC']].iloc[i-lookback:i].values  # shape (5, 7)
    if np.isnan(news_seq).any() or np.isnan(tech_seq).any():
        continue

    today_close = merged['Close'].iloc[int(i)]
    next_close = merged['Close'].iloc[int(i) + 1]
    label = 1 if float(next_close) > float(today_close) else 0

    X_news.append(news_seq)
    X_tech.append(tech_seq)
    y.append(label)

X_news = np.array(X_news)      # shape: (num_samples, 5, 100)
X_tech = np.array(X_tech)      # shape: (num_samples, 5, 7)
y = np.array(y)                # shape: (num_samples,)

print("News shape:", X_news.shape)
print("Tech shape:", X_tech.shape)
print("Labels shape:", y.shape)

  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(nex

News shape: (1489, 5, 100)
Tech shape: (1489, 5, 7)
Labels shape: (1489,)


  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(next_close) > float(today_close) else 0
  label = 1 if float(nex

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class NewsTechLSTM(nn.Module):
    def __init__(self, embedding_dim=100, cnn_out_channels=64, news_hidden=128, tech_hidden=64,dropout=0.5, num_classes=2):
        super(NewsTechLSTM, self).__init__()

        # CNN + LSTM for News
        self.conv1 = nn.Conv1d(embedding_dim, cnn_out_channels, kernel_size=3, padding=1)
        self.relu = nn.ReLU()
        self.pool = nn.MaxPool1d(kernel_size=2)
        self.dropout = nn.Dropout(dropout)
        self.news_lstm = nn.LSTM(cnn_out_channels, news_hidden, batch_first=True)

        # LSTM for Technical Indicators
        self.tech_lstm = nn.LSTM(7, tech_hidden, batch_first=True)

        # Classifier
        self.fc = nn.Linear(news_hidden + tech_hidden, num_classes)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, news_seq, tech_seq):
        x = news_seq.permute(0, 2, 1) 
        x = self.conv1(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.dropout(x)
        x = x.permute(0, 2, 1)

        news_out, _ = self.news_lstm(x) 
        news_last = news_out[:, -1, :]  
        tech_out, _ = self.tech_lstm(tech_seq)
        tech_last = tech_out[:, -1, :] 

        combined = torch.cat((news_last,tech_last),dim=1)
        out=self.fc(combined)
        return self.softmax(out)

In [37]:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score
from torch.utils.data import TensorDataset, DataLoader
import copy

X_news_tensor = torch.tensor(X_news, dtype=torch.float32)
X_tech_tensor = torch.tensor(X_tech, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)

tscv = TimeSeriesSplit(n_splits=5)
fold = 1
best_acc = 0.0
best_model_state = None

for train_idx, test_idx in tscv.split(X_news_tensor):
    print(f"\n--- Fold {fold} ---")

    train_dataset = TensorDataset(X_news_tensor[train_idx], X_tech_tensor[train_idx], y_tensor[train_idx])
    test_dataset = TensorDataset(X_news_tensor[test_idx], X_tech_tensor[test_idx], y_tensor[test_idx])

    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=False)
    test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

    model = NewsTechLSTM()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    loss_fn = nn.CrossEntropyLoss()

    model.train()
    for epoch in range(30):
        total_loss = 0
        for news_batch, tech_batch, y_batch in train_loader:
            optimizer.zero_grad()
            out = model(news_batch, tech_batch)
            loss = loss_fn(out, y_batch)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1} Loss: {avg_loss:.4f}")

    model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for news_batch, tech_batch, y_batch in test_loader:
            out = model(news_batch, tech_batch)
            preds = torch.argmax(out, dim=1)
            all_preds.extend(preds.numpy())
            all_labels.extend(y_batch.numpy())

    acc = accuracy_score(all_labels, all_preds)
    print(f"Fold {fold} Accuracy: {acc:.4f}")

    if acc > best_acc:
        best_acc = acc
        best_model_state = copy.deepcopy(model.state_dict())

    fold += 1

torch.save(best_model_state, 'best_news_tech_model.pt')
print(f"\nBest model saved with accuracy: {best_acc:.4f}")


--- Fold 1 ---
Epoch 1 Loss: 0.6948
Epoch 2 Loss: 0.6939
Epoch 3 Loss: 0.6924
Epoch 4 Loss: 0.6909
Epoch 5 Loss: 0.6929
Epoch 6 Loss: 0.6906
Epoch 7 Loss: 0.6939
Epoch 8 Loss: 0.6927
Epoch 9 Loss: 0.6899
Epoch 10 Loss: 0.6911
Epoch 11 Loss: 0.6937
Epoch 12 Loss: 0.6904
Epoch 13 Loss: 0.6890
Epoch 14 Loss: 0.6967
Epoch 15 Loss: 0.6894
Epoch 16 Loss: 0.6942
Epoch 17 Loss: 0.6921
Epoch 18 Loss: 0.6911
Epoch 19 Loss: 0.6900
Epoch 20 Loss: 0.6933
Epoch 21 Loss: 0.6918
Epoch 22 Loss: 0.6910
Epoch 23 Loss: 0.6906
Epoch 24 Loss: 0.6906
Epoch 25 Loss: 0.6899
Epoch 26 Loss: 0.6893
Epoch 27 Loss: 0.6901
Epoch 28 Loss: 0.6908
Epoch 29 Loss: 0.6879
Epoch 30 Loss: 0.6849
Fold 1 Accuracy: 0.5565

--- Fold 2 ---
Epoch 1 Loss: 0.6924
Epoch 2 Loss: 0.6919
Epoch 3 Loss: 0.6900
Epoch 4 Loss: 0.6870
Epoch 5 Loss: 0.6875
Epoch 6 Loss: 0.6881
Epoch 7 Loss: 0.6886
Epoch 8 Loss: 0.6890
Epoch 9 Loss: 0.6857
Epoch 10 Loss: 0.6885
Epoch 11 Loss: 0.6880
Epoch 12 Loss: 0.6888
Epoch 13 Loss: 0.6865
Epoch 14 Loss: 0