The goal of this code is the building of an SI-RCNN model to forcast intraday directional movements.

The first step is the loading of seven technical indicators from our stock of choice. For the remit of this assignment we used the S&P 500.

We made use of the following 7 indicators:


1. Stochastic %K
2. William’s %R
3. Stochastic %D
4. A/D Oscillator
5. Momentum
6. Disparity
7. Rate of Change

In [33]:
# import yfinance as yf
# import pandas as pd
# import numpy as np
# import ta

# data = yf.download("^GSPC", start="2023-01-01", end="2025-01-01", interval="1d")
# data.dropna(inplace=True)

# # 1. SMA (Simple Moving Average - 20 days)
# data['SMA_20'] = data['Close'].rolling(window=20).mean()

# # 2. EMA (Exponential Moving Average - 20 days)
# data['EMA_20'] = data['Close'].ewm(span=20, adjust=False).mean()

# # 3. RSI (Relative Strength Index - 14 days)
# delta = data['Close'].diff()
# gain = np.where(delta > 0, delta, 0)
# loss = np.where(delta < 0, -delta, 0)
# avg_gain = pd.Series(gain.reshape(-1)).rolling(window=14).mean()
# avg_loss = pd.Series(loss.reshape(-1)).rolling(window=14).mean()
# rs = avg_gain / avg_loss
# data['RSI_14'] = 100 - (100 / (1 + rs))

# # 4. MACD (Moving Average Convergence Divergence)
# ema_12 = data['Close'].ewm(span=12, adjust=False).mean()
# ema_26 = data['Close'].ewm(span=26, adjust=False).mean()
# data['MACD'] = ema_12 - ema_26

# # 5. Stochastic Oscillator %K (14-day)
# low_14 = data['Low'].rolling(window=14).min()
# high_14 = data['High'].rolling(window=14).max()
# data['Stochastic_K'] = 100 * ((data['Close'] - low_14) / (high_14 - low_14))

# # 6. ATR (Average True Range - 14 days)
# high_low = data['High'] - data['Low']
# high_close = np.abs(data['High'] - data['Close'].shift())
# low_close = np.abs(data['Low'] - data['Close'].shift())
# true_range = pd.concat([high_low, high_close, low_close], axis=1).max(axis=1)
# data['ATR'] = true_range.rolling(window=14).mean()

# # 7. OBV (On-Balance Volume)
# obv = (np.sign(data['Close'].diff()) * data['Volume']).fillna(0).cumsum()
# data['OBV'] = obv

# # Keep only indicator columns
# indicators = data[['SMA_20', 'EMA_20', 'RSI_14', 'MACD', 'Stochastic_K', 'ATR', 'OBV']].dropna()

In [34]:
import pandas as pd
import numpy as np
import yfinance as yf

# Ensure the data is sorted by date
technical_layer = yf.download("^GSPC", start="2008-01-01", end="2013-01-01", interval="1d")
technical_layer.reset_index(inplace=True)
technical_layer.dropna(inplace=True)


# Parameters
lookback = 14  # typical lookback for most of these indicators

# 1. Stochastic %K
low_min = technical_layer['Low'].rolling(window=lookback).min()
high_max = technical_layer['High'].rolling(window=lookback).max()
technical_layer['Stochastic_%K'] = 100 * ((technical_layer['Close'] - low_min) / (high_max - low_min))

# 2. Williams %R
technical_layer["Williams_%R"] = -100 * ((high_max - technical_layer['Close']) / (high_max - low_min))

# 3. Stochastic %D (3-period SMA of %K)
technical_layer['Stochastic_%D'] = technical_layer['Stochastic_%K'].rolling(window=3).mean()

# 4. A/D Oscillator (Accumulation/Distribution Line)
ad = ((technical_layer['Close'] - technical_layer['Low']) - (technical_layer['High'] - technical_layer['Close'])) / (technical_layer['High'] - technical_layer['Low']) * technical_layer['Volume']
technical_layer['AD_Line'] = ad.cumsum()
technical_layer['AD_Oscillator'] = technical_layer['AD_Line'] - technical_layer['AD_Line'].shift(lookback)

# 5. Momentum (Close - Close n periods ago)
technical_layer['Momentum'] = technical_layer['Close'] - technical_layer['Close'].shift(lookback)

# 6. Disparity (Close / Moving Average * 100)
technical_layer['Disparity'] = (technical_layer['Close'] / technical_layer['Close'].rolling(window=lookback).mean()) * 100

# 7. Rate of Change (ROC)
technical_layer['ROC'] = ((technical_layer['Close'] - technical_layer['Close'].shift(lookback)) / technical_layer['Close'].shift(lookback)) * 100

# Display relevant columns
technical_indicators = technical_layer[['Date', 'Stochastic_%K', 'Williams_%R', 'Stochastic_%D','AD_Oscillator', 'Momentum', 'Disparity', 'ROC']]
technical_indicators.dropna(inplace=True)
print(technical_indicators.head())

[*********************100%***********************]  1 of 1 completed

Price        Date Stochastic_%K Williams_%R Stochastic_%D AD_Oscillator  \
Ticker                                                                    
15     2008-01-24     47.148721  -52.851279     34.063842 -1.397934e+09   
16     2008-01-25     37.795634  -62.204366     40.550381 -1.486316e+09   
17     2008-01-28     52.368422  -47.631578     45.770926  1.570116e+09   
18     2008-01-29     58.004306  -41.995694     49.389454  8.574271e+09   
19     2008-01-30     53.923576  -46.076424     54.765434 -6.741248e+08   

Price    Momentum  Disparity       ROC  
Ticker                                  
15     -95.090088  98.187044 -6.570807  
16     -81.020020  97.036433 -5.739466  
17     -62.220093  99.060320 -4.393516  
18     -27.889893  99.815992 -2.006193  
19     -53.319946  99.618459 -3.783891  



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  technical_indicators.dropna(inplace=True)


The next step was to create another input of financial news sentence embeddings, for this we used the FNSPID dataset which hold millions of financial news records covering S&P 500 companies.

https://github.com/Zdong104/FNSPID_Financial_News_Dataset

We manually cleaned the dataset by removing some of the lower rows that had garbage data upon downloaded. Following this we loaded it into python sorted it by date and removed all other columns before saving it again so we may reduce how many times this section is run. We then reduce it to the 5 year date range of 2008-2013 which will be our training window.

In [35]:
# full_csv = pd.read_csv('All_external.csv',usecols=['Article_title', 'Date'])  
# full_csv = full_csv.sort_values('Date').reset_index(drop=True)  

# full_csv = full_csv.set_index('Date')
# full_csv.to_csv('Sorted_Articles.csv')    

# filtered_layer = embedding_layer.loc['2008-01-01':'2013-12-31']
# filtered_layer.to_csv('Sorted_Articles_Reduced.csv')

First we tokenize the titles, we handled quotations as this caused some parsing issues.

In [36]:
from gensim.models import Word2Vec
import re

embedding_layer = pd.read_csv('Sorted_Articles_Reduced.csv')  
embedding_layer = embedding_layer.sort_values('Date').reset_index(drop=True)  

def preprocess_title(title):
    title = str(title).lower()
    title = title.replace("’", "'").replace("‘", "'").replace("“", '"').replace("”", '"')
    tokens = re.findall(r"\b[a-zA-Z']+\b", title)
    return tokens

token_list = embedding_layer['Article_title'].apply(preprocess_title)

Next we create a sentence embeddings by averaging the word vectors

In [None]:
model = Word2Vec(sentences=token_list, vector_size=100, window=5, min_count=1, workers=4)

def get_sentence_vector(tokens):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(model.vector_size)

embedding_layer['sentence_vector'] = token_list.apply(get_sentence_vector)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# Define the combined CNN + LSTM model
class NewsCNN_LSTM(nn.Module):
    def __init__(self, embedding_dim=100, cnn_out_channels=64, lstm_hidden_size=128, lstm_layers=1, dropout=0.5, num_classes=2):
        super(NewsCNN_LSTM, self).__init__()

        # CNN Layer
        self.conv1 = nn.Conv1d(in_channels=embedding_dim, out_channels=cnn_out_channels, kernel_size=3, padding=1)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout)
        self.pool = nn.MaxPool1d(kernel_size=2)  # Optional: downsample after CNN

        # LSTM Layer
        self.lstm = nn.LSTM(input_size=cnn_out_channels, hidden_size=lstm_hidden_size,
                            num_layers=lstm_layers, batch_first=True, bidirectional=False)

        # Fully Connected + Softmax
        self.fc = nn.Linear(lstm_hidden_size, num_classes)
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        # x shape: (batch_size, seq_len, embedding_dim)
        x = x.permute(0, 2, 1)  # (batch_size, embedding_dim, seq_len) for Conv1D
        x = self.conv1(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.dropout(x)

        x = x.permute(0, 2, 1)  # Back to (batch_size, seq_len, channels) for LSTM

        # LSTM layer
        lstm_out, _ = self.lstm(x)  # lstm_out: (batch_size, seq_len, lstm_hidden_size)

        # Take last timestep output
        last_timestep = lstm_out[:, -1, :]  # (batch_size, lstm_hidden_size)

        logits = self.fc(last_timestep)
        probs = self.softmax(logits)

        return probs

# Example model instantiation
model = NewsCNN_LSTM()

# Example dummy input: batch of 4 days, each with max_seq_len=10, embedding_dim=100
batch_size = 4
max_seq_len = 10
embedding_dim = 100

dummy_input = torch.randn(batch_size, max_seq_len, embedding_dim)

# Forward pass
output = model(dummy_input)

# Example interpretation
for i, prediction in enumerate(output):
    label = [1,0] if prediction.argmax().item() == 0 else [0,1]
    direction = 'up' if label == [1,0] else 'down'
    print(f"Day {i+1} prediction: {label} -> {direction}")

Day 1 prediction: [0, 1] -> down
Day 2 prediction: [0, 1] -> down
Day 3 prediction: [0, 1] -> down
Day 4 prediction: [0, 1] -> down


In [None]:

print(day_to_vectors)
print(output)

Date
2020-03-31 00:00:00 UTC    [[0.00038041366, 0.00038732446, 0.00061241444,...
2020-04-01 00:00:00 UTC    [[-0.0010377998, 0.00042375515, -0.0011617246,...
2020-04-02 00:00:00 UTC    [[-0.0023824708, -0.0012906671, 0.0009930803, ...
2020-04-06 00:00:00 UTC    [[-0.0007787028, 0.0014723241, -0.0006502996, ...
2020-04-08 00:00:00 UTC    [[-0.0023002836, 0.001976771, -0.00018580035, ...
2020-04-14 00:00:00 UTC    [[-0.0010352622, -0.0015254191, -0.002226289, ...
2020-04-22 00:00:00 UTC    [[-0.00014356375, -0.00014395872, -0.001554029...
2020-04-23 00:00:00 UTC    [[0.0007607373, 0.0022731104, -0.0003411403, 0...
2020-04-28 00:00:00 UTC    [[-0.0008626608, -0.0023146372, 0.002436483, 0...
2020-05-01 00:00:00 UTC    [[-0.0008346233, 0.0016315253, 0.0006025097, -...
2020-05-05 00:00:00 UTC    [[-0.00014000069, 0.002407807, 0.0018011844, -...
2020-05-08 00:00:00 UTC    [[-0.0008903536, 0.0041253697, -0.0019558165, ...
2020-05-15 00:00:00 UTC    [[-0.00420256, 0.001401495, 0.0012031626, 0.