In this notebook we have the functions used for the building of the technical and embedding layers respectively

We made use of the following 7 indicators for the technical layer:
1. Stochastic %K
2. William’s %R
3. Stochastic %D
4. A/D Oscillator
5. Momentum
6. Disparity
7. Rate of Change

As inspired by Combining News and Technical Indicators in Daily Stock Price Trends Prediction.

In [None]:
import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime, timedelta

def calculate_technical_indicator(start_date,end_date):
    lookback = 3
    start_date_lookback = (datetime.strptime(start_date, "%Y-%m-%d") - timedelta(days=lookback))
    technical_layer = yf.download("^GSPC", start=start_date_lookback, end=end_date, interval="1d")
   
    # 1. Stochastic %K
    low_min = technical_layer['Low'].rolling(window=lookback).min()
    high_max = technical_layer['High'].rolling(window=lookback).max()
    technical_layer['Stochastic_%K'] = 100 * ((technical_layer['Close'] - low_min) / (high_max - low_min))

    # 2. Williams %R
    technical_layer["Williams_%R"] = -100 * ((high_max - technical_layer['Close']) / (high_max - low_min))

    # 3. Stochastic %D 
    technical_layer['Stochastic_%D'] = technical_layer['Stochastic_%K'].rolling(window=lookback-1).mean()

    # 4. A/D Oscillator (Accumulation/Distribution Line)
    ad = ((technical_layer['Close'] - technical_layer['Low']) - (technical_layer['High'] - technical_layer['Close'])) / (technical_layer['High'] - technical_layer['Low']) * technical_layer['Volume']
    technical_layer['AD_Line'] = ad.cumsum()
    technical_layer['AD_Oscillator'] = technical_layer['AD_Line'] - technical_layer['AD_Line'].shift(lookback)

    # 5. Momentum 
    technical_layer['Momentum'] = technical_layer['Close'] - technical_layer['Close'].shift(lookback)

    # 6. Disparity 
    technical_layer['Disparity'] = (technical_layer['Close'] / technical_layer['Close'].rolling(window=lookback).mean()) * 100

    # 7. Rate of Change (ROC)
    technical_layer['ROC'] = ((technical_layer['Close'] - technical_layer['Close'].shift(lookback)) / technical_layer['Close'].shift(lookback)) * 100
    technical_layer = technical_layer.loc[start_date:].reset_index()
    technical_layer = technical_layer.dropna()
    return technical_layer

Generating the Data for the embedding layer, we preprocess the title by setting it to lower and converting the quotation marks, after this it performs tokenization by braking down a sentence into individual lexeems then assembling them  into word tokens.

In [None]:
from gensim.models import Word2Vec
import re

def preprocess_title(title):
    title = str(title).lower()
    title = title.replace("’", "'").replace("‘", "'").replace("“", '"').replace("”", '"')
    tokens = re.findall(r"\b[a-zA-Z']+\b", title)
    return tokens


Here from the tokens and the word2vec model we are generating A single vector from the previously generated toekns, that represents the average (mean) of the word vectors in the sentence. If none of the words are found in the model, it returns a zero vector of the same size as the word vectors.

In [None]:
def get_sentence_vector(tokens,model):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(model.vector_size)

Here we processes a CSV file containing news article titles and dates to generate daily news embeddings. It reads the data, and calls the above functions. It then trains a Word2Vec model on the tokenized titles and computes sentence vectors for each article by averaging its word embeddings. 

These vectors are then aggregated daily by averaging them per date, resulting in a single embedding vector for each day.

In [None]:
def calculate_embedding_layer(filename,trading_days):
    embedding_layer = pd.read_csv(filename)  
    embedding_layer = embedding_layer.sort_values('Date').reset_index(drop=True)  
    embedding_layer['Date'] = pd.to_datetime(embedding_layer['Date']).dt.tz_localize(None).dt.date

    token_list = embedding_layer['Article_title'].apply(preprocess_title) 
    model = Word2Vec(sentences=token_list, vector_size=100, window=5, min_count=1, workers=4)
    embedding_layer['sentence_vector'] = token_list.apply(lambda tokens: get_sentence_vector(tokens, model))

    daily_news = embedding_layer.groupby('Date')['sentence_vector'].apply(lambda vecs: np.mean(list(vecs), axis=0)).reset_index()
    daily_news_trading_days = daily_news[daily_news['Date'].isin(trading_days)].reset_index(drop=True)   
    return daily_news_trading_days