# Data Processing 

This notebook presents the pre-processing of the S&P 500 data used to produce the input tensors of the neural network. <br>
For each stock, the input is a raw time series of the prices (High, Low, Open, Close). Note that we include both Close from previous period (t-1) and Open from current period (t) as they differe for daily stock market data. However, even for continuously traded assets such as Crypto there are discrepancies and it cannot be assumed to be equal based on data.

The features columns correspond to:
- Close(t-1)/Open(t-1)
- High(t-1)/Open(t-1)
- Low(t-1)/Open(t-1)
- Open(t)/Open(t-1)
    
<u>Remark:</u> We don't need to normalize the data since it's already of ratio of 2 prices closed to one.

The shape corresponds to:
- 4: Number of features
- 5: Number of stocks 
- 1258: Number of data points

In [1]:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm

In [2]:
# Specify Directory of Data (StocksD/StocksH/StocksM/CryptoD/CryptoH)
data_dir = 'StocksD'
directory = os.getcwd() + '/' + data_dir + '/'
# Get list of all files
stock_files = os.listdir(directory)
stock_files.sort()
# Remove hidden and unwanted files
for file in stock_files: 
    if file[0] == '.': stock_files.remove(file)
    if file[0] == '~': stock_files.remove(file)
# Load only subset of assets
selection = []    
if selection: stock_files = [stock_files[i] for i in selection ]
# Extract stock names only
stock_tickers = [file.split('_')[0] for file in stock_files]
# Check the history of stocks matches
kept_stocks = list() ; not_kept_stocks = list()
for s in tqdm(stock_files):
    df = pd.read_csv(directory + s)
    if data_dir == 'StocksD':   
        input_n = 1259 # 5 years (252*5)
        if len(df)>=input_n: kept_stocks.append(s)
        else: not_kept_stocks.append(s)
    elif data_dir == 'StocksH':
        input_n = 1976  # 1/2 year (152*6.5*2)
        if len(df)>=input_n: kept_stocks.append(s)
        else: not_kept_stocks.append(s)
    elif data_dir == 'StocksM':
        input_n = 59280 # 1/2 year (152*6.5*60)
        if len(df)>=input_n: kept_stocks.append(s)
        else: not_kept_stocks.append(s)
    elif data_dir == 'CryptoH':
        input_n = 17520 # 1 year (365*24*2)
        if len(df)>=input_n: kept_stocks.append(s)
        else: not_kept_stocks.append(s)
# Check if some stocks couldn't be loaded
print('\n There is {} different assets. \n Following assets will be loaded {}'.format(len(stock_tickers),stock_tickers))
if not_kept_stocks: print(' Error in reading following stocks {}'.format(not_kept_stocks))

100%|██████████| 5/5 [00:00<00:00, 98.54it/s]
 There is 5 different assets. 
 Following assets will be loaded ['AAPL', 'AXP', 'BA', 'CAT', 'CSCO']



In [26]:
# Read the stocks data
list_open = list()
list_close = list()
list_high = list()
list_low = list()

for s in tqdm(kept_stocks):
    data = pd.read_csv(directory + s).fillna('bfill').copy()
    data = data[['open', 'close', 'high', 'low']]
    data = data.tail(input_n)

    list_open.append(data.open.values)
    list_close.append(data.close.values)
    list_high.append(data.high.values)
    list_low.append(data.low.values)

array_open = np.transpose(np.array(list_open))[:-1]
array_open_of_the_day = np.transpose(np.array(list_open))[1:]
array_close = np.transpose(np.array(list_close))[:-1]
array_high = np.transpose(np.array(list_high))[:-1]
array_low = np.transpose(np.array(list_low))[:-1]

# Combine data together into one tensor
X = np.transpose(np.array([array_close/array_open, 
                           array_high/array_open,
                           array_low/array_open,
                           array_open_of_the_day/array_open]), axes= (0,2,1))
if False: X = np.transpose(np.array([array_high/array_open,
                                     array_low/array_open,
                                     array_open_of_the_day/array_open]), axes= (0,2,1))
# Save dimensions of data
input_f, input_m, imput_n = X.shape
print('\n Data have shape {}'.format(X.shape))
print(' Number of features: {}'.format(input_f))
print(' Number of assets: {}'.format(input_m))
print(' Number of samples: {}'.format(input_n))
np.save('./input.npy', X)
print(' Input tensor saved!')

100%|██████████| 5/5 [00:00<00:00, 15.92it/s]
 Data have shape (4, 5, 1258)
 Number of features: 4
 Number of assets: 5
 Number of samples: 1259
 Input tensor saved!



In [3]:
stock_tickers

['AAPL', 'AXP', 'BA', 'CAT', 'CSCO']