# Introduction

This code pre-processes the raw financial data in order to create a matrix of prices/volume. This matrix is then 'read' by the network. <br>
For each period of time, the program takes the 30 previous value to create the tensor using:
- Open
- High
- Low
- Close
- Volumne

Each serie is batch-normalized using mean and standard-deviation.

The matrix shape is (30,5) corresponding to (time period, number of input prices + volume). 

# Imports

In [28]:
#for navigation in the folders
import os
import pathlib

from time import strptime
from datetime import datetime

from tqdm import tqdm

import numpy as np
import pandas as pd

import PIL
import pickle
from time import strftime


from utils import *

import pandas as pd




# Navigation in folders and building of the dataset

This part gets all the stocks' token included into the folder.  

In [17]:
data_dir = '/sandp500/individual_stocks_5yr/'
directory = os.getcwd() + data_dir # path to the files
files_tags = os.listdir(directory) #these are the differents pdf files

#this is here because hidden files are also shown in the list. 
for file in files_tags:
    if file[0] == '.':
        files_tags.remove(file)
stock_name = [file.split('_')[0] for file in files_tags]
stocks = [file for file in files_tags]
print('There are {} different stocks.'.format(len(stock_name)))

There are 505 different stocks.


# Starting and stopping date of each stocks

The different stocks start and end at different time. This part removes the stocks which are not entirely in the biggest time span. (starting on the 2013-02-08)

## DataFrame

In [18]:
df_start_end = pd.DataFrame(columns = ['stock', 'start', 'end'])

In [19]:
name = list()
start=list()
end = list()

for s in stocks:
    df = pd.read_csv(os.getcwd() + data_dir + s)
    name.append(df.Name.iloc[0])
    start.append(df.date.iloc[0])
    end.append(df.date.iloc[-1])
df_start_end.stock = name
df_start_end.start = start
df_start_end.end = end


## Get the time span

In [20]:
end_date = set(df_start_end.end)
end_date

{'2018-02-07'}

All the stocks have the same 'end'. So there is no 'early stopping' stocks. There is no need to remove any stock.

In [21]:
starting_list = list(set(df_start_end.start))
starting_dict = {key: 0 for key in list(set(df_start_end.start))}
for i in df_start_end.start:
    for j in starting_list:
        if i == j:
            starting_dict[j]+=1

In [30]:
print(starting_dict)

{'2014-03-27': 1, '2015-01-02': 1, '2015-06-24': 1, '2017-09-01': 1, '2014-07-31': 1, '2014-06-19': 1, '2015-07-06': 2, '2016-01-05': 1, '2013-06-13': 1, '2017-07-03': 1, '2013-06-19': 4, '2017-01-04': 1, '2013-05-09': 1, '2013-02-08': 476, '2015-10-19': 2, '2014-04-17': 1, '2013-11-18': 1, '2015-11-16': 1, '2017-12-05': 1, '2016-07-01': 1, '2016-04-07': 1, '2017-07-17': 1, '2017-04-03': 1, '2014-09-24': 1, '2016-12-02': 1}


Almost all the stocks were already started at the beginning of the time span.  <br>
We will only take these stocks for our study. (starting date: 2013-02-08) <br>

# Missing value

In [23]:
count = 0
for s in tqdm(stocks):
    data = pd.read_csv(os.getcwd() + data_dir + s)
    count += data.isnull().sum().sum()
    
print(count)

100%|██████████| 505/505 [00:02<00:00, 214.40it/s]

27





We have only 27 nan values, let's remove them by taking the back value. <br>
This is done by the '.fillna('bfill')' method. 

# Creation of the Dataset

Here we create the datasets stock by stock. So we don't take into account full length stock/not full length stock. 

In [24]:
kept_stocks = list()
not_kept_stocks = list()

#tqdm is a package which enables to visualize the progress bar
for s in tqdm(stocks):
    
    data = pd.read_csv(os.getcwd() + data_dir + s).fillna('bfill')
    name = s.split('_')[0]
    
    #function located in the utils file 
    #it returns an array of matrixes (each matrix is the input of the network) from the stocks' time series
    X, Y = get_X_Y(data)
    X, Y = np.array(X), np.array(Y)
    
    
    #function located in the utils file 
    #it splits the train, test and validation sets
    #the splitting is not random 
    #it is made with respect of the time line
    #train -> test -> validation
    X_train, X_test, X_val, Y_train, Y_val, Y_test = create_Xt_Yt(X, Y, 
                                                                  percentage_train=0.8, percentage_val = 0.1, percentage_test = 0.1)
    
    #this line checks if there is any nan in the matrixes
    count = (np.isnan(X_train).sum() + np.isnan(X_test).sum() + np.isnan(X_val).sum()
             + np.isnan(Y_train).sum() + np.isnan(Y_val).sum() + np.isnan(Y_test).sum())
    
    if count>0:
        print('error nan', s)
        
    #it takes only the stocks which correspond to the whole time span 
    if X_train.shape!=(982,30,5):
        not_kept_stocks.append(s)
        
    else:
        
        kept_stocks.append(s)
        
        #each stock is stored in a different folder
        out_path_stock = '/data_out_autoreg/'+name+'/'
        out_directory = os.getcwd() + out_path_stock
        #create folder
        pathlib.Path(out_directory).mkdir(parents=True, exist_ok=True) 
        
        
        #each matrix is saved with numpy function
        np.save(out_directory+'Y_train', Y_train)
        np.save(out_directory+'Y_val', Y_val)
        np.save(out_directory+'Y_test', Y_test)
        np.save(out_directory+'X_train', X_train)
        np.save(out_directory+'X_val', X_val)
        np.save(out_directory+'X_test', X_test)
    
    
    
    
    

100%|██████████| 505/505 [02:46<00:00,  3.03it/s]


In [25]:
print('The number of kept stocks is:', len(kept_stocks),'.')

The number of kept stocks is: 468 .


In [26]:
print('The number of not kept stocks is:', len(not_kept_stocks),'.')

The number of not kept stocks is: 37 .


# Conclusion 

In this part, the data are pre-processed to be used as input by the neural network. <br>
The input of this code is a set of 5 time series (Open, High, Low, Close, Volume) which is turned into batch-normalized matrix of shape (30,5). 