# Stock Trend Predictor
__Juan Javier Arosemena__
## Introduction 
As the knowledge and techniques surrounding machine learning increase, the interest in applying this knowledge to stock data for making predictions is growing as well. The stock market is a composition of buyers and sellers of stocks, which are units for representing partial ownership of a company. These stocks have a specified price which can vary each day and minute, and it is affected by unpredictable factors such as politics, social trends, the environment, and company-related events.
Stock data is information that represents the movement of stock prices for given companies (or market indexes such as S&P 500, NASDAQ-100) for each day that the stock market operates. 

Stock data usually has 7 main data fields per day:

* Date: the date of the stock data for that day
* Open: the price of the first stock transaction made after market opens
* High: the highest price of the stock
* Low: the lowest price of the stock
* Close: the price of the first stock transaction made before market closes
* Volume: the number of stocks traded that day
* Adjusted Close: the closing price of a stock after considering corporate actions

Based on this data and derived data, financial analysts can make technichal analysis for the direction that the stock prices will take and, therefore, make decisions on buying and selling stocks with the lowest possible risk.

In stock market theory, it is said that the market follows the _Efficient Markets Hypothesis_, which states that the market “follows a random walk and can be unpredictable based on historical data” ([Madge, 2015](https://www.cs.princeton.edu/sites/default/files/uploads/saahil_madge.pdf)). This holds true for stock predictions in the short term, however, it is possible to find patterns in stock data in long periods of time, which in turn means that there is a degree of predictableness in the stock market.

Taking into account the fact that stock data holds complex patterns that can give information to formulate predictions, the application of __artificial neural networks__ (ANNs) to this data seems to be an appealing task for exploiting the potential of Artificial Intelligence (in fact, big hedge funds already use AI for stock predictions). 

## Workflow
For the development of this project I'm sticking to the Machine Learning Workflow, which has 3 main parts:
1. Data Exploration
    * Data retrieval
    * Feature engineering
    * Data preprocessing and creation of ANN feedable features.
2. Model Training
    * AI model definition
    * Training and validation
    * Parameter tuning
    * Model selection.
3. Model Deployment
    * Deployment of trained model.
    * Deployed model evaluation.
    * Model update.
![The machine learning workflow, Amazon Web Services](https://docs.aws.amazon.com/sagemaker/latest/dg/images/ml-concepts-10.png)

## Data Exploration
Data Exploration consists of the entire process of finding your data, convert it into data that can be manipulated in code, extracting features from the data, clean it,  and finally constructing files that contain directly feedable features for an ANN or any machine learning model of choice.

In [1]:
import os
from os.path import isfile, join
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statistics
from statistics import mean, mode, median, stdev
import yfinance as yf
from sklearn.preprocessing import MinMaxScaler

### Data Retrieval
For this project I chose 3 data sources:
1. Tiingo: A financial research platform dedicated to creating innovative financial tools, which provides an API for downloading stock data.
2. IEX: Investors Exchange is a fair, simple and transparent stock exchange dedicated to investor and issuer protection, and also provides an API.
3. Yahoo Finance: It provides financial news, data and commentary including stock quotes.

In [2]:
# In order to download data from Tiingo and IEX we must provide an API key, 
# which can be found in your site's respective account page.
# My account's keys are stored as environment variables and correspond to free accounts.

tiingo_api_key = os.environ['TIINGO_API_KEY']
iex_api_key = os.environ['IEX_API_KEY']

In [42]:
# File containing all tickers listed by NASDAQ-100 Technology index. 

tickers_file = 'ndxt_tickers.txt'


# Directory tree to create for data processing.

data_dir = 'data/' #this directory must already exist.
raw_data_dir = data_dir + 'raw/'
processed_data_dir = data_dir + 'processed/'
final_data_dir = data_dir + 'final/'


# We will train different models that can predict different time ranges in the stock calendar.

time_range = [1, 5, 10, 20, 90, 270]
periods = [5, 10, 20, 90, 270]
time_words = {1:'day', 5:'week', 10:'two_weeks', 20:'month', 90:'four_months', 270:'year'}


# Make directories

if not os.path.exists(raw_data_dir):
    os.makedirs(raw_data_dir)
if not os.path.exists(processed_data_dir):
    os.makedirs(processed_data_dir)
if not os.path.exists(final_data_dir):
    os.makedirs(final_data_dir)
for n1 in periods:
    for n2 in periods:
        if not os.path.exists(processed_data_dir+f'/{n1}_{n2}/'):
            os.makedirs(processed_data_dir+f'/{n1}_{n2}/')
for t in time_range:
    for n1 in periods:
        for n2 in periods:
            if not os.path.exists(final_data_dir+time_words[t]+f'/{n1}_{n2}/'):
                os.makedirs(final_data_dir+time_words[t]+f'/{n1}_{n2}/')

In [4]:
# Read all the stock tickers to be downloaded

ndxt_tickers = []
with open(data_dir+tickers_file) as f:
    for ticker in f:
        ndxt_tickers.append(ticker.replace('\n', ''))

All data is downloaded and directly transformed into a ``pandas.DataFrame``. Immediately after downloading, the raw data is saved into ``.csv`` files.
The data to be downloaded are all the possible stock quotes from companies that are indexed by the _NASDAQ-100 Technology Sector_(^NDXT), as well as the index data itself. Since we are using free accounts to retrieve the data from the mentioned API's, the time range for all downloaded data is the limit of 5 years previous to the current date.
As for the ^NDXT data, we are using the ``yfinance`` library created by [Ran Aroussi](https://pypi.org/project/yfinance/). 

Do not abuse the following block of code.

In [5]:
# Code for downloading data and saving it, use only when necessary

raw_stock_data_tiingo = []
raw_stock_data_iex = []
error_tickers = []

for ticker in sorted(ndxt_tickers):
    try:
        raw_stock_data_tiingo.append(pdr.get_data_tiingo(ticker, api_key= tiingo_api_key))
    except:
        error_tickers.append(ticker)
else: 
    if error_tickers:
        try:
            for ticker in error_tickers:
                raw_stock_data_iex.append(pdr.get_markets_iex(ticker, api_key= tiingo_api_key))
        except:
            print(ticker+ ' was not downloaded.')
raw_index_data_yahoo = yf.download('^NDXT', period='5y')


# Save each stock data in a CSV file

for t in raw_stock_data_tiingo:
    t.to_csv(raw_data_dir + t.index.values[0][0] + '.csv')
    
for t in raw_stock_data_iex:
    t.to_csv(raw_data_dir + t.index.values[0][0] + '.csv')
    
raw_index_data_yahoo.to_csv(raw_data_dir + '^NDXT.csv')


In [6]:
# Read downloaded data from files

raw_stock_data = []
raw_index_data_filename = '^NDXT.csv'
raw_stock_data_filenames = [f+'.csv' for f in ndxt_tickers]
raw_index_df = pd.read_csv(raw_data_dir + raw_index_data_filename)

for filename in raw_stock_data_filenames:
    raw_stock_data.append(pd.read_csv(raw_data_dir + filename))

### Data preprocessing
In order to manipulate the retireved data, it is necessary to give it structure.

``raw_stock_data`` is a list containing all stock dataframes, and ``raw_index_df`` is the dataframe containing the ^NDXT data. For every dataframe, their index will be the dates of each stock or index quote.

In [10]:
# Reformat date in stocks dataframes, remove time

for data in raw_stock_data:
    data['date'] = data['date'].map(lambda x: x.split()[0])

    
# Volume is not a given data for the index quotes.

raw_index_df.drop(columns='Volume', inplace=True)


# Rename index columns to lowercase

raw_index_df.columns = ['date', 'open', 'high', 'low', 'close', 'adjClose']

Every stock dataframe will also contain its ticker symbol as part of their index. Also remove unnecessary information like dividends and splits.

In [11]:
# Assign symbol and date as index identifiers for every stock record

for data in raw_stock_data:
    data.set_index(['symbol', 'date'], inplace=True, drop=True)
    
# Assign date as index identifier for index records as well

raw_index_df.set_index(['date'], inplace=True, drop=True)


# Remove unnecessary information

for df in raw_stock_data: df.drop(columns=['divCash', 'splitFactor'], inplace=True)

A very crucial part for following data processing if making sure that every dataframe, both stocks and index, contain the same ranges of data. This is because the final features will be a mix of individual stock quotes with index quotes. Since we are not guaranteed that all the downloaded data contains the same time ranges, we must find the oldest last date among all quotes, and also the newest first date for all data as well.

In [12]:
# Find the oldest final date and newest starting date

last_dates = [raw_index_df.index[-1]]
first_dates = [raw_index_df.index[0]]

for df in raw_stock_data:
    dates = []
    
    for idx in df.index:
        dates.append(idx[1])
    
    last_dates.append(max(dates))
    first_dates.append(min(dates))

last_date = min(last_dates)
first_date = max(first_dates)

2019-10-15
2014-10-16


With the found time ranges, we can trim the dataframes to make sure they all contain the same respective dates for their data. Then, reverse dataframes so that the latest quote comes first, and the oldest quote goes last.

In [13]:
# Make sure both DataFrames have the same final date (as close to today as possible)

while(raw_index_df.index[0] > last_date):
    raw_index_df.pop(raw_index_df.index[0])

while(raw_index_df.index[-1] < first_date):
    raw_index_df.pop(raw_index_df.index[-1])
        
for df in raw_stock_data:
    while(df.index[0][1] > last_date):
        df.pop(df.index[0])
    while(df.index[-1][1] < first_date):
        df.pop(df.index[-1])
        
        
# Reverse stock and index records

for df in raw_stock_data:
    df = df.sort_index(ascending=False, inplace=True)
raw_index_df = raw_index_df.iloc[::-1]

In [14]:
# DataFrames have been processed and not considered raw anymore, just rename them.

stocks_df = raw_stock_data
index_df = pd.DataFrame(raw_index_df)

### Feature Engineering
Now that all the raw data has been transformed into explorable data, we can extract and compute information that we want to feed our machine learning model. 

The features to be calculated for stock and index data are the following:
1. Price Momentum Oscillator = TC – PPC
    * TC: today’s close
    * PPC: previous period’s close
    


2. Relative Strength Index = 100 – [100/(1 + RS)]
    * RS: average of x days up-closes divided by average of x days down-closes


3. Money Flow Index = 100 *(100/(1 + MR))
    * MR = (PositiveMF / NegativeMF)
    * MF = TP * Volume
    * TP: average of high, low, and close prices for a given period. If the current Typical Price is greater than the previous period’s, it is considered Positive Money Flow.
    
    
4. Exponential Moving Average = [α * TC] + [(1 – α) * YEMA]
    * TC: today’s close
    * YEMA: yesterday’s exponential moving average
    * α: smoothing factor which is 2/(n+1) where n is the number of days in the period.
    
    
5. Stochastic Oscillator = [(CP - LP) / (HP - LP)]*100
    * CP: closing price
    * LP: lowest low price in the period
    * HP: highest high price in the period
    
    
6. Moving Average Convergence/Divergence = (12-day EMA) – (26-day EMA)


These features where proposed by [Abdul Salam, Emary, and Zawbaa (2018)](https://www.researchgate.net/publication/324029737_A_Hybrid_Moth-Flame_Optimization_and_Extreme_Learning_Machine_Model_for_Financial_Forecasting?enrichId=rgreq-72e17bad737cd78e1c16dfa2b01ab9a9-XXX&enrichSource=Y292ZXJQYWdlOzMyNDAyOTczNztBUzo2MDg2ODg5OTc5MzcxNTlAMTUyMjEzNDE3NDQ4Mg%3D%3D&el=1_x_2&_esc=publicationCoverPdf) in their paper for a machine learning model for stock market prediction.

We will stick to 6 time ranges to use for label calculations as proposed by Madge: 
1. One day
2. One week
3. Two weeks
4. One month
5. Four months
6. One year

In [15]:
def labels(stock_df, since = 1):
    '''Function for labeling the trend in stock data given a period of time.
    
        Args:
            stocks_df (pandas.DataFrame): contains a column 'close' for the closing prices in historical stock data.
            since (int): time period for which to label trend.
        
        Returns:
            None: the passed DataFrame will have a new column with labels 1 for incresing close price after 'since' days, 0 otherwise.
    '''
    stock_df.drop(columns='y_'+str(since), inplace=True, errors='ignore')
    labels = []
    for i in range(len(stock_df)):
        try:
            assert i-since >= 0
            today = stock_df.iloc[i]['close']
            future = stock_df.iloc[i-since]['close']
            labels.append(1 if future>today else 0)
        except:
            labels.append(None)
    stock_df.insert(loc=0, column='y_'+str(since), value=labels)

In [16]:
def change(stock_df, period = 1):
    '''Function for calculating the change percentage of closing prices since 'period' days ago.
    
        Args:
            stocks_df (pandas.DataFrame): contains a column 'close' for the closing prices in historical stock data.
            period (int): time period for which to calculate change.
        Returns:
            None: the passed DataFrame will have a new column with change percentage.
    '''
    stock_df.drop(columns='change', inplace=True, errors='ignore')
    change = []
    for i in range(len(stock_df)):
        try:
            today = stock_df.iloc[i]['close']
            previous = stock_df.iloc[i+period]['close']
            change.append(100*(today-previous)/previous)
        except:
            change.append(None)
    stock_df.insert(loc=0, column='change', value=change)

In [17]:
def PMO(stock_df, period = 50):
    '''Price Momentum Oscillator.
        
        Args:
            stocks_df (pandas.DataFrame): contains a column 'close' for the closing prices in historical stock data.
            period (int): time period for which to calculate.
        Returns:
            None: the passed DataFrame will have a new column with PMO.
    '''
    stock_df.drop(columns='PMO', inplace=True, errors='ignore')
    pmo = []
    for i in range(len(stock_df)):
        try:
            today = stock_df.iloc[i]['close']
            previous = stock_df.iloc[i+period]['close']
            pmo.append(today - previous)
        except:
            pmo.append(None)
    stock_df.insert(loc=0, column='PMO', value=pmo)

In [18]:
def RSI(stock_df, period = 50):
    '''Relative Strength Index.
        
        Args:
            stocks_df (pandas.DataFrame): contains a columns 'close' for the closing prices and 'change' in historical stock data.
            period (int): time period for which to calculate.
        Returns:
            None: the passed DataFrame will have a new column with RSI.
    '''
    stock_df.drop(columns='RSI', inplace=True, errors='ignore')
    rsi = []
    for i in range(len(stock_df)):
        try:
            rsi_value = 0
            pos = []
            neg = []
            
            for j in range(period):
                change = stock_df.iloc[i+j]['change']
                if change > 0: 
                    pos.append(change)
                elif change < 0: 
                    neg.append(abs(change))
                    
            if not neg:
                rsi_value = 100
            elif not pos:
                rsi_value = 0
            else:
                pos = sum(pos)/len(pos)
                neg = sum(neg)/len(neg)
                rsi_value = 100 - (100/(1+(pos/neg)))
            rsi.append(rsi_value)
        except:
            rsi.append(None)
    stock_df.insert(loc=0, column='RSI', value=rsi)

In [19]:
def MFI(stock_df, period = 50):
    '''Money Flow Index.
        
        Args:
            stocks_df (pandas.DataFrame): contains a columns 'close' for the closing prices and 'volume', 'high', and 'low' in historical stock data.
            period (int): time period for which to calculate.
        Returns:
            None: the passed DataFrame will have a new column with MFI.
    '''
    stock_df.drop(columns='MFI', inplace=True, errors='ignore')
    mfi = []
    for i in range(len(stock_df)):
        try:
            mfi_value = 0
            pos = []
            neg = []
            typical_prices = []
            
            for j in range(period):
                if not typical_prices: typical_prices.append( mean([stock_df.iloc[i+1]['high'] , stock_df.iloc[i+1]['low'] , stock_df.iloc[i+1]['close']]) ) 
                tp = (stock_df.iloc[i+j]['high'] + stock_df.iloc[i+j]['low'] + stock_df.iloc[i+j]['close']) / 3
                if tp > typical_prices[-1]: 
                    pos.append( tp * stock_df.iloc[i+j]['volume'] )
                elif tp < typical_prices[-1]: 
                    neg.append( tp * stock_df.iloc[i+j]['volume'] )
            
            if not neg:
                mfi_value = 100
            elif not pos:
                mfi_value = 0
            else:
                pos = sum(pos)/len(pos)
                neg = sum(neg)/len(neg)
                mfi_value = 100 - (100/(1+(pos/neg)))
            mfi.append(mfi_value)
        except:
            mfi.append(None)
    stock_df.insert(loc=0, column='MFI', value=mfi)

In [20]:
def EMA(stock_df, period=50):
    '''Exponential Moving Average.
        
        Args:
            stocks_df (pandas.DataFrame): contains a column 'close' for the closing prices in historical stock data.
            period (int): time period for which to calculate.
        Returns:
            None: the passed DataFrame will have a new column with EMA.
    '''
    stock_df.drop(columns='EMA', inplace=True, errors='ignore')
    a = 2/(period + 1)
    # There are many ways to calculate the first term of an exponential moving average, so for now
    # I'll be using the average of the previous 3 closes
    initial_value_range = 3
    ema = []
    
    for i in range(len(stock_df)):
        emas = []
        try:
            
            for j in list(reversed(range(period))):
                if not emas: emas.append( mean([stock_df.iloc[i+j+day]['close'] for day in range(initial_value_range)]) )
                tc = stock_df.iloc[i+j]['close']
                this_ema = (a * tc) + ((1 - a) * emas[-1])
                emas.append(this_ema)
            
            ema.append(emas[-1])
        except:
            ema.append(None)
    stock_df.insert(loc=0, column='EMA', value=ema)

In [21]:
def SO(stock_df, period=50):
    '''Stochastic Oscillator.
        
        Args:
            stocks_df (pandas.DataFrame): contains a column 'close' for the closing prices, 'high', and 'low' in historical stock data.
            period (int): time period for which to calculate.
        Returns:
            None: the passed DataFrame will have a new column with PMO.
    '''
    stock_df.drop(columns='SO', inplace=True, errors='ignore')
    
    so = []
    
    for i in range(len(stock_df)):
        try:
            tc = stock_df.iloc[i]['close']
            ll = min([stock_df.iloc[i+day]['low'] for day in range(period)])
            hh = max([stock_df.iloc[i+day]['high'] for day in range(period)])
            this_so = ((tc - ll) / (hh - ll)) * 100
            so.append(this_so)
        except:
            so.append(None)
    
    stock_df.insert(loc=0, column='SO', value=so)

In [22]:
def MACD(stock_df, p1=12, p2=26):
    '''Moving Average Convergence/Divergence.
        
        Args:
            stocks_df (pandas.DataFrame): contains a column 'close' for the closing prices in historical stock data.
            p1 (int): time period for which to calculate first EMA.
            p2 (int): time period for which to calculate second EMA.
        Returns:
            None: the passed DataFrame will have a new column with PMO.
    '''
    stock_df.drop(columns='MACD', inplace=True, errors='ignore')
    
    a1 = 2/(p1 + 1)
    a2 = 2/(p2 + 1)
    initial_value_range = 3
    macd = []
    
    for i in range(len(stock_df)):
        ema1 = []
        ema2 = []
        try:
            for j in list(reversed(range(p1))):
                if not ema1: ema1.append( mean([stock_df.iloc[i+j+day]['close'] for day in range(initial_value_range)]) )
                tc = stock_df.iloc[i+j]['close']
                this_ema = (a1 * tc) + ((1 - a1) * ema1[-1])
                ema1.append(this_ema)
            
            for j in list(reversed(range(p2))):
                if not ema2: ema2.append( mean([stock_df.iloc[i+j+day]['close'] for day in range(initial_value_range)]) )
                tc = stock_df.iloc[i+j]['close']
                this_ema = (a2 * tc) + ((1 - a2) * ema2[-1])
                ema2.append(this_ema)
            
            macd.append(ema1[-1] - ema2[-1])
            
        except:
            macd.append(None)
    
    stock_df.insert(loc=0, column='MACD', value=macd)

Given the enormous possibilities of mixing time ranges for features for both index and stock data (around 125 different datasets), as proposed by Madge, we will use features calculated with a single time period of 20 days. 

In [23]:
# Calculate features for index data, MFI is not calculated as it requires volume

period = 20

change(index_df)
MACD(index_df)
SO(index_df, period)
EMA(index_df, period)
RSI(index_df, period)
PMO(index_df, period)
index_df.fillna(value=pd.np.nan, inplace=True)

1258

In [24]:
# Calculate features and labels for stock data, this takes a lot of time

i=0
for df in stocks_df:
    i += 1
    change(df)
    MACD(df)
    SO(df, period)
    EMA(df, period)
    MFI(df, period)
    RSI(df, period)
    PMO(df, period)
    for m in [1, 5, 10, 20, 90, 270]:
        labels(df, m)
    df.fillna(value=pd.np.nan, inplace=True)
    
    print(f'{round(i*100/len(ndxt_tickers))}% ', end='')

stocks_df[-1].head()

3% 5% 8% 10% 13% 15% 18% 21% 23% 26% 28% 31% 33% 36% 38% 41% 44% 46% 49% 51% 54% 56% 59% 62% 64% 67% 69% 72% 74% 77% 79% 82% 85% 87% 90% 92% 95% 97% 100% 

Unnamed: 0_level_0,Unnamed: 1_level_0,y_270,y_90,y_20,y_10,y_5,y_1,PMO,RSI,MFI,EMA,...,adjClose,adjHigh,adjLow,adjOpen,adjVolume,close,high,low,open,volume
symbol,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
XLNX,2019-10-15,,,,,,,4.93,45.699143,52.600614,94.139094,...,96.97,97.45,95.8916,96.24,2774611,96.97,97.45,95.8916,96.24,2774611
XLNX,2019-10-14,,,,,,1.0,0.03,40.79276,49.954551,93.62432,...,95.93,98.86,95.56,98.31,3812968,95.93,98.86,95.56,98.31,3812968
XLNX,2019-10-11,,,,,,0.0,0.27,41.425551,44.948593,93.382899,...,96.22,98.15,94.93,95.12,4909771,96.22,98.15,94.93,95.12,4909771
XLNX,2019-10-10,,,,,,1.0,-4.43,37.015132,50.002449,92.797889,...,92.81,94.32,90.5,90.61,2883108,92.81,94.32,90.5,90.61,2883108
XLNX,2019-10-09,,,,,,1.0,-6.71,26.073207,100.0,92.816159,...,90.48,91.21,89.67,90.7,3246725,90.48,91.21,89.67,90.7,3246725


In [25]:
stocks_df[0].tail()

Unnamed: 0_level_0,Unnamed: 1_level_0,y_270,y_90,y_20,y_10,y_5,y_1,PMO,RSI,MFI,EMA,...,adjClose,adjHigh,adjLow,adjOpen,adjVolume,close,high,low,open,volume
symbol,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
ADBE,2014-10-22,1.0,1.0,1.0,1.0,1.0,1.0,,,,,...,65.04,66.49,64.99,66.31,2363348,65.04,66.49,64.99,66.31,2363348
ADBE,2014-10-21,1.0,1.0,1.0,1.0,1.0,0.0,,,,,...,66.425,66.56,64.51,64.84,3229897,66.425,66.56,64.51,64.84,3229897
ADBE,2014-10-20,1.0,1.0,1.0,1.0,1.0,1.0,,,,,...,64.68,64.73,63.7,64.25,4266231,64.68,64.73,63.7,64.25,4266231
ADBE,2014-10-17,1.0,1.0,1.0,1.0,1.0,1.0,,,,,...,64.52,65.13,63.46,63.46,6540794,64.52,65.13,63.46,63.46,6540794
ADBE,2014-10-16,1.0,1.0,1.0,1.0,1.0,1.0,,,,,...,62.86,63.72,60.72,60.72,6639958,62.86,63.72,60.72,60.72,6639958


Once feature extraction is finished, saved all processed data locally.

In [26]:
# Save the processed data as a milestone

index_df.to_csv(processed_data_dir+ '^NDXT.csv')
for df in stocks_df:
    df.to_csv(processed_data_dir+ f'{df.index[0][0]}.csv')

ANNs require their feeded features to be normalized values. Therefore, we need to convert all features into ranges from [0,1]. For features that represent percenatges we divide them by 100, and for features with arbitrary ranges we scale them with a ``MinMaxScaler``.

In [27]:
# Normalizing features

scaler = MinMaxScaler()

idf = index_df[['PMO', 'EMA', 'MACD']]
scaler.fit(idf)
index_df[['PMO', 'EMA', 'MACD']] = scaler.transform(idf)
idf = index_df[['RSI', 'SO']]
index_df[['RSI', 'SO']] = idf/100

for i, df_ in enumerate(stocks_df):
    df = df_[['PMO', 'EMA', 'MACD']]
    scaler.fit(df)
    stocks_df[i][['PMO', 'EMA', 'MACD']] = scaler.transform(df)
    df = df_[['RSI' ,'MFI', 'SO']]
    stocks_df[i][['RSI' ,'MFI', 'SO']] = df/100


### Final data preparation
Data has been processed and normalized, and is ready to be unified into feedable train and test datasets. 
We will produce one test and one train dataset for each of the 6 models to be trained, and each record (each stock market day) of each dataset will contain the following structure.
1. Label
2. Stock PMO
3. Stock EMA
4. Stock MACD
5. Stock RSI
6. Stock MFI
7. Stock SO
8. Index PMO
9. Index EMA
10. Index MACD
11. Index RSI
12. Index SO


For each stock symbol and each stock date.

In [73]:
# Unify all data into separate training/testing sets

for t in time_range:
    train_df_list = []
    test_df_list = []
    
    for df in stocks_df:
        test_train_separation = round((len(df) + t - max(period, 26))*2/3)
        to_concat_train = [df[['y_'+str(t), 'PMO', 'EMA', 'MACD', 'RSI' ,'MFI', 'SO']].iloc[:test_train_separation], index_df[['PMO', 'EMA', 'MACD', 'RSI', 'SO']].iloc[:test_train_separation]]
        to_concat_test = [df[['y_'+str(t), 'PMO', 'EMA', 'MACD', 'RSI' ,'MFI', 'SO']].iloc[test_train_separation:], index_df[['PMO', 'EMA', 'MACD', 'RSI', 'SO']].iloc[test_train_separation:]]
        train_df_list.append(pd.concat([s.reset_index(drop=True) for s in to_concat_train], sort=False, axis=1))
        test_df_list.append(pd.concat([s.reset_index(drop=True) for s in to_concat_test], sort=False, axis=1))
    
    full_train_df = pd.concat([df for df in train_df_list], axis=0)
    full_test_df = pd.concat([df for df in test_df_list], axis=0)
    
    
    full_train_df.dropna(inplace=True)
    full_test_df.dropna(inplace=True)
    
    # Save final data
    full_train_df.to_csv(final_data_dir+time_words[t]+'/train.csv', header=False, index=False)
    full_test_df.to_csv(final_data_dir+time_words[t]+'/test.csv', header=False, index=False)
        