## Data ReExploration - 150 models
Data Exploration consists of the entire process of finding your data, converting it into data that can be manipulated in code, extracting features from the data, cleaning it,  and finally constructing files that contain directly feedable features for an ANN or any machine learning model of choice.

In [1]:
import os
from os.path import isfile, join
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statistics
from statistics import mean, mode, median, stdev
import yfinance as yf
from sklearn.preprocessing import MinMaxScaler

### Data Retrieval
For this project I chose 3 data sources:
1. Tiingo: A financial research platform dedicated to creating innovative financial tools, which provides an API for downloading stock data.
2. IEX: Investors Exchange is a fair, simple and transparent stock exchange dedicated to investor and issuer protection, and also provides an API.
3. Yahoo Finance: It provides financial news, data and commentary including stock quotes.

In [2]:
# In order to download data from Tiingo and IEX we must provide an API key, 
# which can be found in your site's respective account page.
# My account's keys are stored as environment variables and correspond to free accounts.

tiingo_api_key = os.environ['TIINGO_API_KEY']
iex_api_key = os.environ['IEX_API_KEY']

In [3]:
# File containing all tickers listed by NASDAQ-100 Technology index. 

tickers_file = 'ndxt_tickers.txt'


# Directory tree to create for data processing.

data_dir = 'data_2/' #this directory must already exist.
raw_data_dir = data_dir + 'raw/'
processed_data_dir = data_dir + 'processed/'
final_data_dir = data_dir + 'final/'


# We will train different models that can predict different time ranges in the stock calendar.

time_range = [1, 5, 10, 20, 90, 270]
periods = [5, 10, 20, 90, 270]
time_words = {1:'day', 5:'week', 10:'two_weeks', 20:'month', 90:'four_months', 270:'year'}


# Make directories

if not os.path.exists(raw_data_dir):
    os.makedirs(raw_data_dir)
if not os.path.exists(processed_data_dir):
    os.makedirs(processed_data_dir)
if not os.path.exists(final_data_dir):
    os.makedirs(final_data_dir)
for n1 in periods:
    for n2 in periods:
        if not os.path.exists(processed_data_dir+f'/{n1}_{n2}/'):
            os.makedirs(processed_data_dir+f'/{n1}_{n2}/')
for t in time_range:
    for n1 in periods:
        for n2 in periods:
            if not os.path.exists(final_data_dir+time_words[t]+f'/{n1}_{n2}/'):
                os.makedirs(final_data_dir+time_words[t]+f'/{n1}_{n2}/')

In [4]:
# Read all the stock tickers to be downloaded

ndxt_tickers = []
with open(data_dir+tickers_file) as f:
    for ticker in f:
        ndxt_tickers.append(ticker.replace('\n', ''))

All data is downloaded and directly transformed into a ``pandas.DataFrame``. Immediately after downloading, the raw data is saved into ``.csv`` files.
The data to be downloaded are all the possible stock quotes from companies that are indexed by the _NASDAQ-100 Technology Sector_(^NDXT), as well as the index data itself. Since we are using free accounts to retrieve the data from the mentioned API's, the time range for all downloaded data is the limit of 5 years previous to the current date.
As for the ^NDXT data, we are using the ``yfinance`` library created by [Ran Aroussi](https://pypi.org/project/yfinance/). 

Do not abuse the following block of code, as data retrieval comes from free-tier accounts.

In [5]:
# Code for downloading data and saving it
# ****USE ONLY WHEN NECESSARY****

# raw_stock_data_tiingo = []
# raw_stock_data_iex = []
# error_tickers = []

# for ticker in sorted(ndxt_tickers):
#     try:
#         raw_stock_data_tiingo.append(pdr.get_data_tiingo(ticker, api_key= tiingo_api_key))
#     except:
#         error_tickers.append(ticker)
# else: 
#     if error_tickers:
#         try:
#             for ticker in error_tickers:
#                 raw_stock_data_iex.append(pdr.get_markets_iex(ticker, api_key= tiingo_api_key))
#         except:
#             print(ticker+ ' was not downloaded.')
# raw_index_data_yahoo = yf.download('^NDXT', period='5y')


# # Save each stock data in a CSV file

# for t in raw_stock_data_tiingo:
#     t.to_csv(raw_data_dir + t.index.values[0][0] + '.csv')
    
# for t in raw_stock_data_iex:
#     t.to_csv(raw_data_dir + t.index.values[0][0] + '.csv')
    
# raw_index_data_yahoo.to_csv(raw_data_dir + '^NDXT.csv')


[*********************100%***********************]  1 of 1 downloaded


In [6]:
# Read downloaded data from files

raw_stock_data = []
raw_index_data_filename = '^NDXT.csv'
raw_stock_data_filenames = [f+'.csv' for f in ndxt_tickers]
raw_index_df = pd.read_csv(raw_data_dir + raw_index_data_filename)

for filename in raw_stock_data_filenames:
    raw_stock_data.append(pd.read_csv(raw_data_dir + filename))

### Data preprocessing
In order to manipulate the retireved data, it is necessary to give it proper structure.

``raw_stock_data`` is a list containing all stock dataframes, and ``raw_index_df`` is the dataframe containing the ^NDXT data. For every dataframe, their index will be the dates of each stock or index quote.

In [7]:
# Reformat date in stocks dataframes, remove time

for data in raw_stock_data:
    data['date'] = data['date'].map(lambda x: x.split()[0])

    
# Volume is not a given data for the index quotes.

raw_index_df.drop(columns='Volume', inplace=True)


# Rename index columns to lowercase

raw_index_df.columns = ['date', 'open', 'high', 'low', 'close', 'adjClose']

Every stock dataframe will also contain its ticker symbol as part of their index. Also remove unnecessary information like dividends and splits.

In [8]:
# Assign symbol and date as index identifiers for every stock record

for data in raw_stock_data:
    data.set_index(['symbol', 'date'], inplace=True, drop=True)
    
# Assign date as index identifier for index records as well

raw_index_df.set_index(['date'], inplace=True, drop=True)


# Remove unnecessary information

for df in raw_stock_data: df.drop(columns=['divCash', 'splitFactor'], inplace=True)

A very crucial part for the following data processing is making sure that every dataframe, both stocks and index, contain the same ranges of data. This is because the final features will be a mix of individual stock quotes with index quotes. Since we are not guaranteed that all the downloaded data contains the same time ranges, we must find the oldest last date among all quotes, and also the newest first date for all data as well.

In [9]:
# Find the oldest final date and newest starting date

last_dates = [raw_index_df.index[-1]]
first_dates = [raw_index_df.index[0]]

for df in raw_stock_data:
    dates = []
    
    for idx in df.index:
        dates.append(idx[1])
    
    last_dates.append(max(dates))
    first_dates.append(min(dates))

last_date = min(last_dates)
first_date = max(first_dates)

With the found time ranges, we can trim the dataframes to make sure they all contain the same respective dates for their data. Then, reverse dataframes so that the latest quote comes first, and the oldest quote goes last.

In [10]:
# Make sure both DataFrames have the same final date (as close to today as possible)

while(raw_index_df.index[0] > last_date):
    raw_index_df.pop(raw_index_df.index[0])

while(raw_index_df.index[-1] < first_date):
    raw_index_df.pop(raw_index_df.index[-1])
        
for df in raw_stock_data:
    while(df.index[0][1] > last_date):
        df.pop(df.index[0])
    while(df.index[-1][1] < first_date):
        df.pop(df.index[-1])
        
        
# Reverse stock and index records

for df in raw_stock_data:
    df = df.sort_index(ascending=False, inplace=True)
raw_index_df = raw_index_df.iloc[::-1]

In [11]:
def labels(stock_df, since = 1):
    '''Function for labeling the trend in stock data given a period of time.
    
        Args:
            stocks_df (pandas.DataFrame): contains a column 'close' for the closing prices in historical stock data.
            since (int): time period for which to label trend.
        
        Returns:
            None: the passed DataFrame will have a new column with labels 1 for incresing close price after 'since' days, 0 otherwise.
    '''
    stock_df.drop(columns='y_'+str(since), inplace=True, errors='ignore')
    labels = []
    for i in range(len(stock_df)):
        try:
            assert i-since >= 0
            today = stock_df.iloc[i]['close']
            future = stock_df.iloc[i-since]['close']
            labels.append(1 if future>today else 0)
        except:
            labels.append(None)
    stock_df.insert(loc=0, column='y_'+str(since), value=labels)

In [12]:
def change(stock_df, period = 1):
    '''Function for calculating the change percentage of closing prices since 'period' days ago.
    
        Args:
            stocks_df (pandas.DataFrame): contains a column 'close' for the closing prices in historical stock data.
            period (int): time period for which to calculate change.
        Returns:
            None: the passed DataFrame will have a new column with change percentage.
    '''
    stock_df.drop(columns='change', inplace=True, errors='ignore')
    change = []
    for i in range(len(stock_df)):
        try:
            today = stock_df.iloc[i]['close']
            previous = stock_df.iloc[i+period]['close']
            change.append(100*(today-previous)/previous)
        except:
            change.append(None)
    stock_df.insert(loc=0, column='change', value=change)

In [13]:
def PMO(stock_df, period = 50):
    '''Price Momentum Oscillator.
        
        Args:
            stocks_df (pandas.DataFrame): contains a column 'close' for the closing prices in historical stock data.
            period (int): time period for which to calculate.
        Returns:
            None: the passed DataFrame will have a new column with PMO.
    '''
    stock_df.drop(columns='PMO', inplace=True, errors='ignore')
    pmo = []
    for i in range(len(stock_df)):
        try:
            today = stock_df.iloc[i]['close']
            previous = stock_df.iloc[i+period]['close']
            pmo.append(today - previous)
        except:
            pmo.append(None)
    stock_df.insert(loc=0, column='PMO', value=pmo)

In [14]:
def RSI(stock_df, period = 50):
    '''Relative Strength Index.
        
        Args:
            stocks_df (pandas.DataFrame): contains a columns 'close' for the closing prices and 'change' in historical stock data.
            period (int): time period for which to calculate.
        Returns:
            None: the passed DataFrame will have a new column with RSI.
    '''
    stock_df.drop(columns='RSI', inplace=True, errors='ignore')
    rsi = []
    for i in range(len(stock_df)):
        try:
            rsi_value = 0
            pos = []
            neg = []
            
            for j in range(period):
                change = stock_df.iloc[i+j]['change']
                if change > 0: 
                    pos.append(change)
                elif change < 0: 
                    neg.append(abs(change))
                    
            if not neg:
                rsi_value = 100
            elif not pos:
                rsi_value = 0
            else:
                pos = sum(pos)/len(pos)
                neg = sum(neg)/len(neg)
                rsi_value = 100 - (100/(1+(pos/neg)))
            rsi.append(rsi_value)
        except:
            rsi.append(None)
    stock_df.insert(loc=0, column='RSI', value=rsi)

In [15]:
def MFI(stock_df, period = 50):
    '''Money Flow Index.
        
        Args:
            stocks_df (pandas.DataFrame): contains a columns 'close' for the closing prices and 'volume', 'high', and 'low' in historical stock data.
            period (int): time period for which to calculate.
        Returns:
            None: the passed DataFrame will have a new column with MFI.
    '''
    stock_df.drop(columns='MFI', inplace=True, errors='ignore')
    mfi = []
    for i in range(len(stock_df)):
        try:
            mfi_value = 0
            pos = []
            neg = []
            typical_prices = []
            
            for j in range(period):
                if not typical_prices: typical_prices.append( mean([stock_df.iloc[i+1]['high'] , stock_df.iloc[i+1]['low'] , stock_df.iloc[i+1]['close']]) ) 
                tp = (stock_df.iloc[i+j]['high'] + stock_df.iloc[i+j]['low'] + stock_df.iloc[i+j]['close']) / 3
                if tp > typical_prices[-1]: 
                    pos.append( tp * stock_df.iloc[i+j]['volume'] )
                elif tp < typical_prices[-1]: 
                    neg.append( tp * stock_df.iloc[i+j]['volume'] )
            
            if not neg:
                mfi_value = 100
            elif not pos:
                mfi_value = 0
            else:
                pos = sum(pos)/len(pos)
                neg = sum(neg)/len(neg)
                mfi_value = 100 - (100/(1+(pos/neg)))
            mfi.append(mfi_value)
        except:
            mfi.append(None)
    stock_df.insert(loc=0, column='MFI', value=mfi)

In [16]:
def EMA(stock_df, period=50):
    '''Exponential Moving Average.
        
        Args:
            stocks_df (pandas.DataFrame): contains a column 'close' for the closing prices in historical stock data.
            period (int): time period for which to calculate.
        Returns:
            None: the passed DataFrame will have a new column with EMA.
    '''
    stock_df.drop(columns='EMA', inplace=True, errors='ignore')
    a = 2/(period + 1)
    # There are many ways to calculate the first term of an exponential moving average, so for now
    # I'll be using the average of the previous 3 closes
    initial_value_range = 3
    ema = []
    
    for i in range(len(stock_df)):
        emas = []
        try:
            
            for j in list(reversed(range(period))):
                if not emas: emas.append( mean([stock_df.iloc[i+j+day]['close'] for day in range(initial_value_range)]) )
                tc = stock_df.iloc[i+j]['close']
                this_ema = (a * tc) + ((1 - a) * emas[-1])
                emas.append(this_ema)
            
            ema.append(emas[-1])
        except:
            ema.append(None)
    stock_df.insert(loc=0, column='EMA', value=ema)

In [17]:
def SO(stock_df, period=50):
    '''Stochastic Oscillator.
        
        Args:
            stocks_df (pandas.DataFrame): contains a column 'close' for the closing prices, 'high', and 'low' in historical stock data.
            period (int): time period for which to calculate.
        Returns:
            None: the passed DataFrame will have a new column with PMO.
    '''
    stock_df.drop(columns='SO', inplace=True, errors='ignore')
    
    so = []
    
    for i in range(len(stock_df)):
        try:
            tc = stock_df.iloc[i]['close']
            ll = min([stock_df.iloc[i+day]['low'] for day in range(period)])
            hh = max([stock_df.iloc[i+day]['high'] for day in range(period)])
            this_so = ((tc - ll) / (hh - ll)) * 100
            so.append(this_so)
        except:
            so.append(None)
    
    stock_df.insert(loc=0, column='SO', value=so)

In [18]:
def MACD(stock_df, p1=12, p2=26):
    '''Moving Average Convergence/Divergence.
        
        Args:
            stocks_df (pandas.DataFrame): contains a column 'close' for the closing prices in historical stock data.
            p1 (int): time period for which to calculate first EMA.
            p2 (int): time period for which to calculate second EMA.
        Returns:
            None: the passed DataFrame will have a new column with PMO.
    '''
    stock_df.drop(columns='MACD', inplace=True, errors='ignore')
    
    a1 = 2/(p1 + 1)
    a2 = 2/(p2 + 1)
    initial_value_range = 3
    macd = []
    
    for i in range(len(stock_df)):
        ema1 = []
        ema2 = []
        try:
            for j in list(reversed(range(p1))):
                if not ema1: ema1.append( mean([stock_df.iloc[i+j+day]['close'] for day in range(initial_value_range)]) )
                tc = stock_df.iloc[i+j]['close']
                this_ema = (a1 * tc) + ((1 - a1) * ema1[-1])
                ema1.append(this_ema)
            
            for j in list(reversed(range(p2))):
                if not ema2: ema2.append( mean([stock_df.iloc[i+j+day]['close'] for day in range(initial_value_range)]) )
                tc = stock_df.iloc[i+j]['close']
                this_ema = (a2 * tc) + ((1 - a2) * ema2[-1])
                ema2.append(this_ema)
            
            macd.append(ema1[-1] - ema2[-1])
            
        except:
            macd.append(None)
    
    stock_df.insert(loc=0, column='MACD', value=macd)

Given the enormous possibilities of mixing time ranges for features for both index and stock data (150 different datasets in total), as proposed by Madge, the following cell can take several hours to run. Recommended to run in a powerful instance.

ANNs require their feeded features to be normalized values. Therefore, we need to convert all features into ranges from [0,1]. For features that represent percenatges we divide them by 100, and for features with arbitrary ranges we scale them with a ``MinMaxScaler``.

In [None]:
j=0
symbols = []

for n1 in periods:
    for n2 in periods:
        
        stocks_df = [df.copy() for df in raw_stock_data]
        index_df = raw_index_df.copy()
        scaler = MinMaxScaler()
        
        
        # Calculate features for index data, MFI is not calculated as it requires volume
        if not os.path.exists(processed_data_dir+ f'{n1}_{n2}/^NDXT.csv'):
            change(index_df)
            MACD(index_df, n1, 2*n1)
            SO(index_df, n1)
            EMA(index_df, n1)
            RSI(index_df, n1)
            PMO(index_df, n1)
            index_df.fillna(value=pd.np.nan, inplace=True)
            
            # Normalizing features
            idf = index_df[['PMO', 'EMA', 'MACD']]
            scaler.fit(idf)
            index_df[['PMO', 'EMA', 'MACD']] = scaler.transform(idf)
            idf = index_df[['RSI', 'SO']]
            index_df[['RSI', 'SO']] = idf/100
            
            # Saving index file
            index_df.to_csv(processed_data_dir+f'{n1}_{n2}/^NDXT.csv')
        j += 1
        print(f'{round(j*100/(25*(1+len(ndxt_tickers))), 1)}% ', end='')
            
            
        # Calculate features and labels for stock data, this takes a lot of time
        for i, df in enumerate(stocks_df):
            symbol = df.index[0][0]
            if not os.path.exists(processed_data_dir+ f'{n1}_{n2}/{symbol}.csv'):
                change(df)
                MACD(df, n2, 2*n2)
                SO(df, n2)
                EMA(df, n2)
                MFI(df, n2)
                RSI(df, n2)
                PMO(df, n2)
                for m in time_range:
                    labels(df, m)
                df.fillna(value=pd.np.nan, inplace=True)
                
                # Normalizing features
                df_ = df[['PMO', 'EMA', 'MACD']]
                scaler.fit(df_)
                df[['PMO', 'EMA', 'MACD']] = scaler.transform(df_)
                df_ = df[['RSI' ,'MFI', 'SO']]
                df[['RSI' ,'MFI', 'SO']] = df_/100
                
                # Saving each stock file
                df.to_csv(processed_data_dir+ f'{n1}_{n2}/{symbol}.csv')
            j += 1
            print(f'{round(j*100/(25*(1+len(ndxt_tickers))), 1)}% ', end='')

0.1% 0.2% 0.3% 0.4% 0.5% 0.6% 0.7% 0.8% 0.9% 1.0% 1.1% 

### Final data preparation
Data has been processed and normalized, and is ready to be unified into feedable train and test datasets. 
We will produce one test and one train dataset for each of the 150 models to be trained, and each record (each stock market day) of each dataset will contain the following structure.
1. Label
2. Stock PMO
3. Stock EMA
4. Stock MACD
5. Stock RSI
6. Stock MFI
7. Stock SO
8. Index PMO
9. Index EMA
10. Index MACD
11. Index RSI
12. Index SO


For each stock symbol and each stock date.

In [None]:
# Unify all data into separate training/testing sets

j=0
for n1 in periods:
    for n2 in periods:
        stocks_df = [(pd.read_csv(processed_data_dir+f'{n1}_{n2}/{symbol}.csv'), symbol) for symbol in ndxt_tickers]
        index_df = pd.read_csv(processed_data_dir+f'{n1}_{n2}/^NDXT.csv')
        for t in time_range:
            j+=1
            train_df_list = []
            test_df_list = []
            
            if not os.path.exists(final_data_dir+time_words[t]+f'/{n1}_{n2}/test/')
                os.makedirs(final_data_dir+time_words[t]+f'/{n1}_{n2}/test/')
            if not os.path.exists(final_data_dir+time_words[t]+f'/{n1}_{n2}/train/')
                os.makedirs(final_data_dir+time_words[t]+f'/{n1}_{n2}/train/')
            
            for df, symbol in stocks_df:
                test_train_separation = round((len(df.dropna()))*2/3)
                to_concat = [df[['y_'+str(t), 'PMO', 'EMA', 'MACD', 'RSI' ,'MFI', 'SO']], index_df[['PMO', 'EMA', 'MACD', 'RSI', 'SO']]]
                concatenated = pd.concat([s.reset_index(drop=True) for s in to_concat], sort=False, axis=1).dropna()
                test_train_separation = round((len(concatenated))*2/3)
                
                concatenated_train = concatenated.iloc[:test_train_separation]
                concatenated_test = concatenated.iloc[test_train_separation:]
                
                train_df_list.append(concatenated_train)
                test_df_list.append(concatenated_test)
            
                concatenated_train.to_csv(final_data_dir+time_words[t]+f'/{n1}_{n2}/train/{symbol}.csv', header=False, index=False)
                concatenated_test.to_csv(final_data_dir+time_words[t]+f'/{n1}_{n2}/test/{symbol}.csv', header=False, index=False)
            
            full_train_df = pd.concat([df for df in train_df_list], axis=0)
            full_test_df = pd.concat([df for df in test_df_list], axis=0)

            # Save final data
            full_train_df.to_csv(final_data_dir+time_words[t]+f'/{n1}_{n2}'+'/train.csv', header=False, index=False)
            full_test_df.to_csv(final_data_dir+time_words[t]+f'/{n1}_{n2}'+'/test.csv', header=False, index=False)
            print(f'{round(j*100/(len(periods)*len(periods)*len(time_range)), 2)}% ', end='')