<a href="https://colab.research.google.com/github/marvin-hansen/SP-contest/blob/master/SAMPLE_Data_Proc_V_0_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data pre-processing for S&P prediction with fast.ai


---

|    	|               	|
|---------	|-------------------------	|
| Author  	| Marvin Hansen           	|
| Contact 	| marvin.hansen@gmail.com 	|
| Version 	| 0.7.3                     	|
| Updated 	| March 29, 2019          	|



## Summary
---

Dataset and sample pre-processing pipeline for the S&P 500  prediction challange. This notebook contains a sample data processing pipeline consisting of the following tasks:

1. Loading S&P500 (50Y) dataset 
2. Check for missing values (none in the 50Y set)
3. Apply data transformation
  * Rename coloumns 
  * Change data coloumn format to datetime 
  * Categorify date to capture trends and seasons
  * Remove columns 
4. Feature engineering
  * Calculate technical indicators (MACD, RSI, etc)
  * Display feature matrix / heatmap 
5. Data processing
  * Replacing NaN with zero (required for TabularLearner)
  * Split data into train, test, validation sets
  * Store all three datasets in a single zipfile 


## Getting started
---
1. Open notebook in colab 
2. Run install cells 
3. Run all cells (Ctrl-F9)
4. Set Flag "run" to true and run the processing pipeline
5. Have fun :-)


## Install 
Before running the notebook:  
1. Install all requirements by running all install cells
2. Set install flags to false 
3. Restart envorironment otherwise imports fail



## Known issues:
---

**Error:** 
TypeError: load_data() missing 1 required positional argument: 'path'

**FIX:** 
Just re-run the  cell  with the  "" load_data()" function"  and run the pipeline again. 



## Changelog
---
* V0.7.3 added missing install flags & added install documentation. Added zip download. Fixed many small issues and glitches.  
* V0.7 updated stock_data_processor
* V0.6 updated utils
* V-0.5  Added  initial data processing 

## Compatibility
---

Lib's
* Python Version: 3.6.7
* Pandas Version: 0.24.2
* Numpy Version: 1.16.2
* TA Lib Version: 0.4.17
* FastAI Version: 1.0.50.post1
* PyTorch Version: 1.0.1.post2

GPU Acceleration
* GPU: NVDIA K80 
* Cuda V10.0.130



# The Data

---


**Ticker:**  S&P 500 

**Time Frame:**  Daily closing price 

**Time range:** ~50 years 

***Start Date: ** Dec/31/1969

***End Date: ** Mar/25/2019


**Data Source:**: [finance.yahoo.com](https://finance.yahoo.com/quote/%5EGSPC/history?period1=-3600&period2=1553554800&interval=1d&filter=history&frequency=1d)

** Data Storage:** [Github](https://github.com/marvin-hansen/SP-contest)

**Source file:** SP500-50Y-raw.csv

** Data Format:** CSV File 



**Train, Test, and Validation Datasets**

* **Total size:** 12419
* **Train/Test Split:** 80/20  
* **Train:** 9935 entries
* **Test:** 2484 entries
* **Valid:** 90 entries


##  Data Fields in source file 

|   Name	| Type     	|   	 Comment
| :---	| :---	| :-:	| :---			|
|  date 	| object    	| Date of record 
|  Open | float64 | Open price |
|  High | float64 | Highest day price |
|  Low  | float64 | Lowest day price |
|  Close  | float64 | Closing day  price |
| Adj  Close  | float64 | Adjusted close  price |
|  Volume  | int64 | Trade volume |




# Flags 

In [0]:
# Before running the notebook:  
# 1) install all requirements by running the cells in hte install section 
# 2) set install flags to false 
# 3) Restart envorironment otherwise imports fail

# Once done, run the entire notebook with Crl-F9

# Flags for installation. Set false after install
install_TA = True # ta-lib compiles forever... get a coffee
install_RQ = True # install all other requirements 
install_QD = True # Quandl  

In [0]:
# Start & end data for QUANDL 
import datetime
start = datetime.date(2008, 3, 29)  #year, month, day
end = datetime.date.today()  
#SET  API key
# https://docs.quandl.com/docs/getting-started
quandl.ApiConfig.api_key = "YourKey"

# **Run processor **

In [0]:
# flags for the processing pipline
verbose = True
persist = True # store resulting train, test & valid files to disk 
download = False # set to true to download the generated datasets 


run = False # set true  to run the entire  processing pipeline 
if run:
    process_sp90y(verbose, rank, persist)

# Installations

### numpy, pandas & fast.ai

In [0]:
if(install_RQ): 
  
  # set correct version
  !pip install imgaug==0.2.7 

  # update pandas 
  !pip install --upgrade pandas 

  # update numpy 
  !pip install --upgrade numpy 

In [0]:
if(install_RQ): 
  
  # install latet fast.ai release
  # https://github.com/fastai/fastai/releases
  #!conda install -c pytorch -c fastai fastai
  !curl -s https://course.fast.ai/setup/colab | bash
  !ls

### Install TA/Lib for Technical Analysis

In [0]:
if(install_TA):
  

  # https://colab.research.google.com/drive/1bRdCYAejMOgVZyCAr2F0kx6xw4v5SpIZ#scrollTo=zfTJzr8UtL96
  # download TA-Lib 
  !wget http://prdownloads.sourceforge.net/ta-lib/ta-lib-0.4.0-src.tar.gz

  !tar -xzvf ta-lib-0.4.0-src.tar.gz

  # https://colab.research.google.com/drive/1bRdCYAejMOgVZyCAr2F0kx6xw4v5SpIZ#scrollTo=zfTJzr8UtL96
  import os
  os.chdir('ta-lib') # Can't use !cd in colab
  !./configure --prefix=/usr
  !make
  !make install
  # wait ~ 30s

  os.chdir('../')
  !pip install TA-Lib # finally...#
  # cleanup 
  !rm -r ta-lib
  !rm ta-*
  !ls

### Install Quandl

In [0]:
if(install_TA):  
  !pip install quandl

# Imports


In [0]:
import io
import platform
import datetime
import quandl
import warnings
from enum import Enum, unique
from pathlib import Path
from urllib.request import urlretrieve
import os
from google.colab import files # for file up & download. See end of notebook

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import datetime, DataFrame
from pandas.io.parsers import TextFileReader


# fast ai 
import fastai
from fastai import *
from fastai.imports import *
from fastai.basics import *
from fastai.tabular import *
from fastai.metrics import *

import torch
import talib

print("Done")

### Check Version 

In [0]:
print("* Python Version: " + str(platform.python_version()))
print("* Pandas Version: " + str(pd.__version__))
print("* Numpy Version: " + str(np.__version__))

print("* TA Lib Version: " + str(talib.__version__))
print("* FastAI Version: " + str(fastai.__version__))
print("* PyTorch Version: " + str(torch.__version__))
print()
!nvcc --version

## Verify GPU *acceleration*

In [0]:
torch.cuda.current_device()

print("Cuda available: " + str(torch.cuda.is_available()))
print("Cuda enabled:" + str(torch.backends.cudnn.enabled))

#https://stackoverflow.com/questions/48152674/how-to-check-if-pytorch-is-using-the-gpu
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()


#Additional Info when using cuda
if device.type == 'cuda':
    print("GPU used: " + torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')

# Constants

In [0]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [0]:
#batch size 
bs = 64
# bs = 32 
# bs = 16   # uncomment this line if you run out of memory even after clicking Kernel->Restart

In [0]:
# https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.seed.html
np.random.seed(42) 

# Utils

## Enums

In [0]:
## file utils
@unique
class Data(Enum):
    SP500_50Y_RAW = 0
    SP500_90Y_RAW = 1
    SP500_ALL = 2
    SP500_TRAIN = 3
    SP500_TEST = 4
    SP500_VALID = 5

## Quandl Data Fetcher 


In [0]:
   # Treasury Yield Curve Rates
    # https://www.quandl.com/data/USTREASURY/YIELD-Treasury-Yield-Curve-Rates
    BOND_CODE = 'USTREASURY/'
    BOND = ['YIELD']
    
     # S&P 500 investor Sentiment
    # https://www.quandl.com/data/AAII/AAII_SENTIMENT-AAII-Investor-Sentiment-Data
    # earliesr record: 1987/6/26
    SP_SENT_Code = 'AAII/'
    SP_SENT = 'AAII_SENTIMENT'
    
    # Yale  Sentiment data
    SEN_YALE_CODE = "YALE/"
    SEN_YALE = ['US_CONF_INDEX_VAL_INDIV', 'US_CONF_INDEX_VAL_INST', 'SPCOMP']
    
    # US Misery Index
    SEN_MIS_CODE = "USMISERY/"
    SEN_MIS = ['INDEX']
    
    # Housing data from NAHB
    # https://www.quandl.com/data/NAHB-US-Housing-Indices
    HOUSING_CODE = "NAHB/"
    HOUSING = ["HOUSEACT", 'INTRATES', 'NWFHMI']


In [0]:
def pull_SINGLE_STOCK(code, ticker, folder):

    print("Pulling: " + ticker)
    pullstocks(code, ticker, folder)

In [0]:
def pullstocks(code, stock, folder):
  

    # https://medium.com/python-data/quandl-getting-end-of-day-stock-data-with-python-8652671d6661
    for i in range(len(stock)):
        data = quandl.get_table(code, ticker=stock[i],
                                date={'gte': start, 'lte': end},
                                paginate=True)

        data.to_csv(folder + stock[i] + "-Historical-Data.csv")

        print(stock[i] + " -- DONE")

    print("Done!")


In [0]:
def pull_Bond():
    # Treasury Yield Curve Rates
    label = "Treasury Yield Curve Rates"
    folder = ""
    print("Pulling: " + label)
    # 
    pullstocks(BOND_CODE, BOND, folder)

In [0]:
def pull_SEN():
    # Get AAII Investor Sentiment
    label = "Sentiment"
    folder = ""
    print("Pulling: " + label)
    
    # AAI
    stocks = SP_SENT
    code = SP_SENT_Code
    pullstocks(code, stocks, folder)

    # Yale Sentiment
    stocks = SEN_YALE
    code = SEN_YALE_CODE
    pullstocks(code, stocks, folder)

    # US MISERY INDEX
    stocks = SEN_MIS_CODE
    code = SEN_MIS
    pullstocks(code, stocks, folder)

## View & Visualize 

In [0]:
def show_correlation_matrix(df):
    """
    Shows a correlation matrix and a heatmap of the features in the data frame 
    :param df: pandas data frame 
    :return: void 
    """
    # copied from
    # https://datascience.stackexchange.com/questions/10459/calculation-and-visualization-of-correlation-matrix-with-pandas
    from matplotlib import pyplot as plt
    from matplotlib import cm as cm

    fig = plt.figure()
    ax1 = fig.add_subplot(111)
    cmap = cm.get_cmap('jet', 30)
    cax = ax1.imshow(df.corr(), interpolation="nearest", cmap=cmap)
    ax1.grid(True)
    plt.title("Feature Correlation")
    labels = df.columns.values.tolist()
    ax1.set_xticklabels(labels, fontsize=6)
    ax1.set_yticklabels(labels, fontsize=6)
    # Add colorbar, make sure to specify tick locations to match desired ticklabels
    fig.colorbar(cax, ticks=[.50, .55, .60, .65, .70, .75, .80, .85, .90, .95, 1])
    plt.show()

In [0]:
def print_versions():
    print("* Python Version: " + str(platform.python_version()))
    print("* Pandas Version: " + str(pd.__version__))
    print("* Numpy Version: " + str(np.__version__))
    print("* TA Lib Version: " + str(talib.__version__))

## Load & Save 

In [0]:
def get_path(data_name: Data, url: bool):
    """
    Returns the path corresponding to the data set specified in the enum Data.
    Note, the enum is @unique so no two datasets can have the same path.

    ONLY "raw" data have web url's to download the official reference dataset.
    train, test, validate, and all are generated files.

    When URL = True, the corresponding web url for the data set will be returned.

    By default, path is relative /Data/filename.end

    Update data_folder to set a different path.

    :param data_name: Enum - Dataset
    :param url: bool flag to indicate whether to return a local path or a web url
    :return: file path or url
    """

    data_folder = "" #"Data/"
    sp_name = "SP500"
    sp50_name = "SP500-50Y"
    sp90_name = "SP500-90Y"
    frmt = ".csv"

    path = ""

    if (data_name is Data.SP500_50Y_RAW):
        path = data_folder + sp50_name + "-raw" + frmt
        if (url):
            u = "https://raw.githubusercontent.com/marvin-hansen/SP-contest/master/Data/SP500-50Y-raw.csv"
            path = requests.get(u).content

    if (data_name is Data.SP500_90Y_RAW):
        path = data_folder + sp90_name + "-raw" + frmt
        if (url):
            u = "https://raw.githubusercontent.com/marvin-hansen/SP-contest/master/Data/SP500-90Y-raw.csv"
            path = requests.get(u).content


    if (data_name is Data.SP500_ALL):
        path = data_folder + sp_name + "-all" + frmt
        if (url):
            u = ""
            path = requests.get(u).content
    if (data_name is Data.SP500_TRAIN):
        path = data_folder + sp_name + "-train" + frmt
        if (url): path = ""
    if (data_name is Data.SP500_TEST):
        path = data_folder + sp_name + "-test" + frmt
        if (url): path = ""
    if (data_name is Data.SP500_VALID):
        path = data_folder + sp_name + "-valid" + frmt
        if (url): path = ""

    return path

In [0]:
def load_csv_file(data_name: Data, url: bool):
    """ loads the S&P 500 index file from the path in the path function
    :param path:
    :return: pandas data frame
    """
    if url:
        return pd.read_csv(io.StringIO(get_path(data_name=data_name, url=url).decode('utf-8')),infer_datetime_format=True)

    else:
        return pd.read_csv(get_path(data_name=data_name, url=url), infer_datetime_format=True)

In [0]:
def load_data(data: Data, force_download: bool = False):
    """ Loads the requested dataset, either from the web or from a local copy.
    The data loader stores a local copy of each file it loads from an URL to accelerate
    the next loading of the same file. The data loader overrides the local copy
    whenever force download is set to true.  By default, force is set to False
    as to use the local copy first.

    @depends: Data - Enum that specifies available datasets

    @depends: get_path Adjust local file path and URL's.
    Default relative path is data/
    Default  URL is public github repo.

    :param data: dataset to load
    :param force_download: Download the web-version and override local copy. FALSE by default.
    :return: pandas dataframe
    """
    #
    path = Path(get_path(data_name=data, url=False))
    if(force_download or path.exists()== False):
        print("Load from URL")
        df = load_csv_file(data_name=data, url=True)
        # ... store a local copy to accelerate the next data loading
        #df.to_csv(get_path(data_name=data, url=False))
        return df
    else: # local copy must be there b/c path exists
        # load
        print("Load data from local  file")
        return load_csv_file(data_name=data, url=False)


In [0]:
def save_train_test_valid(df: DataFrame, split_ratio: float, valid_size: int, verbose: bool):
    """
    splits a dataframe into train, test, and validation and stores each set in a different file
    :param df: pandas data frame
    :param split_ratio: ratio between train & split
    :param valid_size: number of rows in the validation set
    :param verbose: prints out file paths when set to true
    :return: void
    """
    if (verbose):
        print("Save data to file.. ")
    
    # replace NaN with zero
    df = fill_nan(df)
    
    # store validation set as the latest of n data points
    valid = df.head(valid_size)
    valid_file = get_path(data_name=Data.SP500_VALID, url=False)
    valid.to_csv(valid_file)

    # split remaining data into train & test sets
    split = int(len(df) * split_ratio)
    train = df[0:split]  #
    test = df[split:len(df)]
  
    ## store train dataset
    train_file = get_path(data_name=Data.SP500_TRAIN, url=False)
    train.to_csv(train_file)

    # store train
    test_file = get_path(data_name=Data.SP500_TEST, url=False)
    test.to_csv(test_file)

    if (verbose):
        print('All data : %d' % (len(df)))
        print('Training data: %d' % (len(train)))
        print('Testing data: %d' % (len(test)))
        print('validation data: %d' % (len(valid)))
        print()
        print("Stored train data in file: ")
        print(train_file)
        print()
        print("Stored train data in file: ")
        print(test_file)
        print()
        print("Stored validation data in file: ")
        print(valid_file)
        print()
        print("Done! All data are saved")
        print(valid_file)

## Dataframe utils

In [0]:
def clean(verbose: bool):
    """
    Checks for missing values and removes them
    :param verbose: print out details
    :return: void - operates in place
    """
    # load from file
    df = load_csv(get_path(Data.SP500_90Y_RAW))

    # check raw data for missing values
    missing = check_missing_values(df=df, verbose=verbose)
    # The S&P dataset has 45 missing values, which is barely 0.18% thus removing them won't hurt much.
    if (missing):
        df = remove_missing_values(df)
        # Double check, should print zero
        check_missing_values(df=df, verbose=verbose)

In [0]:
def check_missing_values(df, verbose: bool):
    """
    Checks the given data frame for missing value and returns a boolean value.
    When set to verbose, the function prints out number and percentage of missing data

    :param df: pandas dataframe
    :param verbose: boolean
    :return: True if df contains missing values, otherwise false.
    """

    # to get the total summation of all missing values in the DataFrame,
    # #we chain two .sum() methods together:
    if (verbose):
        nr_missing = df.isnull().sum().sum()
        nr_values = len(df)
        prct_missing = (nr_missing / nr_values) * 100

        print("Has missing values: " + str(df.isnull().values.any()))
        # inspect each column
        col_names = df.columns.values.tolist()
        for c in col_names:
            print(str(c) + " missing values: " + str(df[c].isnull().sum()))
        print()
        print("Total Values: " + str(nr_values))
        print("Total Missing: " + str(nr_missing))
        print("Percentage Missing: " + str(prct_missing))
        print()

    return df.isnull().values.any()


In [0]:
def remove_missing_values(df):
    """
    Drop each row in a data frame where at least one element is missing
    and returns a copy without missing values.
    :param df: pandas data frame
    :return: df without missing values.
    """
    return df.dropna()

In [0]:
def remove_column(df, col_name):
    """
    Deletes the given column(s) on the given data frame
    :param df: pandas data frame
    :param col_name: string array of column names
    :return: data frame without the columns
    """
    df.drop(columns=col_name)


In [0]:
def fill_nan(df):
    """ Fills NaN values with zero
    :param df: pandas dataframe
    :return: dataframe  without NaN
    """
    return df.fillna(0)

In [0]:
def reverse_df(df):
    """
    reverses all rows so that the last one are listed first
    :param df: pandas data frame 
    :return: reversed frame
    """
    return df.iloc[::-1]


In [0]:
def rename_column(df, old_name, new_name):
    """
    renames a column in the given data frame 
    :param df: pandas dataframe 
    :param old_name: 
    :param new_name: 
    :return: void 
    """
    return df.rename(index=str, columns={old_name: new_name})


In [0]:
def inspect_df(df):
    """
    shows key infos about the given dataframe 
    :param df: 
    :return: void 
    """
    print("Nr. of data: " + str(len(df)))
    print("Sample data: ")
    print()
    print("Meta Data")
    print(df.info())
    print()
    print(df.tail(3).T)
    print()
    show_correlation_matrix(df)
    print("Correlation Matrix")
    print(df.corr())
    print()
    print("Nr. of data: " + str(len(df)))


# Data pre-processors 

In [0]:
def convert_date(df, col_name):
    """
    Converts the given date column from the usual object to an instance of datatime

    :param df: pandas dataframe
    :param col_name: date column
    :return: dataframe with date column of type datetime
    """
    # Passing infer_datetime_format=True can often-times speedup a parsing
    # if its not an ISO8601 format exactly, but in a regular format.
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
    df = pd.to_datetime(df[col_name],  infer_datetime_format=True)

In [0]:
def drop_features(df, features):
    """
    Removes the collumn matching a features name
    :param df: pandas data frame
    :param  features: [String Array]
    :return: Void - modifies the frame in place
    """
    return df.drop(columns=features)

In [0]:
def add_previous_values(df, column_name, number):
    """ Adds n-previous values and stores each in a seperate column
        According to findings by tsfresh, the previous value can have as much
        as 88.5% significance on predicting the current value.
        https://github.com/blue-yonder/tsfresh/blob/master/notebooks/timeseries_forecasting_google_stock.ipynb

    :param df: data frame
    :param column_name: source column
    :param number: number of time periods to add
    :return: None - modifies the frame in place
    """
    for n in range(1, (number + 1)):
        df[column_name + str("-") + str(n)] = df[column_name].shift(-n)


In [0]:
def calc_percent_change(df, column_name):
    """
    Calculates the percentage change for each value in the given column
    :param df: pandas data frame
    :param column_name: String - name of the column 
    :return: Void - modifies the frame in place
    """
    df[column_name + "-pct-chng"] = (df[column_name + "-delta"] / df[column_name]) * 100


In [0]:
def calc_row_delta(df, column_name):
    """
    calculates the difference between the current and the previous value in the given column 
    :param df: pandas data frame
    :param column_name: String - name of the column
    :return: Void - modifies the frame in place
    """
    df[column_name + '-delta'] = df[column_name] - df[column_name].shift(-1)


In [0]:
def calc_technical_indicators(df, column_name: str, id: int, all: bool, verbose: bool):
    """
    Calculates a set of selecteed technical indicators based on the close price of the given stock
    :param df: pandas data frame
    :param column_name: MUST refer to the close price of the stock
    :return: Void - modifies the frame in place
    """
    close = np.asarray(df[column_name])
    # This is for experimenting to generate a wide range of technical indicators
    # requires subsequent feature ranking
    full_range = [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 25, 26, 27, 28, 29]
    short_range = [3, 5, 7, 10, 15, 20, 25]
    long_range = [50, 60, 70, 80, 100, 150, 200, 250]
    range = full_range + long_range

    if (id == 1 or all):
        # Bollinger bands
        df['UP_BB'], df['MID_BB'], df['LOW_BB'] = talib.BBANDS(close, timeperiod=20, nbdevup=2, nbdevdn=2, matype=0)
        if (verbose):
            print("ID: " + str(id))
            print("Bollinger bands")
            print(df.corr())

    if (id == 2 or all):
        # Create Simple Moving Average
        # Time range adjusted based on feature ranking for S&P500
        periods = range  # [2, 3, 4, 5, 7]
        for period in periods:
            df['SMA-' + str(period)] = talib.SMA(close, timeperiod=period)
        if (verbose):
            print("ID: " + str(id))
            print("Simple Moving Average")
            print(df.corr())

    if (id == 3 or all):
        # Create Exponential moving average
        # correlation drops at 30 and beyond
        # time range adjusted based on feature ranking for S&P500
        periods = range  # [6,7,9,12]
        for period in periods:
            df['EMA-' + str(period)] = talib.EMA(close, timeperiod=period)
        if (verbose):
            print("ID: " + str(id))
            print("Exponential Moving Average")
            print(df.corr())

    if (id == 4 or all):

        # Create Momentum
        # no strong correlation for the MOM indicators was found, thus disabled.
        # only MOM-300 yields about ~ -30% Corr.
        periods = range
        for period in periods:
            df['MOM-' + str(period)] = talib.MOM(close, timeperiod=period)
        if (verbose):
            print("ID: " + str(id))
            print("Momentum")
            print(df.corr())

    if (id == 5 or all):
        # Create RSI
        # Time range adjusted based on feature ranking for S&P500
        periods = range  # [10, 11, 12, 13, 14, 15, 21, 22]
        for period in periods:
            df['RSI-' + str(period)] = talib.RSI(close, timeperiod=period)
        if (verbose):
            print("ID: " + str(id))
            print("RSI")
            print(df.corr())

    if (id == 6 or all):
        # Create TRIX
        # Time range adjusted based on feature ranking for S&P500
        # For a smaller sample size, only Trix-30 shows higehst correlation to close price.
        # Add full range to re-test and look how Trix-30 performs
        periods = range  # [3, 4]  # range
        for period in periods:
            df['TRIX-' + str(period)] = talib.TRIX(close, timeperiod=period)

        if (verbose):
            print("ID: " + str(id))
            print("Trix")
            print(df.corr())

    if (id == 7 or all):
        # Cycle Indicator Functions
        # https://mrjbq7.github.io/ta-lib/func_groups/cycle_indicators.html
        df["HT_DCPERIOD"] = talib.HT_DCPERIOD(close)
        df["HT_DCPHASE"] = talib.HT_DCPHASE(close)
        df["HT_TRENDMODE"] = talib.HT_TRENDMODE(close)
        if (verbose):
            print("ID: " + str(id))
            print("Cycle Indicator Functions")
            print(df.corr())
            

# Processing pipline

In [0]:
def process_stock_data(verbose: bool, persist: bool):
    """

    Sample pre-processing pipeline on the S&P500 50 Year data

    The 50Y dataset contains open, high, low, close, volume [OHLCV] data and and percentage change.

    Functions:
    - calc_all_features requires [OHLCV] data

    - get_chart_patterns adds all known chart patterns and requires [OHLCV] data

    - calc_technical_indicators requires only the closing price

    :param verbose: show details
    :param persist: store processed data in train, test, and valid files
    :return: void
    """

    if (verbose):
        print_versions()
        print("load raw data ...")
    df = load_data(data=Data.SP500_50Y_RAW, force_download=True)

    y_name = "Close"
    if (verbose):
        print("Set dependent variable to : " + y_name)

    if (verbose):
        print("Convert date to datetime : ")
    # That's required for automated feature engineering
    #convert_date(df, "Date")
    df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=True)
   
    if (verbose):
        print("Categorify Date  : ")
        add_datepart(df, "Date", drop=False)
    
    if (verbose):
        print("Inspect Data: ")
        print(df.info())
        print()
        print(df.tail(3).T)

    if (verbose):
        print("check raw data for missing values: ")

    missing = check_missing_values(df=df, verbose=verbose)
    # There should be done, but double check anyway
    if (missing):
        print("Found missing! Remove now... ")
        df = remove_missing_values(df)
        # Double check, should print zero
        check_missing_values(df=df, verbose=verbose)

    if (verbose):
        print("add closing values of the previous n days")
    add_previous_values(df=df, column_name=y_name, number=5)

    if (verbose):
        print("Calculate technical indicators")
        # id = sets the technical indicator group
        # all = do all technical indicators
        # Verbose = prints out the correlation matrix for each group
    calc_technical_indicators(df=df, column_name=y_name, id=1, all=True, verbose=False)


    if (verbose):
        print("Run feature engineering...")

        #@TODO: Have fun...

        #add_previous_values(df=df, column_name=y_name, number=5)
        #calc_technical_indicators(df=df, column_name=y_name, id=1, all=True, verbose=False)
        # def get_chart_patterns(df)
        print("Done feature engineering...")


    if (verbose):
        print("Remove columns ... ")
    # values of Adj close  are nearly identical to the close column
    # Open, High, Low, aren't predicted.
    rem_columns = ["Adj Close", "Open", "High", "Low"]
    remove_column(df, rem_columns)

    if (verbose):
        inspect_df(df)

    if (persist):
        split = 0.80 # for a 80/20 split
        valid_size = 90  # last 3 months for validation
        save_train_test_valid(df=df, split_ratio=split, valid_size=valid_size, verbose=verbose)
        print("Done: Data processing completed")


# Run processing pipeline

In [0]:
run = True
download = False # set to true to download the generated datasets 
if run:
  process_stock_data(verbose, persist)

# Check new files 


## Load new dataset

In [0]:
if run: # only test the new files when the processor has generated them..
  print("Load & insect train dataset")
  train_df = load_data(data=Data.SP500_TRAIN)
  train_df.info()

In [0]:
if run:
  print("Load & insect test dataset")
  test_df = load_data(data=Data.SP500_TEST)
  test_df.info()

In [0]:
if run:
  print("Load & insect valid dataset")
  valid_df = load_data(data=Data.SP500_VALID)
  valid_df.info()

In [0]:
if run:
  print("Show Train dataset")
  train_df.tail(3).T

## Download new files 

In [0]:
if download:
  # zip everything together 
  !zip SP500-data-clean.zip SP500-*
  # Download
  files.download('SP500-data-clean.zip')

## Upload file 

... just in case

In [0]:
# https://www.tecmint.com/unzip-extract-zip-files-to-specific-directory-in-linux/
# Extract with !unzip SP500-data-clean.zip 
#from google.colab import files
#files.upload()