https://www.dataquest.io/m/65/guided-project%3A-predicting-the-stock-market

In this project, I will predict each day's closing stock price for [S&P 500 Index](https://en.wikipedia.org/wiki/S%26P_500_Index) based on the past record. I will use neural networks to train the model with records from `1950-2012`, and make predictions for `2013-2015`.

The dataset has been prepared by DataQuest, who describes it like the following.
___
Each row in the file contains a daily record of the price of the S&P500 Index from `1950` to `2015`. The dataset is stored in `sphist.csv`.

The columns of the dataset are:

*   `Date` \-\- The date of the record.
*   `Open` \-\- The opening price of the day (when trading starts).
*   `High` \-\- The highest trade price during the day.
*   `Low` \-\- The lowest trade price during the day.
*   `Close` \-\- The closing price for the day (when trading is finished).
*   `Volume` \-\- The number of shares traded.
*   `Adj Close` \-\- The daily closing price, adjusted retroactively to include any corporate actions. Read more [here](http://www.investopedia.com/terms/a/adjusted_closing_price.asp).
___

# 1. Data overview

In [1]:
from IPython.display import display
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold
from sklearn.neural_network import MLPRegressor
from time import time

# from sklearn.feature_selection import RFE
# from sklearn.svm import SVR

import numpy as np
import pandas as pd
import re

# Load dataset
df = pd.read_csv("sphist.csv")

# Get names of original columns
original_cols = df.columns

# Set target column (closing price of the day)
target_col = "Close"

# Dataset summary
print(df.info())
print()

# Display first 5 rows
display(df.head(5))

# Check for missing data
print("Number of missing data point per feature:")
print(df.isna().sum())
print()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16590 entries, 0 to 16589
Data columns (total 7 columns):
Date         16590 non-null object
Open         16590 non-null float64
High         16590 non-null float64
Low          16590 non-null float64
Close        16590 non-null float64
Volume       16590 non-null float64
Adj Close    16590 non-null float64
dtypes: float64(6), object(1)
memory usage: 907.3+ KB
None



Unnamed: 0,Date,Open,High,Low,Close,Volume,Adj Close
0,2015-12-07,2090.419922,2090.419922,2066.780029,2077.070068,4043820000.0,2077.070068
1,2015-12-04,2051.23999,2093.840088,2051.23999,2091.689941,4214910000.0,2091.689941
2,2015-12-03,2080.709961,2085.0,2042.349976,2049.620117,4306490000.0,2049.620117
3,2015-12-02,2101.709961,2104.27002,2077.110107,2079.51001,3950640000.0,2079.51001
4,2015-12-01,2082.929932,2103.370117,2082.929932,2102.629883,3712120000.0,2102.629883


Number of missing data point per feature:
Date         0
Open         0
High         0
Low          0
Close        0
Volume       0
Adj Close    0
dtype: int64



# 2. Feature processing

In [2]:
def drop_correlated_features(df, corr_thres, target_col):
    """
    The code is from https://bit.ly/2J4WkIw
    
    df: Data frame
    corr_thres: Upper limit of correlation between features.
                If two features are correlated above corr_thes,
                one of them will be removed.
    target_col: Target column which will be excluded from
                correlation coefficient calculation
    
    Return data frame after removing features
    which are correlated together beyond corr_thres
    """
    
    
    ## Identify Highly Correlated Features
    # Create correlation matrix with feature columns
    corr_matrix = df.drop(target_col, axis=1).corr().abs()

    # Select upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

    # Find index of feature columns with correlation greater than 0.95
    to_drop = [column for column in upper.columns if any(upper[column] > corr_thres)]
    
    ## Drop Marked Features
    # Drop features
    df = df.drop(df[to_drop].columns, axis=1)

    return df

def feature_scaling(df, mode="minmax"):
    """
    df: pandas data frame
    mode: Method of scaling.
        "standardise" converts values to standard scores.
        "minmax" puts all values into [0, 1] range.
    
    Return df after replacing values in numerical columns
    with scaled values.
    """
    for col in df.select_dtypes(include=np.number).columns:
        series = df[col]
        if mode == "standardise":    
            df[col] = (series - np.mean(series)) / np.std(series)
        elif mode == "minmax":
            df[col] = (series - np.min(series)) / (np.max(series) - np.min(series))
        
    return df

def feature_aggregation(df, max_offset, first_ind_to_include):
    """
    df: pandas data frame
    first_ind_to_include: Index for the first row to include in
                          training and testing sets
    max_offset: Training and testing sets will exclude data
                from the first date in data up to max_offset days.
    
    Take rows to be included in training and testing sets.
    Add aggregated features to them.
    Return df.
    """

    # For all rows to be included in training and testing
    for ind, row in df.loc[first_ind_to_include:].iterrows():

        # Add new columns for year, month, day of week and
        # number of holidays in previous month
        ymd = row["Date"]
        
        df.loc[ind, "Year"] = ymd.year
        df.loc[ind, "Month"] = ymd.month
        df.loc[ind, "Day"] = ymd.day
        df.loc[ind, "Holidays in previous month"] = (ymd - pd.DateOffset(months=1)).days_in_month



        # For Close (closing price) and Volume columns
        for col in ["Close", "Volume"]:

            # Get values for the past (1) 5 and 30 trading days
            # and (2) max_offset days including holidays
            ind_5 = slice(ind - 5, ind)
            ind_30 = slice(ind - 30, ind)
            ind_max = slice(df[df["Date"] <= ymd - pd.DateOffset(days=max_offset)].index[0], ind)

            val_5 = df.loc[ind_5, "{}".format(col)]
            val_30 = df.loc[ind_30, "{}".format(col)]
            val_max = df.loc[ind_max, "{}".format(col)]

            # Add new columns of ...
            ## Value mean
            val_5_mean = np.mean(val_5)
            val_30_mean = np.mean(val_30)
            val_max_mean = np.mean(val_max)

            df.loc[ind, "{}: Past 5 days mean".format(col)] = val_5_mean
            df.loc[ind, "{}: Past 30 days mean".format(col)] = val_30_mean
            df.loc[ind, "{}: Past {} days mean".format(col, max_offset)] = val_max_mean

            ## Value SD
            val_5_sd = np.std(val_5)
            val_30_sd = np.std(val_30)
            val_max_sd = np.std(val_max)

            df.loc[ind, "{}: Past 5 days SD".format(col)] = val_5_sd
            df.loc[ind, "{}: Past 30 days SD".format(col)] = val_30_sd
            df.loc[ind, "{}: Past {} days SD".format(col, max_offset)] = val_max_sd

            ## Ratios between statistics from different periods in the past
            ### 5 days vs 30 days
            df.loc[ind, "{} means ratio: past 5 days / past 30 days".format(col)] = val_5_mean / val_30_mean
            df.loc[ind, "{} SDs ratio: past 5 days / past 30 days".format(col)] = val_5_sd / val_30_sd

            ### 5 days and max_offset days
            df.loc[ind, "{} means ratio: past 5 days / past {} days".format(col, max_offset)] = val_5_mean / val_max_mean
            df.loc[ind, "{} SDs ratio: past 5 days / past {} days".format(col, max_offset)] = val_5_sd / val_max_sd
    
    return df

def drop_excluded(df, first_ind_to_include, original_cols, target_col):
    """
    df: pandas data frame
    first_ind_to_include: Index of first row to include
    original_cols: Original columns which will be removed to avoid data leakage
                   (I am trying to predict each day's value based on info
                   from PREVIOUS DAYS. Therefore, any info from each day, which
                   the original columns contain, should not be input into
                   the learning algorighm.)
    target_col: Target column which contains values to be predicted
    """
    # Remove rows to be excluded from training and testing sets
    df = df.loc[first_ind_to_include:]

    # Remove original columns to avoid data leakage
    df = df.drop(original_cols.drop(target_col), axis=1)

    return df
    
def feature_processing(df, max_offset, original_cols, target_col, corr_thres):
    """
    df: pandas data frame
    max_offset: Training and testing sets will exclude the data
                    up to [first date in data + max_offset days]
    original_cols: Columns from original dataset.
    target_col: Target column which contains values to be predicted
    corr_thres: Upper limit of correlation between features.
                If two features are correlated above corr_thes,
                one of them will be removed.
    """

    # Get first date to include (= max_offset days after the first date in data)
    # in training and testing sets
    first_date_in_data = df.iloc[0]["Date"]
    first_date_to_include = first_date_in_data + pd.DateOffset(days=max_offset)
    first_ind_to_include = df.index[(df["Date"] >= first_date_to_include)][0]

    print("First date in date:", first_date_in_data)
    print("First date to include in training data", first_date_to_include)

    # Create aggregated features for training and testing
    df = feature_aggregation(df, max_offset, first_ind_to_include)
    
    # Remove rows and columns to be excluded from training and testing sets
    df = drop_excluded(df, first_ind_to_include, original_cols, target_col)

    # Scale numerical features
    df = feature_scaling(df, mode="minmax")

    # Drop features which are highly correlated with one another
    df = drop_correlated_features(df, corr_thres, target_col)

    return df

# Format Date column
df["Date"] = pd.to_datetime(df["Date"])

# Sort data frame by Date column and reset index
df = df.sort_values(by="Date")
df = df.reset_index(drop=True)

# Feature processing - create new columns based on existing ones
df = feature_processing(df, 365, original_cols, target_col, corr_thres=0.95)

# Display first 5 rows of data frame
df.head(5)

First date in date: 1950-01-03 00:00:00
First date to include in training data 1951-01-03 00:00:00


Unnamed: 0,Close,Year,Month,Day,Holidays in previous month,Close: Past 5 days mean,Close: Past 5 days SD,Close: Past 30 days SD,Close means ratio: past 5 days / past 30 days,Close SDs ratio: past 5 days / past 30 days,Close means ratio: past 5 days / past 365 days,Close SDs ratio: past 5 days / past 365 days,Volume: Past 5 days mean,Volume: Past 365 days mean,Volume: Past 5 days SD,Volume: Past 30 days SD,Volume means ratio: past 5 days / past 30 days,Volume SDs ratio: past 5 days / past 30 days,Volume means ratio: past 5 days / past 365 days,Volume SDs ratio: past 5 days / past 365 days
250,0.0,0.0,0.0,0.066667,1.0,0.0,0.002976,0.002239,0.803896,0.374452,0.023061,0.416674,0.000264,0.000517,0.000133,0.000286,0.51087,0.189881,0.047575,0.112902
251,8.5e-05,0.0,0.0,0.1,1.0,7.5e-05,0.002247,0.002536,0.82582,0.262149,0.024406,0.317158,0.000277,0.000523,9.3e-05,0.000288,0.528329,0.124929,0.050075,0.07829
252,8.5e-05,0.0,0.0,0.133333,1.0,0.00012,0.002058,0.002793,0.836977,0.224613,0.025173,0.289886,0.000286,0.00053,6.5e-05,0.00029,0.535468,0.080559,0.051535,0.054757
253,0.000147,0.0,0.0,0.233333,1.0,0.000169,0.001871,0.003083,0.848501,0.190383,0.02601,0.26305,0.000271,0.000534,0.000101,0.000283,0.498266,0.140983,0.048566,0.084911
254,0.000204,0.0,0.0,0.266667,1.0,0.000224,0.00143,0.003388,0.862964,0.137202,0.026949,0.203936,0.000278,0.000543,0.000135,0.000295,0.502812,0.18703,0.04964,0.111325


# 3. Predict and evaluate

The following comment and code have been taken from the section "6. Predict and evaluate" of [my other machine learning project](https://github.com/gknam/dataquest_projects/blob/master/DataScientist/Step6_MachineLearning/4_LinearRegressionForMachineLearning/project1/PredictingHouseSalePrices.ipynb) and slightly modified - feature selection is made 15 times in each fold and mean absolute error is used to evaluate prediction accuracy.

___

K-fold cross validation will be carried out where K will range from 2 to and including 10.

In each fold, feature selection will be done based on each feature's importance which will be evaluated using [ExtraTreesRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html#sklearn.ensemble.ExtraTreesRegressor).

Feature selection will be done 15 times per fold. First selection will include the most important feature. The second selection will be the first selection plus the next most important feature. The same will be done for up to a selection of 20 most important features.

With each set of selected features, predictions will be made using [linear regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) and then evaluated using [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE). The resulting MAEs (n=15) will be averaged within each fold.

Finally, the best fold will be selected and its MAE will be reported together with the names and numbers of selected features.

**Note**: Feature selection is done ***during*** cross validation rather than beforehand. [This will prevent bias which can be created from using feature sets selected from the *whole* dataset to make prediction on *subsets* of data. By doing this, cross validation assesses the **model fitting process** rather than the model itself](https://stats.stackexchange.com/a/27751).
___

In [3]:
def feature_selection(selector, features, target, top_features_number):

    selector.fit(features, target)
    
    feature_cols_series = pd.Series(feature_cols, \
                                    name="Feature_cols")
    
    if type(selector) == ExtraTreesRegressor:
        feature_significance = selector.feature_importances_
        feature_significance_label = "Feature_importance_score"
        ascending = False

    elif type(selector) == RFE:
        feature_significance = selector.ranking_
        feature_significance_label = "Feature_importance_ranking"
        ascending = True
        
    
    feature_significance_series = pd.Series(feature_significance, \
                                          name=feature_significance_label)

    feature_cols_top = pd.concat([feature_cols_series, feature_significance_series], axis=1)\
            .sort_values(by=feature_significance_label, ascending=ascending)\
            .iloc[:top_features_number, 0].values
            
    return feature_cols_top

def train_and_test(fs_train, fs_test, t_train, t_test, model):
    
    # Fit model
    model.fit(fs_train, t_train)
    
    # Predict target using test dataset
    p_test = model.predict(fs_test)
    
    # Get MAE (mean absolute error)
    mae = mean_absolute_error(t_test, p_test)
    
    return mae


# Get features and target
feature_cols = df.columns.drop(target_col)

features = df[feature_cols]
target = df[target_col]

# Calculate number of neurons in hidden layer
n_input = (features.shape[1] + 1)
n_output = 1
n_sample = features.shape[0]
alpha = 2
n_hidden = int(n_sample / (alpha * (n_input + n_output)))

# Model to use for prediction
model = MLPRegressor(hidden_layer_sizes=n_hidden)

# Track lowest MAE
lowest_mae = df.max().max() - df.min().min()

# Track best method
# [fold, number of selected features, 
# names of selected features, lowest MAE]
best_method = [None, None, None, lowest_mae]


start = time()

# # initiate RFE feature selector
# estimator = SVR(kernel="linear")
# selector = RFE(estimator, 1, step=1)

# Initiate ExtraTreesRegressor feature selector
selector = ExtraTreesRegressor()

num_folds = range(2, 10)
mae_all = {str(fold) + "-fold CV": [] for fold in num_folds}
feature_set_sizes = range(1, 16)

# K-fold cross validation
for fold in num_folds:
    
    mae_split = []
    kf = KFold(n_splits=fold, shuffle=True)

    mae_selections = {}
    feature_selections = {}
    
    kf_splits = kf.split(features)
    
    for (train_ind, test_ind), split in zip(kf_splits, range(fold)):

        mae_selections[split] = []
        feature_selections[split] = []
        
        for fss in feature_set_sizes:

            # Get target
            t_train = target.iloc[train_ind]
            t_test = target.iloc[test_ind]
            
            # Get features
            f_train = features.iloc[train_ind]
            f_test = features.iloc[test_ind]
            
            # Select features
            fs_cols = feature_selection(selector, f_train, t_train, top_features_number=fss)
            
            fs_train = f_train[fs_cols]
            fs_test = f_test[fs_cols]
            
            # record MAE and selected features
            mae = train_and_test(fs_train, fs_test, t_train, t_test, model)
            mae_selections[split].append(mae)
            feature_selections[split].append(fs_cols)
        
    # Get mean MAE and feature names per feature selection
    mae_array = np.array([mae_selections[i] for i in mae_selections])
    feature_array = np.array([feature_selections[i] for i in feature_selections])
    for fs in feature_set_sizes:
        mae_fs_mean = np.mean(mae_array[:, fs - 1])
        features_fs = feature_array[:, fs - 1]
        
        if mae_fs_mean < lowest_mae:
            lowest_mae = mae_fs_mean
            best_method = [fold, fs, features_fs, lowest_mae]
        
        mae_all[str(fold) + "-fold CV"].append(mae_fs_mean)
    
end = time()

# Get unique feature selections in best_method
# (source https://stackoverflow.com/a/3724558)
features_fs_all = [list(x) for x in set(tuple(x) for x in best_method[2])]

# Get selector name
selector_type_str = str((type(selector)))
selector_name = re.sub(".*\.|'.*$", "", selector_type_str)

print("Duration: " + ("{:.2f}").format(end - start) + " seconds")
print()

print("K-fold cross validation was tried with K ranging from {} to and including {}."\
     .format(min(num_folds), max(num_folds)))
print()
print("Feature selection was made using {}.".format(selector_name))
print("Different selection sizes were tried with")
print("smallest set including {} features".format(min(feature_set_sizes)))
print("and maximum one including {} features.".format(max(feature_set_sizes)))
print()

print("Best prediction was made with MAE {} in".format(best_method[3]))
print("(1) {}-fold cross-validation".format(best_method[0]))
print("(2) with {} best features selected".format(best_method[1]))
print()
print("The best feature sets selected in each fold were")
print()
for ind, val in enumerate(features_fs_all):
    val = re.sub("\[|\]", "", str(val))
    to_print = str(val) + " and" if ind < len(features_fs_all) - 1 \
                              else str(val)
    
    print(to_print)
    print()
print()
print("(Each feature set lists features in order of importance)")
    
mae_all_df = pd.DataFrame(data=mae_all, index=[str(i) + " feature" for i in feature_set_sizes])
mae_all_df

Duration: 953.01 seconds

K-fold cross validation was tried with K ranging from 2 to and including 9.

Feature selection was made using ExtraTreesRegressor.
Different selection sizes were tried with
smallest set including 1 features
and maximum one including 15 features.

Best prediction was made with MAE 0.0029679271383957232 in
(1) 5-fold cross-validation
(2) with 2 best features selected

The best feature sets selected in each fold were

'Close: Past 5 days mean', 'Volume: Past 5 days mean' and

'Close: Past 5 days mean', 'Volume: Past 365 days mean'


(Each feature set lists features in order of importance)


Unnamed: 0,2-fold CV,3-fold CV,4-fold CV,5-fold CV,6-fold CV,7-fold CV,8-fold CV,9-fold CV
1 feature,0.021631,0.003386,0.003714,0.00947,0.008372,0.008106,0.003163,0.006794
2 feature,0.004757,0.004263,0.004921,0.002968,0.004149,0.004371,0.003769,0.003781
3 feature,0.004044,0.003632,0.009702,0.003623,0.007434,0.0037,0.003378,0.003418
4 feature,0.0043,0.003621,0.003326,0.003734,0.003126,0.003471,0.0032,0.005992
5 feature,0.004046,0.003864,0.004026,0.003514,0.0034,0.003583,0.003218,0.003586
6 feature,0.005068,0.003551,0.003464,0.004172,0.00311,0.003297,0.003359,0.003765
7 feature,0.005353,0.00436,0.003563,0.003587,0.003716,0.003671,0.003497,0.003341
8 feature,0.004495,0.003829,0.003789,0.003938,0.00378,0.003817,0.003578,0.003517
9 feature,0.004458,0.004047,0.003846,0.003848,0.003716,0.00403,0.003588,0.003521
10 feature,0.00469,0.004075,0.0043,0.003978,0.004058,0.004235,0.003603,0.004001


# 4. Closing remarks

## 4.1. Limitation

### 4.1.2. Feature selection method

Again, the comment has been taken from section "7.1.2. Feature selection method" of [my other machine learning project](https://github.com/gknam/dataquest_projects/blob/master/DataScientist/Step6_MachineLearning/4_LinearRegressionForMachineLearning/project1/PredictingHouseSalePrices.ipynb), with the hyperlink modified.

The output of the [previous step](#3.-Predict-and-evaluate) will be different each time it is run. This is because extra trees regressor is a method with intrinsic randomness. Intuitively, this is not good for making consistent predictions. I used it here because it was much faster than the recursive feature elimination (the only other method that I tried using).