# KNN tuning features

In this notebook we tune the number of features used for optimal prediction of the stock prices.

In [1]:
import matplotlib

from sklearn import neighbors
from sklearn.model_selection import train_test_split
from IPython.display import display

from data.get_50_highest_weights import get_sp_50_highest_weights_symbols
from data_preparation.ochlva_data import OCHLVAData
from utils.column_modifiers import target_generator
from utils.column_modifiers import feature_generator
from utils.column_modifiers import keep_columns
from utils.scorers import normalized_root_mean_square_error
from estimators.predictions import calculate_rolling_prediction

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
matplotlib.use('nbAgg')

In [3]:
import matplotlib.pyplot as plt
from utils.visualizations import plot_scores
from utils.visualizations import plot_true_and_prediction

Load the S&P 500 (as `^GSPC`) data

In [4]:
ochlva_data = OCHLVAData()

Load three other stocks: The stock weighted the most, the medium weighted stock and the lowest weighted stock (out of the 50 downloaded). 
We do this in order to get a better feeling of the model.

In [5]:
symbols = get_sp_50_highest_weights_symbols()

# Select symbols with high, medium and low weights
selected_symbols = (symbols.iloc[0], 
                    symbols.iloc[len(symbols)//2], 
                    symbols.iloc[-1])

for s in selected_symbols:
    ochlva_data.load_data(s)

For now, we will only be interested in training using the adjusted close values.

In [6]:
# Keep only 'Adj. Close' column
ochlva_data.transform(keep_columns, ['Adj. Close'], copy=False)

Next, we create the target values for the data.
The target columns will be shifted 7, 14 and 28 days with respect to 'Adj. Close'.

In [7]:
days = [7, 14, 28]
ochlva_data.transform(target_generator, 'Adj. Close', days, copy=False)

As we are tuning the features, we will only be needing one instance of the regressor

In [8]:
reg = neighbors.KNeighborsRegressor()

## Tuning the features on the validation set

In order not to leak information of the unseen data into the tuning we will tune the number of features on a validation set.

In [9]:
feature_days = [1, 5, 10, 20, 40, 80, 160]

validation_scores = dict()
train_scores = dict()

for key in ochlva_data.transformed_data.keys():
    print(f'Processing {key}')
    # Extract the features and targets
    x = ochlva_data.transformed_data[key].\
        loc[:, ochlva_data.transformed_data[key].columns[:-len(days)]] 
    y = ochlva_data.transformed_data[key].\
        loc[:, ochlva_data.transformed_data[key].columns[-len(days):]]
       
    # Append the stock to the scores
    validation_scores[key] = dict()
    train_scores[key] = dict()
    
    for f_days in feature_days:        
        x_w_features = feature_generator(x, 'Adj. Close', f_days, copy=True)
        
        x_train, _, y_train, y_test = \
            train_test_split(x_w_features, y, shuffle=False, test_size=.2)    
        x_train_for_validate, x_validate, y_train_for_validate, y_validate = \
            train_test_split(x_train, y_train, shuffle=False, test_size=.2)

        # Obtain the day of prediction
        # I.e. for a column named x + 2 days, we would expect the two last rows
        # to contain nan
        prediction_days = y_test.isnull().sum()

        # Calculate validation scores
        y_pred, y_pred_train = \
            calculate_rolling_prediction(reg,
                                         x_train_for_validate,
                                         x_validate,
                                         y_train_for_validate,
                                         y_validate, 
                                         prediction_days,
                                         training_prediction=True)
        validation_scores[key][f_days] = \
            normalized_root_mean_square_error(y_validate, y_pred)
        
        # The true value of the trainings is the same as y_validate
        # shifted by one day
        y_train_true = y_train.loc[y_pred_train.index, :] 
        train_scores[key][f_days] = \
            normalized_root_mean_square_error(y_train_true, y_pred_train)  

Processing ^GSPC
Processing AAPL
Processing CMCSA
Processing GILD


In [10]:
_ = plot_scores(train_scores, validation_scores, x_label='Days')
plt.show()

<IPython.core.display.Javascript object>

We can see that both the training and validation error is decreasing as we increase the number of features.
There seem to be no signs of overfitting.

## Test on the unseen test set

We will now test how well the model with the features generalizes on the unseen test set.
This will also act as a sanity check in order to see that what we have found so far is reasonable.

In [11]:
optimal_features = 160

for key in ochlva_data.transformed_data.keys():
    
    x = ochlva_data.transformed_data[key].\
        loc[:, ochlva_data.transformed_data[key].columns[:-len(days)]] 
    y = ochlva_data.transformed_data[key].\
        loc[:, ochlva_data.transformed_data[key].columns[-len(days):]]
        
    x_w_features = feature_generator(x, 'Adj. Close', optimal_features, 
                                     copy=True)   
    
    x_train, x_test, y_train, y_test = \
        train_test_split(x_w_features, y, shuffle=False, test_size=.2)
    
    # Obtain the day of prediction
    # I.e. for a column named x + 2 days, we would expect the two last rows
    # to contain nan
    prediction_days = y_test.isnull().sum()
    
    # NOTE: We refit the model here with the same architecture as we used in 
    #       the predictions above.
    #       However, the data will be different for each time as we do a 
    #       rolling prediction
    y_pred = calculate_rolling_prediction(reg,
                                          x_train,
                                          x_test,
                                          y_train,
                                          y_test, 
                                          prediction_days)
    
    # Plot results
    _ = plot_true_and_prediction(x_test, y_pred, 
                                 columns=['Adj. Close'], y_label='USD')
    plt.show()
    
    # Calculate the normalized root mean squared error
    nrmse = normalized_root_mean_square_error(y_test, y_pred)
    
    print((f'Normalized root mean squared error (averaged for the three '
           f'predictions): {nrmse}'))
    
    print('-'*80)
    print('\n'*5)

<IPython.core.display.Javascript object>

Normalized root mean squared error (averaged for the three predictions): 1.159163371408057
--------------------------------------------------------------------------------








<IPython.core.display.Javascript object>

Normalized root mean squared error (averaged for the three predictions): 0.2290206599114221
--------------------------------------------------------------------------------








<IPython.core.display.Javascript object>

Normalized root mean squared error (averaged for the three predictions): 0.0831406128300853
--------------------------------------------------------------------------------








<IPython.core.display.Javascript object>

Normalized root mean squared error (averaged for the three predictions): 0.11654022757518721
--------------------------------------------------------------------------------






