Linear Projection Models

PCA
- The objective of PCA is to find linear combinations of the original predictors such that the
combinations summarize the maximal amount of variation in the original predictor
space. From a statistical perspective, variation is synonymous with information. So,
by finding combinations of the original predictors that capture variation, we find the subspace of the data that contains the information relevant to the predictors.
- PCA is a particularly useful tool when the available data are composed of one or
more clusters of predictors that contain redundant information (e.g., predictors that
are highly correlated with one another).

Kernel PCA
- Principal component analysis is an effective dimension reduction technique when
predictors are linearly correlated and when the resulting scores are associated with the
response. However, the orthogonal partitioning of the predictor space may not provide
a good predictive relationship with the response, especially if the true underlying
relationship between the predictors and the response is *non-linear*.
- The kernel PCA approach combines
a specific mathematical view of PCA with kernel functions and the kernel ‘trick’
to enable PCA to expand the dimension of the predictor space in which dimension
reduction is performed

ICA
- Independent component analysis (ICA) is similar to PCA in a number of ways. It creates new components that are linear combinations of the original
variables but does so in a way that the components are as statistically independent
from one another as possible. This enables ICA to be able to model a broader
set of trends than PCA, which focuses on orthogonality and linear relationships.
There are a number of ways for ICA to meet the constraint of statistical independence
and often the goal is to maximize the “non-Gaussianity” of the resulting
components.

NMF
- Non-negative matrix factorization is another linear projection method that is specific to features that are greater than or equal to zero. In this case, the algorithm finds the coefficients of A such that their values are also non-negative (thus ensuring that the new features have the same property). The method for determining the coefficients is conceptually simple: find the best set
of coefficients that make the scores as “close” as possible to the original data with
the constraint of non-negativity.

Autoencoders

- Autoencoders are computationally complex multivariate methods for finding representations
of the predictor data and are commonly used in deep learning models
- The idea is to create a nonlinear mapping between the
original predictor data and a set of artificial features (that is usually the same size).
These new features, which may not have any sensible interpretation, are then used as
the model predictors. While this does sound very similar to the previous projection
methods, autoencoders are very different in terms of how the new features are derived
and also in their potential benefit.

In [1]:
# lib
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns;sns.set()
plt.style.use('tableau-colorblind10')

#ml
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

#feature
from sklearn import preprocessing
from sklearn.decomposition import PCA, KernelPCA, FastICA, NMF

# deep learning
import keras

# homemade
from mlutil import cross_validation, pkfold
from features import tautil
from labeling import trend_scanning

In [2]:
import warnings
warnings.filterwarnings(action='ignore')

In [3]:
import yfinance as yf
df_ = yf.download('^GSPC','2001-1-1','2021-1-1')
df = tautil.ohlcv(df_)

[*********************100%***********************]  1 of 1 completed


In [5]:
windows = [5,7,10,15,20,25,30,35]
TA = tautil.get_my_ta_windows(df_,windows).dropna()
TA.describe()

Unnamed: 0,momentum_rsi_10,momentum_rsi_15,momentum_rsi_20,momentum_rsi_25,momentum_rsi_30,momentum_rsi_35,momentum_rsi_5,momentum_rsi_7,trend_dpo_10,trend_dpo_15,...,volatility_ui_5,volatility_ui_7,volume_cmf_10,volume_cmf_15,volume_cmf_20,volume_cmf_25,volume_cmf_30,volume_cmf_35,volume_cmf_5,volume_cmf_7
count,4998.0,4998.0,4998.0,4998.0,4998.0,4998.0,4998.0,4998.0,4998.0,4998.0,...,4998.0,4998.0,4998.0,4998.0,4998.0,4998.0,4998.0,4998.0,4998.0,4998.0
mean,54.268583,53.879164,53.611937,53.412959,53.256496,53.128631,54.909057,54.608281,-0.722007,-0.486498,...,1.23673,1.550163,0.104246,0.103092,0.102354,0.101766,0.101384,0.101148,0.105551,0.105075
std,13.465089,10.919005,9.414303,8.395605,7.651012,7.079507,19.292711,16.197241,19.956716,22.081914,...,1.401736,1.712926,0.214212,0.179747,0.158539,0.14394,0.132422,0.123009,0.295952,0.25261
min,9.707335,14.529425,18.507442,21.824085,24.593252,26.895908,3.049884,6.291691,-228.148096,-247.881413,...,0.0,0.0,-0.615034,-0.431248,-0.361312,-0.294267,-0.247534,-0.218745,-0.861791,-0.732036
25%,44.701994,46.125514,47.12647,47.677631,48.074392,48.423403,40.361439,43.002909,-9.39549,-9.668812,...,0.317262,0.441472,-0.046394,-0.02698,-0.009578,0.001656,0.007177,0.013075,-0.100073,-0.070921
50%,55.56196,54.916362,54.419702,54.124293,53.866632,53.703528,56.537944,56.16701,-0.590002,0.275297,...,0.787585,0.998475,0.101176,0.09334,0.089549,0.089379,0.091277,0.092654,0.108649,0.105702
75%,64.541209,62.10877,60.655221,59.633427,58.93973,58.313785,70.370005,67.247865,7.472977,8.991075,...,1.683842,2.06109,0.248664,0.225882,0.209134,0.195978,0.185748,0.179678,0.31624,0.280313
max,89.023607,86.19153,84.072869,82.359573,80.897979,79.610176,96.540206,91.580161,227.614111,240.415462,...,13.736955,15.720049,0.75326,0.726251,0.600581,0.577258,0.552162,0.487499,0.897133,0.877385


In [26]:
def get_acc(close,feature,label,model,days=None,kfold=3):
    
    if label == 'dir':
        mom = np.sign(close.shift(-days)-close)
        y = mom.copy() # choose y
        raw_X = feature.copy() # choose X

        tmp_df = raw_X.join(y).dropna()
        raw_X=tmp_df.iloc[:,:-1]
        y=tmp_df.iloc[:,-1]
        t1 = pd.Series(y.index, index=y.index) # if y = mom
        
    elif label == 'trend_scanning':  
        trend_df_ = trend_scanning.trend_scanning_labels(close,step=10)
        trend_bin = trend_df_.bin.copy()
        y = trend_bin.copy()
        raw_X = feature.copy() # choose X

        tmp_df = raw_X.join(y).dropna()
        raw_X=tmp_df.iloc[:,:-1]
        y=tmp_df.iloc[:,-1]
        t1 = trend_df_.t1.loc[raw_X.index] # if y = trend scanning label

    # CV
    k=kfold
    cv = pkfold.PKFold(k,t1,0.01)

    # Scaling
    scaler = preprocessing.MinMaxScaler((0,1))
    scaler.fit(raw_X)
    scaled_X = pd.DataFrame(scaler.transform(raw_X),index=raw_X.index,columns=raw_X.columns)
    X = scaled_X

    # Choose model

    rfc = RandomForestClassifier(n_estimators = 200, criterion='entropy',class_weight='balanced_subsample',
                                 bootstrap=True)
    svc = SVC(probability=True)
    if model == 'rf':
        clf = rfc
    elif model == 'svm':
        clf = svc
        
    accs=[]
    for train, test in cv.split(X, y):
        clf.fit(X.iloc[train],y.iloc[train])
        y_true = y.iloc[test]
        y_pred = clf.predict(X.iloc[test])
        accs.append(accuracy_score(y_true,y_pred))
        
    return np.mean(accs)

In [29]:
feature = TA.copy()
close = df.close
raw_acc = get_acc(close,feature,label='trend_scanning',model='rf')
print("Original features acc score: ", raw_acc)

Original features acc score:  0.5756197786928304


## Dimension Reduction (4 methods):

PCA, KPCA, ICA, NMF

In [30]:
scaler = preprocessing.StandardScaler()
TA_ = scaler.fit_transform(TA)

In [31]:
# Linear PCA
pca = PCA(n_components=5)
pca_TA = pca.fit_transform(TA_)
pca_TA = pd.DataFrame(pca_TA,index=TA.index)
pca_TA

Unnamed: 0_level_0,0,1,2,3,4
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2001-02-21,7.339906,-3.038931,0.821730,0.113114,0.885610
2001-02-22,7.306959,-2.441791,1.544955,-0.184831,1.390790
2001-02-23,7.492939,-1.807003,2.063987,-0.098512,1.042909
2001-02-26,5.207762,-0.288038,2.382320,1.090023,0.356149
2001-02-27,5.730340,-0.130656,1.505092,0.923363,0.240657
...,...,...,...,...,...
2020-12-24,-3.170882,-0.798439,1.192625,0.303778,0.736435
2020-12-28,-3.858209,-0.521019,1.271167,0.982919,-0.556747
2020-12-29,-3.236371,-0.875883,0.434978,1.052576,-0.012676
2020-12-30,-3.038223,-1.118297,-0.468695,0.698860,-0.243117


In [32]:
# Kernel PCA
kpca = KernelPCA(n_components=5, kernel='rbf')
kpca_TA = kpca.fit_transform(TA_)
kpca_TA = pd.DataFrame(kpca_TA,index=TA.index)
kpca_TA

Unnamed: 0_level_0,0,1,2,3,4
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2001-02-21,0.484943,0.429348,-0.166064,-0.186343,-0.001535
2001-02-22,0.471407,0.441902,-0.151300,-0.123609,0.041396
2001-02-23,0.459655,0.465964,-0.117514,-0.084133,0.075318
2001-02-26,0.485720,0.319089,-0.040172,0.075151,0.030384
2001-02-27,0.520284,0.360525,-0.031533,0.020893,0.016654
...,...,...,...,...,...
2020-12-24,-0.408602,-0.050862,-0.187227,0.090568,0.031010
2020-12-28,-0.477917,0.048604,-0.142916,-0.064455,0.065036
2020-12-29,-0.379746,-0.073422,-0.131874,-0.076213,0.201686
2020-12-30,-0.270122,-0.067610,-0.090717,-0.145404,0.320578


In [33]:
# ICA
ica = FastICA(n_components=5)
ica_TA = ica.fit_transform(TA_)
ica_TA = pd.DataFrame(ica_TA,index=TA.index)
ica_TA

Unnamed: 0_level_0,0,1,2,3,4
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2001-02-21,-0.001784,0.016206,0.017644,0.005408,-0.026341
2001-02-22,-0.008451,0.016872,0.024138,0.003993,-0.022446
2001-02-23,-0.011243,0.015462,0.023276,0.010165,-0.019324
2001-02-26,-0.014721,0.017192,0.009148,0.016833,-0.007571
2001-02-27,-0.013516,0.012083,0.005603,0.014095,-0.012103
...,...,...,...,...,...
2020-12-24,0.001491,0.012435,0.006602,-0.004824,0.011819
2020-12-28,0.006722,0.010660,-0.004426,0.007380,0.015702
2020-12-29,0.006268,0.012074,-0.005339,-0.000769,0.009319
2020-12-30,0.010962,0.006335,-0.007630,-0.003359,0.004924


In [34]:
scaler2 = preprocessing.MinMaxScaler()
TA__ = scaler2.fit_transform(TA)

In [35]:
# NMF

nmf = NMF(n_components=5)
nmf_TA = nmf.fit_transform(TA__)
nmf_TA = pd.DataFrame(nmf_TA,index=TA.index)
nmf_TA

Unnamed: 0_level_0,0,1,2,3,4
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2001-02-21,0.013432,0.232522,0.037775,0.054796,0.016183
2001-02-22,0.004713,0.237377,0.046152,0.069663,0.020829
2001-02-23,0.000000,0.236929,0.051503,0.067426,0.042660
2001-02-26,0.041623,0.185236,0.060699,0.063332,0.074887
2001-02-27,0.032299,0.189266,0.063896,0.060359,0.066325
...,...,...,...,...,...
2020-12-24,0.151706,0.065984,0.000000,0.111516,0.067858
2020-12-28,0.180832,0.039512,0.002217,0.099377,0.073133
2020-12-29,0.173378,0.049548,0.001824,0.100865,0.043674
2020-12-30,0.184544,0.040938,0.003762,0.094123,0.020047


In [36]:
feature = pca_TA.copy()
close = df.close
pca_acc = get_acc(close,feature,label='trend_scanning',model='rf')
print("PCA features acc score: ", pca_acc)

PCA features acc score:  0.5669837880757509


In [37]:
feature = kpca_TA.copy()
close = df.close
kpca_acc = get_acc(close,feature,label='trend_scanning',model='rf')
print("Kernel PCA features acc score: ", kpca_acc)

Kernel PCA features acc score:  0.5537256924019163


In [38]:
feature = ica_TA.copy()
close = df.close
ica_acc = get_acc(close,feature,label='trend_scanning',model='rf')
print("ICA features acc score: ", ica_acc)

ICA features acc score:  0.5597543882582773


In [39]:
feature = nmf_TA.copy()
close = df.close
nmf_acc = get_acc(close,feature,label='trend_scanning',model='rf')
print("NMF features acc score: ", nmf_acc)

NMF features acc score:  0.5509134791123506


## Autoencoder feature engineering

In [20]:
x = TA__.copy()

epochs=200
dimension=5
keras.backend.clear_session()

nl_encoder = keras.models.Sequential([
    keras.layers.Dense(20, input_shape=[x.shape[1]], activation='relu'),
    keras.layers.Dense(15, activation='selu'),
    keras.layers.Dense(dimension, activation='selu'),
])

nl_decoder = keras.models.Sequential([
    keras.layers.Dense(15, input_shape=[dimension], activation='selu'),
    keras.layers.Dense(20, activation='selu'),
    keras.layers.Dense(x.shape[1], activation='relu'),
])

nl_autoencoder = keras.models.Sequential([nl_encoder, nl_decoder])
nl_autoencoder.compile(loss='mse', optimizer = keras.optimizers.SGD(lr=0.1, decay=1e-4))
nl_autoencoder.summary()


history = nl_autoencoder.fit(x,x, epochs=epochs, callbacks=[keras.callbacks.EarlyStopping(monitor='loss',patience=10)],
                             verbose=0)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
sequential (Sequential)      (None, 5)                 1055      
_________________________________________________________________
sequential_1 (Sequential)    (None, 32)                1082      
Total params: 2,137
Trainable params: 2,137
Non-trainable params: 0
_________________________________________________________________


In [21]:
ae_TA = pd.DataFrame(nl_autoencoder.predict(TA__),index=TA.index)
ae_TA

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2001-02-21,0.167837,0.234479,0.236307,0.255806,0.259399,0.259385,0.121653,0.115743,0.522225,0.524490,...,0.0,0.247688,0.262685,0.219829,0.274062,0.243399,0.265732,0.284837,0.0,0.258916
2001-02-22,0.173287,0.237534,0.240759,0.255665,0.258198,0.258522,0.115795,0.111217,0.530908,0.529608,...,0.0,0.252985,0.301399,0.270267,0.326591,0.294294,0.317398,0.332871,0.0,0.289167
2001-02-23,0.162942,0.235071,0.230060,0.232196,0.244025,0.235650,0.105933,0.120836,0.532726,0.530006,...,0.0,0.274738,0.330516,0.294730,0.337090,0.317103,0.323985,0.343327,0.0,0.313143
2001-02-26,0.316021,0.336216,0.325784,0.319696,0.335129,0.319168,0.260092,0.293713,0.525162,0.533852,...,0.0,0.223430,0.420237,0.384565,0.407552,0.386700,0.379587,0.398414,0.0,0.402552
2001-02-27,0.274228,0.305695,0.297674,0.294008,0.307887,0.293289,0.224816,0.247881,0.521135,0.529902,...,0.0,0.233027,0.369531,0.329020,0.353581,0.332094,0.331279,0.351216,0.0,0.360130
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2020-12-24,0.657694,0.636329,0.633058,0.638878,0.625903,0.611578,0.582677,0.614844,0.524002,0.513263,...,0.0,0.021561,0.620321,0.586509,0.639581,0.629060,0.604828,0.629359,0.0,0.558749
2020-12-28,0.750062,0.704750,0.685361,0.690225,0.678846,0.655648,0.718555,0.717396,0.508160,0.500471,...,0.0,0.000527,0.622329,0.577747,0.622182,0.588571,0.573047,0.599449,0.0,0.579966
2020-12-29,0.698211,0.658246,0.653754,0.664396,0.654173,0.635003,0.641928,0.651577,0.522230,0.506243,...,0.0,0.018821,0.579630,0.539341,0.590437,0.566742,0.549436,0.578759,0.0,0.527502
2020-12-30,0.724813,0.675181,0.671187,0.690134,0.683988,0.660737,0.674411,0.675102,0.521774,0.499596,...,0.0,0.012906,0.546697,0.503573,0.556053,0.526942,0.514976,0.553328,0.0,0.492978


In [40]:
feature = ae_TA.copy()
close = df.close
ae_acc = get_acc(close,feature,label='trend_scanning',model='rf')
print("Autoencoded features acc score: ", ae_acc)

Autoencoded features acc score:  0.5454923975588915


In [22]:
# Linear PCA
pca = PCA(n_components=5)
pca_aeTA = pca.fit_transform(ae_TA)
pca_aeTA = pd.DataFrame(pca_aeTA,index=TA.index)
pca_aeTA

Unnamed: 0_level_0,0,1,2,3,4
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2001-02-21,1.081505,-0.334399,0.157224,-0.029990,0.034080
2001-02-22,1.016906,-0.301312,0.260825,-0.028625,0.029387
2001-02-23,1.048688,-0.201721,0.322112,-0.027939,0.040352
2001-02-26,0.674492,-0.063905,0.260644,-0.009409,0.019760
2001-02-27,0.819538,-0.119794,0.208123,-0.021514,0.029163
...,...,...,...,...,...
2020-12-24,-0.482970,-0.150158,0.156204,0.070384,-0.013339
2020-12-28,-0.619224,-0.067932,-0.022198,0.093590,-0.007559
2020-12-29,-0.477713,-0.150526,-0.002402,0.110082,-0.015670
2020-12-30,-0.478693,-0.162502,-0.110039,0.152415,-0.016579


In [41]:
feature = pca_aeTA.copy()
close = df.close
aepca_acc = get_acc(close,feature,label='trend_scanning',model='rf')
print("Autoencoded features acc score: ", aepca_acc)

Autoencoded features acc score:  0.5406696103279908


## RESULTS

Rank by Accuracy score

- Data: S&P500 of 21 years
- Feature: TA's
- Label: Trend-Scanning Label

In [43]:
accuracys = pd.Series({'Original':raw_acc, 'PCA':pca_acc,'KernelPCA':kpca_acc, 'ICA':ica_acc,'NMF':nmf_acc,'AE':ae_acc,'AE-PCA':aepca_acc})

In [48]:
rank = accuracys.sort_values(ascending=False)
rank

Original     0.575620
PCA          0.566984
ICA          0.559754
KernelPCA    0.553726
NMF          0.550913
AE           0.545492
AE-PCA       0.540670
dtype: float64