# Sharpe's Style Analyis with ML Tools

For this assignment I will try to replicate Sharpe's Style Analysis of an investment fund with data from various MSCI Indexes and EONIA.
In this first code snippet I will simply filter the daily navs and indexes levels for a set of specific dates, which is the weekly dates for return calculation.

Everything is described by comments in the actual code snippet.

NB: I WOULD NOT RE-RUN THE SCRIPT AS SOME OF THE GRID SEARCHES ARE NOT THAT FAST

In [1]:
# Get fund and indexes return
import numpy as np
import pandas as pd

# As it will happens that some models will not covnerge, I will ignore the message
# for the sake of a clear output
import warnings
warnings.filterwarnings('ignore')

INDEXES_DATA = '/Users/marco/Desktop/Tesi MSc/AppData/Indexes_quotations.csv'
FUNDS_NAME_DATA = '/Users/marco/Desktop/Tesi MSc/AppData/Funds_name.csv'
FUNDS_NAV_DATA = '/Users/marco/Desktop/Tesi MSc/AppData/Funds_navc.csv'
WEEKLY_DATES_DATA = '/Users/marco/Desktop/Tesi MSc/Weekly_dates.csv' 


# load all data
#
funds_name_df = pd.read_csv(FUNDS_NAME_DATA, header=0, engine='python')
indexes_level = pd.read_csv(INDEXES_DATA, header=0, engine='python')
funds_nav = pd.read_csv(FUNDS_NAV_DATA, header=0, engine='python').iloc[:, 1:]
all_dates_df = pd.read_csv(WEEKLY_DATES_DATA, header=0, engine='python').iloc[:, 0:1]

correct_dates_for_nav = indexes_level.iloc[:, 0:1]
funds_nav = pd.concat([correct_dates_for_nav, funds_nav], axis=1)
# put dates into list for filtering
#
all_dates = list(all_dates_df['Dates'])

# filter indexes for weely return calculation (friday to friday)
indexes_weekly = indexes_level[indexes_level['Dates'].isin(all_dates)].reset_index(drop=True)
return_indexes_weekly = indexes_weekly.iloc[:, 1:].pct_change().iloc[1:, :]

# divide into strategic and geographic, but I will carry the analysis only on strategic as I think geographic
# is not really meaningful and the approach would be exactly the same, also eonia is needed
eonia = return_indexes_weekly.iloc[:, 0:1]
strategic_return = return_indexes_weekly.iloc[:, 10:21]
strategic_return = pd.concat([eonia, strategic_return], axis=1)

# filter nav_return and select the specifi fund, in this case I will perform the analysis only on the 
# first one which will be a Blackrock fund, but this can be extended to other fund as well, so I will calculate
# all returns but use only the first column
funds_nav = funds_nav[funds_nav['Dates'].isin(all_dates)].reset_index(drop=True)
return_funds = funds_nav.iloc[:, 1:].pct_change().iloc[1:, :].reset_index(drop=True)

In [2]:
return_funds

Unnamed: 0,AT0000677927,AT0000712575,AT0000785266,AT0000A09ZL0,BE0058652646,BE0946564383,BE0946893766,BE0947574787,BE6228801435,DE0008474024,...,NL0009690221,NL0010408704,IE00B5340Q90,LU0998532633,LU1299311834,LU1242470554,FR0011108331,LU0342679015,LU0973154932,GB00BYRJNL93
0,-0.017733,-0.033651,-0.020138,-0.031269,-0.042595,-0.049802,-0.047938,-0.066993,-0.042887,-0.014313,...,-0.039902,-0.043665,-0.011011,-0.047178,-0.025419,-0.012731,-0.030272,-0.012842,0.000000,-0.026288
1,0.005818,0.014945,0.001572,-0.005200,0.017959,0.010615,0.010148,0.029500,0.012343,0.006682,...,0.007970,0.008669,0.006073,0.011067,0.012465,-0.021291,0.016170,0.011878,0.000000,-0.002160
2,-0.015737,-0.026947,-0.011774,-0.014216,0.011494,0.014442,0.017002,0.019361,0.010285,-0.005103,...,0.013273,0.012033,0.004024,0.015192,0.010759,0.021040,0.007680,-0.009503,0.000000,0.021645
3,0.027050,0.022473,0.023790,0.019982,-0.004030,0.001202,0.001520,-0.006838,0.003432,0.016730,...,0.002230,0.003114,0.000000,-0.068873,0.005834,0.002201,0.000793,0.015237,-0.029940,0.001059
4,-0.061175,-0.034485,-0.060069,-0.057377,-0.072570,-0.066020,-0.065250,-0.078432,-0.062585,-0.054421,...,-0.065907,-0.067871,-0.019038,-0.001298,-0.032055,-0.028346,-0.028390,-0.046693,0.000000,-0.049735
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255,-0.029794,-0.029514,-0.023757,-0.043437,-0.036835,-0.034205,-0.038897,-0.018516,-0.041691,-0.031411,...,-0.041798,-0.039263,-0.010537,-0.040107,-0.031937,-0.038810,-0.027875,-0.044437,-0.000175,-0.042355
256,0.039808,0.037769,0.048546,0.040602,0.066214,0.068536,0.051344,0.060983,0.042123,0.052290,...,0.057824,0.059684,0.011715,0.065536,0.080854,0.055168,0.038364,0.060971,0.000000,0.047465
257,0.022855,0.024097,-0.002191,0.031337,0.009160,-0.003046,0.027866,0.028774,0.061736,0.016581,...,0.046991,0.037774,0.004211,0.023353,0.014859,0.076613,0.010362,0.020455,0.000000,0.050463
258,0.012776,0.000865,0.006764,0.026931,-0.010383,-0.001528,0.006987,0.012782,0.004811,0.004680,...,0.009847,0.013405,0.000000,0.001769,0.002524,0.032924,0.001020,-0.003818,0.000000,0.027451


In [3]:
strategic_return

Unnamed: 0,EONIA,MSCI World/Consumer Discrionary,MSCI World/Consumer Staples,MSCI World/Energy,MSCI World/Financials,MSCI World/Health Care,MSCI World/Industrials,MSCI World/Information Tech,MSCI World/Materials,MSCI World/Telecom svc,MSCI World/Utilities
1,-0.000036,-0.036289,-0.024476,-0.069547,-0.048661,-0.022309,-0.031703,-0.041175,-0.041774,-0.027609,-0.018963
2,-0.000053,0.006481,0.013142,-0.001813,0.014226,0.020149,0.000655,-0.003856,-0.008806,0.016380,0.023459
3,-0.000053,0.003628,0.014645,0.039213,0.013183,0.012922,0.012564,0.011945,0.030722,0.013088,0.009999
4,-0.000036,0.003705,-0.000262,-0.019262,-0.000756,0.001938,0.003231,-0.000659,-0.007558,0.002218,0.004284
5,-0.000036,-0.062126,-0.035926,-0.073853,-0.070861,-0.051211,-0.058149,-0.066517,-0.074014,-0.024531,-0.015161
...,...,...,...,...,...,...,...,...,...,...,...
256,-0.000091,-0.045958,-0.035880,-0.046455,0.000000,-0.041954,-0.048394,-0.055245,-0.038335,-0.024790,-0.032669
257,-0.000091,0.066188,0.038602,0.016144,0.000000,0.071700,0.067712,0.082704,0.062207,0.056239,0.031073
258,-0.000091,0.006421,0.037380,0.155923,0.000000,0.017144,0.048730,-0.001249,0.014422,0.013702,0.036516
259,-0.000091,0.015188,-0.013629,0.050417,0.000000,-0.025324,0.012304,-0.003947,0.007750,-0.005813,-0.022369


In [4]:
funds_name_df.head()

Unnamed: 0,sCodeISIN,Nom
0,LU0238689110,BlackRock Global Dynamic Equity Fund A2
1,LU0265550359,BlackRock Systmatc Glb Eq Hgh Inc A2 USD
2,LU0217139020,Pictet Premium Brands P EUR Acc
3,LU0101692670,Pictet Digital P USD
4,LU0097427784,JSS Sust Equity Global P EUR Dist


Now data are ready to be used, my intent is to try to replicate Sharpe's Style Analysis with the Machine Learning tools we have learned throughout the course. 
The idea is to regress the various indexes to replicate as much as possible one of the fund. As performance metrics we will use the classic r^2 as the problem is actually a regression.

Nevertheless this r^2 can be interpreted as the "percentage activity" of the fund, the lower the r^2 the higher the fund activity and thus it's more difficult to replicate its performance using a dynamic combination of passive indexes (EONIA+MSCI).
On the other side, if the r^2 is high it will imply that the fund is actively trading on the market and thus deserve to be paid an higher fee.

This is a really simplified and reductive approach to Sharpe's Style Analysis, but to explain this concept is not really the scope of this project. This simple introduction was to say that if our model performs poorly in terms of r^2 is not necessarily due to a problem in the model but should actually be a point in favour of the fund manager.

For this analysis we will use the first fund in the funds_name_df which is BlackRock Global Dynamic Equity Fund A2 and its ISIN code is LU0238689110, we will use this code to extract the returns of the selected fund from the dataframe that contains all funds return. As I have calculated the r^2 with the traditional style analysis on an different Excel file for other purposes, for the last year of available data the r^2 should be of 86.46%.
So this will be our benchmark for the result and I will split the data so that our test sample will be the last year, so 52 observations out of the 260 available. This implies that our training set is only 208 observations, which is not really sufficient for a good training, but is still a relevant size.

# Simple Regression

The first step of our assignment will be to estimate the results with a simple regression to have an idea of how a standard model would perform in terms of R^2 for this datasets.

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge

In [6]:
lin_reg = LinearRegression()
# Run on the full sample
X = strategic_return
y = return_funds['LU0238689110']
lin_reg.fit(X,y)
print("Full sample R-squared:")
print(lin_reg.score(X,y))

Full sample R-squared:
0.8478228685655933


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=52)
lin_reg.fit(X_train,y_train)
print("Train/test R-squared:")
print(lin_reg.score(X_train,y_train))
print(lin_reg.score(X_test,y_test))
print("Intercept/coefficient:")
print([lin_reg.intercept_,lin_reg.coef_])

Train/test R-squared:
0.8801336985713125
0.49022502272283797
Intercept/coefficient:
[-0.0035614636926175314, array([-4.40670781e+01,  1.85945062e-01,  1.65735689e-01,  3.04002447e-02,
        9.80686955e-02,  1.32073924e-01,  9.09889392e-02,  8.35128641e-03,
        2.03056577e-01,  1.29819158e-01, -5.14252504e-02])]


# Ridge Regression

In this part we will train and test a Ridge model, in this situation we prefer a L2 regularization approach as my intetn is not to shut various parameter to zero, rather to simply decrese their weights in case they are not really relevant for the selected period. This is the main reason I decided to carry out only a Ridge regressiin and not a Lasso too.

For a Machine Learning project is important to fine-tuning the hyperparameters selected, as a consequence I will perform a Grid Search on all the models used, in order to select the best set of hyperparameters to then train the best model. This methodology will be adopted for all other models.

It is also important to point out that, as we cannot perform the standard cross validation, as we would lose the structure of time series data, we use the TimeSeriesSplit available in scikit which is reccomanded for Grid Searches in these situations.

In [8]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit()

param_grid={'alpha':np.arange(start=0.0005, stop=0.05,step=0.001)}
ridge = Ridge()
grid_search=GridSearchCV(ridge,param_grid,scoring='r2', cv=tscv, return_train_score=True)
grid_search.fit(X_train,y_train)
results = pd.DataFrame(grid_search.cv_results_)
print(results[['rank_test_score','mean_test_score','param_alpha']])
print("\n")
print("-"*100)
print("\n")
print('Best result: ')
print(grid_search.best_params_)

    rank_test_score  mean_test_score param_alpha
0                50         0.819138      0.0005
1                35         0.834146      0.0015
2                27         0.840131      0.0025
3                21         0.843516      0.0035
4                16         0.845687      0.0045
5                13         0.847148      0.0055
6                10         0.848139      0.0065
7                 7         0.848796      0.0075
8                 5         0.849202      0.0085
9                 2         0.849414      0.0095
10                1         0.849472      0.0105
11                3         0.849405      0.0115
12                4         0.849235      0.0125
13                6         0.848980      0.0135
14                8         0.848652      0.0145
15                9         0.848264      0.0155
16               11         0.847823      0.0165
17               12         0.847337      0.0175
18               14         0.846811      0.0185
19               15 

In [9]:
# train the best result of the grid search so alpha = 0.05
best_ridge = Ridge(alpha=0.0055)
best_ridge.fit(X_train,y_train)
print("Train/test R-squared:")
print(best_ridge.score(X_train,y_train))
print(best_ridge.score(X_test,y_test))

Train/test R-squared:
0.8791605618284779
0.45872747429281346


From the results of the grid search and then the results of the fine-tuned ridge (alpha=0.0055) we can see that the model does not perform as good as expected in the test sample. We are close to the 86% stated at the begininng but it's still in an acceptable range.

# K Near Neighbours (KNN)

The next model tested is the KNN, which is really different from out initial objective. Here we don't have any resemblance of a functional form, which should be the a linear relationship. As the KNN only have one parameters to search for, we will concentrate on the number of neighbours to maximize our fit.

In [10]:
# k-neighobours regression
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
pg_knnreg = {'n_neighbors' : np.arange(start=2, stop=20,step=1)}
knnreg = reg = KNeighborsRegressor()
grids_knnreg = GridSearchCV(knnreg,pg_knnreg,scoring='r2', cv=tscv, return_train_score=True)
grids_knnreg.fit(X_train,y_train)
results_knnreg = pd.DataFrame(grids_knnreg.cv_results_)
print(results_knnreg[['rank_test_score','mean_test_score','param_n_neighbors']])
print("\n")
print("-"*100)
print("\n")
print('Best result: ')
print(grids_knnreg.best_params_)

    rank_test_score  mean_test_score param_n_neighbors
0                 6         0.704257                 2
1                 4         0.709061                 3
2                 1         0.727624                 4
3                 2         0.723432                 5
4                 3         0.710300                 6
5                 5         0.708336                 7
6                 7         0.694908                 8
7                 8         0.682052                 9
8                 9         0.676380                10
9                10         0.668074                11
10               11         0.662954                12
11               12         0.648442                13
12               13         0.638571                14
13               14         0.628275                15
14               15         0.615135                16
15               16         0.605471                17
16               17         0.595254                18
17        

In [11]:
knn_bestmodel = KNeighborsRegressor(n_neighbors=3)
knn_bestmodel.fit(X_train,y_train)
print("Train/test R-squared:")
print(knn_bestmodel.score(X_train,y_train))
print(knn_bestmodel.score(X_test,y_test))

Train/test R-squared:
0.9028203836429722
0.6756473596523109


The result is really similar to a Ridge regression even if the underlying model is completely different. Nevertheless even this model does not providing better results than what our basic model yields.

# Linear Support Vector Machines 

Here we come back to a model with a functional form, which reasonably seems to fit better to what we are looking for. Furthermore this is a linear model, which theoretically should be really similar to what we are looking for.
As in SVM we have more hyperparameters to fine-tune, here we set the dictionary with epsilon and max_iter as the parameters to grid search for.

As we can see in the output, during the training of so many models, the SVM might not converge (I kept 10000 as max_iter to avoid making the script too long to run), in this case the output reports a warning in the begininng, which we will simply ignore.

In [12]:
from sklearn.svm import LinearSVR

pg_svrlin = {'epsilon' : np.arange(start=0.00001, stop=0.001, step=0.00005), 'max_iter':[10000]}
linear_svm = LinearSVR()
grids_linear_svm = GridSearchCV(linear_svm, pg_svrlin, scoring='r2', cv=tscv, return_train_score=True)
grids_linear_svm.fit(X_train,y_train)
results_grids_linear_svm = pd.DataFrame(grids_linear_svm.cv_results_)
print(results_grids_linear_svm[['rank_test_score','mean_test_score','param_epsilon']])
print("\n")
print("-"*100)
print("\n")
print('Best result: ')
print(grids_linear_svm.best_params_)

    rank_test_score  mean_test_score param_epsilon
0                 1         0.890018         1e-05
1                 2         0.889916         6e-05
2                 3         0.889847       0.00011
3                 4         0.889685       0.00016
4                 5         0.888803       0.00021
5                 6         0.888167       0.00026
6                14         0.887435       0.00031
7                20         0.886987       0.00036
8                18         0.887179       0.00041
9                19         0.887165       0.00046
10               16         0.887314       0.00051
11               17         0.887305       0.00056
12               10         0.887595       0.00061
13                7         0.887771       0.00066
14                8         0.887749       0.00071
15                9         0.887694       0.00076
16               11         0.887594       0.00081
17               12         0.887540       0.00086
18               13         0.8

In [13]:
linear_svm_bestmodel = LinearSVR(epsilon=1e-05, max_iter=10000)
linear_svm_bestmodel.fit(X_train,y_train)
print("Train/test R-squared:")
print(linear_svm_bestmodel.score(X_train,y_train))
print(linear_svm_bestmodel.score(X_test,y_test))

Train/test R-squared:
0.8744413376382348
0.4088860085762984


Expectations for this  model were higher than actually realized, we are still doing worse than "expected" in forecasting, but the linear SVM is among the most performing models for this task.

# Non-Linear SVM

At the core of SVM, we have the non-linear vector machine, which relies on the kernel trick to estimate complex non-linear relationships. By the initial model theorized by Sharpe, we might expect these SVM models to be more 'complex' than needed for our task(thus leading to overfitting). Thus I have decided to limit the analysis on this  family of models and make grid search that includes both polynomial and radial basis kernel. From the polynomila grid search I have excluded 1 as degree as we would come back to a linear SVM.

NB: here some of the models do not converge as well, but no worries, just scroll to the end of the output for  the  usual table.

In [14]:
# now we try non-linear svr

from sklearn.svm import SVR

pg_svm_nonlinear = {'C' : [0.1, 0, 1, 10], 'kernel':['poly', 'rbf'], 'gamma':['scale'],
             'epsilon' : [0.1, 0, 1, 10], 'degree' : np.arange(start=2, stop=11, step=1),
             'max_iter':[10000]}
svm_nonlinear = SVR()
grids_svm_nonlinear = GridSearchCV(svm_nonlinear, pg_svm_nonlinear, scoring='r2', cv=tscv, return_train_score=True)
grids_svm_nonlinear.fit(X_train,y_train)
results_svm_nonlinear = pd.DataFrame(grids_svm_nonlinear.cv_results_)
print(results_svm_nonlinear[['rank_test_score','mean_test_score','param_epsilon','param_C', 'param_degree', 'param_kernel']])
print("\n")
print("-"*100)
print("\n")
print('Best result: ')
print(grids_svm_nonlinear.best_params_)

     rank_test_score  mean_test_score param_epsilon param_C param_degree  \
0                 29        -2.074619           0.1     0.1            2   
1                164        -2.164993           0.1     0.1            2   
2                 28        -0.579165             0     0.1            2   
3                  1         0.477985             0     0.1            2   
4                110        -2.155404             1     0.1            2   
..               ...              ...           ...     ...          ...   
283               19         0.385166             0      10           10   
284              110        -2.155404             1      10           10   
285              110        -2.155404             1      10           10   
286               56        -2.155404            10      10           10   
287               56        -2.155404            10      10           10   

    param_kernel  
0           poly  
1            rbf  
2           poly  
3          

In [15]:
nonlinear_svm_bestmodel = SVR(epsilon=0, max_iter=10000, degree=2, kernel='rbf', gamma='scale')
nonlinear_svm_bestmodel.fit(X_train,y_train)
print("Train/test R-squared:")
print(nonlinear_svm_bestmodel.score(X_train,y_train))
print(nonlinear_svm_bestmodel.score(X_test,y_test))

Train/test R-squared:
0.9890536023832558
0.6586468312598115


Exactly as expected the best model, which is a radial basis kernel with epsilon=0 and C=0.1, is highly overfitting as we can see from the score difference between test and training set.

# Tree Regression

After SVM, we come back to a non-functional model, which is not the ideal model for a regression task as  our. Nevertheless it is interesting to test the performance of such models for task that are apparently out-of-context.
This is a simple tree model and thus consists of only 1 tree with different hyperparameters which are f fine-tuned below.

NB: some models might not converge.


In [16]:
# now we try a simple tree regression

from sklearn.tree import DecisionTreeRegressor

pg_tree = {'max_depth' : [1, 2, 3, 5, 10, 20], 'min_samples_split':[2, 3], 
           'min_samples_leaf':[1, 2, 3], 'max_features' : [1, 3, 5, 11]}
simple_tree = DecisionTreeRegressor()
grids_simple_tree = GridSearchCV(simple_tree, pg_tree, scoring='r2', cv=tscv, return_train_score=True)
grids_simple_tree.fit(X_train,y_train)
results_simple_tree = pd.DataFrame(grids_simple_tree.cv_results_)
print(results_simple_tree[['rank_test_score','mean_test_score','param_max_depth','param_min_samples_split', 'param_max_features', 'param_min_samples_leaf']])
print("\n")
print("-"*100)
print("\n")
print('Best result: ')
print(grids_simple_tree.best_params_)

     rank_test_score  mean_test_score param_max_depth param_min_samples_split  \
0                130         0.321663               1                       2   
1                143         0.049788               1                       3   
2                144         0.026931               1                       2   
3                120         0.349674               1                       3   
4                141         0.188090               1                       2   
..               ...              ...             ...                     ...   
139               44         0.605019              20                       3   
140               39         0.613530              20                       2   
141               38         0.615183              20                       3   
142               62         0.587431              20                       2   
143               71         0.566562              20                       3   

    param_max_features para

In [17]:
best_simple_tree = DecisionTreeRegressor(max_depth=5, max_features=11, min_samples_leaf=1, 
                                         min_samples_split=2)
best_simple_tree.fit(X_train,y_train)
print("Train/test R-squared:")
print(best_simple_tree.score(X_train,y_train))
print(best_simple_tree.score(X_test,y_test))

Train/test R-squared:
0.9519624362493069
0.4073367674054942


The simple tree is overfitting a lot and provides really bad results out of sample, as we could have expected.

# Random Forest

We are now in the world of ensemble models and more complex trees. As always we do have a search gird to fine tune the required hyperparameters. 

In [18]:
# now we try a Random Forest regression with and without bootstrap and out of bag evaluation

from sklearn.ensemble import RandomForestRegressor

pg_randomforest = {'max_depth' : [1, 2, 3, 5, 10, 20, None], 'min_samples_split':[2, 3, 5], 
           'min_samples_leaf':[1, 2], 'n_estimators' : [100, 200, 300], 'bootstrap':[True, False], 'oob_score':[True, False]}
randomforest = RandomForestRegressor()
grids_randomforest = GridSearchCV(randomforest, pg_randomforest, scoring='r2', cv=tscv, return_train_score=True)
grids_randomforest.fit(X_train,y_train)
results_randomforest = pd.DataFrame(grids_randomforest.cv_results_)
print(results_randomforest[['rank_test_score','mean_test_score','param_max_depth','param_min_samples_split', 'param_n_estimators', 'param_min_samples_leaf']])
print("\n")
print("-"*100)
print("\n")
print('Best result: ')
print(grids_randomforest.best_params_)

     rank_test_score  mean_test_score param_max_depth param_min_samples_split  \
0                325         0.470812               1                       2   
1                328         0.469121               1                       2   
2                345         0.461201               1                       2   
3                333         0.466979               1                       2   
4                329         0.468476               1                       2   
..               ...              ...             ...                     ...   
499              305         0.631339            None                       5   
500              394              NaN            None                       5   
501              299         0.633859            None                       5   
502              380              NaN            None                       5   
503              303         0.632628            None                       5   

    param_n_estimators para

In [19]:
best_randomforest = RandomForestRegressor(max_depth=10, min_samples_leaf=1, n_estimators=300,
                                          bootstrap=True)
best_randomforest.fit(X_train,y_train)
print("Train/test R-squared:")
print(best_randomforest.score(X_train,y_train))
print(best_randomforest.score(X_test,y_test))

Train/test R-squared:
0.9745305406707453
0.3195389785005893


Here the model do actually overfit a lot, but provides decent results, in the order of the results achieved with simpler but functional model.
It is interesting to note that the performance in the training sample is really high, this might be due to the fact that the model is not that  good in forecasting but can provide a good fit for describing historical performance.

# Gradient Boosting

Similarly to the random forest here we test the gradient boosting model, which shares some of the limitations of other tree models.

In [20]:
# now we try gradient boosting 

from sklearn.ensemble import GradientBoostingRegressor
boostRegress = GradientBoostingRegressor()
pg_boostRegress = {'max_depth' : [1, 2, 3, 5, None], 'n_estimators' : [1, 3, 5, 10], 
                   'learning_rate':[0.01, 0.05, 0.1, 0.5, 0.7, 0.9, 1]}
grids_boostRegress = GridSearchCV(boostRegress, pg_boostRegress, scoring='r2', cv=tscv, return_train_score=True)
grids_boostRegress.fit(X_train,y_train)
results_boostRegress = pd.DataFrame(grids_boostRegress.cv_results_)
print(results_boostRegress[['rank_test_score','mean_test_score','param_max_depth','param_learning_rate', 'param_n_estimators']])
print("\n")
print("-"*100)
print("\n")
print('Best result: ')
print(grids_boostRegress.best_params_)

     rank_test_score  mean_test_score param_max_depth param_learning_rate  \
0                140        -0.035216               1                0.01   
1                135        -0.019618               1                0.01   
2                132        -0.004628               1                0.01   
3                120         0.030784               1                0.01   
4                139        -0.030604               2                0.01   
..               ...              ...             ...                 ...   
135               39         0.660168               5                   1   
136               28         0.670145            None                   1   
137               35         0.661978            None                   1   
138               29         0.669161            None                   1   
139               34         0.663492            None                   1   

    param_n_estimators  
0                    1  
1                    3  


In [21]:
best_boostreg = GradientBoostingRegressor(learning_rate=0.5, max_depth=3, n_estimators=5)
best_boostreg.fit(X_train,y_train)
print("Train/test R-squared:")
print(best_boostreg.score(X_train,y_train))
print(best_boostreg.score(X_test,y_test))

Train/test R-squared:
0.9546029277222555
0.512567399710302


# Clustering with unsupervised learning

This is a rather particular approach and I don't expect obtaining great results. The idea is to use K-means clustering over all funds and divide them in possible groups. Once the model for clustering will be ready, we will reshape X_train (which is the BlackRock fund we have used initially) into distances from the centroids of the K-means clustering.
Then we will fit a linear svm to the reshaped X in a similar manner to what we have done before.

I do not expect this to perform good, it's just to show how to do it.

The result of Gradient boosting is really similar to Random Forest (as we could have expected from lecture notes), but overall performs worst in the test sample.

In [28]:
# I will try to cluster the various funds

from tslearn.barycenters import dtw_barycenter_averaging
from tslearn.clustering import TimeSeriesKMeans

n_clusters = [3, 5, 10, 15, 20, 30]

for i, n_cluster in enumerate(n_clusters): 
    kmcluster = TimeSeriesKMeans(metric="dtw", max_iter=50, n_clusters=n_cluster)
    kmcluster.fit(return_funds)
    print(f'\nNumber of clusters: {n_cluster}')
    print(f'Inertia: {kmcluster.inertia_}')  


Number of clusters: 3
Inertia: 0.037082923597718254

Number of clusters: 5
Inertia: 0.021483819714294302

Number of clusters: 10
Inertia: 0.014527284572224103

Number of clusters: 15
Inertia: 0.011505576649663266

Number of clusters: 20
Inertia: 0.010314076735945268

Number of clusters: 30
Inertia: 0.008216844117609793


In [32]:
print(kmcluster.cluster_centers_.shape )

(30, 320, 1)


In [33]:
# Since there are 11 indexes I will try to cluster fund indexes 
km11cluster = TimeSeriesKMeans(metric="dtw", max_iter=20, n_clusters=11)
km11cluster.fit(return_funds)
cluster_X_train = km11cluster.transform(X_train)
cluster_X_test = km11cluster.transform(X_test)
print(X_train)
print(X_test)

        EONIA  MSCI World/Consumer Discrionary  MSCI World/Consumer Staples  \
90  -0.000072                        -0.001187                    -0.007761   
256 -0.000091                        -0.045958                    -0.035880   
98  -0.000054                        -0.002886                    -0.014025   
40  -0.000054                        -0.024131                    -0.036076   
19  -0.000054                         0.030045                     0.009819   
..        ...                              ...                          ...   
129 -0.000072                         0.009735                     0.015800   
125 -0.000072                         0.025368                     0.027115   
91  -0.000072                         0.002843                    -0.004987   
84  -0.000072                         0.016190                     0.014056   
7   -0.000036                         0.027181                     0.027445   

     MSCI World/Energy  MSCI World/Financials  MSCI

In [36]:
pg_lsvm_cluster = {'epsilon' : np.arange(start=0.00001, stop=0.001, step=0.00005), 'max_iter':[10000]}
lsvm_cluster = LinearSVR()
grids_lsvm_cluster = GridSearchCV(linear_svm, pg_svrlin, scoring='r2', cv=tscv, return_train_score=True)
grids_lsvm_cluster.fit(cluster_X_train, y_train)
results_grids_lsvm_cluster = pd.DataFrame(grids_lsvm_cluster.cv_results_)
print(results_grids_lsvm_cluster[['rank_test_score','mean_test_score','param_epsilon']])
print("\n")
print("-"*100)
print("\n")
print('Best result: ')
print(grids_lsvm_cluster.best_params_)

    rank_test_score  mean_test_score param_epsilon
0                17         0.690663         1e-05
1                20         0.689189         6e-05
2                19         0.689619       0.00011
3                14         0.691632       0.00016
4                15         0.690712       0.00021
5                18         0.689919       0.00026
6                11         0.694371       0.00031
7                10         0.694383       0.00036
8                 7         0.696171       0.00041
9                 2         0.698372       0.00046
10                8         0.695804       0.00051
11                6         0.696795       0.00056
12                5         0.697570       0.00061
13                9         0.695332       0.00066
14               16         0.690703       0.00071
15               12         0.694259       0.00076
16               13         0.692911       0.00081
17                3         0.698314       0.00086
18                4         0.6

In [41]:
lvsm_cluster_bestmodel = LinearSVR(epsilon=0.00096, max_iter=10000)
lvsm_cluster_bestmodel.fit(cluster_X_train,y_train)
print("Train/test R-squared:")
print(lvsm_cluster_bestmodel.score(cluster_X_train,y_train))
print(lvsm_cluster_bestmodel.score(cluster_X_test,y_test))

Train/test R-squared:
0.7880485937737601
0.13203888714997125


As expected the result is not really good but it might an interesting approach

# Neural Networks

This is a very simple approach to Neural Networks, we will perform a simple GridSearch over some possible shapes and activation functions.

In [30]:
# To close this exercise we will try a Neural Network
from sklearn.neural_network import MLPRegressor
hidden_layer_sizes = [[5,5,5], [10,20,10], [50,50,50], [100, 20, 100], [10,10,10,10,10,10,10]]
pg_mlpr = {"hidden_layer_sizes": hidden_layer_sizes, "activation": ["logistic", "relu"], "solver": ["lbfgs"], "alpha": [0.0001,0.001,0.01,0.1,1]}
mlpr = MLPRegressor(max_iter=7000)
grids_mlpr = GridSearchCV(estimator=mlpr, param_grid=pg_mlpr)
grids_mlpr.fit(X_train,y_train)
results_mlpr = pd.DataFrame(grids_mlpr.cv_results_)
print(results_mlpr[['rank_test_score','mean_test_score','param_hidden_layer_sizes','param_activation', 'param_alpha']])
print("\n")
print("-"*100)
print("\n")
print('Best result: ')
print(grids_mlpr.best_params_)

    rank_test_score  mean_test_score      param_hidden_layer_sizes  \
0                36        -0.022965                     [5, 5, 5]   
1                42        -0.023150                  [10, 20, 10]   
2                48        -0.023508                  [50, 50, 50]   
3                30        -0.022815                [100, 20, 100]   
4                35        -0.022937  [10, 10, 10, 10, 10, 10, 10]   
5                39        -0.023026                     [5, 5, 5]   
6                37        -0.023002                  [10, 20, 10]   
7                45        -0.023404                  [50, 50, 50]   
8                33        -0.022928                [100, 20, 100]   
9                28        -0.022736  [10, 10, 10, 10, 10, 10, 10]   
10               50        -0.023750                     [5, 5, 5]   
11               49        -0.023725                  [10, 20, 10]   
12               43        -0.023158                  [50, 50, 50]   
13               17 

In [42]:
mlpr_bestmodel = MLPRegressor(max_iter=7000, activation='relu', alpha=0.01,
                              hidden_layer_sizes=[100, 20, 100], solver='lbfgs')
mlpr_bestmodel.fit(X_train,y_train)
print("Train/test R-squared:")
print(mlpr_bestmodel.score(X_train,y_train))
print(mlpr_bestmodel.score(X_test,y_test))

Train/test R-squared:
0.8764305422957386
0.33765328970068276


Unfortunately the Neural Network does not perform good, but this more than expected as our training sample is way too small for an effective training.

# Conclusions

Overall none of the model we have tested do perform good enough. This is due to some evident limitations in our data, we do not have enough data to train effectively most of the more advanced model we have tested (SVM, Random Forest, NN).

Furthermore, the task itself does not seems to fit some of the models we have tested, such as KNN, and trees in general. Thus is not surprising that the best results are obtained with simpler linear models, which should actually fit better the theoretical background of Style Analysis itself.

One last interesting point is the good performance in training for Random Forest, this makes me think that a better result can be obtained if we could increase our sample.