## MACHINE LEARNING


MODULE 8 | LESSON 4


---

# **Multilayer Perceptron: Timing Factors and Smart-Beta Strategies** 

|  |  |
|:---|:---|
|**Reading Time** |  50 minutes |
|**Prior Knowledge** |  Linear Regression, Neural Network, Machine Learning  |
|**Keywords** | Multilayer Perceptron, Deep learning, Neural Network, Momentum |


---

*In this lesson, we will revisit the momentum timing strategy of Lesson 2 using our newly acquired knowledge of neural networks. Multilayer perceptron networks (MLPs) will add non-linearity considerations to the prediction model, which will likely enhance its power-but let's check by how much.*

## **1. MLP for Financial Market Predictions**

Multilayer perceptrons (MLPs) are one of the simplest forms of neural networks and thus have been widely used for many applications, including the prediction of financial market returns. There have been several applications, including in the academic literature, exploiting the potential of MLPs. Here, we share a few of these works (you can easily find more online) for you to check in case you are interested. In the forthcoming Deep Learning in Finance course, we will review many more models and applications, but the works presented next could be used as a reasonable baseline from which to grow our models. As you will see, most of them do not really achieve strong return predictability:

- Guresen, Erkam, et al. "Using Artificial Neural Network Models in Stock Market Index Prediction." *Expert Systems with Applications*, vol. 38, no. 8, 2011), pp. 10389-10397, https://www.sciencedirect.com/science/article/abs/pii/S0957417411002740?via%3Dihub

- Rather, Akhter Mohiuddin, et al. "Recurrent Neural Network and a Hybrid Model for Prediction of Stock Returns." *Expert Systems with Applications* vol. 42, no. 6, 2015, pp. 3234-3241, https://www.sciencedirect.com/science/article/abs/pii/S0957417414007684

- Devadoss, A. Victor, and T. Antony Alphonnse Ligori. "Forecasting of Stock Prices using Multi Layer Perceptron." *International Journal of Computing Algorithm*, vol. 2, no. 1, 2013, pp. 440-449, http://ijwebt.com/abstract_meta_author.php?id=V2-I2-P6

- Namdari, Alireza, and Zhaojun Steven Li. "Integrating Fundamental and Technical Analysis of Stock Market through Multi-Layer Perceptron." 2018 IEEE Technology and Engineering Management Conference (TEMSCON). *IEEE*, 2018, https://ieeexplore.ieee.org/abstract/document/8488440?casa_token=hbO9qp7fjdMAAAAA:QsTb7IcWieVNRxCzuw1O57Pq23SbKcmEQ652cULn2enCiDbt9fGx2XKkPVDc9mAK7V8abX5y


- Anand, C. "Comparison of Stock Price Prediction Models using Pre-Trained Neural Networks." *Journal of Ubiquitous Computing and Communication Technologies*, vol. 3, no. 2, 2021, pp. 122-134, https://irojournals.com/jucct/article/view/3/2/5

Let's see how our MLP model is able to help in the prediction of momentum factor returns that we tackled in Lesson 2 using linear regression.

## **2. Timing Momentum with Multilayer Perceptron (MLPs)**

### **2.1. Data**

First, we proceed to load and treat all the necessary data to construct our first deep learning model. The different steps undertaken are virtually the same as the ones from the linear regression case from Lesson 2. We refer you to that notebook in case you have doubts on any of the steps:

In [None]:
import numpy as np
import pandas as pd

In [None]:
route = "10_Portfolios_Prior_12_2_Daily.csv"

In [None]:
# Read the csv file again with skipped rows
df = pd.read_csv("10_Portfolios_Prior_12_2_Daily.csv", index_col=0)
# Format the date index
df.index = pd.to_datetime(df.index, format="%Y%m%d")
# Build de MOM strategy: Long "Hi PRIOR" and Short "Lo PRIOR"
df["Mom"] = df["Hi PRIOR"] - df["Lo PRIOR"]
df.head()

- **Inputs and outputs**


In [None]:
df["Ret"] = df["Mom"]
df["Ret10_MOMi"] = df["Mom"].rolling(10).apply(lambda x: np.prod(1 + x / 100) - 1)
df["Ret25_MOMi"] = df["Mom"].rolling(25).apply(lambda x: np.prod(1 + x / 100) - 1)
df["Ret60_MOMi"] = df["Mom"].rolling(60).apply(lambda x: np.prod(1 + x / 100) - 1)
df["Ret120_MOMi"] = df["Mom"].rolling(120).apply(lambda x: np.prod(1 + x / 100) - 1)
df["Ret240_MOMi"] = df["Mom"].rolling(240).apply(lambda x: np.prod(1 + x / 100) - 1)

df["Ret10_hi"] = df["Hi PRIOR"].rolling(10).apply(lambda x: np.prod(1 + x / 100) - 1)
df["Ret25_hi"] = df["Hi PRIOR"].rolling(25).apply(lambda x: np.prod(1 + x / 100) - 1)
df["Ret60_hi"] = df["Hi PRIOR"].rolling(60).apply(lambda x: np.prod(1 + x / 100) - 1)
df["Ret120_hi"] = df["Hi PRIOR"].rolling(120).apply(lambda x: np.prod(1 + x / 100) - 1)
df["Ret240_hi"] = df["Hi PRIOR"].rolling(240).apply(lambda x: np.prod(1 + x / 100) - 1)

df["Ret10_Low"] = df["Lo PRIOR"].rolling(10).apply(lambda x: np.prod(1 + x / 100) - 1)
df["Ret25_Low"] = df["Lo PRIOR"].rolling(25).apply(lambda x: np.prod(1 + x / 100) - 1)
df["Ret60_Low"] = df["Lo PRIOR"].rolling(60).apply(lambda x: np.prod(1 + x / 100) - 1)
df["Ret120_Low"] = df["Lo PRIOR"].rolling(120).apply(lambda x: np.prod(1 + x / 100) - 1)
df["Ret240_Low"] = df["Lo PRIOR"].rolling(240).apply(lambda x: np.prod(1 + x / 100) - 1)

df["Ret60"] = df["Ret60_MOMi"].shift(-60)
df = df.dropna()
df.tail(10)

df = df.drop(
    [
        "Lo PRIOR",
        "PRIOR 2",
        "PRIOR 3",
        "PRIOR 4",
        "PRIOR 5",
        "PRIOR 6",
        "PRIOR 7",
        "PRIOR 8",
        "PRIOR 9",
        "Hi PRIOR",
        "Mom",
    ],
    axis=1,
)

In [None]:
df.head()

- **Train-Test samples and Scaling**

In [None]:
from sklearn.model_selection import train_test_split

df.reset_index(inplace=True)
df.rename(columns={"index": "Date"}, inplace=True)
df.head()

In [None]:
df.reset_index(inplace=True, drop=True)

ts = int(0.4 * len(df))  # Number of observations in the test sample
split_time = len(df) - ts  # From this data we are in the test sample
test_time = df.iloc[split_time:, 0:1].values  # Keep the test sample dates
Ret_vector = df.iloc[split_time:, 1:2].values
df.tail()

In [None]:
Xdf, ydf = df.iloc[:, 2:-1], df.iloc[:, -1]
X = Xdf.astype("float32")
y = ydf.astype("float32")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=ts, shuffle=False
)  # It is important to keep "shuffle=False"
n_features = X_train.shape[1]
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# Scaling

from sklearn.preprocessing import MinMaxScaler

scaler_input = MinMaxScaler(feature_range=(-1, 1))
scaler_input.fit(X_train)
X_train = scaler_input.transform(X_train)
X_test = scaler_input.transform(X_test)

mean_ret = np.mean(y_train)  # Useful to compute the performance = R2

scaler_output = MinMaxScaler(feature_range=(-1, 1))
y_train = y_train.values.reshape(len(y_train), 1)
y_test = y_test.values.reshape(len(y_test), 1)
scaler_output.fit(y_train)
y_train = scaler_output.transform(y_train)
y_test = scaler_output.transform(y_test)

### **2.2. MLP Model and Training**

Now that we have all our data ready, let's build our first MLP model. To that end, we will define several things:

- **Activation function** 

Although the different hidden layers may employ different activation functions, in this case, we will select the same one for all hidden layers: the rectified linear unit (**ReLU**).

- **Hidden layers and units within layers**

In this model, we will use a total of 3 hidden layers. Each of these layers will have 50, 30, and 10 units respectively in order from the input layer.

- **Output layer**

Of course, we need to define a final fully connected layer for the output

- **Learning rate**

As before, we will choose a learning rate of $10^{-5}$. (*Note that we will play around with this hyperparameter--i.e., tuning it--in the forthcoming Deep Learning in Finance course.*)

- **Optimizer**

We will choose the **Adam** optimizer, which you are already familiar with.

- **Loss function**

Different from the linear regression case, here we will select a loss function based on the **mean absolute error (MAE)**. This is:


$$
\begin{equation*}
    L(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i| 
\end{equation*}
$$
\
Note that, importantly, this error function has a different treatment of outliers than the MSE (mean squared error) function that we have used in linear regression (essentially because it does not compute the power of the differences!). This can be important, especially in the finance context, when we do not want to give too much importance to a few outliers that are simply "extreme events," not representative of the overall sample. Of course, it would be a completely different story if our model aimed at predicting extreme events (e.g., firm bankruptcy, client default on credit card debt, etc.), since these extreme values will be the main focus of our investigation. 

\
***NOTE*, also, that we are setting our seed to be constant**. This is important to note, because different training could lead to different final outcomes due to a variety of things (e.g., random weight initialization in Keras). You are welcome to remove this seed and check how things change!*

In [None]:
import tensorflow as tf

tf.random.set_seed(12345)

act_fun = "relu"  # Activation function
hp_units = 50  # Units in the first hidden layer
hp_units_2 = 30  # Units in the second hidden layer
hp_units_3 = 10  # Units in the third hidden layer

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units=hp_units, activation=act_fun))
model.add(tf.keras.layers.Dense(units=hp_units_2, activation=act_fun))
model.add(tf.keras.layers.Dense(units=hp_units_3, activation=act_fun))
model.add(tf.keras.layers.Dense(1))

hp_lr = 1e-5

adam = tf.keras.optimizers.Adam(lr=hp_lr)
model.compile(optimizer=adam, loss="mean_absolute_error")

Once we have defined our model, we can train it!

In [None]:
model.fit(X_train, y_train, epochs=30, batch_size=32, verbose=2)

As before, we can get a feeling of how the model works by calling 'model.summary()':

In [None]:
model.summary()

Now that we have a few hidden layers with several units in each one, being able to interpret the architecture and understand how many parameters are trained at each step of the model becomes crucial. 

In this case, the 'summary()' option of Keras is already telling us how many parameters are trained in each layer. But where do these parameters come from?

For example, in the first layer of the model there are 800 parameters. These are equal to the number of units in the layer (50) times the number of different inputs (15)-because there would be a weight associated with each input and unit in the layer-plus the bias terms $b$ for each unit (50). Thus, $15 \times 50 + 50 = 800$.

Where does the number of $1530$ parameters from the second layer come from?
This is equal to the number of "inputs" to this layer (which is essentially the number of units in the previous layer (50) times the number of units in the layer (30) plus the bias term for each unit (30). Thus, $50 \times 30 + 30 = 1530$.

You can check the number of parameters in the other layers using the same logic.

### **2.3. Validation and Early stopping**

- **How many epochs should we use when training our models?**

These are very legitimate questions to ask at this point. The choice of the number of epochs that we have used so far (30) is completely discretionary and not based at all on any relevant criteria. We are going to change this now by including **Early stopping** in our training.

This means that we are going to instruct Keras to stop model training when some condition is met. This will be done via the **callback API** in Keras, as we will see shortly.

- **When should we stop training?**

There are multiple ways to define a stopping criterion. We are going to use one of the most common and focus on monitoring the loss function in a separate validation set. Once after each epoch of the training process, we will check if (and how much) the loss function in the validation set decreases. We will also define a parameter, **patience**, that indicates the number of epochs with no improvement in the validation set that we tolerate before Early stopping training. 

*More information on how **Callback API** and **Early stopping** in Keras works can be found here:* https://keras.io/api/callbacks/early_stopping/ 

Let's therefore implement Early stopping when training our algorithm. First, we need to define the characteristics of Early stopping, where we indicate:

1. **The quantity/set to monitor**: in our case the validation set loss function (we will define the validation set in a minute).

2. **The 'mode'**: by setting this to 'min' we ensure training will stop when the quantity set in (1) has stop decreasing.

3. **Patience**: we will allow for 10 epochs with no improvement in minimizing the loss function of the validation set before we stop the training.

4. **restore_best_weights**: due to the iteration process, it may be the case that the last iteration before stopping training does not yield the model weights that achieve the lowest loss function in validation. By setting this option to 'True' we ensure that we keep the weights that achieved the best loss function value (i.e., the lowest) in validation.

In [None]:
es = tf.keras.callbacks.EarlyStopping(
    monitor="val_loss", mode="min", verbose=1, patience=10, restore_best_weights=True
)

Finally, we can re-train our model with the Early stopping callback. Also, note that, because we need it for the Early stopping function, we define a validation set of 20\% of the training set. Importantly, this split is performed by Keras with no shuffling of the data, and this set is kept apart (i.e., there is no training nor testing going on in the validation set). Lastly, note that we incorporate the callback feature 'es' defined before.

We have selected here 100 epochs for training but included an Early stopping criterion. Would the model train for the whole 100 epochs? Let's see...<span style='color: transparent; font-size:1%'>All rights reserved WQU WorldQuant University QQQQ</span>

In [None]:
model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    epochs=100,
    batch_size=32,
    verbose=2,
    callbacks=[es],
)

As you can see, given our new criteria, model training has stopped at epoch 11, when the Early stopping kicks in. 

Now, let's see if our simple deep learning model is able to produce a better prediction of the momentum factor returns than our linear regression did!

## **3. MLP Model Performance for Momentum Timing**

As we did in the case of linear regression, we will::

1. Evaluate the predictive performance of our MLP-based: Out-of-sample $R^2$
2. Assess the viability of this by backtesting our trading strategy

Let's start with the first one.

### **3.1. Out-of-Sample Predictive Power ($R^2_{OS}$)**

Let's first look at the out-of-sample explanatory power of our model via the $R^2_{OS}$ in Campbell and Thompson (2008). The construction of this measure is identical to what we did in Lesson 2 of the module. Please refer there if there are doubts in this regard.

In [None]:
values = scaler_output.inverse_transform(y_test)

y_pred = model.predict(X_test)
y_pred = scaler_output.inverse_transform(y_pred)

In [None]:
y_pred.shape

In [None]:
def R2_campbell(y_true, y_predicted, mean_ret):
    y_predicted = y_predicted.reshape((-1,))
    sse = sum((y_true - y_predicted) ** 2)
    tse = sum((y_true - mean_ret) ** 2)
    r2_score = 1 - (sse / tse)
    return r2_score


R2_Campbell = R2_campbell(values.flatten(), y_pred.flatten(), mean_ret)

print("R2 (Campbell): ", R2_Campbell)

Now, our $R^2_{OS}$ is actually a big improvement over the linear regression case. Indeed, these kind of numbers are in the neighborhood of what is published in leading finance journals in terms of return predictability.

**Note that there is some random weight initialization (among other things) going on when training the model, so if you run this model without setting the seed you will get different numbers. This will also impact the trading strategy returns. Feel free to try it!**

Anyway, predictability is not a super-high number, so let's see if we can make actual money (over the buy-and-hold case) by following a trading strategy based on our models' predictions.

### **3.2. Backtesting Momentum Timing**

As before, this section is inherently identical in construction to the one from Lesson 2, with the only changes coming from the predictions delivered by our MLP-based model. If you have doubts on how to perform some of the steps in this section, please go back to Lesson 2 of the module.

- **What do predicted versus real returns in a test set look like?**

In [None]:
df_predictions = pd.DataFrame(
    {
        "Date": test_time.flatten(),
        "Pred": y_pred.flatten(),
        "Ret": (Ret_vector.flatten() / 100),
        "Values": values.flatten(),
    }
)
df_predictions.tail()

In [None]:
df_predictions.Date = pd.to_datetime(df_predictions.Date, format="%YYYY-%mm-%dd")
df = df_predictions
df.tail()

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(12, 6))
ax = plt.gca()
df.plot(x="Date", y="Values", color="red", label="Real Stock Return", ax=ax)
df.plot(x="Date", y="Pred", color="blue", label="Predicted Returns", ax=ax)
plt.xlabel("Time")
plt.ylabel("Stock Return")
plt.legend()
plt.show()

As you can see, the predicted returns are much less volatile (noisy) in the case of the MLP model than in the linear regression case. This has to do with the enhanced predictive power of MLPs. Let's now see whether this predictive power is enough to effectively time momentum returns.

- **Momentum timing strategy**

In [None]:
df["Positions"] = df["Pred"].apply(np.sign)
df["Strat_ret"] = df["Positions"].shift(1) * df["Ret"]
df["Positions_L"] = df["Positions"].shift(1)
df["Positions_L"][df["Positions_L"] == -1] = 0
df["Strat_ret_L"] = df["Positions_L"] * df["Ret"]
df["CumRet"] = df["Strat_ret"].expanding().apply(lambda x: np.prod(1 + x) - 1)
df["CumRet_L"] = df["Strat_ret_L"].expanding().apply(lambda x: np.prod(1 + x) - 1)
df["bhRet"] = df["Ret"].expanding().apply(lambda x: np.prod(1 + x) - 1)

Final_Return_L = np.prod(1 + df["Strat_ret_L"]) - 1
Final_Return = np.prod(1 + df["Strat_ret"]) - 1
Buy_Return = np.prod(1 + df["Ret"]) - 1

print("Strat Return Long Only =", Final_Return_L * 100, "%")
print("Strat Return =", Final_Return * 100, "%")
print("Buy and Hold Return =", Buy_Return * 100, "%")

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(12, 6))
ax = plt.gca()
df.plot(x="Date", y="bhRet", label="Buy&Hold", ax=ax)
df.plot(x="Date", y="CumRet_L", label="Strat Only Long", ax=ax)
df.plot(x="Date", y="CumRet", label="Strat Long/Short", ax=ax)
plt.xlabel("date")
plt.ylabel("Cumulative Returns")
plt.grid()
plt.show()

df.describe()

Sadly, as you can see, our strategy is not really able to time momentum factor returns. Arguably, our MLPs do a much better job than the linear regression case, but we still need much to be able to outperform the buy-and-hold strategy. This is not really surprising, since momentum has been one of the most profitable factors ever!


## **4. Conclusion**

\
Hopefully, you are not too disappointed by the failure of our trading strategies. Some of you may already be familiar with deep learning applications in other fields, where DL models achieve a remarkable performance and predictive power. Unfortunately, predicting the stock market is not as easy a task for an algorithm as recognizing and classifying pictures into groups of cats and dogs. 

Luckily for us, there is still a lot that can be done to improve the predictive performance of our algorithms in a financial market setting: hyperparameter tuning, more dense and complex networks, feeding data from other different sources, etc. We will deal with all these issues (and more) in the upcoming Deep Learning in Finance course.

See you there!

---
Copyright 2024 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
