In [2]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

# Machine Learning 

The basic issue in finance is that we want to know how expected returns move around, but we only observe realized returns

We can compile lots and lots of information/data about different assets

We saw how to run OLS regression of returns on a large set of characteristics ( I think it was 30)

But we didn;t even think interactions among them--say the value characteristic might have different information for returns for small vs big stocks--considering all these interactions would leads us to estimate 900 coefficients. And of course there are potentially many more characteristics and their lags that could be informative for expected returns and co-movement

You can see that very quickly you run out of data

Here where recent advances in machine learning can be super useful

In the end of the day we want to estimate a function F that maps observed characteristics in future returns


$$R_{t+1}=F(X_t)$$

This function can be linear

$$R_{t+1}=BX_t$$


 or linear in the interactions

 $$R_{t+1}=BX_t+C X_t ⊗ X_t$$

 Or have even higher order or non-linear relationships

 Here where the tools if machine learning can be useful to us

 We will now discuss a few of the most used methods


- **Lasso Regression** (L1 regularization)
- **Neural Network Regression** (customizable number of layers)
- **Random Forest Regression**
- **Gradient Boosted Regression Trees (GBRT)**
- **Elastic Net Regression** (combination of L1 and L2 regularization)

We will apply those to our data set

We will have a training/estimation   sample (1972-1992),  a tuning sample (1992-2002), and a test sample (2002-2016)



### 1. **Lasso Regression**
Lasso (Least Absolute Shrinkage and Selection Operator) regression is a linear regression model with L1 regularization. It minimizes the following objective:

$$
\min_{\beta} \left( \frac{1}{2n} \sum_{i=1}^n (y_i - X_i^\top \beta)^2 + \alpha \|\beta\|_1 \right)
$$

- **Key Characteristics**:
  - Shrinks some coefficients to exactly zero, effectively performing feature selection.
  - Useful for sparse models where only a subset of predictors are important.
  - Struggles with multicollinearity, as it tends to arbitrarily select one among correlated predictors.



In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assuming df is your dataframe
# Extract Y (excess return) and X (characteristics)
X = df.iloc[:, 3:]  # Characteristics (columns after the first 3)
Y = df.iloc[:, 2]   # Excess return (3rd column)

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


# Perform Lasso regression
lasso = Lasso(alpha=0.1)  # You can adjust the alpha (regularization strength)
lasso.fit(X_train, Y_train)

# Coefficients and intercept
print("Lasso Coefficients:", lasso.coef_)
print("Intercept:", lasso.intercept_)

# Optional: Evaluate on test data
score = lasso.score(X_test, Y_test)
print("R^2 on Test Data:", score)

# Optional: Predicted values
Y_pred = lasso.predict(X_test)



---



### 2. **Neural Network Regression**
A neural network is a flexible, non-linear model that uses layers of neurons to approximate complex relationships between inputs (\( X \)) and outputs (\( Y \)). The simplest form of a feedforward neural network can be expressed as:

\[
\hat{y} = f(W^{[L]} \sigma(W^{[L-1]} \dots \sigma(W^{[1]} X + b^{[1]}) + b^{[L-1]}) + b^{[L]}
\]

- **Key Characteristics**:
  - Consists of an input layer, hidden layers, and an output layer.
  - Activation functions (\( \sigma \), e.g., ReLU or sigmoid) introduce non-linearity.
  - The number of layers and neurons can be tuned to fit data complexity.
  - Requires careful tuning of hyperparameters (e.g., learning rate, number of layers, epochs).

---

### 3. **Random Forest Regression**
Random Forest is an ensemble method that combines multiple decision trees to make predictions. Each tree is trained on a bootstrap sample of the data, and predictions are averaged:

\[
\hat{y} = \frac{1}{T} \sum_{t=1}^T h_t(X)
\]

Where \( h_t(X) \) is the prediction of the \( t \)-th tree.

- **Key Characteristics**:
  - Reduces overfitting by averaging predictions across trees.
  - Handles non-linear relationships and interactions between features well.
  - Relatively robust to noisy data and outliers.
  - Does not extrapolate beyond the range of the training data.

---

### 4. **Gradient Boosted Regression Trees (GBRT)**
GBRT is an ensemble technique that builds trees sequentially, where each tree corrects the errors of the previous one. The prediction is updated iteratively:

\[
\hat{y}_t(X) = \hat{y}_{t-1}(X) + \nu \cdot g_t(X)
\]

Where:
- \( g_t(X) \): Gradient of the loss function with respect to predictions.
- \( \nu \): Learning rate, controlling the contribution of each tree.

- **Key Characteristics**:
  - Optimizes a differentiable loss function (e.g., squared error for regression).
  - Can capture complex, non-linear patterns in the data.
  - Requires careful tuning of hyperparameters (e.g., learning rate, number of trees, maximum tree depth).

---

### 5. **Elastic Net Regression**
Elastic Net combines L1 (Lasso) and L2 (Ridge) regularization to balance feature selection and multicollinearity handling. The objective function is:

\[
\min_{\beta} \left( \frac{1}{2n} \sum_{i=1}^n (y_i - X_i^\top \beta)^2 + \alpha_1 \|\beta\|_1 + \alpha_2 \|\beta\|_2^2 \right)
\]

Where:
- \( \|\beta\|_1 \): Lasso penalty encourages sparsity.
- \( \|\beta\|_2^2 \): Ridge penalty shrinks coefficients to reduce multicollinearity.

- **Key Characteristics**:
  - Balances Lasso's feature selection and Ridge's stability with correlated predictors.
  - Controlled by two hyperparameters:
    - \( \alpha \): Overall regularization strength.
    - \( \rho \) (mixing ratio): Balance between L1 and L2 penalties.

---

### Summary Table

| **Model**                  | **Type**       | **Key Strengths**                                        | **Limitations**                            |
|----------------------------|----------------|----------------------------------------------------------|--------------------------------------------|
| Lasso Regression           | Linear         | Feature selection, interpretable coefficients           | Struggles with multicollinearity           |
| Neural Network Regression   | Non-linear     | Flexible, captures complex patterns                     | Requires significant tuning and data       |
| Random Forest Regression    | Non-linear     | Robust to overfitting, handles feature interactions well | Computationally expensive for large data   |
| GBRT                       | Non-linear     | Accurate, optimizes for specific loss functions         | Sensitive to hyperparameters, overfitting  |
| Elastic Net Regression      | Linear         | Handles multicollinearity, balances selection & stability | Can be slower than Ridge or Lasso          |

Let me know if you’d like additional detail or comparisons!