In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

$$
\newcommand{\R}{\mathbf{R}}
\newcommand{\r}{\mathbf{r}}
\newcommand{\F}{\mathbf{F}}
\newcommand{\V}{\mathbf{V}}
\newcommand{\ntickers}{{n_\text{tickers}}}
\newcommand{\ndates}{{n_\text{dates}}}
\newcommand{\nfactors}{{n_\text{factors}}}
\newcommand{\nchars}{{n_\text{chars}}}
\newcommand{\dp}{{(d)}}
\newcommand{\sp}{{(s)}}
\newcommand{\Bbeta}{\mathbf\beta}
$$

# Factor models via Autoencoders

A clever way of using Neural Networks to solve a familiar but important problem in Finance
was proposed by [Gu, Kelly, and Xiu, 2019](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3335536).

It is an extension of the Factor Model framework of Finance, combined with the tools of
dimensionality reduction (to find the factors) of Deep Learning: the Autoencoder.

You can find [code](https://github.com/stefan-jansen/machine-learning-for-trading/blob/main/20_autoencoders_for_conditional_risk_factors/06_conditional_autoencoder_for_asset_pricing_model.ipynb)
for this model as part of the excellent book by [Stefan Jansen](https://github.com/stefan-jansen/machine-learning-for-trading/blob/main/20_autoencoders_for_conditional_risk_factors/06_conditional_autoencoder_for_asset_pricing_model.ipynb)
- [Github](https://github.com/stefan-jansen/machine-learning-for-trading)
- In order to run the code notebook, you first need to run a notebook for [data preparation](https://github.com/stefan-jansen/machine-learning-for-trading/blob/main/20_autoencoders_for_conditional_risk_factors/05_conditional_autoencoder_for_asset_pricing_data.ipynb)
    - This notebook relies on files created by notebooks from earlier chapters of the book
    - So, if you want to run the code, you have a lot of preparatory work ahead of you
    - Try to take away the ideas and the coding

# Factor Model review

We will begin with a quick review/introduction to Factor Models in Finance.


First, some necessary notation:
- $\r^{(d)}_s$: Return of ticker $s$ on day $d$.
- $\hat\r^{(d)}_s$: approximation of $\r^{(d)}_s$

- $n_\text{tickers}$: **large** number of tickers
- $n_\text{dates}$ number of dates
- $n_\text{factors}$: **small** number of factors: independent variables (features) in our approximation
- Returns matrix $\R$ indexed by *date*
    - $\R: (n_\text{dates} \times n_\text{tickers})$
    - $|| \R^\dp || = n_\text{tickers}$
        - $\R^\dp$ is vector of returns for each of the $\ntickers$ on date $d$

- $\r$ will denote a vector of single day returns: $\R^\dp$ for some date $d$

**Notation summary**

term &nbsp; &nbsp;  &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| meaning
:---|:---
$s$ | ticker
$\ntickers$ | number of tickers
$d$ | date
$\ndates$ | number of dates
$\nchars$   | number of characteristics per ticker
$m$ | number of examples
    | $m = \ndates$
$i$ | index of example
    | There will be one example per date, so we use $i$ and $d$ interchangeably.
$ [ \X^\ip, \R^\ip ]$ | example $i$
         | $|| \X^\ip || = (\ntickers \times  \nchars )$ 
         | $|| \R^\ip || = \ntickers$
$\X^\dp_s$ | vector of ticker $s$'s characteristics on day $d$
             | $ || \X^\dp_s || = \nchars$


**Note**

The paper actually seeks to predict $\hat\r^{(d+1)}_s$ (forward return) rather than approximate
the current return $\hat\r^\dp_s$.

We will present this as an approximation problem as opposed to a prediction problem for
simplicity of presentation (i.e., to include PCA as a model).

A **factor model** seeks to approximate/explain the return of a *number* of tickers in terms of common "factors" $\F$
- $\F: (\ndates \times \nfactors)$
$$
\begin{array} \\
\R^\dp_1 & = & \Bbeta^\dp_1 \cdot \F^\dp  + \epsilon_1\\
\vdots \\
\R^\dp_\ntickers & = & \Bbeta^\dp_\ntickers \cdot \F^\dp + \epsilon_\ntickers \\
\end{array}
$$


There are several ways to create a factor model.

## Pre-defined factors, solve for sensitivities

First: supposed $\F$ is given
- For each date $d$, returns for: market, several industries, large/small cap
- Solve for $\Bbeta_s$, for each $s$
    - $\ntickers$ separate Linear Regression models: $\langle \X^\dp, \y^\dp \rangle = \langle \r^\dp_s, \F^\dp \rangle$
    - Regression of time-series of a ticker's return agains a time-series of Factor returns
    - Solve for $\Bbeta_s$

## Pre-defined sensitivities, solve for factors

Alternately: suppose $\Bbeta$ is given
- For each ticker $s$, sensitivity of $s$ to $\Bbeta_j$
- Solve for $\F^\dp$, for each $d$
    - $\ndates$ separate Linear Regression models $\langle \X^\sp, \y^\sp \rangle = \langle \Bbeta_s, \r^\dp_s \rangle$
    - Regression of *cross-section* of tickers returns against a cross-section of ticker sensitivities
    - Solve for $\F^\dp$

## Solve for sensitivities and factors: PCA

Yet another possibility: solve for $\Bbeta$ and $\F$ *simulataneoulsy*.

Recall Principal Components
- Representing $\X$ (with "standard" basis vectors) via an *alternate basis* $\V$
$$\X = \tilde\X \V^T$$

In this case without dimensionality reduction:
$$
\R = \tilde\R V^T
$$

where
$$
\begin{array} \\
\R, \tilde\R: (\ndates \times n_\text{tickers}) \\
\V^T: (n_\text{tickers} \times n_\text{tickers} ) \\
\end{array}
$$

With dimensionality reduced from $n_\text{tickers}$ to $n_\text{factors}$

$\R = \F \, \Bbeta^T$

- $\F^T: (\ndates \times \nfactors)$
    - is $\tilde\R$ with columns eliminated b/c of dimensionality reduction
- $\Bbeta^T: (  \nfactors \times \ntickers)$  
    - so $\Bbeta^\sp$ are sensitivities of $s$ to factors

- Solve for $\F, \Bbeta$ simultaneously

The daily observation of $\ntickers$ returns $\R^\dp$ is replaced by $\nfactors$ returns $\F^\dp$

# This paper

This paper will create a factor model that
- Solve for $\F, \Bbeta$ simultaneously
    - like PCA
- **But** where $\F$ and $\Bbeta$ are defined by Neural Networks

## Autoencoder

The paper refers to the model as a kind of Autoencoder.

Let's review the topic.
    
Training examples $\langle \X^\dp, \y^\dp \rangle = \langle \R^\dp, \R^\dp \rangle$

No obvious form as factor model
- $\R^\dp = \r$ 
    - mapped by Encoder to latent $\z$ (of length $\nfactors$)
    - latent $\z$ mapped to $\r$ by Decoder

Imagine instead creating an "Autoencoder" that worked as follows
- Maps $\R^\dp$ to $\Bbeta^\dp$
    - sensitivity of each of the $\ntickers$ on day $d$ to  day $d$ returns of $\nfactors$ $\F^\dp$
-  Maps $\R^\dp$ to the day $d$ returns of $\nfactors$ $\F^\dp$
- Outputting $\y^\dp = \Bbeta^\dp \, \F^\dp$

It acts as an Autoencoder in the senses that the Training examples $\langle \X^\dp, \y^\dp \rangle = \langle \R^\dp, \R^\dp \rangle$
- But constrains $\hat\y^\dp = \hat\R^\dp$ to the form $\hat\R^\dp = \Bbeta^\dp \, \F^\dp$


This model solves for $\Bbeta^\dp, \F^\dp$ simultaneously
- almost what  PCA does **but**, in PCA,  $\Bbeta$ does not vary by day
- this model: the beta of a ticker $s$ to a factor $j$ changes by day $d$ !


This paper goes one step further than the standard Autoencoder
- Standard Autoencoder maps $\R^\dp$ to $\Bbeta^\dp$
- This paper allows $\nchars \ge 1$ daily *characteristics* $\X^\dp$ to map to $\Bbeta^\dp$
    - one characteristic may be $\R^\dp$

$\beta^\dp_s =  \text{NN}( \X^\dp_s ; \W_\Bbeta )$
- $\beta^\dp_s$ 
    - parameterized by weights $\W_\Bbeta$
    - is only a function of the characteristics of $s$
    - **not** a function of *other* ticker $s'$ characteristics $\X_{s'}$, as in PCA
- $\beta^\dp_{s}$ share the same weights $\W_\beta$ for all $s, d$
    - unlike fixed factor, solve for $\beta_s$
        - different for each $s$
        - same for each day $d$


### This model: nothing pre-defined, solve for sensitivities and factors
- Simultaneously solve for $\beta^\dp_s$ and $\F^\dp$
    - $\Bbeta^\dp_s$ constrained: 
    $$
\beta^\dp_s = \text{Dense} \,( \nfactors ) ( \X^\dp_s )
$$
        - combination of ticker-specific, time-varying characteristics $\X^\dp_s$
        - we solve for the *combining weights*
            - shared by all tickers and dates
    - $\F^\dp$ constrained
    $$
\F^\dp = \text{Dense} \,( \nfactors ) ( \R^\dp )
$$

        - combination of time-varying *raw returns* $\R^\dp$ 
        - we solve for *combining weights*
            - shared by all dates

<table>
    <tr>
        <th><center>Autoencoder for Conditional Risk Factors</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_for_conditional_risk_factors.png" width="90%"></td>
    </tr>
</table>


## This paper
- $\r^\dp = \Bbeta^\dp * \r^\dp$
    - $\r^\dp$ shape is $(n_\text{tickers} \times 1)$
    - $\Bbeta$ shape is $(n_\text{tickers} \times n_\text{factors})$
    - $\r^\dp$ shape is $(n_\text{factors} \times 1)$
- Solve simultaneously for $\Bbeta^\dp, \r^\dp$    
    where
$\beta^\dp_s =  f( \X^\dp_s )$
    - $\beta^\dp_s$ is only a function of the characteristics of $s$
    - **not** $f( \r^\dp)$: the simultaneous returns of *other* $s'$ as in PCA
    - $\beta^\dp_{s}$ share the same $\W_\beta$ for all $s, d$
        - unlike fixed factor, solve for $\beta_s$
            - different for each $s$
            - same for each day $d$

    and where
    
    $\r^\dp = f(\r^\dp)$ for $f$ fixed over all $d$ 
        - like PCA
    


# Input side of network





## Input $\X$

$
\X : ( \ndates \times \ntickers \times \nchars )
$

$|| \X^ || =  (\ntickers \times \nchars)$
- one example per date
- example shape is $\ntickers \times \nchars$

## Dense $\beta$

- `Dense` $( \nfactors )$
    - `Dense`( $\nfactors ) :  (\ntickers \times \nchars) \mapsto (\ntickers \times \nfactors) $
    - threads over ticker dimension ([see](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense))
        - tickers share same weights
        - single `Dense`( $\nfactors )$ **not** $\ntickers$ copies of `Dense`$( \nfactors )$
         
- $\W_\beta: ( \nfactors \times \nchars )$
    - same across all $d, s$
    - $W_\beta^\dp = W_\Bbeta^{(d')}$ like any other training (same weight for every example)
    - $W^\dp_{\Bbeta,(s)} = W^{(d)}_{\Bbeta,(s')}$: transformation of characteristics to beta *independent* of ticker 
    - hence, size of $\W_\beta$ is $( \nfactors \times \nchars )$



$$
\beta^\dp = \text{Dense} \,( \nfactors ) ( \X^\dp )
$$

$$ || \Bbeta^\dp || = ( \ntickers \times \nfactors )$$

# Factor side of network

## Input $\R$

$
\R : ( \ndates \times \ntickers )
$

$|| \R^\dp || =  (\ntickers)$
- one set of returns per date

## Dense $\delta$ (factor)

- `Dense` $( \nfactors )$
    - `Dense`( $\nfactors ) :  \ntickers \mapsto \nfactors$
- $\W_f: (\nfactors \times \ntickers )$
    - same across all $d, s$
    - $W_f^\dp = W_f^{(d')}$ like any other training (same weight for every example)
    - $W^dp{f} $: transformation of ticker returns to factor returns 
 

$$
\F^{(d)} = \text{Dense} \,( n_\text{factors} ) ( \R^{(d)} )
$$

$$ || \F^{(d)} || =  n_\text{factors}$$

# Dot

$$\hat{\r}^\dp = \Bbeta^\dp \cdot \F^\dp$$
- Dot product threads over factor dimension

$$ || \hat{\r}^\dp || = \ntickers $$

# Loss


Let
$\loss^\dp_{(s)}$ denote error of ticker $s$ on day $d$.
$$
\loss^\dp_{(s)} = \r^\dp_s - \hat{\r}^\dp_s
$$
or perhaps 
$$
\loss^\dp_\sp = \r^{(d+1)}_s - \hat{\r}^\dp_s
$$


- $\loss^\dp$ is the loss of example $d$
    - this loss has $\ntickers$ sub-components
    - This appears in example $i = d: \X^\dp$
    - $\loss^\ip = \loss^\dp = \sum_s { \loss^\dp_{(s)} }$
- This is different than the loss $\loss'$ for the case where an example is a single ticker on a single day
    - $m' = n_\text{dates} * \ntickers$ examples in this case
    - $\loss'^\ip = \loss^\dp_\sp$