# Adaptive Portfolio by Solving Multi-armed Bandit via Thompson Sampling

Shen Shen and Wang (2016) presented an online portfolio algorithm that leverages Thompson sampling to mix two different strategies. 

**NOTE: no risk control at all**

## Problem and modeling

In some cases, naive strategies such as Equally-weighted and Value-weighted portfolios can even get better performance. Under these circumstances, we can use multiple classic strategies as multiple strategic arms in multi-armed bandit to naturally establish a connection with the portfolio selection problem.

We consider a self-financing, limited time and limited asset financial environment. The trading periods consist of $t_k = k\Delta{t}, k=0,...,m$, where $\Delta{t}$ represents one day, week or month, depending on the cycle of rebalancing and is $m$ the total cycles of participation in the transaction. We also represent the return vector of $n$ assets at time $t_{k-1}$ to $t_k$ time as $\mathbf{R_k}$ 
. The formula of the return $R_{k,i}$ of the i-th asset is $R_{k,i} = P_{k,i} / P_{k-1, i}$, where $P_{k-1,i}$ and $P_{k,i}$ represent the price of the i-th asset at times. The transaction fee is also an important factor in the final benefit. For the sake of simplifying the model, however, it is not considered in this model. Still, we think about how to reduce trading behavior.

$\mathbf{W_k}$ as the portfolio weight vector at time $t_k$ denotes the investment decision at time $t_k$, where $W_{k,i}$ represents the allocation weight of the i-th asset in the entire portfolio. 

### Strategic Arms

**Buy and hold(BH)**: This is an intuitive idea which involves doing nothing and continuing to hold the existing portfolio in this time window.
$$\begin{equation}\tag{1} \mathbf{W_k^{BH} = \mathbf{W_{k-1}}} \end{equation}$$

**Sold All (SA)**: Involves selling all the assets so that the combination is an empty position or a pure cash position.
$$\begin{equation}\tag{2} \mathbf{W_k^{SA}} = 0 \end{equation}$$

**Equally-weighted portfolio (EW)**: Regardless of the asset, all assets are directly placed into equal weight positions during each rebalancing period.
$$\begin{equation}\tag{3} \mathbf{W_k^{EW}} = \frac{1}{n}\mathbf{l} \end{equation}$$

**Value-weighted portfolio (VW)**: As a passive investment strategy, positions in each rebalancing period are allocated as per the current capital of each asset.
$$\begin{equation}\tag{4} \mathbf{W_k^{VW}}=\frac{\mathbf{W_{k-1}} \cdot \mathbf{R_{k-1}}}{\mathbf{W^T_{k-1}} \mathbf{R_{k-1}}} \end{equation}$$

**Mean-variance portfolio (MV)**: Mean-variance model is a strategy constructed in line with the Markowitz’s theory. It captures the aforementioned risk-return trade-off.
$$ \tag{5} \mathbf{W_k^{MV}} = \argmin_{\mathbf{W_k}\mathbf{l}=1}{\mathbf{W_k^T}\sum_k\mathbf{W_k} - \mathbf{R_k^T}\mathbf{W_k}} $$
where $\mathbf{R_k^T}\mathbf{W_k}$ is the expected return and is $\mathbf{W_k^T}\sum_k\mathbf{W_k}$ the variance of portfolio returns



### Portfolio Bandit via Thompson Sampling (PBTS)

The multi-armed bandit of the portfolio strategy is $<a, R_a>$, $a$ is a collection of strategic arms (classic portfolio strategies),

$$
\tag{6}
a_k = [a_{k,1},...,a_{k,l}] \\
a_{k,1} = w_k^{BH}, a_{k,2} = w_k^{SA}, a_{k,3} = w_k^{EW}, a_{k,4} = w_k^{VW}, a_{k,5} = w_k^{MV}
$$

At time k, each arm randomly samples a value $\theta_{k,j}$ from its respective Beta distribution, then the arm jk of this selection
is:
$$ \tag{7} j_k = \argmax_j{\theta_{k,j}}$$

use the Sharpe ratio as measure:

$$
\tag{8}
\begin{equation}
\begin{cases}
\sum_{j=1}^l{(\mathbf{l_A})} \geq c & \text{success} \\
\sum_{j=1}^l{(\mathbf{l_A})} < c & \text{failure} \\
\end{cases}
\end{equation} \\
A = \{j | SR(a_{k, j_k}) - SR(a_{k,j}) \geq 0 \}
$$

$\mathbf{l_A}$ is an indicator function and $SR(a_{k,j})$ represents the Sharp ratio of user’s historical selection of arm $j$ at time
$t_k$. Usually, the international average generally takes a 36-month net growth rate to calculate the Sharpe ratio.
The choice of $c$ can be selected based on users’ investment risk preferences. If the user prefers to pursue high-risk and
high-return, the smaller the $c$ can be; the larger the $c$ can be, if the user tends to pursue a relatively stable investment.

#### Pseudo code


```latex
[Input]: Total cycles of participation in the transaction (m),
number of asserts (n), daily return (R), sliding window (τ ),
the top (c)
[Output]: Portfolio weight (w)

1: Initialize the Beta distribution θj ∼ Beta(αj , βj) of each strategic arm by α1 = ... = αl = β1 = βl = 1.
2: for k = 1 to m do
3:     Calculate the weight ratio of each basic portfolio strategy according to Eqs. (1) - (5).
4:     Sampling each arm’s θ_{j,k} from the Beta(αj , βj) distribution .
5:     Select arm j_k according to Equation (7).
6:     if k > τ then
7:          Assign the portfolio weight w_k = a_{k,j_k} at t_k.
8:     end if
9:     Update αj and βj according to Eqs. (8)-(9).
10:    if Success then
11:       αj = αj + 1.
12:    else
13:       βj = βj + 1.
14:     end if
15: end for
```

## Resources
* https://arxiv.org/abs/1911.05309 Adaptive Portfolio by Solving Multi-armed Bandit via Thompson Sampling
* https://rshare.library.torontomu.ca/articles/thesis/Financial_Bandits_-_Development_of_Thompson_Sampling_for_Financial_Data/24625161: Financial Bandits - Development of Thompson Sampling for Financial Data
* http://proceedings.mlr.press/v119/zhu20d/zhu20d.pdf: Thompson Sampling Algorithms for Mean-Variance Bandits
* https://arxiv.org/abs/1911.05309
* Dataset: https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html 
* https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.171377 Risk-aware multi-armed bandit problem with application to portfolio selection