## Abstract

Financial portfolio management is the process of constant redistribution of a fund into different financial products. This paper presents a financial-model-free Reinforcement Learning framework to provide a deep machine learning solution to the portfolio management problem. The framework consists of the Ensemble of Identical Independent Evaluators (EIIE) topology, a Portfolio-Vector Memory (PVM), an Online Stochastic Batch Learning (OSBL) scheme, and a fully exploiting and explicit reward function. This framework is realized in three instants in this work with a Convolutional Neural Network (CNN), a basic Recurrent Neural Network (RNN), and a Long Short-Term Memory (LSTM). They are, along with a number of recently reviewed or published portfolio-selection strategies, examined in three back-test experiments with a trading period of 30 minutes in a cryptocurrency market. Cryptocurrencies are electronic and decentralized alternatives to government-issued money, with Bitcoin as the best-known example of a cryptocurrency. All three instances of the framework monopolize the top three positions in all experiments, outdistancing other compared trading algorithms. Although with a high commission rate of 0.25% in the backtests, the framework is able to achieve at least 4-fold returns in 50 days.

Keywords: Machine learning; Convolutional Neural Networks; Recurrent Neural Networks;
Long Short-Term Memory; Reinforcement learning; Deep Learning; Cryptocurrency;
Bitcoin; Algorithmic Trading; Portfolio Management; Quantitative Finance

## 1. Introduction

Previous successful attempts of model-free and fully machine-learning schemes to the algorithmic trading problem, without predicting future prices, are treating the problem as a Reinforcement Learning (RL) one. These includeMoody and Saffell (2001), Dempster and Leemans (2006), Cumming (2015), and the recent deep RL utilization by Deng et al. (2017). These RL algorithms output discrete trading signals on an asset. Being limited to single-asset trading, they are not applicable to general portfolio management problems, where trading agents manage multiple assets. 

Deep RL is lately drawing much attention due to its remarkable achievements in playing video games (Mnih et al., 2015) and board games (Silver et al., 2016). These are RL problems with discrete action spaces, and can not be directly applied to portfolio selection problems, where actions are continuous. Although market actions can be discretized, discretization is considered a major drawback, because discrete actions come with unknown risks. For instance, one extreme discrete action may be defined as investing all the capital into one asset, without spreading the risk to the rest of the market. In addition, discretization scales badly. Market factors, like number of total assets, vary from market to market. In order to take full advantage of adaptability of machine learning over different markets, trading algorithms have to be scalable. A general-purpose continuous deep RL framework, the actor-critic Deterministic Policy Gradient Algorithms, was recently introduced (Silver et al., 2014; Lillicrap et al., 2016). The continuous output in these actor-critic algorithms is achieved by a neural-network approximated action policy function, and a second network is trained as the reward function estimator. Training two neural networks, however, is found out to be difficult, and sometimes even unstable. 

This paper proposes an RL framework specially designed for the task of portfolio management. The core of the framework is the Ensemble of Identical Independent Evaluators (EIIE) topology. An IIE is a neural network whose job is to inspect the history of an asset and evaluate its potential growth for the immediate future. The evaluation score of each asset is discounted by the size of its intentional weight change for the asset in the portfolio and is presented to a softmax layer, whose outcome will be the new portfolio weights for the coming trading period. The portfolio weights define the market action of the RL agent. An asset with an increased target weight will be bought in with additional amount, and that with decreased weight will be sold. Apart from the market history, portfolio weights from the previous trading period are also input to the EIIE. This is for the RL agent to consider the effect of transaction cost to its wealth. For this purpose, the portfolio weights of each period are recorded in a Portfolio Vector Memory (PVM). The EIIE is trained in an Online Stochastic Batch Learning scheme (OSBL), which is compatible with both pre-trade training and online training during back-tests or online trading. The reward function of the RL framework is the explicit average of the periodic logarithmic returns. Having an explicit reward function, the EIIE evolves, under training, along the gradient ascending direction of the function. Three different species of IIEs are tested in this work, a Convolutional Neural Network (CNN) (Fukushima, 1980; Krizhevsky et al., 2012; Sermanet et al., 2012), a basic Recurrent Neural Network (RNN) (Werbos, 1988), and a Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997). 


Being a fully machine-learning approach, the framework is not restricted to any particular markets. To examine its validity and profitability, the framework is tested in a cryptocurrency (virtual money, Bitcoin as the most famous example) exchange market, Polonix.com. A set of coins are preselected by their ranking in trading-volume over a time interval just before an experiment. Three back-test experiments of well separated timespans are performed in a trading period of 30 minutes. The performance of the three EIIEs are compared with some recently published or reviewed portfolio selection strategies (Li et al., 2015a; Li and Hoi, 2014). The EIIEs significantly beat all other strategies in all three experiments 

Cryptographic currencies, or simply cryptocurrencies, are electronic and decentralized alternatives to government-issued moneys (Nakamoto, 2008; Grinberg, 2012). While the best known example of a cryptocurrency is Bitcoin, there are more than 100 other tradable cryptocurrencies competing each other and with Bitcoin (Bonneau et al., 2015). The motive behind this competition is that there are a number of design flaws in Bitcoin, and people are trying to invent new coins to overcome these defects hoping their inventions will eventually replace Bitcoin (Bentov et al., 2014; Duffield and Hagan, 2014). There are, however, more and more cryptocurrencies being created without targeting to beat Bitcoin, but with the purposes of using the blockchain technology behind it to develop decentralized applications. (For example, Ethereum is a decentralized platform that runs smart contracts, and Siacoin is the currency
for buying and selling storage service on the decentralized cloud Sia.) To June 2017, the total market capital of all cryptocurrencies is 102 billions in USD, 41 of which is of Bitcoin.([Crypto-currency market capitalizations](http://coinmarketcap.com/)) Therefore, regardless of its design faults, Bitcoin is still the dominant cryptocurrency in markets. As a result, many other currencies can not be bought with fiat currencies, but only be traded against Bitcoin. 

Two natures of cryptocurrencies differentiate them from traditional financial assets, making their market the best test-ground for algorithmic portfolio management experiments. These natures are decentralization and openness, and the former implies the latter. Without a central regulating party, anyone can participate in cryptocurrency trading with low entrance requirements. One direct consequence is abundance of small-volume currencies. Affecting the prices of these penny-markets will require smaller amount of investment, compared to traditional markets. This will eventually allow trading machines to learn and take advantage of the impacts by their own market actions. Openness also means the markets are more accessible. Most cryptocurrency exchanges have application programming interface for obtaining market data and carrying out trading actions, and most exchanges are open 24/7 without restricting frequency of tradings. These non-stop markets are ideal for machines to learn in the real world in shorter time-frames. The paper is organized as follows. Section 2 defines the portfolio management problem that this project is aiming to solve. Section 3 introduces asset preselection and the reasoning behind it, the input price tensor, and a way to deal with missing data in the market history. The portfolio management problem is re-described in the language RL in Section 4. Section 5 presents the EIIE meta topology, the PVM, the OSBL scheme. The results of the three experiments are staged in Section 6.

## 2. Problem Definition 

### 2.2 Mathematical Formalism 
Portfolio management is the action of continuous reallocation of a capital into a number of financial assets. For an automatic trading robot, these investment decisions and actions are made periodically. This section provides a mathematical setting of the portfolio management problem. 

### 2.1 Trading Period 

In this work, trading algorithms are time-driven, where time is divided into periods of equal lengths T. At the beginning of each period, the trading agent reallocates the fund among the assets. T = 30minutes in all experiments of this paper. The price of an asset goes up and down within a period, but four important price points characterize the overall movement of a period, namely the opening, highest, lowest and closing prices (Rogers and Satchell, 1991). For continuous markets, the opening price of a financial instrument in a period is the closing price from the previous period. It is assumed in the back-test experiments that at the beginning of each period assets can be bought or sold at the opening price of that period. The justification of such an assumption is given in Section 2.4.


The portfolio consists of \\( m \\) assets. The closing prices of all assets comprise the price vector for Period \\( t \\),  \\( {\boldsymbol v_t} \\). In other words, the ith element of \\( {\boldsymbol v_t} \\), \\( v_{i,t} \\), is the closing price of the \\( i \\)th asset in the \\( t \\) th period. Similarly, \\( {\boldsymbol v^{(hi)}_t} \\) and \\( {\boldsymbol v^{(lo)}_t} \\) denote the highest and lowest prices of the period. The first asset in the portfolio is special, that it is the quoted currency, referred to as the cash for the rest of the article. Since the prices of all assets are quoted in cash, the first elements of \\( v^{(lo)}_{0,t} \\), \\( {\boldsymbol v^{(hi)}_t} \\) and \\({\boldsymbol v^{(lo)}_t} \\) are always one, that is \\( v^{(hi)}_{0,t} = v^{(lo)}_{0,t} = v_{0,t} = 1, \forall t \\). In the experiments of this paper, the cash is Bitcoin. For continuous markets, elements of \\( v^{(lo)}_{0,t} \\) are the opening prices for Period \\( t + 1 \\) as well as the closing prices for Period \\( t \\). The price relative vector of the tth trading period, \\( yt \\), is defined as the element-wise division of \\( {\boldsymbol v_t} \\) by \\( {\boldsymbol v_{t-1}} \\):

$$ y_t := v_t \oslash v_{t-1} = \biggl(1,\frac{v_{1,t}}{v_{1,t-1}},\frac{v_{2,t}}{v_{2,t-1}},...,\frac{v_{m,t}}{v_{m,t-1}} \biggr)^T \tag{1}$$

The elements of \\( y_t \\) are the quotients of closing prices and opening prices for individual asset in the period. The price relative vector can be used to calculate the change in total portfolio value in a period. If \\( p_{t-1} \\) is the portfolio value at the begining of Period \\( t \\), ignoring transaction cost,

$$ p_t = p_{t-1}{\boldsymbol y}_t \cdot {\boldsymbol w}_{t-1} \tag{2}$$

where \\( {\boldsymbol w}_{t-1} \\) is the portfolio weight vector (referred to as the portfolio vector from now on) at the beginning of Period t, whose ith element, \\( w_{t-1,i} \\) is the proportion of asset \\( i \\) in the portfolio after capital reallocation. The elements of \\( {\boldsymbol w}_t \\) always sum up to one by definition, \\( \sum_i w_{t,i} = 1、\forall t \\). The *rate of return* for Period \\( t \\) is then

$$ \rho_t := \frac{p_t}{p_{t-1}} -1 = {\boldsymbol y}_t \cdot {\boldsymbol w}_{t-1} -1 \tag{3}$$

and the corresponding logarithmic rate of return is

$$ r_t := \ln \frac{p_t}{p_{t-1}} -1 = \ln {\boldsymbol y}_t \cdot {\boldsymbol w}_{t-1} -1 \tag{4}$$

In a typical portfolio management problem, the initial portfolio weight vector \\( {\boldsymbol w}_0 \\) is chosen to be the first basis vector in the Euclidean space,

$$ {\boldsymbol w}_0 = (1,0,...,0)^T  \tag{5} $$

indicating all the capital is in the trading currency before entering the market. If there is no transaction cost, the final portfolio value will be

$$ p_f = p_0 \exp \biggl( \sum_{t=1}^{t_f+1}r_t \biggr) = p_0 \prod_{t=1}^{t_f + 1}{\boldsymbol y}_t \cdot {\boldsymbol w}_{t-1} \tag{6} $$

where \\( p_0 \\) is the initial investment amount. The job of a portfolio manager is to maximize \\( p_f \\) for a given time frame.

### 2.3 Transaction Cost

In a real-world scenario, buying or selling assets in a market is not free. The cost is normally from commission fee. Assuming a constant commission rate, this section will re-calculate the final portfolio value in Equation (6), using a recursive formula extending a work by Ormos and Urb´an (2013). 

The portfolio vector at the beginning of Period \\( t \\) is \\( w_{t-1} \\). Due to price movements in the market, at the end of the same period, the weights evolve into


$$ {\boldsymbol w}'_t = \frac{{\boldsymbol y}_t \odot {\boldsymbol w}_{t-1}}{{\boldsymbol y} \cdot {\boldsymbol w}_{t-t}} \tag{7} $$

where \\( \odot \\) is the element-wise multiplication. The mission of the portfolio manager now at the end of Period \\( t \\) is to reallocate portfolio vector from \\( w'_t \\) to \\( w_t \\) by selling and buying relevant assets. Paying all commission fees, this reallocation action shrinks the portfolio value by a factor \\( \mu_t \\). \\( \mu_t \in (0,1] \\), and will be called the transaction remainder factor from now on. \\( \mu_t \\) is to be determined below. Denoting \\( p_{t-1} \\) as the portfolio value at the beginning of Period \\( t \\) and \\( p'_t \\) at the end, 

$$ p_t = \mu_tp'_t \tag{8}$$

Figure 1: Illustration of the effect of transaction remainder factor μt. The market movement during Period t, represented by the price-relative vector yt, drives the portfolio value and portfolio weights from pt−1 and wt−1 to p′t and w′t. The asset selling and purchasing action at time t redistributes the fund into wt. As a side-effect, these transactions shrink the portfolio to pt by a factor of μt. The rate of return for Period t is calculated using portfolio values at the beginning of the two consecutive periods in Equation (9).

The rate of return (3) and logarithmic rate of return (4) are now

(9)
(10)

and the final portfolio value in Equation (6) becomes

(11)

Different from Equation (4) and (2) where transaction cost is not considered, in Equation (10) and (11), p′
t 6= pt and the difference between the two values is where the transaction remainder factor comes into play. Figure 1 demonstrates the relationship among portfolio vectors and values and their dynamic relationship on a time axis. 

The remaining problem is to determine this transaction remainder factor μt. During the portfolio reallocation from w′t to wt, some or all amount of asset i need to be sold, ifp′tw′t,i > ptwt,i or w′t,i > μtwt,i. The total amount of cash obtained by all selling is

(12)

where 0 6 cs < 1 is the commission rate for selling, and (v)+ = ReLu(v) is the element-wise rectified linear function, (x)+ = x if x > 0, (x)+ = 0 otherwise. This money and the original cash reserve p′tw′t,0 taken away the new reserve μtp′ twt,0 will be used to buy new assets,]

(13)

where 0 6 cp < 1 is the commission rate for purchasing, and p′ t has been canceled out on both sides. Using identity (a − b)+ − (b − a)+ = a − b, and the fact that w′ t,0 +mP i=1 w′ t,i =1 = wt,0 +mP i=1wt,i, Equation (13) is simplified to

(14)

The presence of μt inside a linear rectifier means μt is not solvable analytically, but it can only be solved iteratively.

#### Theorem1 *Denoting*

(15)

While this convergence is not stated in Ormos and Urb´an (2013), its proof will be given in Appendix A. This theorem provides a way to approximate the transaction remainder factor μt to an arbitrary accuracy. The speed on the convergence depends on the error of the initial guest μ⊙. The smaller |μt − μ⊙| is, the quicker Sequence (15) converges to μt. When cp = cs = c, there is a practice (Moody et al., 1998) to approximate μt with cmPi=1|w′t,i−wt,i|. Therefore, in this work, μ⊙ will use this as the first value for the sequence, that 

 (16)

In the training of the neural networks, ˜μ(k) t with a fixed k in (15) is used. In the backtest experiments, a tolerant error  dynamically determines k, that is the first k, such that ˜μ(k)t − ˜μ(k−1)t< , is used for ˜μ(k)
t to approximate μt. In general, μt and its approximations are functions of portfolio vectors of two recent periods and the price relative vector, 

μt = μt(wt−1,wt, yt). (17)

Throughout this work, a single constant commission rate for both selling and purchasing for all non-cash assets is used, cs = cp = 0.25%, the maximum rate at Poloniex. The purpose of the algorithmic agent is to generate a time-sequence of portfolio vectors {w1,w2, · · · ,wt, · · · } in order to maximize the accumulative capital in (11), taking transaction cost into account. 

### 2.4 Two Hypotheses 
In this work, back-test tradings are only considered, where the trading agent pretends to be back in time at a point in the market history, not knowing any ”future” market information, and does paper trading from then onward. As a requirement for the back-test experiments, the following two assumptions are imposed:


1. Zero slippage: The liquidity of all market assets is high enough that, each trade can be carried out immediately at the last price when a order is placed. 

2. Zero market impact: The capital invested by the software trading agent is so insignificant that is has no influence on the market. In a real-world trading environment, if the trading volume in a market is high enough, these two assumptions are near to reality. 

## 3. Data Treatments 
The trading experiments are done in the exchange Poloniex, where there are about 80
tradable cryptocurrency pairs with about 65 available cryptocurrencies3. However, for the reasons given below, only a subset of coins is considered by the trading robot in one period. Apart from coin selection scheme, this section also gives a description of the data structure that the neural networks take as their input, a normalization pre-process, and a scheme to deal with missing data. 

### 3.1 Asset Pre-Selection 
In the experiments of the paper, the 11 most-volumed non-cash assets are preselected for the portfolio. Together with the cash, Bitcoin, the size of the portfolio, m + 1, is 12. This number is chosen by experience and can be adjusted in future experiments. For markets with large volumes, like the foreign exchange market, m can be as big as the total number of available assets. 

One reason for selecting top-volumed cryptocurrencies (simply called coins below) is that bigger volume implies better market liquidity of an asset. In turn it means the market condition is closer to Hypothesis 1 set in Section 2.4. Higher volumes also suggest that the investment can have less influence on the market, establishing an environment closer to the Hypothesis 2. Considering the relatively high trading frequency (30 minutes) compared to some daily trading algorithms, liquidity and market size are particularly important in the current setting. In addition, the market of cryptocurrency is not stable. Some previously rarely- or popularly-traded coins can have sudden boost or drop in volume in a short period of time. Therefore, the volume for asset preselection is of a longer time-frame, relative to the trading period. In these experiments, volumes of 30 days are used. However, using top volumes for coin selection in back-test experiments can give rise to a survival bias. The trading volume of an asset is correlated to its popularity, which in turn is governed by its historic performance. Giving future volume rankings to a backtest, will inevitably and indirectly pass future price information to the experiment, causing unreliable positive results. For this reason, volume information just before the beginning of the back-tests is taken for preselection to avoid survival bias. 

### 3.2 Price Tensor 
Historic price data is fed into a neural network to generate the output of a portfolio vector. This subsection describes the structure of the input tensor, its normalization scheme, and how missing data is dealt with. 

The input to the neural networks at the end of Period t is a tensor, Xt, of rank 3 with shape (f, n,m), where m is the number of preselected non-cash assets, n is the number of input periods before t, and f = 3 is the feature number. Since prices further back in the history have much less correlation to the current moment than that of recent ones, n = 50
(a day and an hour) for the experiments. The criterion of choosing the m assets were given in Section 3.1. Features for asset i on Period t are its closing, highest, and lowest prices in the interval. Using the notations from Section 2.2, these are vi,t, v(hi)
i,t , and v(lo)i,t . However,these absolute price values are not directly fed to the networks. Since only the changes in prices will determine the performance of the portfolio management (Equation (10)), all prices in the input tensor will be normalization by the latest closing prices. Therefore, Xt is the stacking of the three normalized price matrices,