<a href="https://colab.research.google.com/github/letianzj/QuantResearch/blob/master/ml/reinforcement_pm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Reinforcement Portfolio Manager

## Setup

Uncomment to execute once

In [7]:
# !sudo apt-get update
# !pip install yfinance
# !pip install ta
# !pip install -U gym==0.21.0
# !pip install -U quanttrader==0.5.5
# !pip install -U pyfolio==0.9.2

# !sudo apt-get install -y xvfb ffmpeg freeglut3-dev
# !pip install 'imageio==2.4.0'
# !pip install pyvirtualdisplay
# !pip install tf-agents[reverb]
# !pip install pyglet
# !pip install -U PyYaml==3.13

Restart the runtime to take PyYaml==3.13 into effect. Otherwise pyfolio will complain on yaml.load error.

Code below might need to run twice.

In [29]:
import os
import io
import tempfile
import shutil
import zipfile
from google.colab import files

from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import yfinance as yf
import gym
import quanttrader as qt
from quanttrader import PortfolioEnv
import pyfolio as pf

import tensorflow as tf
from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import py_driver
from tf_agents.drivers.dynamic_step_driver import DynamicStepDriver
from tf_agents.environments import tf_py_environment
from tf_agents.environments import suite_gym
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import sequential, q_network, network
from tf_agents.policies import py_tf_eager_policy
from tf_agents.policies import random_tf_policy
from tf_agents.policies import policy_saver
from tf_agents.replay_buffers import TFUniformReplayBuffer
from tf_agents.trajectories import trajectory
from tf_agents.specs import tensor_spec
from tf_agents.utils import common

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import pyvirtualdisplay
import reverb

In [30]:
gym.__version__, qt.__version__, pf.__version__

('0.21.0', '0.5.5', '0.9.2')

In [8]:
def load_data():
    from datetime import timedelta
    import ta

    start_date = datetime(2010, 1, 1)
    end_date = datetime(2020, 12, 31)
    syms = ['SPY', 'QQQ']
    max_price_scaler = 5_000.0
    max_price_scaler = 1
    max_volume_scaler = 1.5e8
    df_obs = pd.DataFrame()             # observation
    df_exch = pd.DataFrame()            # exchange; for order match

    for sym in syms:
        df = yf.download(sym, start=start_date, end=end_date)
        df.index = pd.to_datetime(df.index) + timedelta(hours=15, minutes=59, seconds=59)

        df_exch = pd.concat([df_exch, df['Close'].rename(sym)], axis=1)

        df['Open'] = df['Adj Close'] / df['Close'] * df['Open'] / max_price_scaler
        df['High'] = df['Adj Close'] / df['Close'] * df['High'] / max_price_scaler
        df['Low'] = df['Adj Close'] / df['Close'] * df['Low'] / max_price_scaler
        df['Volume'] = df['Adj Close'] / df['Close'] * df['Volume'] / max_volume_scaler
        df['Close'] = df['Adj Close'] / max_price_scaler
        df = df[['Open', 'High', 'Low', 'Close', 'Volume']]
        df.columns = [f'{sym}:{c.lower()}' for c in df.columns]

        macd = ta.trend.MACD(close=df[f'{sym}:close'])
        df[f'{sym}:macd'] = macd.macd()
        df[f'{sym}:macd_diff'] = macd.macd_diff()
        df[f'{sym}:macd_signal'] = macd.macd_signal()

        rsi = ta.momentum.RSIIndicator(close=df[f'{sym}:close'])
        df[f'{sym}:rsi'] = rsi.rsi()

        bb = ta.volatility.BollingerBands(close=df[f'{sym}:close'], window=20, window_dev=2)
        df[f'{sym}:bb_bbm'] = bb.bollinger_mavg()
        df[f'{sym}:bb_bbh'] = bb.bollinger_hband()
        df[f'{sym}:bb_bbl'] = bb.bollinger_lband()

        atr = ta.volatility.AverageTrueRange(high=df[f'{sym}:high'], low=df[f'{sym}:low'], close=df[f'{sym}:close'])
        df[f'{sym}:atr'] = atr.average_true_range()

        df_obs = pd.concat([df_obs, df], axis=1)

    return df_obs, df_exch

In [9]:
df_obs, df_exch = load_data()

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


## Trading Environment

In [15]:
look_back = 10
cash = 100_000.0
max_nav_scaler = cash

train_qt_env = PortfolioEnv(df_obs, df_exch)
train_qt_env.set_cash(cash)
train_qt_env.set_commission(0.0001)
train_qt_env.set_steps(n_lookback=10, n_warmup=50, n_maxsteps=250)
train_qt_env.set_feature_scaling(max_nav_scaler)

eval_qt_env = PortfolioEnv(df_obs, df_exch)
eval_qt_env.set_cash(cash)
eval_qt_env.set_commission(0.0001)
eval_qt_env.set_steps(n_lookback=10, n_warmup=50, n_maxsteps=2000, n_init_step=504)         # index 504 is 2012-01-03
eval_qt_env.set_feature_scaling(max_nav_scaler)

Take one step to see how the environment works.

In [19]:
o1 = eval_qt_env.reset()
action = np.array([0.4, 0.4], dtype=np.float64)
o2, reward, done, info = eval_qt_env.step(action)

There are two stocks, each has 13 features; plus NAV, resulting in total 27 features.

The lookback window is 10 days or two weeks.

In [20]:
o1.shape, o2.shape

((10, 27), (10, 27))

In [21]:
idx0 = eval_qt_env._init_step
idx1 = idx0+3
eval_qt_env._df_obs_scaled[idx0:idx1]         # observation

Unnamed: 0_level_0,SPY:open,SPY:high,SPY:low,SPY:close,SPY:volume,SPY:macd,SPY:macd_diff,SPY:macd_signal,SPY:rsi,SPY:bb_bbm,SPY:bb_bbh,SPY:bb_bbl,SPY:atr,QQQ:open,QQQ:high,QQQ:low,QQQ:close,QQQ:volume,QQQ:macd,QQQ:macd_diff,QQQ:macd_signal,QQQ:rsi,QQQ:bb_bbm,QQQ:bb_bbh,QQQ:bb_bbl,QQQ:atr
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
2012-01-03 15:59:59,105.490874,106.002808,105.218393,105.276192,1.066237,0.885565,0.284873,0.600691,60.32446,102.609565,105.926763,99.292368,1.674594,51.671302,51.925526,51.52603,51.662224,0.239178,0.060246,0.133801,-0.073555,56.593513,50.688891,52.166454,49.211328,0.876583
2012-01-04 15:59:59,105.028497,105.532172,104.623908,105.441345,0.700116,0.99283,0.313711,0.679119,60.780561,102.703508,106.223549,99.183468,1.619856,51.580514,51.952772,51.353527,51.880135,0.177978,0.147476,0.176824,-0.029349,57.817202,50.694258,52.18822,49.200296,0.856773
2012-01-05 15:59:59,104.871587,105.87893,104.392682,105.722046,0.957229,1.087947,0.327062,0.760885,61.588795,102.809845,106.552665,99.067025,1.610313,51.771184,52.35227,51.571434,52.306873,0.24975,0.248179,0.222022,0.026157,60.184417,50.728198,52.332262,49.124135,0.851349


In [22]:
eval_qt_env._df_exch[idx0:idx1]

Unnamed: 0,SPY,QQQ
2012-01-03 15:59:59,127.5,56.900002
2012-01-04 15:59:59,127.699997,57.139999
2012-01-05 15:59:59,128.039993,57.610001


At the end of 2012-01-03, if action is [0.4, 0.4] or 40% in SPY, 40% in QQQ, and the remaining 20% in cash, then we buy 40_000/127.50 or 313 shares of SPY, and 40_000/56.9=702 shares of QQQ. We pay commission of (313x127.50+702x56.9)x0.0001=7.985, and the remaining cash=100_000-313x127.50-702x56.9-7.985=20_140.7, roughly 20% of NAV.

Then the market moves to 2012-01-04, and SPY price goes up to 127.70, QQQ goes up to 57.14. They are now worth 313x127.70 and 702x57.14, respectively, and NAV including cash becomes 313x127.70+702x57.14+20_140.7=100_233.09.

NAV change is the reward, in this case is 233.09.

As shown below.

In [26]:
eval_qt_env._df_positions.iloc[idx0]

SPY          0.0
QQQ          0.0
Cash    100000.0
NAV     100000.0
Name: 2012-01-03 15:59:59, dtype: float64

In [27]:
eval_qt_env._df_positions.iloc[idx0+1]

SPY        313.000000
QQQ        702.000000
Cash     20140.713799
NAV     100223.092415
Name: 2012-01-04 15:59:59, dtype: float64

In [28]:
reward,  100223.092415-100000.0

(223.09241505889892, 223.09241500000644)

Create TF-Agents environment from Gym environment.