<a href="https://colab.research.google.com/github/letianzj/QuantResearch/blob/master/ml/reinforcement_trader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

(Use Open in Colab button above to see trading vidoes)

## Introduction

From reinforcement gamer to reinforcement trader

As illustrated in the figure below, investing bears a clear resemblance to game playing. In fact, some good poke players, such as Edward Thorp, also stand out in the stock markets.


![From Game Player to Stock Trader](https://files.gitbook.com/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MSGPuOMqasmUECLHyXj%2Fuploads%2Fgit-blob-87626e4bd747bdb40439277c09abce3e5aeb822d%2Fch5_rl_stock_trading.PNG?alt=media)

source: [Chapter Machine Learning](https://letianzj.gitbook.io/systematic-investing/products_and_methodologies/machine_learning)

Reinforcement learning has been applied to stock trading and portfolio management. Xiong, Zhuoran, et al (2018) explore the stock market and Zhang, et al (2020) trade the futures market. Nan, et al (2020) add news headline sentiments into the training. Spooner, Thomas, et al (2018) study the market makers who face inventory risk. Fischer, T. G. (2018) provides a survey of current reinforcement learning status in financial markets.

This notebook focuses on the trading part. It trains a reinforcement trader to buy and sell stocks. The objective is to achieve higher end dollar profits. Of course, other risk adjusted objectives such as Sharpe ratio are also viable.

The next notebook will focus on the portfolio management part, by training a reinforcement portfolio management to perform strategic allocation among stocks or asset classes.

## Setup

Uncomment to execute once

In [11]:
# !sudo apt-get update
# !pip install yfinance
# !pip install ta
# !pip install -U gym==0.21.0
# !pip install -U quanttrader==0.5.5
# !pip install -U pyfolio==0.9.2

# !sudo apt-get install -y xvfb ffmpeg freeglut3-dev
# !pip install 'imageio==2.4.0'
# !pip install pyvirtualdisplay
# !pip install tf-agents[reverb]
# !pip install pyglet
# !pip install -U PyYaml==3.13

Restart the runtime to take PyYaml==3.13 into effect. Otherwise pyfolio will complain on yaml.load error. 

It might needs to run twice.

In [6]:
import os
import io
import tempfile
import shutil
import zipfile
from google.colab import files

from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import yfinance as yf
import gym
from quanttrader import TradingEnv
import pyfolio as pf

import tensorflow as tf
from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import py_driver
from tf_agents.drivers.dynamic_step_driver import DynamicStepDriver
from tf_agents.environments import tf_py_environment
from tf_agents.environments import suite_gym
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import sequential, q_network, network
from tf_agents.policies import py_tf_eager_policy
from tf_agents.policies import random_tf_policy
from tf_agents.policies import policy_saver
from tf_agents.replay_buffers import TFUniformReplayBuffer
from tf_agents.trajectories import trajectory
from tf_agents.specs import tensor_spec
from tf_agents.utils import common

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import pyvirtualdisplay
import reverb

In [4]:
gym.__version__, qt.__version__, pf.__version__

('0.21.0', '0.5.5', '0.9.2')

In [8]:
def load_data():
    from datetime import timedelta
    import ta

    start_date = datetime(2010, 1, 1)
    end_date = datetime(2020, 12, 31)
    syms = ['SPY']
    max_price_scaler = 5_000.0
    max_volume_scaler = 1.5e10
    df_obs = pd.DataFrame()             # observation
    df_exch = pd.DataFrame()            # exchange; for order match

    for sym in syms:
        df = yf.download(sym, start=start_date, end=end_date)
        df.index = pd.to_datetime(df.index) + timedelta(hours=15, minutes=59, seconds=59)

        df_exch = pd.concat([df_exch, df['Close'].rename(sym)], axis=1)

        df['Open'] = df['Adj Close'] / df['Close'] * df['Open'] / max_price_scaler
        df['High'] = df['Adj Close'] / df['Close'] * df['High'] / max_price_scaler
        df['Low'] = df['Adj Close'] / df['Close'] * df['Low'] / max_price_scaler
        df['Volume'] = df['Adj Close'] / df['Close'] * df['Volume'] / max_volume_scaler
        df['Close'] = df['Adj Close'] / max_price_scaler
        df = df[['Open', 'High', 'Low', 'Close', 'Volume']]
        df.columns = [f'{sym}:{c.lower()}' for c in df.columns]

        macd = ta.trend.MACD(close=df[f'{sym}:close'])
        df[f'{sym}:macd'] = macd.macd()
        df[f'{sym}:macd_diff'] = macd.macd_diff()
        df[f'{sym}:macd_signal'] = macd.macd_signal()

        rsi = ta.momentum.RSIIndicator(close=df[f'{sym}:close'])
        df[f'{sym}:rsi'] = rsi.rsi()

        bb = ta.volatility.BollingerBands(close=df[f'{sym}:close'], window=20, window_dev=2)
        df[f'{sym}:bb_bbm'] = bb.bollinger_mavg()
        df[f'{sym}:bb_bbh'] = bb.bollinger_hband()
        df[f'{sym}:bb_bbl'] = bb.bollinger_lband()

        atr = ta.volatility.AverageTrueRange(high=df[f'{sym}:high'], low=df[f'{sym}:low'], close=df[f'{sym}:close'])
        df[f'{sym}:atr'] = atr.average_true_range()

        df_obs = pd.concat([df_obs, df], axis=1)

    return df_obs, df_exch

In [9]:
df_obs, df_exch = load_data()

[*********************100%***********************]  1 of 1 completed


## Trading Environment

In [10]:
look_back = 10
cash = 100_000.0
max_nav_scaler = cash

train_qt_env = TradingEnv(2, df_obs, df_exch)
train_qt_env.set_cash(cash)
train_qt_env.set_commission(0.0001)
train_qt_env.set_steps(n_lookback=10, n_warmup=50, n_maxsteps=250)
train_qt_env.set_feature_scaling(max_nav_scaler)

eval_qt_env = TradingEnv(2, df_obs, df_exch)
eval_qt_env.set_cash(cash)
eval_qt_env.set_commission(0.0001)
eval_qt_env.set_steps(n_lookback=10, n_warmup=50, n_maxsteps=2500, n_init_step=504)         # index 504 is 2012-01-03
eval_qt_env.set_feature_scaling(max_nav_scaler)

In [11]:
o1 = train_qt_env.reset()
train_qt_env._init_step

584

In [12]:
o1 = train_qt_env.reset()
total_reward = 0.0
while True:
    action = train_qt_env.action_space.sample()
    o2, reward, done, info = train_qt_env.step(action)
    total_reward += reward
    #print(action, reward * max_nav_scaler, info)
    if done:
        break

In [13]:
o1.shape, o2.shape

((10, 14), (10, 14))

In [14]:
x_left = train_qt_env._init_step
x_right = min(train_qt_env._df_exch.shape[0], train_qt_env._init_step+train_qt_env._maxsteps+1)
df_price = train_qt_env._df_exch[x_left:x_right]
df_nav = train_qt_env._df_positions['NAV'][x_left:x_right]
df_both = pd.concat([df_nav, df_price], axis=1)
df_both = df_both.iloc[1:]
df_both.columns = ['tf-agent', 'benchmark']
df_ret = df_both / df_both.shift(1) - 1
df_ret = df_ret[1:]
agent_perf_stats = pf.timeseries.perf_stats(df_ret['tf-agent'])
benchmark_perf_stats = pf.timeseries.perf_stats(df_ret['benchmark'])
perf_stats = pd.concat([agent_perf_stats, benchmark_perf_stats], axis=1)
perf_stats.columns = ['tf-agent', 'benchmark']
perf_stats

Unnamed: 0,tf-agent,benchmark
Annual return,-0.023693,0.014375
Cumulative returns,-0.023414,0.014203
Annual volatility,0.109128,0.168052
Sharpe ratio,-0.165274,0.168778
Calmar ratio,-0.20494,0.102233
Stability,0.523998,0.005548
Max drawdown,-0.115608,-0.140615
Omega ratio,0.958061,1.029755
Sortino ratio,-0.227923,0.234327
Skew,-0.320142,-0.24759


In [15]:
print(df_nav[0], total_reward, df_nav[-1])
np.testing.assert_almost_equal(df_nav[0] + total_reward, df_nav[-1], decimal=5)       # should be equal

100000.0 -2769.5254657608093 97230.47453423917


In [16]:
train_qt_env = gym.wrappers.FlattenObservation(train_qt_env)
train_py_env = suite_gym.wrap_env(train_qt_env)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)

eval_qt_env = gym.wrappers.FlattenObservation(eval_qt_env)
eval_py_env = suite_gym.wrap_env(eval_qt_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

In [17]:
train_env.action_spec()

BoundedTensorSpec(shape=(), dtype=tf.int64, name='action', minimum=array(0), maximum=array(1))

In [18]:
train_env.time_step_spec()

TimeStep(
{'discount': BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
 'observation': BoundedTensorSpec(shape=(140,), dtype=tf.float32, name='observation', minimum=array(-3.4028235e+38, dtype=float32), maximum=array(3.4028235e+38, dtype=float32)),
 'reward': TensorSpec(shape=(), dtype=tf.float32, name='reward'),
 'step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type')})

Some helper functions

In [19]:
def embed_mp4(filename):
  """Embeds an mp4 file in the notebook."""
  video = open(filename,'rb').read()
  b64 = base64.b64encode(video)
  tag = '''
  <video width="640" height="480" controls>
    <source src="data:video/mp4;base64,{0}" type="video/mp4">
  Your browser does not support the video tag.
  </video>'''.format(b64.decode())

  return IPython.display.HTML(tag)

def create_policy_eval_video(env, policy, filename, num_episodes=5, fps=30):
  filename = filename + ".mp4"
  with imageio.get_writer(filename, fps=fps) as video:
    for _ in range(num_episodes):
      time_step = env.reset()
      video.append_data(env.pyenv.envs[0].render())

      while not time_step.is_last():
        action_step = policy.action(time_step)
        time_step = env.step(action_step.action)
        video.append_data(env.pyenv.envs[0].render())

  return embed_mp4(filename)

## Spontaneous Trader

In [20]:
random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(), train_env.action_spec())

In [21]:
time_step = train_py_env.reset()

In [22]:
random_policy.action_spec

BoundedTensorSpec(shape=(), dtype=tf.int64, name='action', minimum=array(0), maximum=array(1))

In [23]:
time_step = train_env.reset()
action_step = random_policy.action(time_step)

Below shows spontaneous trader's random trading behavior.

The upper half is SPY price curve along with red buy and green sell marks. The lower half is NAV or total asset value.

Due to random trading window and random trading actions, re-run below code each time will generate slightly different video.

In [24]:
create_policy_eval_video(train_env, random_policy, "random-agent", num_episodes=1)

In [277]:
train_env.pyenv.envs[0].env._df_positions

Unnamed: 0,SPY,Cash,NAV
2010-01-04 15:59:59,0.0,100000.0,100000.0
2010-01-05 15:59:59,0.0,100000.0,100000.0
2010-01-06 15:59:59,0.0,100000.0,100000.0
2010-01-07 15:59:59,0.0,100000.0,100000.0
2010-01-08 15:59:59,0.0,100000.0,100000.0
...,...,...,...
2020-12-23 15:59:59,0.0,0.0,0.0
2020-12-24 15:59:59,0.0,0.0,0.0
2020-12-28 15:59:59,0.0,0.0,0.0
2020-12-29 15:59:59,0.0,0.0,0.0


## Reinforcement Trader

### Recruit a Trader

Hree we recruit a DQN trader, giving her $100,000 and let her trade SPY.

Hopefully after 1 millon times of simulated training, she is able to to find a good quantititive trading rule to trade SPY. 

Her trading rule is a black box. We don't care how she trades, as long as she keeps bringing in profits.

Then she is expected to apply her deep neutral network trading rule to other stocks via so-called transfer learning.

In [325]:
learning_rate = 1e-3 
num_eval_episodes = 10
replay_buffer_max_length = 100000

In [326]:
fc_layer_params = (100, 50)
action_tensor_spec = tensor_spec.from_spec(train_env.action_spec())
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1

# Define a helper function to create Dense layers configured with the right
# activation and kernel initializer.
def dense_layer(num_units):
  return tf.keras.layers.Dense(
      num_units,
      activation=tf.keras.activations.relu,
      kernel_initializer=tf.keras.initializers.VarianceScaling(
          scale=2.0, mode='fan_in', distribution='truncated_normal'))

# QNetwork consists of a sequence of Dense layers followed by a dense layer
# with `num_actions` units to generate one q_value per available action as
# its output.
dense_layers = [dense_layer(num_units) for num_units in fc_layer_params]
q_values_layer = tf.keras.layers.Dense(
    num_actions,
    activation=None,
    kernel_initializer=tf.keras.initializers.RandomUniform(
        minval=-0.03, maxval=0.03),
    bias_initializer=tf.keras.initializers.Constant(-0.2))
q_net = sequential.Sequential(dense_layers + [q_values_layer])

In [327]:
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

Here we use a three layer fully-connected deep neural network.

The input observation is 14 days x 10 features, flattened to shape 140. The first layer has 100 neurons. Therefore, it requires $140 \times 100+100=14,100$ parameters.

The second layer has 50 neurons, implying $100 \times 50 + 50 = 5,050$ parameters. 

The output layer is a binary decision of either buy or sell, which needs $50 \times 2 + 2 = 102$ parameters.

In [328]:
q_net.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_9 (Dense)              multiple                  14100     
_________________________________________________________________
dense_10 (Dense)             multiple                  5050      
_________________________________________________________________
dense_11 (Dense)             multiple                  102       
Total params: 19,252
Trainable params: 19,252
Non-trainable params: 0
_________________________________________________________________


In [329]:
eval_policy = agent.policy
collect_policy = agent.collect_policy

In [346]:
random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(), train_env.action_spec())

In [331]:
time_step = train_env.reset()

In [332]:
random_policy.action(time_step)

PolicyStep(action=<tf.Tensor: shape=(1,), dtype=int64, numpy=array([0])>, state=(), info=())

In [333]:
def compute_avg_return(environment, policy, num_episodes=5):

  total_return = 0.0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_last():
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
    total_return += episode_return

  avg_return = total_return / num_episodes
  return avg_return.numpy()[0]

In [334]:
compute_avg_return(eval_env, random_policy, num_episodes=1)

42119.504

In [335]:
replay_buffer = TFUniformReplayBuffer(
    data_spec=agent.collect_data_spec,
    batch_size=train_env.batch_size,
    max_length=100_000)

In [336]:
# replay_buffer_observer = replay_buffer.add_batch

In [337]:
train_env.reset()

init_driver = DynamicStepDriver(
    train_env,
    random_policy,
    observers=[replay_buffer.add_batch],
    num_steps=2_500)
final_time_step, final_policy_state = init_driver.run()

In [338]:
trajectories, buffer_info = replay_buffer.get_next(sample_batch_size=2, num_steps=3)

In [339]:
trajectories.observation.shape

TensorShape([2, 3, 140])

In [340]:
from tf_agents.trajectories.trajectory import to_transition
time_steps, action_steps, next_time_steps = to_transition(trajectories)
time_steps.observation.shape

TensorShape([2, 2, 140])

In [341]:
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3,
    sample_batch_size=64,
    num_steps=2).prefetch(3)

In [342]:
iterator = iter(dataset)

In [343]:
num_iterations = 1_000_000   # less intelligence, more persistance; 24x7 player
save_interval = 100_000
eval_interval = 50_000
log_interval = 5_000

In [344]:
# Create a driver to collect experience.
collect_driver = DynamicStepDriver(
    train_env,
    agent.collect_policy,
    observers=[replay_buffer.add_batch],
    num_steps=4) # collect 4 steps for each training iteration

### Training the Trader

In [345]:
# (Optional) Optimize by wrapping some of the code in a graph using TF function.
collect_driver.run = common.function(collect_driver.run)
agent.train = common.function(agent.train)

# Reset the train step.
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(train_env, agent.policy, num_episodes=1)
returns = np.array([avg_return])

# Reset the environment.
time_step = None
policy_state = agent.collect_policy.get_initial_state(train_env.batch_size)

 step 5000step = 5000: loss = 2167260.0
 step 10000step = 10000: loss = 1368180.125
 step 15000step = 15000: loss = 2198302.0
 step 20000step = 20000: loss = 2775266.5
 step 25000step = 25000: loss = 1159157.125
 step 30000step = 30000: loss = 1105965.375
 step 35000step = 35000: loss = 1888510.5
 step 40000step = 40000: loss = 3168719.5
 step 45000step = 45000: loss = 851866.375
 step 50000step = 50000: loss = 6141768.0
step = 50000: Average Return = 0.0
 step 55000step = 55000: loss = 3820841.0
 step 60000step = 60000: loss = 1095066.125
 step 65000step = 65000: loss = 1502168.75
 step 70000step = 70000: loss = 7292247.5
 step 75000step = 75000: loss = 674870.75
 step 80000step = 80000: loss = 635257.3125
 step 85000step = 85000: loss = 4569862.5
 step 90000step = 90000: loss = 672412.5
 step 95000step = 95000: loss = 642349.625
 step 100000step = 100000: loss = 1482314.125
step = 100000: Average Return = 19049.134765625
 step 105000step = 105000: loss = 1826036.875
 step 110000step 

In [386]:
num_iterations = 2_000_000
while True:
    # Collect a few steps using collect_policy and save to the replay buffer.
    time_step, policy_state = collect_driver.run(time_step, policy_state)

    # Sample a batch of data from the buffer and update the agent's network.
    experience, unused_info = next(iterator)
    train_loss = agent.train(experience).loss

    step = agent.train_step_counter.numpy()
    print(f'\r step {step}', end='')

    if step % log_interval == 0:
        print('step = {0}: loss = {1}'.format(step, train_loss))

    if step % eval_interval == 0:
        avg_return = compute_avg_return(train_env, agent.policy, num_episodes=1)
        print('step = {0}: Average Return = {1}'.format(step, avg_return))
        returns = np.append(returns, avg_return)

    # if step % save_interval == 0:
    #     save_checkpoint_to_local()

    if step > num_iterations:
        break

 step 1005000step = 1005000: loss = 1266627.0
 step 1010000step = 1010000: loss = 390454.6875
 step 1015000step = 1015000: loss = 417311.375
 step 1020000step = 1020000: loss = 687286.4375
 step 1025000step = 1025000: loss = 1044038.5
 step 1030000step = 1030000: loss = 537503.75
 step 1035000step = 1035000: loss = 784378.625
 step 1040000step = 1040000: loss = 1072486.125
 step 1045000step = 1045000: loss = 969355.0625
 step 1050000step = 1050000: loss = 464003.75
step = 1050000: Average Return = 3481.649169921875
 step 1055000step = 1055000: loss = 646623.0625
 step 1060000step = 1060000: loss = 1622306.0
 step 1065000step = 1065000: loss = 491992.90625
 step 1070000step = 1070000: loss = 770378.8125
 step 1075000step = 1075000: loss = 412003.6875
 step 1080000step = 1080000: loss = 1062068.75
 step 1085000step = 1085000: loss = 966958.875
 step 1090000step = 1090000: loss = 2013997.125
 step 1095000step = 1095000: loss = 1142090.75
 step 1100000step = 1100000: loss = 1249168.25
step

KeyboardInterrupt: ignored

Save the model.

The policy is saved to github. We can wget from github and unzip it.

In [347]:
def create_zip_file(dirname, base_filename):
  return shutil.make_archive(base_filename, 'zip', dirname)

In [350]:
tempdir = os.getenv("TEST_TMPDIR", tempfile.gettempdir())

policy_dir = os.path.join(tempdir, 'policy')
tf_policy_saver = policy_saver.PolicySaver(agent.policy)

In [351]:
tf_policy_saver.save(policy_dir)
policy_zip_filename = create_zip_file(policy_dir, os.path.join(tempdir, 'exported_policy'))

  "imported and registered." % type_spec_class_name)


INFO:tensorflow:Assets written to: /tmp/policy/assets


INFO:tensorflow:Assets written to: /tmp/policy/assets


In [357]:
!ls /tmp/exported_policy.zip

/tmp/exported_policy.zip


In [358]:
files.download(policy_zip_filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Evaluate Trader Performance 

In [361]:
create_policy_eval_video(train_env, agent.policy, "trained-agent", num_episodes=1)

In [None]:
def create_policy_eval_video(policy, filename, num_episodes=5, fps=30):
  filename = filename + ".mp4"
  with imageio.get_writer(filename, fps=fps) as video:
    for _ in range(num_episodes):
      time_step = train_py_env.reset()
      video.append_data(train_py_env.render())
      n_step = 0
      while not time_step.is_last():
        action_step = policy.action(time_step)
        time_step = train_env.step(action_step.action)
        video.append_data(train_py_env.render())
        n_step += 1
  return embed_mp4(filename)

In [None]:
train_env.pyenv.envs[0].env._df_positions

Unnamed: 0,SPY,Cash,NAV
2010-01-04 15:59:59,0.0,100000.0,100000.0
2010-01-05 15:59:59,0.0,100000.0,100000.0
2010-01-06 15:59:59,0.0,100000.0,100000.0
2010-01-07 15:59:59,0.0,100000.0,100000.0
2010-01-08 15:59:59,0.0,100000.0,100000.0
...,...,...,...
2020-12-23 15:59:59,0.0,0.0,0.0
2020-12-24 15:59:59,0.0,0.0,0.0
2020-12-28 15:59:59,0.0,0.0,0.0
2020-12-29 15:59:59,0.0,0.0,0.0


In [383]:
time_step = train_env.reset()
ones = 0
zeros = 0
while not time_step.is_last():
  action_step = agent.policy.action(time_step)
  time_step = train_env.step(action_step.action)
  if action_step.action.numpy()[0] == 1:
    ones+=1
  else:
    zeros+=1

print(zeros, ones)

0 250


In [385]:
 agent.train_step_counter.numpy()

1000001