# Building a Reinforcement Learning Environment

This notebook is going to create a Reinforcement Learning environment. The environment created is designed to work with a generic time serie to be applied to traditional shares.

In [1]:
import logging
import imp
import sys
from six import StringIO

In [2]:
import gym
from gym import spaces, envs
import pandas as pd
import numpy as np

In [3]:
sys.path.append('..')

from helpers.dataset import read_quote_dataset, preprocess_quotes

In [4]:
# Configir logging module for jypter notebook
imp.reload(logging)
# logging_format = '%(asctime)s - %(levelname)s - %(process)s - %(message)s'
logging_format = '%(message)s'
logging.basicConfig(level=logging.DEBUG, format=logging_format)

# Disable backtesting logs
# logging.getLogger('helpers.backtest').setLevel(level=logging.WARNING)

# Load the Dataset

In [5]:
PARAM_DATASET = '../../data/SPY_postprocess_adj.csv.gz'

In [6]:
df = read_quote_dataset(PARAM_DATASET)

In [7]:
# Set the date as index
df.set_index('date', drop=False, inplace=True)

In [8]:
df.head()

Unnamed: 0_level_0,date,open,high,low,close,close_adj,volume,open_adj,low_adj,high_adj,...,ratio_close_adj_000_close_adj_005_norm,ratio_close_adj_000_close_adj_020_norm,ratio_close_adj_000_ema_005_norm,ratio_close_adj_000_ema_010_norm,ratio_close_adj_000_ema_020_norm,ratio_close_adj_000_ema_050_norm,ratio_close_adj_000_sma_005_norm,ratio_close_adj_000_sma_010_norm,ratio_close_adj_000_sma_020_norm,ratio_close_adj_000_sma_050_norm
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-01-03,2000-01-03,148.25,148.25,143.875,145.4375,101.425385,8164300,103.38677,100.335727,103.38677,...,,,,,,,,,,
2000-01-04,2000-01-04,143.531204,144.0625,139.640594,139.75,97.459068,8089800,100.09601,97.38277,100.466526,...,,,,,,,,,,
2000-01-05,2000-01-05,139.9375,141.531204,137.25,140.0,97.633377,12177900,97.589791,95.715579,98.70121,...,,,,,,,,,,
2000-01-06,2000-01-06,139.625,141.5,137.75,137.75,96.064301,6227200,97.371891,96.064301,98.679482,...,,,0.48663,,,,,,,
2000-01-07,2000-01-07,140.3125,145.75,140.0625,145.75,101.643333,8066500,97.851322,97.676977,101.643333,...,,,0.815422,,,,0.740588,,,


On reinforcement learning it is not needed to create the class, because there is no class. **Remember that it is not suppervised learning**. The equivalent to the class on supervised learning, is the reward that it is gotten on every step.

So, there is no need of any preprocessing.

# Create the Reinforcement Learning Environment

The environment class created is generic and implments the `gym.Env` interface. It receives as input any time serie, with any kind of columns that later will be used by the reinforcement learning algorithm. For each column, it has to be specified the minimum and the maximum value. There are two special entries:
- The date,  which must be the index of the input dataframe.
- A column which specifies the price tradable price (in general the close price). The name of the column is the content of the variable `price_col`.

Some basic features of the environment created:
- Allows to trade long or short according it is specified on params `with_long` or `with_short`.
- Allows to constrain the training/testing periods setting `date_start` and `date_end`.
- Allows to compute the reward as a nominal price (`reward_type='price'`) or as a percentage (`reward_type='pct'`).
- Allows to internally standardize the input columns, to help the neural network convergence.

The `gym.Env` methods implemented are:
- reset
- step
- render

In [9]:
class AssetTimeSerieEnv(gym.Env):
    '''gym interface that implements a stock time series'''
    metadata = {'render.modes': ['human', 'ansi']}
    reward_range = (-np.inf, np.inf)

    all_actions = {
        'long_only': {
            0: 'buy',
            1: 'hold',
            2: 'sell',
        },
        'short_only': {
            0: 'short_sell',
            1: 'hold',
            2: 'buy_to_cover',
        },
        'long_short': {
            0: 'buy',
            1: 'hold',
            2: 'sell',
            3: 'short_sell',
            4: 'buy_to_cover',
        },
    }

    def __init__(self, data_df, input_cols, price_col, low_cols, high_cols,
                 with_long=True, with_short=False,
                 date_start=None, date_end=None,
                 reward_type='pct', sort_index=False, std=False):
        super(AssetTimeSerieEnv, self).__init__()

        if with_long and with_short:
            self.action_key = 'long_short'
            position_min = -1
            position_max = 1
        elif with_long and not with_short:
            self.action_key = 'long_only'
            position_min = 0
            position_max = 1
        elif with_short and not with_long:
            self.action_key = 'short_only'
            position_min = -1
            position_max = 0
        else:
            raise Exception('Mode no valid')

        self.df = data_df
        if sort_index:
            self.df = self.df.sort_index()

        self.input_cols = input_cols
        self.price_col = price_col
        self.with_long = with_long
        self.with_short = with_short
        self.reward_type = reward_type
        self.std = std

        if self.std:
            new_colnames = ['%s_std' % col for col in input_cols]
            df_std = (self.df[input_cols] - self.df[input_cols].mean()) / \
                self.df[input_cols].std()
            self.df[new_colnames] = df_std
            self.input_cols = new_colnames

        if date_start is None:
            self.date_start = self.df.iloc[0].name
        else:
            self.date_start = self.df.index[self.df.index >= date_start][0]

        if date_end is None:
            self.date_end = self.df.iloc[-1].name
        else:
            self.date_end = self.df.index[self.df.index <= date_end][-1]

        if self.df.loc[self.date_start:self.date_end].shape[0] < 2:
            raise Exception('At least 2 rows are need')

        self.actions = self.all_actions[self.action_key]
        self.action_space = spaces.Discrete(len(self.actions))
        # self.observation_space = spaces.Box(
        #     low=-np.inf, high=np.inf, shape=(1,), dtype=np.float32)
        self.observation_space = spaces.Box(
            low=np.array([position_min] + low_cols),
            high=np.array([position_max] + high_cols),
            dtype=np.float32,
        )

        self.current_date = None
        self.position = 0
        self.acum = 0
        self._iter = None

    def reset(self):
        self.position = 0
        self.buy_price = None
        self.df_iter = self.df.loc[self.date_start:self.date_end].iterrows()

        self.current_date, self.current_data = next(self.df_iter)

        new_status = np.array(
            [0] +  # Current position on asset
            [self.current_data[col] for col in self.input_cols]
        )
        return new_status


    def step(self, action):
        action_str = self.actions[action]
        next_date, next_data = next(self.df_iter)

        if action_str == 'hold':
            # Nothing to do
            pass
        elif action_str == 'buy':
            self.position = 1
            self.buy_price = self.current_data[self.price_col]
        elif action_str == 'sell':
            if self.position == 1:
                self.position = 0
                self.buy_price = None
            else:
                # Not possible
                pass
        elif action_str == 'short_sell':
            self.position = -1
            self.buy_price = self.current_data[self.price_col]
        elif action_str == 'buy_to_cover':
            if self.position == -1:
                self.position = 0
                self.buy_price = None

        reward = self._compute_reward(next_data)
        self.current_date, self.current_data = next_date, next_data
        done = self.current_date.date() == self.date_end.date()

        new_status = np.array(
            [self.position] +  # Current position on asset
            [self.current_data[col] for col in self.input_cols]
        )
        return new_status, reward, done, {}

    def _compute_reward(self, next_data):
        if self.position == 0:
            reward = 0
        elif self.reward_type == 'price':
            reward = self.position * (
                next_data[self.price_col] - self.current_data[self.price_col])
        elif self.reward_type == 'pct':
                reward = self.position * (next_data[self.price_col] - self.current_data[self.price_col]) / self.current_data[self.price_col]
        else:
            raise Exception('reward_type %s not implemented.' % self.reward_type)

        return reward

    def render(self, mode='human'):
        outfile = StringIO() if mode == 'ansi' else sys.stdout

        render_text_01 = ['date: %s' % self.current_date, 'position: %d' % self.position]
        render_text_02 = ['%s: %f' % (col, self.current_data[col]) for col in self.input_cols]
        render_text_03 = ['%s: %s' % (self.price_col, self.current_data[self.price_col])]
        final_text = ' - '.join(render_text_01 + render_text_02 + render_text_03)

        outfile.write(final_text)
        outfile.write('\n')

# Test our reinforcement learning environment

In [10]:
input_cols = ['open_adj', 'low_adj', 'high_adj', 'close_adj', 'volume']

Create our time serie environment with the following features:
- using the OLHCV quotes (open, low, high, close and volume).
- Allow long trades.
- Allow short selling. 
- Working period: from 2018-01-01 to 2018-01-10.
- Reward type: Percentage
- No input standardization

In [11]:
ts_env = AssetTimeSerieEnv(
        data_df=df,
        input_cols=input_cols,
        price_col='close_adj',
        low_cols=[0] * len(input_cols),
        high_cols=[np.inf] * len(input_cols),
        with_long=True,
        with_short=True,
        date_start=pd.to_datetime('2018-01-01'),
        date_end=pd.to_datetime('2018-01-10'),
        reward_type='pct',
        std=False,
)

Reset our environment

In [12]:
ts_env.reset()

array([0.00000000e+00, 2.61694586e+02, 2.61264680e+02, 2.62642332e+02,
       2.62603241e+02, 8.66557000e+07])

reset returns the initial state. The first value of the state represent the current possition on the given asset. If it allows long and short position, this first value could contains 3 possible values:
- 0: no position on the asset
- 1: long possition on the asset
- -1: short possition on the asset.

The next items of the status represents the values of each input columns, which should be the quotes of the first tradable day since 2018-01-01.

Lets verify the state calling the render method

In [13]:
ts_env.render()

date: 2018-01-02 00:00:00 - position: 0 - open_adj: 261.694586 - low_adj: 261.264680 - high_adj: 262.642332 - close_adj: 262.603241 - volume: 86655700.000000 - close_adj: 262.603241


As there was no operation, the current status is no position, and the quotes available on 2018-01-02, which is the first tradable day.

Let execute an operation, which on reinforcement learning is called a step.

In [14]:
actions = {value: key for key, value in AssetTimeSerieEnv.all_actions['long_short'].items()}
actions

{'buy': 0, 'hold': 1, 'sell': 2, 'short_sell': 3, 'buy_to_cover': 4}

There are 5 possible actions when it is allowed short selling:
- Buy
- Hold a possition
- Sell (only applied when there was a previous buy)
- Short sell
- Buy to cover (onlu applied when there was a shortsell previosly).

In [15]:
[new_state, reward, done, params] = ts_env.step(actions['buy'])
logging.info('new_state: %s', new_state)
logging.info('reward: %f', reward)
logging.info('done: %s', done)

new_state: [1.00000000e+00 2.62788857e+02 2.62788857e+02 2.64430334e+02
 2.64264221e+02 9.00704000e+07]
reward: 0.006325
done: False


After execute the buy, it returns:
- the new state
- the reward
- if the environment is finished
- an internal parameter (It is not used on this project)

The first value of  the state is a 1, which validates that the buy was executed. The other values are the new quotes available. 

The `done` variable is false, which means the time serie hasn't ended.

The reward is 0.006325, which means a 0.632%. It is because the new close price is 264.264221 and the previous close price was 262.603241 (look at the close_price after the `reset` method). Below it is verified that the reward was properly computed.

Check that the reward was properly computed

In [16]:
(264.264221 - 262.603241) / 262.603241

0.006325055218949087

Lets verify the current state calling the `render` method

In [17]:
ts_env.render()

date: 2018-01-03 00:00:00 - position: 1 - open_adj: 262.788857 - low_adj: 262.788857 - high_adj: 264.430334 - close_adj: 264.264221 - volume: 90070400.000000 - close_adj: 264.264221


On next available day, keep the same possition

In [18]:
[new_state, reward, done, params] = ts_env.step(actions['hold'])
logging.info('new_state: %s', new_state)
logging.info('reward: %f', reward)
logging.info('done: %s', done)

new_state: [1.00000000e+00 2.64977486e+02 2.64332626e+02 2.65915451e+02
 2.65378052e+02 8.06364000e+07]
reward: 0.004215
done: False


In [19]:
ts_env.render()

date: 2018-01-04 00:00:00 - position: 1 - open_adj: 264.977486 - low_adj: 264.332626 - high_adj: 265.915451 - close_adj: 265.378052 - volume: 80636400.000000 - close_adj: 265.378052


The last step reward is 0.4215%.

Sell the position

In [20]:
[new_state, reward, done, params] = ts_env.step(actions['sell'])
logging.info('new_state: %s', new_state)
logging.info('reward: %f', reward)
logging.info('done: %s', done)

new_state: [0.00000000e+00 2.66257422e+02 2.65710272e+02 2.67283318e+02
 2.67146545e+02 8.35240000e+07]
reward: 0.000000
done: False


In [21]:
ts_env.render()

date: 2018-01-05 00:00:00 - position: 0 - open_adj: 266.257422 - low_adj: 265.710272 - high_adj: 267.283318 - close_adj: 267.146545 - volume: 83524000.000000 - close_adj: 267.146545


As the position was solt, the new `position` value is 0.

Kepp uninvested (using the `hold` action)

In [22]:
[new_state, reward, done, params] = ts_env.step(actions['hold'])
logging.info('new_state: %s', new_state)
logging.info('reward: %f', reward)
logging.info('done: %s', done)

new_state: [0.00000000e+00 2.67039052e+02 2.66716637e+02 2.67810934e+02
 2.67635071e+02 5.73192000e+07]
reward: 0.000000
done: False


In [23]:
ts_env.render()

date: 2018-01-08 00:00:00 - position: 0 - open_adj: 267.039052 - low_adj: 266.716637 - high_adj: 267.810934 - close_adj: 267.635071 - volume: 57319200.000000 - close_adj: 267.635071


As we are uninvested, the action's reward is 0 and the position is also 0.

Now lets go short

In [24]:
[new_state, reward, done, params] = ts_env.step(actions['short_sell'])
logging.info('new_state: %s', new_state)
logging.info('reward: %f', reward)
logging.info('done: %s', done)

new_state: [-1.00000000e+00  2.68104073e+02  2.67791408e+02  2.68934576e+02
  2.68240875e+02  5.72540000e+07]
reward: -0.002264
done: False


In [25]:
ts_env.render()

date: 2018-01-09 00:00:00 - position: -1 - open_adj: 268.104073 - low_adj: 267.791408 - high_adj: 268.934576 - close_adj: 268.240875 - volume: 57254000.000000 - close_adj: 268.240875


As we went short, the new `position` value is `-1`.

As the asset was appreciated, going from 267.635071 to 268.240875, the reward is negative.

Keep the short position

In [26]:
[new_state, reward, done, params] = ts_env.step(actions['hold'])
logging.info('new_state: %s', new_state)
logging.info('reward: %f', reward)
logging.info('done: %s', done)

new_state: [-1.00000000e+00  2.67400569e+02  2.66658026e+02  2.68123609e+02
  2.67830475e+02  6.95743000e+07]
reward: 0.001530
done: True


In [27]:
ts_env.render()

date: 2018-01-10 00:00:00 - position: -1 - open_adj: 267.400569 - low_adj: 266.658026 - high_adj: 268.123609 - close_adj: 267.830475 - volume: 69574300.000000 - close_adj: 267.830475


The reward is possitive, because the asset was depreciated, going from 268.240875 to 267.830475.

On the other hand, `step` methods returuns that `done` variable is now True. This is because the date is 2018-01-10, which is the last valid date. It means that the environment finishes.

As the environment finished, it is not possible to call `step` method again. Lets see...

In [28]:
ts_env.step(actions['hold'])

StopIteration: 

The only way to call step again, is starting a new environment

In [29]:
ts_env.reset()

array([0.00000000e+00, 2.61694586e+02, 2.61264680e+02, 2.62642332e+02,
       2.62603241e+02, 8.66557000e+07])

In [30]:
ts_env.render()

date: 2018-01-02 00:00:00 - position: 0 - open_adj: 261.694586 - low_adj: 261.264680 - high_adj: 262.642332 - close_adj: 262.603241 - volume: 86655700.000000 - close_adj: 262.603241


In [31]:
ts_env.step(actions['short_sell'])

(array([-1.00000000e+00,  2.62788857e+02,  2.62788857e+02,  2.64430334e+02,
         2.64264221e+02,  9.00704000e+07]), -0.006325055218949087, False, {})