# Stock Trading Using RL
Using DQN to make buy/sell/hold decision for a given stock.

In this chapter, we’ll implement our own OpenAI Gym environment, which simulates the stock market, and apply the DQN method that we’ve just learned in Chapters 6, Deep Q-Networks, and Chapter 7, DQN Extensions, to train the agent that will trade stocks to maximize the profit.

## Data
In our example, we’ll use the Russian stock market prices for the period of
2015-2016, which is placed in Chapter08/data/ch08-small-quotes.tgz and has to be unpacked before model training.
Inside the archive, we have CSV files with M1 bars, which means that every row in the CSV corresponds to a single minute in time and price movement during this minute is captured with four prices: open, high, low, and close. Here, an open price is the price at the beginning of the minute, high is the maximum price during the interval, low is the minimum price, and the close price is the last price of the minute time interval. Every minute interval is called bar and allows us to have an idea of price movement within the interval. For example, in the YNDX_160101_161231.csv file (which is Yandex company stocks for 2016), we have 130k lines of this form:

| DATE  |  TIME |  OPEN |  HIGH |    LOW |  CLOSE |  VOL |
| ------ | ------- | ------- | ------- | -------- | -------- | ------ |
| 20160104 | 100100 | 1148.9000000 | 1148.9000000 | 1148.9000000 | 1148.9000000 | 0 |

   
The first two columns are the date and time for the minute, the next four columns are open, high, low, and close prices and the last value represents the amount of buy and sell orders performed during the bar. The exact interpretation of this number is stock and market-dependent, but usually, volumes give you an idea about how active the market was.
The typical way to represent those prices is called a candlestick chart, where every bar is shown as a candle. Part of Yandex’s quotes for one day in February 2016 is shown in the following chart. Every file in the archive contains the M1 data for one year and it will be used in this chapter’s example:

## Problem statements and key decisions 

In our example, we’ll just scratch the surface a bit with our RL tools and our problem will be formulated as simply as possible, using price as an observation. We will investigate whether it will be possible for our agent to learn when the best time is to buy one single share and then close the position to maximize the profit. The purpose of this example is to show how flexible the RL model can be and what the first steps are that you usually need to take to apply RL to a real-life use case.
As you already know, to formulate RL problems three things are needed: 
* observation of the environment
* possible actions
* reward system
In previous chapters, all three were already given to us and the internal machinery of the environment was hidden. Now we’re in a different situation, so **we need to decide ourselves what our agent will see and what set of actions it can take**. The reward system is also not given as a strict set of rules, rather it is guided by our feelings and knowledge of the domain, but we still have lots of flexibility here.

Flexibility, in this case, is good and bad at the same time. It’s good that we have the freedom to pass some information to the agent that we feel will be important to learn efficiently. For example, you can pass to the trading agent not only prices but also the information about news or important statistics to be published (which is known to influence financial markets a lot). The bad part is that this flexibility usually means that to find a good agent, you need to try lots of variants of data representation and it’s not always obvious which will work better. In our case, we’ll implement the basic trading agent in its simplest form. The observation will include the following information:
* $N$ past bars, where each have open, high, low, and close prices
* An indication that the share was bought some time ago (it will be possible to have only one share at a time)
* Profit or loss we currently have from our current position (the share bought)

At every step, which will be after every minute’s bar, the agent can take one of the following actions:
* **Do nothing** Skip the bar without taking actions
* **Buy** a share: If the agent has already got the share, nothing will be bought, otherwise we’ll pay the commission, which is usually some small percentage of the current price
* **Close the position**: If we’ve got no share previously bought, nothing will happen, otherwise we’ll pay the commission for the trade

The reward that the agent receives could be expressed in various ways. On the one hand, we can split the reward into multiple steps during our ownership of the share. In that case, the reward on every step will be equal to the last bar’s movement. On the other hand, the agent can receive reward only after the close action and receive full reward at once. At the first sight, both variants should have the same final result, but maybe with different convergence speed. However, in practice, the difference could be dramatic. We’ll implement both variants to have a chance to compare them.
One last decision to make is how to represent the prices in our environment observation. Ideally, we would like our agent to be independent on actual price values and take into account relative movement, such as “stock has grown 1% during the last bar” or “stock has lost 5%." This makes sense, as different stocks’ prices can vary, but they can have similar movement patterns. In finance, there exists a branch of analytics called “technical analysis," which studies such patterns to help to make predictions from them. We would like our system to be able to discover them (if they exist). To achieve this, we’ll convert every bar “open, high, low, and close” prices to three numbers showing high, low, and close prices represented as a percentage to the open price.
This representation has its own drawbacks, as we’re potentially losing the information about key price levels. For example, it’s known that markets have a tendency to bounce from round price numbers (like 8000 per bitcoin) and levels which were turning points in the past. However, as already stated, we’re not implementing “Wall Street Killer” here, but playing with the data and checking the concept. The representation in the form of relative price movement will help the system to find repeating patterns in the price level (if they exist, of course), regardless of the absolute price position. Potentially, the neural network (NN) could learn this on its own (it’s just the mean price which needs to be subtracted from the absolute price values), but relative representation simplifies the NN’s task.

## The trading environment 
As we have lots of code that is supposed to work with OpenAI Gym, we’ll implement the trading functionality following Gym’s Env class API, which should be familiar to you. Our environment is implemented in the StocksEnv class in the Chapter08/lib/environ.py module. It uses several internal classes to keep its state and encode observations. Let’s first look at the public API class.

In [66]:
import gym
import gym.spaces
from gym.utils import seeding
import enum
import numpy as np

from lib import data #data is a pthon module that reads the CSV files

DEFAULT_BARS_COUNT = 10
DEFAULT_COMMISSION_PERC = 0.1

#Enumerting the possible actions
class Actions(enum.Enum): 
    Hold = 0 #Hold
    Buy = 1  #Buy
    Sell = 2 #Sell


class State:
    #_init_ method
    def __init__(self, bars_count, commission_perc, reset_on_close, reward_on_close=True, volumes=True):
        assert isinstance(bars_count, int) #Checking validity of the the input arguments
        assert bars_count > 0
        assert isinstance(commission_perc, float)
        assert commission_perc >= 0.0
        assert isinstance(reset_on_close, bool)
        assert isinstance(reward_on_close, bool)
        self.bars_count = bars_count #Count of bars in data
        self.commission_perc = commission_perc #comission percentage
        self.reset_on_close = reset_on_close #Reset on close - bollean flag
        self.reward_on_close = reward_on_close #rewared on close- boolean flag
        self.volumes = volumes #day's volume

    #reset method
    def reset(self, prices, offset): 
        assert isinstance(prices, data.Prices)
        assert offset >= self.bars_count-1
        self.have_position = False
        self.open_price = 0.0
        self._prices = prices
        self._offset = offset #on every reset of the environment, the random offset in time series will be chosen. Otherwise, we’ll start from the beginning of the data.

    @property
    def shape(self):
        # [h, l, c] * bars + position_flag + rel_profit (since open)
        if self.volumes:
            return (4 * self.bars_count + 1 + 1, )
        else:
            return (3*self.bars_count + 1 + 1, )

    def encode(self):
        """
        Convert current state into numpy array.
        """
        res = np.ndarray(shape=self.shape, dtype=np.float32)
        shift = 0
        for bar_idx in range(-self.bars_count+1, 1):
            res[shift] = self._prices.high[self._offset + bar_idx]
            shift += 1
            res[shift] = self._prices.low[self._offset + bar_idx]
            shift += 1
            res[shift] = self._prices.close[self._offset + bar_idx]
            shift += 1
            if self.volumes:
                res[shift] = self._prices.volume[self._offset + bar_idx]
                shift += 1
        res[shift] = float(self.have_position)
        shift += 1
        if not self.have_position:
            res[shift] = 0.0
        else:
            res[shift] = (self._cur_close() - self.open_price) / self.open_price
        return res

    def _cur_close(self):
        """
        Calculate real close price for the current bar
        """
        open = self._prices.open[self._offset] #on every reset of the environment, the random offset in time series will be chosen. Otherwise, we’ll start from the beginning of the data.
        rel_close = self._prices.close[self._offset]
        return open * (1.0 + rel_close)

    def step(self, action):
        """
        Perform one step in our price, adjust offset, check for the end of prices
        and handle position change
        :param action:
        :return: reward, done
        """
        assert isinstance(action, Actions)
        reward = 0.0
        done = False
        close = self._cur_close()
        if action == Actions.Buy and not self.have_position:
            self.have_position = True
            self.open_price = close
            reward -= self.commission_perc
        elif action == Actions.Sell and self.have_position:
            reward -= self.commission_perc #subtract sell position
            done |= self.reset_on_close
            if self.reward_on_close: #claculate reward on close (Sell)
                reward += 100.0 * (close - self.open_price) / self.open_price
            self.have_position = False
            self.open_price = 0.0

        self._offset += 1
        prev_close = close
        close = self._cur_close()
        done |= self._offset >= self._prices.close.shape[0]-1
        if self.have_position and not self.reward_on_close:
            reward += 100.0 * (close - prev_close) / prev_close
#         print('State.step: ', 'action=', action, 'prev.price', round(prev_close,1), ' open.price=', round(self.open_price,1),
#               'curr.price=',  round(close,1), ' reward=', round(reward,1), ' offset=', self._offset ,  ' done=', done)

        return reward, done


class State1D(State):
    """
    State with shape suitable for 1D convolution
    """
    @property
    def shape(self):
        if self.volumes:
            return (6, self.bars_count)
        else:
            return (5, self.bars_count)

    def encode(self):
        res = np.zeros(shape=self.shape, dtype=np.float32)
        ofs = self.bars_count-1
        res[0] = self._prices.high[self._offset-ofs:self._offset+1]
        res[1] = self._prices.low[self._offset-ofs:self._offset+1]
        res[2] = self._prices.close[self._offset-ofs:self._offset+1]
        if self.volumes:
            res[3] = self._prices.volume[self._offset-ofs:self._offset+1]
            dst = 4
        else:
            dst = 3
        if self.have_position:
            res[dst] = 1.0
            res[dst+1] = (self._cur_close() - self.open_price) / self.open_price
        return res


class StocksEnv(gym.Env):
    metadata = {'render.modes': ['human']}

    def __init__(self, prices, bars_count=DEFAULT_BARS_COUNT,
                 commission=DEFAULT_COMMISSION_PERC, reset_on_close=True, state_1d=False,
                 random_ofs_on_reset=True, reward_on_close=False, volumes=False):
        assert isinstance(prices, dict)
        self._prices = prices
        if state_1d:
            self._state = State1D(bars_count, commission, reset_on_close, reward_on_close=reward_on_close,
                                  volumes=volumes)
        else:
            self._state = State(bars_count, commission, reset_on_close, reward_on_close=reward_on_close,
                                volumes=volumes)
        self.action_space = gym.spaces.Discrete(n=len(Actions))
        self.observation_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=self._state.shape, dtype=np.float32)
        self.random_ofs_on_reset = random_ofs_on_reset
        self.seed()

    def reset(self):
        # make selection of the instrument and it's offset. Then reset the state
        self._instrument = self.np_random.choice(list(self._prices.keys()))
        prices = self._prices[self._instrument]
        bars = self._state.bars_count
        if self.random_ofs_on_reset:
            offset = self.np_random.choice(prices.high.shape[0]-bars*10) + bars #on every reset of the environment, the random offset in time series will be chosen. Otherwise, we’ll start from the beginning of the data.
        else:
            offset = bars # Otherwise, we’ll start from the beginning of the data.
        self._state.reset(prices, offset)
        return self._state.encode()

    def step(self, action_idx):
        action = Actions(action_idx)
        reward, done = self._state.step(action)
        obs = self._state.encode()
        info = {"instrument": self._instrument, "offset": self._state._offset}
#        print('StockEnv: obs=', obs)
        return obs, reward, done, info

    def render(self, mode='human', close=False):
        pass

    def close(self):
        pass

    def seed(self, seed=None):
        self.np_random, seed1 = seeding.np_random(seed)
        seed2 = seeding.hash_seed(seed1 + 1) % 2 ** 31
        return [seed1, seed2]

    @classmethod
    def from_dir(cls, data_dir, **kwargs):
        prices = {file: data.load_relative(file) for file in data.price_files(data_dir)}
        return StocksEnv(prices, **kwargs)


# Models
In this example, two architectures of DQN are used:
* a simple feed-forward network with three layers 
* a network with 1D convolution and a feature extractor, followed by two fully connected layers to output Q values.

Both of them use the dueling architecture described in the previous chapter. Double DQN and two-step Bellman unrolling have also been used.
The rest of the process is the same as in the classical DQN (from Chapter 6, Deep Q-Networks).
Both models are in Chapter08/lib/models.py and are very simple.

In [45]:
import math
import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F


class NoisyLinear(nn.Linear):
    def __init__(self, in_features, out_features, sigma_init=0.017, bias=True):
        super(NoisyLinear, self).__init__(in_features, out_features, bias=bias)
        self.sigma_weight = nn.Parameter(torch.full((out_features, in_features), sigma_init))
        self.register_buffer("epsilon_weight", torch.zeros(out_features, in_features))
        if bias:
            self.sigma_bias = nn.Parameter(torch.full((out_features,), sigma_init))
            self.register_buffer("epsilon_bias", torch.zeros(out_features))
        self.reset_parameters()

    def reset_parameters(self):
        std = math.sqrt(3 / self.in_features)
        self.weight.data.uniform_(-std, std)
        self.bias.data.uniform_(-std, std)

    def forward(self, input):
        self.epsilon_weight.normal_()
        bias = self.bias
        if bias is not None:
            self.epsilon_bias.normal_()
            bias = bias + self.sigma_bias * self.epsilon_bias
        return F.linear(input, self.weight + self.sigma_weight * self.epsilon_weight, bias)


class SimpleFFDQN(nn.Module):
    def __init__(self, obs_len, actions_n):
        super(SimpleFFDQN, self).__init__()

        self.fc_val = nn.Sequential(
            nn.Linear(obs_len, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )

        self.fc_adv = nn.Sequential(
            nn.Linear(obs_len, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, actions_n)
        )

    def forward(self, x):
        val = self.fc_val(x)
        adv = self.fc_adv(x)
        return val + adv - adv.mean()


class DQNConv1D(nn.Module):
    def __init__(self, shape, actions_n):
        super(DQNConv1D, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv1d(shape[0], 128, 5),
            nn.ReLU(),
            nn.Conv1d(128, 128, 5),
            nn.ReLU(),
        )

        out_size = self._get_conv_out(shape)

        self.fc_val = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )

        self.fc_adv = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, actions_n)
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        conv_out = self.conv(x).view(x.size()[0], -1)
        val = self.fc_val(conv_out)
        adv = self.fc_adv(conv_out)
        return val + adv - adv.mean()


class DQNConv1DLarge(nn.Module):
    def __init__(self, shape, actions_n):
        super(DQNConv1DLarge, self).__init__()

        self.conv = nn.Sequential(
            nn.Conv1d(shape[0], 32, 3),
            nn.MaxPool1d(3, 2),
            nn.ReLU(),
            nn.Conv1d(32, 32, 3),
            nn.MaxPool1d(3, 2),
            nn.ReLU(),
            nn.Conv1d(32, 32, 3),
            nn.MaxPool1d(3, 2),
            nn.ReLU(),
            nn.Conv1d(32, 32, 3),
            nn.MaxPool1d(3, 2),
            nn.ReLU(),
            nn.Conv1d(32, 32, 3),
            nn.ReLU(),
            nn.Conv1d(32, 32, 3),
            nn.ReLU(),
        )

        out_size = self._get_conv_out(shape)

        self.fc_val = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, 1)
        )

        self.fc_adv = nn.Sequential(
            nn.Linear(out_size, 512),
            nn.ReLU(),
            nn.Linear(512, actions_n)
        )

    def _get_conv_out(self, shape):
        o = self.conv(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    def forward(self, x):
        conv_out = self.conv(x).view(x.size()[0], -1)
        val = self.fc_val(conv_out)
        adv = self.fc_adv(conv_out)
        return val + adv - adv.mean()


# Training code 

We have two very similar training modules in this example: one for the feed-forward model and one for 1D convolutions. For both of them, there is nothing new added to our examples from Chapter 7, DQN Extensions: 
* They’re using epsilon-greedy action selection to perform exploration. The epsilon linearly decays over the first 1M steps from 1.0 to 0.1.
* A simple experience replay buffer of size 100k is being used, which is initially populated with 10k transitions.
* For every 1000 steps, we calculate the mean value for the fixed set of states to check the dynamics of the Q-values during the training.
* For every 100k steps, we perform validation: 100 episodes are played on the training data and on previously unseen quotes. Characteristics of orders are recorded in TensorBoard, such as the mean profit, the mean count of bars, and share held. This step allows us to check for overfitting conditions.


To start the training, you need to pass training data with the --data option, which could be an individual CSV file of the whole directory with files. By default, the training module uses Yandex quotes for 2016 (file data/YNDX_160101_161231. csv). For the validation data, there is an option --valdata, which takes Yandex 2015 quotes by default. Another required option will be -r, which is used to pass the name of the run. This name will be used in the TensorBoard run name and to create directories with saved models.

### Train Feed Forward Model


In [67]:
#!/usr/bin/env python3
import os
import gym
import ptan
import argparse
import numpy as np

import torch
import torch.optim as optim

from lib import common_stocks, validation

from tensorboardX import SummaryWriter

BATCH_SIZE = 32
BARS_COUNT = 10
TARGET_NET_SYNC = 1000
DEFAULT_STOCKS = "data/YNDX_160101_161231.csv"
DEFAULT_VAL_STOCKS = "data/YNDX_150101_151231.csv"

GAMMA = 0.99

REPLAY_SIZE = 100000
REPLAY_INITIAL = 10000

REWARD_STEPS = 2

LEARNING_RATE = 0.0001

STATES_TO_EVALUATE = 1000
EVAL_EVERY_STEP = 1000

EPSILON_START = 1.0
EPSILON_STOP = 0.1
EPSILON_STEPS = 1000000

CHECKPOINT_EVERY_STEP = 1000000
VALIDATION_EVERY_STEP = 100000

if __name__ == "__main__":
#     parser = argparse.ArgumentParser()
#     parser.add_argument("--cuda", default=False, action="store_true", help="Enable cuda")
#     parser.add_argument("--data", default=DEFAULT_STOCKS, help="Stocks file or dir to train on, default=" + DEFAULT_STOCKS)
#     parser.add_argument("--year", type=int, help="Year to be used for training, if specified, overrides --data option")
#     parser.add_argument("--valdata", default=DEFAULT_VAL_STOCKS, help="Stocks data for validation, default=" + DEFAULT_VAL_STOCKS)
#     parser.add_argument("-r", "--run", required=True, help="Run name")
#     args = parser.parse_args()

#control arguments
    args_cuda = False
    args_data = DEFAULT_STOCKS
    args_year = 2016
    args_valdata = DEFAULT_VAL_STOCKS
    args_run = 'run_1'
    device = torch.device("cuda" if args_cuda else "cpu")
    
    saves_path = os.path.join("saves", args_run)
    os.makedirs(saves_path, exist_ok=True)

    if args_year is not None or os.path.isfile(args_data):
        if args_year is not None:
            stock_data = data.load_year_data(args_year)
        else:
            stock_data = {"YNDX": data.load_relative(args_data)}
        env = StocksEnv(stock_data, bars_count=BARS_COUNT, reset_on_close=True, state_1d=False, volumes=False)
        env_tst = StocksEnv(stock_data, bars_count=BARS_COUNT, reset_on_close=True, state_1d=False)
    elif os.path.isdir(args_data):
        env = environ.StocksEnv.from_dir(args_data, bars_count=BARS_COUNT, reset_on_close=True, state_1d=False)
        env_tst = environ.StocksEnv.from_dir(args_data, bars_count=BARS_COUNT, reset_on_close=True, state_1d=False)
    else:
        raise RuntimeError("No data to train on")
#    env = gym.wrappers.TimeLimit(env, max_episode_steps=1000)

    val_data = {"YNDX": data.load_relative(args_valdata)}
    env_val = StocksEnv(val_data, bars_count=BARS_COUNT, reset_on_close=True, state_1d=False)

    writer = SummaryWriter(comment="-simple-" + args_run)
    net = SimpleFFDQN(env.observation_space.shape[0], env.action_space.n).to(device)
    print(net)
    tgt_net = ptan.agent.TargetNet(net)
    selector = ptan.actions.EpsilonGreedyActionSelector(EPSILON_START)
    agent = ptan.agent.DQNAgent(net, selector, device=device)
    exp_source = ptan.experience.ExperienceSourceFirstLast(env, agent, GAMMA, steps_count=REWARD_STEPS)
    buffer = ptan.experience.ExperienceReplayBuffer(exp_source, REPLAY_SIZE)
    optimizer = optim.Adam(net.parameters(), lr=LEARNING_RATE)

    # main training loop
    step_idx = 0
    eval_states = None
    best_mean_val = None

    with common_stocks.RewardTracker(writer, np.inf, group_rewards=100) as reward_tracker:
        while True:
            step_idx += 1
            buffer.populate(1)
            selector.epsilon = max(EPSILON_STOP, EPSILON_START - step_idx / EPSILON_STEPS)

            new_rewards = exp_source.pop_rewards_steps()
            if new_rewards:
                reward_tracker.reward(new_rewards[0], step_idx, selector.epsilon)

            if len(buffer) < REPLAY_INITIAL:
                continue

            if eval_states is None:
                print("Initial buffer populated, start training")
                eval_states = buffer.sample(STATES_TO_EVALUATE)
                eval_states = [np.array(transition.state, copy=False) for transition in eval_states]
                eval_states = np.array(eval_states, copy=False)

            if step_idx % EVAL_EVERY_STEP == 0:
                mean_val = common_stocks.calc_values_of_states(eval_states, net, device=device)
                writer.add_scalar("values_mean", mean_val, step_idx)
                if best_mean_val is None or best_mean_val < mean_val:
                    if best_mean_val is not None:
                        print("%d: Best mean value updated %.3f -> %.3f" % (step_idx, best_mean_val, mean_val))
                    best_mean_val = mean_val
                    torch.save(net.state_dict(), os.path.join(saves_path, "mean_val-%.3f.data" % mean_val))

            optimizer.zero_grad()
            batch = buffer.sample(BATCH_SIZE)
            loss_v = common_stocks.calc_loss(batch, net, tgt_net.target_model, GAMMA ** REWARD_STEPS, device=device)
            loss_v.backward()
            optimizer.step()

            if step_idx % TARGET_NET_SYNC == 0:
                tgt_net.sync()

            if step_idx % CHECKPOINT_EVERY_STEP == 0:
                idx = step_idx // CHECKPOINT_EVERY_STEP
                torch.save(net.state_dict(), os.path.join(saves_path, "checkpoint-%3d.data" % idx))

            if step_idx % VALIDATION_EVERY_STEP == 0:
                res = validation.validation_run(env_tst, net, device=device)
                for key, val in res.items():
                    writer.add_scalar(key + "_test", val, step_idx)
                res = validation.validation_run(env_val, net, device=device)
                for key, val in res.items():
                    writer.add_scalar(key + "_val", val, step_idx)


Reading data/YNDX_160101_161231.csv
Read done, got 131542 rows, 99752 filtered, 0 open prices adjusted
Reading data/YNDX_150101_151231.csv
Read done, got 130566 rows, 104412 filtered, 0 open prices adjusted
SimpleFFDQN(
  (fc_val): Sequential(
    (0): Linear(in_features=32, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=1, bias=True)
  )
  (fc_adv): Sequential(
    (0): Linear(in_features=32, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=3, bias=True)
  )
)
732: done 100 games, mean reward -0.115, mean steps 6.40, speed 961.42 f/s, eps 1.00
1372: done 200 games, mean reward -0.141, mean steps 5.96, speed 930.43 f/s, eps 1.00
2041: done 300 games, mean reward -0.152, mean steps 5.90, speed 818.35 f/s, eps 1.00
2720: done 400 games, mean reward -

KeyboardInterrupt: 

In [59]:
env.

0.1

In [21]:
stock_data

{'data/YNDX_160101_161231.csv': Prices(open=array([ 1156.90002441,  1150.59997559,  1150.19995117, ...,  1245.5       ,
         1246.        ,  1244.        ], dtype=float32), high=array([ 0.00086438,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.00361736], dtype=float32), low=array([-0.0033711 , -0.00017378, -0.00060855, ..., -0.00080289,
        -0.00160514, -0.00040193], dtype=float32), close=array([-0.0033711 , -0.00017378, -0.00043471, ..., -0.00080289,
        -0.00080257,  0.00361736], dtype=float32), volume=array([  43.,    5.,  165., ...,  200.,  231.,  191.], dtype=float32))}

Box(5, 50)