In [17]:
%matplotlib inline


Reinforcement Learning (DQN) tutorial
=====================================
**Author**: `Adam Paszke <https://github.com/apaszke>`_


This tutorial shows how to use PyTorch to train a Deep Q Learning (DQN) agent
on the CartPole-v0 task from the `OpenAI Gym <https://gym.openai.com/>`__.

**Task**

The agent has to decide between two actions - moving the cart left or
right - so that the pole attached to it stays upright. You can find an
official leaderboard with various algorithms and visualizations at the
`Gym website <https://gym.openai.com/envs/CartPole-v0>`__.

.. figure:: /_static/img/cartpole.gif
   :alt: cartpole

   cartpole

As the agent observes the current state of the environment and chooses
an action, the environment *transitions* to a new state, and also
returns a reward that indicates the consequences of the action. In this
task, the environment terminates if the pole falls over too far.

The CartPole task is designed so that the inputs to the agent are 4 real
values representing the environment state (position, velocity, etc.).
However, neural networks can solve the task purely by looking at the
scene, so we'll use a patch of the screen centered on the cart as an
input. Because of this, our results aren't directly comparable to the
ones from the official leaderboard - our task is much harder.
Unfortunately this does slow down the training, because we have to
render all the frames.

Strictly speaking, we will present the state as the difference between
the current screen patch and the previous one. This will allow the agent
to take the velocity of the pole into account from one image.

**Packages**


First, let's import needed packages. Firstly, we need
`gym <https://gym.openai.com/docs>`__ for the environment
(Install using `pip install gym`).
We'll also use the following from PyTorch:

-  neural networks (``torch.nn``)
-  optimization (``torch.optim``)
-  automatic differentiation (``torch.autograd``)
-  utilities for vision tasks (``torchvision`` - `a separate
   package <https://github.com/pytorch/vision>`__).




In [18]:
import gym
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from collections import namedtuple
from itertools import count
from copy import deepcopy
from PIL import Image
import pandas as pd

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.autograd import Variable
import torch.autograd
import torchvision.transforms as T


env = gym.make('CartPole-v0').unwrapped

# set up matplotlib
is_ipython = 'inline' in matplotlib.get_backend()
if is_ipython:
    from IPython import display

plt.ion()

# if gpu is to be used
use_cuda = torch.cuda.is_available()
FloatTensor = torch.cuda.FloatTensor if use_cuda else torch.FloatTensor
LongTensor = torch.cuda.LongTensor if use_cuda else torch.LongTensor
ByteTensor = torch.cuda.ByteTensor if use_cuda else torch.ByteTensor
Tensor = FloatTensor

[2017-11-25 00:09:05,108] Making new env: CartPole-v0


Replay Memory
-------------

We'll be using experience replay memory for training our DQN. It stores
the transitions that the agent observes, allowing us to reuse this data
later. By sampling from it randomly, the transitions that build up a
batch are decorrelated. It has been shown that this greatly stabilizes
and improves the DQN training procedure.

For this, we're going to need two classses:

-  ``Transition`` - a named tuple representing a single transition in
   our environment
-  ``ReplayMemory`` - a cyclic buffer of bounded size that holds the
   transitions observed recently. It also implements a ``.sample()``
   method for selecting a random batch of transitions for training.




In [19]:
Transition = namedtuple('Transition',
                        ('state', 'action', 'next_state', 'reward'))


class ReplayMemory(object):

    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []
        self.position = 0

    def push(self, *args):
        """Saves a transition."""
        if len(self.memory) < self.capacity:
            self.memory.append(None)
        self.memory[self.position] = Transition(*args)
        self.position = (self.position + 1) % self.capacity

    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)

    def __len__(self):
        return len(self.memory)

Now, let's define our model. But first, let quickly recap what a DQN is.

DQN algorithm
-------------

Our environment is deterministic, so all equations presented here are
also formulated deterministically for the sake of simplicity. In the
reinforcement learning literature, they would also contain expectations
over stochastic transitions in the environment.

Our aim will be to train a policy that tries to maximize the discounted,
cumulative reward
$R_{t_0} = \sum_{t=t_0}^{\infty} \gamma^{t - t_0} r_t$, where
$R_{t_0}$ is also known as the *return*. The discount,
$\gamma$, should be a constant between $0$ and $1$
that ensures the sum converges. It makes rewards from the uncertain far
future less important for our agent than the ones in the near future
that it can be fairly confident about.

The main idea behind Q-learning is that if we had a function
$Q^*: State \times Action \rightarrow \mathbb{R}$, that could tell
us what our return would be, if we were to take an action in a given
state, then we could easily construct a policy that maximizes our
rewards:

\begin{align}\pi^*(s) = \arg\!\max_a \ Q^*(s, a)\end{align}

However, we don't know everything about the world, so we don't have
access to $Q^*$. But, since neural networks are universal function
approximators, we can simply create one and train it to resemble
$Q^*$.

For our training update rule, we'll use a fact that every $Q$
function for some policy obeys the Bellman equation:

\begin{align}Q^{\pi}(s, a) = r + \gamma Q^{\pi}(s', \pi(s'))\end{align}

The difference between the two sides of the equality is known as the
temporal difference error, $\delta$:

\begin{align}\delta = Q(s, a) - (r + \gamma \max_a Q(s', a))\end{align}

To minimise this error, we will use the `Huber
loss <https://en.wikipedia.org/wiki/Huber_loss>`__. The Huber loss acts
like the mean squared error when the error is small, but like the mean
absolute error when the error is large - this makes it more robust to
outliers when the estimates of $Q$ are very noisy. We calculate
this over a batch of transitions, $B$, sampled from the replay
memory:

\begin{align}\mathcal{L} = \frac{1}{|B|}\sum_{(s, a, s', r) \ \in \ B} \mathcal{L}(\delta)\end{align}

\begin{align}\text{where} \quad \mathcal{L}(\delta) = \begin{cases}
     \frac{1}{2}{\delta^2}  & \text{for } |\delta| \le 1, \\
     |\delta| - \frac{1}{2} & \text{otherwise.}
   \end{cases}\end{align}

Q-network
^^^^^^^^^

Our model will be a convolutional neural network that takes in the
difference between the current and previous screen patches. It has two
outputs, representing $Q(s, \mathrm{left})$ and
$Q(s, \mathrm{right})$ (where $s$ is the input to the
network). In effect, the network is trying to predict the *quality* of
taking each action given the current input.




In [20]:
class DQN(nn.Module):
    #input dim is 
    def __init__(self, input_dim, hidden_dim, valset_size):
        super(DQN, self).__init__()
        self.hidden_dim = hidden_dim

        self.lstm = nn.LSTM(input_dim, hidden_dim)

        self.hidden2val = nn.Linear(hidden_dim, valset_size)
        self.hidden = self.init_hidden()

    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        return (Variable(torch.zeros(1, 1, self.hidden_dim)),
                Variable(torch.zeros(1, 1, self.hidden_dim)))

    def forward(self, traffic):
        #traffic = torch.unsqueeze(traffic, 1)
        
        lstm_out, self.hidden = self.lstm(traffic, self.hidden)
        #output.append(hx)
        
        val_space = self.hidden2val(lstm_out)
        #maybe hook dollar value to 1 at some point
        val_score = val_space[-1]
        return val_score

testdata = Variable(torch.ones((10, 2)))

#print(testdata)   
modeltest = DQN(2, 10, 2)
print(modeltest.forward(testdata))
#print(modeltest.forward(testdata[1]))
        

Variable containing:
 0.1001  0.0841
[torch.FloatTensor of size 1x2]



Training
--------

Hyperparameters and utilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This cell instantiates our model and its optimizer, and defines some
utilities:

-  ``Variable`` - this is a simple wrapper around
   ``torch.autograd.Variable`` that will automatically send the data to
   the GPU every time we construct a Variable.
-  ``select_action`` - will select an action accordingly to an epsilon
   greedy policy. Simply put, we'll sometimes use our model for choosing
   the action, and sometimes we'll just sample one uniformly. The
   probability of choosing a random action will start at ``EPS_START``
   and will decay exponentially towards ``EPS_END``. ``EPS_DECAY``
   controls the rate of the decay.
-  ``plot_durations`` - a helper for plotting the durations of episodes,
   along with an average over the last 100 episodes (the measure used in
   the official evaluations). The plot will be underneath the cell
   containing the main training loop, and will update after every
   episode.




In [32]:
#may wanna change end to lower or something 
BATCH_SIZE = 16
GAMMA = 0.999
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 200


input_dim = 2
hidden_state_dim = 1
valset_dim = 2
model = DQN(input_dim, hidden_state_dim, valset_dim)
if use_cuda:
    model.cuda()

#think more about what best optimizer to use is
optimizer = optim.RMSprop(model.parameters())
memory = ReplayMemory(10000)


steps_done = 0
            

def valuation(time, USD, ETH, buy_price, sell_price):
    context = Variable(state_to_data(time, data))
    model.hidden = model.init_hidden()
    #case where I own ETH
    if (ETH != 0):
        new_USD, _, _ = trade(0, USD, ETH, buy_price, sell_price)
        scale = Variable(torch.FloatTensor([[new_USD], [ETH * sell_price]]))
    #case where I own USD
    if (USD != 0):
        _, new_ETH, _ = trade(1, USD, ETH, buy_price, sell_price)
        scale = Variable(torch.FloatTensor([[USD], [new_ETH * sell_price]]))
    #print(scale)
    #print(model(context[len(context)-1]))
    return scale * model(context).t()
    #return scale * model(Variable(context, volatile=True).type(FloatTensor))

def select_action(time, USD, ETH, buy_price, sell_price):
    global steps_done
    sample = random.random()
    eps_threshold = EPS_END + (EPS_START - EPS_END) * \
        math.exp(-1. * steps_done / EPS_DECAY)
    steps_done += 1
    if sample > eps_threshold:
        options = valuation(time, USD, ETH, buy_price, sell_price)
        if options.data.numpy()[0][0] > options.data.numpy()[1][0]:
            return LongTensor([[0]])
        else:
            return LongTensor([[1]])
    else:
        return LongTensor([[random.randrange(2)]])




Training loop
^^^^^^^^^^^^^

Finally, the code for training our model.

Here, you can find an ``optimize_model`` function that performs a
single step of the optimization. It first samples a batch, concatenates
all the tensors into a single one, computes $Q(s_t, a_t)$ and
$V(s_{t+1}) = \max_a Q(s_{t+1}, a)$, and combines them into our
loss. By defition we set $V(s) = 0$ if $s$ is a terminal
state.



In [33]:
last_sync = 0


def optimize_model():
    global last_sync
    if len(memory) < BATCH_SIZE:
        return
    transitions = memory.sample(BATCH_SIZE)
    # Transpose the batch (see http://stackoverflow.com/a/19343/3343043 for
    # detailed explanation).
    batch = Transition(*zip(*transitions))
    
    # Compute a mask of non-final states and concatenate the batch elements
    non_final_mask = ByteTensor(tuple(map(lambda s: s is not None,
                                          batch.next_state)))

    next_statelist = [s for s in batch.next_state if s is not None]
    
    #print(torch.cat([s for s in batch.next_state if s is not None]))
    
    state_batch = batch.state
    action_batch = Variable(torch.cat(batch.action))
    reward_batch = Variable(torch.cat(batch.reward))
    
    #valuation(state, USD, ETH, buy_price, sell_price)

    # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
    # columns of actions taken
    
    state_values = Variable(torch.zeros(BATCH_SIZE, 2))
    for i in range(BATCH_SIZE):
        time, USD, ETH = state_batch[i]
        buy_price, sell_price = state_to_prices(time, prices)
        state_values[i] = valuation(time, USD, ETH, buy_price, sell_price).t()
    
    state_action_values = state_values.gather(1, action_batch)
    
    # Compute V(s_{t+1}) for all next states.
    next_state_temp = Variable(torch.zeros(BATCH_SIZE, 2))
    for i in range(BATCH_SIZE):
        time, USD, ETH = next_statelist[i]
        buy_price, sell_price = state_to_prices(time, prices)
        next_state_temp[i] = valuation(time, USD, ETH, buy_price, sell_price).t()
    
    next_state_values = Variable(torch.zeros(BATCH_SIZE).type(Tensor))
    next_state_values[non_final_mask] = next_state_temp.max(1)[0]
    
    # Now, we don't want to mess up the loss with a volatile flag, so let's
    # clear it. After this, we'll just end up with a Variable that has
    # requires_grad=False
    
    # next_state_values.volatile = False
    
    # Compute the expected Q values
    
    expected_state_action_values = (next_state_values * GAMMA) + reward_batch
  
    # Compute Huber loss
    loss = F.smooth_l1_loss(state_action_values, expected_state_action_values)
   
    # Optimize the model
    optimizer.zero_grad()
    loss.backward()
    for param in model.parameters():
        param.grad.data.clamp_(-1, 1)
    optimizer.step()

Below, you can find the main training loop. At the beginning we reset
the environment and initialize the ``state`` variable. Then, we sample
an action, execute it, observe the next screen and the reward (always
1), and optimize our model once. When the episode ends (our model
fails), we restart the loop.

Below, `num_episodes` is set small. You should download
the notebook and run lot more epsiodes.



In [34]:

USD = 100 #amount in USD 
ETH = 0 #amount in ETH 
#0 is action to go USD
#1 is action to go ETH

#all fees use poloniex taker fee. could try a strat that learns live and acts as a maker

buy_pen = 0.25/100 #proportion lost when buying
sell_pen = 0.25/100 #proportion lost when selling

#trade takes in an action, amount and outputs a reward
def trade(action, USD, ETH, buy_price, sell_price):
    if USD > 0:
        currency = 0
    else:
        currency = 1
    if (action == currency):
        return old_USD, old_ETH, 0
    elif (action == 1):
        new_ETH = USD * buy_price * (1 - buy_pen)
        new_USD = 0
    elif (action == 0):
        new_USD = ETH * sell_price * (1 - sell_pen)
        new_ETH = 0
    #value of ETH is considered to be whatever you could sell it for to get USD
    return new_USD, new_ETH, 0
    
        

In [35]:
#takes in the state number and gives the context of that trade point
#used in the optimization 
#also needs to give us what currency we own so that can be fed into neurel net as third feature 
def state_to_data(time, data):
    context = data[:time]
    return context

#takes in state and returns buy and sell price
def state_to_prices(time, prices):
    buy_price = prices[time]
    sell_price = 1 / buy_price
    return buy_price, sell_price

def state_to_old_prices(time, prices):
    old_buy_price = prices[time-1]
    old_sell_price = 1 / old_buy_price
    return old_buy_price, old_sell_price
                    



In [None]:
### df = pd.read_csv('october_array.csv')
arr = df.values
data = torch.from_numpy(arr[1:100][1:100]).float()
print(data)

#there is an issue with this data
df = pd.read_csv('october_data_prices.csv')
arr = df.values
arr = arr[1:99, 2].astype(float)
prices = torch.from_numpy(arr).float()
print(prices)



In [39]:
num_episodes = 10
prep_time = 10 #number of states we don't use at start, to give time for hidden state to init
sim_time = 98 - prep_time #total number of time periods
total_profit = 0#stores amount left at end of sim

#we call our data set history

init_money = 100
for i_episode in range(num_episodes):
    
    
    #reset currency amounts
    USD = init_money
    ETH = 0
    #initialize state.
    state = (prep_time, USD, ETH)
    for t in range(sim_time - 1):
        print(state)
        time, USD, ETH = state
        
        #find a way to set done by seeing if we are at the end of the training time (or if 0 money left)
        done = (sim_time - 1 == time)
        
        
        #get buy_price and sell_price by indexing into data 
        buy_price, sell_price = state_to_prices(time, prices)
        old_buy_price, old_sell_price = state_to_old_prices(time, prices)
        # Select and perform an action
        action = select_action(time, USD, ETH, buy_price, sell_price)
        #print(action)
        old_USD = USD
        old_ETH = ETH
        
        USD, ETH, _ = trade(action[0,0], USD, ETH, buy_price, sell_price)
        reward = (USD + sell_price*ETH) - (old_USD + old_sell_price*old_ETH)
        
        reward = Tensor([reward])
        
        
        
        #find a way to set done by seeing if we are at the end of the training time (or if 0 money left)
        done = (sim_time - 1 == time)
        done = (USD ==0 and ETH == 0)
        if not done:
            next_state = (time + 1, USD, ETH)
        else:
            next_state = None 

        # Store the transition in memory
        memory.push(state, action, next_state, reward)

        # Move to the next state
        state = next_state

        # Perform one step of the optimization (on the target network)
        optimize_model()
        
        if done:
            print(USD + ETH * sell_price - init_money)
            total_profit.append(USD + ETH * sell_price - init_money)
            break

print('Complete')
print(profit)

(10, 100, 0)
(11, 0, 30144.451217651367)
(12, 0, 30144.451217651367)
(13, 100.09596537120645, 0)
(14, 0, 29979.678501008468)
(15, 0, 29979.678501008468)
(16, 99.45037609165979, 0)
(17, 99.45037609165979, 0)
(18, 0, 29736.71759415994)
(19, 98.4513824616939, 0)
(20, 0, 29412.473574659132)
(21, 97.97501512452547, 0)
(22, 97.97501512452547, 0)
(23, 0, 29323.553671151403)
(24, 0, 29323.553671151403)
(25, 97.48679350316722, 0)
(26, 0, 29117.704391223284)
(27, 97.0654983252074, 0)
(28, 0, 29046.383515668185)
(29, 96.53739074902555, 0)
(30, 96.53739074902555, 0)
(31, 0, 28874.064735196345)
(32, 96.25942801021145, 0)
(33, 0, 28836.190589624206)
(34, 96.13283103485351, 0)
(35, 0, 28773.494223784317)
(36, 0, 28773.494223784317)
(37, 0, 28773.494223784317)
(38, 0, 28773.494223784317)
(39, 95.35402155556432, 0)
(40, 95.35402155556432, 0)
(41, 0, 28624.09697833124)
(42, 0, 28624.09697833124)
(43, 0, 28624.09697833124)
(44, 94.82018337541146, 0)
(45, 94.82018337541146, 0)
(46, 94.82018337541146, 0)
(

KeyboardInterrupt: 

In [50]:
valuation(2, 100, 0, 1, 1)

Variable containing:
 86.4197
 87.6682
[torch.FloatTensor of size 2x1]