Reinforcement Learning for Financial Data Series
================================================

In this project, our aim is to implement a Reinforcement Learning (RL)
strategy for trading stocks. Specifically, we use the DQN -- Deep
Q-Network -- algorithm to train an agent which trades Brent Crude Oil
(BCOUSD) stocks, in order to maximize long term profit. Finally, we
compare the results from the RL agent with a rule-based agent that uses
the Trend Calculus predictive algorithm to make decisions.

Group members:
--------------

-   Fabian Sinzinger
-   Karl Bäckström
-   Rita Laezza

In [None]:
// Scala imports
import org.lamastex.spark.trendcalculus._
import spark.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import java.sql.Timestamp
import org.apache.spark.sql.expressions._

  

>     import org.lamastex.spark.trendcalculus._
>     import spark.implicits._
>     import org.apache.spark.sql._
>     import org.apache.spark.sql.functions._
>     import java.sql.Timestamp
>     import org.apache.spark.sql.expressions._

  

Brent Crude Oil Dataset
-----------------------

The dataset consists of historical data starting from the *14th of
October 2010* to the *21st of June 2019*. Since the data in the first
and last day is incomplete, we remove it from the dataset. The BCUSD
data is sampled approximatly every minute with a specific timestamp and
registered in US dollars.

To read the BCUSD dataset, we use the same parsers provided by the
[TrendCalculus](https://github.com/lamastex/spark-trend-calculus)
library. This allows us to load the FX data into a Spark Dataset. The
original dataset contains **TickerPoint** objects which are made up of a
**ticker**, a **time** and a **close** value. The first consists of the
name of the stock, the second is the timestamp of the data point and the
latter consists of the value of the stock at the end of each 1 minute
bin.

Finally we add the **index** and the \*\*diff\_close\*\* columns. The
latter consists of the relative difference between the **close** value
at the current and the previous **time**. Note hat since **ticker** is
always the same, we remove that column.

In [None]:
// Load dataset
val oilDS = spark.read.fx1m("dbfs:/FileStore/shared_uploads/fabiansi@kth.se/*csv.gz").toDF.withColumn("ticker", lit("BCOUSD")).select($"ticker", $"time" as "x", $"close" as "y").as[TickerPoint].orderBy("time")

// Add column with difference from previous close value (expected 'x', 'y' column names)
val windowSpec = Window.orderBy("x")
val oilDS1 = oilDS 
.withColumn("diff_close", $"y" - when((lag("y", 1).over(windowSpec)).isNull, 0).otherwise(lag("y", 1).over(windowSpec)))

// Rename variables
val oilDS2 = oilDS1.withColumnRenamed("x","time").withColumnRenamed("y","close")

// Remove incomplete data from first day (2010-11-14) and last day (2019-06-21)
val oilDS3 = oilDS2.filter(to_date(oilDS2("time")) >= lit("2010-11-15") && to_date(oilDS2("time")) <= lit("2019-06-20"))

// Add index column
val windowSpec1 = Window.orderBy("time")
val oilDS4 = oilDS3
.withColumn("index", row_number().over(windowSpec1))

// Drop ticker column
val oilDS5 = oilDS4.drop("ticker")

// Store loaded data as temp view, to be accessible in Python
oilDS5.createOrReplaceTempView("temp")

  

>     oilDS: org.apache.spark.sql.Dataset[org.lamastex.spark.trendcalculus.TickerPoint] = [ticker: string, x: timestamp ... 1 more field]
>     windowSpec: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@5d67736f
>     oilDS1: org.apache.spark.sql.DataFrame = [ticker: string, x: timestamp ... 2 more fields]
>     oilDS2: org.apache.spark.sql.DataFrame = [ticker: string, time: timestamp ... 2 more fields]
>     oilDS3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ticker: string, time: timestamp ... 2 more fields]
>     windowSpec1: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@63b5e6bd
>     oilDS4: org.apache.spark.sql.DataFrame = [ticker: string, time: timestamp ... 3 more fields]
>     oilDS5: org.apache.spark.sql.DataFrame = [time: timestamp, close: double ... 2 more fields]

  

### Preparing the data in Python

Because the
[TrendCalculus](https://github.com/lamastex/spark-trend-calculus)
library we use is implemented in Scala and we want to do our
implementation in Python, we have to make sure that the data loaded in
Scala is correctly read in Python, before moving on. To that end, we
select the first 10 datapoints and show them in a table.

We can see that there are roughly **2.5 million data points** in the
BCUSD dataset.

In [None]:
import datetime
import numpy as np

# Create Dataframe from temp data
oilDF_py = spark.table("temp")

# Select the 10 first Rows of data and print them
ten_oilDF_py = oilDF_py.limit(10)
ten_oilDF_py.show()

# Check number of data points
last_index = oilDF_py.count()
print("Number of data points: {}".format(last_index))

# Select the date of the last data point
print("Last data point: {}".format(np.array(oilDF_py.where(oilDF_py.index == last_index).select('time').collect()).item()))

  

>     +-------------------+-----+--------------------+-----+
>     |               time|close|          diff_close|index|
>     +-------------------+-----+--------------------+-----+
>     |2010-11-15 00:00:00| 86.6|-0.01000000000000...|    1|
>     |2010-11-15 00:01:00| 86.6|                 0.0|    2|
>     |2010-11-15 00:02:00|86.63|0.030000000000001137|    3|
>     |2010-11-15 00:03:00|86.61|-0.01999999999999602|    4|
>     |2010-11-15 00:05:00|86.61|                 0.0|    5|
>     |2010-11-15 00:07:00| 86.6|-0.01000000000000...|    6|
>     |2010-11-15 00:08:00|86.58|-0.01999999999999602|    7|
>     |2010-11-15 00:09:00|86.58|                 0.0|    8|
>     |2010-11-15 00:10:00|86.58|                 0.0|    9|
>     |2010-11-15 00:12:00|86.57|-0.01000000000000...|   10|
>     +-------------------+-----+--------------------+-----+
>
>     Number of data points: 2523078
>     Last data point: 2019-06-20 23:59:00

  

Creating the Environment
------------------------

In order to train RL agents, we first need to create the environment
with which the agent will interact to gather experience. In our case,
that consist of a stock market simulation which plays out historical
data from the BCUSD dataset. This is valid, under the assumption that
the trading on the part of our agent has no affect on the stock market.
An RL problem can be formally defined by a Markov Decision Process
(MDP).

For our application, we have the following MDP: - s = (HOLDING, NOT
HOLDING) - a = (LONG, SHORT) - r =

In [None]:
import gym
import math
import numpy as np
import random

PENALTY = 1  # 0.999756079


class MarketEnv(gym.Env):
    def __init__(self, full_data, start_date, end_date, episode_size=30*24*60, scope=60, cumulative_reward=False):

        self.cumulative_reward = cumulative_reward
        self.actions = ["LONG", "SHORT"]  # pass not required, if already owned, buy is hold
        self.action_space = gym.spaces.Discrete(len(self.actions))
        self.state = None

        self.diff_close = np.array(full_data.filter(full_data["time"] > start_date).filter(full_data["time"] <= end_date).select('diff_close').collect())
        self.close = np.array(full_data.filter(full_data["time"] > start_date).filter(full_data["time"] <= end_date).select('close').collect())
        
        self.num_ticks_train = np.shape(self.diff_close)[0]
        self.episode_size = episode_size

        self.scope = scope
        self.time_index = self.scope  # start 60 steps in, to ensure that we have hist. values ?
        self.episode_init_time = self.time_index  # initial time index of the episode
        #self.reset() this should always be called after initilization, any way

    def step(self, action):
        self.reward = 0
        if self.actions[action] == "LONG":
            if sum(self.boughts) < 0:
                for b in self.boughts:
                    self.reward += -(b + 1) 
                if self.cumulative_reward:
                    self.reward = self.reward / max(1, len(self.boughts))
                self.boughts = []
            self.boughts.append(1.0)
        elif self.actions[action] == "SHORT":
            if sum(self.boughts) > 0:
                for b in self.boughts:
                    self.reward += b - 1
                if self.cumulative_reward:
                    self.reward = self.reward / max(1, len(self.boughts))
                self.boughts = []
            self.boughts.append(-1.0)
        else:
            raise ValueError

        vari = self.time_range[self.time_index]

        for i in range(len(self.boughts)):
            self.boughts[i] = self.boughts[i] * PENALTY * (1 + vari * (-1 if sum(self.boughts) < 0 else 1))

        self.define_state()
        self.time_index += 1
        
        # Check if done
        if self.time_index - self.episode_init_time > self.episode_size:
            self.done = True
        if self.time_index > self.diff_close.shape[0] - self.scope:
            self.done = True
            
        if self.done:
            for b in self.boughts:
                self.reward += (b * (1 if sum(self.boughts) > 0 else -1)) - 1
            if self.cumulative_reward:
                self.reward = self.reward / max(1, len(self.boughts))

            self.boughts = []

        return self.state, self.reward, self.done, {'index':'close': int(self.time_index), 'close': float(self.close[self.time_index]), 'boughts': self.boughts}

    def reset(self, random_starttime=True):
        self.boughts = []
        self.done = False
        self.reward = 0.
        self.time_index = self.scope 
        self.define_state()
        
        if random_starttime:
            self.time_index += random.randint(0, self.num_ticks_train - self.scope)
        
        self.episode_init_time = self.time_index

        return self.state

    def define_state(self):
        tmp_state = []

        budget = (sum(self.boughts) / len(self.boughts)) if len(self.boughts) > 0 else 1.
        size = math.log(max(1., len(self.boughts)), 100)
        position = 1. if sum(self.boughts) > 0 else 0.
        tmp_state.append([[budget, size, position]])

        # df_back = self.diff_close.filter(self.diff_close.index.between(self.time_index, self.time_index + self.scope - 1))
        np_back = self.diff_close[self.time_index - self.scope:self.time_index]  # verify that we dont provide the actual value that we want to predict here
        # TODO check if we go out of range

        # np_back = np.array(df_back.select('diff_close').collect())

        tmp_state.append(np_back)

        tmp_state = [np.array(i) for i in tmp_state]
        self.state = tmp_state

    def seed(self):
        pass

In [None]:
start = datetime.datetime(2010, 11, 14, 20, 19)
end = datetime.datetime(2012, 11, 14, 20, 32)

env = MarketEnv(oilDF_py, start, end)

print(env.step(0))

  

>     ([array([[ 0.99,  0.  ,  1.  ]]), array([[-0.09],
>            [ 0.05],
>            [-0.03],
>            [-0.01],
>            [-0.06],
>            [-0.02],
>            [ 0.15],
>            [ 0.01],
>            [ 0.06],
>            [-0.05],
>            [-0.09],
>            [ 0.04],
>            [-0.02],
>            [ 0.02],
>            [ 0.03],
>            [ 0.08],
>            [ 0.04],
>            [ 0.01],
>            [ 0.  ],
>            [-0.02],
>            [-0.05],
>            [ 0.03],
>            [ 0.04],
>            [ 0.03],
>            [ 0.  ],
>            [ 0.01],
>            [-0.04],
>            [-0.06],
>            [ 0.  ],
>            [ 0.06],
>            [ 0.04],
>            [-0.01],
>            [ 0.  ],
>            [ 0.06],
>            [ 0.  ],
>            [ 0.02],
>            [ 0.  ],
>            [-0.03],
>            [-0.1 ],
>            [ 0.07],
>            [-0.08],
>            [ 0.04],
>            [-0.03],
>            [ 0.03],
>            [ 0.07],
>            [ 0.06],
>            [ 0.02],
>            [ 0.01],
>            [-0.01],
>            [ 0.02],
>            [-0.01],
>            [-0.01],
>            [ 0.  ],
>            [ 0.04],
>            [-0.02],
>            [-0.03],
>            [-0.01],
>            [ 0.  ],
>            [-0.01],
>            [ 0.21]])], 0.0, False, {'close': -0.020000000000010232, 'boughts': [0.9900000000000091]})

  

DQN Algorithm
-------------

&lt;img src="https://imgur.com/mvopoh8.png" width=800&gt;

In [None]:
class ExperienceReplay:
    def __init__(self, max_memory=100, discount=.9):
        self.max_memory = max_memory
        self.memory = list()
        self.discount = discount

    def remember(self, states, done):
        # memory[i] = [[state_t, action_t, reward_t, state_t+1], done?]
        self.memory.append([states, done])
        if len(self.memory) > self.max_memory:
            del self.memory[0]

    def get_batch(self, model, batch_size=10):
        len_memory = len(self.memory)
        num_actions = model.output_shape[-1]

        env_dim = self.memory[0][0][0].shape[1]
        inputs = np.zeros((min(len_memory, batch_size), env_dim, 1))
        targets = np.zeros((inputs.shape[0], num_actions))
        for i, idx in enumerate(np.random.randint(0, len_memory, size=inputs.shape[0])):
            state_t, action_t, reward_t, state_tp1 = self.memory[idx][0]
            done = self.memory[idx][1]

            inputs[i:i + 1] = state_t
            # There should be no target values for actions not taken.
            targets[i] = model.predict(state_t)[0]
            Q_sa = np.max(model.predict(state_tp1)[0])
            if done: # if done is True
                targets[i, action_t] = reward_t
            else:
                # reward_t + gamma * max_a' Q(s', a')
                targets[i, action_t] = reward_t + self.discount * Q_sa
        return inputs, targets

In [None]:
# This is the model used in https://github.com/kh-kim/stock_market_reinforcement_learning 
from keras.models import Model
from keras.layers import merge, Convolution2D, MaxPooling2D, Input, Dense, Flatten, Dropout, Reshape, TimeDistributed, BatchNormalization, Merge, merge
from keras.layers.advanced_activations import LeakyReLU

def build_model(self):
    dr_rate = 0.0

    B = Input(shape = (3,))
    b = Dense(5, activation = "relu")(B)

    inputs = [B]
    merges = [b]

    for i in xrange(1):
        S = Input(shape=[2, 60, 1])
        inputs.append(S)

        h = Convolution2D(64, 3, 1, border_mode = 'valid')(S)
        h = LeakyReLU(0.001)(h)
        h = Convolution2D(128, 5, 1, border_mode = 'valid')(S)
        h = LeakyReLU(0.001)(h)
        h = Convolution2D(256, 10, 1, border_mode = 'valid')(S)
        h = LeakyReLU(0.001)(h)
        h = Convolution2D(512, 20, 1, border_mode = 'valid')(S)
        h = LeakyReLU(0.001)(h)
        h = Convolution2D(1024, 40, 1, border_mode = 'valid')(S)
        h = LeakyReLU(0.001)(h)

        h = Flatten()(h)
        h = Dense(2048)(h)
        h = LeakyReLU(0.001)(h)
        h = Dropout(dr_rate)(h)
        merges.append(h)

        h = Convolution2D(2048, 60, 1, border_mode = 'valid')(S)
        h = LeakyReLU(0.001)(h)

        h = Flatten()(h)
        h = Dense(4096)(h)
        h = LeakyReLU(0.001)(h)
        h = Dropout(dr_rate)(h)
        merges.append(h)

    m = merge(merges, mode = 'concat', concat_axis = 1)
    m = Dense(1024)(m)
    m = LeakyReLU(0.001)(m)
    m = Dropout(dr_rate)(m)
    m = Dense(512)(m)
    m = LeakyReLU(0.001)(m)
    m = Dropout(dr_rate)(m)
    m = Dense(256)(m)
    m = LeakyReLU(0.001)(m)
    m = Dropout(dr_rate)(m)
    V = Dense(2, activation = 'linear', init = 'zero')(m)
    model = Model(input = inputs, output = V)

    return model

In [None]:
import json
import numpy as np
from keras.models import Sequential
from keras.layers.core import Dense
from keras.layers import Conv1D, MaxPool1D, Flatten, BatchNormalization
import collections
import datetime

# RL parameters
epsilon = .5  # exploration
min_epsilon = 0.1
max_memory = 5000
batch_size = 128
discount = 0.8

# Environment parameters
num_actions = 2  # [long, short]
episodes = 1000 # 100000
episode_size = 1 * 1 * 60  # roughly a hour worth of data in each training episode

# Define state sequence scope (approx. 1 hour)
sequence_scope = 60
input_shape = (batch_size, sequence_scope, 1)

# Create Q Network
model = Sequential()
model.add(Conv1D(32, (5), strides=2, input_shape=input_shape[1:], activation='relu'))
model.add(MaxPool1D(pool_size=2, strides=1))
model.add(BatchNormalization())
model.add(Conv1D(32, (5), strides=1, activation='relu'))
model.add(MaxPool1D(pool_size=2, strides=1))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(hidden_size, activation='relu'))
model.add(BatchNormalization())
model.add(Dense(num_actions))
model.compile(loss='mse', optimizer='adam')

# Define training interval
start = datetime.datetime(2010, 11, 15, 0, 0)
end = datetime.datetime(2018, 12, 31, 23, 59)

# Initialize Environment
env = MarketEnv(oilDF_py, start, end, episode_size=episode_size, scope=sequence_scope)

# Initialize experience replay object
exp_replay = ExperienceReplay(max_memory=max_memory)

# Train
returns = []
for e in range(1, episodes):
    loss = 0.
    counter = 0
    reward_sum = 0.
    done = False
    
    state = env.reset()
    input_t = state[1].reshape(1, sequence_scope, 1)
    
    while not done:     
        counter += 1
        input_tm1 = input_t
        # get next action
        if np.random.rand() <= epsilon:
            action = np.random.randint(0, num_actions, size=1)
        else:
            q = model.predict(input_tm1)
            action = np.argmax(q[0])

        # apply action, get rewards and new state
        state, reward, done, info = env.step(action)
        reward_sum += reward
        input_t = state[1].reshape(1, sequence_scope, 1)         

        # store experience
        exp_replay.remember([input_tm1, action, reward, input_t], done)

        # adapt model
        inputs, targets = exp_replay.get_batch(model, batch_size=batch_size)

        loss += model.train_on_batch(inputs, targets)
        
    returns.append(reward_sum)
    print("Episode {:03d}/{:d} | Average Loss {:.4f} | Cumulative Reward {:.4f}".format(e, episodes, loss / counter, reward_sum))
    epsilon = max(min_epsilon, epsilon * 0.99)

  

>     Episode 001/400 | Average Loss 0.3689 | Cumulative Reward 0.1637
>     Episode 002/400 | Average Loss 0.2221 | Cumulative Reward -0.0233
>     Episode 003/400 | Average Loss 0.2839 | Cumulative Reward -0.3709
>     Episode 004/400 | Average Loss 0.3601 | Cumulative Reward -1.2132
>     Episode 005/400 | Average Loss 0.2441 | Cumulative Reward 0.8580
>     Episode 006/400 | Average Loss 0.1887 | Cumulative Reward 0.3902
>     Episode 007/400 | Average Loss 0.1640 | Cumulative Reward 1.4007
>     Episode 008/400 | Average Loss 0.1131 | Cumulative Reward 0.6776
>     Episode 009/400 | Average Loss 0.0888 | Cumulative Reward 0.6964
>     Episode 010/400 | Average Loss 0.1072 | Cumulative Reward -0.1378
>     Episode 011/400 | Average Loss 0.1203 | Cumulative Reward -0.4680
>     Episode 012/400 | Average Loss 0.0950 | Cumulative Reward 0.1480
>     Episode 013/400 | Average Loss 0.0773 | Cumulative Reward -0.8090
>     Episode 014/400 | Average Loss 0.1827 | Cumulative Reward -0.2950
>     Episode 015/400 | Average Loss 227231.1264 | Cumulative Reward -1.7342
>     Episode 016/400 | Average Loss 898787.8186 | Cumulative Reward 0.8652
>     Episode 017/400 | Average Loss 309887.6135 | Cumulative Reward 0.5920
>     Episode 018/400 | Average Loss 42025.0600 | Cumulative Reward 0.7633
>     Episode 019/400 | Average Loss 23240.6485 | Cumulative Reward -0.6439
>     Episode 020/400 | Average Loss 1548.2308 | Cumulative Reward 0.1562
>     Episode 021/400 | Average Loss 3022.9790 | Cumulative Reward 0.1391
>     Episode 022/400 | Average Loss 928.1853 | Cumulative Reward 0.0965
>     Episode 023/400 | Average Loss 335.2204 | Cumulative Reward -0.0647
>     Episode 024/400 | Average Loss 92.0703 | Cumulative Reward -1.0509
>     Episode 025/400 | Average Loss 62.6662 | Cumulative Reward 2.6331
>     Episode 026/400 | Average Loss 141.5263 | Cumulative Reward 0.7501
>     Episode 027/400 | Average Loss 151.6588 | Cumulative Reward -0.7282
>     Episode 028/400 | Average Loss 88.7003 | Cumulative Reward -0.4138
>     Episode 029/400 | Average Loss 70.3339 | Cumulative Reward -0.3277
>     Episode 030/400 | Average Loss 71.3869 | Cumulative Reward 0.2297
>     Episode 031/400 | Average Loss 63.7683 | Cumulative Reward -1.2043
>     Episode 032/400 | Average Loss 52.0525 | Cumulative Reward 0.3404
>     Episode 033/400 | Average Loss 139.2886 | Cumulative Reward 0.5432
>     Episode 034/400 | Average Loss 125.1808 | Cumulative Reward -0.0532
>     Episode 035/400 | Average Loss 75581.6705 | Cumulative Reward -0.2716
>     Episode 036/400 | Average Loss 1499093.0702 | Cumulative Reward -0.0629
>     Episode 037/400 | Average Loss 142075.2642 | Cumulative Reward 0.6731
>     Episode 038/400 | Average Loss 42704.7186 | Cumulative Reward -1.2040
>     Episode 039/400 | Average Loss 4600.6392 | Cumulative Reward -2.4596
>     Episode 040/400 | Average Loss 3533.7538 | Cumulative Reward -2.1036
>     Episode 041/400 | Average Loss 3485.4257 | Cumulative Reward -0.3009
>     Episode 042/400 | Average Loss 1754.1580 | Cumulative Reward -0.0897
>     Episode 043/400 | Average Loss 4854.8793 | Cumulative Reward 0.6749
>     Episode 044/400 | Average Loss 32073.8605 | Cumulative Reward 0.4181
>     Episode 045/400 | Average Loss 27062.1981 | Cumulative Reward 10.9193
>     Episode 046/400 | Average Loss 5357.5187 | Cumulative Reward 0.6953
>     Episode 047/400 | Average Loss 1618.4097 | Cumulative Reward -0.5666
>     Episode 048/400 | Average Loss 527.3351 | Cumulative Reward -1.1120
>     Episode 049/400 | Average Loss 238.1634 | Cumulative Reward -1.3369
>     Episode 050/400 | Average Loss 175.4740 | Cumulative Reward -0.6742
>     Episode 051/400 | Average Loss 123.2425 | Cumulative Reward -0.6937
>     Episode 052/400 | Average Loss 104.5643 | Cumulative Reward -0.2996
>     Episode 053/400 | Average Loss 110.9646 | Cumulative Reward -0.2320
>     Episode 054/400 | Average Loss 133.3820 | Cumulative Reward 0.1100
>     Episode 055/400 | Average Loss 116.8288 | Cumulative Reward 1.5496
>     Episode 056/400 | Average Loss 97.1488 | Cumulative Reward -0.5533
>     Episode 057/400 | Average Loss 97.8561 | Cumulative Reward -2.1378
>     Episode 058/400 | Average Loss 142.2611 | Cumulative Reward -0.3099
>     Episode 059/400 | Average Loss 223.6259 | Cumulative Reward -0.2781
>     Episode 060/400 | Average Loss 1084.1126 | Cumulative Reward 3.4933
>     Episode 061/400 | Average Loss 23224.0857 | Cumulative Reward 32.5585
>     Episode 062/400 | Average Loss 25930.6092 | Cumulative Reward 0.4708
>     Episode 063/400 | Average Loss 322392.9633 | Cumulative Reward -2.5420
>     Episode 064/400 | Average Loss 334215.3544 | Cumulative Reward -1.8296
>     Episode 065/400 | Average Loss 35868.0845 | Cumulative Reward 0.6106
>     Episode 066/400 | Average Loss 117355.0819 | Cumulative Reward 0.2297
>     Episode 067/400 | Average Loss 22205.1672 | Cumulative Reward 1.5464
>     Episode 068/400 | Average Loss 2638.6969 | Cumulative Reward 1.9100
>     Episode 069/400 | Average Loss 853.5517 | Cumulative Reward 6.6405
>     Episode 070/400 | Average Loss 554.5541 | Cumulative Reward -1.4309
>     Episode 071/400 | Average Loss 428.0672 | Cumulative Reward 1.3316
>     Episode 072/400 | Average Loss 357.3235 | Cumulative Reward 0.2123
>     Episode 073/400 | Average Loss 414.5044 | Cumulative Reward -2.1793
>     Episode 074/400 | Average Loss 333.6023 | Cumulative Reward -0.2519
>     Episode 075/400 | Average Loss 311.4634 | Cumulative Reward -0.5926
>     Episode 076/400 | Average Loss 315.3541 | Cumulative Reward -0.5030
>     Episode 077/400 | Average Loss 308.6532 | Cumulative Reward 1.1239
>     Episode 078/400 | Average Loss 312.1528 | Cumulative Reward 1.9444
>     Episode 079/400 | Average Loss 330.1110 | Cumulative Reward -1.8888
>     Episode 080/400 | Average Loss 445.1022 | Cumulative Reward 0.9536
>     Episode 081/400 | Average Loss 331.2435 | Cumulative Reward 1.0538
>     Episode 082/400 | Average Loss 276.0205 | Cumulative Reward 0.4262
>     Episode 083/400 | Average Loss 255.5548 | Cumulative Reward 4.3867
>     Episode 084/400 | Average Loss 350.6699 | Cumulative Reward -0.3012
>     Episode 085/400 | Average Loss 315.2827 | Cumulative Reward -1.3592
>     Episode 086/400 | Average Loss 428.5230 | Cumulative Reward -3.0848
>     Episode 087/400 | Average Loss 426.4023 | Cumulative Reward -2.3325
>     Episode 088/400 | Average Loss 323.8480 | Cumulative Reward -0.7839
>     Episode 089/400 | Average Loss 288.2806 | Cumulative Reward 0.4595
>     Episode 090/400 | Average Loss 375.2701 | Cumulative Reward 6.6938
>     Episode 091/400 | Average Loss 327.0411 | Cumulative Reward -3.4743
>     Episode 092/400 | Average Loss 307.1827 | Cumulative Reward 3.0540
>     Episode 093/400 | Average Loss 547.2796 | Cumulative Reward -2.0188
>     Episode 094/400 | Average Loss 1411.6391 | Cumulative Reward 0.6463
>     Episode 095/400 | Average Loss 8877.2380 | Cumulative Reward -0.9047
>     Episode 096/400 | Average Loss 78876.4797 | Cumulative Reward 0.3335
>     Episode 097/400 | Average Loss 35666.4564 | Cumulative Reward -2.8158
>     Episode 098/400 | Average Loss 60985.4187 | Cumulative Reward -1.1599
>     Episode 099/400 | Average Loss 30969.0067 | Cumulative Reward 0.3606
>     Episode 100/400 | Average Loss 120682.6666 | Cumulative Reward 0.2651
>     Episode 101/400 | Average Loss 50795.0507 | Cumulative Reward 0.6701
>     Episode 102/400 | Average Loss 24420.6597 | Cumulative Reward -2.4215
>     Episode 103/400 | Average Loss 33171.9955 | Cumulative Reward -0.8510
>     Episode 104/400 | Average Loss 25231.0958 | Cumulative Reward -0.8302
>     Episode 105/400 | Average Loss 7064.6491 | Cumulative Reward 5.9196
>     Episode 106/400 | Average Loss 2441.4632 | Cumulative Reward 2.9671
>     Episode 107/400 | Average Loss 3972.6634 | Cumulative Reward 2.2185
>     Episode 108/400 | Average Loss 4525.4060 | Cumulative Reward -0.5613
>     Episode 109/400 | Average Loss 4010.8900 | Cumulative Reward 0.3562
>     Episode 110/400 | Average Loss 4406.7336 | Cumulative Reward -0.2574
>     Episode 111/400 | Average Loss 2514.7808 | Cumulative Reward 0.9526
>     Episode 112/400 | Average Loss 1741.7351 | Cumulative Reward -0.4360
>     Episode 113/400 | Average Loss 1994.2122 | Cumulative Reward -0.6628
>     Episode 114/400 | Average Loss 2122.6169 | Cumulative Reward 1.4274
>     Episode 115/400 | Average Loss 2412.2641 | Cumulative Reward -2.7171
>     Episode 116/400 | Average Loss 2157.2323 | Cumulative Reward -2.2203
>     Episode 117/400 | Average Loss 1763.6523 | Cumulative Reward 5.8168
>     Episode 118/400 | Average Loss 2157.3896 | Cumulative Reward 0.7164
>     Episode 119/400 | Average Loss 2271.7217 | Cumulative Reward -0.4404
>     Episode 120/400 | Average Loss 1806.5632 | Cumulative Reward 4.1580
>     Episode 121/400 | Average Loss 1592.2903 | Cumulative Reward -1.7653
>     Episode 122/400 | Average Loss 1865.5214 | Cumulative Reward 1.0669
>     Episode 123/400 | Average Loss 1590.0795 | Cumulative Reward -5.7212
>     Episode 124/400 | Average Loss 1290.1997 | Cumulative Reward -2.0139
>     Episode 125/400 | Average Loss 6849.0803 | Cumulative Reward 10.0472
>     Episode 126/400 | Average Loss 4108.5280 | Cumulative Reward -0.4549
>     Episode 127/400 | Average Loss 2033.6164 | Cumulative Reward -1.9177
>     Episode 128/400 | Average Loss 3314.3956 | Cumulative Reward -1.2552
>     Episode 129/400 | Average Loss 1554.2960 | Cumulative Reward 1.3309
>     Episode 130/400 | Average Loss 1695.8567 | Cumulative Reward -0.6389
>     Episode 131/400 | Average Loss 4964.5786 | Cumulative Reward -0.4334
>     Episode 132/400 | Average Loss 30351.7264 | Cumulative Reward 3.4232
>     Episode 133/400 | Average Loss 292663.9331 | Cumulative Reward -6.1486
>     Episode 134/400 | Average Loss 39913.3923 | Cumulative Reward 0.8041
>     Episode 135/400 | Average Loss 4525.7708 | Cumulative Reward 0.5128
>     Episode 136/400 | Average Loss 5816.5796 | Cumulative Reward 1.0164
>     Episode 137/400 | Average Loss 3372.1565 | Cumulative Reward -2.7557
>     Episode 138/400 | Average Loss 1255.0318 | Cumulative Reward -1.4759
>     Episode 139/400 | Average Loss 1227.6646 | Cumulative Reward -4.4638
>     Episode 140/400 | Average Loss 1490.9701 | Cumulative Reward -0.9443
>     Episode 141/400 | Average Loss 2175.9060 | Cumulative Reward -0.6079
>     Episode 142/400 | Average Loss 3145.8265 | Cumulative Reward -4.0675
>     Episode 143/400 | Average Loss 1102.6005 | Cumulative Reward -0.1042
>     Episode 144/400 | Average Loss 350.4612 | Cumulative Reward -1.6198
>     Episode 145/400 | Average Loss 316.4280 | Cumulative Reward 0.6322
>     Episode 146/400 | Average Loss 436.2378 | Cumulative Reward 0.9417
>     Episode 147/400 | Average Loss 470.8193 | Cumulative Reward -0.7965
>     Episode 148/400 | Average Loss 497.6323 | Cumulative Reward 1.0542
>     Episode 149/400 | Average Loss 330.6520 | Cumulative Reward 0.1210
>     Episode 150/400 | Average Loss 464.6532 | Cumulative Reward -0.8083
>     Episode 151/400 | Average Loss 606.0201 | Cumulative Reward -0.3361
>     Episode 152/400 | Average Loss 415.2449 | Cumulative Reward -2.2024
>     Episode 153/400 | Average Loss 390.6160 | Cumulative Reward 8.2336
>     Episode 154/400 | Average Loss 379.7443 | Cumulative Reward 2.7249
>     Episode 155/400 | Average Loss 415.5202 | Cumulative Reward -3.8253
>     Episode 156/400 | Average Loss 349.0971 | Cumulative Reward 0.5322
>     Episode 157/400 | Average Loss 327.8658 | Cumulative Reward -3.0669
>     Episode 158/400 | Average Loss 452.8055 | Cumulative Reward 1.8531
>     Episode 159/400 | Average Loss 358.1489 | Cumulative Reward 0.6118
>     Episode 160/400 | Average Loss 285.6239 | Cumulative Reward -0.2853
>     Episode 161/400 | Average Loss 211.8839 | Cumulative Reward 1.3377
>     Episode 162/400 | Average Loss 199.9443 | Cumulative Reward 8.9061
>     Episode 163/400 | Average Loss 266.2299 | Cumulative Reward -0.5084
>     Episode 164/400 | Average Loss 278.4291 | Cumulative Reward -1.9366
>     Episode 165/400 | Average Loss 237.7808 | Cumulative Reward 2.3392
>     Episode 166/400 | Average Loss 316.0515 | Cumulative Reward 1.7630
>     Episode 167/400 | Average Loss 259.7458 | Cumulative Reward 1.6817
>     Episode 168/400 | Average Loss 157.4775 | Cumulative Reward -5.0146
>     Episode 169/400 | Average Loss 158.3143 | Cumulative Reward -0.2332
>     Episode 170/400 | Average Loss 171.3159 | Cumulative Reward 0.1229
>     Episode 171/400 | Average Loss 225.3762 | Cumulative Reward -0.2566
>     Episode 172/400 | Average Loss 246.0206 | Cumulative Reward 1.2274
>     Episode 173/400 | Average Loss 246.9212 | Cumulative Reward -6.2713
>     Episode 174/400 | Average Loss 172.3311 | Cumulative Reward -7.8041
>     Episode 175/400 | Average Loss 159.6695 | Cumulative Reward -0.1459
>     Episode 176/400 | Average Loss 133.5186 | Cumulative Reward 12.6934
>     Episode 177/400 | Average Loss 203.6588 | Cumulative Reward 0.6016
>     Episode 178/400 | Average Loss 180.3377 | Cumulative Reward -5.1401
>     Episode 179/400 | Average Loss 156.4804 | Cumulative Reward -2.9022
>     Episode 180/400 | Average Loss 256.7832 | Cumulative Reward -0.4108
>     Episode 181/400 | Average Loss 248.6069 | Cumulative Reward 2.9433
>     Episode 182/400 | Average Loss 209.6588 | Cumulative Reward -9.2684
>     Episode 183/400 | Average Loss 170.3557 | Cumulative Reward -2.3097
>     Episode 184/400 | Average Loss 137.4371 | Cumulative Reward 3.2238
>     Episode 185/400 | Average Loss 173.6806 | Cumulative Reward 0.6079
>     Episode 186/400 | Average Loss 156.5687 | Cumulative Reward -0.9450
>     Episode 187/400 | Average Loss 127.6977 | Cumulative Reward -3.8308
>     Episode 188/400 | Average Loss 319.2198 | Cumulative Reward -2.4003
>     Episode 189/400 | Average Loss 288.2685 | Cumulative Reward 1.8500
>     Episode 190/400 | Average Loss 190.4522 | Cumulative Reward -3.9517
>     Episode 191/400 | Average Loss 135.5635 | Cumulative Reward 1.3865
>     Episode 192/400 | Average Loss 117.3449 | Cumulative Reward -0.0848
>     Episode 193/400 | Average Loss 144.4432 | Cumulative Reward -4.9818
>     Episode 194/400 | Average Loss 120.0152 | Cumulative Reward 3.4097
>     Episode 195/400 | Average Loss 132.2422 | Cumulative Reward -0.7065
>     Episode 196/400 | Average Loss 202.1392 | Cumulative Reward 1.6419
>     Episode 197/400 | Average Loss 114.8075 | Cumulative Reward -0.8801
>     Episode 198/400 | Average Loss 153.8085 | Cumulative Reward -0.5935
>     Episode 199/400 | Average Loss 109.2495 | Cumulative Reward 0.0568
>     Episode 200/400 | Average Loss 82.4544 | Cumulative Reward 0.7314
>     Episode 201/400 | Average Loss 149.3786 | Cumulative Reward -3.0157
>     Episode 202/400 | Average Loss 272.0143 | Cumulative Reward 0.4744
>     Episode 203/400 | Average Loss 132.8172 | Cumulative Reward -0.1198
>     Episode 204/400 | Average Loss 70.0116 | Cumulative Reward 1.8903
>     Episode 205/400 | Average Loss 67.3756 | Cumulative Reward -1.8030
>     Episode 206/400 | Average Loss 270.3828 | Cumulative Reward -4.3223
>     Episode 207/400 | Average Loss 399.5093 | Cumulative Reward -12.4752
>     Episode 208/400 | Average Loss 450.2183 | Cumulative Reward -0.0156
>     Episode 209/400 | Average Loss 1671.8967 | Cumulative Reward 2.7647
>     Episode 210/400 | Average Loss 16700.3337 | Cumulative Reward 9.8480
>     Episode 211/400 | Average Loss 13446.1835 | Cumulative Reward -1.1438
>     Episode 212/400 | Average Loss 137932.9261 | Cumulative Reward 0.4590
>     Episode 213/400 | Average Loss 921923.0748 | Cumulative Reward -15.9503
>     Episode 214/400 | Average Loss 1615567.8120 | Cumulative Reward 1.3957
>     Episode 215/400 | Average Loss 3153049.2869 | Cumulative Reward -7.7295
>     Episode 216/400 | Average Loss 2272427.4549 | Cumulative Reward 6.7439
>     Episode 217/400 | Average Loss 1095519.9990 | Cumulative Reward 11.3729
>     Episode 218/400 | Average Loss 432235.9467 | Cumulative Reward 1.4778
>     Episode 219/400 | Average Loss 161949.9376 | Cumulative Reward 0.3419
>     Episode 220/400 | Average Loss 58845.4062 | Cumulative Reward 2.7168
>     Episode 221/400 | Average Loss 21230.0694 | Cumulative Reward -1.6147
>     Episode 222/400 | Average Loss 9440.9599 | Cumulative Reward -0.7419
>     Episode 223/400 | Average Loss 7721.9410 | Cumulative Reward -1.6850
>     Episode 224/400 | Average Loss 6988.2630 | Cumulative Reward -2.1230
>     Episode 225/400 | Average Loss 15801.3750 | Cumulative Reward 0.1460
>     Episode 226/400 | Average Loss 26985.7430 | Cumulative Reward 0.8148
>     Episode 227/400 | Average Loss 33871.8176 | Cumulative Reward 0.7159
>     Episode 228/400 | Average Loss 64896.3301 | Cumulative Reward 2.6142
>     Episode 229/400 | Average Loss 58708.7660 | Cumulative Reward 1.5416
>     Episode 230/400 | Average Loss 23921.1229 | Cumulative Reward -1.3302
>     Episode 231/400 | Average Loss 19383.0976 | Cumulative Reward 0.3096
>     Episode 232/400 | Average Loss 21257.7983 | Cumulative Reward 1.6749
>     Episode 233/400 | Average Loss 9605.4634 | Cumulative Reward 2.6378
>     Episode 234/400 | Average Loss 9249.7002 | Cumulative Reward 9.3712
>     Episode 235/400 | Average Loss 10296.6081 | Cumulative Reward 1.5610
>     Episode 236/400 | Average Loss 15917.5678 | Cumulative Reward -2.1748
>     Episode 237/400 | Average Loss 20750.8658 | Cumulative Reward -2.5459
>     Episode 238/400 | Average Loss 38039.9481 | Cumulative Reward 8.1172
>     Episode 239/400 | Average Loss 32984.9469 | Cumulative Reward -0.0595
>     Episode 240/400 | Average Loss 16785.4156 | Cumulative Reward 0.3051
>     Episode 241/400 | Average Loss 19140.9990 | Cumulative Reward -31.6967
>     Episode 242/400 | Average Loss 50914.1964 | Cumulative Reward -0.2164
>     Episode 243/400 | Average Loss 43226.3390 | Cumulative Reward 0.7469
>     Episode 244/400 | Average Loss 142370.8437 | Cumulative Reward -0.7410
>     Episode 245/400 | Average Loss 271693.5746 | Cumulative Reward 2.9429
>     Episode 246/400 | Average Loss 372235.5745 | Cumulative Reward -0.9499
>     Episode 247/400 | Average Loss 554116.7420 | Cumulative Reward -2.4131
>     Episode 248/400 | Average Loss 689604.6347 | Cumulative Reward 3.9210
>     Episode 249/400 | Average Loss 17052161.8473 | Cumulative Reward 1.8883
>     Episode 250/400 | Average Loss 13790480.7408 | Cumulative Reward -0.9414
>     Episode 251/400 | Average Loss 553313.4969 | Cumulative Reward 1.9731
>     Episode 252/400 | Average Loss 193172.5233 | Cumulative Reward 0.4359
>     Episode 253/400 | Average Loss 70787.7497 | Cumulative Reward -3.2308
>     Episode 254/400 | Average Loss 31558.5159 | Cumulative Reward -0.5023
>     Episode 255/400 | Average Loss 20880.9573 | Cumulative Reward -3.3874
>     Episode 256/400 | Average Loss 14961.8871 | Cumulative Reward -0.5209
>     Episode 257/400 | Average Loss 12658.6910 | Cumulative Reward -1.0665
>     Episode 258/400 | Average Loss 10553.0673 | Cumulative Reward -1.0583
>     Episode 259/400 | Average Loss 11647.9965 | Cumulative Reward -1.1289
>     Episode 260/400 | Average Loss 10819.5231 | Cumulative Reward -0.9052
>     Episode 261/400 | Average Loss 9811.8621 | Cumulative Reward 0.6868
>     Episode 262/400 | Average Loss 9734.9689 | Cumulative Reward -0.1421
>     Episode 263/400 | Average Loss 16806.1076 | Cumulative Reward -0.9311
>     Episode 264/400 | Average Loss 51381.4164 | Cumulative Reward 0.5244
>     Episode 265/400 | Average Loss 52715.5890 | Cumulative Reward -3.1132
>     Episode 266/400 | Average Loss 147370.8256 | Cumulative Reward 6.7689
>     Episode 267/400 | Average Loss 349701.5862 | Cumulative Reward -5.3704
>     Episode 268/400 | Average Loss 453258.1571 | Cumulative Reward 1.9217
>     Episode 269/400 | Average Loss 469045.7159 | Cumulative Reward -1.1183
>     Episode 270/400 | Average Loss 216574.8130 | Cumulative Reward -0.2209
>     Episode 271/400 | Average Loss 63084.3598 | Cumulative Reward -0.5959
>     Episode 272/400 | Average Loss 43544.3739 | Cumulative Reward 2.8877
>     Episode 273/400 | Average Loss 54201.8135 | Cumulative Reward -0.7229
>     Episode 274/400 | Average Loss 44982.9422 | Cumulative Reward -7.6026
>     Episode 275/400 | Average Loss 49190.5107 | Cumulative Reward -2.3663
>     Episode 276/400 | Average Loss 38168.9935 | Cumulative Reward 1.3036
>     Episode 277/400 | Average Loss 33624.0186 | Cumulative Reward 9.4910
>     Episode 278/400 | Average Loss 52808.2118 | Cumulative Reward -2.5743
>     Episode 279/400 | Average Loss 99931.7062 | Cumulative Reward 3.7482
>     Episode 280/400 | Average Loss 52280.1315 | Cumulative Reward -1.1176
>     Episode 281/400 | Average Loss 51300.1439 | Cumulative Reward 0.0196
>     Episode 282/400 | Average Loss 46517.2936 | Cumulative Reward -7.0006
>     Episode 283/400 | Average Loss 75258.0231 | Cumulative Reward -2.2092
>     Episode 284/400 | Average Loss 124144.1463 | Cumulative Reward 2.3907
>     Episode 285/400 | Average Loss 166405.0758 | Cumulative Reward 17.9376
>     Episode 286/400 | Average Loss 109845.9581 | Cumulative Reward 4.2551
>     Episode 287/400 | Average Loss 86368.8667 | Cumulative Reward 0.0941
>     Episode 288/400 | Average Loss 48403.3659 | Cumulative Reward -0.6673
>     Episode 289/400 | Average Loss 41796.8802 | Cumulative Reward -1.6662
>     Episode 290/400 | Average Loss 46337.5346 | Cumulative Reward -3.5903
>     Episode 291/400 | Average Loss 131393.1431 | Cumulative Reward -0.0253
>     Episode 292/400 | Average Loss 495484.4769 | Cumulative Reward 1.9893
>     Episode 293/400 | Average Loss 175873.1958 | Cumulative Reward 2.5618
>     Episode 294/400 | Average Loss 179162.6928 | Cumulative Reward 2.0417
>     Episode 295/400 | Average Loss 119478.1822 | Cumulative Reward 4.6492
>     Episode 296/400 | Average Loss 58801.1919 | Cumulative Reward 0.6851
>     Episode 297/400 | Average Loss 32332.3374 | Cumulative Reward -3.0626
>     Episode 298/400 | Average Loss 60964.8619 | Cumulative Reward -2.7876
>     Episode 299/400 | Average Loss 102951.7667 | Cumulative Reward -0.6980
>     Episode 300/400 | Average Loss 219870.4704 | Cumulative Reward -2.9908
>     Episode 301/400 | Average Loss 205059.5425 | Cumulative Reward 0.4130
>     Episode 302/400 | Average Loss 121851.6144 | Cumulative Reward -6.5875
>     Episode 303/400 | Average Loss 55374.6726 | Cumulative Reward -0.7004
>     Episode 304/400 | Average Loss 40714.0962 | Cumulative Reward -0.6966
>     Episode 305/400 | Average Loss 616077.5248 | Cumulative Reward -6.7208
>     Episode 306/400 | Average Loss 1291013.0000 | Cumulative Reward 3.3525
>     Episode 307/400 | Average Loss 287649.6938 | Cumulative Reward -1.0102
>     Episode 308/400 | Average Loss 26063.9600 | Cumulative Reward -2.2521
>     Episode 309/400 | Average Loss 13097.6914 | Cumulative Reward -10.6769
>     Episode 310/400 | Average Loss 10210.9897 | Cumulative Reward 1.4032
>     Episode 311/400 | Average Loss 9341.3356 | Cumulative Reward 0.5489
>     Episode 312/400 | Average Loss 8890.5654 | Cumulative Reward -1.5771
>     Episode 313/400 | Average Loss 9969.8120 | Cumulative Reward -6.8167
>     Episode 314/400 | Average Loss 16652.6395 | Cumulative Reward 4.5195
>     Episode 315/400 | Average Loss 59403.0649 | Cumulative Reward -1.7285
>     Episode 316/400 | Average Loss 245777.2312 | Cumulative Reward 2.5883
>     Episode 317/400 | Average Loss 817190.2299 | Cumulative Reward -1.8446
>     Episode 318/400 | Average Loss 323813.5505 | Cumulative Reward -10.0864
>     Episode 319/400 | Average Loss 214917.4482 | Cumulative Reward 3.1694
>     Episode 320/400 | Average Loss 64451.5863 | Cumulative Reward -0.7796
>     Episode 321/400 | Average Loss 219139.1056 | Cumulative Reward 5.6409
>     Episode 322/400 | Average Loss 166843.1720 | Cumulative Reward 1.1437
>     Episode 323/400 | Average Loss 643758.2579 | Cumulative Reward 0.9259
>     Episode 324/400 | Average Loss 273197.3924 | Cumulative Reward 0.6936
>     Episode 325/400 | Average Loss 87513.6069 | Cumulative Reward -1.1678
>     Episode 326/400 | Average Loss 37614.0828 | Cumulative Reward 0.2960
>     Episode 327/400 | Average Loss 19445.4082 | Cumulative Reward 8.9279
>     Episode 328/400 | Average Loss 9239.1169 | Cumulative Reward 0.9224
>     Episode 329/400 | Average Loss 7564.7177 | Cumulative Reward -1.6945
>     Episode 330/400 | Average Loss 7219.3443 | Cumulative Reward 0.7301
>     Episode 331/400 | Average Loss 6740.1247 | Cumulative Reward -0.7489
>     Episode 332/400 | Average Loss 6888.5497 | Cumulative Reward 0.7859
>     Episode 333/400 | Average Loss 6260.5688 | Cumulative Reward 1.2887
>     Episode 334/400 | Average Loss 6881.4800 | Cumulative Reward 0.8997
>     Episode 335/400 | Average Loss 6491.5882 | Cumulative Reward 1.2076
>     Episode 336/400 | Average Loss 6970.5306 | Cumulative Reward 8.2423
>     Episode 337/400 | Average Loss 6692.2639 | Cumulative Reward -10.7015
>     Episode 338/400 | Average Loss 10528.1149 | Cumulative Reward -14.2128
>     Episode 339/400 | Average Loss 2749960.3616 | Cumulative Reward 3.9547
>     Episode 340/400 | Average Loss 6293915.8740 | Cumulative Reward 0.5112
>     Episode 341/400 | Average Loss 4181966.3648 | Cumulative Reward 7.5671
>     Episode 342/400 | Average Loss 1521715.4903 | Cumulative Reward -0.7421
>     Episode 343/400 | Average Loss 259878.5100 | Cumulative Reward -3.9238
>     Episode 344/400 | Average Loss 223847.7131 | Cumulative Reward -0.8457
>     Episode 345/400 | Average Loss 178520.7246 | Cumulative Reward 9.3477
>     Episode 346/400 | Average Loss 91922.6007 | Cumulative Reward 10.6711
>     Episode 347/400 | Average Loss 92341.3508 | Cumulative Reward 4.6715
>     Episode 348/400 | Average Loss 481404.8008 | Cumulative Reward -0.6806
>     Episode 349/400 | Average Loss 619327.0096 | Cumulative Reward -2.5698
>     Episode 350/400 | Average Loss 200402.9201 | Cumulative Reward 4.9615
>     Episode 351/400 | Average Loss 73401.7965 | Cumulative Reward -1.9998
>     Episode 352/400 | Average Loss 353985.6604 | Cumulative Reward -3.0569
>     Episode 353/400 | Average Loss 83143597.3859 | Cumulative Reward 2.1511
>     Episode 354/400 | Average Loss 6398529.5102 | Cumulative Reward 0.7794
>     Episode 355/400 | Average Loss 3473207.1696 | Cumulative Reward -2.7955
>     Episode 356/400 | Average Loss 1885632.9462 | Cumulative Reward -2.8542
>     Episode 357/400 | Average Loss 1584695.8876 | Cumulative Reward -0.8653
>     Episode 358/400 | Average Loss 1803479.0510 | Cumulative Reward 15.1597
>     Episode 359/400 | Average Loss 2213418.5752 | Cumulative Reward -2.5831
>     Episode 360/400 | Average Loss 2749705.8786 | Cumulative Reward 1.0120
>     Episode 361/400 | Average Loss 1289897.7182 | Cumulative Reward 1.4205
>     Episode 362/400 | Average Loss 1813762.0330 | Cumulative Reward 1.5898
>     Episode 363/400 | Average Loss 1324749.2118 | Cumulative Reward 1.6278
>     Episode 364/400 | Average Loss 1762172.3248 | Cumulative Reward -2.4109
>     Episode 365/400 | Average Loss 2590903.5681 | Cumulative Reward -5.4126
>     Episode 366/400 | Average Loss 2224000.0000 | Cumulative Reward -0.1538
>     Episode 367/400 | Average Loss 1441616.1888 | Cumulative Reward -1.6342
>     Episode 368/400 | Average Loss 3022001.7933 | Cumulative Reward -5.8340
>     Episode 369/400 | Average Loss 2972922.5484 | Cumulative Reward 5.4439
>     Episode 370/400 | Average Loss 2003531.0430 | Cumulative Reward -2.6320
>     Episode 371/400 | Average Loss 2709109.8594 | Cumulative Reward -12.0559
>     Episode 372/400 | Average Loss 3228692.9416 | Cumulative Reward -6.7218
>     Episode 373/400 | Average Loss 3700596.4360 | Cumulative Reward 0.5078
>     Episode 374/400 | Average Loss 3377426.7874 | Cumulative Reward 0.7276
>     Episode 375/400 | Average Loss 3365048.7382 | Cumulative Reward 3.1090
>     Episode 376/400 | Average Loss 4587485.8284 | Cumulative Reward -6.7627
>     Episode 377/400 | Average Loss 5680297.4426 | Cumulative Reward 3.3688
>     Episode 378/400 | Average Loss 3493340.6926 | Cumulative Reward -2.9359
>     Episode 379/400 | Average Loss 2484035.3117 | Cumulative Reward 3.1205
>     Episode 380/400 | Average Loss 5507217.9449 | Cumulative Reward 3.1457
>     Episode 381/400 | Average Loss 3075229.6434 | Cumulative Reward 0.5471
>     Episode 382/400 | Average Loss 6119780.0567 | Cumulative Reward -3.7341
>     Episode 383/400 | Average Loss 3379319.6055 | Cumulative Reward -0.7432
>     Episode 384/400 | Average Loss 4744977.0717 | Cumulative Reward -1.1412
>     Episode 385/400 | Average Loss 5937114.8721 | Cumulative Reward -3.9135
>     Episode 386/400 | Average Loss 3936486.9326 | Cumulative Reward 1.8648
>     Episode 387/400 | Average Loss 4697161.7754 | Cumulative Reward -0.6459
>     Episode 388/400 | Average Loss 3918769.9928 | Cumulative Reward -1.3641
>     Episode 389/400 | Average Loss 3720405.1218 | Cumulative Reward 5.0283
>     Episode 390/400 | Average Loss 3203617.5284 | Cumulative Reward 0.5877
>     Episode 391/400 | Average Loss 4206256.2672 | Cumulative Reward -2.0365
>     Episode 392/400 | Average Loss 5004207.2419 | Cumulative Reward -0.4567
>     Episode 393/400 | Average Loss 3484851.8288 | Cumulative Reward -2.2568
>     Episode 394/400 | Average Loss 5117577.1432 | Cumulative Reward -0.3933
>     Episode 395/400 | Average Loss 3649000.0143 | Cumulative Reward -1.9124
>     Episode 396/400 | Average Loss 3255186.4273 | Cumulative Reward 2.8319
>     Episode 397/400 | Average Loss 3177039.6798 | Cumulative Reward -1.6385
>     Episode 398/400 | Average Loss 3807617.9805 | Cumulative Reward 1.1477
>     Episode 399/400 | Average Loss 2519941.9805 | Cumulative Reward 3.4947

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(returns)
ax.set_ylabel("Return")
ax.set_xlabel("Episode")
display(fig)

In [None]:
done = False
states = []
actions = []
rewards = []
reward_sum = 0.

# Define testing interval
start = datetime.datetime(2019, 1, 1, 0, 0)
end = datetime.datetime(2019, 6, 20, 23, 59)

# Test learned model
env = MarketEnv(oilDF_py, start, end, episode_size=np.inf, scope=sequence_scope)
state = env.reset(random_starttime=False)
input_t = state[1].reshape(1, sequence_scope, 1)
while not done:    
    states.append(state)
    q = model.predict(input_t)
    action = np.argmax(q[0])
    actions.append(action)
    state, reward, done, info = env.step(action)
    rewards.append(reward)
    reward_sum += reward
    input_t = state[1].reshape(1, sequence_scope, 1)      
print("Return = {}".format(reward_sum))

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(11, 9))
ax[0].plot(states,label='s')
ax[0].plot(actions,label='a')
ax[0].set_ylabel("(s,a)")
ax[0].set_xlabel("Timestep")
ax[0].legend()
ax[1].plot(rewards)
ax[1].set_ylabel("r")
ax[1].set_xlabel("Timestep")
display(fig)

  

read data as structures stream
==============================

In [None]:
val oil_path = "dbfs:/FileStore/shared_uploads/fabiansi@kth.se/*csv.gz"

val input = spark
  .readStream
  .format("delta")
  .load(oil_path)
  .as[TickerPoint]


In [None]:
joinedDS = spark.read.parquet("dbfs:/FileStore/shared_uploads/fabiansi@kth.se/joinedDSWithMaxRev").orderBy("x")
