# The Markov Chain Model

The Markov chain (MC) model is the second model used in our thesis. It is significantly more complex than the simple probabilistic model, however, it is still quite rudimentary in comparison with real-life markets. The model was developed by Hult and Kiessling in their paper *[Algorithmic trading with Markov Chains](https://www.researchgate.net/publication/268032734_ALGORITHMIC_TRADING_WITH_MARKOV_CHAINS)*.

In this model, the limit order book (LOB) is modelled explicitly. There are six event types:

> 1. Buy limit orders 
> 2. Sell limit orders
> 3. Cancel buy orders
> 4. Cancel sell orders
> 5. Buy market orders
> 6. Sell market orders

The arrival of an order results in a state transition in the Markov chain. The transition rates are described in our **[report](https://github.com/KodAgge/Reinforcement-Learning-for-Market-Making/blob/main/Reinforcement%20Learning%20for%20Market%20Making.pdf)**. An example of how the arrival of different orders affect the LOB is shown in the image below.

<div>
    <img src="images/LOBDynamics.png" width=800/>
</div>


Like in the simple probabilistic model we also have:

> * The time _t_ can take integer values between _0_ and _T_.
>
> * The market maker has to quote bid and ask prices every second.
>
> * The market maker can put the bid and ask depths at *max\_quote\_depth* different levels, from _1_ to *max\_quote\_depth* ticks away from the best ask and best bid price respectively.
>
> * The cash process _X<sub>t</sub>_ denotes the market makers cash at time _t_.
>
> * The inventory process _Q<sub>t</sub>_ denotes the market makers inventory at time _t_.
>
> * The value process _V<sub>t</sub>_ denotes the value of the market maker's position at time _t_, that is its cash plus the value of its current inventory.
>
> * The market maker can see the current time _t_ , its inventory _Q<sub>t</sub>_, the spread and the *full LOB* before taking an action.
>
> * At time _t = T_ the market maker is forced to liquidate its position.

The _tick_ is the smallest tradeable unit of the underlying, for instance $0.01 of AAPL.

Contrary to the simple probabilistic model, it is not possible to derive an analytically optimal strategy in the Markov chain model.

## The deep reinforcement learning

After that short introduction, it's time for some deep reinforcement learning in the form of DDQN.

We start by importing the needed file.

In [4]:
import logging
from mc_model_mm_deep_rl_batch import (
    train_multiple_agents_batch,
    evaluate_DDQN_batch
)

logging.basicConfig(level=logging.INFO)

Now we have to decide on the parameters we want to use for the environment and the hyperparameters we want to use for the DDQN.

There are some additional parameters in the Markov chain model, see the code snippet below for an explanation of them. However, importantly, we choose a longer episode (trading window) of *T = 100*.

Since DDQN involves neural networks, even more parameters are added. A scaling of the reward is needed to try keep it within *\[-1,1\]*. Also, since the market maker can view the full LOB, we let it reset to a random state at the start of every episode to let the agent get exposed to more LOB states.

In [5]:
model_params = {
                "dt": 1,                    # the length of the time steps
                "T": 100,                   # the length of the episode
                "num_levels": 10,           # how many depth levels that should be included in the LOB
                "default_order_size": 5,    # the size of the orders the MM places
                "max_quote_depth": 5,       # how deep the MM can put its quotes
                "reward_scale": 0.1,        # a factor all rewards will be multiplied with
                "randomize_reset": True     # should a random LOB state  be chosen at the start of every episode?
}

We now have to decide which hyperparameter values we want to use.

The DDQN algorithm involves a lot of hyperparameters. We will not explain them here, instead we direct the interest reader to our report **[report](https://github.com/KodAgge/Reinforcement-Learning-for-Market-Making/blob/main/Reinforcement%20Learning%20for%20Market%20Making.pdf)**. However, they do involve network architecture, experience replay and epsilon-greedy policy parameters.

In [6]:
hyperparams = {
                "n_train": int(2e5),    # the number of steps the agents will be trained for
                "n_test": int(1e2),     # the number of episodes the agents will be evaluated for
                "n_runs": 4             # the number of agents that will be trained
}

DDQN_params = {
                # network params
                "hidden_size": 64,                                          # the hidden size of the network
                "buffer_size": hyperparams["n_train"] / 200,                # the size of the experience replay bank
                "replay_start_size": hyperparams["n_train"] / 200,          # after how many number of steps the experience replay is started
                "target_update_interval": hyperparams["n_train"] / 100,     # how often the target network is updated
                "update_interval": 2,                                       # how often the online network is updated
                "minibatch_size": 16,                                       # the size of the minibatches used

                # epsilon greedy (linear decay)
                "exploration_initial_eps": 1,                               # the starting value of the exploration rate
                "exploration_final_eps": 0.05,                              # the final value of the exploration rate
                "exploration_fraction": 0.5,                                # when the final value is reached

                # learning rate
                "learning_rate_dqn": 1e-4,                                  # the learning rate used (Adam)
                
                # other params
                "num_envs": 10,                                             # how many parallelized environments
                "n_train": hyperparams["n_train"], 
                "n_runs": hyperparams["n_runs"],
                "reward_scale": model_params["reward_scale"],

                # logging params
                "log_interval": hyperparams["n_train"] / 100,               # the frequency of saving information
                "num_estimate": 10000,                                      # how many states that should be used for estimating q_values
                "n_states": 10                                              # the number of states heatmaps are averaged over
                
}

For this model it is the emulating of the market that is the bottleneck, so it runs faster on a cpu than a gpu. This holds even when multithreading is used for the emulation, which we use in this example.

In [7]:
gpu = -1

Finally we decide where to save our results.

In [8]:
# naming the folder where the results will be saved
folder_name = "mc_deep_example"

outdir = f"results/mc_model_deep/{folder_name}/"

We're now ready for the deep reinforcement learning!

This is easily done with the function *train\_multiple\_agents\_batch*.

In [13]:
train_multiple_agents_batch(
    DDQN_params, 
    model_params, 
    hyperparams["n_train"], 
    outdir, 
    hyperparams["n_runs"], 
    gpu=gpu
)

INFO:mc_model_mm_deep_rl_batch:The folder results/mc_model_deep/mc_deep_example/estimate_folder already exists.
INFO:mc_model_mm_deep_rl_batch:Run 1 in progress.


AttributeError: Can't get local object 'setup_batch_env.<locals>.make_env'

## Evaluating the strategies

Now that the training is complete, we can now continue with evaluating the agents.

This is easily done with the function *evaluate\_DDQN\_batch*.

In [6]:
evaluate_DDQN_batch(
    outdir, 
    n_test=hyperparams["n_test"],                  
    Q=10,       # how many depths that should be displayed in the heatmaps
    randomize_start=model_params["randomize_reset"]
)

INFO:mc_model_mm_deep_rl_batch:The folder results/mc_model_deep/mc_deep_example/image_folder already exists.
INFO:mc_model_mm_deep_rl_batch:Plotting training.
INFO:mc_model_mm_deep_rl_batch:Plotting strategies.
INFO:mc_model_mm_deep_rl_batch:Evaluating agents.
INFO:mc_model_mm_deep_rl_batch:Evaluating benchmarks...
INFO:mc_model_mm_deep_rl_batch:...best agent
INFO:mc_model_mm_deep_rl_batch:...mean agent
INFO:mc_model_mm_deep_rl_batch:...constant strategy
INFO:mc_model_mm_deep_rl_batch:...random_strategy
INFO:mc_model_mm_deep_rl_batch:Visualizing the strategies.
INFO:mc_model_mm_deep_rl_batch:The folder results/mc_model_deep/mc_deep_example/image_folder already exists.


<Figure size 432x288 with 0 Axes>


Evaluation complete! Let's take a peek at the images that were saved when running *evaluate\_DDQN\_batch*.

Let's first have a look at the reward, the estimated state-value at (0,0) and the network loss during training.

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/training_graph.png"/>
</div>

In this image it looks like that the algorithm hasn't converged. Indeed, it has to be trained for much longer. It probably also needs hyperparameter tuning since the q-estimate and the loss seems to be diverging.

We can also have a look the learnt strategies. The figure below shows the learnt bid depths.

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/bid_heat_randomized_10.png" width="500"/>
</div>

Furthermore, we can compare the average rewards of the Q-learning strategies versus some benchmarking strategies. These are displayed in the boxplot below.

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/box_plot_benchmarking.png"/>
</div>

We can also view these results in table form.


In [7]:
f = open(f"{outdir}image_folder/table_benchmarking")
print(f.read())
f.close()

strategy           mean reward    std reward    reward per action    reward per second
---------------  -------------  ------------  -------------------  -------------------
constant (d=1)          0.0402     0.104441              0.000402             0.000402
random                 -0.013      0.075743             -0.00013             -0.00013
DDQN (best run)         0.0254     0.0751188             0.000254             0.000254
DDQN (mean)             0.014      0.0967264             0.00014              0.00014


It could a be interesting to see how the mean strategy and the individual strategies act. The figures below shows the average inventory, cash and value process of the different strategies when evaluted for *n\_test* episodes.

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/visualization_mean.png"/>
</div>

<div>
    <img src="results/mc_model_deep/mc_deep_example/image_folder/visualization_all.png"/>
</div>

## More results?

There are a lot more figures and tables to explore which can be found in the **[mc_deep_example](https://github.com/KodAgge/Reinforcement-Learning-for-Market-Making/tree/main/code/results/mc_model_deep/mc_deep_example/image_folder)** folder.