**Advanced Machine Learning (Semester 1 2023)**
# 8 Reinforcement Learning


*N. Hernitschek, 2023*


This Jupyter notebook gives an intro to Reinforcement Learning.



---
## Contents
* [Reinforcement Learning with `keras`/`tensorflow`](#first-bullet)
* [Water Flow Control](#second-bullet)
* [Summary](#third-bullet)

## 1. Reinforcement Learning with `keras`/ `tensorflow` <a class="anchor" id="first-bullet"></a>

Machine learning methods we have seen so far either fall into the category of supervised or unsupervised algorithms.
Reinforcement Learning stands out because it is used to train models in a live environment.


`keras-rl` implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the Deep Learning library `keras`.

One can extend `keras-rl` according to their own needs by e.g. building metrics and callbacks in addition to the built-in ones. Even more so, it is easy to implement your own environments and even algorithms by simply extending some simple abstract classes. 
You can find more information at 
https://github.com/keras-rl/keras-rl

OpenAI `gym` is an open-source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. Since its release, this API has become the field standard.
Furthermore, `keras-rl` works with OpenAI Gym out of the box. This means that evaluating and playing around with different algorithms is easy.
You can find more information at 
https://github.com/openai/gym



OpenAI `gym` comes with standard test environments for Reinforcement Learning, such as simple computer games like "Space Invaders", or simulations like balancing a pole on a moving cart. Whereas such environments can be fun to try out and can give an idea on the performance of various Reinforcement Learning algorithms, this can be limiting.
For this reason, this tutorial will show you how to build a custom Reinforcement Learning environment using OpenAI `gym`. Specifically, we will build an Reinforcement Learning model to automatically regulate temperature and get it to an optimal range.


First we install the libraries:


In [17]:

#Keras-rl2 gives us several pre-defined agents to build Reinforcement Learning models.
#!pip install keras-rl2

#OpenAI gym provides environments for Reinforcement Learning
#!pip install gym


Defaulting to user installation because normal site-packages is not writeable
Collecting keras-rl2
  Downloading keras_rl2-1.0.5-py3-none-any.whl (52 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.1/52.1 KB[0m [31m970.1 kB/s[0m eta [36m0:00:00[0m36m0:00:01[0m
Installing collected packages: keras-rl2
Successfully installed keras-rl2-1.0.5


## 2. Application: Water Flow Control <a class="anchor" id="second-bullet"></a>

We want to build a Reinforcement Learning model to automatically regulate temperature and get it to an optimal range.


**Goals:**
    
1. we want our optimal temperature to be between 37 and 39 degrees Celcius.

2. The shower length will be 60 seconds. This means that the **episode length** will be 60 seconds in which the model will try to get into that optimal temperature range within 60 seconds.

3. Our model will perform three actions: turn up, keep, and turn down the temperature. 

    
    

The placeholder class `Env` from OpenAI `gym` allows us to build our custom environment.

The `Discrete` and `Box` spaces from `gym.spaces` allow us to define the actions and the current state we can take on our environment.


In [51]:
import numpy as np
from gym import Env
from gym.spaces import Box, Discrete
import random

import tensorflow as tf

from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten



## 2.1 Building the custom RL environment with OpenAI Gym





We begin by creating a `CustomEnv` class. By passing `Env` to the `CustomEnv` class, we **inherit** the methods and properties from the OpenAI `gym` environment class.

Within the `CustomEnv`class, we implement the `__init__` function to initialize the actions, observations, and episode length.
The actions are: down (`0`), keep(`1`), up (`2`).

The `observation_space` will hold an array of our current temperature. Next, we set our start temperature to 38 degrees plus a random integer. Finally, we’ve set the shower length to 60 seconds.

Other than `Discrete` spaces, `Box` spaces are much more flexible and allow us to pass through multiple values between 0 and 100. In addition, they can be used to hold other data such as images, audio, and data frames.

The `step` function defines what we do after we take action. We’ve set our action value to `-1`. Ideally, this means that:

 *   If we apply action `0` together with `-1`, we get a `-1` value. This action will lower the temperature by 1.
 *   If we apply action `1` together with `-1`, we get a `0` value. This action will maintain the current temperature.
 *   If we apply action `2` together with `-1`, we get a `1` value. This action will increase the temperature by 1.

Each step, We are also reducing the remaining shower length by 1.

When calculating the **reward**:

* If the temperature is in its optimal range of 37, and 39, we give a reward of 1.
* If the temperature is not in the optimal range, we give a reward of `-1`. 

Our model will always try to converge with this function so that the temperature is within the optimal range.


We use the `reset` function to reset our environment or update each episode. It resets the shower temperature and time.


In [52]:
class CustomEnv(Env):
    def __init__(self):
        self.action_space = Discrete(3)
        self.observation_space = Box(low=np.array([0]), high=np.array([100]))
        self.state = 38 + random.randint(-3,3)
        self.shower_length = 60
        
    def step(self, action):
        self.state += action -1 
        self.shower_length -= 1 
        
        if self.state >=37 and self.state <=39: 
            reward =1 
        else: 
            reward = -1 
        
        if self.shower_length <= 0: 
            done = True
        else:
            done = False
        
        info = {}
        
        # Return step information
        return self.state, reward, done, info
    
    def reset(self):
        self.state = 38 + random.randint(-3,3)
        self.shower_length = 60 
        return self.state

In [53]:
env = CustomEnv()

In [25]:
env.observation_space.sample()

array([84.878075], dtype=float32)

In [55]:
env.action_space.sample()

1

Let’s play around with our environment without doing any training. We're just sampling:

In [56]:
episodes = 20 #20 shower episodes
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))

Episode:1 Score:10
Episode:2 Score:8
Episode:3 Score:2
Episode:4 Score:-38
Episode:5 Score:-60
Episode:6 Score:-60
Episode:7 Score:-42
Episode:8 Score:-48
Episode:9 Score:-10
Episode:10 Score:8
Episode:11 Score:-10
Episode:12 Score:4
Episode:13 Score:-60
Episode:14 Score:-46
Episode:15 Score:-24
Episode:16 Score:-38
Episode:17 Score:-56
Episode:18 Score:-50
Episode:19 Score:-30
Episode:20 Score:30


After running through 20 different showers, we get different reward values. 
Remember, if the shower is not within the optimal range of between 37 and 39 degrees, we get a reward of `-1`.

Most of the rewards indicate that we were way outside our optimal temperature range.
The best reward typically between 25 and 30, which indicates that some of the steps that we took may have been within that optimal range.

Let’s go ahead and use `keras` to build a **Deep Learning model**.

## 2.2 Creating a Deep Learning model using Keras


The first step involves defining our states and actions:


In [57]:
states = env.observation_space.shape
actions = env.action_space.n

In [40]:
actions

3

In [41]:
states

(1,)

In the model, we are passing in the temperature (`input_shape=states`) to the input of our Deep Learning model.

In [58]:
def build_model(states, actions):
    model = Sequential()    
    model.add(Dense(24, activation='relu', input_shape=states))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model

In [59]:
model = build_model(states, actions)

In [60]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_9 (Dense)             (None, 24)                48        
                                                                 
 dense_10 (Dense)            (None, 24)                600       
                                                                 
 dense_11 (Dense)            (None, 3)                 75        
                                                                 
Total params: 723
Trainable params: 723
Non-trainable params: 0
_________________________________________________________________


## 2.3 Building the agent with`keras-RL`

We can then pass this model to the `Keras-RL` model.

We use the Boltzmann Q Policy. It builds a probability law on q values and returns an action selected randomly according to this law.


We build a `DQNagent` using the model we created in the section above.
A DQN agent is a value-based reinforcement learning agent that trains a critic to estimate the return or future rewards. DQN is a variant of Q-learning. The DQN agent uses the Sequential memory to store various states, actions, and rewards.






In [61]:
def build_agent(model, actions):
    policy = BoltzmannQPolicy()
    memory = SequentialMemory(limit=50000, window_length=1)
    dqn = DQNAgent(model=model, memory=memory, policy=policy, 
                  nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
    return dqn

Our custom Reinforcement Learning environment can now train our `dqn` model to set the correct optimal temperature.

We train the agent for 60000 steps, but you could train it for longer to produce better results by adjusting the `nb_steps` parameter.



In [63]:
dqn = build_agent(model, actions)

dqn.compile(tf.keras.optimizers.legacy.Adam(learning_rate=1e-3), metrics=['mae'])

dqn.fit(env, nb_steps=60000, visualize=False, verbose=1)

2023-06-07 15:27:07.671126: W tensorflow/c/c_api.cc:300] Operation '{name:'dense_10_2/kernel/Assign' id:1004 op device:{requested: '', assigned: ''} def:{{{node dense_10_2/kernel/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](dense_10_2/kernel, dense_10_2/kernel/Initializer/stateless_random_uniform)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


Training for 60000 steps ...
Interval 1 (0 steps performed)
    1/10000 [..............................] - ETA: 17:59 - reward: -1.0000

  updates=self.state_updates,
2023-06-07 15:27:08.025049: W tensorflow/c/c_api.cc:300] Operation '{name:'dense_11/BiasAdd' id:775 op device:{requested: '', assigned: ''} def:{{{node dense_11/BiasAdd}} = BiasAdd[T=DT_FLOAT, _has_manual_control_dependencies=true, data_format="NHWC"](dense_11/MatMul, dense_11/BiasAdd/ReadVariableOp)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2023-06-07 15:27:08.060000: W tensorflow/c/c_api.cc:300] Operation '{name:'total_3/Assign' id:1191 op device:{requested: '', assigned: ''} def:{{{node total_3/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](total_3, total_3/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't

166 episodes - episode_reward: -29.000 [-60.000, 40.000] - loss: 0.801 - mae: 4.947 - mean_q: -5.620

Interval 2 (10000 steps performed)
167 episodes - episode_reward: -30.180 [-60.000, 36.000] - loss: 1.296 - mae: 7.799 - mean_q: -10.993

Interval 3 (20000 steps performed)
167 episodes - episode_reward: -30.443 [-60.000, 30.000] - loss: 2.104 - mae: 10.180 - mean_q: -14.573

Interval 4 (30000 steps performed)
166 episodes - episode_reward: -23.434 [-60.000, 34.000] - loss: 1.983 - mae: 9.731 - mean_q: -13.899

Interval 5 (40000 steps performed)
167 episodes - episode_reward: -24.168 [-60.000, 34.000] - loss: 1.825 - mae: 9.315 - mean_q: -13.270

Interval 6 (50000 steps performed)
done, took 495.196 seconds


<keras.callbacks.History at 0x7f0c4b063430>



After 60000 steps, we get a reward of typically between -0.1 and -0.3. In the initial 10000 steps, we begin with a reward of -0.6412. This decreased to -0.3908 at the end.

The model is not perfect but when we increase the number of training steps, you will get better results (positive rewards).

Positive values mean that the temperature is within its optimal temperature. You can try adding some random figures when creating the model and see how your agent will behave after training.

## 2.4 Testing our custom Reinforcement Learning environment

After training our model, we can test it out. 

This is an ideal example and might not represent a real-case scenario, i.e., when something else is influencing with the temperature. It is thus always important to build a model as close as possible to the scenario in case.

In [64]:
results = dqn.test(env, nb_episodes=150, visualize=False)
print(np.mean(results.history['episode_reward']))

Testing for 150 episodes ...
Episode 1: reward: 60.000, steps: 60
Episode 2: reward: 60.000, steps: 60
Episode 3: reward: 60.000, steps: 60
Episode 4: reward: 60.000, steps: 60
Episode 5: reward: 60.000, steps: 60
Episode 6: reward: 60.000, steps: 60
Episode 7: reward: 60.000, steps: 60
Episode 8: reward: 60.000, steps: 60
Episode 9: reward: 60.000, steps: 60
Episode 10: reward: 60.000, steps: 60
Episode 11: reward: 60.000, steps: 60
Episode 12: reward: 58.000, steps: 60
Episode 13: reward: 60.000, steps: 60
Episode 14: reward: 60.000, steps: 60
Episode 15: reward: 60.000, steps: 60
Episode 16: reward: 60.000, steps: 60
Episode 17: reward: 60.000, steps: 60
Episode 18: reward: 58.000, steps: 60
Episode 19: reward: 60.000, steps: 60
Episode 20: reward: 58.000, steps: 60
Episode 21: reward: 60.000, steps: 60
Episode 22: reward: 60.000, steps: 60
Episode 23: reward: 60.000, steps: 60
Episode 24: reward: 60.000, steps: 60
Episode 25: reward: 60.000, steps: 60
Episode 26: reward: 60.000, st

## 3. Summary <a class="anchor" id="fourth-bullet"></a>

At this point, all of you should have:
* seen how to use Reinforcement Learning with `keras`
* seen how to build custom Environments with `keras`.





