# AI-LAB SESSION 5: Deep Reinforcement Learning

In this lesson we will use the CartPole environment and we will see how to create and work with a neural network using Kears on top of Tensorflow.

## CartPole
The environment used is **CartPole** (taken from the book of Sutton and Barto as visible in the figure)

![Cartpole](images/cartpole.jpg)

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

In [2]:
import os, sys, keras, random, numpy
module_path = os.path.abspath(os.path.join('../tools'))
if module_path not in sys.path:
    sys.path.append(module_path)

import gym, envs
from utils.ai_lab_functions import *
from timeit import default_timer as timer
from tqdm import tqdm as tqdm
from collections import deque
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import Adam

The **state** of environment is represented as a tuple of 4 values: 
- *Cart Position* range from -4.9 to 4.8
- *Cart Velocity* range from -inf to +inf
- *Pole Angle* range from -24 deg to 24 deg
- *Pole Velocity* range from -inf to +inf

The **actions** allowed in the environment are 2:
- *action 0*: push cart to left
- *action 1*: push cart to right

The **reward** is 1 for every step taken, including the termination step.

In [4]:
env = gym.make("CartPole-v1")
state = env.reset()
print("STARTING STATE: {}".format(state))
print("\tCart Position: {}\n\tCart Velocity {}\n\tPole Angle {} \n\tPole Velocity {}".format(state[0], state[1], state[2], state[3]))

print("\nPOSSIBLE ACTIONS: ", env.action_space.n)

STARTING STATE: [-0.02159284  0.0218142   0.03271236  0.02024597]
	Cart Position: -0.021592838289049787
	Cart Velocity 0.021814203645092456
	Pole Angle 0.0327123590808205 
	Pole Velocity 0.02024597093067808

POSSIBLE ACTIONS:  2


Finally, we still have the standard functionalities of a Gym environment:
- step(action): the agent performs action from the current state. Returns a tuple (new_state, reward, done, info) where:
    - new_state: is the new state reached as a consequence of the agent's last action
    - reward: the reward obtained by the agent in this step
    - done: True if the episode is terminal, False otherwise
    - info: not used, you can safely discard it

- reset(): the environment is reset and the agent goes back to the starting position. Returns the initial state id

## Neural Network with Kears
**Keras** is an open-source neural-network library written in Python. It is capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. Designed to enable fast experimentation with deep neural networks, it focuses on being user-friendly, modular, and extensible.

![Network](images/neural_networks.png)

With kears you can easly create a neural network with the **Sequential** module. Before training a neural netowrk you must compile it, selecting the loss function and the optimizer, in our experiment we will use the *mean_squared_error* for the loss function and the *adam* optimizer, that is a standard configuration for a DQN problem.

In [6]:
input_layer = 3
layer_size = 5
output_layer = 2

model = Sequential()
model.add(Dense(layer_size, input_dim=input_layer, activation="relu")) #input layer + hidden layer #1
model.add(Dense(layer_size, activation="relu")) #hidden layer #2
model.add(Dense(layer_size, activation="relu")) #hidden layer #3
model.add(Dense(layer_size, activation="relu")) #hidden layer #4
model.add(Dense(layer_size, activation="relu")) #hidden layer #5
model.add(Dense(output_layer, activation="linear")) #output layer

model.compile(loss="mean_squared_error", optimizer='adam') #loss function and optimzer definition

In Keras you can compute the output of a network with the **predict** function, that requires as input the values of the input layer nodes and returns the corresponding values of the output layer.

In [8]:
input_network = [random.uniform(0, 1), random.uniform(0, 1), random.uniform(0, 1)]
output_network = model.predict(np.array([input_network]))
print("Input Network: {}".format(input_network))
print("Network Prediction: {}".format(output_network[0]))

Input Network: [0.8987786622886936, 0.036971218190691824, 0.2312485655346025]
Network Prediction: [-0.15208383 -0.12403072]


To train a network in Keras we must use the function **fit**, that take as input:
- *input*: the input of the network that we are interested to train
- *expected_output*: the output that we consider correct
- *epochs*: the number of iteration for the backpropagation (in DQN this value is always 1).

In [10]:
input_network = [random.uniform(0, 1), random.uniform(0, 1), random.uniform(0, 1)]
expected_output = [0, 0]

print("Prediction 'before' training:")
print(model.predict(np.array([input_network])))

model.fit(np.array([input_network]), np.array([expected_output]), epochs=1000, verbose=0)

print("\nPrediction 'after' training:")
print(model.predict(np.array([input_network])))

Prediction 'before' training:
[[-0.10590489 -0.09018652]]

Prediction 'after' training:
[[0.000000e+00 5.820766e-11]]


Finally, remember that for all the methods (*fit*, *predict*, ...) keras requires as input a numpy array of array, for example you must convert your state in the correct **shape**.  Kears will return, in the same way, an array of array, so to extract the corresponding ouutput layer you must select the first element.

In [12]:
state = np.array([0, 0, 0])
# model.predict(input_network) will give you a shape error
state = state.reshape(1, 3)
print("Prediction:", model.predict(state)[0])

Prediction: [ 4.3242984e-04 -2.9583927e-05]


## Assignment: Q-Learning

Your first assignement is to implement all the functions necessary for a deep q-learning algorithm. In particular you must implement the following functions: *create_model*, *train_model* and *DQN*.

#### Hint:
For the experience replay buffer you can use the python data structure *dequeue*, defining the maximum length allowed. With the *random.sample(replay_buffer, size)* function you can sample *size* element from the queue:

In [14]:
replay_buffer = deque(maxlen=10000)
for _ in range(100): replay_buffer.append(random.uniform(0, 1))
    
samples = random.sample(replay_buffer, 3) 
print("Get 3 elements from replay_buffer:", samples)

Get 3 elements from replay_buffer: [0.21813980204636396, 0.21343985912814512, 0.8259263480587257]


In [16]:
def create_model(input_size, output_size, hidden_layer_size, hidden_layer_number):
    """
    Create the neural netowrk model with the given parameters

    Args:
        input_size: the number of nodes for the input layer
        output_size: the number of nodes for the output layer
        hidden_layer_size: the number of nodes for each hidden layer
        hidden_layer_number: the number of hidden layers

    Returns:
        model: the corresponding neural network
    """

    model = Sequential()
    total = hidden_layer_size

    # Input layer + 1 hidden layer.
    model.add(Dense(total, input_dim = input_size, activation = "relu"))

    for i in range(hidden_layer_number):
         # Hidden layer for hidden_layer_number.
        model.add(Dense(total, activation = "relu"))
    
    # Output layer.
    model.add(Dense(output_size, activation = "linear"))
    
    # Algorithms to use for optimization.
    model.compile(loss = "mean_squared_error", optimizer = 'adam')

    return model

In [18]:
def train_model(model, memory, gamma=0.95):
    """
    Performs the value iteration algorithm for a specific environment

    Args:
        model: the neural network model to train
        memory: the memory array on wich perform the training
        gamma: gamma value, the discount factor for the Bellman equation
    """

    batch_size = 32

    if len(memory) >= batch_size:
        minibatch = random.sample(memory, batch_size)
        
        # For s, a, s', r in MB (for every tuple in the minibatch).
        for state, action, reward, next_state, done in minibatch:
            Q = reward

            if not done:
                Q = (reward + gamma * np.amax(model.predict(next_state)[0]))

            Q_values = model.predict(state)
            Q_values[0][action] = Q

            model.fit(state, Q_values, verbose = 0)

        memory.clear()
    
    return model

In [20]:
def DQN(environment, neural_network, trials, epsilon_decay = 0.995):
    """
    Performs the Q-Learning algorithm for a specific environment on a specific neural netowrk model

    Args:
        environment: OpenAI Gym environment
        neural_network: the neural netowrk to train
        trials: the number of iterations for the training phase
        epsilon_decay: the dacay value of epsilon for the eps-greedy exploration

    Returns:
        score_queue: 1-d dimensional array of the reward obtained at each trial step
    """

    epsilon = 1.0; 
    epsilon_min = 0.01

    memory = []
    datatrain = []
    score_queue = []

    observation_space = environment.observation_space.shape[0]

    for trial in range(trials):
        score = 0

        state = environment.reset()
        state = state.reshape(1, observation_space)

        while True:
            action = 0
            
            # 4 should be epsilon.
            if random.randrange(10) < 4:
                action = random.randrange(0, environment.action_space.n)
            else:
                action = np.argmax(neural_network.predict(state)[0])

            next_state, reward, done, info = environment.step(action)
            next_state = next_state.reshape(1, observation_space)
            memory.append((state, action, reward, next_state, done))

            # Train neural network.
            neural_network = train_model(neural_network, memory, epsilon)
            score += reward
            
            # Break condition.
            if done:
                break
            
            # Update epsilon.
            epsilon = epsilon - epsilon_decay
            if epsilon < epsilon_min:
                epsilon = epsilon_min

            state = next_state

        score_queue.append(score)
        print("Trial {:>5} - Score {:>5}.".format(trial, score))
        
        if score > 130: 
            break

    return neural_network, score_queue

In [23]:
env = gym.make("CartPole-v1")
neural_network = create_model(4, 2, 32, 2)
neural_network, score = DQN(env, neural_network, trials=1000)

Trial     0 - Score  11.0.
Trial     1 - Score  17.0.
Trial     2 - Score  21.0.
Trial     3 - Score  11.0.


KeyboardInterrupt: 

## Execution
The following code executes the DQN and plots the reward function, the execution could require up to 10 minutes on some computer. Correct results for comparison can be found here below. Notice that since the executions are stochastic the charts could differ: the important thing is the global trend and the final convergence to a visible reward improvement.

In [None]:
rewser = []
window = 10

score = rolling(np.array(score), window)
rewser.append({"x": np.arange(1, len(score) + 1), "y": score, "ls": "-", "label": "DQN"})
plot(rewser, "Rewards", "Episodes", "Rewards")

**Standard DQN on CartPole results:**
<img src="images/results-standard.png" width="600">