<div style="width: 100%; clear: both;">
<div style="float: left; width: 50%;">
<img src="http://www.uoc.edu/portal/_resources/common/imatges/marca_UOC/UOC_Masterbrand.jpg", align="left">
</div>
<div style="float: right; width: 50%;">
<p style="margin: 0; padding-top: 22px; text-align:right;">M2.883 · Aprendizaje por refuerzo</p>
<p style="margin: 0; text-align:right;">Máster universitario en Ciencia de datos (<i>Data science</i>)</p>
<p style="margin: 0; text-align:right; padding-button: 100px;">Estudios de Informática, Multimedia y Telecomunicación</p>
</div>
</div>
<div style="width:100%;">&nbsp;</div>


# Módulo 1: ejemplos de OpenAI Gym

En este _notebook_ cargaremos algunos de los escenarios de OpenAI Gym y veremos la interacción entre algunos agentes y estos escenarios o entornos.

## 1. CartPole
En este primer ejemplo vamos a cargar el entorno CartPole y realizaremos algunas pruebas.

### 1.1. Carga de datos

El siguiente código carga los paquetes necesarios para el ejemplo, crea el entorno mediante el método `make` e imprime por pantalla la dimensión del espacio de acciones (dos acciones: 0 = izquierda y 1 = derecha), del espacio de observaciones (cuatro observaciones: posición del carro, velocidad del carro, ángulo del poste y velocidad del poste en la punta) y el rango de la variable de recompensa (de menos infinito a más infinito).

In [1]:
import gym
import numpy as np

env = gym.make('CartPole-v0')
print("Action space is {} ".format(env.action_space))
print("Observation space is {} ".format(env.observation_space))
print("Reward range is {} ".format(env.reward_range))

Action space is Discrete(2) 
Observation space is Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32) 
Reward range is (-inf, inf) 


Seguidamente, reseteamos el entorno (acción que hay que realizar siempre después de la creación de éste) e inicializamos las variables que guardarán el número de pasos ejecutados (t), la recompensa acumulada (`total_reward`) y la variable que nos indicará cuándo finaliza un episodio (done).

In [2]:
# Environment reset
obs = env.reset()
t, total_reward, done = 0, 0, False

### 1.2. Ejecución de un episodio

A continuación, realizaremos la ejecución de un episodio del entorno CartPole utilizando un agente que selecciona las acciones de forma aleatoria.

El siguiente código realiza la ejecución de un episodio del entorno (este finaliza cuando la variable `done` toma el valor `True`). El agente se implementa mediante el método  `env.action_space.sample()` que selecciona una acción al azar. Se imprime por pantalla para cada paso (_time step_) la observación que genera el entorno (los cuatro valores comentados anteriormente), la acción seleccionada y la recompensa obtenida en ese paso (+ 1 en cada acción hasta que finaliza el episodio).

In [3]:
while not done:
    
    # Render the environment
    env.render()
    
    # Get random action (this is the implementation of the agent)
    action = env.action_space.sample()
    
    # Execute action and get response
    new_obs, reward, done, info = env.step(action)
    print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
    
    obs = new_obs
    total_reward += reward
    t += 1
    
total_reward += reward
t += 1
print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))

Obs: [ 0.007 -0.042 -0.049  0.002] -> Action: 1 and reward: 1.0
Obs: [ 0.006  0.153 -0.049 -0.306] -> Action: 1 and reward: 1.0
Obs: [ 0.009  0.349 -0.055 -0.613] -> Action: 0 and reward: 1.0
Obs: [ 0.016  0.155 -0.067 -0.338] -> Action: 1 and reward: 1.0
Obs: [ 0.02   0.351 -0.074 -0.651] -> Action: 1 and reward: 1.0
Obs: [ 0.027  0.547 -0.087 -0.966] -> Action: 1 and reward: 1.0
Obs: [ 0.037  0.743 -0.106 -1.285] -> Action: 0 and reward: 1.0
Obs: [ 0.052  0.549 -0.132 -1.027] -> Action: 0 and reward: 1.0
Obs: [ 0.063  0.356 -0.152 -0.779] -> Action: 1 and reward: 1.0
Obs: [ 0.07   0.553 -0.168 -1.115] -> Action: 1 and reward: 1.0
Obs: [ 0.082  0.75  -0.19  -1.455] -> Action: 0 and reward: 1.0
Obs: [ 0.097  0.558 -0.219 -1.228] -> Action: 0 and reward: 1.0


Finalmente, imprimimos los resultados y cerramos el entorno.

In [4]:
print("Episode finished after {} timesteps and reward was {} ".format(t, total_reward))
env.close()

Episode finished after 12 timesteps and reward was 12.0 


### 1.3. Simulando varios episodios

El siguiente fragmento de código repite el proceso del apartado anterior para el número de episodios definido en la variable `num_episodes`.

In [5]:
num_episodes = 10

for episode in range(num_episodes):

    # Environment reset
    obs = env.reset()
    t, total_reward, done = 0, 0, False
    
    print('Running episode {} '.format(episode+1))
    
    while not done:
    
        # Render the environment
        env.render()
    
        # Get random action (this is the implementation of the agent)
        action = env.action_space.sample()
    
        # Execute action and get response
        new_obs, reward, done, info = env.step(action)
        print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
    
        obs = new_obs
        total_reward += reward
        t += 1
        
    total_reward += reward
    t += 1
    print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
    print("Episode {} finished after {} timesteps and reward was {} ".format(episode+1, t, total_reward))
    print('')
    
env.close()

Running episode 1 
Obs: [-0.026  0.045 -0.023 -0.049] -> Action: 1 and reward: 1.0
Obs: [-0.025  0.24  -0.024 -0.349] -> Action: 0 and reward: 1.0
Obs: [-0.02   0.045 -0.031 -0.064] -> Action: 0 and reward: 1.0
Obs: [-0.019 -0.149 -0.032  0.218] -> Action: 1 and reward: 1.0
Obs: [-0.022  0.046 -0.028 -0.084] -> Action: 0 and reward: 1.0
Obs: [-0.021 -0.149 -0.03   0.2  ] -> Action: 1 and reward: 1.0
Obs: [-0.024  0.047 -0.026 -0.102] -> Action: 1 and reward: 1.0
Obs: [-0.023  0.242 -0.028 -0.403] -> Action: 0 and reward: 1.0
Obs: [-0.018  0.048 -0.036 -0.119] -> Action: 1 and reward: 1.0
Obs: [-0.017  0.243 -0.038 -0.423] -> Action: 1 and reward: 1.0
Obs: [-0.013  0.439 -0.046 -0.727] -> Action: 1 and reward: 1.0
Obs: [-0.004  0.635 -0.061 -1.034] -> Action: 1 and reward: 1.0
Obs: [ 0.009  0.831 -0.082 -1.345] -> Action: 0 and reward: 1.0
Obs: [ 0.026  0.637 -0.109 -1.079] -> Action: 0 and reward: 1.0
Obs: [ 0.038  0.443 -0.13  -0.823] -> Action: 0 and reward: 1.0
Obs: [ 0.047  0.25  -

Obs: [-0.034 -0.4    0.049  0.67 ] -> Action: 1 and reward: 1.0
Obs: [-0.042 -0.206  0.062  0.393] -> Action: 0 and reward: 1.0
Obs: [-0.046 -0.402  0.07   0.705] -> Action: 1 and reward: 1.0
Obs: [-0.054 -0.208  0.084  0.435] -> Action: 1 and reward: 1.0
Obs: [-0.058 -0.014  0.093  0.17 ] -> Action: 0 and reward: 1.0
Obs: [-0.059 -0.21   0.096  0.491] -> Action: 1 and reward: 1.0
Obs: [-0.063 -0.016  0.106  0.23 ] -> Action: 1 and reward: 1.0
Obs: [-0.063  0.177  0.111 -0.028] -> Action: 1 and reward: 1.0
Obs: [-0.06   0.37   0.11  -0.283] -> Action: 1 and reward: 1.0
Obs: [-0.052  0.564  0.104 -0.539] -> Action: 1 and reward: 1.0
Obs: [-0.041  0.757  0.094 -0.797] -> Action: 1 and reward: 1.0
Obs: [-0.026  0.951  0.078 -1.059] -> Action: 0 and reward: 1.0
Obs: [-0.007  0.755  0.057 -0.743] -> Action: 1 and reward: 1.0
Obs: [ 0.008  0.949  0.042 -1.018] -> Action: 0 and reward: 1.0
Obs: [ 0.027  0.754  0.021 -0.712] -> Action: 1 and reward: 1.0
Obs: [ 0.042  0.948  0.007 -0.998] -> Ac

## 2. Frozen Lake
En este segundo ejemplo vamos a cargar el entorno FrozenLake y volveremos a realizar algunas pruebas.

### 2.1. Carga de datos

De la misma forma que en el ejemplo inicial, el siguiente código carga los paquetes necesarios para el ejemplo, crea el entorno mediante el método `make` e imprime por pantalla la dimensión del espacio de acciones (0 = izquierda, 1 = derecha, 2 = abajo y 3 = arriba), el espacio de observaciones (un número del 0 al 15 que indica la posición del agente en el entorno) y el rango de la variable de recompensa (0 para cualquier acción excepto si se llega a la casilla de destino, en cuyo caso la recompensa es 1).

In [6]:
import time

env = gym.make('FrozenLake-v0')
print("Action space is {} ".format(env.action_space))
print("Observation space is {} ".format(env.observation_space))
print("Reward range is {} ".format(env.reward_range))

Action space is Discrete(4) 
Observation space is Discrete(16) 
Reward range is (0, 1) 


### 2.2. Ejecución de un episodio

A continuación, realizaremos la ejecución de un episodio del entorno FrozenLake utilizando un agente que selecciona las acciones de forma aleatoria.

En el siguiente código inicializamos el entorno, definimos el máximo número de pasos por episodio (`max_steps`) y realizamos la ejecución de un episodio del entorno (este finaliza cuando la variable 'done' toma el valor 'True' o cuando se alcanza el número máximo de pasos estipulado). De nuevo, utilizamos un agente que implementa una política completamente aleatoria (`env.action_space.sample()`). Mediante el método `env.render()` podemos ir viendo la evolución del agente en el entorno desde la casilla de salida S hasta que llega a la casilla de destino G o cae en un agujero H.

In [7]:
# Environment reset
obs = env.reset()
t, total_reward, done = 0, 0, False
max_steps = 100

# Render the environment
env.render()
print('')
time.sleep(0.1)

while t < max_steps:
    # Get random action (this is the implementation of the agent)
    action = env.action_space.sample()
    
    # Execute action and get response
    obs, reward, done, info = env.step(action)
    
    # Render the environment
    env.render()
    print('')
        
    t += 1
    if done:
        break
    time.sleep(0.1)

print("Episode finished after {} timesteps and reward was {} ".format(t, reward))
env.close()


[41mS[0mFFF
FHFH
FFFH
HFFG

  (Down)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Right)
SFFF
FHFH
[41mF[0mFFH
HFFG

  (Left)
SFFF
FHFH
FFFH
[41mH[0mFFG

Episode finished after 4 timesteps and reward was 0.0 


### 2.3. Simulando varios episodios

El siguiente fragmento de código repite el proceso del apartado anterior para el número de episodios definido en la variable `num_episodes`.

In [8]:
num_episodes = 10

for episode in range(num_episodes):

    # Environment reset
    obs = env.reset()
    t, done = 0, False
    
    print('Running episode {} '.format(episode+1))

    # Render the environment
    env.render()
    print('')
    time.sleep(0.1)
    
    while t < max_steps:
        # Get random action (this is the implementation of the agent)
        action = env.action_space.sample()
    
        # Execute action and get response
        obs, reward, done, info = env.step(action)
        
        # Render the environment
        env.render()
        print('')
        
        t += 1
        if done:
            break
        time.sleep(0.1)
      
    print("Episode {} finished after {} timesteps and reward was {} ".format(episode+1, t, reward))
    print('')

Running episode 1 

[41mS[0mFFF
FHFH
FFFH
HFFG

  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG

  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG

Episode 1 finished after 2 timesteps and reward was 0.0 

Running episode 2 

[41mS[0mFFF
FHFH
FFFH
HFFG

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Up)
SFFF
F[41mH[0mFH
FFFH
HFFG

Episode 2 finished after 3 timesteps and reward was 0.0 

Running episode 3 

[41mS[0mFFF
FHFH
FFFH
HFFG

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG

  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG

  (Left)
SFFF
FHFH
[41mF[0mFFH
HFFG

  (Up)
SFFF
FHFH
[41mF[0mFFH
HFFG

  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG

  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Right)
SFFF
FHFH
[41mF[0mFFH
HFFG

  (Down)
SFFF
FHFH
F[41mF[0mFH
HFFG


### 2.4. Calculando la recompensa total de varios episodios

Para medir la eficiencia del agente, podemos calcular la recompensa total de varios episodios. Dado que en cada episodio la recompensa acumulada es 0 si no se llega a la celda de destino y 1 si se consigue el objetivo, medir la recompensa total acumulada de un número de episodios nos da una medida del porcentaje de éxito de nuestro agente.

El siguiente fragmento de código repite el proceso del apartado anterior para el número de episodios definido en la variable `num_episodes` y calcula el porcentaje de acierto del agente. Se omite la renderización del entorno con el objetivo de agilizar la ejecución.

In [9]:
num_episodes = 1000
total_reward = 0

for episode in range(num_episodes):

    # Environment reset
    obs = env.reset()
    t, done = 0, False
    
    #env.render() --- Uncomment if you want to see the path of the agen  

    while t < max_steps:
        # Get random action (this is the implementation of the agent)
        action = env.action_space.sample()
    
        # Execute action and get response
        obs, reward, done, info = env.step(action)
        
        # Render the environment
        #env.render() --- Uncomment if you want to see the path of the agent
        
        total_reward += reward
        t += 1
        if done:
            break
    
success_rate = total_reward*100/num_episodes
print("{} successes in {} episodes: {} % of success".format(total_reward, num_episodes, success_rate))

11.0 successes in 1000 episodes: 1.1 % of success


### 2.5. Entrenando a un agente

Tal y como hemos podido ver en el apartado anterior, como el agente utilizado elige las acciones al azar, es casi imposible llegar a la casilla de destino G con esta política (el porcentaje de éxito está en un 1 % o 2 %). Vamos a entrenar un agente utilizando el método Q-Learning. Este método (que se estudiará en módulos posteriores) puede implementarse mediante una tabla que va actualizándose a partir de la interacción del agente con el entorno.
El siguiente código implementa este método y realiza el entrenamiento del agente a partir de la ejecución de varios episodios.

__Nota__: recordad que las simulaciones ejecutadas tienen un componente aleatorio y los porcentajes pueden variar de una ejecución a otra.

Empezamos importando algunos paquetes:

In [10]:
import pickle

Inicializamos algunas variables del método que queremos implementar, entre las que se encuentran el número de episodios (`num_episodes`) y el número máximo de pasos por cada episodio (`max_steps`).

In [11]:
epsilon = 0.9
num_episodes = 100000
max_steps = 100

learning_rate = 0.81
gamma = 0.96

Inicializamos a cero todos los valores de la tabla de la función Q (de dieciseis estados por cuatro acciones cada estado), que acabará dándonos una idea de cuál es la mejor acción para cada estado.

In [12]:
Q = np.zeros((env.observation_space.n, env.action_space.n))
print(Q)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


El siguiente código define las funciones que caracterizan al agente (se estudiarán en módulos posteriores de este curso).

In [13]:
def choose_action(state):
    action=0
    if np.random.uniform(0, 1) < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q[state, :])
    return action

def learn(state, new_state, reward, action):
    predict = Q[state, action]
    target = reward + gamma * np.max(Q[new_state, :])
    Q[state, action] = Q[state, action] + learning_rate * (target - predict)

El siguiente código realiza tantas partidas del juego como se indican en la variable `num_episodes`. En cada partida (episodio), el agente va interactuando con el entorno y, como fruto de esa interacción, va actualizando los valores de la tabla _Q_. En el código se ha comentado el método `env.render()` con el objetivo de no saturar la pantalla. Así mismo, se imprimen por pantalla aquellos episodios en los que el agente alcanza la casilla de destino.

In [14]:
# Start
for episode in range(num_episodes):
    state = env.reset()
    t = 0
    
    while t < max_steps:
        #env.render() --- Uncomment if you want to see the path of the agent
        action = choose_action(state)  
        state2, reward, done, info = env.step(action)  
        learn(state, state2, reward, action)
        state = state2
        t += 1
       
        if done:
            break

    if reward == 1:
        print("Episode {} finished after {} timesteps and reward was {} ".format(episode+1, t, reward)) 

Episode 28 finished after 11 timesteps and reward was 1.0 
Episode 81 finished after 24 timesteps and reward was 1.0 
Episode 250 finished after 15 timesteps and reward was 1.0 
Episode 290 finished after 10 timesteps and reward was 1.0 
Episode 321 finished after 7 timesteps and reward was 1.0 
Episode 501 finished after 12 timesteps and reward was 1.0 
Episode 506 finished after 10 timesteps and reward was 1.0 
Episode 606 finished after 17 timesteps and reward was 1.0 
Episode 750 finished after 22 timesteps and reward was 1.0 
Episode 763 finished after 8 timesteps and reward was 1.0 
Episode 780 finished after 8 timesteps and reward was 1.0 
Episode 812 finished after 13 timesteps and reward was 1.0 
Episode 976 finished after 8 timesteps and reward was 1.0 
Episode 1012 finished after 19 timesteps and reward was 1.0 
Episode 1120 finished after 16 timesteps and reward was 1.0 
Episode 1211 finished after 10 timesteps and reward was 1.0 
Episode 1227 finished after 13 timesteps an

Episode 9290 finished after 12 timesteps and reward was 1.0 
Episode 9366 finished after 11 timesteps and reward was 1.0 
Episode 9487 finished after 8 timesteps and reward was 1.0 
Episode 9638 finished after 24 timesteps and reward was 1.0 
Episode 9670 finished after 8 timesteps and reward was 1.0 
Episode 9777 finished after 18 timesteps and reward was 1.0 
Episode 9890 finished after 39 timesteps and reward was 1.0 
Episode 9916 finished after 15 timesteps and reward was 1.0 
Episode 9934 finished after 17 timesteps and reward was 1.0 
Episode 9979 finished after 21 timesteps and reward was 1.0 
Episode 10095 finished after 12 timesteps and reward was 1.0 
Episode 10130 finished after 21 timesteps and reward was 1.0 
Episode 10131 finished after 31 timesteps and reward was 1.0 
Episode 10202 finished after 9 timesteps and reward was 1.0 
Episode 10222 finished after 9 timesteps and reward was 1.0 
Episode 10329 finished after 9 timesteps and reward was 1.0 
Episode 10389 finished 

Episode 17393 finished after 13 timesteps and reward was 1.0 
Episode 17491 finished after 8 timesteps and reward was 1.0 
Episode 17496 finished after 11 timesteps and reward was 1.0 
Episode 17500 finished after 7 timesteps and reward was 1.0 
Episode 17557 finished after 20 timesteps and reward was 1.0 
Episode 17571 finished after 14 timesteps and reward was 1.0 
Episode 17700 finished after 12 timesteps and reward was 1.0 
Episode 17933 finished after 10 timesteps and reward was 1.0 
Episode 18154 finished after 21 timesteps and reward was 1.0 
Episode 18177 finished after 11 timesteps and reward was 1.0 
Episode 18200 finished after 11 timesteps and reward was 1.0 
Episode 18369 finished after 10 timesteps and reward was 1.0 
Episode 18372 finished after 23 timesteps and reward was 1.0 
Episode 18387 finished after 17 timesteps and reward was 1.0 
Episode 18428 finished after 12 timesteps and reward was 1.0 
Episode 18436 finished after 8 timesteps and reward was 1.0 
Episode 184

Episode 25531 finished after 8 timesteps and reward was 1.0 
Episode 25583 finished after 14 timesteps and reward was 1.0 
Episode 25664 finished after 9 timesteps and reward was 1.0 
Episode 25686 finished after 13 timesteps and reward was 1.0 
Episode 25723 finished after 12 timesteps and reward was 1.0 
Episode 25784 finished after 13 timesteps and reward was 1.0 
Episode 25858 finished after 25 timesteps and reward was 1.0 
Episode 25866 finished after 8 timesteps and reward was 1.0 
Episode 25916 finished after 13 timesteps and reward was 1.0 
Episode 25926 finished after 9 timesteps and reward was 1.0 
Episode 25963 finished after 8 timesteps and reward was 1.0 
Episode 26096 finished after 7 timesteps and reward was 1.0 
Episode 26110 finished after 13 timesteps and reward was 1.0 
Episode 26140 finished after 9 timesteps and reward was 1.0 
Episode 26198 finished after 7 timesteps and reward was 1.0 
Episode 26253 finished after 10 timesteps and reward was 1.0 
Episode 26262 fi

Episode 34487 finished after 21 timesteps and reward was 1.0 
Episode 34564 finished after 12 timesteps and reward was 1.0 
Episode 34587 finished after 9 timesteps and reward was 1.0 
Episode 34603 finished after 6 timesteps and reward was 1.0 
Episode 34644 finished after 18 timesteps and reward was 1.0 
Episode 34701 finished after 7 timesteps and reward was 1.0 
Episode 34787 finished after 11 timesteps and reward was 1.0 
Episode 34843 finished after 16 timesteps and reward was 1.0 
Episode 34893 finished after 6 timesteps and reward was 1.0 
Episode 34897 finished after 11 timesteps and reward was 1.0 
Episode 34898 finished after 7 timesteps and reward was 1.0 
Episode 34906 finished after 20 timesteps and reward was 1.0 
Episode 34941 finished after 12 timesteps and reward was 1.0 
Episode 35014 finished after 6 timesteps and reward was 1.0 
Episode 35079 finished after 8 timesteps and reward was 1.0 
Episode 35129 finished after 7 timesteps and reward was 1.0 
Episode 35144 fi

Episode 41719 finished after 7 timesteps and reward was 1.0 
Episode 41760 finished after 8 timesteps and reward was 1.0 
Episode 41878 finished after 15 timesteps and reward was 1.0 
Episode 41902 finished after 12 timesteps and reward was 1.0 
Episode 42047 finished after 6 timesteps and reward was 1.0 
Episode 42158 finished after 9 timesteps and reward was 1.0 
Episode 42480 finished after 10 timesteps and reward was 1.0 
Episode 42529 finished after 10 timesteps and reward was 1.0 
Episode 42598 finished after 6 timesteps and reward was 1.0 
Episode 42725 finished after 9 timesteps and reward was 1.0 
Episode 42867 finished after 10 timesteps and reward was 1.0 
Episode 42916 finished after 26 timesteps and reward was 1.0 
Episode 42924 finished after 10 timesteps and reward was 1.0 
Episode 42973 finished after 16 timesteps and reward was 1.0 
Episode 43011 finished after 7 timesteps and reward was 1.0 
Episode 43029 finished after 8 timesteps and reward was 1.0 
Episode 43077 fi

Episode 50310 finished after 25 timesteps and reward was 1.0 
Episode 50392 finished after 10 timesteps and reward was 1.0 
Episode 50446 finished after 33 timesteps and reward was 1.0 
Episode 50475 finished after 19 timesteps and reward was 1.0 
Episode 50483 finished after 9 timesteps and reward was 1.0 
Episode 50539 finished after 9 timesteps and reward was 1.0 
Episode 50543 finished after 7 timesteps and reward was 1.0 
Episode 50569 finished after 9 timesteps and reward was 1.0 
Episode 50580 finished after 13 timesteps and reward was 1.0 
Episode 50749 finished after 11 timesteps and reward was 1.0 
Episode 50819 finished after 9 timesteps and reward was 1.0 
Episode 50884 finished after 7 timesteps and reward was 1.0 
Episode 50975 finished after 12 timesteps and reward was 1.0 
Episode 51004 finished after 8 timesteps and reward was 1.0 
Episode 51069 finished after 6 timesteps and reward was 1.0 
Episode 51130 finished after 12 timesteps and reward was 1.0 
Episode 51217 fi

Episode 58468 finished after 17 timesteps and reward was 1.0 
Episode 58507 finished after 11 timesteps and reward was 1.0 
Episode 58625 finished after 12 timesteps and reward was 1.0 
Episode 58660 finished after 8 timesteps and reward was 1.0 
Episode 58667 finished after 14 timesteps and reward was 1.0 
Episode 58674 finished after 20 timesteps and reward was 1.0 
Episode 58787 finished after 8 timesteps and reward was 1.0 
Episode 58796 finished after 17 timesteps and reward was 1.0 
Episode 58810 finished after 23 timesteps and reward was 1.0 
Episode 58829 finished after 8 timesteps and reward was 1.0 
Episode 58873 finished after 12 timesteps and reward was 1.0 
Episode 58958 finished after 6 timesteps and reward was 1.0 
Episode 59032 finished after 16 timesteps and reward was 1.0 
Episode 59051 finished after 8 timesteps and reward was 1.0 
Episode 59099 finished after 9 timesteps and reward was 1.0 
Episode 59107 finished after 28 timesteps and reward was 1.0 
Episode 59170 

Episode 66363 finished after 19 timesteps and reward was 1.0 
Episode 66529 finished after 26 timesteps and reward was 1.0 
Episode 66608 finished after 18 timesteps and reward was 1.0 
Episode 66633 finished after 11 timesteps and reward was 1.0 
Episode 66636 finished after 12 timesteps and reward was 1.0 
Episode 66643 finished after 11 timesteps and reward was 1.0 
Episode 66655 finished after 13 timesteps and reward was 1.0 
Episode 66762 finished after 12 timesteps and reward was 1.0 
Episode 66788 finished after 14 timesteps and reward was 1.0 
Episode 66809 finished after 8 timesteps and reward was 1.0 
Episode 66871 finished after 12 timesteps and reward was 1.0 
Episode 66884 finished after 15 timesteps and reward was 1.0 
Episode 66891 finished after 13 timesteps and reward was 1.0 
Episode 66935 finished after 17 timesteps and reward was 1.0 
Episode 66953 finished after 21 timesteps and reward was 1.0 
Episode 67032 finished after 31 timesteps and reward was 1.0 
Episode 6

Episode 73524 finished after 22 timesteps and reward was 1.0 
Episode 73530 finished after 15 timesteps and reward was 1.0 
Episode 73761 finished after 9 timesteps and reward was 1.0 
Episode 73867 finished after 6 timesteps and reward was 1.0 
Episode 73909 finished after 17 timesteps and reward was 1.0 
Episode 73929 finished after 7 timesteps and reward was 1.0 
Episode 73954 finished after 12 timesteps and reward was 1.0 
Episode 74014 finished after 7 timesteps and reward was 1.0 
Episode 74093 finished after 17 timesteps and reward was 1.0 
Episode 74145 finished after 8 timesteps and reward was 1.0 
Episode 74192 finished after 8 timesteps and reward was 1.0 
Episode 74250 finished after 11 timesteps and reward was 1.0 
Episode 74281 finished after 8 timesteps and reward was 1.0 
Episode 74287 finished after 16 timesteps and reward was 1.0 
Episode 74597 finished after 12 timesteps and reward was 1.0 
Episode 74695 finished after 27 timesteps and reward was 1.0 
Episode 74733 f

Episode 80989 finished after 16 timesteps and reward was 1.0 
Episode 80992 finished after 10 timesteps and reward was 1.0 
Episode 81019 finished after 8 timesteps and reward was 1.0 
Episode 81026 finished after 21 timesteps and reward was 1.0 
Episode 81038 finished after 10 timesteps and reward was 1.0 
Episode 81224 finished after 8 timesteps and reward was 1.0 
Episode 81243 finished after 15 timesteps and reward was 1.0 
Episode 81288 finished after 14 timesteps and reward was 1.0 
Episode 81347 finished after 14 timesteps and reward was 1.0 
Episode 81369 finished after 12 timesteps and reward was 1.0 
Episode 81782 finished after 19 timesteps and reward was 1.0 
Episode 81864 finished after 26 timesteps and reward was 1.0 
Episode 81969 finished after 14 timesteps and reward was 1.0 
Episode 82094 finished after 19 timesteps and reward was 1.0 
Episode 82100 finished after 19 timesteps and reward was 1.0 
Episode 82160 finished after 16 timesteps and reward was 1.0 
Episode 82

Episode 89635 finished after 10 timesteps and reward was 1.0 
Episode 89638 finished after 11 timesteps and reward was 1.0 
Episode 89666 finished after 16 timesteps and reward was 1.0 
Episode 89680 finished after 12 timesteps and reward was 1.0 
Episode 89783 finished after 20 timesteps and reward was 1.0 
Episode 89819 finished after 8 timesteps and reward was 1.0 
Episode 89959 finished after 22 timesteps and reward was 1.0 
Episode 90098 finished after 28 timesteps and reward was 1.0 
Episode 90234 finished after 14 timesteps and reward was 1.0 
Episode 90461 finished after 8 timesteps and reward was 1.0 
Episode 90479 finished after 9 timesteps and reward was 1.0 
Episode 90483 finished after 7 timesteps and reward was 1.0 
Episode 90484 finished after 6 timesteps and reward was 1.0 
Episode 90573 finished after 8 timesteps and reward was 1.0 
Episode 90601 finished after 11 timesteps and reward was 1.0 
Episode 90621 finished after 12 timesteps and reward was 1.0 
Episode 90691 

Episode 97474 finished after 18 timesteps and reward was 1.0 
Episode 97498 finished after 22 timesteps and reward was 1.0 
Episode 97513 finished after 10 timesteps and reward was 1.0 
Episode 97538 finished after 14 timesteps and reward was 1.0 
Episode 97545 finished after 9 timesteps and reward was 1.0 
Episode 97629 finished after 13 timesteps and reward was 1.0 
Episode 97710 finished after 15 timesteps and reward was 1.0 
Episode 97764 finished after 10 timesteps and reward was 1.0 
Episode 97851 finished after 6 timesteps and reward was 1.0 
Episode 97909 finished after 8 timesteps and reward was 1.0 
Episode 98002 finished after 36 timesteps and reward was 1.0 
Episode 98037 finished after 19 timesteps and reward was 1.0 
Episode 98044 finished after 8 timesteps and reward was 1.0 
Episode 98069 finished after 8 timesteps and reward was 1.0 
Episode 98112 finished after 11 timesteps and reward was 1.0 
Episode 98136 finished after 9 timesteps and reward was 1.0 
Episode 98156 

Podemos ver los valores finales de la tabla _Q_ después del entrenamiento.

In [15]:
print(Q)

[[0.59443218 0.71169225 0.57442415 0.59972985]
 [0.54378006 0.68181537 0.63588329 0.76263833]
 [0.73044534 0.63939679 0.75364971 0.71064473]
 [0.0200028  0.02034774 0.49212556 0.50702884]
 [0.57064149 0.08786751 0.66042586 0.52530131]
 [0.         0.         0.         0.        ]
 [0.76854039 0.02456945 0.03168803 0.57726613]
 [0.         0.         0.         0.        ]
 [0.5775804  0.57772213 0.02934505 0.80989684]
 [0.76899126 0.87265298 0.85835759 0.79383454]
 [0.85274067 0.84953287 0.75282606 0.00807286]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.85283394 0.17923375 0.91565933 0.86023702]
 [0.92092484 0.90454517 0.96007081 0.9765358 ]
 [0.         0.         0.         0.        ]]


### 2.6. Comprobando la mejora
En este último apartado comprobaremos que el agente diseñado consigue mejores prestaciones que el agente aleatorio.

El código es muy parecido al que hemos utilizado mientras entrenábamos al agente, pero se omite la parte de aprendizaje de este. Para ello, simularemos varios episodios utilizando los valores de la tabla _Q_ obtenida en el entrenamiento. Concretamente, el agente selecciona el valor máximo de la tabla _Q_ para cada estado:

In [16]:
def choose_action_max(state):
    action = np.argmax(Q[state, :])
    return action

De nuevo, calculamos la recompensa total de varios episodios y se calcula el porcentaje de acierto que, como puede comprobarse, es superior al del agente aleatorio.

Se ofrece la oportunidad en el código de visualizar (de forma distinta a la vista hasta este momento) los últimos episodios de la simulación (indicados en la variable `num_shows`).

In [17]:
from IPython.display import clear_output

num_episodes = 1000
total_reward = 0
num_shows = 5
show_episode = False

# start
for episode in range(num_episodes):

    if (num_episodes - episode) <= num_shows:
        show_episode = True
        
    state = env.reset()
    
    if show_episode == True:
        print('')
        print('')
        print("*** Episode: ", episode+1)
        print('')
        print('')
        time.sleep(0.8)
        clear_output(wait=True)
        env.render()
    
    t = 0
    while t < 100:
        action = choose_action_max(state)  
        state, reward, done, info = env.step(action)  
        
        if show_episode == True:
            time.sleep(0.5)
            clear_output(wait=True)
            env.render()
        if done:
            break

    if show_episode == True:
        time.sleep(0.8)
        clear_output(wait=True)
        print('')
        print('')
        print('Reward = {}'.format(reward))
        print('')
        print('')
        time.sleep(0.8)
        clear_output(wait=True)
    
    total_reward += reward
    
success_rate = total_reward*100/num_episodes
print("{} successes in {} episodes: {} % of success".format(total_reward, num_episodes, success_rate))

176.0 successes in 1000 episodes: 17.6 % of success
