### Instalación entorno visual

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [14]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
!apt-get update > /dev/null 2>&1
!apt-get install cmake > /dev/null 2>&1
!pip install --upgrade setuptools 2>&1
!pip install ez_setup > /dev/null 2>&1
!pip install gym[atari] > /dev/null 2>&1



In [15]:
import gym
from gym.wrappers import Monitor
import glob
import io
import base64
from IPython.display import HTML
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay

display = Display(visible=0, size=(1400, 900))
display.start()

"""
Utility functions to enable video recording of gym environment 
and displaying it.
To enable video, just do "env = wrap_env(env)""
"""
#Path donde guardar los videos
video_path = '/content/drive/MyDrive/MIOTI/RL/SESION_4/video/'

def show_video(video_path):
  mp4list = glob.glob(video_path+'*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env,modelo):
  env = Monitor(env, video_path+modelo, force=True)
  return env

![image.png](attachment:image.png)
<center style="color:#888">Módulo Advanced Data Science<br/>Asignatura Reinforcement Learning</center>


# S4. Challenge. Implementación de Deep Q-Network

Vamos a implementar DQN, un algoritmo aproximado de Q-learning, con técnicas de *experience replay* y *target networks*. Para ello utilizaremos la librería `keras-rl2` que nos ofrece una abstracción de alto nivel para entrenar modelos de Deep Learning sobre entornos de RL. **Comienza instalando la libreria indicada, cuidado de no instalar keras-rl**, porque es posible que no funcione con TensorFlow 2  

Si necesitas consultar la documentación de la librería acude a esta web: *https://keras-rl.readthedocs.io/en/latest/*     

**No olvides contestar a las preguntas del final**

In [16]:
!pip install keras-rl2 ## YOUR CODE HERE ##



Hemos visto que con la red implementada en clase, el agente aprende pero parace que tiene cierto sesgo en irse hacia un lado. Ahora te toca a ti, diseñar una red más eficiente. Es una red muy sencilla, a parte de la capa de la primera capa Flatten, con 4 capas densas más es suficiente. Yo he usado 658 parámetros para que el agente aprenda.    

En este ejemplo a la red no le pasamos el estado de observaciones como la imagen que vemos, si no una lista que contiene: *[position of cart, velocity of cart, angle of pole, rotation rate of pole]*. Para más info sobre el juego puedes consultar la descripción aquí: *https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py#L16*     

Como veras, tienes que definir la variable *windows_length*, repasa el concepto de **stacking** del Worsheet de la S4. Si a esta red no le pasamos la obervación como un frame (imagen), piensa ¿de cuánto podría ser este *windows_lenght* como mínimo?

In [17]:
import numpy as np
import gym

import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten
from tensorflow.keras.optimizers import Adam

from rl.agents.dqn import DQNAgent
from rl.policy import BoltzmannQPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory

Este ejercicio lo haremos sobre el mismo entorno, `CartPole-v0`

In [18]:
#Instanciación de variables globales.
ENV_NAME = 'CartPole-v0'
env = gym.make(ENV_NAME) # Instancia el entorno
env._max_episode_steps = 300 # Busca en internet como aumentar el numero máximo de stepd+s del entorno de 200(por defecto) a 300.
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n # Obtén el número de acciones del entorno
#Tamaño de las ventana a entrenar.
window_lengths = [1,4]
#Politicas con las que entrenar
policies=[BoltzmannQPolicy(),EpsGreedyQPolicy()]
#Path donde guardar los pesos.
weights_path = '/content/drive/MyDrive/MIOTI/RL/SESION_4/weights/'

Ahora tenemos que declarar la arquitectura de nuestra DQN que utilizaremos para resolver el entorno. Ten en cuenta que el input será la observación del entorno y la salida será el valor de $Q$ para cada una de las acciones.

In [19]:
# Define la red. Recuerda que con 658 parámetro es suficiente
# Creamos una función que nos genere el model de DNN.
# Como parámetro de entrada tiene la dimensión de la ventana de entrada.
def create_model(window_length):
  model = Sequential()
  model.add(Flatten(input_shape=(window_length,) + env.observation_space.shape))
  ## YOUR CODE HERE ## 

  model.add(Dense(16))
  model.add(Activation('relu'))
  model.add(Dense(16))
  model.add(Activation('relu'))
  model.add(Dense(16))
  model.add(Activation('relu'))

  model.add(Dense(nb_actions)) # Recuerda, el núnero de nueronas de la ultima capa,¿con qué tiene que coincidir?
  model.add(Activation('linear'))
  # print(model.summary())
  return model

Ahora que tenemos nuestra DQN y entorno preparados, tenemos que hacer que aprende a jugar al `CartPole`. Para ello tendremos que declarar los componentes típicos de un modelo de DQN, la memoria y la política (*¡¡Prueba con las dos!!*).

In [20]:
# Parte del entrenamiento consistente en bucle anidado en el que primero de ejecutan 2 entrenmientos para la política
# BoltzmannQPolicy para ventana 1 y 4 y lo mismo para la política EpsGreedyQPolicy ventana 1 y 4.
model=None
dqn = None
for policy in policies:
  print (f'Policy: {type(policy).__name__}')
  for window_length in window_lengths:
    
    print(f'Window_length: {window_length}')
    model=create_model(window_length)
    memory = SequentialMemory(limit=50000, window_length=window_length) # Llama al método SequentialMemory que has importado y establece un límite de 50000 steps y una longitud de ventana de 1
    dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
                  target_model_update=1e-2, policy=policy)
    
    optimizer =Adam(learning_rate=0.001)
    dqn.compile(optimizer, metrics=['mse']) # Compila la red llamando al optimizador Adam con un learning_rate=1e-3 y usa la métrica 'mse'

    dqn.fit(env, nb_steps=50000, visualize=False, verbose=2)

    # After training is done, we save the final weights.
    weights_file=f'dqn_{ENV_NAME}_{type(policy).__name__}_{window_length}_weights.h5f'

    dqn.save_weights(weights_path+weights_file, overwrite=True)
    # dqn.save_weights(f'dqn_{ENV_NAME}_weights.h5f', overwrite=True)

    # Finally, evaluate our algorithm for 5 episodes.
    dqn.test(env, nb_episodes=5, visualize=False)


Policy: BoltzmannQPolicy
Window_length: 1
Training for 50000 steps ...


  updates=self.state_updates,


    19/50000: episode: 1, duration: 2.859s, episode steps:  19, steps per second:   7, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.579 [0.000, 1.000],  loss: 0.553204, mse: 0.555155, mean_q: 0.059912




    55/50000: episode: 2, duration: 0.547s, episode steps:  36, steps per second:  66, episode reward: 36.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.611 [0.000, 1.000],  loss: 0.422739, mse: 0.438737, mean_q: 0.212234
    75/50000: episode: 3, duration: 0.326s, episode steps:  20, steps per second:  61, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.600 [0.000, 1.000],  loss: 0.212814, mse: 0.366967, mean_q: 0.668391
    94/50000: episode: 4, duration: 0.299s, episode steps:  19, steps per second:  64, episode reward: 19.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.421 [0.000, 1.000],  loss: 0.080693, mse: 0.528432, mean_q: 1.104121
   105/50000: episode: 5, duration: 0.175s, episode steps:  11, steps per second:  63, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.727 [0.000, 1.000],  loss: 0.052493, mse: 0.848875, mean_q: 1.335739
   124/50000: episode: 6, duration: 0.324s, episode steps:  19, step

  updates=self.state_updates,


    31/50000: episode: 1, duration: 3.257s, episode steps:  31, steps per second:  10, episode reward: 31.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.645 [0.000, 1.000],  loss: 0.441886, mse: 0.450916, mean_q: -0.001558




    51/50000: episode: 2, duration: 0.343s, episode steps:  20, steps per second:  58, episode reward: 20.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.400 [0.000, 1.000],  loss: 0.322765, mse: 0.335538, mean_q: 0.110414
    67/50000: episode: 3, duration: 0.270s, episode steps:  16, steps per second:  59, episode reward: 16.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.375 [0.000, 1.000],  loss: 0.257872, mse: 0.317981, mean_q: 0.338624
    84/50000: episode: 4, duration: 0.284s, episode steps:  17, steps per second:  60, episode reward: 17.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.471 [0.000, 1.000],  loss: 0.198336, mse: 0.376641, mean_q: 0.567846
    95/50000: episode: 5, duration: 0.196s, episode steps:  11, steps per second:  56, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.636 [0.000, 1.000],  loss: 0.148077, mse: 0.510622, mean_q: 0.763263
   106/50000: episode: 6, duration: 0.190s, episode steps:  11, step

  updates=self.state_updates,


    40/50000: episode: 1, duration: 3.807s, episode steps:  40, steps per second:  11, episode reward: 40.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.450 [0.000, 1.000],  loss: 0.432797, mse: 0.436813, mean_q: 0.092385
    49/50000: episode: 2, duration: 0.163s, episode steps:   9, steps per second:  55, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  loss: 0.306216, mse: 0.340292, mean_q: 0.310529
    61/50000: episode: 3, duration: 0.219s, episode steps:  12, steps per second:  55, episode reward: 12.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.167 [0.000, 1.000],  loss: 0.214071, mse: 0.320088, mean_q: 0.557900
    71/50000: episode: 4, duration: 0.178s, episode steps:  10, steps per second:  56, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  loss: 0.151401, mse: 0.361324, mean_q: 0.847342
    81/50000: episode: 5, duration: 0.170s, episode steps:  10, step

  updates=self.state_updates,


    10/50000: episode: 1, duration: 0.733s, episode steps:  10, steps per second:  14, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.100 [0.000, 1.000],  loss: --, mse: --, mean_q: --


  updates=self.state_updates,


    19/50000: episode: 2, duration: 2.996s, episode steps:   9, steps per second:   3, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  loss: 0.524034, mse: 0.594394, mean_q: 0.651873
    28/50000: episode: 3, duration: 0.169s, episode steps:   9, steps per second:  53, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.111 [0.000, 1.000],  loss: 0.363264, mse: 0.496213, mean_q: 0.983008




    37/50000: episode: 4, duration: 0.172s, episode steps:   9, steps per second:  52, episode reward:  9.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  loss: 0.286177, mse: 0.564956, mean_q: 1.242197
    47/50000: episode: 5, duration: 0.190s, episode steps:  10, steps per second:  53, episode reward: 10.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.000 [0.000, 0.000],  loss: 0.264803, mse: 0.556139, mean_q: 1.477296
    58/50000: episode: 6, duration: 0.196s, episode steps:  11, steps per second:  56, episode reward: 11.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.091 [0.000, 1.000],  loss: 0.232300, mse: 0.471558, mean_q: 1.515074
    72/50000: episode: 7, duration: 0.255s, episode steps:  14, steps per second:  55, episode reward: 14.000, mean reward:  1.000 [ 1.000,  1.000], mean action: 0.143 [0.000, 1.000],  loss: 0.243182, mse: 0.614428, mean_q: 1.644238
    82/50000: episode: 8, duration: 0.181s, episode steps:  10, step

### Probamos el modelo

Si al cargar el modelo, te da un error referente a un string, ejecuta la siguiente celda.



In [21]:
# pip install 'h5py==2.10.0' --force-reinstall 


Si estás en otra sesión ejecuta las celdas anteriores y no olvides comentar la línea de código donde se llama al método dqn.fit() para que no vuelva a entrenar.

In [22]:
# weight_file=f'./dqn_CartPole-v0_{type(policy).__name__}_{window_length}_weights.h5f'
# print(weight_file)

In [25]:
for policy in policies:
  print (f'Policy: {type(policy).__name__}')
  for window_length in window_lengths:
    
    print(f'Policy: {type(policy).__name__} Window_length: {window_length}')
    model=create_model(window_length)
    memory = SequentialMemory(limit=50000, window_length=window_length) # Llama al método SequentialMemory que has importado y establece un límite de 50000 steps y una longitud de ventana de 1
    dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10,
                  target_model_update=1e-2, policy=policy)
    optimizer =Adam(learning_rate=0.001)
    dqn.compile(optimizer, metrics=['mse']) # Compila la red llamando al optimizador Adam con un learning_rate=1e-3 y usa la métrica 'mse'

    # Se sonstruye el nombre de fichero de pesos a cargar
    weights_file=f'dqn_{ENV_NAME}_{type(policy).__name__}_{window_length}_weights.h5f'
    print(weights_path+weights_file)
    dqn.load_weights(weights_path+weights_file) # Especifica la ruta donde hayas guardado los pesos en tu local
    #Se crea el subdirectorio de videos para cada politica y ventana
    sub_path_video=f'{type(policy).__name__}_{window_length}'
    dqn.test(wrap_env(env,sub_path_video), nb_episodes=5, visualize=True)
    show_video(video_path+sub_path_video+'/')

    

Policy: BoltzmannQPolicy
Policy: BoltzmannQPolicy Window_length: 1
/content/drive/MyDrive/MIOTI/RL/SESION_4/weights/dqn_CartPole-v0_BoltzmannQPolicy_1_weights.h5f
Testing for 5 episodes ...


  updates=self.state_updates,


Episode 1: reward: 300.000, steps: 300
Episode 2: reward: 300.000, steps: 300
Episode 3: reward: 300.000, steps: 300
Episode 4: reward: 300.000, steps: 300
Episode 5: reward: 300.000, steps: 300


Policy: BoltzmannQPolicy Window_length: 4
/content/drive/MyDrive/MIOTI/RL/SESION_4/weights/dqn_CartPole-v0_BoltzmannQPolicy_4_weights.h5f
Testing for 5 episodes ...


  updates=self.state_updates,


Episode 1: reward: 300.000, steps: 300
Episode 2: reward: 300.000, steps: 300
Episode 3: reward: 300.000, steps: 300
Episode 4: reward: 266.000, steps: 266
Episode 5: reward: 300.000, steps: 300


Policy: EpsGreedyQPolicy
Policy: EpsGreedyQPolicy Window_length: 1
/content/drive/MyDrive/MIOTI/RL/SESION_4/weights/dqn_CartPole-v0_EpsGreedyQPolicy_1_weights.h5f
Testing for 5 episodes ...


  updates=self.state_updates,


Episode 1: reward: 289.000, steps: 289
Episode 2: reward: 247.000, steps: 247
Episode 3: reward: 248.000, steps: 248
Episode 4: reward: 229.000, steps: 229
Episode 5: reward: 207.000, steps: 207


Policy: EpsGreedyQPolicy Window_length: 4
/content/drive/MyDrive/MIOTI/RL/SESION_4/weights/dqn_CartPole-v0_EpsGreedyQPolicy_4_weights.h5f
Testing for 5 episodes ...


  updates=self.state_updates,


Episode 1: reward: 124.000, steps: 124
Episode 2: reward: 122.000, steps: 122
Episode 3: reward: 116.000, steps: 116
Episode 4: reward: 115.000, steps: 115
Episode 5: reward: 120.000, steps: 120


**¿Has observado alguna diferencia aplicando una u otra política?**      
**¿Hay alguna diferencia si el valor del window_length es 1 o 4?¿Por què?**

* **Respuesta 1:** Si se observan diferencias. Con la política **BoltzmannQPolicy** se observa que se obtienen las máximas recompensas y el juego en la mayoría de los episodios completan los pasos sin que se incline el palo en más de 15º o el carro se mueve más de 2.4 unidades, tal como indican las reglas del juego CartPole [https://gym.openai.com/envs/CartPole-v0/](https://gym.openai.com/envs/CartPole-v0/).
Con la política **EpsGreedyQPolicy** se obtiene resultados algo peores tal vez por que le penalice la fase de exploración. Las recompensas son menores en ambas ventanas comparada con la otra política.

* **Respuesta 2:** Dentro de cada política se obtienen mejores resultados cuando la longitud de la ventana **stacking** es 1 (una imagen o frame que aporta información adicional) que 4 (4 imagenes o frames que aportan información adicional al contexto). Por alguna razón en el juego de CartPole, la aportación de 4 frames penaliza.