### Proyecto práctico

Consideraciones a tener en cuenta:

- El entorno sobre el que trabajaremos será _SpaceInvaders-v0_ y el algoritmo que usaremos será _DQN_.

- Para nuestro ejercicio, una solución óptima será alcanzada cuando el agente consiga una **media de recompensa por encima de 20 puntos en modo test**. Por ello, esta media de la recompensa se calculará a partir del código de test en la última celda del notebook.

Este proyecto práctico consta de tres partes:

   1) Implementar la red neuronal que se usará en la solución
    
   2) Implementar las distintas piezas de la solución DQN
    
   3) Justificar la respuesta en relación a los resultados obtenidos

IMPORTANTE:

- Si no se consigue una puntuación óptima, responder sobre la mejor puntuación obtenida.

- Para entrenamientos largos, recordad que podéis usar checkpoints de vuestros modelos para retomar los entrenamientos. En este caso, recordad cambiar los parámetros adecuadamente (sobre todo los relacionados con el proceso de exploración).

- Tened en cuenta que las versiones de librerías recomendadas son Tensorflow==1.13.1, Keras==2.2.4 y keras-rl==0.4.2

#### Preparar entorno sobre TF2

In [2]:
# Conectamos con nuestro Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Establezco una ruta absoluta a un directorio existente de mi Google Drive (cambiar a drive propio)
BASE_FOLDER = "/content/drive/Othercomputers/My MacBook Pro/08_aprendizaje_por_refuerzo/proyecto/"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
# install keras-rl2 that works with tensorflow 2.x
!pip install keras-rl2 > /dev/null 2>&1

# install gym and atari ROMs
!pip install -U gym>=0.21.0
!pip install -U gym[atari,accept-rom-license]



In [4]:
# install the relevant libraries to make rendering possible
!pip install pyvirtualdisplay > /dev/null 2>&1
!apt-get update
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
Hit:3 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Ign:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:6 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Ign:7 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:8 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  Release
Hit:9 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:11 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Hit:12 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:13 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic 

In [5]:
# import the relevant libraries 
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only
import tensorflow as tf
import numpy as np
import math
import glob
import io
import base64
from IPython.display import HTML

from IPython import display as ipythondisplay


  for external in metadata.entry_points().get(self.group, []):


In [6]:

"""
Utility functions to enable video recording of gym environment and displaying it
To enable video, just do "env = wrap_env(env)""
"""

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")
    

def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

#### Importar librerías

In [7]:
from __future__ import division

from PIL import Image
import numpy as np
import gym

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Convolution2D, Permute, Dropout, BatchNormalization
#from keras.optimizers import Adam
from tensorflow.keras.optimizers import Adam
import keras.backend as K

from rl.agents.dqn import DQNAgent
from rl.policy import LinearAnnealedPolicy, BoltzmannQPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory
from rl.core import Processor
from rl.callbacks import FileLogger, ModelIntervalCheckpoint

#### Configuración base

In [8]:
INPUT_SHAPE = (84, 84)
WINDOW_LENGTH = 4

env_name = 'SpaceInvaders-v0'
#env = wrap_env(gym.make(env_name))
env = gym.make(env_name)

np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n

In [9]:
class AtariProcessor(Processor):
    def process_observation(self, observation):
        assert observation.ndim == 3  # (height, width, channel)
        img = Image.fromarray(observation)
        img = img.resize(INPUT_SHAPE).convert('L') # a escala de grises
        processed_observation = np.array(img)
        assert processed_observation.shape == INPUT_SHAPE
        return processed_observation.astype('uint8') # casting a 8 bits

    def process_state_batch(self, batch):
        processed_batch = batch.astype('float32') / 255. # normalizar entre 0 y 1
        return processed_batch

    def process_reward(self, reward):
        return np.clip(reward, -1., 1.) # se acota para que no sea tan sensible a valores extremos

In [10]:
# Acciones disponibles en el entorno
nb_actions = env.action_space.n
nb_actions

6

In [11]:
# Nombres de las acciones
env.unwrapped.get_action_meanings()

['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']

In [None]:
# Dimensions de las observaciones del entorno
env.observation_space.shape

(210, 160, 3)

1) Implementación de la red neuronal

In [12]:
# Next, we build our model. We use the same model that was described by Mnih et al. (2015).
input_shape = (WINDOW_LENGTH,) + INPUT_SHAPE
model = Sequential()

# segun el backend, se usa un ordenamiento de dimensiones diferente (para tensorflow o theano)
#if K.image_dim_ordering() == 'tf':
if K.image_data_format() == 'channels_last':
    # (width, height, channels)
    model.add(Permute((2, 3, 1), input_shape=input_shape))
#elif K.image_dim_ordering() == 'th':
elif image_data_format() == 'channels_first':
    # (channels, width, height)
    model.add(Permute((1, 2, 3), input_shape=input_shape))
else:
    raise RuntimeError('Unknown image_dim_ordering.')
# conv_1
model.add(Convolution2D(32, (8, 8), strides=(4, 4)))
model.add(Activation('relu'))
# conv_2
model.add(Convolution2D(64, (4, 4), strides=(2, 2)))
model.add(Activation('relu'))
# conv_3
model.add(Convolution2D(64, (3, 3), strides=(1, 1)))
model.add(Activation('relu'))
# FC
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear')) # lineal porque son las recompensas esperadas a futuro para cada accion (nb_actions)
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 permute (Permute)           (None, 84, 84, 4)         0         
                                                                 
 conv2d (Conv2D)             (None, 20, 20, 32)        8224      
                                                                 
 activation (Activation)     (None, 20, 20, 32)        0         
                                                                 
 conv2d_1 (Conv2D)           (None, 9, 9, 64)          32832     
                                                                 
 activation_1 (Activation)   (None, 9, 9, 64)          0         
                                                                 
 conv2d_2 (Conv2D)           (None, 7, 7, 64)          36928     
                                                                 
 activation_2 (Activation)   (None, 7, 7, 64)          0

2) Implementación de la solución DQN

In [13]:
memory = SequentialMemory(limit=1000000, window_length=WINDOW_LENGTH)
processor = AtariProcessor()

In [14]:
# nb_steps será el tiempo que durará la exploración, antes de pasar a explotación. 
# value_test guarda un pelín de aleatoriedad para minimizar el riesgo de estancamiento en minimos durante el test
# empieza muy aleatorio a 1 (100%) de acciones aleatorias y va decreciendo a más acciones 'predichas/aprendidas'
policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps',
                              value_max=1., value_min=.1, value_test=.05,
                              nb_steps=2000000)

In [15]:
# recordar que el modelo se 'duplica' uno para target y otro para predicción, ambos idénticos
# gamma es el discount factor de las rewards
# target se actualiza cada 10000 steps
# cada 20 steps, se actualizan los pesos del modelo
# bajamos el learning rate
dqn = DQNAgent(model=model, nb_actions=nb_actions, policy=policy,
               memory=memory, processor=processor,
               nb_steps_warmup=50000, gamma=.9,
               target_model_update=10000,
               train_interval=20)
dqn.compile(Adam(learning_rate=.00025), metrics=['mae'])

In [None]:
# Training part
# log en pantalla cada 10,000 steps
# cada 250000 steps almaceno versiones del modelo (callbacks)
# el log escrito, cada 100 steps
weights_filename = BASE_FOLDER+'dqn2_{}_weights.h5f'.format(env_name)
checkpoint_weights_filename = BASE_FOLDER+'dqn2_' + env_name + '_weights_{step}.h5f'
log_filename = BASE_FOLDER+'dqn2_{}_log.json'.format(env_name)
callbacks = [ModelIntervalCheckpoint(checkpoint_weights_filename, interval=500000)]
callbacks += [FileLogger(log_filename, interval=100)]

dqn.fit(env, callbacks=callbacks, nb_steps=3000000, log_interval=10000, visualize=False)

dqn.save_weights(weights_filename, overwrite=True)

Training for 3000000 steps ...
Interval 1 (0 steps performed)


  updates=self.state_updates,


14 episodes - episode_reward: 9.000 [4.000, 18.000] - lives: 2.223

Interval 2 (10000 steps performed)
16 episodes - episode_reward: 8.938 [5.000, 16.000] - lives: 2.090

Interval 3 (20000 steps performed)
14 episodes - episode_reward: 9.929 [5.000, 17.000] - lives: 2.034

Interval 4 (30000 steps performed)
14 episodes - episode_reward: 8.929 [5.000, 14.000] - lives: 2.113

Interval 5 (40000 steps performed)
13 episodes - episode_reward: 9.923 [3.000, 18.000] - lives: 2.102

Interval 6 (50000 steps performed)
   16/10000 [..............................] - ETA: 33s - reward: 0.0000e+00 

  updates=self.state_updates,


16 episodes - episode_reward: 8.562 [3.000, 21.000] - loss: 0.006 - mae: 0.022 - mean_q: 0.034 - mean_eps: 0.975 - lives: 2.215

Interval 7 (60000 steps performed)
14 episodes - episode_reward: 9.357 [5.000, 19.000] - loss: 0.007 - mae: 0.039 - mean_q: 0.052 - mean_eps: 0.971 - lives: 2.153

Interval 8 (70000 steps performed)
14 episodes - episode_reward: 11.000 [6.000, 23.000] - loss: 0.007 - mae: 0.053 - mean_q: 0.069 - mean_eps: 0.966 - lives: 2.080

Interval 9 (80000 steps performed)
15 episodes - episode_reward: 9.067 [4.000, 15.000] - loss: 0.006 - mae: 0.068 - mean_q: 0.088 - mean_eps: 0.962 - lives: 2.089

Interval 10 (90000 steps performed)
12 episodes - episode_reward: 11.917 [3.000, 26.000] - loss: 0.007 - mae: 0.085 - mean_q: 0.108 - mean_eps: 0.957 - lives: 2.155

Interval 11 (100000 steps performed)
14 episodes - episode_reward: 8.857 [4.000, 15.000] - loss: 0.006 - mae: 0.086 - mean_q: 0.110 - mean_eps: 0.953 - lives: 1.996

Interval 12 (110000 steps performed)
15 episod

In [16]:
# use PyvirtualDisplay to create a “virtual display” that we will send our rendered frames to
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1400, 900))
display.start()


<pyvirtualdisplay.display.Display at 0x7f19a048b410>

In [17]:
# wrap a Monitor around environment
env_monitor = wrap_env(env)

In [19]:
# Testing part to calculate the mean reward
weights_filename = BASE_FOLDER+'dqn1_{}_weights.h5f'.format(env_name)
dqn.load_weights(weights_filename)
dqn.test(env_monitor, nb_episodes=10, visualize=True)
show_video()

Testing for 10 episodes ...
Episode 1: reward: 24.000, steps: 967
Episode 2: reward: 12.000, steps: 386
Episode 3: reward: 14.000, steps: 627
Episode 4: reward: 13.000, steps: 536
Episode 5: reward: 19.000, steps: 799
Episode 6: reward: 10.000, steps: 351
Episode 7: reward: 25.000, steps: 980
Episode 8: reward: 17.000, steps: 593
Episode 9: reward: 11.000, steps: 536
Episode 10: reward: 14.000, steps: 575


3) Justificación de los parámetros seleccionados y de los resultados obtenidos