<div style="width: 100%; clear: both;">
<div style="float: left; width: 50%;">
<img src="http://www.uoc.edu/portal/_resources/common/imatges/marca_UOC/UOC_Masterbrand.jpg", align="left">
</div>
<p style="margin: 0; padding-top: 22px; text-align:right;">M2.883 · Aprenentatge per reforç</p>
<p style="margin: 0; text-align:right;">Màster universitari de Ciència de Dades</p>
<p style="margin: 0; text-align:right; padding-button: 100px;">Estudis d'Informàtica, Multimèdia i Telecomunicació</p>
</div>
</div>
<div style="width:100%;">&nbsp;</div>


# Mòdul 1: exemples d'OpenAI Gym

En aquest _notebook_ carregarem alguns dels escenaris d'OpenAI Gym i veurem la interacció entre alguns agents i aquests escenaris o entorns.

## 1. CartPole
En aquest primer exemple carregarem l'entorn CartPole i farem algunes proves.

### 1.1. Càrrega de dades

El codi següent carrega els paquets necessaris per a l'exemple, crea l'entorn mitjançant el mètode `make` i imprimeix per pantalla la dimensió de l'espai d'accions (dues accions: 0 = esquerra i 1 = dreta), de l'espai d'observacions (quatre observacions: posició del carretó, velocitat del carretó, angle del pal i velocitat del pal en la punta) i el rang de la variable de recompensa (de menys infinit a més infinit).

In [1]:
import gym
import numpy as np

env = gym.make('CartPole-v0')
print("Action space is {} ".format(env.action_space))
print("Observation space is {} ".format(env.observation_space))
print("Reward range is {} ".format(env.reward_range))

Action space is Discrete(2) 
Observation space is Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32) 
Reward range is (-inf, inf) 


Seguidament, reinicialitzem l'entorn (acció que cal fer sempre després de la creació d'aquest entorn) i inicialitzem les variables que guardaran el nombre de passos executats (`t`), la recompensa acumulada (`total_reward`) i la variable que ens indicarà quan finalitza un episodi (`done`).

In [2]:
# Environment reset
obs = env.reset()
t, total_reward, done = 0, 0, False

### 1.2. Execució d'un episodi

A continuació, farem l'execució d'un episodi de l'entorn CartPole utilitzant un agent que selecciona les accions de manera aleatòria.

El codi següent fa l'execució d'un episodi de l'entorn (aquest finalitza quan la variable `done` pren el valor `True`). L'agent s'implementa mitjançant el mètode  `env.action_space.sample()`, que selecciona una acció a l'atzar. Per a cada pas (_time step_), s'imprimeixen per pantalla l'observació que genera l'entorn (els quatre valors esmentats anteriorment), l'acció seleccionada i la recompensa obtinguda en aquest pas (+1 en cada acció fins que finalitza l'episodi).

In [3]:
while not done:
    
    # Render the environment
    env.render()
    
    # Get random action (this is the implementation of the agent)
    action = env.action_space.sample()
    
    # Execute action and get response
    new_obs, reward, done, info = env.step(action)
    print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
    
    obs = new_obs
    total_reward += reward
    t += 1
    
total_reward += reward
t += 1
print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))

Obs: [ 0.031  0.002 -0.048 -0.042] -> Action: 1 and reward: 1.0
Obs: [ 0.031  0.198 -0.048 -0.349] -> Action: 1 and reward: 1.0
Obs: [ 0.035  0.394 -0.055 -0.657] -> Action: 1 and reward: 1.0
Obs: [ 0.043  0.59  -0.069 -0.966] -> Action: 0 and reward: 1.0
Obs: [ 0.055  0.396 -0.088 -0.696] -> Action: 1 and reward: 1.0
Obs: [ 0.063  0.592 -0.102 -1.015] -> Action: 0 and reward: 1.0
Obs: [ 0.075  0.398 -0.122 -0.756] -> Action: 1 and reward: 1.0
Obs: [ 0.083  0.595 -0.137 -1.084] -> Action: 1 and reward: 1.0
Obs: [ 0.095  0.791 -0.159 -1.417] -> Action: 1 and reward: 1.0
Obs: [ 0.11   0.988 -0.187 -1.755] -> Action: 1 and reward: 1.0
Obs: [ 0.13   1.185 -0.222 -2.099] -> Action: 1 and reward: 1.0


Finalment, imprimim els resultats i tanquem l'entorn.

In [4]:
print("Episode finished after {} timesteps and reward was {} ".format(t, total_reward))
env.close()

Episode finished after 11 timesteps and reward was 11.0 


### 1.3. Simulació de diversos episodis

El fragment de codi següent repeteix el procés de l'apartat anterior per al nombre d'episodis definit en la variable `num_episodes`.

In [5]:
num_episodes = 10

for episode in range(num_episodes):

    # Environment reset
    obs = env.reset()
    t, total_reward, done= 0, 0, False
    
    print('Running episode {} '.format(episode+1))
    
    while not done:
    
        # Render the environment
        env.render()
    
        # Get random action (this is the implementation of the agent)
        action = env.action_space.sample()
    
        # Execute action and get response
        new_obs, reward, done, info = env.step(action)
        print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
    
        obs = new_obs
        total_reward += reward
        t += 1
        
    total_reward += reward
    t += 1
    print("Obs: {} -> Action: {} and reward: {}".format(np.round(obs, 3), action, reward))
    print("Episode {} finished after {} timesteps and reward was {} ".format(episode+1, t, total_reward))
    print('')
    
env.close()

Running episode 1 
Obs: [-0.045 -0.033 -0.034  0.034] -> Action: 1 and reward: 1.0
Obs: [-0.046  0.163 -0.033 -0.269] -> Action: 0 and reward: 1.0
Obs: [-0.042 -0.032 -0.038  0.013] -> Action: 0 and reward: 1.0
Obs: [-0.043 -0.226 -0.038  0.293] -> Action: 0 and reward: 1.0
Obs: [-0.048 -0.421 -0.032  0.574] -> Action: 0 and reward: 1.0
Obs: [-0.056 -0.616 -0.021  0.856] -> Action: 0 and reward: 1.0
Obs: [-0.068 -0.81  -0.004  1.142] -> Action: 1 and reward: 1.0
Obs: [-0.085 -0.615  0.019  0.848] -> Action: 1 and reward: 1.0
Obs: [-0.097 -0.42   0.036  0.561] -> Action: 0 and reward: 1.0
Obs: [-0.105 -0.616  0.047  0.865] -> Action: 0 and reward: 1.0
Obs: [-0.118 -0.812  0.065  1.172] -> Action: 1 and reward: 1.0
Obs: [-0.134 -0.617  0.088  0.901] -> Action: 1 and reward: 1.0
Obs: [-0.146 -0.424  0.106  0.637] -> Action: 0 and reward: 1.0
Obs: [-0.155 -0.62   0.119  0.961] -> Action: 0 and reward: 1.0
Obs: [-0.167 -0.817  0.138  1.289] -> Action: 0 and reward: 1.0
Obs: [-0.183 -1.013  

Obs: [-0.026 -0.251  0.086  0.602] -> Action: 1 and reward: 1.0
Obs: [-0.031 -0.058  0.098  0.337] -> Action: 1 and reward: 1.0
Obs: [-0.032  0.136  0.105  0.077] -> Action: 0 and reward: 1.0
Obs: [-0.029 -0.06   0.106  0.401] -> Action: 0 and reward: 1.0
Obs: [-0.031 -0.257  0.114  0.725] -> Action: 0 and reward: 1.0
Obs: [-0.036 -0.453  0.129  1.052] -> Action: 1 and reward: 1.0
Obs: [-0.045 -0.26   0.15   0.802] -> Action: 0 and reward: 1.0
Obs: [-0.05  -0.457  0.166  1.138] -> Action: 0 and reward: 1.0
Obs: [-0.059 -0.654  0.189  1.478] -> Action: 0 and reward: 1.0
Obs: [-0.072 -0.851  0.218  1.823] -> Action: 0 and reward: 1.0
Episode 6 finished after 39 timesteps and reward was 39.0 

Running episode 7 
Obs: [-0.     0.039 -0.015 -0.016] -> Action: 1 and reward: 1.0
Obs: [ 0.     0.234 -0.015 -0.313] -> Action: 1 and reward: 1.0
Obs: [ 0.005  0.43  -0.021 -0.61 ] -> Action: 0 and reward: 1.0
Obs: [ 0.014  0.235 -0.033 -0.324] -> Action: 1 and reward: 1.0
Obs: [ 0.018  0.43  -0.04

## 2. FrozenLake
En aquest segon exemple carregarem l'entorn FrozenLake i tornarem a fer algunes proves.

### 2.1. Càrrega de dades

De la mateixa forma que en l'exemple inicial, el codi següent carrega els paquets necessaris per a l'exemple, crea l'entorn mitjançant el mètode `make` i imprimeix per pantalla la dimensió de l'espai d'accions (0 = esquerra, 1 = dreta, 2 = a baix i 3 = a dalt), l'espai d'observacions (un número del 0 al 15 que indica la posició de l'agent en l'entorn) i el rang de la variable de recompensa (0 per a qualsevol acció excepte si s'arriba a la casella de destinació, i en aquest cas la recompensa és 1).

In [6]:
import time

env = gym.make('FrozenLake-v0')
print("Action space is {} ".format(env.action_space))
print("Observation space is {} ".format(env.observation_space))
print("Reward range is {} ".format(env.reward_range))

Action space is Discrete(4) 
Observation space is Discrete(16) 
Reward range is (0, 1) 


### 2.2. Execució d'un episodi

A continuació, executarem un episodi de l'entorn FrozenLake utilitzant un agent que selecciona les accions de manera aleatòria.

En el codi següent inicialitzem l'entorn, definim el màxim nombre de passos per a episodi (`max_steps`) i fem l'execució d'un episodi de l'entorn (aquest finalitza quan la variable 'done' pren el valor 'True' o quan s'aconsegueix el nombre màxim de passos estipulat). De nou, utilitzem un agent que implementa una política completament aleatòria (`env.action_space.sample()`). Mitjançant el mètode `env.render()`, podem anar veient l'evolució de l'agent en l'entorn des de la casella de sortida S fins que arriba a la casella de destinació G o cau en un forat H.

In [7]:
# Environment reset
obs = env.reset()
t, total_reward, done = 0, 0, False
max_steps = 100

# Render the environment
env.render()
print('')
time.sleep(0.1)

while t < max_steps:
    # Get random action (this is the implementation of the agent)
    action = env.action_space.sample()
    
    # Execute action and get response
    obs, reward, done, info = env.step(action)
    
    # Render the environment
    env.render()
    print('')
        
    t += 1
    if done:
        break
    time.sleep(0.1)

print("Episode finished after {} timesteps and reward was {} ".format(t, reward))
env.close()


[41mS[0mFFF
FHFH
FFFH
HFFG

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG

  (Left)
SFFF
F[41mH[0mFH
FFFH
HFFG

Episode finished after 3 timesteps and reward was 0.0 


### 2.3. Simulació de diversos episodis

El fragment de codi següent repeteix el procés de l'apartat anterior per al nombre d'episodis definit en la variable `num_episodes`.

In [8]:
num_episodes = 10

for episode in range(num_episodes):

    # Environment reset
    obs = env.reset()
    t, done = 0, False
    
    print('Running episode {} '.format(episode+1))

    # Render the environment
    env.render()
    print('')
    time.sleep(0.1)
    
    while t < max_steps:
        # Get random action (this is the implementation of the agent)
        action = env.action_space.sample()
    
        # Execute action and get response
        obs, reward, done, info = env.step(action)
        
        # Render the environment
        env.render()
        print('')
        
        t += 1
        if done:
            break
        time.sleep(0.1)
      
    print("Episode {} finished after {} timesteps and reward was {} ".format(episode+1, t, reward))
    print('')

Running episode 1 

[41mS[0mFFF
FHFH
FFFH
HFFG

  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG

  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG

  (Right)
SF[41mF[0mF
FHFH
FFFH
HFFG

  (Down)
SFF[41mF[0m
FHFH
FFFH
HFFG

  (Left)
SFF[41mF[0m
FHFH
FFFH
HFFG

  (Left)
SFF[41mF[0m
FHFH
FFFH
HFFG

  (Up)
SFF[41mF[0m
FHFH
FFFH
HFFG

  (Right)
SFF[41mF[0m
FHFH
FFFH
HFFG

  (Up)
SF[41mF[0mF
FHFH
FFFH
HFFG

  (Right)
SFF[41mF[0m
FHFH
FFFH
HFFG

  (Right)
SFFF
FHF[41mH[0m
FFFH
HFFG

Episode 1 finished after 11 timesteps and reward was 0.0 

Running episode 2 

[41mS[0mFFF
FHFH
FFFH
HFFG

  (Right)
SFFF
[41mF[0mHFH
FFFH
HFFG

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Right)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG

  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG

  (Left)
S[41mF[0mFF
FHFH
FFFH
HFFG

  (Left)
S[41mF[0mFF
FHFH
FFFH
HFFG

  (Right)
S[41mF[0mFF
F

### 2.4. Càlcul de la recompensa total de diversos episodis

Per mesurar l'eficiència de l'agent, podem calcular la recompensa total de diversos episodis. Atès que en cada episodi la recompensa acumulada és 0 si no s'arriba a la cel·la de destinació i 1 si s'aconsegueix l'objectiu, mesurar la recompensa total acumulada d'un nombre d'episodis ens dona una mesura del percentatge d'èxit del nostre agent.

El fragment de codi següent repeteix el procés de l'apartat anterior per al nombre d'episodis definit en la variable `num_episodes` i calcula el percentatge d'encert de l'agent. S'omet la renderització de l'entorn amb l'objectiu d'agilitar l'execució.

In [9]:
num_episodes = 1000
total_reward = 0

for episode in range(num_episodes):

    # Environment reset
    obs = env.reset()
    t, done = 0, False
    
    #env.render() --- Uncomment if you want to see the path of the agent  

    while t < max_steps:
        # Get random action (this is the implementation of the agent)
        action = env.action_space.sample()
    
        # Execute action and get response
        obs, reward, done, info = env.step(action)
        
        # Render the environment
        #env.render() --- Uncomment if you want to see the path of the agent
        
        total_reward += reward
        t += 1
        if done:
            break
    
success_rate = total_reward*100/num_episodes
print("{} successes in {} episodes: {} % of success".format(total_reward, num_episodes, success_rate))

16.0 successes in 1000 episodes: 1.6 % of success


### 2.5. Entrenament d'un agent

Tal com hem pogut veure en l'apartat anterior, com que l'agent utilitzat tria les accions a l'atzar, és gairebé impossible arribar a la casella de destinació G amb aquesta política (el percentatge d'èxit està en l'1 % o el 2 %). Entrenarem un agent utilitzant el mètode Q-Learning. Aquest mètode (que s'estudiarà en mòduls posteriors) es pot implementar mitjançant una taula que va actualitzant-se a partir de la interacció de l'agent amb l'entorn.
El codi següent implementa aquest mètode i fa l'entrenament de l'agent a partir de l'execució de diversos episodis.

__Nota__: recordeu que les simulacions executades tenen un component aleatori i els percentatges poden variar d'una execució a una altra.

Comencem important alguns paquets:

In [10]:
import pickle

Inicialitzem algunes variables del mètode que volem implementar, entre les quals hi ha el nombre d'episodis (`num_episodes`) i el nombre màxim de passos per cada episodi (`max_steps`).

In [11]:
epsilon = 0.9
num_episodes = 100000
max_steps = 100

learning_rate = 0.81
gamma = 0.96

Inicialitzem a zero tots els valors de la taula de la funció Q (de setze estats per quatre accions cada estat), que acabarà donant-nos una idea de quina és la millor acció per a cada estat.

In [12]:
Q = np.zeros((env.observation_space.n, env.action_space.n))
print(Q)

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


El codi següent defineix les funcions que caracteritzen l'agent (s'estudiaran en mòduls posteriors d'aquest curs).

In [13]:
def choose_action(state):
    action=0
    if np.random.uniform(0, 1) < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q[state, :])
    return action

def learn(state, new_state, reward, action):
    predict = Q[state, action]
    target = reward + gamma * np.max(Q[new_state, :])
    Q[state, action] = Q[state, action] + learning_rate * (target - predict)

El codi següent fa tantes partides del joc com s'indiquen en la variable `num_episodes`. En cada partida (episodi), l'agent va interactuant amb l'entorn i, com a fruit d'aquesta interacció, va actualitzant els valors de la taula _Q_. En el codi s'ha comentat el mètode `env.render()` amb l'objectiu de no saturar la pantalla. Així mateix, s'imprimeixen per pantalla els episodis en els quals l'agent aconsegueix la casella de destinació.

In [14]:
# Start
for episode in range(num_episodes):
    state = env.reset()
    t = 0
    
    while t < max_steps:
        #env.render() --- Uncomment if you want to see the path of the agent
        action = choose_action(state)  
        state2, reward, done, info = env.step(action)  
        learn(state, state2, reward, action)
        state = state2
        t += 1
       
        if done:
            break

    if reward == 1:
        print("Episode {} finished after {} timesteps and reward was {} ".format(episode+1, t, reward)) 

Episode 29 finished after 9 timesteps and reward was 1.0 
Episode 41 finished after 16 timesteps and reward was 1.0 
Episode 84 finished after 6 timesteps and reward was 1.0 
Episode 175 finished after 7 timesteps and reward was 1.0 
Episode 187 finished after 15 timesteps and reward was 1.0 
Episode 277 finished after 13 timesteps and reward was 1.0 
Episode 322 finished after 8 timesteps and reward was 1.0 
Episode 430 finished after 10 timesteps and reward was 1.0 
Episode 457 finished after 8 timesteps and reward was 1.0 
Episode 493 finished after 12 timesteps and reward was 1.0 
Episode 532 finished after 16 timesteps and reward was 1.0 
Episode 547 finished after 16 timesteps and reward was 1.0 
Episode 559 finished after 7 timesteps and reward was 1.0 
Episode 643 finished after 27 timesteps and reward was 1.0 
Episode 667 finished after 15 timesteps and reward was 1.0 
Episode 668 finished after 8 timesteps and reward was 1.0 
Episode 678 finished after 16 timesteps and reward

Episode 7712 finished after 8 timesteps and reward was 1.0 
Episode 7794 finished after 21 timesteps and reward was 1.0 
Episode 7840 finished after 8 timesteps and reward was 1.0 
Episode 7945 finished after 9 timesteps and reward was 1.0 
Episode 8029 finished after 10 timesteps and reward was 1.0 
Episode 8092 finished after 8 timesteps and reward was 1.0 
Episode 8095 finished after 12 timesteps and reward was 1.0 
Episode 8146 finished after 9 timesteps and reward was 1.0 
Episode 8161 finished after 6 timesteps and reward was 1.0 
Episode 8175 finished after 23 timesteps and reward was 1.0 
Episode 8182 finished after 22 timesteps and reward was 1.0 
Episode 8210 finished after 13 timesteps and reward was 1.0 
Episode 8215 finished after 8 timesteps and reward was 1.0 
Episode 8244 finished after 11 timesteps and reward was 1.0 
Episode 8275 finished after 6 timesteps and reward was 1.0 
Episode 8279 finished after 8 timesteps and reward was 1.0 
Episode 8346 finished after 17 ti

Episode 15270 finished after 6 timesteps and reward was 1.0 
Episode 15281 finished after 22 timesteps and reward was 1.0 
Episode 15344 finished after 7 timesteps and reward was 1.0 
Episode 15409 finished after 6 timesteps and reward was 1.0 
Episode 15467 finished after 11 timesteps and reward was 1.0 
Episode 15483 finished after 17 timesteps and reward was 1.0 
Episode 15577 finished after 13 timesteps and reward was 1.0 
Episode 15596 finished after 6 timesteps and reward was 1.0 
Episode 15671 finished after 9 timesteps and reward was 1.0 
Episode 15726 finished after 10 timesteps and reward was 1.0 
Episode 15743 finished after 14 timesteps and reward was 1.0 
Episode 15765 finished after 8 timesteps and reward was 1.0 
Episode 15793 finished after 7 timesteps and reward was 1.0 
Episode 15884 finished after 15 timesteps and reward was 1.0 
Episode 15950 finished after 10 timesteps and reward was 1.0 
Episode 15995 finished after 8 timesteps and reward was 1.0 
Episode 16021 fi

Episode 23718 finished after 11 timesteps and reward was 1.0 
Episode 23742 finished after 15 timesteps and reward was 1.0 
Episode 23913 finished after 13 timesteps and reward was 1.0 
Episode 23943 finished after 12 timesteps and reward was 1.0 
Episode 23953 finished after 10 timesteps and reward was 1.0 
Episode 24035 finished after 9 timesteps and reward was 1.0 
Episode 24075 finished after 29 timesteps and reward was 1.0 
Episode 24199 finished after 16 timesteps and reward was 1.0 
Episode 24260 finished after 10 timesteps and reward was 1.0 
Episode 24573 finished after 11 timesteps and reward was 1.0 
Episode 25225 finished after 15 timesteps and reward was 1.0 
Episode 25227 finished after 7 timesteps and reward was 1.0 
Episode 25266 finished after 10 timesteps and reward was 1.0 
Episode 25466 finished after 26 timesteps and reward was 1.0 
Episode 25485 finished after 12 timesteps and reward was 1.0 
Episode 25590 finished after 11 timesteps and reward was 1.0 
Episode 25

Episode 33979 finished after 17 timesteps and reward was 1.0 
Episode 34021 finished after 11 timesteps and reward was 1.0 
Episode 34068 finished after 12 timesteps and reward was 1.0 
Episode 34320 finished after 13 timesteps and reward was 1.0 
Episode 34474 finished after 17 timesteps and reward was 1.0 
Episode 34496 finished after 11 timesteps and reward was 1.0 
Episode 34553 finished after 6 timesteps and reward was 1.0 
Episode 34696 finished after 16 timesteps and reward was 1.0 
Episode 34757 finished after 24 timesteps and reward was 1.0 
Episode 34850 finished after 13 timesteps and reward was 1.0 
Episode 34866 finished after 11 timesteps and reward was 1.0 
Episode 34890 finished after 24 timesteps and reward was 1.0 
Episode 34908 finished after 14 timesteps and reward was 1.0 
Episode 35004 finished after 10 timesteps and reward was 1.0 
Episode 35029 finished after 33 timesteps and reward was 1.0 
Episode 35048 finished after 10 timesteps and reward was 1.0 
Episode 3

Episode 42161 finished after 19 timesteps and reward was 1.0 
Episode 42307 finished after 8 timesteps and reward was 1.0 
Episode 42354 finished after 14 timesteps and reward was 1.0 
Episode 42476 finished after 6 timesteps and reward was 1.0 
Episode 42502 finished after 12 timesteps and reward was 1.0 
Episode 42557 finished after 18 timesteps and reward was 1.0 
Episode 42597 finished after 9 timesteps and reward was 1.0 
Episode 42727 finished after 13 timesteps and reward was 1.0 
Episode 42789 finished after 12 timesteps and reward was 1.0 
Episode 42817 finished after 7 timesteps and reward was 1.0 
Episode 42835 finished after 12 timesteps and reward was 1.0 
Episode 42905 finished after 18 timesteps and reward was 1.0 
Episode 43072 finished after 13 timesteps and reward was 1.0 
Episode 43111 finished after 13 timesteps and reward was 1.0 
Episode 43125 finished after 7 timesteps and reward was 1.0 
Episode 43184 finished after 9 timesteps and reward was 1.0 
Episode 43367 

Episode 51955 finished after 7 timesteps and reward was 1.0 
Episode 52018 finished after 30 timesteps and reward was 1.0 
Episode 52024 finished after 8 timesteps and reward was 1.0 
Episode 52044 finished after 19 timesteps and reward was 1.0 
Episode 52119 finished after 8 timesteps and reward was 1.0 
Episode 52242 finished after 15 timesteps and reward was 1.0 
Episode 52293 finished after 9 timesteps and reward was 1.0 
Episode 52462 finished after 9 timesteps and reward was 1.0 
Episode 52651 finished after 10 timesteps and reward was 1.0 
Episode 52705 finished after 20 timesteps and reward was 1.0 
Episode 52710 finished after 8 timesteps and reward was 1.0 
Episode 52820 finished after 14 timesteps and reward was 1.0 
Episode 52910 finished after 6 timesteps and reward was 1.0 
Episode 52924 finished after 11 timesteps and reward was 1.0 
Episode 53016 finished after 6 timesteps and reward was 1.0 
Episode 53053 finished after 18 timesteps and reward was 1.0 
Episode 53091 fi

Episode 60840 finished after 6 timesteps and reward was 1.0 
Episode 61051 finished after 12 timesteps and reward was 1.0 
Episode 61059 finished after 8 timesteps and reward was 1.0 
Episode 61140 finished after 11 timesteps and reward was 1.0 
Episode 61157 finished after 6 timesteps and reward was 1.0 
Episode 61234 finished after 12 timesteps and reward was 1.0 
Episode 61333 finished after 13 timesteps and reward was 1.0 
Episode 61343 finished after 16 timesteps and reward was 1.0 
Episode 61383 finished after 10 timesteps and reward was 1.0 
Episode 61474 finished after 21 timesteps and reward was 1.0 
Episode 61498 finished after 17 timesteps and reward was 1.0 
Episode 61500 finished after 18 timesteps and reward was 1.0 
Episode 61510 finished after 9 timesteps and reward was 1.0 
Episode 61511 finished after 12 timesteps and reward was 1.0 
Episode 61614 finished after 23 timesteps and reward was 1.0 
Episode 61663 finished after 19 timesteps and reward was 1.0 
Episode 6170

Episode 69753 finished after 23 timesteps and reward was 1.0 
Episode 69782 finished after 10 timesteps and reward was 1.0 
Episode 69910 finished after 10 timesteps and reward was 1.0 
Episode 69955 finished after 10 timesteps and reward was 1.0 
Episode 70022 finished after 10 timesteps and reward was 1.0 
Episode 70085 finished after 10 timesteps and reward was 1.0 
Episode 70147 finished after 7 timesteps and reward was 1.0 
Episode 70153 finished after 7 timesteps and reward was 1.0 
Episode 70163 finished after 7 timesteps and reward was 1.0 
Episode 70227 finished after 9 timesteps and reward was 1.0 
Episode 70269 finished after 25 timesteps and reward was 1.0 
Episode 70278 finished after 11 timesteps and reward was 1.0 
Episode 70331 finished after 8 timesteps and reward was 1.0 
Episode 70335 finished after 16 timesteps and reward was 1.0 
Episode 70438 finished after 15 timesteps and reward was 1.0 
Episode 70453 finished after 12 timesteps and reward was 1.0 
Episode 70481

Episode 77364 finished after 21 timesteps and reward was 1.0 
Episode 77484 finished after 11 timesteps and reward was 1.0 
Episode 77592 finished after 8 timesteps and reward was 1.0 
Episode 77663 finished after 8 timesteps and reward was 1.0 
Episode 77686 finished after 13 timesteps and reward was 1.0 
Episode 77735 finished after 8 timesteps and reward was 1.0 
Episode 77750 finished after 8 timesteps and reward was 1.0 
Episode 77762 finished after 21 timesteps and reward was 1.0 
Episode 77765 finished after 16 timesteps and reward was 1.0 
Episode 77820 finished after 17 timesteps and reward was 1.0 
Episode 77989 finished after 8 timesteps and reward was 1.0 
Episode 77991 finished after 6 timesteps and reward was 1.0 
Episode 78076 finished after 7 timesteps and reward was 1.0 
Episode 78292 finished after 16 timesteps and reward was 1.0 
Episode 78364 finished after 10 timesteps and reward was 1.0 
Episode 78564 finished after 11 timesteps and reward was 1.0 
Episode 78664 f

Episode 85603 finished after 17 timesteps and reward was 1.0 
Episode 85682 finished after 29 timesteps and reward was 1.0 
Episode 85699 finished after 15 timesteps and reward was 1.0 
Episode 85949 finished after 6 timesteps and reward was 1.0 
Episode 86043 finished after 25 timesteps and reward was 1.0 
Episode 86052 finished after 12 timesteps and reward was 1.0 
Episode 86068 finished after 9 timesteps and reward was 1.0 
Episode 86082 finished after 10 timesteps and reward was 1.0 
Episode 86113 finished after 17 timesteps and reward was 1.0 
Episode 86301 finished after 16 timesteps and reward was 1.0 
Episode 86441 finished after 7 timesteps and reward was 1.0 
Episode 86485 finished after 21 timesteps and reward was 1.0 
Episode 86494 finished after 21 timesteps and reward was 1.0 
Episode 86539 finished after 28 timesteps and reward was 1.0 
Episode 86559 finished after 8 timesteps and reward was 1.0 
Episode 86563 finished after 6 timesteps and reward was 1.0 
Episode 86567

Episode 94003 finished after 14 timesteps and reward was 1.0 
Episode 94023 finished after 6 timesteps and reward was 1.0 
Episode 94040 finished after 14 timesteps and reward was 1.0 
Episode 94050 finished after 6 timesteps and reward was 1.0 
Episode 94095 finished after 16 timesteps and reward was 1.0 
Episode 94132 finished after 7 timesteps and reward was 1.0 
Episode 94156 finished after 7 timesteps and reward was 1.0 
Episode 94200 finished after 7 timesteps and reward was 1.0 
Episode 94217 finished after 11 timesteps and reward was 1.0 
Episode 94339 finished after 16 timesteps and reward was 1.0 
Episode 94376 finished after 20 timesteps and reward was 1.0 
Episode 94434 finished after 9 timesteps and reward was 1.0 
Episode 94463 finished after 20 timesteps and reward was 1.0 
Episode 94470 finished after 17 timesteps and reward was 1.0 
Episode 94513 finished after 16 timesteps and reward was 1.0 
Episode 94530 finished after 17 timesteps and reward was 1.0 
Episode 94538 

Podem veure els valors finals de la taula _Q_ després de l'entrenament.

In [15]:
print(Q)

[[6.69248816e-01 6.56106214e-01 6.62835070e-01 6.50601117e-01]
 [6.01607319e-01 6.62811260e-01 1.98059159e-02 6.76492289e-01]
 [6.72699169e-01 6.60354253e-01 6.80655441e-01 6.57379174e-01]
 [1.24824086e-01 6.76385418e-01 5.51405262e-01 6.80961553e-01]
 [6.87201864e-01 5.62493031e-01 2.11837913e-02 1.12144967e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [3.06795053e-02 1.37942598e-01 1.57777244e-01 6.82141775e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [1.21505058e-01 6.53841948e-01 7.73904139e-01 8.21092327e-01]
 [8.57700710e-01 8.55795155e-01 8.40740029e-01 1.70050581e-01]
 [8.48389375e-01 8.73349694e-01 9.28839563e-01 1.65134694e-04]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]
 [6.97187100e-01 1.42029240e-01 9.32262976e-01 2.45882633e-02]
 [8.91436763e-01 8.23173829e-01 9.10819995e-01 9.76199594e-01]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.000000

### 2.6. Comprovació de la millora
En aquest últim apartat comprovarem que l'agent dissenyat aconsegueix millors prestacions que l'agent aleatori.

El codi és molt semblant al que hem utilitzat mentre entrenàvem l'agent, però s'omet la part d'aprenentatge d'aquest agent. Per a això, simularem diversos episodis utilitzant els valors de la taula _Q_ obtinguda en l'entrenament. Concretament, l'agent selecciona el valor màxim de la taula _Q_ per a cada estat:

In [16]:
def choose_action_max(state):
    action = np.argmax(Q[state, :])
    return action

De nou, calculem la recompensa total de diversos episodis i calculem el percentatge d'encert, que, com es pot comprovar, és superior al de l'agent aleatori.

En el codi s'ofereix l'oportunitat de visualitzar (de manera diferent a la vista fins a aquest moment) els últims episodis de la simulació (indicats en la variable `num_shows`).

In [17]:
from IPython.display import clear_output

num_episodes = 1000
total_reward = 0
num_shows = 5
show_episode = False

# start
for episode in range(num_episodes):

    if (num_episodes - episode) <= num_shows:
        show_episode = True
        
    state = env.reset()
    
    if show_episode == True:
        print('')
        print('')
        print("*** Episode: ", episode+1)
        print('')
        print('')
        time.sleep(0.8)
        clear_output(wait=True)
        env.render()
    
    t = 0
    while t < 100:
        action = choose_action_max(state)  
        state, reward, done, info = env.step(action)  
        
        if show_episode == True:
            time.sleep(0.5)
            clear_output(wait=True)
            env.render()
        if done:
            break

    if show_episode == True:
        time.sleep(0.8)
        clear_output(wait=True)
        print('')
        print('')
        print('Reward = {}'.format(reward))
        print('')
        print('')
        time.sleep(0.8)
        clear_output(wait=True)
    
    total_reward += reward
    
success_rate = total_reward*100/num_episodes
print("{} successes in {} episodes: {} % of success".format(total_reward, num_episodes, success_rate))

167.0 successes in 1000 episodes: 16.7 % of success
