<a href="https://colab.research.google.com/github/medinavi/2DTS/blob/main/RL_Exerc%C3%ADcio_1_Q_Learning_Cliff_Walking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Neste notebook, voc√™ codificar√° do zero seu segundo agente de Reinforcement Learning jogando Cliff Walking usando Q-Learning

Adaptado HuggingFace

<img src="https://www.gymlibrary.dev/_images/cliff_walking.gif" alt="Environments"/>

###üéÆ Environments:

> 

- [CliffWalking-v0](https://www.gymlibrary.dev/environments/toy_text/cliff_walking/)


###üìö RL-Library: 

- Python and NumPy
- [Gym](https://www.gymlibrary.dev/)

## Instalar depend√™ncias e criar um display virtual üîΩ


In [None]:
!pip install gym==0.24
!pip install pygame
!pip install numpy

!pip install huggingface_hub
!pip install pickle5
!pip install pyyaml==6.0
!pip install imageio
!pip install imageio_ffmpeg
!pip install pyglet==1.5.1
!pip install tqdm

In [None]:
%%capture
!sudo apt-get update
!apt install python-opengl ffmpeg xvfb
!pip3 install pyvirtualdisplay

Para garantir que as novas bibliotecas instaladas sejam usadas, **√†s vezes √© necess√°rio reiniciar o tempo de execu√ß√£o do notebook**. A pr√≥xima c√©lula for√ßar√° o **tempo de execu√ß√£o a travar, ent√£o voc√™ precisar√° se conectar novamente e executar o c√≥digo a partir daqui**.

In [None]:
#import os
#os.kill(os.getpid(), 9)

In [None]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

## Importa√ß√£o de pacotes üì¶

Al√©m das bibliotecas instaladas, utilizamos tamb√©m:

- `random`: Para gerar n√∫meros aleat√≥rios (que ser√£o √∫teis para a pol√≠tica epsilon-greedy).
- `imageio`: Para gerar um v√≠deo de replay.

In [None]:
import numpy as np
import gym
import random
import imageio
import os
import tqdm
import time

import pickle5 as pickle
from tqdm.notebook import tqdm

# Part 1: CliffWalking

## Criando o ambiente CliffWalking-v0 (https://www.gymlibrary.dev/environments/toy_text/cliff_walking/)
---

üí° Um bom h√°bito quando voc√™ come√ßa a usar um ambiente √© verificar sua documenta√ß√£o

üëâ https://www.gymlibrary.dev/environments/toy_text/cliff_walking/

---


In [None]:
#env = gym.make("CliffWalking-v0")

### Verifique o Environment:


In [None]:
print("_____OBSERVATION SPACE_____ \n")
print("Observation Space", env.observation_space)
print("Sample observation", env.observation_space.sample()) # Get a random observation

In [None]:
print("\n _____ACTION SPACE_____ \n")
print("Action Space Shape", env.action_space.n)
print("Action Space Sample", env.action_space.sample()) # Take a random action

## Definindo os hiperpar√¢metros ‚öôÔ∏è
Os hiperpar√¢metros relacionados √† explora√ß√£o s√£o alguns dos mais importantes.

- Precisamos garantir que nosso agente **explore o espa√ßo de estados** o suficiente para aprender uma boa aproxima√ß√£o de valor. Para fazer isso, precisamos ter decaimento progressivo do epsilon.
- Se voc√™ diminuir o epsilon muito r√°pido (decay_rate muito alto), **voc√™ corre o risco de que seu agente fique preso**, j√° que seu agente n√£o explorou o espa√ßo de estado o suficiente e, portanto, n√£o pode resolver o problema.

In [None]:
state_space = env.observation_space.n
print("There are ", state_space, " possible states")

action_space = env.action_space.n
print("There are ", action_space, " possible actions")

In [None]:
# Vamos criar nossa Qtable de tamanho (state_space, action_space) e inicializar cada valor em 0 usando np.zeros
def initialize_q_table(state_space, action_space):
  Qtable = np.zeros((state_space, action_space))
  return Qtable

In [None]:
Qtable_CliffWalking = initialize_q_table(state_space, action_space)

In [None]:
Qtable_CliffWalking

In [None]:
def greedy_policy(Qtable, state):
  # Exploitation: take the action with the highest state, action value
  action = np.argmax(Qtable[state][:])
  
  return action

In [None]:
def epsilon_greedy_policy(Qtable, state, epsilon):
  # Gera aleatoriamente um n√∫mero entre 0 e 1
  random_int = random.uniform(0,1)
  # if random_int > maior que epsilon --> exploitation
  if random_int > epsilon:
     # Execute a a√ß√£o com o maior valor dado um estado
     # np.argmax pode ser √∫til aqui
    action = greedy_policy(Qtable, state)
  # else --> exploration
  else:
    action = env.action_space.sample()
  
  return action

In [None]:
# Par√¢metros de treinamento
n_training_episodes = 10000 # Total de epis√≥dios de treinamento
learning_rate = 0.7 # Taxa de aprendizado

# Par√¢metros de avalia√ß√£o
n_eval_episodes = 100 # N√∫mero total de epis√≥dios de teste

# Par√¢metros do ambiente
env_id = "CliffWalking-v0" # Nome do ambiente
max_steps = 99 # Max passos por epis√≥dio
gamma = 0.95 # Taxa de desconto
eval_seed = [] # A semente de avalia√ß√£o do ambiente

# Par√¢metros de explora√ß√£o
max_epsilon = 1.0 # Probabilidade de explora√ß√£o no in√≠cio
min_epsilon = 0.05 # Probabilidade m√≠nima de explora√ß√£o
decay_rate = 0.0005 # Taxa de decaimento exponencial para prob de explora√ß√£o

In [None]:
def train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable):
  for episode in tqdm(range(n_training_episodes)):
    # # Reduzir epsilon (porque precisamos cada vez menos exploration)
    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode)
    # Redefinir o ambiente
    state = env.reset()
    step = 0
    done = False

    # repete
    for step in range(max_steps):
      # Escolha a a√ß√£o At para usar a pol√≠tica gananciosa (greedy policy) do epsilon 
      action = epsilon_greedy_policy(Qtable, state, epsilon)

      # Take action At and observe Rt+1 and St+1
      # Take the action (a) and observe the outcome state(s') and reward (r)
      new_state, reward, done, info = env.step(action)

      # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
      Qtable[state][action] = Qtable[state][action] + learning_rate * (reward + gamma * np.max(Qtable[new_state]) - Qtable[state][action])   

      # If done, finish the episode
      if done:
        break
      
      # Our next state is the new state
      state = new_state
  return Qtable

## Treinando o agente Q-Learning üèÉ

In [None]:
Qtable_CliffWalking = train(n_training_episodes, min_epsilon, max_epsilon, decay_rate, env, max_steps, Qtable_CliffWalking)

## Verificando a tabela Q-Learning üëÄ

In [None]:
Qtable_CliffWalking

## Avalia√ß√£o do M√©todo üìù

- Definimos o m√©todo de avalia√ß√£o que vamos usar para testar nosso agente Q-Learning.

In [None]:
def evaluate_agent(env, max_steps, n_eval_episodes, Q, seed):
  """
   Avalie o agente para epis√≥dios ``n_eval_episodes`` e retorne recompensa m√©dia e padr√£o de recompensa.
   :param env: O ambiente de avalia√ß√£o
   :param n_eval_episodes: N√∫mero de epis√≥dios para avaliar o agente
   :param Q: A tabela Q
   :param seed: A matriz de sementes de avalia√ß√£o (para taxi-v3)
   """
  episode_rewards = []
  for episode in tqdm(range(n_eval_episodes)):
    if seed:
      state = env.reset(seed=seed[episode])
    else:
      state = env.reset()
    step = 0
    done = False
    total_rewards_ep = 0
    
    for step in range(max_steps):
      # Tome a a√ß√£o (√≠ndice) que tem a recompensa futura m√°xima esperada dado aquele estado
      action = greedy_policy(Q, state)
      new_state, reward, done, info = env.step(action)
      total_rewards_ep += reward
        
      if done:
        break
      state = new_state
    episode_rewards.append(total_rewards_ep)
  mean_reward = np.mean(episode_rewards)
  std_reward = np.std(episode_rewards)

  return mean_reward, std_reward

## Avaliando nosso agente Q-Learning üìà


In [None]:
# Evaluate our Agent
mean_reward, std_reward = evaluate_agent(env, max_steps, n_eval_episodes, Qtable_CliffWalking, eval_seed)
print(f"Mean_reward={mean_reward:.2f} +/- {std_reward:.2f}")

#### N√£o modifique essa parte


In [None]:
def play_actions(env, Qtable, delay = 1):
  """
   Gerar um v√≠deo de replay do agente
   :param env
   :param Qtable: Qtable do nosso agente
   :param out_directory
   :param fps: quantos quadros por segundo (com taxi-v3 e frozenlake-v1 usamos 1)
   """

  sequenses = []  
  done = False
  state = env.reset(seed=random.randint(0,500))
  txt = env.render(mode='human')

  sequenses.append(txt)
  while not done:
    # Tome a a√ß√£o (√≠ndice) que tem a recompensa futura m√°xima esperada dado aquele estado
    
    action = np.argmax(Qtable[state][:])
    print(action)
    state, reward, done, info = env.step(action) # Colocamos diretamente next_state = state para a l√≥gica de grava√ß√£o
    txt = env.render(mode='human')
    sequenses.append(txt)
    print(txt)
    time.sleep(delay)
  
  return sequenses
 

In [None]:
play_actions(env, Qtable_CliffWalking,  1)

In [None]:
def record_video(env, Qtable, out_directory, fps=1):
  """
   Gerar um v√≠deo de replay do agente
   :param env
   :param Qtable: Qtable do nosso agente
   :param out_directory
   :param fps: quantos quadros por segundo (com taxi-v3 e frozenlake-v1 usamos 1)
   """
  images = []  
  done = False
  state = env.reset(seed=random.randint(0,500))
  img = env.render(mode='rgb_array')
  images.append(img)
  while not done:
    # Tome a a√ß√£o (√≠ndice) que tem a recompensa futura m√°xima esperada dado aquele estado
    action = np.argmax(Qtable[state][:])
    state, reward, done, info = env.step(action) # Colocamos diretamente next_state = state para a l√≥gica de grava√ß√£o
    img = env.render(mode='rgb_array')
    images.append(img)
  imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)