<a href="https://colab.research.google.com/github/marrej/ML-projects/blob/main/DeepQLearning_With_Pacman.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Q learning on Atari games

This notebook tries out the basics of Q learning (implementing the NN network & algorithm from scrath) on the Atari games (e.g. Space invaders).

In [None]:
!pip3 install gymnasium gymnasium[atari] gymnasium[accept-rom-license]
!pip3 install imageio
!pip3 install stable_baselines3
!pip3 install tqdm

Collecting gymnasium
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Collecting farama-notifications>=0.0.1 (from gymnasium)
  Downloading Farama_Notifications-0.0.4-py3-none-any.whl (2.5 kB)
Collecting shimmy[atari]<1.0,>=0.1.0 (from gymnasium)
  Downloading Shimmy-0.2.1-py3-none-any.whl (25 kB)
Collecting autorom[accept-rom-license]~=0.4.2 (from gymnasium)
  Downloading AutoROM-0.4.2-py3-none-any.whl (16 kB)
Collecting AutoROM.accept-rom-license (from autorom[accept-rom-license]~=0.4.2->gymnasium)
  Downloading AutoROM.accept-rom-license-0.6.1.tar.gz (434 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m434.7/434.7 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
!pip3 install ale_py==0.8.1

In [None]:
import gymnasium as gym
from stable_baselines3.common.atari_wrappers import (
    ClipRewardEnv,
    EpisodicLifeEnv,
    FireResetEnv,
    MaxAndSkipEnv,
    NoopResetEnv,
)

def make_env():
  def thunk():
    # Use grayscale so that we reduce the amount of data in the net
    game_env = gym.make("MsPacmanNoFrameskip-v4", obs_type="grayscale", render_mode="rgb_array")
    # Should resize to reduce the amount of consumed space
    game_env = gym.wrappers.ResizeObservation(game_env, (84, 84))
    game_env = ClipRewardEnv(game_env)
    # This bugs the game_env
    game_env = NoopResetEnv(game_env, noop_max=30)
    game_env = MaxAndSkipEnv(game_env, skip=4)
    game_env = EpisodicLifeEnv(game_env)
    # # Use 4 image states
    game_env = gym.wrappers.FrameStack(game_env, 4)
    game_env = gym.wrappers.RecordEpisodeStatistics(game_env)

    return game_env
  return thunk

game_env = make_env()()

In [None]:
from PIL import Image
import numpy as np

game_env.reset()
Image.fromarray(np.uint8(game_env.render()))

In [None]:
import imageio
import random

def record_video(env, QAgent, out_directory, fps=15):
    """
    Generate a replay video of the agent
    :param env
    :param QAgent: QAgent of our agent
    :param out_directory
    :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
    """
    images = []
    for i in range(3):
      terminated = False
      truncated = False
      state, info = env.reset(seed=random.randint(0, 500))

      img = env.render()
      images.append(img)
      while not terminated or truncated:
          # Take the action (index) that have the maximum expected future reward given that state
          action = QAgent.get_action(np.array(state), epsilon=0)

          state, reward, terminated, truncated, info = env.step(
              action
          )
          # We directly put next_state = state for recording logic
          img = env.render()
          images.append(img)
    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)

class DummyQAgent():
  def get_action(self, state, epsilon):
    return random.randint(0, 8)

# Generates a dummy video for a random traversal
ag = DummyQAgent()
record_video(make_env()(), ag, "replay.mp4")

In [None]:
# mount G drive
from google.colab import drive
drive.mount('/content/drive')

## NN definition

In [None]:
import torch
import torch.nn as nn
from stable_baselines3.common.buffers import ReplayBuffer
import torch.nn.functional as F
import torch.optim as optim
from tqdm import tqdm

class QNet(nn.Module):
  def __init__(self, env):
    super().__init__()
    self.network = nn.Sequential(
        ## We use 4 images as a state, but could increase this
        nn.Conv2d(4, 32, 8, stride=4),
        nn.ReLU(),
        nn.Conv2d(32, 64, 4, stride=2),
        nn.ReLU(),
        nn.Conv2d(64, 64, 3, stride=1),
        nn.ReLU(),
        nn.Flatten(),
        nn.Linear(3136, 512),
        nn.ReLU(),
        nn.Linear(512, env.single_action_space.n),
    )

  def forward(self, x):
    return self.network(x / 255.0)

class DeepQLearning():
  def __init__(self, is_cuda=False):
    """
    params
    is_cuda = False: sets the cuda as main processing device if available
    """
    self.is_cuda = is_cuda

  def init_nets(self, env):
    # initialize a training model (target)
    # initialize a Qmodel
    self.target_net = QNet(env).to(self.get_device())
    self.q_net = QNet(env).to(self.get_device())
    self.target_net.load_state_dict(self.q_net.state_dict())

  def get_epsilon(self, duration, time_step, e_min=0.05, e_max=1.0,):
    # Calculate the time step based on the duration
    slope = (e_max - e_min)/duration
    return max(e_min, e_max - slope*time_step)

  def get_device(self):
    return torch.device("cuda" if torch.cuda.is_available() and self.is_cuda else "cpu")

  def fine_tune(self,
            q_net,
            ens,
            start_training_after_steps=20000,
            max_epsilon=1.0,
            min_epsilon=0.05,
            max_steps=3000000,
            batch_size=32,
            learning_rate=0.0001,
            # At which timesteps do we update the target network
            target_network_frequency_update=5000,
            # Update rate of the target network
            tau = 0.9,
            gamma=0.99,
            # how often should we  run eval
            eval_frequency=50000,
            checkpoint_frequency=50000,
            # How often should we train
            training_frequency=4):
    print('Fine tuning')
    env = gym.vector.SyncVectorEnv(ens)

    # Initialize new nets
    self.target_net = QNet(env).to(self.get_device())
    self.q_net = QNet(env).to(self.get_device())

    # Copy params from the loaded net
    # Maybe I should Initialize the target_net with a random value and then load from Q_net with a tau the same way as we do in training. So its not equal from the start, otherwise we hinder training?
    self.q_net.load_state_dict(q_net.state_dict())
    # Load the target_net as a Q_net progress
    self.update_target_params(tau)

    return self.train(ens,
            start_training_after_steps=start_training_after_steps,
            max_epsilon=max_epsilon,
            min_epsilon=min_epsilon,
            max_steps=max_steps,
            batch_size=batch_size,
            learning_rate=learning_rate,
            target_network_frequency_update=5000,
            tau=tau,
            gamma=gamma,
            eval_frequency=eval_frequency,
            checkpoint_frequency=checkpoint_frequency,
            training_frequency=training_frequency)

  def train(self,
            ens,
            start_training_after_steps=100000,
            max_epsilon=1.0,
            min_epsilon=0.05,
            max_steps=5000000,
            batch_size=32,
            learning_rate=0.001,
            # At which timesteps do we update the target network
            target_network_frequency_update=5000,
            # Update rate of the target network
            tau = 0.9,
            gamma=0.9,
            # how often should we run eval
            eval_frequency=50000,
            checkpoint_frequency=50000,
            # How often should we train
            training_frequency=4,
            fine_tuning=False,
            preloaded_envs = None
            ):
    print('Training initialized')
    if preloaded_envs == None:
      env = gym.vector.SyncVectorEnv(ens)
    else:
      env = preloaded_envs

    if fine_tuning == False:
      self.init_nets(env)
      print('Nets initialized')

    # Initialize optimizer
    optimizer = optim.Adam(self.q_net.parameters(), lr=learning_rate)

    # initialize a ActionReplay

    ## How do we store the rewards?
    replay_buffer = ReplayBuffer(
        n_envs=len(ens),
        action_space=env.single_action_space,
        observation_space=env.single_observation_space,
        buffer_size=80000,
        device=self.get_device(),
        handle_timeout_termination=False
    )

    # History of loss/rewards through steps
    history = []

    # initialize the base state
    observation, info = env.reset()

    # do a Sample/Train cycle
    for global_step in tqdm(range(max_steps)):

      epsilon = self.get_epsilon(e_min=min_epsilon, e_max=max_epsilon, time_step=global_step, duration=max_steps)

      ## Do n sampling cycles

      action = self.get_epsilon_greedy_action(env, len(ens), observation, epsilon)
      ### Sample action from the training model (usign the epsilon greedy policy)

      next_observations, reward, termination, truncation, info = env.step(action)

      ### Add the state,action,reward to the ReplayAction stack
      replay_buffer.add(observation, next_observations.copy(), action, reward, termination, info)

      # Use the next observation as base for next run
      observation = next_observations


      if global_step >= start_training_after_steps and global_step % training_frequency == 0:
        ## Retrieve N random state sequences (j, j+n)
        sequence = replay_buffer.sample(batch_size)

        with torch.no_grad():
          # grab the target values (will contain batchsize*amt of actions)
          target_max, arg_max = self.target_net(sequence.next_observations).to(float).max(dim=1)
          target_val = sequence.rewards.flatten().to(float) + gamma * target_max * (1.0 - sequence.dones.flatten().to(float))

        ### Run a Qmodel to retrieve an action (with epislon greedy policy)
        # gathers the actions that were executed from the observations and reduces the dimentions to (32 from 32x1)
        # https://stackoverflow.com/questions/50999977/what-does-the-gather-function-do-in-pytorch-in-layman-terms
        old_val = self.q_net(sequence.observations).gather(1, sequence.actions).squeeze()

        # Converting the values to float to avoid problems with model being Float but values being double
        # see https://stackoverflow.com/questions/67456368/pytorch-getting-runtimeerror-found-dtype-double-but-expected-float
        loss = F.mse_loss(target_val.float(), old_val.float())

        # calculate the loss and update the Q_net
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # So that both networks are in sync, or rather the Q network doesn play catch and mouse with the moving gradient of target, we need to update it in some timesteps
        if global_step % target_network_frequency_update == 0:
          self.update_target_params(tau)

      # Do checkpoints while training
      if global_step % checkpoint_frequency == 0 and global_step >= start_training_after_steps:
        checkpoint_name = 'rl_checkpoint_4_agents___steps_at_{0}'
        torch.save(self.q_net.state_dict(), checkpoint_name.format(global_step))

      # If termination + truncation are all true than we should restart all envs
      # if np.mean(termination+truncation) > 0.0:
        # print("\n!restarting at step {0}".format(global_step))
        # not needed due to episodic life from atari wrappers
        # observation, info = env.reset()

      # Run eval only after training the model
      if global_step % eval_frequency == 0 and global_step >= start_training_after_steps:
        mean_rewards, mse, max_rewards, avg_steps = eval(self.q_net, make_env(), eval_episodes=2, target_net=self.target_net)
        history.append({"step": global_step, "rewards": mean_rewards, "mse": mse, "avg_steps": avg_steps, "max_rewards": max_rewards})
        print('\nStep: ', global_step, ', mean rewards: ', mean_rewards, ', mse:', mse, ', epsilon: ', epsilon, ', avg_steps: ',avg_steps,'\n' )

    return self.q_net, history

  def update_target_params(self, tau):
    target_params_copy = self.target_net.state_dict().copy()
    for q_key, target_key in zip(self.q_net.state_dict(), self.target_net.state_dict()):
      ## Use the state dict to pull all the  network weights & based on the keys we should be able to update it based on the tau
      if q_key != target_key:
        raise Exception("Sorry uneven nets, with unnasignable parameters")

      # Update the target network with a mixture of q_params + target_params (weight+bias), by using tau as the raio of qParams to targetParams
      target_params_copy[q_key] = self.q_net.state_dict()[q_key] * tau + self.target_net.state_dict()[target_key] * (1.0 - tau)

    # Update the target network parameters with the updated parameters
    self.target_net.load_state_dict(target_params_copy)

  def get_epsilon_greedy_action(self, env, env_n, state, epsilon):
    if random.random() > epsilon:
      # Question this piece. I am not fully sure that this is correct also for me. Althought he network seems to train
      q_vals = self.q_net(torch.Tensor(state).to(self.get_device()))
      return torch.argmax(q_vals, dim=1).cpu().numpy()
    else:
      # Generate a random action for each environment
      return np.array([env.single_action_space.sample() for _ in range(env_n)])

# Detached from the main object (generic functions)

def get_greedy_action(q_net, observation):
    q_val = q_net(torch.Tensor(observation).to('cpu').resize(1, 4, 84, 84)).argmax(dim=1)
    return torch.argmax(q_val).cpu().numpy()

# Gets the value of the best action
def get_greedy_val(q_net, observation):
    q_val = q_net(torch.Tensor(observation).to('cpu').resize(1, 4, 84, 84)).max(dim=1)
    return q_val.values.cpu().detach().numpy()

def eval(q_net, env_func, eval_episodes=100, max_eval_steps=100000, target_net=None, lives_in_episode=3, save_video=False):
  env_to_eval = env_func()
  rewards = []
  loss = []
  steps = []
  max_rewards_per_episode = 0
  for episode in tqdm(range(eval_episodes)):
    images = []
    episode_rewards = 0
    episode_steps = 0
    # Due to 3 lifes
    for i in range(lives_in_episode):
      observation, info = env_to_eval.reset()
      step = 0
      truncation = False
      termination = False
      life_rewards = []

      img = env_to_eval.render()
      images.append(img)

      while truncation == False and termination == False and step < max_eval_steps:
        action = q_net(torch.Tensor(observation).to('cpu').resize(1, 4, 84, 84)).argmax(dim=1)
        next_observations, reward, termination, truncation, info = env_to_eval.step(action)

        img = env_to_eval.render()
        images.append(img)

        life_rewards.append(np.array(reward).sum())

        # if has target_net calculate errors
        if target_net != None:
          expected_action = get_greedy_val(target_net, observation)
          used_action = get_greedy_val(q_net, observation)
          with torch.no_grad():
            squared_loss=((action - expected_action)**2).cpu().numpy()
          loss.append(squared_loss)

        observation = next_observations
        step=step+1
        episode_steps=episode_steps+1

      episode_rewards = episode_rewards + np.array(life_rewards).sum()

    steps.append(episode_steps)
    rewards.append(episode_rewards)
    max_rewards_per_episode = max(max_rewards_per_episode, np.array(episode_rewards).sum())

    # render images conditionally
    imageio.mimsave("eval_ep{0}.mp4".format(episode), [np.array(img) for i, img in enumerate(images)], fps=20)
  return np.array(rewards).mean(), np.array(loss).mean(), max_rewards_per_episode, np.array(steps).mean()

agents = [make_env() for i in range(4)]
trained_q, history = DeepQLearning(False).train(agents)

In [None]:
# Loading and saving the checkpoint https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-for-inference
# torch.save(trained_q.state_dict(), 'rl_checkpoint_4_agents')

eval_env_single = make_env()
eval_env = gym.vector.SyncVectorEnv([eval_env_single])
loaded_q = QNet(eval_env)
loaded_q.load_state_dict(torch.load('rl_checkpoint_800k_steps'))

In [None]:
agents = [make_env() for i in range(2)]
trained_q, history = DeepQLearning(False).fine_tune(loaded_q, agents, max_epsilon=0.7)
torch.save(trained_q.state_dict(), 'rl_checkpoint_2_agents_fine_tuned')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
eval(loaded_q, eval_env_single, eval_episodes=2, max_eval_steps=1000000, save_video=True)

In [None]:
# QAgent used for video recording
class QAgent():
  def __init__(self, q_net):
    self.q_net = q_net

  def get_action(self, state, epsilon):
    return self.q_net(torch.Tensor(state).to('cpu').resize(1, 4, 84, 84)).argmax(dim=1)


agent = QAgent(loaded_q)
record_video(eval_env_single(), agent, "replay.mp4")

Notes:

The best performing agent was at 10k steps. So the problem maybe is not that the agents are not doing what they should, but maybe that I am using too few envs? E.g. I should be using hundreds of agents at the same time, or also my epsilon goes crazy, due to it being step related not slope related. Maybe It should become less aggresive based on the mean errors for a batch rather than being a constant slope?

- Add it to the eval

# TODOS:

- Add evaluation in batches (which returns average rewards + video if requested)
- rework the structure to support N environments (Whatever environments)
- Add eval data that will be exported once training ends so we can plot steps, mean_error, rewards

Blog post notes:

- How do we do the epsilon? Is steps based correct?
- How many agents should we do? More agents less global steps?
- How does training sooner (sampling from the batch) affect the training? Do we first need more data to be sampled? Or do we just do it to save processing time on first N steps??
  - Start training afte N steps should provide enough gap for the agents to experiment enough before learning & adjusting, and not oversampling on already learned. E.g. Batch of 32 is too much if we start learning after 100 steps
- SHould we experiment with the restarting? Can that overfit? When should we restart? e.g. half environments dead?
- Although MSE is improving wee don't se improvement in rewards (updating to use slope instead of step decay) -> Should we consider using only duration after we start training to use the learned steps??
- DId I really make error in the Epsilon?! -> always double check it in eval rounds
- Strive to understand the underlying pieces. Why are some properties updated in a simple way, why do other require slices.
- validate. Don't just blindly copy and paste things/ or copy and rewrite. You need to get in and play around to get a better understanding
- Added fine_tune method so we can continue from the last time that the training stopped -> This avoids dropouts in colab. What we could do as well is store the *Target_net* which would really allow to continue from where we left of. Interesting would be also to append the history so we can actually use it to continue observing the improvements

Leader board:

Steps | MaxScore | MeanScore | info

840k  | 1140  | 75.0  | did direct training with lr 0.001
300k finetune | | | did 300k directly with 0.001 then finetune for additional 5mil steps & increase actors to 20