<a href="https://colab.research.google.com/github/marrej/ML-projects/blob/main/RL_Q_learning_with_frozen_lake.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Q Learning introduction

Utilize Q learning to solve deterministic & non deterministic puzzles (e.g. Icy field).

In [65]:
# Retrieve virtual display & video capture to be able to user replay of the session

!sudo apt-get update
!sudo apt-get install -y python3-opengl
!apt install ffmpeg xvfb
!pip3 install pyvirtualdisplay
!pip3 install gymnasium pickle5

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
python3-opengl is already the newest version (3.1.5+dfsg-1).
0 upgraded, 0 newly installed, 0 to remove and 29 

In [66]:
# Restart of the OS might be required to get make sure that all the updates are running (pyvirtual display...)
# import os
# os.kill(os.getpid(), 9)

In [67]:
# Virtual display
from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

<pyvirtualdisplay.display.Display at 0x7b47b7707760>

In [68]:
import numpy as np
import gymnasium as gym # To retrieve the game (frozen lake)
import random
import imageio # To store the replays
import os
import tqdm

import pickle5 as pickle # To save the model
from tqdm.notebook import tqdm

## Initialize the game env

In [69]:
# https://gymnasium.farama.org/environments/toy_text/frozen_lake/#frozen-lake

# Define custom map layout
desc = ["SFFH", "FHFH", "FFFH", "HHFG"]

# Make sure that its deterministic by not being slippery
# Use max 10 steps, then end an episode
env = gym.make('FrozenLake-v1', desc=desc, map_name="4x4", is_slippery=False, render_mode="rgb_array")

In [70]:
print('Possible states in the Space:\n')
print(env.observation_space)
print('All spaces to observe:',env.observation_space.n)
print(env.observation_space.sample())

Possible states in the Space:

Discrete(16)
All spaces to observe: 16
7


In [71]:
print('Possible actions in the Space:\n')
print(env.action_space)
print('All actions:', env.action_space.n)
print(env.action_space.sample())

Possible actions in the Space:

Discrete(4)
All actions: 4
2


## Define the Q table

In [72]:
def init_q_table(observation_space, action_space):
  return np.zeros((action_space, observation_space), dtype=float)

Qtable = init_q_table(env.action_space.n, env.observation_space.n)
Qtable

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

# Define the policy

We will use the Greedy epsilon for training.
Note that Q learning is off-policy, meaning that it uses different policy for training & infernce. For inference it will use Epsilon Greedy (To allow exploration not only exploitation). For training it will use the Greedy. Meaning that when we are trying to predict which path the agent should take, it will try to also explore options (depending on the epsilon), but when the Agent should should update the values, it will use the best it already learned.

In [73]:
def greedy_policy(q_table, state):
  return np.argmax(q_table[state][:])

def greedy_epsilon_policy(q_table, state, epsilon = 0):
  q_action_row = q_table[state]
  random_epsilon = random.uniform(0, 1)
  if (1-epsilon)<random_epsilon:
    return random.randrange(len(q_action_row))
  else:
    return greedy_policy(q_table, state)

greedy_epsilon_policy([[0.1, 0.6, 0.3, 0.2]], 0, 0.9)

1

## Define the hyperparameters

One of the most interesting is the epsilon decay, so that going in to the later iterations the agent relies more on the Exploitation than exploration.
Though if the decay is too fast, the agen't won't be able to learn the best approach and might not converge.
Note that this is somewhat aligned with learning rates in classic algorithms. If the learning rate is too high, the ML algo won't converge, but if too low, it will take too long to get to a good result (and might not converge as well).

We might need to tweak these, mainly the Greedy epsilon (min epsilon & decay) & discount_rate to get better results

In [74]:
# # Training params

NTrainingEpisodes = 1000
LearningRate = 0.7

# # Eval params
NEvalEpisodes = 100
BatchSize = 50

# # Env params
EnvName = 'FrozenLake-v1'
MaxSteps = 50 # Note that not enough steps and the Agent may not converge
Gamma = 0.9 # discounting rate. The higher it is, the more the agent relies on Future steps as a reward
EvalSeed = []

# # Greedy epsilon params
MaxEpsilon = 1 # At start do only exploration
MinEpsilon = 0.5 # at the end 50% of cases should be exploration
EpsilonDecay = 0.0005

## Training loop

In [75]:
def eval_func(env, q_table, max_steps = 99, epsilon = 0, n_eval_episodes=100, seed=[]):
  eval_episodes = []
  for episode in tqdm(range(n_eval_episodes)):
    ## init
    if len(seed) > 0:
      position, info = env.reset(seed=seed[episode])
    else:
      position, info = env.reset()

    ## Try to solve the puzzle
    for l in range(max_steps):
      ### pick action
      ### Epsilon is negative removal, so if 0 it will always pick action from the table
      action = greedy_epsilon_policy(q_table, position, epsilon)

      ### run the action
      next_position, reward, termination, truncation, info = env.step(action)
      position = next_position

      if termination == True or truncation == True:
        eval_episodes.append(reward)
        break

  return np.mean(eval_episodes), np.std(eval_episodes)


def train(
    env,
    q_table,
    n_training_episodes = 1000,
    learning_rate = 0.7,
    n_eval_episodes = 100,
    batch_size = 50,
    max_steps = 99,
    gamma = 0.9,
    max_epsilon = 1,
    min_epsilon = 0.05,
    epsilon_decay = 0.0005):
  epsilon = max_epsilon
  for episode in tqdm(range(n_training_episodes)):

    ## inference
    position, info = env.reset()

    for j in range(max_steps):
      ### calculate state https://gymnasium.farama.org/environments/toy_text/frozen_lake/#observation-space
      ### pick action
      action = greedy_epsilon_policy(q_table, position, epsilon)

      ### run the action
      next_position, reward, termination, truncation, info = env.step(action)

      ### asign reward -> using TD
      last_value = 0
      q_table[position][action] = q_table[position][action] +  learning_rate*(reward + gamma * (np.max(q_table[next_position])) - q_table[position][action])

      ### update position
      position = next_position

      if termination == True or truncation == True:
        break

    ## calculate the new epsilon, after decay
    epsilon = min_epsilon + (max_epsilon - min_epsilon) * np.exp(-epsilon_decay * episode)

    if (episode%batch_size == 0):
      # EVAL
      eval_episodes = eval_func(env, q_table, max_steps, epsilon, n_eval_episodes)

      print('Batch:',episode,'Mean eval:',np.mean(eval_episodes))

  return q_table

# Stitching it all together

In [76]:
Qtable = init_q_table(env.action_space.n, env.observation_space.n)

TrainedQTable = train(
    env,
    q_table=Qtable,
    n_training_episodes=NTrainingEpisodes,
    learning_rate=LearningRate,
    n_eval_episodes=NEvalEpisodes,
    batch_size=BatchSize,
    max_steps=MaxSteps,
    gamma=Gamma,
    max_epsilon=MaxEpsilon,
    min_epsilon=MinEpsilon,
    epsilon_decay=EpsilonDecay
    )

  0%|          | 0/1000 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 0 Mean eval: 0.0


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 50 Mean eval: 0.05474937185533099


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 100 Mean eval: 0.0


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 150 Mean eval: 0.0


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 200 Mean eval: 0.05474937185533099


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 250 Mean eval: 0.0


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 300 Mean eval: 0.0


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 350 Mean eval: 0.05474937185533099


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 400 Mean eval: 0.10029361054615991


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 450 Mean eval: 0.1002936105461599


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 500 Mean eval: 0.11797958971132715


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 550 Mean eval: 0.07999999999999999


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 600 Mean eval: 0.14874342087037917


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 650 Mean eval: 0.07999999999999999


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 700 Mean eval: 0.16257350822173072


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 750 Mean eval: 0.14874342087037917


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 800 Mean eval: 0.11797958971132712


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 850 Mean eval: 0.10029361054615991


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 900 Mean eval: 0.07999999999999999


  0%|          | 0/100 [00:00<?, ?it/s]

Batch: 950 Mean eval: 0.16257350822173072


In [77]:
eval_func(env, Qtable, max_steps=MaxSteps, epsilon=0)

  0%|          | 0/100 [00:00<?, ?it/s]

(1.0, 0.0)

# Publish the model to the hub

- records video
- calculates statistics
- uploads to huggingface


In [78]:
from huggingface_hub import HfApi, snapshot_download
from huggingface_hub.repocard import metadata_eval_result, metadata_save

from pathlib import Path
import datetime
import json

In [79]:
def record_video(env, Qtable, out_directory, fps=1):
    """
    Generate a replay video of the agent
    :param env
    :param Qtable: Qtable of our agent
    :param out_directory
    :param fps: how many frame per seconds (with taxi-v3 and frozenlake-v1 we use 1)
    """
    images = []
    terminated = False
    truncated = False
    state, info = env.reset(seed=random.randint(0, 500))
    img = env.render()
    images.append(img)
    while not terminated or truncated:
        # Take the action (index) that have the maximum expected future reward given that state
        action = np.argmax(Qtable[state][:])
        state, reward, terminated, truncated, info = env.step(
            action
        )  # We directly put next_state = state for recording logic
        img = env.render()
        images.append(img)
    imageio.mimsave(out_directory, [np.array(img) for i, img in enumerate(images)], fps=fps)

In [80]:
def push_to_hub(repo_id, model, env, video_fps=1, local_repo_path="hub"):
    """
    Evaluate, Generate a video and Upload a model to Hugging Face Hub.
    This method does the complete pipeline:
    - It evaluates the model
    - It generates the model card
    - It generates a replay video of the agent
    - It pushes everything to the Hub

    :param repo_id: repo_id: id of the model repository from the Hugging Face Hub
    :param env
    :param video_fps: how many frame per seconds to record our video replay
    (with taxi-v3 and frozenlake-v1 we use 1)
    :param local_repo_path: where the local repository is
    """
    _, repo_name = repo_id.split("/")

    eval_env = env
    api = HfApi()

    # Step 1: Create the repo
    repo_url = api.create_repo(
        repo_id=repo_id,
        exist_ok=True,
    )

    # Step 2: Download files
    repo_local_path = Path(snapshot_download(repo_id=repo_id))

    # Step 3: Save the model
    if env.spec.kwargs.get("map_name"):
        model["map_name"] = env.spec.kwargs.get("map_name")
        if env.spec.kwargs.get("is_slippery", "") == False:
            model["slippery"] = False

    # Pickle the model
    with open((repo_local_path) / "q-learning.pkl", "wb") as f:
        pickle.dump(model, f)

    # Step 4: Evaluate the model and build JSON with evaluation metrics
    mean_reward, std_reward = eval_func(
        env=eval_env, q_table=model["qtable"], max_steps=model["max_steps"], epsilon=0, n_eval_episodes=model["n_eval_episodes"], seed=model["eval_seed"]
    )

    evaluate_data = {
        "env_id": model["env_id"],
        "mean_reward": mean_reward,
        "n_eval_episodes": model["n_eval_episodes"],
        "eval_datetime": datetime.datetime.now().isoformat(),
    }

    # Write a JSON file called "results.json" that will contain the
    # evaluation results
    with open(repo_local_path / "results.json", "w") as outfile:
        json.dump(evaluate_data, outfile)

    # Step 5: Create the model card
    env_name = model["env_id"]
    if env.spec.kwargs.get("map_name"):
        env_name += "-" + env.spec.kwargs.get("map_name")

    if env.spec.kwargs.get("is_slippery", "") == False:
        env_name += "-" + "no_slippery"

    metadata = {}
    metadata["tags"] = [env_name, "q-learning", "reinforcement-learning", "custom-implementation"]

    # Add metrics
    eval = metadata_eval_result(
        model_pretty_name=repo_name,
        task_pretty_name="reinforcement-learning",
        task_id="reinforcement-learning",
        metrics_pretty_name="mean_reward",
        metrics_id="mean_reward",
        metrics_value=f"{mean_reward:.2f} +/- {std_reward:.2f}",
        dataset_pretty_name=env_name,
        dataset_id=env_name,
    )

    # Merges both dictionaries
    metadata = {**metadata, **eval}

    model_card = f"""
  # **Q-Learning** Agent playing1 **{EnvName}**
  This is a trained model of a **Q-Learning** agent playing **{EnvName}** .

  ## Usage

  model = load_from_hub(repo_id="{repo_id}", filename="q-learning.pkl")

  # Don't forget to check if you need to add additional attributes (is_slippery=False etc)
  env = gym.make(model["env_id"])
  """

    eval_func(
        env=eval_env, q_table=model["qtable"], max_steps=model["max_steps"], epsilon=0, n_eval_episodes=model["n_eval_episodes"], seed=model["eval_seed"]
    )

    readme_path = repo_local_path / "README.md"
    readme = ""
    print(readme_path.exists())
    if readme_path.exists():
        with readme_path.open("r", encoding="utf8") as f:
            readme = f.read()
    else:
        readme = model_card

    with readme_path.open("w", encoding="utf-8") as f:
        f.write(readme)

    # Save our metrics to Readme metadata
    metadata_save(readme_path, metadata)

    # Step 6: Record a video
    video_path = repo_local_path / "replay.mp4"
    record_video(env, model["qtable"], video_path, video_fps)

    # Step 7. Push everything to the Hub
    api.upload_folder(
        repo_id=repo_id,
        folder_path=repo_local_path,
        path_in_repo=".",
    )

    print("Your model is pushed to the Hub. You can view your model here: ", repo_url)

## HF login

In [81]:
# grab token from https://huggingface.co/settings/tokens

from huggingface_hub import notebook_login

# notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Set the model

In [82]:
model = {
    "env_id": EnvName,
    "max_steps": MaxSteps,
    "n_training_episodes": NTrainingEpisodes,
    "n_eval_episodes": NEvalEpisodes,
    "eval_seed": EvalSeed,
    "learning_rate": LearningRate,
    "gamma": Gamma,
    "max_epsilon": MaxEpsilon,
    "min_epsilon": MinEpsilon,
    "decay_rate": EpsilonDecay,
    "qtable": Qtable,
}

In [83]:
username = "Marcuso"
repo_name = "q-FrozenLake-v1-4x4-noSlippery"
push_to_hub(repo_id=f"{username}/{repo_name}", model=model, env=env)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

True


q-learning.pkl:   0%|          | 0.00/907 [00:00<?, ?B/s]

Your model is pushed to the Hub. You can view your model here:  https://huggingface.co/Marcuso/q-FrozenLake-v1-4x4-noSlippery


In [86]:
# record local video to watch

#saved to "content/replayX.mp4"
record_video(env, Qtable, "replay2.mp4", 1)