# Reinforcement Learning

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 13/03/2025   | Martin | Created   | Created notebook for reinforcement learning design patterns. | 

# Content

* [Introduction](#introduction)

# Introduction

`TF-Agents` is the reinforcement learning (RL) library for Tensorflow. 

__Components__

* Agent operates in an __environement__ and learns by processing signals received every time it chooses an action. Environment is implemented in Python and wrapped in TF wrapper for efficient parallelization
* __Policy__ maps an observation from the environment into a distribution over some actions
* __Driver__ executes the policy in an environment for a specified number of steps (_episodes_)
* __Replay Buffer__ stores the experience (agent trajectories in action space and associated rewards) of executing a policy in an environment; buffer content is queried for a subset of trajectories during training

💡 __IDEA:__ Cast each problem as a RL problem and map parameters into the corresponding TF-Agent component

---

# GridWorld

__The Game:__ 6x6 square board, where the agent starts at (0,0), the finish is at (5,5), and the goal of the agent is to find the path from the start to the finish

__Actions:__ Possible actions are moves up/down/left/right

__Rewards:__  If the agent lands on the finish, it receives a reward of 100, and the game terminates after 100 steps if the end was not reached by the agent

In [1]:
%load_ext watermark
import tensorflow as tf
import numpy as np

from tf_agents.environments import py_environment, \
                                   tf_environment, \
                                   tf_py_environment, \
                                   utils, \
                                   wrappers, \
                                   suite_gym
from tf_agents.specs import array_spec
from tf_agents.trajectories import trajectory, time_step as ts

from tf_agents.agents.dqn import dqn_agent
from tf_agents.networks import q_network
from tf_agents.drivers import dynamic_step_driver
from tf_agents.metrics import tf_metrics, py_metrics
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.utils import common
from tf_agents.drivers import py_driver, dynamic_episode_driver

import matplotlib.pyplot as plt

2025-03-13 18:35:27.039426: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-13 18:35:27.049718: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1741862127.061314   10569 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1741862127.065020   10569 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-13 18:35:27.077978: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

Define the environment the agents will operate in. Define the conditions for resetting/ terminating the state, and mechanics for moving

In [None]:
class GridWorldEnv(py_environment.PyEnvironment):
  def __init__(self):
    # BoundedArraySpec defines the min and max values that an ArraySpec can take
    # ArraySpec is a condition of what values can be inside the array
    self._action_spec = array_spec.BoundedArraySpec(
      shape=(),
      dtype=np.int32,
      minimum=0,
      maximum=3,
      name='action'
    ) # Actions that were taken
    self._observation_spec = array_spec.BoundedArraySpec(
      shape=(4, ),
      dtype=np.int32,
      minimum=[0, 0, 0, 0],
      maximum=[5, 5, 5, 5],
      name='observation'
    ) # Set of observations i.e results for each epoch
    self._state = [0, 0, 5, 5] # Represent the (row, col, frow, fcol) of the player and the finish
    self._episode_ended = False
  
  def action_spec(self):
    return self._action_spec
  
  def observation_spec(self):
    return self._observation_spec

  def _reset(self):
    """
    Reset to the next episode

    Returns:
        _type_: _description_
    """
    self._state = [0, 0, 5, 5]
    self._episode_ended = False
    return ts.restart(np.array(self._state, dtype=np.int32))
  
  def _step(self, action):
    """
    Each step:
      1. Check if the episode has ended
        - if yes, reset episode
        - if no, continue to move using current action
      2. Check if there is a game over 
        - if yes, end the episode
        - return corresponding reward

    Args:
        action (_type_): _description_

    Returns:
        _type_: _description_
    """
    if self._episode_ended:
      return self.reset()
    
    self.move(action)

    if self.game_over():
      self._episode_ended = True
    
    if self._episode_ended:
      if self.game_over():
        reward = 100
      else:
        reward = 0
      return ts.termination(np.array(self._state, dtype=np.int32), reward) # End of game
    else:
      return ts.transition(np.array(self._state, dtype=np.int32), reward=0, discount=0.9) # Transition to next step of game
    
  def move(self, action):
    row, col, frow, fcol = self._state[0], self._state[1], self._state[2], self._state[3],
    if action == 0: #down
      if row - 1 >= 0:
        self._state[0] -= 1
    if action == 1: #up
      if row + 1 < 6:
        self._state[0] += 1
    if action == 2: #left
      if col - 1 >= 0:
        self._state[1] -= 1
    if action == 3: #right
      if col + 1  < 6:
        self._state[1] += 1
  
  def game_over(self):
    row, col, frow, fcol = self._state[0], self._state[1], self._state[2], self._state[3],
    return row == frow and col == fcol

In [None]:
def compute_avg_return(environment, policy, num_episodes=10):
  total_return = 0.0
  for _ in range(num_episodes):
    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_late():
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
      total_return += episode_return
  
  avg_return = total_return / num_episodes
  return avg_return.numpy()[0]

In [3]:
%watermark

Last updated: 2025-03-13T18:14:08.783191+08:00

Python implementation: CPython
Python version       : 3.10.12
IPython version      : 8.33.0

Compiler    : GCC 11.4.0
OS          : Linux
Release     : 5.15.167.4-microsoft-standard-WSL2
Machine     : x86_64
Processor   : x86_64
CPU cores   : 20
Architecture: 64bit

