<a href="https://colab.research.google.com/github/poweredbypigeon/Blackjack-Text-Based-/blob/main/Technical_Guide_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0. Enviornment API

**Welcome to Environment API**
This section explains some of the API for interacting with the game environment. API stands for Application Programming Interface, it details information about the structure of the code so you are aware what information you can obtain from the environment and use within you agent's solution / reward functions





### **0.1 Map**

The environment map coordinate system is as follows. Some key things to note:

1. The center of the map is the origin
2. Positive X direction is pointed torwards the right
3. Positive Y direction is pointed upwards

You can use the image to scale and figure out regions within the game which you can use to define your agent behaviour and/or reward functions!

<p align="center">
  <img src="https://raw.githubusercontent.com/lightningminted/UTMIST_AI2_TechnicalGuide/main/env_stage.png" alt="RL Schema" width="900"/>
</p>


### **0.2 Environment and Signals**

For accessing variables about the environment (including objects, agents (player, opponent), etc) you can use the following format:

In [1]:
# Agents
env.objects["player"]
env.objects["opponent"]

# Agent Position
env.objects["player"].body.position.x   # X position during frame
env.objects["player"].body.position.y   # Y position during frame

env.objects["player"].body.position.x_change  # Change in x direction position between frames
env.objects["player"].body.position.y_change  # Change in y direction position between frames

env.objects["player"].body.velocity.x # X velocity of agent
env.objects["player"].body.velocity.y # Y velocity of agent

# Agent Charachteristics
env.objects["player"].DamageTakenTotal      # Integer value of total damage taken
env.objects["player"].DamageTakenThisStock  # Integer value of damage taken this stock life
env.objects["player"].DamageTakenThisFrame  # Integer value
env.objects["player"].WeaponHeldThisFrame   # True or False

# Time
env.time_elapsed  # Time that has elapsed since start of game
env.current_frame # Current frame number

# Platforms
env.objects['ground']
env.objects['platform1']
env.objects['platform2']

# Signals
knockout_signal = Signal()
knockout_signal.connect(knockout_reward)
knockout_signal.emit(agent="player") # Triggered when an agent is knocked out

win_signal = Signal()
win_signal.connect(win_reward)
win_signal.emit(agent="player") # Triggered when the player wins

NameError: name 'env' is not defined

## 1. Reward Function Library

**Welcome to the Reward Function Library**

This section contains some basic already pre-implemented reward functions for you to use for training your agents! The reward functions that have been implemented correspond to some of the ones in the technical guide. We recommend you use these reward functions as starting points for tweaking and designing your own!

As a starting point/hint, we’ve also included some ideas of reward functions that might be good to implement yourself! If you are struggling to implement some of these feel free to ask questions on the discord in the question hub!





### 1.1 Existential State/Env Rewards

In [2]:
class RewardMode(Enum):
    ASYMMETRIC_OFFENSIVE = 0
    SYMMETRIC = 1
    ASYMMETRIC_DEFENSIVE = 2

def damage_interaction_reward(
    env: WarehouseBrawl,
    mode: RewardMode = RewardMode.SYMMETRIC,
) -> float:
    """
    Computes the reward based on damage interactions between players.

    Modes:
    - ASYMMETRIC_OFFENSIVE (0): Reward is based only on damage dealt to the opponent
    - SYMMETRIC (1): Reward is based on both dealing damage to the opponent and avoiding damage
    - ASYMMETRIC_DEFENSIVE (2): Reward is based only on avoiding damage

    Args:
        env (WarehouseBrawl): The game environment
        mode (DamageRewardMode): Reward mode, one of DamageRewardMode

    Returns:
        float: The computed reward.
    """
    # Getting player and opponent from the enviornment
    player: Player = env.objects["player"]
    opponent: Player = env.objects["opponent"]

    # Reward dependent on the mode
    damage_taken = player.damage_taken_this_frame
    damage_dealt = opponent.damage_taken_this_frame

    if mode == RewardMode.ASYMMETRIC_OFFENSIVE:
        reward = damage_dealt
    elif mode == RewardMode.SYMMETRIC:
        reward = damage_dealt - damage_taken
    elif mode == RewardMode.ASYMMETRIC_DEFENSIVE:
        reward = -damage_taken
    else:
        raise ValueError(f"Invalid mode: {mode}")

    return reward

NameError: name 'Enum' is not defined

In [None]:
def danger_zone_reward(
    env: WarehouseBrawl,
    zone_penalty: int = 1,
    zone_height: float = 4.2
) -> float:
    """
    Applies a penalty for every time frame player surpases a certain height threshold in the environment.

    Args:
        env (WarehouseBrawl): The game environment.
        zone_penalty (int): The penalty applied when the player is in the danger zone.
        zone_height (float): The height threshold defining the danger zone.

    Returns:
        float: The computed penalty as a tensor.
    """
    # Get player object from the environment
    player: Player = env.objects["player"]

    # Apply penalty if the player is in the danger zone
    reward = -zone_penalty if player.body.position.y >= zone_height else 0.0

    return reward


NOTE: that this danger_height being 4.2 corresponds to the following

<p align="center">
  <img src="https://raw.githubusercontent.com/lightningminted/UTMIST_AI2_TechnicalGuide/main/env_stage_dz.png" alt="RL Schema" width="900"/>
</p>


In [None]:
# TODO: This reward function has not been written and is left as an exercise to try and implement
#       yourself. Think about the following before implementing:
#
#       - While having a stock lead is generally good in fighting games,
#         how would this reward influence agent behaviour?
#       - Is this behaviour even desirable?
#       - Is this behaviour more valuable near the beggingin or end of the match,
#         and based on that answer how can you change the reward so it considers time?


def stock_advantage_reward(
    env: WarehouseBrawl,
    success_value: float = 0, #TODO
) -> float:

    """
    Computes the reward given for every time step your agent is edge guarding the opponent.

    Args:
        env (WarehouseBrawl): The game environment
        success_value (float): Reward value related to having/gaining a weapon (however you define it)
    Returns:
        float: The computed reward.
    """
    reward = 0.0
    # TODO: Write the function

    return reward


 ### 1.2 Modulo Existential Reward

In [None]:
def move_to_opponent_reward(
    env: WarehouseBrawl,
) -> float:
    """
    Computes the reward based on whether the agent is moving toward the opponent.
    The reward is calculated by taking the dot product of the agent's normalized velocity
    with the normalized direction vector toward the opponent.

    Args:
        env (WarehouseBrawl): The game environment

    Returns:
        float: The computed reward
    """
    # Getting agent and opponent from the enviornment
    player: Player = env.objects["player"]
    opponent: Player = env.objects["opponent"]

    # Extracting player velocity and position from environment
    player_position_dif = np.array([player.body.position.x_change, player.body.position.y_change])

    direction_to_opponent = np.array([opponent.body.position.x - player.body.position.x,
                                      opponent.body.position.y - player.body.position.y])

    # Prevent division by zero or extremely small values
    direc_to_opp_norm = np.linalg.norm(direction_to_opponent)
    player_pos_dif_norm = np.linalg.norm(player_position_dif)

    if direc_to_opp_norm < 1e-6 or player_pos_dif_norm < 1e-6:
        return 0.0

    # Compute the dot product of the normalized vectors to figure out how much
    # current movement (aka velocity) is in alignment with the direction they need to go in
    reward = np.dot(player_position_dif / direc_to_opp_norm, direction_to_opponent / direc_to_opp_norm)

    return reward

In [None]:
# TODO: This reward function has not been written and is left as an exercise to try and implement
#       yourself. Think about the following before implementing:
#
#       - When does "edge-guarding" happen?
#       - Where does the oppoent have to be or be moving?
#       - Where does your player have to be or be moving?
#       - Where does the ledge have to be relative to the agents?

def edge_guard_reward(
    env: WarehouseBrawl,
    success_value: float = 0, #TODO
    fail_value: float = 0,    #TODO
) -> float:

    """
    Computes the reward given for every time step your agent is edge guarding the opponent.

    Args:
        env (WarehouseBrawl): The game environment
        success_value (float): Reward value for the player hitting first
        fail_value (float): Penalty for the opponent hitting first

    Returns:
        float: The computed reward.
    """
    reward = 0.0
    # TODO: Write the function

    return reward


### 1.3 Single Event/Sparse Reward

**How to Implement an Event/Sparse Reward: Signals**

Other types of rewards check every timestep checks if certain conditions are true (a method known as **polling**).

Events related to sparse rewards only happen very rarely, so it would be very inefficient to check them every time step. For this reason we introduce something called **signals**.

A signal is a way to trigger a response when an event occurs, without needing to constantly check for the event. This is much more efficient because the system only responds when something relevant happens, rather than checking continuously.

In this setup, a signal serves as a message or notification that something has occurred in the environment (e.g., a player achieving a knockout).

When a signal is emitted, it notifies all functions that have been connected to it (known as **handlers**), and those functions can then take the appropriate action, such as rewarding the agent for the achievement. The reward functions will be handlers in this case.

**How to implment a signal and a corresponding sparse reward**

To make a reward function for sparse rewards you need to write the reward funciton and set up the signal. Below is some code indicating how the signal class is organized - you won't have to modify this asa it is already inside your notebook. What you will have to do is

1.   Create an instance of this signal class
2.   Write a reward function
3.   Connect the reward function to the signal
4.   Put the signal emission whenever you want it to activate

In [None]:
# Signal class (DO NOT ADD TO NOTEBOOK - already implmented)

class Signal:
    def __init__(self):
        self._handlers = []

    def connect(self, handler):
        self._handlers.append(handler)

    def emit(self, *args, **kwargs):
        for handler in self._handlers:
            handler(*args, **kwargs)

# 1. Create instance of signal (in this case it's knockout)
knockout_signal = Signal()

# 2. Define the knockout reward (note it is defined in the next cell)

# 3. Connect reward functions to signal using .connect
knockout_signal.connect(knockout_reward)

# 4. Wherever in the code you want the signal to activate, put signal emit,
#    including what value you want to pass along to your reward funciton handler

knockout_signal.emit(agent="player")    # Signal passing argument "player", indicating player was knocked out
knockout_signal.emit(agent="opponent")  # Signal passing argument "opponent", indicating opponent was knocked out

In [None]:
class RewardMode(Enum):
    ASYMMETRIC_OPPONENT = 0
    SYMMETRIC = 1
    ASYMMETRIC_PLAYER = 2


def knockout_reward(
    env: WarehouseBrawl,
    agent: str = "player",
    mode: RewardMode = RewardMode.SYMMETRIC,
    knockout_value_opponent: float = 50.0,
    knockout_value_player: float = 50.0,


) -> float:
    """
    Computes the reward based on who won the match.

    Modes:
    - ASYMMETRIC_OPPONENT (0): Reward is based only on the opponent being knocked out
    - SYMMETRIC (1): Reward is based on both agents being knocked out
    - ASYMMETRIC_PLAYER (2): Reward is based only on your own plauyer being knocked out

    Args:
        env (WarehouseBrawl): The game environment
        agent(str): The agent that was knocked out
        mode (RewardMode): Reward mode, one of RewardMode
        knockout_value_opponent (float): Reward value for knocking out opponent
        knockout_value_player (float): Reward penalty for player being knocked out

    Returns:
        float: The computed reward.
    """
    reward = 0.0

    # Mode logic to compute reward
    if mode == RewardMode.ASYMMETRIC_OPPONENT:
        if agent == "opponent":
            reward = knockout_value_opponent # Reward for opponent being knocked out
    elif mode == RewardMode.SYMMETRIC:
        if agent == "player":
            reward = -knockout_value_player  # Penalty for player getting knocoked out
        elif agent == "opponent":
            reward = knockout_value_opponent # Reward for opponent being knocked out
    elif mode == RewardMode.ASYMMETRIC_PLAYER:
        if agent == "player":
            reward = -knockout_value_player  # Penalty for player getting knocked out

    return reward

In [None]:
def win_reward(
    env: WarehouseBrawl,
    agent: str = "player",
    win_value: float = 300.0,
    lose_value: float = 200.0,
) -> float:

    """
    Computes the reward based on knockouts.


    Args:
        env (WarehouseBrawl): The game environment
        agent(str): The agent that won
        win_value (float): Reward value for knocking out opponent
        lose_value (float): Reward penalty for player being knocked out

    Returns:
        float: The computed reward.
    """

    reward = win_value if agent == "player" else -lose_value
    return reward

In [None]:
# TODO: the signal for this reward has not been implemented, this is left as an exercise for
#       you to try and implement yourself. Think about when this signal would be activated, and
#       how to prevent this signal from being activated again once the first hit has happened

def first_hit(
    env: WarehouseBrawl,
    agent: str = "player",
    success_value: float = 20.0,
    fail_value: float = 10.0,
) -> float:

    """
    Computes the reward based on who lands the first hit

    Args:
        env (WarehouseBrawl): The game environment
        agent (str): The agent that hit first ("player" or "opponent")
        success_value (float): Reward value for the player hitting first
        fail_value (float): Penalty for the opponent hitting first

    Returns:
        float: The computed reward.
    """

    reward = success_value if agent == "player" else -fail_value
    return reward
