**MALVI BID - 20187945**

This is my coursework submission for COMP3071 Designing Intelligent Agents (Spring 2022/2023) at the School of Computer Science, University of Nottingham Malaysia.

Please check the submitted `README.md` file for instructions on how to set up the python environment to run this code.

**Useful links:**

Project repository: https://github.com/malvibid/COMP3071-Designing-Intelligent-Agents/tree/coursework/COMP3071-DIA-CW/src

Best trained model: https://github.com/malvibid/COMP3071-Designing-Intelligent-Agents/tree/coursework/COMP3071-DIA-CW/src/best_trained_model

Video featuring a selection of test rounds demonstrating the best trained agent playing the game: https://youtu.be/QFf0_4FCh0w

All trained models: https://github.com/malvibid/COMP3071-Designing-Intelligent-Agents/tree/coursework/COMP3071-DIA-CW/src/trained_models


# Custom Dino Environment

## Import Dependencies

In [1]:
import base64
import cv2
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Environment Components
from gymnasium import Env
from gymnasium.spaces import Box, Discrete

# Selenium for automatically loading and play the game
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

## DinoEnvironment Class

In [12]:
# Create Dino Game Environment
class DinoEnvironment(Env):

    def __init__(self):

        # Subclass model
        super().__init__()

        self.driver = self._create_driver()

        # Setup spaces
        low_values = np.array(
            [0, 0, 0, 6, -1, -1, -1, -1, -1, -1], dtype=np.float32)  # Initial speed is 6, while max speed is 13
        high_values = np.array(
            [150, 1, 1, 13, 600, 3, 600, 150, 50, 50], dtype=np.float32)  # Canvas dimensions are 600x150
        self.observation_space = Box(
            low=low_values, high=high_values, shape=(10,), dtype=np.float32)

        # Start jumping, Start ducking, Stop ducking, Do nothing - Ducking has been divided into two actions because the agent should also learn the correct ducking duration
        self.action_space = Discrete(4)

        self.actions_map = [
            (Keys.ARROW_UP, "key_down"),  # Start jumping
            (Keys.ARROW_DOWN, "key_down"),  # Start ducking
            (Keys.ARROW_DOWN, "key_up"),  # Stop ducking
            (Keys.ARROW_RIGHT, "key_down")  # Do nothing
        ]

        # Keep track of number of obstacles the agent has passed
        self.passed_obstacles = 0

    # Create and return an instance of the Chrome Driver
    def _create_driver(self):

        # Set options for the WebDriver
        options = Options()

        # Turn off logging to keep terminal clean
        options.add_experimental_option('excludeSwitches', ['enable-logging'])

        # Keep the browser running after the code finishes executing
        options.add_experimental_option("detach", True)

        # Create a Service instance for running the ChromeDriver executable
        service = Service(executable_path=ChromeDriverManager().install())

        # Create an instance of the Chrome WebDriver with the specified service and options - The driver object can be used to automate interactions with the Chrome browser
        driver = webdriver.Chrome(service=service, options=options)

        # Maximize the Chrome window
        driver.maximize_window()

        return driver

    # Encode the obstacle type as an integer
    def _encode_obstacle_type(self, obstacle_type):
        if obstacle_type == 'CACTUS_SMALL':
            return 0
        elif obstacle_type == 'CACTUS_LARGE':
            return 1
        elif obstacle_type == 'PTERODACTYL':
            return 2
        else:
            raise ValueError(f"Unknown obstacle type: {obstacle_type}")

    # Get obstacles that are currently on the screen
    def _get_obstacles(self):
        obstacles = self.driver.execute_script(
            "return Runner.instance_.horizon.obstacles")
        obstacle_info = []
        for obstacle in obstacles:
            obstacle_type = obstacle['typeConfig']['type']
            # Encode the obstacle type as an integer
            encoded_obstacle_type = self._encode_obstacle_type(obstacle_type)
            obstacle_x = obstacle['xPos']
            obstacle_y = obstacle['yPos']
            obstacle_width = obstacle['typeConfig']['width']
            obstacle_height = obstacle['typeConfig']['height']
            obstacle_info.append(
                (encoded_obstacle_type, obstacle_x, obstacle_y, obstacle_width, obstacle_height))
        return obstacle_info

    # Get Trex's state (Jumping, Ducking or Running/Do nothing)
    def _get_trex_info(self):
        trex = self.driver.execute_script("return Runner.instance_.tRex")
        # xpos remains the same throughout the game - don't need it
        trex_y = trex['yPos']
        trex_is_jumping = trex['jumping']
        trex_is_ducking = trex['ducking']
        return trex_y, trex_is_jumping, trex_is_ducking

    # Get current game speed
    def _get_game_speed(self):
        game_speed = self.driver.execute_script(
            "return Runner.instance_.currentSpeed")
        return game_speed

    # Get the distance between the Trex and the next obstacle
    def _get_distance_to_next_obstacle(self):
        trex_x = self.driver.execute_script(
            "return Runner.instance_.tRex.xPos")  # xpos of trex
        obstacles = self._get_obstacles()
        if obstacles:
            next_obstacle = obstacles[0]
            obstacle_x = next_obstacle[1]  # xpos of next obstacle
            distance_to_next_obstacle = obstacle_x - trex_x
        else:
            distance_to_next_obstacle = None
        return distance_to_next_obstacle

    # Check if the agent has passed an obstacle
    def _passed_obstacle(self):
        obstacles = self._get_obstacles()
        if obstacles:
            # next_obstacle: [encoded_obstacle_type, obstacle_x, obstacle_y, obstacle_width, obstacle_height]
            next_obstacle = obstacles[0]
            trex_x = self.driver.execute_script(
                "return Runner.instance_.tRex.xPos")
            obstacle_x = next_obstacle[1]  # Next obstacles xpos
            obstacle_width = next_obstacle[3]  # Next obstacles width
            return obstacle_x + obstacle_width < trex_x
        else:
            return False

    # Get and return the score for the last game played
    def _get_current_score(self):
        try:
            score = int(''.join(self.driver.execute_script(
                "return Runner.instance_.distanceMeter.digits")))
        except:
            score = 0
        return score

    # Get and return the high score for all games played in current browser session
    def _get_high_score(self):
        try:
            score = int(''.join(self.driver.execute_script(
                "return Runner.instance_.distanceMeter.highScore.slice(-5)")))  # MaxScore=99999, MaxScoreUnits=5
        except:
            score = 0
        return score

    # Capture screenshot of current game state and return the image captured for rendering
    def _get_image(self):
        # Capture a screenshot of the game canvas as a data URL - string that represents the image in base64-encoded format
        data_url = self.driver.execute_script(
            "return document.querySelector('canvas.runner-canvas').toDataURL()")

        # Remove the leading text from the data URL using string slicing and decode the remaining base64-encoded data
        LEADING_TEXT = "data:image/png;base64,"
        image_data = base64.b64decode(data_url[len(LEADING_TEXT):])

        # Convert the binary data in 'image_data' to a 1D NumPy array
        image_array = np.frombuffer(image_data, dtype=np.uint8)

        # Decode the image data and create an OpenCV image object - OpenCV Image Shape format (H, W, C) ( rows, columns, and channels )
        image = cv2.imdecode(image_array, cv2.IMREAD_COLOR)

        return image

    # Load and Reset the game environment
    def reset(self):
        try:
            # Navigate to the Chrome Dino website
            self.driver.get("chrome://dino/")

        except WebDriverException as e:
            # Ignore "ERR_INTERNET_DISCONNECTED" error thrown because this game is available offline
            if "ERR_INTERNET_DISCONNECTED" in str(e):
                pass  # Ignore the exception.
            else:
                raise e  # Handle other WebDriverExceptions

        # Avoid errors that can arise due to the 'runner-canvas' element not being present - Using WebDriverWait and EC together ensures that the code does not proceed until the required element is present
        timeout = 10
        WebDriverWait(self.driver, timeout).until(
            EC.presence_of_element_located((By.CLASS_NAME, "runner-canvas")))

        # Start game
        self.driver.find_element(By.TAG_NAME, "body").send_keys(Keys.SPACE)

        return self.get_observation()

    # Get the current state of the game and return it as the observation
    def get_observation(self):
        obstacles = self._get_obstacles()
        trex_y, trex_is_jumping, trex_is_ducking = self._get_trex_info()
        game_speed = self._get_game_speed()
        distance_to_next_obstacle = self._get_distance_to_next_obstacle()

        state = (
            trex_y,
            trex_is_jumping,
            trex_is_ducking,
            game_speed,
            distance_to_next_obstacle,
            # Unpack the tuple of the first obstacle
            *(obstacles[0] if obstacles else (None, None, None, None, None))
        )

        # Set dtype for state to float32 for consistency and compatibility with the RL algorithm
        state = np.array(state, dtype=np.float32)

        # Replace NaN values with -1
        state[np.isnan(state)] = -1

        return state

    # Check if the game is over and return True or False
    def is_game_over(self):
        # Done if either Trex crashed into an obstacle or reached max score which is 99999
        # Check if Trex crashed
        crashed = self.driver.execute_script("return Runner.instance_.crashed")

        # Get the maximum score from the game
        max_score = self.driver.execute_script(
            "return Runner.instance_.distanceMeter.maxScore")
        current_score = self._get_current_score()

        return crashed or (current_score >= max_score)

    # Calculate and return the reward for the current state of the game
    def get_reward(self, done):
        # Must maintain the relative importance of different rewards so that the agent can differentiate between the various outcomes and is encouraged to learn a good policy
        reward = 0
        if done:
            # Penalize for crashing into an obstacle
            reward -= 10
        else:
            if self._passed_obstacle():
                # Reward for passing an obstacle
                reward += 1
                self.passed_obstacles += 1
            else:
                # Small reward for staying alive
                reward += 0.1

        current_score = self._get_current_score()
        high_score = self._get_high_score()

        if current_score > high_score:
            # Bonus reward for surpassing the high score
            reward += 1

        return reward

    # Take a step in the game environment based on the given action
    def step(self, action):

        # Take action
        # Get key and action mapping
        key, action_type = self.actions_map[action]

        # Create a new ActionChains object
        action_chains = ActionChains(self.driver)

        # Perform the key press action
        if action_type == "key_down":
            action_chains.key_down(key).perform()
        # Perform the key release action
        elif action_type == "key_up":
            action_chains.key_up(key).perform()

        # Get next observation
        obs = self.get_observation()

        # Check whether game is over
        done = self.is_game_over()

        # Get reward
        reward = self.get_reward(done)

        info = {
            'current_score': self._get_current_score(),
            'high_score': self._get_high_score()
        }

        return obs, reward, done, info

    # Visualise the game
    def render(self, mode: str = 'human'):
        img = cv2.cvtColor(self._get_image(), cv2.COLOR_BGR2RGB)
        if mode == 'rgb-array':
            return img
        elif mode == 'human':
            cv2.imshow('Dino Game', img)
            cv2.waitKey(1)

    # Close the game environment and the driver
    def close(self):
        self.driver.quit()

## Test the Custom Game Environment

This section is for testing the Game Environment to ensure it is defined correctly before using it with the Agent for RL. 

In [146]:
# Helper class to format and print observations properly
def print_formatted_obs(observations):
    obs_titles = ["trex_y", "trex_jumping", "trex_ducking", "game_speed", "obst_dist", "obst_type", "obst_x", "obst_y", "obst_width", "obst_height"]
    # Create a pandas DataFrame
    df = pd.DataFrame(observations, columns=obs_titles)

    # Set the pandas display options for better readability (optional)
    pd.set_option("display.width", 140)
    # pd.set_option("display.precision", 2)

    # Print the DataFrame
    print(df)

In [140]:
env = DinoEnvironment()
env.reset() # returns an observation from the env

array([83.   ,  1.   ,  0.   ,  6.011, -1.   , -1.   , -1.   , -1.   ,
       -1.   , -1.   ], dtype=float32)

In [141]:
env.observation_space

Box([ 0.  0.  0.  6. -1. -1. -1. -1. -1. -1.], [150.   1.   1.  13. 600.   3. 600. 150.  50.  50.], (10,), float32)

In [142]:
env.observation_space.shape[0]

10

In [143]:
env.action_space

Discrete(4)

In [144]:
env.action_space.n

4

**Note:** Render function works better if using `.py` python files instead of the `.ipynb` notebook to run the code.

In [147]:
# Test loop - Play 1 game
env = DinoEnvironment()
for episode in range(1):
    obs = env.reset()
    done = False
    total_reward = 0
    all_observations = []
    # images = []

    while not done:
        action = env.action_space.sample()  # Take random actions
        obs, reward, done, info = env.step(action)
        # print(obs)
        all_observations.append(obs)  # Print obs formatted nicely in a table
        total_reward += reward

        # env.render(mode='human')
        # img = env.render(mode='rgb-array')
        # images.append(img) # Can use some image library to create a gif using collected images

    print_formatted_obs(all_observations)
    print(f"Episode: {episode}, Total Reward: {total_reward}, , Current Score: {info['current_score']}, High Score: {info['high_score']}")

     trex_y  trex_jumping  trex_ducking  game_speed  obst_dist  obst_type  obst_x  obst_y  obst_width  obst_height
0      73.0           1.0           0.0       6.017       -1.0       -1.0    -1.0    -1.0        -1.0         -1.0
1      51.0           1.0           0.0       6.029       -1.0       -1.0    -1.0    -1.0        -1.0         -1.0
2      33.0           1.0           0.0       6.041       -1.0       -1.0    -1.0    -1.0        -1.0         -1.0
3      19.0           1.0           0.0       6.051       -1.0       -1.0    -1.0    -1.0        -1.0         -1.0
4      13.0           1.0           0.0       6.063       -1.0       -1.0    -1.0    -1.0        -1.0         -1.0
..      ...           ...           ...         ...        ...        ...     ...     ...         ...          ...
137    93.0           0.0           0.0       7.719       99.0        0.0   105.0   105.0        17.0         35.0
138    93.0           0.0           0.0       7.731       84.0        0.0    87.

## ModifiedDinoEnvironment Class

In this version of the Environment I added information about the Trex to the state: Trex's standing height and width, and its ducking height and width. This increased the observation space shape from 10 to 14. My goal was to give the Trex enough information so that it make a decision whether to jump or duck and avoid unnecessary actions, to achieve this I experimented with different reward structures.

In [2]:
# Create Dino Game Environment
class ModifiedDinoEnvironment(Env):

    def __init__(self):

        # Subclass model
        super().__init__()

        self.driver = self._create_driver()

        # Setup spaces
        low_values = np.array(
            [0, -1, -1, -1, -1, 0, 0, 6, -1, -1, -1, -1, -1, -1], dtype=np.float32)  # Initial speed is 6, while max speed is 13
        high_values = np.array(
            [150, 50, 50, 50, 60, 1, 1, 13, 600, 3, 600, 150, 50, 50], dtype=np.float32)  # Canvas dimensions are 600x150
        self.observation_space = Box(
            low=low_values, high=high_values, shape=(14,), dtype=np.float32)

        # Start jumping, Start ducking, Stop ducking, Do nothing - Ducking has been divided into two actions because the agent should also learn the correct ducking duration
        self.action_space = Discrete(4)

        self.actions_map = [
            (Keys.ARROW_UP, "key_down"),  # Start jumping
            (Keys.ARROW_DOWN, "key_down"),  # Start ducking
            (Keys.ARROW_DOWN, "key_up"),  # Stop ducking
            (Keys.ARROW_RIGHT, "key_down")  # Do nothing
        ]

        # Keep track of number of obstacles the agent has passed
        self.passed_obstacles = 0

    # Create and return an instance of the Chrome Driver
    def _create_driver(self):

        # Set options for the WebDriver
        options = Options()

        # Turn off logging to keep terminal clean
        options.add_experimental_option('excludeSwitches', ['enable-logging'])

        # Keep the browser running after the code finishes executing
        options.add_experimental_option("detach", True)

        # Create a Service instance for running the ChromeDriver executable
        service = Service(executable_path=ChromeDriverManager().install())

        # Create an instance of the Chrome WebDriver with the specified service and options - The driver object can be used to automate interactions with the Chrome browser
        driver = webdriver.Chrome(service=service, options=options)

        # Maximize the Chrome window
        driver.maximize_window()

        return driver

    # Encode the obstacle type as an integer
    def _encode_obstacle_type(self, obstacle_type):
        if obstacle_type == 'CACTUS_SMALL':
            return 0
        elif obstacle_type == 'CACTUS_LARGE':
            return 1
        elif obstacle_type == 'PTERODACTYL':
            return 2
        else:
            raise ValueError(f"Unknown obstacle type: {obstacle_type}")

    # Get obstacles that are currently on the screen
    def _get_obstacles(self):
        obstacles = self.driver.execute_script(
            "return Runner.instance_.horizon.obstacles")
        obstacle_info = []
        for obstacle in obstacles:
            obstacle_type = obstacle['typeConfig']['type']
            # Encode the obstacle type as an integer
            encoded_obstacle_type = self._encode_obstacle_type(obstacle_type)
            obstacle_x = obstacle['xPos']
            obstacle_y = obstacle['yPos']
            obstacle_width = obstacle['typeConfig']['width']
            obstacle_height = obstacle['typeConfig']['height']
            obstacle_info.append(
                (encoded_obstacle_type, obstacle_x, obstacle_y, obstacle_width, obstacle_height))
        return obstacle_info

    # Get Trex's state (Jumping, Ducking or Running/Do nothing)
    def _get_trex_info(self):
        trex = self.driver.execute_script("return Runner.instance_.tRex")
        # xpos remains the same throughout the game - don't need it
        trex_y = trex['yPos']
        trex_height = trex['config']['HEIGHT']
        trex_width = trex['config']['WIDTH']
        trex_duck_height = trex['config']['HEIGHT_DUCK']
        trex_duck_width = trex['config']['WIDTH_DUCK']
        trex_is_jumping = trex['jumping']
        trex_is_ducking = trex['ducking']
        return trex_y, trex_height, trex_width, trex_duck_height, trex_duck_width, trex_is_jumping, trex_is_ducking

    # Get current game speed
    def _get_game_speed(self):
        game_speed = self.driver.execute_script(
            "return Runner.instance_.currentSpeed")
        return game_speed

    # Get the distance between the Trex and the next obstacle
    def _get_distance_to_next_obstacle(self):
        trex_x = self.driver.execute_script(
            "return Runner.instance_.tRex.xPos")  # xpos of trex
        obstacles = self._get_obstacles()
        if obstacles:
            next_obstacle = obstacles[0]
            obstacle_x = next_obstacle[1]  # xpos of next obstacle
            distance_to_next_obstacle = obstacle_x - trex_x
        else:
            distance_to_next_obstacle = None
        return distance_to_next_obstacle

    # Check if the agent has passed an obstacle
    def _passed_obstacle(self):
        obstacles = self._get_obstacles()
        if obstacles:
            # next_obstacle: [encoded_obstacle_type, obstacle_x, obstacle_y, obstacle_width, obstacle_height]
            next_obstacle = obstacles[0]
            trex_x = self.driver.execute_script(
                "return Runner.instance_.tRex.xPos")
            obstacle_x = next_obstacle[1]  # Next obstacles xpos
            obstacle_width = next_obstacle[3]  # Next obstacles width
            return obstacle_x + obstacle_width < trex_x
        else:
            return False

    # Get and return the score for the last game played
    def _get_current_score(self):
        try:
            score = int(''.join(self.driver.execute_script(
                "return Runner.instance_.distanceMeter.digits")))
        except:
            score = 0
        return score

    # Get and return the high score for all games played in current browser session
    def _get_high_score(self):
        try:
            score = int(''.join(self.driver.execute_script(
                "return Runner.instance_.distanceMeter.highScore.slice(-5)")))  # MaxScore=99999, MaxScoreUnits=5
        except:
            score = 0
        return score

    # Capture screenshot of current game state and return the image captured for rendering
    def _get_image(self):
        # Capture a screenshot of the game canvas as a data URL - string that represents the image in base64-encoded format
        data_url = self.driver.execute_script(
            "return document.querySelector('canvas.runner-canvas').toDataURL()")

        # Remove the leading text from the data URL using string slicing and decode the remaining base64-encoded data
        LEADING_TEXT = "data:image/png;base64,"
        image_data = base64.b64decode(data_url[len(LEADING_TEXT):])

        # Convert the binary data in 'image_data' to a 1D NumPy array
        image_array = np.frombuffer(image_data, dtype=np.uint8)

        # Decode the image data and create an OpenCV image object - OpenCV Image Shape format (H, W, C) ( rows, columns, and channels )
        image = cv2.imdecode(image_array, cv2.IMREAD_COLOR)

        return image

    # Load and Reset the game environment
    def reset(self):
        try:
            # Navigate to the Chrome Dino website
            self.driver.get("chrome://dino/")

        except WebDriverException as e:
            # Ignore "ERR_INTERNET_DISCONNECTED" error thrown because this game is available offline
            if "ERR_INTERNET_DISCONNECTED" in str(e):
                pass  # Ignore the exception.
            else:
                raise e  # Handle other WebDriverExceptions

        # Avoid errors that can arise due to the 'runner-canvas' element not being present - Using WebDriverWait and EC together ensures that the code does not proceed until the required element is present
        timeout = 10
        WebDriverWait(self.driver, timeout).until(
            EC.presence_of_element_located((By.CLASS_NAME, "runner-canvas")))

        # Start game
        self.driver.find_element(By.TAG_NAME, "body").send_keys(Keys.SPACE)

        return self.get_observation()

    # Get the current state of the game and return it as the observation
    def get_observation(self):
        obstacles = self._get_obstacles()
        trex_y, trex_height, trex_width, trex_duck_height, trex_duck_width, trex_is_jumping, trex_is_ducking = self._get_trex_info()
        game_speed = self._get_game_speed()
        distance_to_next_obstacle = self._get_distance_to_next_obstacle()

        state = (
            trex_y,
            trex_height,
            trex_width,
            trex_duck_height,
            trex_duck_width,
            trex_is_jumping,
            trex_is_ducking,
            game_speed,
            distance_to_next_obstacle,
            # Unpack the tuple of the first obstacle
            *(obstacles[0] if obstacles else (None, None, None, None, None))
        )

        # Set dtype for state to float32 for consistency and compatibility with the RL algorithm
        state = np.array(state, dtype=np.float32)

        # Replace NaN values with -1
        state[np.isnan(state)] = -1

        return state

    # Check if the game is over and return True or False
    def is_game_over(self):
        # Done if either Trex crashed into an obstacle or reached max score which is 99999
        # Check if Trex crashed
        crashed = self.driver.execute_script("return Runner.instance_.crashed")

        # Get the maximum score from the game
        max_score = self.driver.execute_script(
            "return Runner.instance_.distanceMeter.maxScore")
        current_score = self._get_current_score()

        return crashed or (current_score >= max_score)

    # Calculate and return the reward for the current state of the game
    def get_reward(self, obs, done, info):

        reward = 0
        
        current_score = info['current_score']
        high_score = info['high_score']

        if done:
            # Penalize for crashing into an obstacle
            reward -= 10

            if current_score > high_score:
                # Bonus reward for surpassing the high score
                reward += 10
        else:

            # Reward for staying alive
            reward += 1

            trex_y, trex_height, trex_width, trex_duck_height, trex_duck_width, trex_is_jumping, trex_is_ducking, game_speed, distance_to_next_obstacle, obstacle_type, obstacle_x, obstacle_y, obstacle_width, obstacle_height = obs

            # Penalize unnecessary jumps and ducks when there are no obstacles
            if obstacle_type == -1:
                if trex_is_jumping:
                    # Penalize for jumping when there are no obstacles
                    reward -= 0.5
                if trex_is_ducking:
                    # Penalize for ducking when there are no obstacles
                    reward -= 0.5

            # Penalize for taking incorrect action when there are obtacles
            if obstacle_type != -1:
                if trex_is_jumping and (obstacle_y + obstacle_height) < trex_duck_height:
                    # Penalize for jumping when the obstacle is flying and there's enough space to duck
                    reward -= 0.1

                if trex_is_ducking and (trex_y + trex_height) > (obstacle_y):
                    # Penalize for ducking when the obstacle is on the ground
                    reward -= 0.1

            if self._passed_obstacle():
                # Reward for passing an obstacle
                reward += 1
                self.passed_obstacles += 1

            if current_score > high_score:
                # Small reward for every step the high score surpasses current score
                reward += 0.1

        return reward

    # Take a step in the game environment based on the given action
    def step(self, action):

        # Take action
        # Get key and action mapping
        key, action_type = self.actions_map[action]

        # Create a new ActionChains object
        action_chains = ActionChains(self.driver)

        # Perform the key press action
        if action_type == "key_down":
            action_chains.key_down(key).perform()
        # Perform the key release action
        elif action_type == "key_up":
            action_chains.key_up(key).perform()

        # Get next observation
        obs = self.get_observation()

        # Check whether game is over
        done = self.is_game_over()

        info = {
            'current_score': self._get_current_score(),
            'high_score': self._get_high_score()
        }

        # Get reward
        reward = self.get_reward(obs, done, info)

        return obs, reward, done, info

    # Visualise the game
    def render(self, mode: str = 'human'):
        img = cv2.cvtColor(self._get_image(), cv2.COLOR_BGR2RGB)
        if mode == 'rgb-array':
            return img
        elif mode == 'human':
            cv2.imshow('Dino Game', img)
            cv2.waitKey(1)

    # Close the game environment and the driver
    def close(self):
        self.driver.quit()


## Test the Modified Custom Game Environment

This section is for testing the Game Environment to ensure it is defined correctly before using it with the Agent for RL. 

In [56]:
# Helper class to format and print observations properly
def print_formatted_obs(observations):
    obs_titles = ["trex_y", "trex_h","trex_w","tr_duck_h","tr_duck_w","trex_jump", "trex_duck", "game_speed", "obst_dist", "obst_type", "obst_x", "obst_y", "obst_w", "obst_h"]
    # Create a pandas DataFrame
    df = pd.DataFrame(observations, columns=obs_titles)

    # Set the pandas display options for better readability (optional)
    pd.set_option("display.width", 140)
    # pd.set_option("display.precision", 2)

    # Print the DataFrame
    print(df)

In [58]:
env = ModifiedDinoEnvironment()
env.observation_space.shape[0]

14

In [64]:
# Test loop - Play 1 game
env = ModifiedDinoEnvironment()
for episode in range(1):
    obs = env.reset()
    done = False
    total_reward = 0
    all_observations = []
    # images = []

    while not done:
        action = env.action_space.sample()  # Take random actions
        obs, reward, done, info = env.step(action)
        # print(obs)
        all_observations.append(obs)  # Print obs formatted nicely in a table
        total_reward += reward

        # env.render(mode='human')
        # img = env.render(mode='rgb-array')
        # images.append(img) # Can use some image library to create a gif using collected images

    print_formatted_obs(all_observations)
    print(f"Episode: {episode}, Total Reward: {total_reward}, , Current Score: {info['current_score']}, High Score: {info['high_score']}")

     trex_y  trex_h  trex_w  tr_duck_h  tr_duck_w  trex_jump  trex_duck  game_speed  obst_dist  obst_type  obst_x  obst_y  obst_w  obst_h
0      67.0    47.0    44.0       25.0       59.0        1.0        0.0       6.019       -1.0       -1.0    -1.0    -1.0    -1.0    -1.0
1      47.0    47.0    44.0       25.0       59.0        1.0        0.0       6.031       -1.0       -1.0    -1.0    -1.0    -1.0    -1.0
2      55.0    47.0    44.0       25.0       59.0        1.0        0.0       6.041       -1.0       -1.0    -1.0    -1.0    -1.0    -1.0
3      72.0    47.0    44.0       25.0       59.0        1.0        0.0       6.051       -1.0       -1.0    -1.0    -1.0    -1.0    -1.0
4      80.0    47.0    44.0       25.0       59.0        1.0        0.0       6.061       -1.0       -1.0    -1.0    -1.0    -1.0    -1.0
..      ...     ...     ...        ...        ...        ...        ...         ...        ...        ...     ...     ...     ...     ...
144    93.0    47.0    44.0       

# DQN Dino Agent


The DQN algorithm for this agent has been adopted from these two papers:

1. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with Deep Reinforcement Learning. https://doi.org/10.48550/ARXIV.1312.5602
2. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533. https://doi.org/10.1038/nature14236

## Import Dependencies

In [3]:
import os
import random
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque
import wandb

## DinoDQNAgent Class

In [4]:
class DinoDQNAgent():
    def __init__(self, env,
                 gamma=0.95,
                 epsilon=1.0,
                 epsilon_min=0.01,
                 epsilon_decay=0.995,
                 learning_rate=0.001,
                 batch_size=32,
                 memory_size=100000):
        self.env = env
        self.state_size = env.observation_space.shape[0]  # 10
        self.action_size = env.action_space.n  # 4
        self.hidden_sizes = [64, 128]  # number of hidden neurons for the model
        self.memory = deque(maxlen=memory_size)
        self.gamma = gamma  # discounting factor
        self.epsilon = epsilon  # exploration rate
        self.epsilon_min = epsilon_min  # min exploration rate
        self.epsilon_decay = epsilon_decay  # exploration decay per step
        self.batch_size = batch_size
        self.model = self._build_model()
        self.optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
        self.loss_fn = nn.MSELoss()

    # Define the DQN model architecture - This model will be used to approximate the Q-values of the agent's actions given a state.
    def _build_model(self):
        model = nn.Sequential(
            nn.Linear(self.state_size, self.hidden_sizes[0]),
            nn.ReLU(),
            nn.Linear(self.hidden_sizes[0], self.hidden_sizes[1]),
            nn.ReLU(),
            nn.Linear(self.hidden_sizes[1], self.action_size)
        )

        return model

    # Store agents experiences as a tuple
    def remember(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))

    # Determine which action to take given a state
    def act(self, state):
        # Explore randomly or exploit given the current epsilon value
        if random.uniform(0, 1) <= self.epsilon:
            return random.randrange(self.action_size)
        else:
            state = torch.tensor(state, dtype=torch.float32)
            q_values = self.model(state)
            action = torch.argmax(q_values).item()
            return action

    # Update the DQN model using a batch of experiences sampled from the memory
    def replay(self):
        # Check if the number of experiences (state, action, reward, next_state, done) in the memory is less than the batch size
        if len(self.memory) < self.batch_size:
            # Don't do anything since there's not enough data to create a minibatch for training
            return

        # Create minibatch from a random sample of experiences from the memory
        minibatch = random.sample(self.memory, self.batch_size)

        for state, action, reward, next_state, done in minibatch:
            # Calculate the expected Q-value for the current state-action pair (q_target)
            # If done, - Game has ended, don't need to make predictions about future rewards
            q_target = reward
            if not done:
                # Calculate the Q-values for the next state using the DQN model, i.e., estimate future reward
                next_state = torch.tensor(next_state, dtype=torch.float32)
                q_values_next = self.model(next_state)
                # Update the target value by adding the discounted maximum Q-value of the next state to the current reward
                q_target = reward + self.gamma * \
                    torch.max(q_values_next).item()

            # Calculate the Q-values for the current state using the DQN model
            state = torch.tensor(state, dtype=torch.float32)
            q_values = self.model(state)

            # Update/Map the expected Q-value of the chosen action with the calculated target value
            q_values_expected = q_values.clone().detach()

            q_values_expected[action] = q_target

            # Note: q_values_expected is the ground truth for the action that the agent took in the current state vs q_values is the models prediction of what should happen

            # Reset the gradients of the optimizer before performing backpropagation
            self.optimizer.zero_grad()

            # Calculate the loss using the Mean Squared Error (MSE) between the current Q-values and the expected Q-values
            loss = self.loss_fn(q_values, q_values_expected)

            # Perform backpropagation to calculate the gradients of the model's parameters with respect to the loss
            loss.backward()

            # Update the model's parameters using the calculated gradients and the optimizer's learning rate
            self.optimizer.step()

        # Decrease episolon over time to reduce exploration and increase exploitation of the models learnt knowledge
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay
            
        # Return the loss value
        return loss.item()

    # Save the current state of the DQN model and optimizer to a file.
    def save_model(self, model_name, model_output_dir, log_to_wandb):
        # Create a dictionary to store the state of the model, optimizer and any other additional information
        state = {
            'model_state_dict': self.model.state_dict(),
            'optimizer_state_dict': self.optimizer.state_dict()
        }

        save_path = os.path.join(
            model_output_dir, model_name)

        # Save the state dictionary to a file
        torch.save(state, save_path)

        if log_to_wandb:
            # Save model as a wandb artifact
            artifact = wandb.Artifact(model_name, type='model')
            artifact.add_file(save_path)
            wandb.log_artifact(artifact)

    # Load the DQN model and optimizer state from a file.
    def load_model(self, file_path, older_model, for_training):

        if older_model:
            self.model.load_state_dict(torch.load(file_path))
        else:
            # Load the state dictionary from the file using the torch.load() function
            state = torch.load(file_path)

            # Restore the state of the model and optimizer
            self.model.load_state_dict(state['model_state_dict'])

            # Set for_training to true if using the model to continue training from a previously saved state
            if for_training:
                self.optimizer.load_state_dict(state['optimizer_state_dict'])


# Train and Test Agent

## Import Dependencies

In [5]:
import os
import wandb

## Train Agent

### Train Function

In [6]:
def train(agent, env, episodes, model_output_dir, save_interval=10, log_to_wandb=False, render=False):

    if log_to_wandb:
        wandb.init(project='chrome_dino_dqn_agent', name='train_run')

    total_rewards = []
    total_scores = []
    
    # test_interval = 10  # The interval at which to test the agent's knowledge

    for episode in range(episodes):
        state = env.reset()
        done = False
        episode_reward = 0
        episode_loss = []
        
        # # Set epsilon to 0 for every n episodes and then reset it back to the previous value
        # previous_epsilon = None
        # if episode % test_interval == 0:
        #     previous_epsilon = agent.epsilon
        #     agent.epsilon = 0


        while not done:
            if render:
                env.render(mode='human')

            # Use agent to predict action
            action = agent.act(state)

            # Take a step in the environment
            next_state, reward, done, info = env.step(action)

            # Remember agents experience after every step
            agent.remember(state, action, reward, next_state, done)

            state = next_state
            episode_reward += reward

        # Train/Update the model every episode
        loss = agent.replay()
        episode_loss.append(loss)
        


        total_rewards.append(episode_reward)
        total_scores.append(info["current_score"])

        # Calculate overall training metrics
        mean_episode_loss = sum(episode_loss) / len(episode_loss)
        mean_reward = sum(total_rewards) / len(total_rewards)
        mean_score = sum(total_scores) / len(total_scores)

        # Log metrics
        print(
            f"Episode {episode + 1}/{episodes}, Highest Score: {info['high_score']}, Episode Score: {info['current_score']}, Episode Reward: {episode_reward:.4f}, Episode Epsilon: {agent.epsilon:.4f}, Episode Loss: {loss:.4f}, Mean Score: {mean_score:.4f}, Mean Reward {mean_reward:.4f}")
        
        if log_to_wandb:
            wandb.log({
                "episode": (episode + 1)/episodes,
                "highest_score": info["high_score"],
                "episode_score": info["current_score"],
                "episode_reward": episode_reward,
                "episode_epsilon": agent.epsilon,
                "episode_loss": loss,
                "mean_loss": mean_episode_loss,
                "mean_reward": mean_reward,
                "mean_current_score": mean_score
            })
            
        # # Reset epsilon back to its previous value if needed
        # if previous_epsilon is not None:
        #     agent.epsilon = previous_epsilon
        #     previous_epsilon = None

        # Save the model every save_interval episodes
        if (episode + 1) % save_interval == 0:
            model_name = f"dino_dqn_episode_{episode + 1}.pth"
            agent.save_model(model_name, model_output_dir, log_to_wandb)
            print(f"Model saved after episode {episode + 1}")
            
    # Finish wandb logging
    if log_to_wandb:
        wandb.finish()


### Train

### `train_5`

Reverted back to original Reward Strategy as in (train_1)

- Trained for 200 episodes
- Replay after every epoch
- Epsilon decay = 0.995

In [148]:
# Specify directory to save model
OUTPUT_DIR = "trained_models/"

# Create directories if they don't exist on the path
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

In [149]:
# Number of episodes to train the agent
TRAIN_EPISODES = 200

In [150]:
# Instantiate Environment and Agent
env = DinoEnvironment()
agent = DinoDQNAgent(env)

# Train Model
train(agent, env, TRAIN_EPISODES, OUTPUT_DIR, log_to_wandb=True)

Episode 1/200, Highest Score: 107, Episode Score: 107, Episode Reward: 97.3000, Episode Epsilon: 0.9950, Episode Loss: 2.1888, Mean Score: 107.0000, Mean Reward 97.3000
Episode 2/200, Highest Score: 107, Episode Score: 53, Episode Reward: 1.3000, Episode Epsilon: 0.9900, Episode Loss: 16.3613, Mean Score: 80.0000, Mean Reward 49.3000
Episode 3/200, Highest Score: 107, Episode Score: 54, Episode Reward: 2.6000, Episode Epsilon: 0.9851, Episode Loss: 26.8944, Mean Score: 71.3333, Mean Reward 33.7333
Episode 4/200, Highest Score: 107, Episode Score: 54, Episode Reward: 2.5000, Episode Epsilon: 0.9801, Episode Loss: 19.7332, Mean Score: 67.0000, Mean Reward 25.9250
Episode 5/200, Highest Score: 107, Episode Score: 52, Episode Reward: 1.1000, Episode Epsilon: 0.9752, Episode Loss: 46.6583, Mean Score: 64.0000, Mean Reward 20.9600
Episode 6/200, Highest Score: 107, Episode Score: 52, Episode Reward: 2.4000, Episode Epsilon: 0.9704, Episode Loss: 46.9283, Mean Score: 62.0000, Mean Reward 17.8

0,1
episode,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
episode_epsilon,██▇▇▇▇▆▆▆▆▆▅▅▅▅▄▄▄▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁
episode_loss,▁▂▁▁▁▁▁▁▁▁▁▁▁▄▂▁▁▁▂▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▂█▂▁▁
episode_reward,▁▂▄▃▂▅▁▁▂▁▂▁▁▁▁▂▂▁▂▄▄▂▂▂▁▄▃█▂▃▂▂▃▂▂▂▂▂▂▂
episode_score,▂▁▄▂▁█▁▁▁▁▁▁▁▁▁▁▁▁▄▇█▁▂▁▁▅▁▆▁▁▁▁▁▂▁▁▁▁▁▁
highest_score,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
mean_current_score,█▃▃▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
mean_loss,▁▂▁▁▁▁▁▁▁▁▁▁▁▄▂▁▁▁▂▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▂█▂▁▁
mean_reward,█▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
episode,1.0
episode_epsilon,0.36696
episode_loss,1.23786
episode_reward,2.4
episode_score,52.0
highest_score,107.0
mean_current_score,54.965
mean_loss,1.23786
mean_reward,3.6465


### `train_6`
Original Reward Strategy as in (train_1)

- Trained for 400 episodes
- Replay after every epoch
- Epsilon decay = 0.995 with exploration breaks - set epsilon to 0 every 10 intervals to test the models knowledge while training

In [29]:
# Specify directory to save model
OUTPUT_DIR = "trained_models/"

# Create directories if they don't exist on the path
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)
    
# Number of episodes to train the agent
TRAIN_EPISODES = 400

# Instantiate Environment and Agent
env = DinoEnvironment()
agent = DinoDQNAgent(env)

# Train Model
train(agent, env, TRAIN_EPISODES, OUTPUT_DIR, log_to_wandb=True)

Episode 1/400, Highest Score: 51, Episode Score: 51, Episode Reward: 125.5000, Episode Epsilon: 0.0000, Episode Loss: 0.0015, Mean Score: 51.0000, Mean Reward 125.5000
Episode 2/400, Highest Score: 51, Episode Score: 51, Episode Reward: 3.2000, Episode Epsilon: 0.9950, Episode Loss: 145.9340, Mean Score: 51.0000, Mean Reward 64.3500
Episode 3/400, Highest Score: 52, Episode Score: 52, Episode Reward: 4.2000, Episode Epsilon: 0.9900, Episode Loss: 0.8471, Mean Score: 51.3333, Mean Reward 44.3000
Episode 4/400, Highest Score: 58, Episode Score: 58, Episode Reward: 21.1000, Episode Epsilon: 0.9851, Episode Loss: 0.0053, Mean Score: 53.0000, Mean Reward 38.5000
Episode 5/400, Highest Score: 58, Episode Score: 52, Episode Reward: 3.7000, Episode Epsilon: 0.9801, Episode Loss: 25535.4844, Mean Score: 52.8000, Mean Reward 31.5400
Episode 6/400, Highest Score: 58, Episode Score: 55, Episode Reward: 8.8000, Episode Epsilon: 0.9752, Episode Loss: 52.1755, Mean Score: 53.1667, Mean Reward 27.7500

0,1
episode,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
episode_epsilon,██▇▇▇▆▆▆▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁
episode_loss,▁▁▁▁▁▁▁▁▁█▄▁▂▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁
episode_reward,█▃▂▂▃▃▂▂▂▂▂▂▂▄▄▃▂▅▃▃▃▂▂▂▂▄▁▄▂▂▃▂▃▂▂▂▇▂▂▂
episode_score,▂▁▁▁▁▂▁▁▂▁▁▁▁▁▂▃▁▇▁▁▁▁▁▁▁▅▁▁▁▁▂▁▂▁▁▁█▁▁▁
highest_score,▁▁▁▁▂▂▂▂▃▃▃▃▃▃▃▃▄▄██████████████████████
mean_current_score,▂▁▃▂▄▄▄▄▆▆▆▆▆▆▆▆▇▇█▇██▇▇███████▇███████▇
mean_loss,▁▁▁▁▁▁▁▁▁█▄▁▂▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁
mean_reward,█▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
episode,1.0
episode_epsilon,0.16455
episode_loss,0.22567
episode_reward,4.0
episode_score,51.0
highest_score,142.0
mean_current_score,55.1575
mean_loss,0.22567
mean_reward,6.32725


## Test Agent

### Test Function

In [7]:
def test(agent, env, episodes, model_path, log_to_wandb=False, older_model=False, render=False):

    if log_to_wandb:
        wandb.init(project='chrome_dino_dqn_agent', name='test_run')

    total_rewards = []
    total_scores = []

    agent.load_model(model_path, older_model, for_training=False)

    # Set exploration rate (epsilon) to 0 to only choose actions based on the model's predictions (exploit its knowledge)
    agent.epsilon = 0

    for episode in range(episodes):
        state = env.reset()
        done = False
        episode_reward = 0

        while not done:
            if render:
                env.render(mode='human')

            # Use agent to predict action
            action = agent.act(state)

            # Take a step in the environment
            next_state, reward, done, info = env.step(action)

            state = next_state
            episode_reward += reward

        total_rewards.append(episode_reward)
        total_scores.append(info["current_score"])

        # Calculate overall training metrics
        mean_reward = sum(total_rewards) / len(total_rewards)
        mean_score = sum(total_scores) / len(total_scores)

        # Log metrics
        print(
            f"Episode {episode + 1}/{episodes}, Highest Score: {info['high_score']}, Episode Score: {info['current_score']}, Episode Reward: {episode_reward:.4f}, Episode Epsilon: {agent.epsilon:.4f}, Mean Score: {mean_score:.4f}, Mean Reward {mean_reward:.4f}")

        if log_to_wandb:
            wandb.log({
                "episode": (episode + 1)/episodes,
                "highest_score": info["high_score"],
                "episode_score": info["current_score"],
                "episode_reward": episode_reward,
                "episode_epsilon": agent.epsilon,
                "mean_reward": mean_reward,
                "mean_current_score": mean_score
            })
            
    if log_to_wandb:
        # Finish wandb logging        
        wandb.finish()

### Test

#### Train_5

In [154]:
# Number of episodes to test the agent
TEST_EPISODES = 5

In [155]:
# Specify path to load a model
MODEL_LOAD_PATH = "trained_models/dino_dqn_episode_200.pth"

In [156]:
# Instantiate Environment and Agent
env = DinoEnvironment()
agent = DinoDQNAgent(env)

# Test model
test(agent, env, TEST_EPISODES, MODEL_LOAD_PATH, log_to_wandb=False)

Episode 1/5, Highest Score: 64, Episode Score: 64, Episode Reward: 110.6000, Episode Epsilon: 0.0000, Mean Score: 64.0000, Mean Reward 110.6000
Episode 2/5, Highest Score: 64, Episode Score: 52, Episode Reward: 1.5000, Episode Epsilon: 0.0000, Mean Score: 58.0000, Mean Reward 56.0500
Episode 3/5, Highest Score: 64, Episode Score: 52, Episode Reward: 1.4000, Episode Epsilon: 0.0000, Mean Score: 56.0000, Mean Reward 37.8333
Episode 4/5, Highest Score: 64, Episode Score: 54, Episode Reward: 2.0000, Episode Epsilon: 0.0000, Mean Score: 55.5000, Mean Reward 28.8750
Episode 5/5, Highest Score: 64, Episode Score: 52, Episode Reward: 0.8000, Episode Epsilon: 0.0000, Mean Score: 54.8000, Mean Reward 23.2600


#### Train_6

In [39]:
# Number of episodes to test the agent
TEST_EPISODES = 5

# Specify path to load a model
MODEL_LOAD_PATH = "trained_models/train_6/dino_dqn_episode_110.pth"

# Instantiate Environment and Agent
env = DinoEnvironment()
agent = DinoDQNAgent(env)

# Test model
test(agent, env, TEST_EPISODES, MODEL_LOAD_PATH, log_to_wandb=False)

Episode 1/5, Highest Score: 53, Episode Score: 53, Episode Reward: 121.2000, Episode Epsilon: 0.0000, Mean Score: 53.0000, Mean Reward 121.2000
Episode 2/5, Highest Score: 60, Episode Score: 60, Episode Reward: 18.6000, Episode Epsilon: 0.0000, Mean Score: 56.5000, Mean Reward 69.9000
Episode 3/5, Highest Score: 68, Episode Score: 68, Episode Reward: 24.1000, Episode Epsilon: 0.0000, Mean Score: 60.3333, Mean Reward 54.6333
Episode 4/5, Highest Score: 83, Episode Score: 83, Episode Reward: 36.4000, Episode Epsilon: 0.0000, Mean Score: 66.0000, Mean Reward 50.0750
Episode 5/5, Highest Score: 83, Episode Score: 53, Episode Reward: 4.5000, Episode Epsilon: 0.0000, Mean Score: 63.4000, Mean Reward 40.9600


# All Model Tests
Testing a model from each training run to determine the best trained model.

## Train_1

In [42]:
# Number of episodes to test the agent
TEST_EPISODES = 5

# Specify path to load a model
MODEL_LOAD_PATH = "trained_models/train_1/episode_100.pth"

# Instantiate Environment and Agent
env = DinoEnvironment()
agent = DinoDQNAgent(env)

# Test model
test(agent, env, TEST_EPISODES, MODEL_LOAD_PATH, log_to_wandb=False, older_model=True)

Episode 1/5, Highest Score: 338, Episode Score: 338, Episode Reward: 581.2000, Episode Epsilon: 0.0000, Mean Score: 338.0000, Mean Reward 581.2000
Episode 2/5, Highest Score: 338, Episode Score: 333, Episode Reward: 59.7000, Episode Epsilon: 0.0000, Mean Score: 335.5000, Mean Reward 320.4500
Episode 3/5, Highest Score: 572, Episode Score: 572, Episode Reward: 361.5000, Episode Epsilon: 0.0000, Mean Score: 414.3333, Mean Reward 334.1333
Episode 4/5, Highest Score: 572, Episode Score: 259, Episode Reward: 42.4000, Episode Epsilon: 0.0000, Mean Score: 375.5000, Mean Reward 261.2000
Episode 5/5, Highest Score: 572, Episode Score: 519, Episode Reward: 82.3000, Episode Epsilon: 0.0000, Mean Score: 404.2000, Mean Reward 225.4200


## Train_2

In [45]:
# Number of episodes to test the agent
TEST_EPISODES = 5

# Specify path to load a model
MODEL_LOAD_PATH = "trained_models/train_2/dino_dqn_episode_400.pth"

# Instantiate Environment and Agent
env = DinoEnvironment()
agent = DinoDQNAgent(env)

# Test model
test(agent, env, TEST_EPISODES, MODEL_LOAD_PATH, log_to_wandb=False)

Episode 1/5, Highest Score: 158, Episode Score: 158, Episode Reward: 322.1000, Episode Epsilon: 0.0000, Mean Score: 158.0000, Mean Reward 322.1000
Episode 2/5, Highest Score: 158, Episode Score: 52, Episode Reward: 3.8000, Episode Epsilon: 0.0000, Mean Score: 105.0000, Mean Reward 162.9500
Episode 3/5, Highest Score: 158, Episode Score: 52, Episode Reward: 3.9000, Episode Epsilon: 0.0000, Mean Score: 87.3333, Mean Reward 109.9333
Episode 4/5, Highest Score: 158, Episode Score: 143, Episode Reward: 26.2000, Episode Epsilon: 0.0000, Mean Score: 101.2500, Mean Reward 89.0000
Episode 5/5, Highest Score: 158, Episode Score: 52, Episode Reward: 3.5000, Episode Epsilon: 0.0000, Mean Score: 91.4000, Mean Reward 71.9000


## Train_3
Used the `ModifiedDinoEnv` class for this training run.

In [67]:
# Number of episodes to test the agent
TEST_EPISODES = 5

# Specify path to load a model
MODEL_LOAD_PATH = "trained_models/train_3/dino_dqn_episode_110.pth"

# Instantiate Environment and Agent
env = ModifiedDinoEnvironment()
agent = DinoDQNAgent(env)

# Test model
test(agent, env, TEST_EPISODES, MODEL_LOAD_PATH, log_to_wandb=False)

Episode 1/5, Highest Score: 51, Episode Score: 51, Episode Reward: 147.8000, Episode Epsilon: 0.0000, Mean Score: 51.0000, Mean Reward 147.8000
Episode 2/5, Highest Score: 51, Episode Score: 51, Episode Reward: 136.1000, Episode Epsilon: 0.0000, Mean Score: 51.0000, Mean Reward 141.9500
Episode 3/5, Highest Score: 51, Episode Score: 51, Episode Reward: 137.3000, Episode Epsilon: 0.0000, Mean Score: 51.0000, Mean Reward 140.4000
Episode 4/5, Highest Score: 52, Episode Score: 52, Episode Reward: 144.2000, Episode Epsilon: 0.0000, Mean Score: 51.2500, Mean Reward 141.3500
Episode 5/5, Highest Score: 52, Episode Score: 51, Episode Reward: 141.3000, Episode Epsilon: 0.0000, Mean Score: 51.2000, Mean Reward 141.3400


## Train_4
Used the `ModifiedDinoEnv` class with a different reward strategy for this training run.

In [10]:
# Number of episodes to test the agent
TEST_EPISODES = 5

# Specify path to load a model
MODEL_LOAD_PATH = "trained_models/train_4/dino_dqn_episode_100.pth"

# Instantiate Environment and Agent
env = ModifiedDinoEnvironment() 
agent = DinoDQNAgent(env)

# Test model
test(agent, env, TEST_EPISODES, MODEL_LOAD_PATH, log_to_wandb=False)

Episode 1/5, Highest Score: 52, Episode Score: 52, Episode Reward: 147.4000, Episode Epsilon: 0.0000, Mean Score: 52.0000, Mean Reward 147.4000
Episode 2/5, Highest Score: 52, Episode Score: 52, Episode Reward: 151.0000, Episode Epsilon: 0.0000, Mean Score: 52.0000, Mean Reward 149.2000
Episode 3/5, Highest Score: 52, Episode Score: 51, Episode Reward: 143.0000, Episode Epsilon: 0.0000, Mean Score: 51.6667, Mean Reward 147.1333
Episode 4/5, Highest Score: 52, Episode Score: 52, Episode Reward: 148.5000, Episode Epsilon: 0.0000, Mean Score: 51.7500, Mean Reward 147.4750
Episode 5/5, Highest Score: 52, Episode Score: 51, Episode Reward: 150.0000, Episode Epsilon: 0.0000, Mean Score: 51.6000, Mean Reward 147.9800


## Train_5

In [13]:
# Number of episodes to test the agent
TEST_EPISODES = 5

# Specify path to load a model
MODEL_LOAD_PATH = "trained_models/train_5/dino_dqn_episode_110.pth"

# Instantiate Environment and Agent
env = DinoEnvironment()
agent = DinoDQNAgent(env)

# Test model
test(agent, env, TEST_EPISODES, MODEL_LOAD_PATH, log_to_wandb=False)

Episode 1/5, Highest Score: 69, Episode Score: 69, Episode Reward: 167.1000, Episode Epsilon: 0.0000, Mean Score: 69.0000, Mean Reward 167.1000
Episode 2/5, Highest Score: 69, Episode Score: 51, Episode Reward: 3.9000, Episode Epsilon: 0.0000, Mean Score: 60.0000, Mean Reward 85.5000
Episode 3/5, Highest Score: 69, Episode Score: 53, Episode Reward: 4.5000, Episode Epsilon: 0.0000, Mean Score: 57.6667, Mean Reward 58.5000
Episode 4/5, Highest Score: 69, Episode Score: 51, Episode Reward: 4.3000, Episode Epsilon: 0.0000, Mean Score: 56.0000, Mean Reward 44.9500
Episode 5/5, Highest Score: 69, Episode Score: 53, Episode Reward: 4.4000, Episode Epsilon: 0.0000, Mean Score: 55.4000, Mean Reward 36.8400


## Train_6

In [14]:
# Number of episodes to test the agent
TEST_EPISODES = 5

# Specify path to load a model
MODEL_LOAD_PATH = "trained_models/train_6/dino_dqn_episode_110.pth"

# Instantiate Environment and Agent
env = DinoEnvironment()
agent = DinoDQNAgent(env)

# Test model
test(agent, env, TEST_EPISODES, MODEL_LOAD_PATH, log_to_wandb=False)

Episode 1/5, Highest Score: 66, Episode Score: 66, Episode Reward: 161.7000, Episode Epsilon: 0.0000, Mean Score: 66.0000, Mean Reward 161.7000
Episode 2/5, Highest Score: 66, Episode Score: 52, Episode Reward: 4.3000, Episode Epsilon: 0.0000, Mean Score: 59.0000, Mean Reward 83.0000
Episode 3/5, Highest Score: 66, Episode Score: 53, Episode Reward: 4.7000, Episode Epsilon: 0.0000, Mean Score: 57.0000, Mean Reward 56.9000
Episode 4/5, Highest Score: 66, Episode Score: 66, Episode Reward: 8.5000, Episode Epsilon: 0.0000, Mean Score: 59.2500, Mean Reward 44.8000
Episode 5/5, Highest Score: 82, Episode Score: 82, Episode Reward: 43.3000, Episode Epsilon: 0.0000, Mean Score: 63.8000, Mean Reward 44.5000


# Best Model Test
Testing the best model (train_1: `episode_100.pth`) for 50 runs to see the highest score it can achieve.

In [162]:
# Specify path to load a model
MODEL_LOAD_PATH = "best_trained_models\episode_100.pth"

In [167]:
# Number of episodes to test the agent
TEST_EPISODES = 50

In [168]:
# Instantiate Environment and Agent
env = DinoEnvironment()
agent = DinoDQNAgent(env)

# Test model
test(agent, env, TEST_EPISODES, MODEL_LOAD_PATH, log_to_wandb=True, older_model=True)

Episode 1/50, Highest Score: 362, Episode Score: 362, Episode Reward: 508.0000, Episode Epsilon: 0.0000, Mean Score: 362.0000, Mean Reward 508.0000
Episode 2/50, Highest Score: 403, Episode Score: 403, Episode Reward: 84.7000, Episode Epsilon: 0.0000, Mean Score: 382.5000, Mean Reward 296.3500
Episode 3/50, Highest Score: 442, Episode Score: 442, Episode Reward: 54.4000, Episode Epsilon: 0.0000, Mean Score: 402.3333, Mean Reward 215.7000
Episode 4/50, Highest Score: 663, Episode Score: 663, Episode Reward: 286.5000, Episode Epsilon: 0.0000, Mean Score: 467.5000, Mean Reward 233.4000
Episode 5/50, Highest Score: 663, Episode Score: 260, Episode Reward: 31.4000, Episode Epsilon: 0.0000, Mean Score: 426.0000, Mean Reward 193.0000
Episode 6/50, Highest Score: 663, Episode Score: 282, Episode Reward: 33.6000, Episode Epsilon: 0.0000, Mean Score: 402.0000, Mean Reward 166.4333
Episode 7/50, Highest Score: 663, Episode Score: 287, Episode Reward: 33.3000, Episode Epsilon: 0.0000, Mean Score: 

0,1
episode,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
episode_epsilon,▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
episode_reward,▇▂▁▄▁▁▁▁▁▁▂▁▁▁▁▁▁▂▃▂▇▂▂▁▂▂▁█▂▃▂▃▁▁▂▂▁▃▂▁
episode_score,▃▃▃▄▂▂▂▂▂▂▄▃▃▃▂▃▁▃▅▃▆▄▃▂▃▁▁█▅▇▄▇▂▂▅▃▂▆▅▂
highest_score,▁▁▂▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▄▄▆▆▆▆▆▆▆█████████████
mean_current_score,▁▂▃▆▃▂▁▁▂▂▂▃▃▃▃▃▂▂▃▃▄▄▅▄▄▄▃▅▅▆▆▇▇▇▇▇▇███
mean_reward,█▅▃▄▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
episode,1.0
episode_epsilon,0.0
episode_reward,42.3
episode_score,261.0
highest_score,1467.0
mean_current_score,505.7
mean_reward,105.28
