# Optimizing Reinforcement Algorithms & Comparing Performances with Street Fighter II: Special Champion Edition

*by Josiah Hegarty, Everett Lewark, and Austin Youngren, May 10, 2024*

https://github.com/lewark/cs445-fighters

## Introduction

For this project, we were interested in investigating applications of machine learning within video games. Classic fighting games in particular are an area well supported by libraries, existing documentation, and examples, so we decided to go that route. Using the Gymnasium, Stable-Retro, and Stable Baselines3 libraries, we ran experiments with multiple reinforcement learning algorithms to see which model architecture would perform best: Advantage Actor Critic (A2C), Deep Q Network (DQN), or Proximal Policy Optimization (PPO).

### Initial Hypothesis

With respect to model architectures, we expect the PPO model to have the best average performance on each game tested, followed by the A2C model, and with DQN having the worst average performance. Specifically, the PPO model will be able to win more sets of matches with less health lost in N sets than the other two models, when using the match wins and health percentages as an evaluation metric.

### Overview

Initially, we had hoped to answer the majority of the following questions using the PPO, A2C, and DQN reinforcement algorithms:

- Will one of these reinforcement learning models perform better than the others when training in 2D fighting game environments?
- Will the performance of these models vary across different 2D fighting games, or different characters within the same game, due to the variance of specific inputs needed to perform strings/combos?
- In fighting games, players can choose what side of the screen to start on. Will this positioning affect the model, after being trained on one side, but having to play on the other?
- Will we see lower performance when we introduce models into a semi-3D environment due to the increase of actions the model can take?
- Do we see a different model perform better in this environment when compared to performance with 2D-fighters?

As we worked, we found pursuing these questions was more than we bargained for and that we did not have the time to pursue much more than tuning, analyzing, and comparing the three models on a single game. We were able to gather the information needed to answer:

- Will one of these reinforcement learning models perform better than the others when training in 2D fighting game environments?

As we worked, we were able to find some new questions that fit within the scope of our smaller plan:

- Does a more detailed reward function improve the performance of the models; or can being too detailed in the point system properties add a level of confusion to the model’s ability to learn?
- Will one reward function work equally well across all three models; if not, will certain reward properties help one model more than others?
- Will we see improved performance in non-default hyperparameter values when training these algorithms on Street Fighter II Special Champions?


## Methods

Before the company’s later shift to focus on proprietary software products like ChatGPT, OpenAI published its own open-source machine-learning libraries for reinforcement learning, named Gym, Gym-Retro, and Stable Baselines. These libraries provide reinforcement learning environments, retro game integration, and model implementations respectively. However, since these libraries no longer receive feature updates, their respective communities created forks called Gymnasium, Stable-Retro, and Stable Baselines3. These three community maintained forks are what we ultimately decided to use for our project as a result.

### Porting Base Code (Everett)

In developing our project, we based our initial implementations for training a model capable of playing games such as  Street Fighter II on a YouTube tutorial by Renotte (2022), and I ported much of Renotte's code to the newer Gymnasium and Stable-Retro libraries. This took some work, since the signatures of common methods like `reset` and `step` had added new return values that had to be included in our implementation. I also refactored the code to use more functions and classes to allow reuse across different model architectures and games--or at least, that was the hope. Ultimately, the changes we needed to make for our different experiments were large enough that various internal forks of the module were created. Finally, because the Stable-Retro environments are single-threaded and run on the CPU, I added multi-process training that combined Renotte's code with the ppo.py example from Stable-Retro (Farama Foundation, 2023). This allows training times to be increased multiple times their normal speeds:

In [None]:
def make_env(env_class, n_procs: int = 4, n_stack: int = 4, **kwargs) -> Env:
    if n_procs == 0:
        env = DummyVecEnv([lambda: Monitor(env_class(**kwargs), LOG_DIR)])
    else:
        env = SubprocVecEnv([(lambda: Monitor(env_class(**kwargs), LOG_DIR)) for proc in range(n_procs)])
    env = VecFrameStack(env, n_stack, channels_order='last')
    return env

In order to view the performance of our models after the fact, I also added a `run_checkpoint` module that allowed us to load up a previously saved model checkpoint. It would then either display the model running for a single episode to give visual feedback on how well it did, or write the output to a video file using the Gymnasium RecordVideo wrapper.

### Environments

In terms of setup, our team ended up needing to run our models on three different hardware and software platforms, making the process of getting all three configured somewhat more difficult and time consuming than expected. While I used a machine running Debian GNU/Linux for most of my training, Austin ran the libraries through an Ubuntu environment on Windows Subsystem for Linux 2 after multiple other attempts that encountered library incompatibilities. Josiah also largely developed his part of the code using Windows Subsystem for Linux 1 on Windows 10. Austin’s laptop had persisting issues with rendering the models' gameplay during their training and evaluations, so he was restricted to using his home computer.

### Proximal Policy Optimization Algorithm: Optimization, Training, and Analysis Methods (Austin)

Note: All evaluations described below utilized 25 Evaluation episodes.

In the PPO notebook, I start the experiment by running a baseline test using 0 timesteps, for six different models. Three of these models utilize the CNN policy architecture and the other three utilize the MLP policy architecture. This allows us to see how the model performs using untrained, random inputs, which allows us to analyze if our reward function conveys the information needed for the model to improve its gameplay.

In the second section, I again train six different models, three with the MLP policy and three with the CNN policy. Each of the models are trained 500,000 timesteps and use their default hyperparameter values. When comparing their performance in this section to the performances of the previous section, we can infer whether our reward function is a proper fit or not. If the models do worse than the baseline, the reward function needs to be refined. If the models do better, we can continue to the first grid search.

In both grid search experiments, I train with 500,000 timesteps, using a seed of 2 for each model’s pseudo-random number generation. This ensures that all models start with the same initial conditions, allowing for fair comparisons between the models. This also helps to ensure that improvements in performance are due to hyperparameter values, and not a product of random generation.

The first round of grid search attempts to find the most optimal values for the hyperparameters:

- learning rate
- batch size (mini-batch size)
- n_batches (the number of batches)
- steps (number of steps to run for each environment per update)

The values tested for each hyperparameter can be seen in the image below with the exception of the number of steps. The value used for the number of steps can be computed as (n_batches * batch_size).


In [None]:
policies = ['CnnPolicy', 'MlpPolicy']
n_learning_rate = [0.0001, 0.00001, 0.000001]
batch_size = [32, 64, 128]
n_batches = [4, 16, 32]

Once this grid search has finished, the best MLP policy model and the best CNN policy model are picked  from the 54 trained models for the second grid search experiment.

The second round of grid search attempts to find the most optimal values for the hyperparameters:

- Epochs (Number of epochs when optimizing the surrogate loss)
- Gamma (Discount factor)
- GAE Lambda (Factor for trade-off of bias vs variance for Generalized Advantage Estimator)
- Clip Range (Clipping parameter)

Again, the values tested for each hyperparameter can be seen in the snippet below.

In [None]:
n_epochs = [10, 25, 75]
n_gamma = [0.9, 0.999]
n_gae_lambda = [0.95, 0.97, 0.99]
n_clip_range = [0.1, 0.2, 0.3]

This grid search is the most time intensive as we train 108 models. However, the number of models is not the main factor that increases computation time. In fact, epochs had an even larger impact: a 10 epoch model can take roughly 15 minutes to train, while a 75 epoch model can take close to 80 minutes. When this grid search is finished, we pick one best model instead of two. This will be the final model for training and analysis.

During the final analysis, we set the pseudo-random seed to its default value; allowing us to test the consistency of the model given variable starting conditions. Ten copies of the model are trained with one million time-steps. Comparing the performance of the model copies will help in determining if the model can perform consistently in the game environment. 

The TensorBoard plots for this model are then compared to the plots of the models in the previous experiments to determine its overall performance.

### Proximal Policy Optimization Algorithm: Refining the Reward Function and Model Selection Methodology (Austin)

Initially, I ran across issues with the libraries needed to set up the basics of our project. Due to this, I had to utilize a large portion of the code that Everett created for the environment. Luckily, this allowed me to dive directly into the experimentation of the PPO algorithm.

When starting the evaluation process for the PPO model, I focused heavily on the grid search process. From a Medium article by user AurelianTactics (2018), I was able to get an idea of the general value ranges used for each of the hyperparameters I chose to test. The process was broken down into two grid search experiments to help in reducing the number of models being trained in the experiment. Had I not broken it down in this way, I would have had to train 2,916 models in a single experiment. 

By reducing the grid search experiment into two sections, I was able to grab the best-performing CNN Policy and MLP Policy models, then train those on the remaining hyperparameters in a separate grid search. This method reduced the total number of models I needed to train in my grid search experiment to 162. The first grid search’s focus is to fine tune the learning_rate, batch_size, and n_steps hyperparameters for both the MLP Policy and CNN Policy versions of the PPO model. 

During my first version of the grid search, I only selected the model with the highest reward mean as the best model. However, I questioned if the second grid search may produce better results for the Policy version that performed lesser than the selected model. So, I adjusted the code to select both the MLP Policy and CNN Policy models with the highest reward mean to be tested in the second grid search.

In my initial full experiment, I found that my final model’s performance was very inconsistent in its performance at one million timesteps compared to its 25,000 to 50,000 timesteps experiment trial versions. So, from here, we chose to increase the experiment time steps to 500,000. Again, we found inconsistencies, which adjusted our attention towards our reward function.

When I first experimented with reward functions, I first tested to see if the values of the reward were the issue. I went through many versions where I varied the magnitude of the reward value for score, health, and wins. To fully test the performance of these different reward functions, I spent over six hours watching the model’s rendering during evaluation steps. While I watched, I tallied every game win and recorded how far into the game each model was able to achieve before losing a game. I then compared the wins and the achievable game depth of the models across all versions of the reward function. The best version I landed on was:

In [None]:
from typing import Any
def compute_score(self, info: dict[str, Any]) -> float:
    return (
        info.get("score", 0) / 10
        + (info.get("health", 0) - info.get("enemy_health", 0))
        - (info.get("enemy_matches_won", 0) * 1300)
    )

Even with this function, I felt I wasn't seeing the level of performance I was hoping for. The plots for loss and explained variance in TensorBoard showed me that this reward function was barely better than the initial version. So, I turned to the internet in hopes of finding more information about reward functions. 

I found many resources on reward functions, the most helpful being a YouTube video from CampusX (2022). The video primarily talks about how consistent rewards are needed for the model to learn properly, along with using lower magnitude reward values. Though there were many resources for the topic of reward functions, ones that were focused on fighting games were scarce. By luck, I came across code for a reward function on GitHub by Hulse (2022). Though I didn’t use this exact reward function, it gave me an idea for a brand-new reward function that was largely different from the previous versions:

In [None]:
def compute_score(self, info: dict[str, Any]) -> float:

    reward = 0

    if info.get("score", 0) > self.score:
        reward += (info.get("score", 0) - self.score) / 1000000
        self.score = info.get("score", 0)

    if info.get("health", 0) < self.health:
        reward += (info.get("health", 0) - self-health) / 100
        self.health = info.get("health", 0)

    if info.get("enemy_health", 0) < self.enemy_health:
        reuard += (self.enemy_health - info.get("enemy_health", 0)) / 100
        self.enemy_health = info.get("enemy_health", 0)

    if info.get("matches_won", 0) > self.r_wins:
        self.r_win = info.get("matches_won", 0)
        reward += 0.5

    if info.get("enemy_matches_won", 0) > self.enemy_win:
        self.enemy_wins = info.get("enemy_matches_won", 0)
        reward += 0.5

    return reward

This reward function became the base for my final reward function and influenced Everett’s A2C reward function. Everett and I experimented with different versions of this. With this version, Everett was able to find bugs in the returned values of certain reward properties that would cause large positive and negative values to be returned. These values likely caused confusion for the model during training. Finding and correcting these issues greatly helped in removing discrepancies seen within the plots of mean reward. 
Shortly after this, Everett was able to integrate a method of retrieving the player and enemy positions, which played a role in my final reward function. Though my current reward function is not refined to the extent that I would like, it is the best one I have been able to produce in the given time:

In [None]:
def compute_reward(self, info: dict[str, Any]) -> float:
    distance = self.get_player_distance(info)
    reward = (distance - self.distance) / 1000
    self.distance = distance
    self.steps += 1

    new_health = info["health"]
    new_enemy_health = info["enemy_health"]
    new_player_wins = info["matches_won"]
    new_enemy_wins = info["enemy_matches_won"]
    new_y = info["player_y"]

    if new_y < 192 and new_health < self.health:
        reward += -0.5
    self.player_y = new_y

    if (self.health - new_health) < 15 and (self.health - new_health) > 0:
        reward += 1

    if new_health == self.health and new_enemy_health == self.enemy_health:
        if new_health > new_enemy_health and (new_health != 0 and new_enemy_health != 0):
            reward += 0.2
        elif new_health < new_enemy_health and (new_health != 0 and new_enemy_health != 0):
            reward += -0.2

    # if self.steps >= 200000:
    if new_health < self.health and new_health > 0:
        reward += -0.8
    self.health = new_health

    if new_enemy_health < self.enemy_health and new_enemy_health > 0:
        reward += 0.8
    self.enemy_health = new_enemy_health

    if new_player_wins > self.player_wins:
        reward += 2 * new_player_wins
    self.player_wins = new_player_wins

    if new_enemy_wins > self.enemy_wins:
        reward += -2 * new_enemy_wins
    self.enemy_wins = new_enemy_wins

    return reward

This reward function gives extra penalty to the model for taking damage while in the air. The hope behind this was to lessen the frequency of the model jumping since the air is a dangerous place to be in fighting games. Next, I give a positive reward if the character takes less than 15 damage. When blocking, characters do not take damage unless they block a special attack. The damage taken in this instance is highly reduced. In all versions of the PPO model, it hardly ever blocked and never intentionally blocked.  Ideally, this reward property would help the model learn to block. The third if-block penalizes the model for having less health than the enemy but rewards the model when it has more health than the enemy. The fourth if-block penalizes the model heavily whenever they lose health and aren’t blocking. The fifth if-block does the opposite, greatly rewarding the model for dealing damage. Finally, the last if-blocks will punish and reward the model based on the number of rounds won or lost. This value resets through each game (reaching a new character or a new game reset after a loss) so the model will at most lose or game 2 reward values for winning or losing a game.
Then, one of the final adjustments I made to my grid search was how I decided to choose the best models from the grid search. Instead of choosing the model with the highest reward mean, I instead grab the top 27 models with the highest reward mean for the first list, then the top 27 models with the highest episode length mean in the second. Using the two lists of best models, I create a third list that consists of the intersecting models between the two. The CNN Policy model and MLP model with the best mean reward is then selected from the third list of models. Given the complexities of how models perform, this was the best method I could think of without having access to additional performance data. 

### Integrating Position-based Reward (Everett)

The particular variables exposed to the reward function through the "info" dictionary vary by game. Within the default Stable-Retro Street Fighter II environment, the following variables are available:

- player score
- player health
- enemy health
- player round wins
- enemy round wins
- countdown timer

The code from Renotte used the player’s in-game score to determine rewards, but we wanted to make ours a bit more nuanced. As discussed earlier, Austin and I initially experimented with rewarding the model when it does damage to the opponent, and penalizing it when it takes damage. However, this simpler reward function has a problem. For the most part, dealing damage is more complex than just making a single action. In the case of melee attacks, a player must first walk toward the opponent before they can deal damage, and this is a whole task unto itself.

As described in the YouTube guide by CampusX (2022), reinforcement learning models can perform better when they are given some sort of continuous reward function to optimize against. However, in the case of damage, our rewards were much more sparse. The model would have to somehow stumble into the other player and then make a move that deals damage to receive any positive feedback. Furthermore, this may need to randomly occur enough times that the model could learn complex patterns from it, including both the ability to visually recognize where the player is relative to the enemy as well the necessary actions to move it closer.

To address this problem, I tried introducing another variable into the reward function that rewarded the player for moving closer to the opponent. In order to do this, the X and Y coordinates of both the player and the enemy had to be exposed from the environment so that they can be integrated into the reward function. The Stable-Retro library provides documentation on the process of integrating a new game environment (Farama Foundation, 2023), and the process of modifying an existing integration is similar. The process involved compiling and running a specialized integration UI, which looked like this:

![stable-retro integration UI](images/everett/integration-ui.png)

Using this interface, which functions similarly to other memory-inspection tools like Cheat Engine, I located variables within the game's RAM using an iterative process. For instance, to locate the player X coordinate, I ran a search for variables that were marked as unchanged. I then moved the player to the right and narrowed the current set of variables by searching for ones that increased in value. By following steps like these repeatedly, the console's entire RAM was gradually reduced to a few candidate memory locations, which I manually checked using the automatically-updating table in the sidebar. The same strategy was then used to locate the memory locations for the player Y coordinate and the enemy coordinates. This resulted in a JSON file containing the memory addresses:

```js
{
  "info": {
    // ... trimmed base variables from Stable-Retro ...
    "enemy_x": {
      "address": 16745094,
      "type": ">u2"
    },
    "enemy_y": {
      "address": 16745098,
      "type": ">u2"
    },
    // ...
    "player_x": {
      "address": 16744454,
      "type": ">u2"
    },
    "player_y": {
      "address": 16744458,
      "type": ">u2"
    }
  }
}
```

Afterward, I loaded the JSON file into Stable-Retro, and modified the reward function to use the distance between the player and opponent as reported by the np.hypot (hypotenuse) function.

In [None]:
from gymnasium import Wrapper

class CustomReward(Wrapper):
    ...
    
    def compute_reward(self, info: dict[str, Any]) -> float:
        reward = 0

        if self.use_distance:
            distance = self.get_player_distance(info)
            reward = (distance - self.distance) / 10
            self.distance = distance
            
        ...
        
        return reward

    def get_player_distance(self, info):
        return np.hypot(
            info["enemy_x"] - info["player_x"],
            info["enemy_y"] - info["player_y"]
        )

### Advantage Actor Critic (Everett)

For my analysis within this project, I focused on the Advantage Actor Critic (A2C) algorithm, which Hugging Face (2023) describes as a particular variant of Actor Critic. In Actor Critic, the model is composed of two neural networks, one of which (the actor) selects actions, while the other (the critic) scores those actions. In some ways, this arrangement superficially resembles a Generative Adversarial Network (GAN), including the fact that the critic’s output influences the backpropagation pass for the actor. Advantage Actor Critic makes a further modification using an advantage function that estimates how good a particular action is in comparison to the other options in that scenario.

Stable Baselines3 provides its own implementation of A2C, which I used for my experiments during this project. Switching PPO out for A2C was trivial, since the library uses a standardized API for all its built in models. I found that A2C was a bit faster to train when compared to PPO as well, because it involves fewer passes over the reward and action data. The A2C model has several different hyperparameters, including policy model type (convolutional neural network or multi-layer perceptron), learning rate, step count, lambda of the generalized advantage estimator, entropy coefficient, value function coefficient, gradient clipping threshold, optimizer (Adam or RMSProp), and RMSProp epsilon (Raffin et al., 2023a). Due to time constraints, I did not investigate most of these properties. However, given Austin's results with PPO (discussed later), a similarly large grid-search may not have been necessary.

### DQN Testing and Analysis (Josiah)

For my portion of the project, I primarily focused on testing and analyzing the performance of a single Deep Q Network (DQN) algorithm for the project task; namely playing Street Fighter II' - Special Champion Edition (USA) for the Sega Genesis console. For the algorithm, I used an implantation of a single Deep Q network provided by the Stable Baselines 3.2.3.2 library (Raffin et al., 2023b). For my experiments, I planned on testing this implementation of the DQN in a custom Gymnasium 3.1.2 environment for Street Fighter II that had been implemented by Everett.  However, to be able to set up the environment for the algorithm, there were a few technical challenges that had to be overcome first. 

### Setup and Discrete Action Wrapper Class (Josiah)

The main challenge was that the DQN algorithm provided by Stable Baselines 3 only currently supports environments with a discrete action space (integer coded actions). Due to it using the original controller inputs to the Sega Genesis console, the Gymnasium environment for Street Fighter II could only be implemented using a multi-binary action space inputs (a 12 bit sequence).

<center>Figure 1. Compatible DQN action and observation spaces from Stable Baselines 3 DQN documentation (Raffin et al., 2023b)</center>

![Stable-Baselines action spaces](images/josiah/sb-action-spaces.png)

This meant that, out of the three algorithms our team planned to test for our project, DQN would not be able to natively interact with the testing environment. Using a paper published by *Open AI, Gotta Learn Fast:A New Benchmark for Generalization in RL* (Nichol, et al., 2018a), and the corresponding GitHub page (Nichol, et al., 2018b), I was able to find a case where researchers had successfully created a discrete action space wrapper class for a Gymnasium environment, and that allowed models such as DQN to interact with multi-binary environments. In the paper, this was done using Gymnasium's built-in action wrapper class. To make my own similar implementation, I built a list of all possible legal buttons and combos that a player might input for the game, defining legal combinations as the buttons a player would have actually been able to press together on the original Sega Genesis controller. The buttons consisted of four directional buttons, [Up , Down, Left, Right] and an additional six action buttons [A, B, C, X, Y, Z]. The legal combinations for player movement consisted of the [Up] and [Down] buttons combined with either [Left] or [Right], i.e., [Up, Right]. However, combinations such as [Up, Down] were excluded since a player using the original controller wouldn’t have been able to press both of those buttons at the same moment. For action button combinations, any combination of action buttons was allowed, since a player could in theory press all the action buttons at once. The final list of possible button combinations consisted of all directional buttons, and directional combinations with all action buttons and action combinations. This would allow for the agent to select a set of buttons similar to those accessible to a human player, for example, to press the [Up, Right] buttons while simultaneously pressing two punch buttons [X,Y]. 


In [None]:
def make_buttons_and_combos(self):
    action_buttons = ['A', 'B', 'C', 'X', 'Y', 'Z']
    movement_buttons = ['Up', 'Down', 'Left', 'Right']
    movement_combos = []
    all_combos = []

    all_buttons = action_buttons + movement_buttons

    for btn in movement_buttons:
        movement_combos.append([btn])
        
    ...

After making the list of buttons and combinations, and following the example provided in the article, I used a NumPy array to build a 12-bit binary representation for each button and button combination. This allowed the wrapper class to translate discrete actions provided by a model to a multi-binary action space, and to allow the model to ‘see’ its multi-binary actions as a discrete list of buttons and button combinations. This effectively made it possible for the Stable Baseline’s DQN implementation to support multibinary action spaces and so allowed us to use DQN for the project.

In [None]:
import gymnasium as gym

class DiscreteWrapper(gym.ActionWrapper):
    """
        Wrap a gymnasium environment to allow it to use discrete a action space.
    """


    def __init__(self, env, input_buttons = None, input_combos = None):
        super().__init__(env)
        assert isinstance(env.action_space, gym.spaces.MultiBinary)
       
        self.buttons = None
        self.combos = None
        self.decode_discrete_action = []


        if input_buttons is None or input_combos is None:
            self.buttons, self.combos = self.make_buttons_and_combos()
       
        for combo in self.combos:
            arr = np.array([False] * env.action_space.n)
            for button in combo:
                arr[self.buttons.index(button)] = True
            self.decode_discrete_action.append(arr)


        self.action_space = gym.spaces.Discrete(
            len(self.decode_discrete_action))
       
       
    def action(self, act):
        return self.decode_discrete_action[act].copy()
       


    def make_buttons_and_combos(self):
        action_buttons = ['A', 'B', 'C', 'X', 'Y', 'Z']
        movement_buttons = ['Up', 'Down', 'Left', 'Right']
        movement_combos = []
        all_combos = []


        all_buttons = action_buttons + movement_buttons


        for btn in movement_buttons:
            movement_combos.append([btn])


        for b1 in movement_buttons[:2]:
            for b2 in movement_buttons[2:]:
                movement_combos.append([b1, b2])


        for btn in movement_combos:
            all_combos.append(btn)


        for btn in action_buttons:
            all_combos.append([btn])


        for mv_combo in movement_combos:
            for ac_btn in action_buttons:
                all_combos.append(mv_combo + [ac_btn])


        return all_buttons, all_combos


### DQN Baseline Testing (Josiah)

After setting up the requirements for the custom Gymnasium environment ported by Everett, I established a quantitative ‘random agent’ baseline for the DQN model. This was done as a negative control for the model’s performance. The idea was that we would need our agents to perform consistently better than ones selecting random actions to quantifiably determine if learning was successful, or if an apparent success was due to chance. I did this by evaluating the mean and standard deviation of rewards gained by 100 separate DQN models trained for 0 timesteps and tested over one evaluation episode each. For these tests, 50 of the models were initialized with the multilayer perceptron policy and 50 with a convolutional neural network policy. Each set was also given a unique set of random seeds. The models were only tested for a single episode since, in their untrained state, our team found that they performed a deterministic set of actions in any episode for a given seed.


### Initial DQN Hyperparameter Tuning (Original Reward Function) (Josiah)

After establishing a baseline for DQN, I started on the hyperparameter tuning process by performing a grid search with 40 models from each policy and default hyperparameters over a number of learning epochs ranging from 10,000 to 1,000,00 timesteps. This was done to gain a better understanding of the number of learning steps that might be needed for the DQN model to converge, and to try to determine if one or the other policies seemed to be working better for our problem environment. This was also essentially a “sign of life” test, as  suggested in the talk at the 2021 Reinforcement Learning Virtual School,  RL Tips and Tricks and The Challenges of applying RL to Real Robots, (Raffin, A. 2021a). During this process, I also made several functions to plot the rewards gained by both the baseline agents and those that had been trained to get a better understanding of how the training process was proceeding and to try determining a visual difference in the performance between trained and random baseline agents. The included method to plot the total points gained by an agent at every timestep of an episode, as well as ones to plot the cumulative sum of rewards gained by the agent at every timestep.


In [None]:
def eval_plotter(ep_recorder: dict, trial = None, legend = True, plot_each = False, show_all = False, title = 'title'):


    for k in ep_recorder.keys():
        steps, rsum, all_rewards, _ep_info = ep_recorder[k]
        all_steps = [i for i in range(steps)]


        plt.plot(all_steps, rsum, label = f'episode: {k}')
        plt.title(title)
        plt.ylabel('Cumulative Rewards')
        plt.xlabel('Steps')
        if legend:
            plt.legend()
        if plot_each:
            plt.show()


    plt.show()


    if show_all:
        alpha = 1.0
        n_episodes = len(ep_recorder)
        alpha_offset = (0.8/(n_episodes))
       
        for k in ep_recorder.keys():
            steps, rsum, all_rewards, _ep_info = ep_recorder[k]
            all_steps = [i for i in range(steps)]


            plt.plot(all_steps, all_rewards, '.', label = f'episode: {k}', alpha = alpha)
            plt.title(f'Trial {trial} Rewards per Step: All Episodes')
            plt.ylabel('Rewards')
            plt.xlabel('Steps')
            if legend:
                plt.legend()
            if plot_each:
                plt.show()
            alpha -= alpha_offset
            if alpha <= 0.15:
                alpha = 0.15
               
    plt.show()

### Preprocessing and Reward Function Analysis (Josiah) 

Due to the fact that the results of this last test seemed to have some extreme variability, and that both Everett and Austin were seeing similar results on their more highly-tuned models, the other team members and I suspect that the reward function and/or the preprocessing that we were using for the test environment might be poorly optimized for our algorithms. To try to determine if the preprocessing was possibly causing the algorithms to fail, I used a tutorial video by Lower (2023) to set up a TensorBoard UI to try to assess the DQN’s progress more clearly during the training process.

I also made several separate test environment modules with and without additional preprocessing techniques including stochastic frame skipping, which randomly skips a few of each of the images being shown to the model as observations, and clip environment which, allows the model’s action and observation space to be scaled within a [-1 to 1] interval. The first of these was a suggested normalization technique from the aforementioned video by Lower (2023), and the later of these techniques was suggested in the previously referenced video by Raffin (2021a), *RL Tips and Tricks and The Challenges of applying RL to Real Robots.* With the modules I performed a grid search over environments with these preprocessing techniques. I also used several of the reward functions being developed by Austin, including some with new positional data gained through Everett’s work, and a few test reward functions of my own. This constituted testing over 11 different test environments with different reward functions, and also over 3 different preprocessing techniques including frame delta (provided by the Renotte (2022) code that Everett ported), stochastic frame skipping, and action and observation space clipping.

In [None]:
from typing import Optional

class SF_Default: ...

class SF_FrameDelta(SF_Default):
    def __init__(self, render_mode: Optional[str] = None, random_delay: int = 30, use_delta: bool = True):
        super().__init__(render_mode=render_mode, random_delay=random_delay, use_delta=use_delta)
        self.custom_test_name = 'Default_With_Frame_Delta'

    def __str__(self):
        return self.custom_test_name
    
    def __repr__(self):
        return self.custom_test_name
    

class SF_Default_Scaled_Rewards(StreetFighter):
    def __init__(self, render_mode: Optional[str] = None, random_delay: int = 30, use_delta: bool = False):
        super().__init__(render_mode=render_mode, random_delay=random_delay, use_delta=use_delta)
        self.custom_test_name = 'SF_Default_Scaled_Rewards'

        self.score = 0
        self.enemy_health = 175
        self.health = 175
        self.enemy_wins = 0
        self.player_wins = 0
        self.random_delay = random_delay
    
    ...

In [None]:
def skip_env(base_env_class, 
                render_mode = None, 
                log_dir = LOG_DIR, 
                n_procs: int = 4, 
                n_stack: int = 4
                ):

    if n_procs == 0:
        env = make_discrete_env(base_env_class=base_env_class, render_mode=render_mode)
        env = MaxAndSkipEnv(env, skip=4)
        env = Monitor(env, filename=log_dir)
        env = DummyVecEnv([lambda: env])
        env = VecTransposeImage(env)
        return env
    else:
        env = SubprocVecEnv([lambda: Monitor(MaxAndSkipEnv(make_discrete_env(base_env_class=base_env_class, render_mode=render_mode), 4), log_dir) for proc in range(n_procs)])
        env = VecFrameStack(env, n_stack, channels_order='last')
        env = VecTransposeImage(env)
        return env

### Second Round DQN Baseline Testing and Hyperparameter Tuning (Josiah) 

After the team felt like they had found the best of the new reward functions, I incorporated it into my test environment and established a new baseline with another random agent. For the baseline testing, I used a specialized version of the evaluate function provided in Stable Baseline’s tutorial notebook (Raffin et al., 2021b) In this version, no model is involved in the testing, only the environment itself. It works by having a given test environment randomly select an action from its own action space, and then calculating mean total rewards, std of total rewards, sum of total rewards, and mean steps taken by an agent over a given set of training episodes. This technique was an improvement in that it allowed environments to be tested with a truly random agent in a way that was also agnostic to any particular model, policy, or algorithm we would want to analyze. As a bonus, it was also much faster than the previous approach since models didn’t need to be initialized and trained for 0 timesteps for the baseline process, only the environments needed to be initialized.  To perform the baseline testing for the new environment, I collected the previously mentioned statistics for a random agent over 50 training episodes and saved the results.


In [None]:
def evaluate(
    model: BaseAlgorithm,
    eval_env,
    n_eval_episodes: int = 25,
    deterministic: bool = False,
    return_episode_rewards: bool = False,
    progress_bar: bool = True,
    get_ep_info: bool = False,
    upper_n_step_bound: int = 10_000_000,
    discrete: bool = True
) -> dict:
    """
    Evaluate an RL agent for `num_episodes`.
    Can perform regular prediction if given a model, or baseline
    prediction of a random agent if model is set to None.


    returns: mean, std of episode rewards, and mean episode steps if return_episode_rewards is False
    otherwise it will return a dict with recorded information from each episode, where the key of the
    dict is the episode number.


    the values of the dict will be:
    ep_total_num_steps, ep_running_sum_per_step, ep_rewards_per_step, ep_info]


    where ep_info will be a dict congaing the information in the episode info object per episode step
    if get_ep_info is set to true.


    """
    ...

### Final DQN Hyperparameter Tuning and Analysis (Josiah)

For the final DQN hyperparameter tuning, I used a combination of hyperparameters based on those suggested by Stable Baseline 3’s sister repository RLZoo (RL Baselines3 Zoo, 2023), as well as those suggested in the research paper, *Human-level control through deep reinforcement learning* (Mnih, V., Kavukcuoglu, K., Silver, D. et al., 2015).


In [None]:
h_params = {
    'policy': ['MlpPolicy', 'CnnPolicy'],
    'gamma': [0.9, 0.99],
    'learning_rate': [0.00001],
    'batch_size': [256],
    'buffer_size': [50_000],
    'train_freq': [16, 32],
    'gradient_steps': [1],
    'exploration_fraction': [0.3, 0.5],
    'exploration_final_eps': [0.1, 0.2],
    'target_update_interval': [100, 10000],
    'policy_kwargs': [dict(net_arch=[256, 256])],
    'seed':[2],
}

For each of these hyperparameters, I planned on training the model for a total of 500,00 epochs and collecting the mean, std, and sum of total rewards gained by an agent over 10 training episodes, as well as mean number of steps reached by an agent for those episodes. Unfortunately, due to time constraints for the project, I was able to test these hyperparameters for the multilayer perceptron policy (MlpPolicy), which had previously shown the best results for DQN so far.

After the grid search I analyzed the results by saving them in a pandas dataframe and comparing the mean results for a particular DQN architecture to those of the baseline previously established.

As a final step, I also recorded a video of the best performing DQN model so that it could be qualitatively compared to the best results from both A2C and PPO being tested by my teammates. This video is shown in the Results section.

## Results

### A2C (Everett)

During one of my initial experiments, I ran a preliminary grid search using an earlier version of the reward function to investigate the impact of the learning rate and steps per update. The interpretability of results from this run was somewhat limited, and this hinted at issues that were to come.

![preliminary result table](images/everett/prelim-table-2.png)

Since step count did not seem to have a consistent impact on model performance, and smaller batches tended to train more smoothly without pausing for longer periods of time, I ultimately decided to leave the steps-per-update value at the default of 5 for the next experiment.

For my last experiment I compared eighteen trained A2C models in an attempt to determine if there were any visible effects from tweaking the learning rate, policy model type, and reward function. Each successive reward function tested introduced new properties into the calculation, the first adding my distance metric and the second incorporating Austin’s considerations of damage taken while jumping or blocking. The models were all trained for 500,000 timesteps over eight concurrent environment threads. During the evaluation phase, models ran through 25 episodes, each encompassing the time between when the player initiated the fight with Guile and ending when the player got a Game Over.

The average episode reward and length are listed below. I’ve also included baseline measurements I took by running random actions against the same reward function and episode count.


![final result table](images/everett/final-table.png)

Learning-curve graphs taken from TensorBoard also show how episode length and reward evolved over the course of training. Note that the scale on the episode reward graph depends on the particular reward function in question, so different reward functions can’t be compared directly using that metric.

![Mean episode length by model](images/everett/final-ep-length-mean.png)
![Mean episode reward by model](images/everett/final-reward-mean.png)

Ultimately, performance from these models was somewhat disappointing. The models did not score much higher than the baseline (worse in many cases), and the standard deviations of the runs were large enough that there weren’t many obvious patterns. The most complex reward function of the three (with Austin’s blocking and jumping rules) had some of the lowest mean episode lengths, but this is likely mainly due to random variation. The best model out of that category happened to have the longest session when I recorded a video of it playing an episode.

In [9]:
from IPython.display import Video
Video("videos/a2c-video-mlp-all-reward.mp4")

The furthest I've seen a model get is to stage 3 (Chun-Li), but that is a rare occurrence. In this case, the model makes it to Stage 2 but can't make it any further. This is perhaps not a serious achievement, since a random number generator is capable of making just as much progress.

### DQN (Josiah)

Shown below are the results of the initial baseline testing for the DQN algorithm with a random agent. We should note that, since each evaluation episode for this part of the project was a single episode long, the mean standard deviation of rewards do not reflect the true standard deviations between episode rewards (which, from the graph, we can see was quite high.)

Initial total baseline results for a random agent using both multilayer perceptron (Mlp) and convolutional network (Cnn)  policies.

![Combined mean results for both policies](images/josiah/01-combined-mean.png)
<center>Combined mean results for both policies</center>

![Mean results for the Cnn policy](images/josiah/02-mean-cnn.png)
<center>Mean results for the Cnn policy</center>

![Mean results for the Mlp policy](images/josiah/03-mean-mlp.png)
<center>Mean results for the Mlp policy</center>

![Cumulative rewards](images/josiah/04-rewards-step.png)
<center>Graph of the cumulative rewards per step gained by the random agent for both policies, and over 800 evaluation episodes.</center>

The baseline with the initial reward function suggested that we could expect a random agent to have mean total rewards of approximately 30 - 40 thousand, and to be able to survive in the game for approximately 6,500 timesteps. The timestep metric is important here since, while the total reward varies with the scale of the rewards given by a particular reward function, the timsteps is closely tied to the agent’s performance in the game. In this case, 6000 is approximately the beginning of the second level of the game. We can also note that there was an extreme amount of variability in the rewards for each episode.

Initial total baseline results for a default DQN agent using both multilayer perceptron (Mlp) and convolutional network (Cnn)  policies. And trained over a range of total timesteps from 10,000 to 1,000,000.

![Combined mean results for both policies](images/josiah/05-combined-mean.png)
<center>Combined mean results for both policies</center>

![Mean results for the Cnn policy](images/josiah/06-mean-cnn.png)
<center>Mean results for the Cnn policy</center>

![Mean results for the Mlp policy](images/josiah/07-mean-mlp.png)
<center>Mean results for the Mlp policy</center>

![Cumulative rewards](images/josiah/08-cumulative.png)
<center>Graph of the cumulative rewards per step gained by a default DQN agent for both policies, and over 800  evaluation episodes.</center>

Above are shown the results of the initial tests over different policies, and timesteps for Stable Baseline’s implementation of the DQN algorithm. Some important things to note are that, while the algorithm seemed to perform slightly better than a random agent in terms of total timesteps, it also performed significantly worse in terms of total mean rewards. Also, while we can visually see that the algorithm seemed to have slightly less variance in its rewards when compared with a random agent, the amount of variation was still very high. This was one one the mai motivations to try to find a reward function that better fit our project's game environet and algorithms.

Below are the results of the baseline testing and evaluation of DQN agent’s trained with our team’s second reward function.

#### Baseline testing with a random agent over 50 episodes (second reward function)
![Baseline Results](images/josiah/09-baseline.png)

#### Grid Search Results for DQN over 500_000 training timesteps

![Mean episode length](images/josiah/10-episode-length.png)
<center>Mean episode length.</center>

![Mean episode rewards](images/josiah/11-episode-rewards.png)
<center>Mean episode rewards.</center>

![Exploration Rates](images/josiah/12-exploration-rate.png)
<center>Exploration Rates over the course of training.</center>

![Reward results](images/josiah/13-reward-results.png)
<center>Reward results for the highest scoring model (on mean reward and mean test episode length) after 50 evaluation episodes.</center>

From the results, we can see that the variability of of the DQN’s performance was somewhat improved by the new reward function, However, The DQN algorithm still was not able to perform better than the random agent, and the learning curve of the agent’s training suggest that, after a brief initial period, the agents were not learning in the way we would hope to see.  This suggests that there would still be additional hyperparameter tuning or adjustments made to the experiment design to see better results from the DQN algorithm, such as possibly reward shaping more tailored to the DQN algorithm, as well much longer and more extensive hyperparameter tuning that we have tried thus far.

Using the method described in (Kerkez, 2013), a recording of the best model is embedded below:

In [12]:
from IPython.display import Video
Video("videos/dqn-video.mp4")

### PPO (Austin)

#### PPO Baseline Testing: 0 timesteps

![Baseline testing](images/austin/01-cnn-mlp.png)


#### PPO Default Testing: 1,000,000 timesteps:

CNN Policy: Blue, Purple, Green

MLP Policy: Orange, Red, Yellow

(Line Smoothing: Off)

![mean PPO episode length and reward](images/austin/02-ep.png)
![approx_kl, clip_fraction, entropy_loss, explained_variance](images/austin/03-train.png)
![train loss, policy gradient loss, value loss](images/austin/04-train.png)
![individual episode rewards and lengths](images/austin/05-policy.png)

#### Grid Search 1: Learning Rate, Batch Size, Number of Batches,&  Number of Update Steps 
(Line Smoothing: On)

![mean PPO episode length and reward](images/austin/06-rollout.png)
![approx_kl, clip_fraction, entropy_loss, explained_variance](images/austin/07-train.png)
![train loss, policy gradient loss, value loss](images/austin/08-train.png)

Top 17 Performing Models From Grid Search 1:
![individual episode rewards and lengths](images/austin/09-grid.png)

In previous versions of the experiment, the MLP policy models consistently outperformed the CNN policy models. Now, given the current reward function, both Policy models seem to roughly perform the same on average. Though, given the results of this grid search, the best CNN Policy outperformed the best MLP policy by a fairly large margin for both mean reward and episode length mean.

#### Grid Search 2: Epochs, Gamma, LAE Lambda, & Clip Range
(Line Smoothing: On)

![mean PPO episode length and reward](images/austin/10-rollout.png)
![approx_kl, clip_fraction, entropy_loss, explained_variance](images/austin/11-train.png)
![train loss, policy gradient loss, value loss](images/austin/12-train.png)

Models 1 through 54 are MLP policy models, whereas 55 through 100 are CNN models. Training had to be stopped before the final 8 models could finish to ensure we met the turn-in deadline.

Best Models from Grid Search 2 Chosen for Further Evaluation:
![individual episode rewards and lengths](images/austin/13-table.png)

Here the CNN policy model has a slightly higher reward mean but the MLP policy model has a larger mean episode length. This shows that the MLP policy model is able to survive longer, but the CNN is slightly better at performing in accordance with the reward function. The MLP’s higher episode length does not necessarily mean it made it further than the CNN in the game. This means the increase of episode length could correlate with the model’s ability to maintain its health for longer periods, prolonging its matches. Reviewing them will give us a sense of why the MLP mean episode length is larger. Below is a video showing the best MlpPolicy model from the table above:

In [13]:
from IPython.display import Video
Video("videos/ppo-gs2-mlp.mp4")

Additionally, here is the CNN model:

In [16]:
from IPython.display import Video
Video("videos/ppo-gs2-cnn.mp4")

By comparing the rendered performance of the two models, we can see that the larger mean episode length of the MLP policy model is likely due to it being able to venture further into the game’s campaign instead of it being able to avoid taking damage for prolonged periods. This is fairly apparent as the MLP policy model wins against Guile, reaching Ken, and the CNN policy model loses to Guile, ending the episode.
 
When it comes to reward mean and episode length, these two models do perform better than the random baseline and the default training models. This shows that some improvement can be made through tuning algorithm hyperparameters. Though, these tunings are likely not responsible for drastic performance improvement.

Due to the extensive complexities of these libraries, there may be an important concept we overlooked that would greatly aid in improving the model’s ability to succeed in the game. The idea that we overlooked a key concept becomes more apparent when we train these models for longer periods.

#### Final Model Training: 1,000,000 Timesteps

CNN: Blue

MLP: Red 

(Line Smoothing: Off)

![mean PPO episode length and reward](images/austin/14-rollout.png)
![approx_kl, clip_fraction, entropy_loss, explained_variance](images/austin/15-train.png)
![train loss, policy gradient loss, value loss](images/austin/16-train.png)

| Policy    | Learning Rate | Batch Size | Seed | Epochs | Gamma | GAE Lambda | Clip Range | Mean Reward | Episode Length Mean | Learning Time | Evaluation Time |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| MipPolicy | 0.000100 | 256  | 64 | 10 | 0.999 | 0.95 | 0.2 | -21.52 | 7979.32 | 1682.788061 | 104.796033 |
| CnnPolicy | 0.000001 | 1024 | 64 | 25 | 0.900 | 0.99 | 0.3 |  18.92 | 8950.36 | 3633.778498 | 117.024862 |



In these larger training sessions we remove the set seed passed to the model for training during grid search experiments. By removing the seed, we are able to understand if the model is able to consistently learn from different starting conditions. Here we can see that the performance given a randomly generated seed introduces inconsistency in the model’s ability to perform. The video below displays the model’s inability to perform at the same level as its previous versions. 

In [14]:
from IPython.display import Video
Video("videos/ppo-1m-mlp.mp4")

#### Bonus Model Training: 5,000,000 Timesteps

CNN: Blue

MLP: Pink 

(Line Smoothing: Off)

![mean PPO episode length and reward](images/austin/17-rollout.png)
![approx_kl, clip_fraction, entropy_loss, explained_variance](images/austin/18-train.png)
![train loss, policy gradient loss, value loss](images/austin/19-train.png)

Once again, due to time, I was forced to end the training early to ensure we met the deadline. 
We can still see the trajectory of both the models' episode length and reward decreasing, and we have a completed MLP model. This one exhibits particularly strange behavior.

In [15]:
from IPython.display import Video
Video("videos/ppo-5m-mlp.mp4")

The inability to improve could be due to having an insufficient reward function that prevents the model from improving after a certain threshold of time steps, introducing variable starting conditions via random seed generation, preprocessing that does not filter out enough noise, or there may be a piece of information we are missing that is critical to bypassing this barrier. Reading further into documentation and performing additional experiments are needed to know the direct causes of this issue.

## Conclusions

An unexpected challenge with this project was getting all three project participants’ development environments set up. As mentioned previously, running the libraries we used on Windows required a great deal of trial and error. We also had to troubleshoot issues with memory usage and limit ranges for hyperparameters like steps per update, since values that were too large would allocate absurdly large buffers.

Integrating our various codebases also presented a challenge because each of us worked on our own models semi-independently. Everett set up a Git repository to synchronize our code with one another, but by the time we started using it more, it was largely too late to more closely integrate most of our models. As a result, the particular form of evaluation used for each model varied, with Austin’s having the most comprehensive hyperparameter grid-search and Josiah’s model possessing the most domain-specific metrics like round wins.

During the final two weeks of our experiments, we spent much of our time trying and testing new reward functions in the hopes of improving the model’s ability to learn. In certain models we see small areas of improvement but it is still unclear if the bottleneck of model performance is due to this reason. Given more time, we would have liked to read through all parts of the Stable Baselines and Stable-Retro library documentation to find our “missing link.”
One of our largest roadblocks for this project was time. As seen in prior discussions, performing large grid searches would take half a day to multiple days to finish. For example, the PPO’s two-part grid search took eighty-hours to train 154 models. Though we saw some improvements, the changes left us with inconclusive leads for the direction we needed to take to reach the level in performance we were striving for. 

Overall, it was challenging to get consistent results from our models. It’s difficult to tell if particular hyperparameters or model architectures had benefits over one another, so ultimately the answer to our hypothesis is inconclusive. In fact, many of our models were comparable to a baseline player that simply performed a random action every frame. This raises the question of how many of our results are meaningful, and how many are noise. It’s possible that grayscale video input to the models may be asking too much, and that a simpler model that operates directly on properties like position, velocity, and health might perform better. Those types of adjustments may be worth investigating in the future.


## References

- AurelianTactics. (2018, July 8). *PPO Hyperparameters and Ranges.* Medium. https://medium.com/aureliantactics/ppo-hyperparameters-and-ranges-6fc2d29bccbe 
- CampusX. (2022, July 28). *Design the Best Reward Function | Reinforcement Learning Part-6.* YouTube. https://www.youtube.com/watch?v=IdJL9rcQrFU
- Farama Foundation. (2023). *Game Integration.* Stable-Retro Documentation. https://stable-retro.farama.org/integration/
- Farama Foundation. (2023). *ppo.py.* Stable-Retro. https://github.com/Farama-Foundation/stable-retro/blob/master/retro/examples/ppo.py
- Hugging Face. (2023, April 19). *Advantage Actor Critic (A2C).* Hugging Face. https://huggingface.co/learn/deep-rl-course/unit6/advantage-actor-critic
- Hulse, C. [corbosiny]. (2022, October 24). *AIVO-StreetFigherReinforcementLearning.* GitHub. https://github.com/corbosiny/AIVO-StreetFigherReinforcementLearning
- Kerkez, V. (2013, August 2). *Just do: from IPython.display import Video Video("test.mp4")* [Comment on the online forum post How can I play a local video in my IPython notebook?] StackOverflow. https://stackoverflow.com/a/18026076
- Lower. J. (2023). *Python Reinforcement Learning using Stable baselines.* Mario PPO.Youtube.
https://www.youtube.com/watch?v=PxoG0A2QoFs&t=1959s
- Mnih, V., Kavukcuoglu, K., Silver, D. et al. (2015). *Human-level control through deep reinforcement learning.* Nature 518, 529–533 (2015) https://doi.org/10.1038/nature14236
- Nichol, et al. (2018a). *Fast:A New Benchmark for Generalization in RL.* https://arxiv.org/abs/1804.0372
- Nichol, et al. (2018b). *discretizer.py.* GitHub. https://github.com/openai/retro/blob/master/retro/examples/discretizer.py
- Raffin, A. (2021a). *RL Tips and Tricks and The Challenges of applying RL to Real Robots.* Youtube.
https://youtu.be/Ikngt0_DXJg
- Raffin, A, et al. (2021b). *Stable Baselines3 Tutorial - Getting Started.* https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/stable_baselines_getting_started.ipynb#scrollTo=hyyN-2qyK_T2
- Raffin, A., et al. (2023a, May 8). *A2C.* Stable Baselines3 2.3.2 documentation. https://stable-baselines3.readthedocs.io/en/master/modules/a2c.html
- Raffin, A., et al. (2023b, May 8). *DQN.* Stable Baselines3 2.3.2 documentation. https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html
- Renotte, N. (2022, February 9). *Build a Street Fighter AI Model with Python | Gaming Reinforcement Learning Full Course.* YouTube. https://www.youtube.com/watch?v=rzbFhu6So5U 
- RL Baselines3 Zoo. (2023). *hyperparams_opt.py.* GitHub. https://github.com/DLR-RM/rl-baselines3-zoo/blob/master/rl_zoo3/hyperparams_opt.py
- videogames . ai. (2019, August 31). *MLP vs. CNN after 250M training with PPO2.* YouTube.
https://www.youtube.com/watch?v=P14Zn1dSIHY 

# Word Count

In [7]:
import io
from nbformat import current
import glob
nbfile = glob.glob('HegartyLewarkYoungren-FinalProject.ipynb')
if len(nbfile) > 1:
    print('More than one ipynb file. Using the first one.  nbfile=', nbfile)
with io.open(nbfile[0], 'r', encoding='utf-8') as f:
    nb = current.read(f, 'json')
word_count = 0
for cell in nb.worksheets[0].cells:
    if cell.cell_type == "markdown":
        word_count += len(cell['source'].replace('#', '').lstrip().split(' '))
print('Word count for file', nbfile[0], 'is', word_count)

Word count for file HegartyLewarkYoungren-FinalProject.ipynb is 7752
