# Platform for Reinforcement Learning Experiments
## Chris Durbin system project
In this notebook I will walk through my system project. I will provide a rationale for my project, a high level overview for what the project does and its components, then delve into examples of how to use the project. Finally I will wrap up the notebook adding more details based on the rubric for the assignment and how everything fits in with the 6Ds framework.

## Rationale
I chose this specific project because I see reinforcement learning showing a lot of promise in the pursuit of general intelligence, and I want to understand it well, especially the latest advances in the research. I've wanted to try out the OpenAI Gym for the last few years, and I've known I want a good way to tweak and test experiments without losing track of what I tweaked and what worked the best. So I decided on setting up a platform for running reinforcement algorithm experiments with an initial focus on using the OpenAI Gym environments, but with a path towards incorporating any environment.

## Overview

Since my project fits into the Platform as a Service (PaaS) model there is a lot that you can do, but not a concrete single demo of "the project". The best way to show what in can do is to walk through a few use case examples of how one might utilize the PaaS for different scenarios. Ideally the environment is robust enough that it can handle scenarios I have not even considered.

I used the rubric as well as the 6Ds framework as a guide to decide what to demo and how to best describe the project. In addition to having them guide the use cases, I also addressed them directly at the end of this notebook. I've also provided links so you can directly jump to those sections if you want more details on the platform before running the examples.

For the platform I am combining several open source projects:

[OpenAI Gym](https://www.gymlibrary.dev/) for the environments, [CleanRL](https://docs.cleanrl.dev/) for the reinforcement algorithms implementing the agents, and [Weights and Biases](https://docs.wandb.ai/) for tracking experiments. I'll go into more detail on these later on in the notebook, but figured it's best to jump right in with an example to show them in action.

## Table of Contents
* [Using the PaaS](#using-paas)
* [Use cases](#use-case-1)
    - [Use case 1: Compare different RL algorithms](#use-case-1)
    - [Use case 2: Stochasticity](#use-case-2) Effect of randomly ignoring chosen actions on learning performance.
    - [Use case 3: Resuming training of checkpointed model](#use-case-3)
* [Docker setup](#docker-setup)
* [Experiment tracking](#experiment-tracking)
* [Rubric](#rubric)
* [6Ds Framework](#6ds-framework)


<a id="using-paas"></a>
## Using the PaaS

I will now show three use cases to give an idea on how to use the platform. Each of these use cases trains an agent against the CartPole environment. I picked the CartPole environment for the demos because it trains quickly and is easy to see how well it is performing by watching the saved videos. For the CartPole environment the goal is to move the cart in a way to keep the pole balanced without falling over. The maximum score for an episode is 200 meaning the pole did not fall over in 200 timesteps.

<img src="images/CartPole2.png" width="300" />

Note that the platform can be run using Docker and the provided image or locally on the host. The experiment tracking with Weights and Biases can be self hosted or using the cloud hosted version. I have tested with each of the permutations. More instructions can be found in the Docker repository, but my recommendation would be to run locally and with Weights and Biases configured to use the cloud hosted version at https://wandb.ai/. You will need to create an account and then login.

First make sure to install of the necessary libraries.

**WARNING** On the first install this can take several minutes especially on a slow network connection due to the multiple GB PyTorch dependency.

In [1]:
!pip install -r requirements.txt

Collecting gym==0.23.1
  Downloading gym-0.23.1.tar.gz (626 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m626.2/626.2 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting numpy==1.23.4
  Downloading numpy-1.23.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pygame==2.1.0
  Downloading pygame-2.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting torch==1.13.0
  Downloading torch-1.13.0-cp38-cp38-manylinux1_x86_64.whl (890.2 M

#### Log in to wandb
You will need to run `wandb login` once from a terminal and follow the instructions to configure your API key. After that you will remain logged in and will not need to do that again.

In the cell below I am just verifying that I am logged in with my account, the key setup needs to be done outside of Jupyter if you get any kind of interactive prompt.

In [1]:
!wandb login

[34m[1mwandb[0m: Currently logged in as: [33mjhebeler[0m. Use [1m`wandb login --relogin`[0m to force relogin


<a id="use-case-1"></a>
### Use case 1 - Benchmark multiple reinforcement learning algorithms to compare training time and performance for each algorithm

#### Kicking off the first run using the PPO algorithm

In [3]:
import os
os.environ["SDL_VIDEODRIVER"] = "dummy"

In [5]:
!python code/ppo.py \
    --env-id CartPole-v0 \
    --total-timesteps 50000 \
    --wandb-project-name cart_pole_algo_compare \
    --track \
    --capture-video

2022-12-17 13:16:46.192834: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-17 13:16:46.866931: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-12-17 13:16:46.866991: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such fil

#### Kicking off the second run using the Deep Q Learning algorithm

In [1]:
!python code/dqn.py \
    --seed 1 \
    --env-id CartPole-v0 \
    --total-timesteps 50000 \
    --wandb-project-name cart_pole_algo_compare \
    --track \
    --capture-video

2022-12-17 13:28:48.596334: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-17 13:28:49.267405: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib/python3.8/dist-packages/cv2/../../lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2022-12-17 13:28:49.267469: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such fil

#### Kicking off the third and final run using the C51 algorithm

In [5]:
!python code/c51.py \
    --seed 1 \
    --env-id CartPole-v0 \
    --total-timesteps 50000 \
    --wandb-project-name cart_pole_algo_compare \
    --track \
    --capture-video

  if not hasattr(tensorboard, "__version__") or LooseVersion(
[34m[1mwandb[0m: Currently logged in as: [33mchrisatumd[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.13.6 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.13.5
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/cdurbin/JHU/CreatingAIEnabledSystems/dockershare/SystemProject/wandb/run-20221210_225829-cmqg5igt[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mCartPole-v0__c51__1__1670731108[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/chrisatumd/cart_pole_algo_compare[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/chrisatumd/cart_pole_algo_compare/runs/cmqg5igt[0m
global_step=1200, episodic_return=22.0
global_step=2000, episodic_return=13.0
global_step=2200, episodic_r

### Comparing performance
I then bring up the three runs in Weights and Biases which when self hosted is running on http://localhost:8080  or if cloud hosted is at https://wandb.ai/, and look at the charts and tables for the project I created called cart_pole_algo_compare. From looking at the charts I can see that PPO takes the least time to run 50,000 total steps and improves performance quickly though it seems to have more variance in episode length (my performance metric) than the other two algorithms. It looks like there is not a slam dunk winner for any of the algorithms. I would continue to test the algorithms with different parameters to try to choose the best one for my case. Note that I was running 50,000 total timesteps for each task which ended up running a different number of episodes so the runs for each do not cover the entire X axis (only the slowest one to improve training will which in this scenario was the C51 algorithm).

<p float="left">
  <img src="images/AlgorithmPerformanceLearningCurve.png", width="300"/>
  <img src="images/AlgorithmComparisonTimes.png", width="300"/>
</p>


<a id="use-case-2"></a>
### Use case 2 - Test impact of chosen action replaced at random some percent of the time
For this experiment I want to see the affect on the performance of my reinforcement learning algorithm if a chosen action for a step is replaced by a random action. I added support for a flag --ignore-action that takes a value between 0 and 1 and ignores the chosen action that percentage of the time (0.15 means 15 percent). I tested this with a few different configurations (0, 0.15, 0.35, and 0.70). In the notebook I am just demoing 0 and 0.35. I am using the PPO algorithm for the rest of the tests since it is the fastest one to train.

#### Baseline with chosen action always used

In [6]:
!python code/ppo.py \
    --seed 1 \
    --exp-name CartPoleBaseline \
    --env-id CartPole-v0 \
    --total-timesteps 50000 \
    --wandb-project-name cart_pole_random_actions \
    --track \
    --capture-video

[34m[1mwandb[0m: Currently logged in as: [33mchrisatumd[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.13.6 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.13.5
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/cdurbin/JHU/CreatingAIEnabledSystems/dockershare/SystemProject/wandb/run-20221210_225934-xbms5sry[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mCartPole-v0__CartPoleBaseline__1__1670731173[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/chrisatumd/cart_pole_random_actions[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/chrisatumd/cart_pole_random_actions/runs/xbms5sry[0m
  loader = importlib.find_loader(fullname, path)
  deprecation(
  logger.deprecation(
  if distutils.version.LooseVersion(
global_step=52, episodic_ret

#### Experiment with random action used 35% of the time

In [7]:
!python code/ppo.py \
    --seed 1 \
    --exp-name CartPoleRandom35 \
    --env-id CartPole-v0 \
    --total-timesteps 50000 \
    --wandb-project-name cart_pole_random_actions \
    --track \
    --capture-video \
    --ignore-action-rate 0.35

[34m[1mwandb[0m: Currently logged in as: [33mchrisatumd[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.13.6 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.13.5
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/cdurbin/JHU/CreatingAIEnabledSystems/dockershare/SystemProject/wandb/run-20221210_230004-36pzvr1y[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mCartPole-v0__CartPoleRandom35__1__1670731203[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/chrisatumd/cart_pole_random_actions[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/chrisatumd/cart_pole_random_actions/runs/36pzvr1y[0m
  loader = importlib.find_loader(fullname, path)
  deprecation(
  logger.deprecation(
  if distutils.version.LooseVersion(
global_step=52, episodic_ret

**Note**: For the charts below I included two other runs for a total of four runs. You could modify the cells if you wanted to repeat the tests with 0.15 and 0.7 as the values for --ignore-action-rate.

The chart showing the performance over time is useful. The four colors are:
* Blue - all chosen actions taken
* Red - 15% of actions replaced with random action
* Purple - 35% of actions replaced with random action
* Red - 70% of actions replaced with random action

The maximum score for an episode is 200 and each training iteration had 50,000 total steps (so ones that had lower per episode actions taken will have more episodes displayed in the chart). We see that 15% random still achieves the maximum score for many episodes though it takes a little bit longer in training to get there. With 35% random actions the training never achieves the maximum score and at 70% it only achieves a single score over 100. The next steps for this use case would be to run additional experiments with more total steps and different percentages to see at which percentage the training fails to converge to around the maximum score.

<img src='images/RandomActionsPerformance.png' width=300/>

You can see your own performance in wandb by selecting the cart_pole_random_actions project.

<a id="use-case-3"></a>
### Use case 3 - Checkpoint runs and resume training
This was one of the most important use cases for me, and I was excited when I was able to get it working successfully.

#### First I'll start a run and tell it to checkpoint the model after every 20 iterations
Note the addition of the parameter --checkpoint-frequency=20.

In [8]:
!python code/ppo.py \
    --seed 1 --env-id CartPole-v0 --total-timesteps 10000 \
    --track --capture-video --wandb-project-name cart_pole_checkpoint_and_resume \
    --checkpoint-frequency=20

[34m[1mwandb[0m: Currently logged in as: [33mchrisatumd[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.13.6 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.13.5
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/cdurbin/JHU/CreatingAIEnabledSystems/dockershare/SystemProject/wandb/run-20221210_230036-2wrh3qoy[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mCartPole-v0__ppo__1__1670731235[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/chrisatumd/cart_pole_checkpoint_and_resume[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/chrisatumd/cart_pole_checkpoint_and_resume/runs/2wrh3qoy[0m
  loader = importlib.find_loader(fullname, path)
  deprecation(
  logger.deprecation(
  if distutils.version.LooseVersion(
global_step=52, episodic_re

There are two things to notice from the run:

1. We're logging "Saving model checkpoint" each time we save the model. This is saving the file locally and then uploading it to W&B.
2. The training did not achieve the max score of 200 during the training and so clearly had not converged yet. 

Rather than start over we'll download the trained model and resume training from that point.

**IMPORTANT** For the next test make sure to set the WANDB_RUN_ID below to the value that was printed out as the run identifier when initially running the experiment. For example for my URL: https://wandb.ai/chrisatumd/cart_pole_checkpoint_and_resume/runs/2keuhvm7 the WANDB_RUN_ID to use is 2keuhvm7.

In [9]:
%env WANDB_RUN_ID 2keuhvm7

env: WANDB_RUN_ID=2keuhvm7


Now I use the --resume flag combined with setting the environment variable above to indicate which run to resume.

In [10]:
!python code/ppo.py \
    --seed 1 --env-id CartPole-v0 --total-timesteps 20000 \
    --track --capture-video --wandb-project-name cart_pole_checkpoint_and_resume \
    --checkpoint-frequency=20 \
    --resume

[34m[1mwandb[0m: Currently logged in as: [33mchrisatumd[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.13.6 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.13.5
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/Users/cdurbin/JHU/CreatingAIEnabledSystems/dockershare/SystemProject/wandb/run-20221210_230058-2keuhvm7[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Resuming run [33mCartPole-v0__ppo__1__1670731257[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/chrisatumd/cart_pole_checkpoint_and_resume[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/chrisatumd/cart_pole_checkpoint_and_resume/runs/2keuhvm7[0m
  loader = importlib.find_loader(fullname, path)
  deprecation(
Resuming prior run.
Attempting to load file /Users/cdurbin/JHU/CreatingAIEnabledSyste

I ran the above cell twice for a total of 3 runs. The initial run had 10000 global steps, the second run had 20000 global steps, and the third run 20000 steps. You can see there is a bit of an issue with the metrics tracking the global steps, but I verified that it is successfully resuming the run each time using the previously trained model and each run executes exactly the number of steps I requested. By the end of the third run the episode score reaches the 200 max score frequently.

<p float='left'>
    <img src='images/ResumedTrainingRun.png', width=300/>
    <img src='images/ResumedTrainingRunPerformance.png', width=300/>
</p>

You can see your own performance in wandb by selecting the cart_pole_checkpoint_and_resume project.

<a id="docker-setup"></a>
## Docker setup

I created a docker container as another way to run the experiments. I also set things up so that I could use a locally hosted Weights and Biases rather than the cloud hosted one used by default. This required some sophisticated setup including setting up a network within docker and giving both the W&B container and my environment container access to that network. In addition my container used docker in docker in order to upload the results to W&B. For the detailed instructions on how to run with the docker container see my [README](https://hub.docker.com/repository/registry-1.docker.io/cdurbin/705.603_chris_durbin/general).

I was happy with this setup at first, but by the end I found enough shortcomings that I would not recommend running this way. I included those details in my [findings](#findings) section. 

<a id="experiment-tracking"></a>
## Experiment tracking

I showed some of what was being with tracked with the charts in prior cells, but there is much more being captured. Some of what I am capturing and pushing to Weights and Biases includes:

* Training Metrics for generating learning curves
* The trained model
* The source code that was used in order to perform that training run
* The duration of the run
* Several videos of the agent running in the enviroment at various points in the train run.
* The commandline used to kick off the run
* Every configuration option passed in to kick off the run
* Every hyperparameter setting

Here are a few screenshots from my locally hosted Weights and Biases server.

<p float='left'>
    <img src='images/GeneralViewWandB.png', width=500/>
    Overview page
    <img src='images/WBTable.png', width=500/>
    Metrics and Parameters page
    <img src='images/WBFiles.png', width=500/>
    Files listing from run
</p>

## Tensorboard

The logging and metrics capture is also compatible with tensorboard. You can bring up tensorboard locally with `tensorboard --logdir runs`. Then you can bring it up locally at http://localhost:6006

Here is an example screenshot after running experiments.
<p float='left'>
    <img src='images/Tensorboard.png', width=500/>
</p>

<a id="rubric"></a>
## Rubric
Given I could not show everything in a demo I documented how the project addressed each category in the rubric.

## Challenge level

For this PaaS I am putting together several open source libraries to use in tandem. Throughout my career I have found plumbing together multiple outside projects one of the more difficult tasks in the software development world. In addition with the projects all being relatively new and changing rapidly, it further increases the integration effort.

### New technical area
The projects I am using are all relatively new projects.

* When I started my system project, cleanrl had not yet released version 1.0.0 (they released it twenty days ago). The first commit for the project was in 2020.
* WandB has not yet released version 1.0 of their software and their first 0.x release was in late 2019.
* The OpenAI gym is slightly more mature with a release date of April 2016, but is still on the leading edge for environments for reinforcement learning.
* Tensorboard has also been around for some time with its first release in 2017.

I set up the project so that it can run natively on a host, but also have a way to install it using Docker. For the docker setup I needed to use more advanced and challenging features than we needed in the class and more advanced than what I have used in the past. A couple of those features included setting up networks for docker containers to reach one another as well as docker in docker (making docker calls from within a docker container).

## Supporting References
I heavily used open source projects for putting together my platform. Here are links to some of the open source projects I used and referenced their documentation to set everything up.

1. Weights and Biases: [Home page](https://wandb.ai/site), [Documentation](https://docs.wandb.ai/)
2. CleanRL: [Repo](https://github.com/vwxyzjn/cleanrl), [Documentation](https://docs.cleanrl.dev/)
3. OpenAI Gym: [Repo](https://github.com/openai/gym), [Documentation](https://www.gymlibrary.dev/)
4. PyTorch: https://pytorch.org/
5. Tensorboard: [Home page](https://www.tensorflow.org/tensorboard), [Repo](https://github.com/tensorflow/tensorboard)
6. Setting up docker networks to communicate between containers: https://docs.docker.com/engine/reference/commandline/network_create/
7. Setting up to run Docker in a Docker container: https://devopscube.com/run-docker-in-docker/

## Overall Design and Architecture
My system project is comprised of the following components:
* Python code based on the CleanRL project used to execute reinforcement algorithms.
* Scripts and Dockerfile for setting up a docker environment in which to run the algorithms.
* Hosted WandB running as a Docker container (cloud hosted WandB can easily be used as well by creating an account at no cost).
* File system for storing experiment logs, metrics, trained models, and videos.
* Tensorboard to visualize runs.

All of these work together to allow for an easy way to run experiments and visualize results during training as well as comparing results of experiments after running. The experiment tracking allows multiple experiments to be run simultaneously with the number of experiments run in parallel limited only by the underlying hardware resources. In a cloud environment the experiments are infinitely scalable for all intents and purposes (obviously cloud vendor hardware and the user budget are constraints).

## Data Collection and Analysis
Since my project was based on reinforcement learning rather than supervised or unsupervised learning I did not need to collect data. I instead evaluated projects to use as the basis for my reinforcement learning environments and chose to focus on OpenAI's gym because it had a good range of projects that could be used for training reinforcement learning algorithms in reasonable times.

I saved off all of the data captured as part of the experiments in a way that allows users to effectively analyze algorithms as well as tune hyperparameters and even code changes between experiments.

## Model(s) Selection
As a platform as a service my project does not involve choosing a model. However it can be utilized by end users in a way to directly make their model selection choice. A user can set up experiments to run the same training procedure on as many different models as they like and at the end of the experiments make a choice for their best model (similar to what I showed for my first use case example demo). My project will capture multiple metrics to aid in this choice including the end performance of the algorithm as well as the overall experiment training time which are the two most likely factors in choosing a model.

## Code Design
Most of the Python code used for my project was adapted from the CleanRL project which implemented the various machine learning algorithms. One of the core tenets of the CleanRL project is for the code to *NOT* be modular so that people new to the algorithms can see the entire algorithm in just a single file. I adapted their code to make it more modular. I also added cross-cutting functionality for checkpointing models, resuming previous training runs, and testing out stochastic behavior where chosen actions are replaced by random actions a percentage of the time.

<a id="findings"></a>
## Analysis/Findings
Sorry this section may be a little bit long. I'll start with my going in goals.

### Goals
1. Understand how to use the OpenAI Gym to train against several environments including Atari games.
2. Create an environment to track experiments.
3. Resume training from a stopped or crashed run.

I achieved all three of my goals and so the project was a big win for me personally. I am ready to make use of my project for any reinforcement learning in the future.

### OpenAI Gym
The OpenAI gym is now supported by another group (Farama foundation): https://github.com/Farama-Foundation/Gymnasium. In addition to supporting Atari environments it supports several others. I tested out both simple classic control environments like CartPole as well as some of the Atari games, mostly Pong and Breakout. Trying out other environment types is an obvious next step and would work seamlessly with my platform, just one more library to install with pip. 

### Weights and Biases
Going in I assumed I would need to build something custom to track experiments. When I came across the Weights and Biases project I quickly realized it already did everything I was looking to do. I will definitely use the project in the future.

The one issue I had with the project is that with the self-hosted version I lost my experiments one time, and a second time it seems to have become corrupted to the point I could not recover. I could not find much in the way of bug reports (the project is really new), so at this point I would recommend to only use their cloud hosted version.

### Training times
I was excited to see how quickly I could train the various algorithms in the different environments. I had planned on trying to train against as many Atari games as I could. It turned out that I only was able to test training with Pong and Breakout. I tried a few algorithms and trained in a few different ways, but in each of the games I needed to train for more than 24 hours to get any kind of reasonable performance. My trained model still lost to the computer player in pong on average at 28 hours, but at least it did win some matches and the average loss was under 4 points as opposed to 21 points at the beginning of training.

#### Mac OS with M1 chip
Unfortunately most of the training libraries I looked at only provide significant performance improvements when using an NVIDIA GPU. For example when using envpool Pong can be trained in just 10 minutes! However envpool is supported on Linux. I would assume in the coming years more libraries will target adding support for the ML cores on the new Mac chips, but in the meantime I think I need to invest in a Linux desktop with a powerful GPU. The performance difference is too compelling at this point in time.

#### Docker
When training on docker for I found that performance drastically dropped. For both Pong and Breakout I was averaging only 1 sample per second when compared to roughly 200 second when running on my host. I tried a few configuration changes and settings, but could not improve the performance. I'm not entirely sure whether it is only an issue with Docker for Desktop for Mac. I have seen that on Linux you can allocate host hardware resources successfully with Docker. In any case this was the most disappointing findings for me because I do like using Docker when possible rather than installing directly to my host.

#### Reinforcement Learning Algorithms
I learned that there were several other reinforcement algorithms beyond just Q-Learning and Sarsa. The CleanRL documentation was great for discovering the algorithms as well as detailed code and documentation on how the algorithms work.

## Jupyter Notebook documentation
We are reading it now!

## Github and Docker Repository
The code is all available in Github and in Docker hub. I ensured that the images use the Linux platform rather than the ARM architecture for Mac OS. There are READMEs in both locations to help get started.

<a id="6ds-framework"></a>
## 6Ds Framework

For another overview of the project it is useful to look at it through the 6D framework. Note that since my project is a platform trying to directly map to the 6D framework in some cases it makes more sense to look at how a use case utilizing the platform can apply the 6D framework.

### Decomposition
The purpose of the platform is to speed up the process of setting up, performing, and analyzing a series of reinforcement learning experiments. By speeding up the process around experimentation, users will be able to achieve their own goals which might fall in the 'reduce workload', 'speed up process', or 'achieve new insight' realm.

### Domain Expertise
There are a few relevant domains for my project:
* Reinforcement Learning
* Platform as a Service
* Atari games

#### Reinforcement Learning
For the reinforcement learning aspect I learned quite a bit while performing research and experimentation for my research paper. I am still a novice though and so I spent a good bit of time reading through the algorithms implemented by CleanRL.

#### Platform as a Service
I have some experience building out a platform as a service on top of AWS and almost 20 years of software development exerience. While I had no experience with many of the tools outside of some Docker experience, I could at least apply related experience to have better insights on what to focus on. 

#### Atari games
Atari games are simple and I have experience playing some of them, so I had an advantage by choosing to test out my platform with atari games as opposed to another realm on which I was not familiar.

### Data
Since I am focused on a platform and reinforcement learning my project did not rely on collecting data and all of the concerns there such as getting labeled data, making sure the collection was ethical, and minimizing bias. The data concerns for my project were all on the output side. For this I focused more on the design of capturing the data from experiments in a way that was most useful for users of the platform.

### Design
The platform design is to support experiment training for a broad range of use cases. I provided immediate support for running the CleanRL algorithms against environments in the OpenAI gym, but made it simple to add other environments and other ways to integrate different agents while still being able to track experiments, checkpoint the models, and resume training using Weights and Biases.

### Diagnosis
The most important part of my system project was providing an environment where end users could analyze their experiments. Some of the key components to help diagnose experiments include:
* Visualization of key training information (many charts shown using both WandB and tensorboard)
  - Training performance over time
  - Number of steps and episodes to achieve certain performance
* Visualization of games at various points in training (video display)
* Capture of all hyperparameters for an experiment for repeatability and tweaking for future experiments
* Easily comparing metrics for multiple experiments

### Deployment
I wanted deployment of my platform to be as flexible as possible in order to support many use cases. Some of the key features I focused on for deployment were:
* Run locally on Mac OS
* Run locally on Linux
* Run using Docker containers on either Linux or Mac OS
* Possibility to run in the cloud

Note that I did not test out cloud deployment due to cost concerns as well as not having the time to set things up. I did investigate and found Terraform projects for hosting WandB on each of the major cloud platforms. Each platform also has ways to run docker containers (ECS and EKS in AWS, Cloud run in Google cloud, and Azure container instances in Azure). That would be a logical next future step for this project.