# Project 2: Continuous Control

This project was a trigger to explore reinforcement learning frameworks. The framework which was used to solve the problem is [stable baselines 3](https://stable-baselines3.readthedocs.io/en/master/) which is implemented with pytorch. The problem was solved using A2C, PPO and TD3 on the 20 agent environment. Below one can find a description of each algorithm with references to the codebase, notes and graphs from the experiments, and finally a comparison of the effectiveness of the three algorithms on the problem at hand. 

## Usage of stable baselines 3

### About the library
The library has an interesting story. In an attempt to standardise implementations and give the ability to use recent reinforcement learning algorithms out of the box to benchmark problems, OpenAI created the library [baselines](https://github.com/openai/baselines). Though the project was sucessful, it still lacked consistency and ease of use. This led some users to fork the library and create a friendlier version with cleaner code, [stable baselines](https://stable-baselines.readthedocs.io/en/master). The main library is based on TensorFlow, but recently a new version of it was introduced, [stable baselines 3](https://stable-baselines3.readthedocs.io/en/master/), which uses PyTorch. The library has currently A2C, PPO and TD3 implemented and gives the ability to the user to set different hyper parameters and define the architectures of the policy and value networks. 

It must be stated that there are several other libraries which have similar structure. Ones which were considered were [reagent](https://github.com/facebookresearch/ReAgent), [dopamine](https://github.com/google/dopamine) and [kerasrl](https://keras-rl.readthedocs.io/en/latest/). Aside from those, one can find several implementations of more recent or specialized algorithms via [paperswithcode](https://paperswithcode.com/area/playing-games). 

### Wrapping the Unity environments
One of the steps needed in order to use the library was to wrap the Unity environments in an environment class compatible with it. One can find the wrappers' code in `reacher/unity_environment_wrappers.py`. Two wrappers were built:
- One for the single agent environment. This was straight forward as one only needed to translate the Unity environment to a OpenAI gym environment.
- One for the multi-agent environment. Unfortunately stable baselines support only multiple environments with single agents in this direction. Nevertheless, the one can define a customized environment inheriting from `VecEnv` and handle the multiple agent actions as if they come from different environments.

## Learning algorithms
Below is a description of A2C, PPO and TD3. I attempt below to connect the algorithms with their implementation in the codebase.

### Rollout and prioritized replay


### A2C

### PPO

### TD3

## Experiments and solution of the enviroment

### PPO trained on the single agent environment
- Setting the network architecture
- Picking the right learning rate
- Results

### PPO trained on the multi-agent environment
- Big speed-up
- More robust on learning rate changes
- Results

### A2C
- Trained also on  the multi-agent environment
- Sometimes converges really fast
- Not as consistent in the results. Some times converges, some not.
- Results

### TD3
-
-
-

## Ideas for the future

### Learning faster
The solution to the problem is extremely close to the optimal score, 40, so there is not a lot to improve regarding performance. On the other hand one could explore options to make the convergence of training faster. Here are some ideas in this direction:
- *Tune the size of the rollout and the number of epochs.* Up to now the default values of 2048 environment steps till rollout and 10 epochs for training. One could do an ablation study on both parameters and check how the reduction of the 'collect data -> train' cycle affects the speed of convergence.
- Mixed precision training: This a generic low-level improvement. In a lot of use cases, especially in computer vision, one can sucessfully reduce the float precision to 16 bit on specific parts of the networks trained and achiece speed increase in both training and predicting with minimal drops in performance. Curious if this works equally well in the reinforcement learning setting.