# Understanding the PushBlock case
___
## 1. Introduction
In this notebook we analyse one of the examples in the ml-agents repository, the PushBlock example. 
Also we will analyse its performance with several parameter changes. And in the end we will take a look at a slightly modified version of the PushBlock case and see how it works compared to the original one.

The team responsible for this work is formed by:  

<img src="adria_foto.jpg" style="width: 200px; margin: 10px 0px;  border: 1px solid black;"> 
Adrià Ortiz Navarro  
veric00@gmail.com  
<img src="ferran_foto.jpg" style="width: 200px; margin: 10px 0px;  border: 1px solid black;"> 
Ferran Illa Capellas  
ferran.illa26@gmail.com




## 2. Case analysis
The PushBlock example consists of a small block (the Agent) who has to push a bigger block (the Target) to the goal zone. This occur in a considerably small square platform with borders around it, so both the Agent and the Target can not fall from the platform.

<img src="map1.jpg" style="width: 700px;">



By default the **rewards** given to the Agent in the PushBlock case are the following:

- +5 for reaching the Goal  
This is the main reward given when the Agent completes his task.

- -(1/`maxSteps`) for every frame  
This negative reward is to encourage the Agent to complete his task as fast as possible, as the total reward decreases every frame. A negative reward like this will be necessary in any variation of the PushBlock case.

In order for the brain to calculate the next Agent action, it needs to receive the **state** of the Agent inside the world. The information that the brain needs in the PushBlock example is the following:

- Distance to Target
- Distance to Goal
- Distance to Walls

It uses raycasts from the Agent to look for the objects mentioned above.
```css
var rayDistance = 12f;
float[] rayAngles = { 0f, 45f, 90f, 135f, 180f, 110f, 70f };
var detectableObjects = new[] { "block", "goal", "wall" };
AddVectorObs(rayPer.Perceive(rayDistance, rayAngles, detectableObjects, 0f, 0f));
AddVectorObs(rayPer.Perceive(rayDistance, rayAngles, detectableObjects, 1.5f, 0f));
```
What the Perceive function basically does is that, for each ray, stores categorial information on a detected object along with the object distance. This information is sent to the brain through the `AddVectorObs` method.

<img src="raycasts.png" style="width: 700px;"> 
&nbsp;



Another important aspect of the process are the **actions**. These are the decisions of the brain. In the PushBlock case we have a discrete action space with only 6 actions posible. These actions correspond to 4 directions of movement (forward, back, right and left) and 2 directions of rotation (right and left). 

Depending on the action received, the right value is stored in the `rotateDir` and `dirToGo` variables, and then they are used to move the Agent in the scene.

```css
transform.Rotate(rotateDir, Time.fixedDeltaTime * 200f);
agentRB.AddForce(dirToGo * academy.agentRunSpeed, ForceMode.VelocityChange);
```
<br><br>
So summarizing, the learning process goes like this:

The Agent does random actions and at some point he accidentally pushes the Target to the Goal zone, so he gets a high reward. After that, whenever the state of the Agent is similar to the one where he got a high reward, the Agent will also do similar actions to that step. Over time this situation will be repeated and the policy will be shaped. Also, the Agent will slowly learn to complete his task faster because of the negative reward over time of which we have spoken before. If the Agent pushes the Target to the Goal fast the reward will be higher so he will repeat those actions in the future.

## 3. Performance analysis

In this section we will analyse some performance results that we obtained after testing differents configuration parameters. This parameters are found in the `config/trainer_config.yaml` file.

Successfully training a Reinforcement Learning model often involves tuning the training hyperparameters. This [guide](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Training-PPO.md) explains what every parameter does with also some best practices for tuning the training process.

The PushBlock example already gives a pretty good level of performance with the default parameters so improving it would be very hard. Here we will mainly show some parameter changes and analyse how they affect the performace. The parameters that we changed are:


-	`epsilon` - default 0.2 | test 0.5  
Corresponds to the acceptable threshold of divergence between the old and new policies during gradient descent updating. Setting this value small will result in more stable updates, but will also slow the training process. As the PushBlock case is pretty simple, we want to try to maximize it to make the training faster.
-	`beta` - default 1.0e-2 | test 1.0e-1  
Controls the randomness of the policy. This ensures that agents properly explore the action space during training. Increasing this will ensure more random actions are taken. We want to increase it to make the Agent find the most optimal solution, not just a solution that works. Even if this means that the training slows down a bit.
-	`hidden_units` - default 256 | test 512  
Correspond to how many units are in each fully connected layer of the neural network. This parameter should grow as complexity of the case is bigger. As the PushBlock case is not very complex, we wanted to increase this value to test that more units is not always better.
-	`learning_rate` - default 3.0e-4 | test 3.0e-5  
Corresponds to the strength of each gradient descent update step. This should typically be decreased if training is unstable, and the reward does not consistently increase. We decreased this parameter to see if the training is more stable or the reward increases faster.

Lets jump into the results. We changed the parameters one by one to have more results to compare with. Every training is represented by a different colour in the graphics beneath. The first training is the one with default parameters. In the second we added the epsilon change. In the third we added the beta change. And so on until the fifth training that has all the parameters of before changed. The colour codes are:

- Default | Dark blue
- epsilon | Red
- beta | Grey
- hidden_units | Orange
- learning_rate | Light Blue

![param_cumulativeReward](analysis_param_cumulativeReward.jpg)

The cumulative reward should increase during a successful training session. Here we see clearly that the default training has the best configuration. Only with the epsilon parameter change, the result gets unstable.

![param_episode](analysis_param_episodeLenght.jpg)

This graphic should decrease slowly. It does alright with the default training and the hidden_units trtaining. It means that the hidden_unit change affects on the speed that the Agent takes to reach the Goal, because it recovers from the bad results of the previous changes.

![param_valueLoss](analysis_param_valueLoss.jpg)

The value loss should increase while the agent is learning. Again, the default and the hidden_units trainings are the ones who perform better. The learning_rate training also increases its value loss but it takes more time. The other trainings do not perform right.

## 4. New case proposal

Our proposal consist of adding obstacles in the scene so it is harder for the Agent to reach the Goal, and additionally, changing the reward system to try to improve the training efficiency. For this exercice, we did 2 things:

1. 2 different maps, with a total of 3. Each map has increased number of obstacles in it. 
2. A total of 4 different brains, each one with diferent rewards

With this, there are a total of 12 diferent results.

### Brains with different rewards

Default brain  
- The target arrives to the goal = +5.0
- Every frame that the target isn’t in the goal = (-1f / agentParameters.maxStep)

BigReward  
- The target arrives to the goal = +10.0 (A bigger reward for completing the task)
- Every frame that the target isn’t in the goal = (-1f / agentParameters.maxStep)

BigTime  
- The target arrives to the goal = +5.0
- Every frame that the target isn’t in the goal = (-2f / agentParameters.maxStep) (The most slower is, less reward will have)

OurLearning
- The target arrives to the goal = +5.0
- Every frame that the target isn’t in the goal = (-1f / agentParameters.maxStep)
- If the agent is at a max distance of 3.0f with the target, every frame = + 0.001f
- If the target is at a max distance of 1.0f with a wall, every frame = -0.01f


### Results

We tested those brains into 3 different maps.

#### Map 1 (default scene)

<img src="map1.jpg" style="width: 700px;"> 

Colour code:
- Default | Pink
- BigReward | Green
- BigTime | Grey 
- OurLearning | Orange

![img](analysis_reward_cumulativeReward.jpg)
The better is the BigReward and the worst the default.
<br><br>
![img](analysis_reward_episodeLenght.jpg)
The better is OurLearning and the worst the default.
<br><br>
![img](analysis_reward_vl.jpg)
The worst by far is the BigReward and the best is OurLearning.

#### Map 2

<img src="map2.jpg" style="width: 700px;"> 

Colour code:
- Default | Dark Blue
- BigReward | Red
- BigTime | Light Blue 
- OurLearning | Pink

![img](analysis_reward_cumulativeReward2.jpg)
The best is the BigReward, and the worst the BigTime.
<br><br>
![img](analysis_reward_episodeLenght2.jpg)
The best is the default, and the worst the BigTime.
<br><br>
![img](analysis_reward_vl2.jpg)
The best one is the BigTime and the worst by far is the BigReward.

#### Map 3

<img src="newCase_ss.png" style="width: 700px;"> 

Colour code:
- Default | Green
- BigReward | Grey
- BigTime | Orange 
- OurLearning | Blue

![img](analysis_reward_cumulativeReward3.jpg)
The best one is the BigReward and the worst the BigTime.
<br><br>
![img](analysis_reward_episodeLenght3.jpg)
The best is the BigReward and the worst OurLearning.
<br><br>
![img](analysis_reward_vl3.jpg)
The best one is OurLearning and the worst by far, BigReward.

#### Conclusion

When we put each brain already trained in the maps, the one that seems to work best is **OurLearning**.

### How to create and train this new case

#### Modify the scene

1. Duplicate the scene
2. Add some walls in the middle of the scene (you can duplicate an existent wall). Do not forget to tag them as "wall"

#### Modify the Agent script

You will have to add these rewards in the Agent script:
- If the agent is at a max distance of 3.0f with the target, every frame = + 0.001f
- If the target is at a max distance of 1.0f with a wall, every frame = -0.01f

#### Brains and start training

1. First, duplicate a brain, rename it if needed, and put the model to none 
2. Put the brain in the academy and check the control box
3. Put the same brain into the GameObject Agent, inside PushAgentBasic, the part with Brain.
4. In the Anaconda Prompt, put activate ml-agents and press Enter
5. In the Anaconda Prompt put the direction of the folder ml-agents 
6. Once you are inside that folder, put mlagents-learn config/trainer_config.yaml --run-id=”NameOfTheLearning” --train
7. Press enter, wait a few seconds, and then, press the play button. With this, the agent will train and learn

If you like to train from a launcher and not from the editor:
6. Create a .exe of the project
7. Once you are inside that folder, put mlagents-learn config/trainer_config.yaml --env=FolderOfTheExecutable/Executable --run-id=”NameOfTheLearning” --train	


## 5. Imitation Learning

We also made the Agent learn with an Imitation Learning (IL) algorithm. We trained a IL brain in 50.000 steps in Map3. 

Colour code:
- Default | Blue
- IL | Red

![img](analysis_IL_cumulativeReward.jpg)
The Default brain is a bit better.
<br><br>
![img](analysis_IL_episodeLenght.jpg)
The Default is a bit better than the IL.
<br><br>
![img](analysis_IL_vl.jpg)
The IL Train is better than the Default.

#### Conclusion

The IL is not as good as the Default, but this is positive because it needs to fail to seem more human.

