# Introduction
The aim of this notebook is to study the <b>crawler</b> example of the [ml-agents](https://github.com/Unity-Technologies/ml-agents) repository. The crawler is an agent that has a main body and 4 legs composed by 2 limbs each and moves in a plane to reach a target.

We will explain how the example works and train one with a modification to compare the results with an unmodified environment using the information that Tensor Board provides us. All of the work done here will be focused on the `CrawlerDynamicLearning` scene from the Unity project.
## Team Information
### David Pérez Gallego
    Student at ENTI-UB: Interactive digital content
    Email: davidperezgallego@enti.cat  
<img width='150px' align='center' src='img/david.jpg'>

### Eduard Arnau Romeu
    Student at ENTI-UB: Interactive digital content
    Email: eduardarnauromeu@enti.cat    
<img width='150px' align='center' src='img/eduard.jpg'>

# Case analysis
First of all, we will take a look at the crawler as a GameObject and all the parts and scripts involved in it:

<img width='900px' align='center' src='img/crawler-gameobject.png'>

<center><cite>Crawler GameObject in the ml-agents unity project</cite></center>

As we can see, the crawler is formed by <b>4 legs</b> (each one with its relative foreleg) and a <b>body</b>.

### Legs
Each leg part is attached to its father limb by a configurable joint which applies an angular constraint in each axis and prevents any position variation between the leg part and the anchored component.

The upper leg part can rotate in the X and Y axis while the lower part can only rotate in the X axis.

The foreleg is attached to the leg which is attached to the main body.

#### <center> Main Body ←(Joint)← Leg ←(Joint)← Foreleg </center>

Apart from the movement, each leg also has a script called `GroundContact.cs` which checks the collision with the `ground` layer. This script allows us to use the collision with the ground to either punish the agent, set the agent as done or use the collision flag as an observation for the agent.

The <b>forelegs are used as observations</b> when they collide with the ground and <b>the upper part of the legs are used for punishment</b> since we don't want the agent to use the upper part of the legs to move to the target.

### Body
The body also has the `GroundContact.cs` code attached. To prevent the agent from dragging its body while walking, <b>the agent is punished whenever the body collides with the ground.</b>

### Controller
The agent, which runs in a script called `CrawlerAgent.cs` handles all the behaviour while the important body parts of the body are handled by the script `JointDriveController.cs` to store relevant information for acting and learning of each relevant body part.

This last script allows the agent to reset the joints, set their target rotation and their strength in order to achieve the desired behaviour. Each joint has the

The controller overrides certain functions from the `Agent.cs` class. The added funcitonalities of those functions are the following:

##### InitializeAgent()
This function <b>initializes all the agent body parts</b>.

First of all stores a reference to the `JointDriveController.cs` script that the agent has attached and then initializes the body parts of the agent. <b>Each body part is stored in a dictionary</b> with the transform of each body part as the key and a custom class `BodyPart` as the data. 
`BodyPart` is a class which belongs to `JointDriveController.cs` and contains all the relevant information of the body part and allows easy access and modification to the `ConfigurableJoint` attached to the GameObject.

##### CollectObservations()
This functions <b>collects observations for the agent brain so it can learn</b>.

It starts by observing the current position relative to the target (referenced as dirToTarget in the code) andhen it stores the body orientation (up and front) and also its Y position. Remember that the agent is punished whenever the body touches the ground.

Now the funcitons proceeds to analyze each body part. For each joint the crawler agent checks if the joint is touching the ground using the code `GroundContact.cs` previously explained, the velocity and the angular velocity of the part. When the joint it is observing is not the body (which is the root joint of the agent), it also stores the position of the joint relative to the body, the current rotation (in each axis) and how much strength the joint is applying with a value between 0 and 1 relative to the maximum force that the joint can apply.

##### AgentAction()
This function <b>checks if the agent has reached the target</b>, it also <b>updates the joints based on the decision frequency and input action </b> and finally it <b>rewards or punishes the agent.</b>

This function starts by checking if any body part has touched the target. In case any is touching it, the agent gets a substantial reward and the target is set to a new random position in the environment.

After checking if the target is reached, the direction to the target (referenced as dirToTarget in the code) is updated.


#### <center>dirToTarget = target.position - body.position<center>

The agent checks if it has to take a decision in the current step and if the flag is set to true it takes action.
The agent proceeeds to apply torque in two axis to the upper limbs (X and Y as stated in the Legs part above) and in one axis for the lower limbs (the X axis). After applying torque it sets the joints strength for this decision step.

The function now proceeds to reward or punish the agent depending on 3 factors:
    - The agent moving towards the objective.
    - The agent body facing the target.
    - The time taken by the agent to reach the target.

Last but not the least, the function increments the decision timer which modifies the decision flag and allows the agent to know if it will have to decide in the next step.




# Performance Analysis:
### Critical parameters
#### Gamma:
The gamma parameter is the discount factor for future rewards (how much does the agent care about the future rewards).
In this particular problem this factor should be set as high as possible, since our task requires high preparation to get to the target.

By setting this value low we get an agent that focuses a lot on staying away from the ground (jumps around really high compared to other trained models) and being well oriented towards the target, but that doesn’t really do a good job at actually getting to that target.
<img width='900px' align='center' src='img/CrawlerLowGammaJump.gif'>
<center><cite>Gamma = 0.8</cite></center>

Also, since the task the agent focuses on is easier than going to the target it stops learning much faster than a model with a high gamma.

<img width='650px' align='center' src='img/LowGammaReward.png'>
<center><cite>The learning process peaks at 350.000 steps and then stays almost the same</cite></center>

As a side note, we did saw something unique to this model, its ability to walk even on its back. This behavior probably appears only on this model thanks to his jumps away from the ground (that often flips him over) and its focus on looking at the target. Since other models don’t jump that high we assume they don’t get the chance to learn how to stay away of the ground on that position.

##### Max Steps:
This parameter defines the maximum amount of simulation steps that are run on a training session.
Since our problem is quite complex in both actions and observations we need a lot of steps for the agent to start becoming efficient at its task.

On our tests models need at least 300.000 steps to get good results. With all the other parameters set to default we saw that at around 400.000 the agent stops getting better.

<img width='650px' align='center' src='img/CrawlerDefaultShort.gif'>
<center><cite>100.000 steps model</cite></center>
<img width='650px' align='center' src='img/CrawlerDefaultMedium.gif'>
<center><cite>250.000 steps model</cite></center>
<img width='650px' align='center' src='img/CrawlerDefaultHigh.gif'>
<center><cite>One million steps model</cite></center>

<img width='750px' align='center' src='img/CumulativeReward.png'>
<center><cite>Reward graphic of the default runs</cite></center>


By tweaking some of the parameters (mainly the rate of change of the policy), we trained a model that learned above the default parameters and at 500.000 steps was still getting better.
<img width='650px' align='center' src='img/CrawlerEpsilonHigh.gif'>
<center><cite>510.000 setps with modified epsilon</cite></center>
<img width='750px' align='center' src='img/RewardHighEpsilon.png'>

#### Other important settings:
Due to the size and complexity of this problem there are some other really important parameters related to the amount of observations and actions:
###### Batch size and buffer size:
Amount of experiences experimented before doing a gradient decent and learning step (respectively). Having as many observations and continuous actions we need to set both of them as high as possible, else the training wouldn’t be effective since the agent wouldn’t experiment enough.
##### Number of layers and hidden units: 
This two parameters correspond to the type of neural network that the model works on. Since our actions depend on a lot of observations that are related on a complex way we need to set this two parameters to high values (3 and 512 which are the highest recommended by the ml-agents documentation).

### Smooth Results:
#### Epsilon:
This value, corresponds to the amount of change of the policies between each gradient descent. This means that the lower this parameter is the higher the stability of the training (it doesn’t get peaks and drops of the rewards due to changing the policy too much).
The problem is that the lower this value is the slower the training becomes, since at the start it takes quite a long time to start getting good results.
Still, we think that in a really long run (over 600.000 setps) this might become a good option since the stability does really increase a lot.

<img width='900px' align='center' src='img/CrawlerStableRun.gif'>
<center><cite>Agent trained with low gamma and high Batch and buffer sizes</cite></center>

<img width='900px' align='center' src='img/RewardLowEpsilon.png'>
<center><cite>Grey: Defaut     Orange: Epislon 0.1 - Batch 4048  - Buffer 40480 </cite></center>




### Train fast
Since this problem is so complex there isn't a really fast way to train it without a lot of steps. 

The best way we found to make the training faster was by increasing the epsilon value, thanks to its big changes in the policies at the start, the agent got better really quickly, but as the training advanced the reward values started to become more and more noisy, slowing the training process at those late stages.
<img width='900px' align='center' src='img/RewardHighEpsilon.png'>
<center><cite>Grey: Defaut     Blue: Epislon 0.3</cite></center>

More investigation should be put into this line, also tweaking the beta value (to try and get more entropy so that the agent keeps trying new policies) and maybe augmenting the batch size and buffer size to try help stabilize the training. 






# New Case Porposal

### The walker

As a new case porposal we've decided that we are going to <b>turn the crawler into a walker</b> by removing two legs from the main body and making it stand vertically. <b>We did two main iterations</b>. The first one helped us define the custom rewards and punishments that we had to include to achieve the desired behaviour while the second one served as pure testing of the new behaviour after a long training period.

<img width='900px' align='center' src='img/modifyed-crawler-0.png'>
<center><cite>New crawler layout</cite></center>

As we can see, it looks as weird as it can get. The constraints on the legs joints have also been modifyed to fit the needs of the new layout that the agent now has.

<img width='250px' align='center' src='img/modifyed-crawler-1.png'>
<img width='250px' align='center' src='img/modifyed-crawler-2.png'>
<center><cite>The joints constraints now resemble human articulations</cite></center>

When all those modifications were done, we had to adapt the current agent code to fit the new action and observation vector.

### Behaviour

A new script called `CrawlerAgentModifyed.cs` contains the new class that drives the agent. This class is a modification of the
`CrawlerAgent.cs` that adapts the observation vector and the action vector and includes 2 new rewards/punishments:
    - The body position is not greater or smaller than the original position + a little margin.
    - The agent body has the up direction aligned with the up of the world.
    
The original class has an observation vector size of 129 values. Each leg part took 14 values, the main body takes 7 values to observe and other variables like orientation and position take 10 values, thus making a total of:

#### <center> 14 values/legPart * 8 legParts + 7 values * body / 1 body + 10 values = 129 values </center>

by removing 2 full legs, we ended up removing 4 leg parts (upper and lower ones) leaving us with:

#### <center> 14 values/legPart * 4 legParts + 7 values * body / 1 body + 10 values = 73 values </center>

We did not add any observation since the ones that were already present after the removal were more than enough.

One last thing that we must do is adapt all the actions that the agent can do. In the original brain, the agent recieves an action vector with a length of 20 as the action input. <b>Each upper leg part takes 3 actions</b> (X and Y desired rotation + joint strength) while <b>the lower leg takes 2 actions</b> (X desired rotation + joint strength). Because we are removing 2 full legs which took 10 values from the action vector in total, <b>the action space is reduced from 20 to 10 values</b>.


When all of those modifications are done, <b>we must set the new brain parameters</b> in order to make it work with the new code.

### Brain

As for the brain, the only modifications relative to the original one are the observation vector and the action vector.

<img width='500px' align='center' src='img/modifyed-crawler-brain.png'>
<center><cite>ModifyedCrawlerDynamicLearning learning brain parameters</cite></center>

### First Training

On the first training, we left the crawler training with any modification on the reward system (as stated before). The training was set to end at the <b>1 million iterations</b>. The siumlation was run using the <b>configuration file of the dynamic crawler learning</b>. The agent did reset each time the crawler's head touched the ground but the target did not. We ended up with a kangoroo-wannabe crawler that jumped really well but didn't quite hit the landings.

<img width='900px' align='center' src='img/first-training.gif'>
<center><cite>Crawler derping around</cite></center>

We can now analyze the training using tensorboard to see how did the agent perform.

<img width='700px' align='center' src='img/first-training-graph-0.png'>
<img width='700px' align='center' src='img/first-training-graph-1.png'>

First of all, those are the modifyed parameters of the configuration yaml file:

    normalize: true
    num_epoch: 3
    time_horizon: 1000
    batch_size: 2024
    buffer_size: 20240
    gamma: 0.995
    max_steps: 3e6
    summary_freq: 3000
    num_layers: 3
    hidden_units: 512
    
The two most obvious changes are the ones in the two first graphs. <b>The agent definetly did learn</b> since the reward got higher with each iteration and <b>the simulation lasted more because the agent learned to balance itself</b>.

Moving onto the second row of graphs, we can clearly see how the agent policy loss decreases as <b>the agent finds a suitable way to move</b> and starts to train that way of moving. The value loss from the policy increases as the agent explores strategies but then also decreases when the agent starts to focus on a moveset.

At this point we've all noticed the weird stuff that happens in the first 150k iterations. We can conlcude that the agent found a maximum in the function that it was calculating and decided to move on since it wasn't an absolute maximum (the best solution to the problem).

On the last graphs we can appreciate how the agent explores less and normalizes its behaviour as it keeps learning and estimates a higher reward as the reward curve (1st graphics from the 1st row) starts to grow steadily.



### Second Training

After learning from the mistakes we made in the first training, we decided to include the following variations:
    - Custom agent rewards (stated in the previous Behaviour header)
    - The target also resets when the agent fails.
    
This training was set to <b>3 million iterations</b> which took a whole night of calculations in order to complete. It was also run on the same configuration file as the previous one.

<img width='900px' align='center' src='img/night-training.gif'>
<center><cite>It ended up walking sideways</cite></center>

After the changes, the tensorflow graphs showed us the struggles the agent had to learn a proper strategy.

<img width='700px' align='center' src='img/night-training-graph-0.png'>
<img width='700px' align='center' src='img/night-training-graph-1.png'>

As we can see, <b>the reward graph is an absolute mess but it keeps growing between the spikes</b>. Note that the graph is smoothed a 92% so it can be read properly. We can also see that the time it took for the agent to end a simulation step did also have huge variations even in the 3 million steps, the agent has tried lots of ways to balance itself, this is why those graphs don't have a point in which they start to stabilize.

In the second part we can see a similar result to the first simulation. The agent did find a way of walking that suited him and started mastering it like in the previous simulation. <b>This time though the choice didn't bring stabilized values al along it</b>.

The last graphs also show how the agent starts exploring and experimenting but ends up sticking with the walking technique that fits him the best dues to the entropy and learning rate decrease. As those values decrease so does the estimated value, which starts to flatten.

### Conclusion

<b>The modifications we made to the agent were heavy</b> thus leaving us with an agent that needs a lot of iterations and some constraints apart from the joint movement restriction in order to work correctly. We're proud that we've made him jump and walk (even if is sideways) and achieve the target sometimes.

We are aware that there are solvers which are way more efficient for the walker situation. Our testing did reinforce the common knowledge that robots have a hard time balancing themselves when moving on two legs.





