# Motion Planning with Deep Reinforcement Learning

### Introduction
<br>
This project was aimed to build a motion planner that can help a robot in navigating through an environment to find the goal using the Deep Q learning which helps the robot learn a policy to identify the best action to take from each given state in order to maximize the long term reward. It uses a Reinforcement learning algorithm called Deep Q learning which uses deep neural networks to find the approximate Q value of a given state and action.
As we know that classical robot motion planning methods assumes that we know our environment such as the location of the obstacles, the location of the goal and the physics of the environment. On the contrary, the reinforcement leaning algorithm does not assume anything about the environment so the environment can have even different physics than the physics of this universe and the robot will still able to learn the respective policy

### Environment
<br>
The environment is continuous sate space but to make the problem simple and easy to train the model the actions have been discretized in terms of the direction in which the robot will move with the fixed distance it can cover in one step. A spherical robot is used to which learns how to navigate in the environment to find the goal. Though we are not considering the non-holonomic constraint here but this model will work for non-holonomic constraints as well since the agent learns about the environment from the interaction and develops a policy which suggests the action to take each state. So for example if we have a car robot and it needs to move in the left side  but due to non-holonomic constraint it can’t so the way that the robot will learn how to move left is by trying different actions in the initial phase such as make a left turn which due the environment dynamics will have some turn around radius . SO the robot will eventually learn how long before he need to make the turn considering its velocity to reach the point on the left side

<img src = "https://raw.githubusercontent.com/imishra/MotionPlanningWithDeepQLearning/master/env.png" width ="500"></img>

### Algorithm
<br>
**Deep Q-Learning:** This motion planner uses Q learning, which is a reinforcement learning algorithm in which an agent learns an optimal policy (Q*(s, a)) to follow based on its interaction with the environment in the past which is represented as St, At, R where St is the state at time t, A t is the action taken at time and R is the reward the agent received after taking the action.  Q*(s, a) is the Q value for a given state and action given by the optimal policy Q*. 
<br><br>
Q-learning uses , the temporal difference error to adjust its approximation (the error between the old approximation and the new approximation) using some small learning rate so that it does not too much weightage to any new information as those information are not very much reliable. Hence we keep the randomness high initially and so robot majorly makes a lot of random moves initially which helps the robot to explore the environment and the randomness goes down as the robot becomes more and more aware about the environment dynamics.
This algorithm uses a deep neural network to predict the approximate q value for each state action pair and we use the temporal difference error to train the neural network.

**Variations**: I have tried two other variations for this algorithm which are as followers
<br>
1.	**Experience Replay**: As we know that Neural network has a tendency to forget  what it has seen earlier hence as the training grows neural network sees more and more states which are highly correlated and biased towards the goal hence the neural network fails to generalize well. So to tackle this problem we use an experience pool which contains the past states that the robot has seen and hence instead of taking the sequence of states we select some random sample from the memory pool and then train the algorithm on the samples which helps the neural network to generalize well.
 
2.	**Deep Q-Learning-with Potential Function**: I have tried a hybrid approach using the Deep Q learning in a combination with the Potential function which I believe would help the learning model to converge faster as compared to the general Deep Q Learning. I used the distance of the goal from the current position of the agent as the potential functional. Though This algorithm seems to learn faster as compared to the general model but I could not gather substantial results to support this claim


### Simulation

In [1]:
### Training sample
import io
import base64
from IPython.display import HTML

video = io.open('simulation.ogv', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls>
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii')))

### Learned Policy Visualization

<br>
Below are the Visualization of the learned policy which is created my making a contour plot of of the max Q value for a good number of sample state in the environment 
<img src="https://raw.githubusercontent.com/imishra/MotionPlanningWithDeepQLearning/master/trajectory_and_Q_contour.png"></img>

As we can see that path does not seem optimal but considering the fact that to make the path optimal we need to train the model on more data so that it is able to approximate the exact Q values for each state action pair. 

### Limitation
As we all know that training a Deep Neural Network takes a lot of computational resources along with the a lot of time so it really becomes computationally expensive to train a reinforcement learning agent for a continuous environment as this. So I could not make this model work for bigger environments and make it generalize so well that they can work for any given start and goal state but this model can be trained by using more complex neural network with appropriate resources to work for bigger environment with any start and goal state.

### Future Work 
I would explore the variation 2 described above by designing some other potential function because it seems a good tradeoff of to minimize the training time by having some extra information about the environment (which takes the tag of model free approach from this algorithm). In addition, I would work on including the velocity parameters.

### References 
1.	Harris et al. (2014) Human-level control through deep reinforcement learning
2.	Potential Based Reward Shaping for Hierarchical Reinforcement Learning, Yang Gao, Francesca Toni, Department of Computing, Imperial College London, 2015
