# Demo - We're going to land a SpaceX Rocket using Machine Learning

# OpenAI Gym

![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/OpenAI_Logo.svg/1200px-OpenAI_Logo.svg.png "Logo Title Text 1")

- OpenAI was founded in late 2015 as a non-profit with a mission to “build safe artificial general intelligence (AGI) and ensure AGI's benefits are as widely and evenly distributed as possible.” 
- A major contribution that OpenAI made to the machine learning world was developing both the Gym and Universe software platforms.

![alt text](https://venturebeat.com/wp-content/uploads/2016/04/OpenAI-Gym.jpg?fit=578%2C302&strip=all "Logo Title Text 1")

https://gym.openai.com/envs/
https://github.com/openai/gym

- Gym is a collection of environments/problems designed for testing and developing reinforcement learning algorithms—it saves the user from having to create complicated environments. G
- Gym is written in Python
- there are multiple environments such as robot simulations or Atari games. 
- There is also an online leaderboard for people to compare results and code.

https://github.com/openai/gym/wiki/Leaderboard

# Reinforcement Learning

#### Reinforcement learning is a computational approach where an agent interacts with an environment by taking actions in which it tries to maximize an accumulated reward.


![alt text](https://d3ansictanv2wj.cloudfront.net/image3-5f8cbb1fb6fb9132fef76b13b8687bfc.png "Logo Title Text 1")

![alt text](http://slideplayer.com/4429312/14/images/6/Recap%3A+MDPs+Markov+decision+processes%3A+Quantities%3A+States+S+Actions+A.jpg "Logo Title Text 1")

- A Markov Decision Process (MDP) involves a set of states and actions. 
- Actions connect states with some uncertainty described by the dynamics P(s′|a,s) (i.e. transition probabilities which underlie the transition function T(s,a,s′)) 
- Additionally is a reward function R(s,a,s′) that associates a reward or penalty with each state. 
- When these are known or learned the result is a policy π which prescribes actions to take given a state. 
- We can then value a given state s and a policy π in terms of expected future rewards with a value function Vπ(s).
- The ultimate goal here is to identify an optimal policy.

![alt text](http://images.slideplayer.com/25/7672501/slides/slide_4.jpg "Logo Title Text 1")

- A lot of times all parameters are known in advance, so we can conduct offline planning, that is, formulate a plan without needing to interact with the world.
- But we may not know the reward function R or even the transition model T, and then we must engage in online planning, in which we must interact with the world to learn more about it to better formulate a plan.
- Online planning involves reinforcement learning, where agents can learn in what states rewards or goals are located without needing to know from the start.

#### A summary of RL

![alt text](https://www.ibm.com/developerworks/library/cc-reinforcement-learning-train-software-agent/fig02.png "Logo Title Text 1")

- The agent interacts with its environment and receives feedback in the form of reward
- The agent's utility is defined by the reward function
- The agent must learn to act so as to maximize expected rewards
- Learning is based on observed samples of outcomes
- The ultimate goal of reinforcement learning is to learn a policy which returns an action to take given a state. 
- To form a good policy we need to know the value of a given state; 
- we do so by learning a value function which is the sum of rewards from the current state to some terminal state, following a fixed policy. 
- This value function can be learned and approximated by any learning and approximation approach, e.g. neural networks.

#### We can broadly categorize various RL methods into three groups:

![alt text](https://www.e-sciencecentral.org/upload/ijfis/thumb/ijfis-16-270f3.gif "Logo Title Text 1")

- critic-only methods, which first learn a value function and then use that to define a policy, e.g. TD learning.
- actor-only methods, which directly search the policy space. An example is an evolutionary approach where different policies are evolved.
- actor-critic methods, where a critic and an actor are both included and learned separately. The critic observes the actor and evaluates its policy, determining when it needs to change.

# Proximal Policy Optimization

![alt text](http://karpathy.github.io/assets/rl/rl.png "Logo Title Text 1")

![alt text](https://morvanzhou.github.io/static/results/reinforcement-learning/6-4-1.png "Logo Title Text 1")

- Policy gradient methods are a type of reinforcement learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) by gradient descent.
- Trust Region Policy Optimization (TRPO) as achieved excellent results in challenging continuous control tasks. 
- The high-level idea is to take steps in directions that improve the policy, while simultaneously not straying too far from the old policy. 
- The constraint to stay “near” the old policy is the primary difference between TRPO and the vanilla policy gradient method.
- Making too large a change from the previous policy, especially in high-dimensional, nonlinear environments, can lead to a dramatic decrease in performance. 
- A naive solution is to take minuscule policy steps. The questions become: How small of a step? And how do you take small steps? 

![alt text](https://camo.githubusercontent.com/f0872e487fe5a62274acbe83b2f28ba90db63a38/68747470733a2f2f6d6f7276616e7a686f752e6769746875622e696f2f7374617469632f726573756c74732f7265696e666f7263656d656e742d6c6561726e696e672f362d342d332e706e67 "Logo Title Text 1")

- TRPO takes a principled approach to controlling the rate of policy change. 
- The algorithm places a constraint on the average KL-divergence (DKL) between the new and old policy after each update: DKL(πold‖πnew).
- The learning algorithm then chooses directions that lead to the biggest policy improvement under a budget not changing the policy beyond the (DKL) limit. 
- PPO is an implementation of TRPO that adds DKL terms to training loss function. With this loss function in place, we train the policy with gradient descent like a typical neural network.

