# Cartpole: for newcomers to RL

We will be working through the methods described in the [OpenAI Requests for Research][1] for the [Cartpole environment][2]. 

Specifically, we will start with a simple linear model (that has only four parameters), and use the sign of the weighted sum to choose between the two actions. We will then look at three methods for finding the best parameters:
  1. The random guessing algorithm
  2. The hill-climbing algorithm
  3. Policy gradient algorithm


[1]: https://openai.com/requests-for-research/#cartpole
[2]: https://gym.openai.com/envs/CartPole-v0/

## The Environment
> The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center. 

> CartPole-v0 defines "solving" as getting average reward of 195.0 over 100 consecutive trials.

## The Random Guessing Algorithm
> Generate 10,000 random configurations of the model's parameters, and pick the one that achieves the best cumulative reward. It is important to choose the distribution over the parameters correctly

### Find optimal parameters

In [1]:
import gym
import numpy as np

In [2]:
def run_episode(env, params):
    total_reward = 0
    num_timesteps = 200
    
    observation = env.reset()
    for _ in xrange(num_timesteps):
        action = 0 if np.matmul(params,observation) < 0 else 1
        observation, reward, done, info = env.step(action)
        total_reward += reward
    
        if done:
            break
    
    return total_reward

In [3]:
# Find optimal parameters
env = gym.make('CartPole-v0')

num_episodes = 10000
num_params = 4

best_total_reward = 0
best_params = None

for _ in range(num_episodes):
    params = 2 * np.random.random(size=num_params) - 1
    total_reward = run_episode(env, params)
    
    if total_reward > best_total_reward:
        best_total_reward = total_reward
        best_params = params
        
print "best total reward: {}\nbest parameters: {}\n".format(best_total_reward, best_params)

[33mWARN: gym.spaces.Box autodetected dtype as <type 'numpy.float32'>. Please provide explicit dtype.[0m
best total reward: 200.0
best parameters: [-0.15868992  0.53567141  0.58937563  0.62118454]



### Watch result

In [4]:
import os
import os.path
from time import sleep

from gym.wrappers import Monitor
from IPython.display import HTML

In [5]:
# Log an epsiode with the best params.
# https://discuss.openai.com/t/how-to-capture-video-feed-from-a-universe-game/954/2
dir_path = 'random/'
env = Monitor(env, dir_path, force=True)
run_episode(env, best_params)
env.close()

In [6]:
# Get the video file path.
for f in os.listdir(dir_path):
    _, extension = os.path.splitext(f)
    if extension == '.mp4':
        video_path = os.path.join(dir_path, f)

In [7]:
# Render the video inline.
# https://gist.github.com/thanasi/ad31f798b747629e717bcebd2cad15cf
html_str = """
<div align="middle">
<video width="80%" controls>
    <source src="{}" type="video/mp4">
</video></div>
"""
html_str = html_str.format(video_path)
HTML(html_str)

# References:

  * [OpenAI docs][1]
  * [OpenAI ROR: Cartpole][2]
  * [KVFrans Cartpole blog post][3]
  * [OpenAI Gym repo][4]
  * [Cartpole-v0 environment doc][5]
  
  
[1]: https://gym.openai.com/docs/
[2]: https://openai.com/requests-for-research/#cartpole
[3]: http://kvfrans.com/simple-algoritms-for-solving-cartpole/
[4]: https://github.com/openai/gym
[5]: https://gym.openai.com/envs/CartPole-v0/