# Tutorial: Policy optimization algorithms

## Montecarlo Policy Gradient

This is an example implementation of REINFORCE, as explained in the lecture.

In [1]:
import gym
import numpy as np

In [2]:
env = gym.make('CartPole-v0')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


In [3]:
def featurize(s,a):
    return (2*a-1)*np.tanh(s)

def softmax_policy(theta, actions):
    def policy_fn(state):
        exp = np.exp([np.dot(featurize(state,a),theta) for a in actions])
        probs = exp/np.sum(exp)
        return probs
    return policy_fn

def run_episode(env, random_policy): 
    done = False
    total_reward = 0
    state = env.reset()
    episode = []
    actions = range(env.action_space.n)
    while not done:
        probs = random_policy(state)
        action = np.random.choice(actions, p=probs)
        new_state, reward, done, _ = env.step(action)
        episode.append((state,action,reward))
        state = new_state
        total_reward += reward
    return episode, total_reward


In [9]:
theta = 2*np.random.rand(4)-1
n_episodes = 10000
actions = range(env.action_space.n)
gamma = 1.
alpha = 1e-3

In [10]:
for ep in range(n_episodes):
    G = 0
    policy = softmax_policy(theta,actions)
    episode, ep_reward = run_episode(env,policy)
    
    if ep_reward == 200:
        print("Won! this happened in episode ", ep)
        break
    
    if (ep+1) % 1000 == 0:
        print("Reward in episode {}: {}".format(ep+1,ep_reward))
    
    
    for i, _ in enumerate(episode):
        s, a, r = _
        G = sum(x[2]*(gamma**j) for j, x in enumerate(episode[i:]))
        s = np.array(s)
        theta = theta + alpha*(featurize(s,a)-np.sum([policy(s)[a]*featurize(s,a) for a in actions]))*G


Reward in episode 1000: 99.0
Won! this happened in episode  1244


## Your turn:

- Implement REINFORCE and REINFORCE with baseline for `CliffWalking`.

## Additional material

If you feel ready for a more interesting challenge, you can try `Pong`, which is one of the simples Atari environments. You should do some preprocessing of the video frames, to speed up the computation (resize, downsample, and remove background or color altogether).





Try to solve this yourself first! If you get stuck or want some inspiration, look at Andrej Karpathy's post:

- http://karpathy.github.io/2016/05/31/rl/
- https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5