# Finding best policy by brute force
On this example we will do the simplest way of finding a policy for the cart pole problem, by brute force. This problem is simple because the state space and action space is small.

#### State Space (1x4 vector)

Num | Observation | Min | Max
---|---|---|---
0 | Cart Position | -2.4 | 2.4
1 | Cart Velocity | -Inf | Inf
2 | Pole Angle | ~ -41.8&deg; | ~ 41.8&deg;
3 | Pole Velocity At Tip | -Inf | Inf

#### Action space (1x2 vector)

Num | Action
--- | ---
0 | Push cart to the left
1 | Push cart to the right

#### Search Space
On this problem the search space is 8 because the action space has 2 elements and the state space 4 elements:
$$len(\text{S}_{\text{1x4}}) . len(\text{A}_\text{1x2})$$

#### Episode Termination
* Pole Angle is more than ±12°
* Cart Position is more than ±2.4 (center of the cart reaches the edge of the display)
* Episode length is greater than 200

### References
* https://github.com/openai/gym/wiki/CartPole-v0
* https://medium.com/@m.alzantot/deep-reinforcement-learning-demystified-episode-0-2198c05a6124
* https://ray.readthedocs.io/en/latest/index.html
* https://bair.berkeley.edu/blog/2018/01/09/ray/

### Load Libraries and paramters

In [1]:
import time
import gym
import numpy as np

# Only log errors
gym.logger.set_level(40)

# Policies to generate
n_policy = 50000

### Policy
On this case our parametrized policy will receive our state $S_\text{1x4}$ and do a dot product with it's internal set of parameters $\theta_{1x4}$

In [2]:
# Generate random 1x4 vector for the policy parameters
def gen_random_policy_params():
    return (np.random.uniform(-1,1, size=4).astype(np.float32))

# Evaluate a policy given it's parameters and some state
def policy(policy_params, state):
    # It's basically a linear model
    if np.dot(policy_params, state) > 0:
        return 1
    else:
        return 0

  and should_run_async(code)


### Generate a list of possilbe policies

In [3]:
policy_params_list = [gen_random_policy_params() for _ in range(n_policy)]
print('Generated list of %d policies parameters' % len(policy_params_list))

Generated list of 50000 policies parameters


### Evaluate Policy
This function basically run a policy during 

In [4]:
def evaluate_policy_episode(policy_params, render=False):
    # Crete environment
    env = gym.make('CartPole-v0')
    state = env.reset()
    # Total rewards from eposide
    total_reward_episode = 0
    while True:
        if render:
            env.render()
        selected_action = policy(policy_params, state)
        state, reward, done, _ = env.step(selected_action)
        total_reward_episode += reward
        if done:
            break
    return total_reward_episode

### Search for Policies

In [5]:
%%time
# Evaluate the policies on in the environment
scores_list = [evaluate_policy_episode(p) for p in policy_params_list]

CPU times: user 41.1 s, sys: 281 ms, total: 41.4 s
Wall time: 41.5 s


### Evaluate Best Policy

In [6]:
# Select the best policy from the score list.
best_score = max(scores_list)
print('Best policy score = %d' % best_score)

best_policy_params= policy_params_list[np.argmax(scores_list)]
print('Found best policy with params:', best_policy_params)

# Run best policy [0.32896566 0.56930596 0.8278743  0.43927363]
total_rewards_best_policy = evaluate_policy_episode(best_policy_params, render=True);
print('Total rewards on best policy:', total_rewards_best_policy)

if total_rewards_best_policy != best_score:
    print('Something went wrong')

Best policy score = 200
Found best policy with params: [-0.7618115   0.62916756  0.8967531   0.8483351 ]
Total rewards on best policy: 200.0
