In [4]:
import gym

# Reinforcement Learning

cartpole

As the agent observes the current state of the environment and chooses an action, the environment transitions to a new state, and also returns a reward that indicates the consequences of the action. In this task, rewards are +1 for every incremental timestep and the environment terminates if the pole falls over too far or the cart moves more then 2.4 units away from center. This means better performing scenarios will run for longer duration, accumulating larger return.

The CartPole task is designed so that the inputs to the agent are 4 real values representing the environment state (position, velocity, etc.). However, neural networks can solve the task purely by looking at the scene, so we'll use a patch of the screen centered on the cart as an input. Because of this, our results aren't directly comparable to the ones from the official leaderboard - our task is much harder. Unfortunately this does slow down the training, because we have to render all the frames.

Strictly speaking, we will present the state as the difference between the current screen patch and the previous one. This will allow the agent to take the velocity of the pole into account from one image.

In [2]:
# Create an enviromnent using gym
env = gym.make("CartPole-v1")
obs = env.reset()
obs

array([ 0.00530656, -0.02173368, -0.00133809,  0.01859455])

In [4]:
# plot the state of the enviromnent
env.render()

True

In [9]:
# What is the space of actions
env.action_space

Discrete(2)

In [28]:
action = 1
obs, reward, done, info = env.step(action)
obs

array([ 0.10998358,  1.60569308, -0.14018582, -2.36941021])

In [29]:
reward

1.0

In [30]:
done

False

In [31]:
info

{}

## Create a Policy for CartPole

In [5]:
def basic_policy(obs):
    angle = obs[2]
    action = 0 if angle < 0 else 1
    return action

totals = []
env = gym.make("CartPole-v1")

for episode in range(500):
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

In [6]:
import numpy as np
np.mean(totals), np.std(totals), np.min(totals), np.max(totals)

(42.07, 8.59448078710983, 25.0, 68.0)