In [42]:
import gym
# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

ModuleNotFoundError: No module named 'tensorflow'

# Reinforcement Learning

cartpole

As the agent observes the current state of the environment and chooses an action, the environment transitions to a new state, and also returns a reward that indicates the consequences of the action. In this task, rewards are +1 for every incremental timestep and the environment terminates if the pole falls over too far or the cart moves more then 2.4 units away from center. This means better performing scenarios will run for longer duration, accumulating larger return.

The CartPole task is designed so that the inputs to the agent are 4 real values representing the environment state (position, velocity, etc.). However, neural networks can solve the task purely by looking at the scene, so we'll use a patch of the screen centered on the cart as an input. Because of this, our results aren't directly comparable to the ones from the official leaderboard - our task is much harder. Unfortunately this does slow down the training, because we have to render all the frames.

Strictly speaking, we will present the state as the difference between the current screen patch and the previous one. This will allow the agent to take the velocity of the pole into account from one image.

In [2]:
# Create an enviromnent using gym
env = gym.make("CartPole-v1")
obs = env.reset()
obs

array([ 0.04004176, -0.01955164, -0.01786848,  0.0244276 ])

In [10]:
try:
    import pyvirtualdisplay
    display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()
except ImportError:
    pass

In [11]:
img = env.render(mode="rgb_array")
img.shape

(800, 1200, 3)

In [6]:
# plot the state of the enviromnent
env.render()

True

In [3]:
# What is the space of actions
env.action_space

Discrete(2)

In [4]:
action = 1
obs, reward, done, info = env.step(action)
obs

array([-0.03506609,  0.20399354, -0.03754943, -0.33613728])

In [5]:
reward

1.0

In [6]:
done

False

In [7]:
info

{}

## Create a Policy for CartPole

A simple hard-coded policy

Let's hard code a simple strategy: if the pole is tilting to the left, then push the cart to the left, and vice versa. Let's see if that works:

ModuleNotFoundError: No module named 'tensorflow'

In [3]:
def basic_policy(obs):
    angle = obs[2]
    action = 0 if angle < 0 else 1
    return action

totals = []
env = gym.make("CartPole-v1")

for episode in range(500):
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

In [4]:
import numpy as np
np.mean(totals), np.std(totals), np.min(totals), np.max(totals)

(42.118, 8.914935557815323, 24.0, 71.0)

In [35]:
env.seed(42)

frames = []

obs = env.reset()
for step in range(200):
    img = env.render(mode="rgb_array")
    frames.append(img)
    action = basic_policy(obs)

    obs, reward, done, info = env.step(action)
    if done:
        print(step)
        print("DONE")
        break

54
DONE


In [36]:

import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.animation as animation
mpl.rc('animation', html='jshtml')

def update_scene(num, frames, patch):
    patch.set_data(frames[num])
    return patch,

def plot_animation(frames, repeat=False, interval=40):
    fig = plt.figure()
    patch = plt.imshow(frames[0])
    plt.axis('off')
    anim = animation.FuncAnimation(
        fig, update_scene, fargs=(frames, patch),
        frames=len(frames), repeat=repeat, interval=interval)
    plt.close()
    return anim

In [37]:
plot_animation(frames)

## Neural Network Policies
Let's create a neural network that will take observations as inputs, and output the action to take for each observation. To choose an action, the network will estimate a probability for each action, then we will select an action randomly according to the estimated probabilities. In the case of the Cart-Pole environment, there are just two possible actions (left or right), so we only need one output neuron: it will output the probability p of the action 0 (left), and of course the probability of action 1 (right) will be 1 - p.

In [38]:


keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

n_inputs = 4 # == env.observation_space.shape[0]

model = keras.models.Sequential([
    keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]),
    keras.layers.Dense(1, activation="sigmoid"),
])

NameError: name 'keras' is not defined

In [41]:
%run pip install keras


Exception: File `'pip.py'` not found.