# Reinforcement Learning- CartPole game
The objective of the project is to implement Cartpole game using reinforcement learning.

The Cart-Pole is a very simple environment composed of a cart that can move left or right, and a pole placed vertically on top of it.A pole is attached to the frictionless cart. The agent needs to balance the Pole by adjusting the Cart position.
The goal is to make an agent learn the policy so that the cart always balances the pole by moving the cart according to the position of the pole.

## WorkFlow:
First will make a simple hard-coded policy, make the agent with that policy, and examine the performance.
Then will train a neural network and observe the performance.
Lastly will implement policy gradient algorithm  and observe its effects in the improvement of the game.

# Importing the Modules

In [2]:
import sys
assert sys.version_info >= (3, 5)

import numpy as np
import tensorflow as tf
from tensorflow import keras

import sklearn
assert sklearn.__version__ >= "0.20"

np.random.seed(42)
tf.random.set_seed(42)

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Getting the CartPole environment from OpenAI Gym

In [4]:
import gym

In [5]:
env = gym.make('CartPole-v1')
env.seed(42)

[42]

In [6]:
# Initializing the environment by calling reset() method. This returns an observation.
obs = env.reset()
print(obs)

[-0.01258566 -0.00156614  0.04207708 -0.00180545]


##### Observations of CartPole has 4 values
First Value is the position of cart. Second value is the velocity of the cart. Third value is the angle of the pole. Fourth value is the angular velocity of the pole.

In [7]:
print(env.action_space)

Discrete(2)


# Hard Coding Simple Policy

##### Simple Policy
Need to move the cart to the right if the pole slants towards the right. As the pole tilts towards the left, might want to push the cart to the left.

In [8]:
env.seed(42)

def basic_policy(obs):
    angle = obs[2]
    if angle < 0:
        return 0
    else:
        return 1

Will play 500 episodes of the game, each episode with 200 steps. For each step will call the basic_policy to get the action, and perform that step with that action. Also will calculate reward for each episode.

In [9]:
totals = []
for episode in range(500):
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

In [10]:
# Calculating minimum, maximum, and mean steps for which the basic_policy is able to keep the pole up:
print(np.mean(totals), np.std(totals), np.min(totals), np.max(totals))

41.718 8.858356280936096 24.0 68.0


This strategy is a bit too basic: the maximum number of steps the agent that kept the pole up is only 68. This environment is considered solved when the agent keeps the poll up for 200 steps.

# Adding Neural Network

In [11]:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

In [12]:
n_inputs = 4
model = keras.models.Sequential([
    keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]),
    keras.layers.Dense(1, activation="sigmoid"),
])

In [13]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 5)                 25        
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 6         
Total params: 31
Trainable params: 31
Non-trainable params: 0
_________________________________________________________________


# Game with Untrained Neural Network

To observe how neural network will perform with absolutey no training and randomly initialized weights to determine the action to take for the next step of the agent

In [14]:
env.seed(42)
def basic_policy_untrained(obs):
    left_proba = model.predict(obs.reshape(1, -1))
    action = int(np.random.rand() > left_proba)
    return action

In [15]:
totals = []
for episode in range(50):
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy_untrained(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

np.mean(totals), np.std(totals), np.min(totals), np.max(totals)

(27.16, 15.652935826866473, 9.0, 88.0)

The cartpole is quite unstable and wobbly

# Training the Neural Network

In [16]:
np.random.seed(42)
n_environments = 50
n_iterations = 5000

In [17]:
# Initializing 50 different cartpole environments
envs = [gym.make("CartPole-v1") for _ in range(n_environments)]

# Setting different seeds to each environment with their respective indices as per the above list.
for index, env in enumerate(envs):
    env.seed(index)

In [18]:
observations = [env.reset() for env in envs]

In [19]:
optimizer = keras.optimizers.RMSprop()
loss_fn =  keras.losses.binary_crossentropy

In [20]:
for iteration in range(n_iterations):
    # if angle < 0, want proba(left) = 1., or else proba(left) = 0.
    target_probas = np.array([([1.] if obs[2] < 0 else [0.])
                              for obs in observations])

    with tf.GradientTape() as tape:
        left_probas = model(np.array(observations))
        loss = tf.reduce_mean(loss_fn(target_probas, left_probas))
    print("\rIteration: {}, Loss: {:.3f}".format(iteration, loss.numpy()), end="")
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

    actions = (np.random.rand(n_environments, 1) > left_probas.numpy()).astype(np.int32)
    for env_index, env in enumerate(envs):
        obs, reward, done, info = env.step(actions[env_index][0])
        observations[env_index] = obs if not done else env.reset()

Iteration: 4999, Loss: 0.094

This seems to have learned the policy better. Now, will work towards making the pole lesser wobbly. One way to do this might be by allowing the cartpole to learn/explore for itself a better policy. So will now modify the algorithm such that the network itself learns a better policy.

# Implementing Policy Gradients

## Defining play_one_step function

In [25]:
def play_one_step(env, obs, model, loss_fn):
    with tf.GradientTape() as tape:
        left_proba = model(obs[np.newaxis])
        action = (tf.random.uniform([1, 1]) > left_proba)
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
        loss = tf.reduce_mean(loss_fn(y_target, left_proba))
    grads = tape.gradient(loss, model.trainable_variables)
    obs, reward, done, info = env.step(int(action[0, 0].numpy()))
    return obs, reward, done, grads

## Defining play_multiple_episodes function

In [26]:
def play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):
    all_rewards = []
    all_grads = []
    for episode in range(n_episodes):
        current_rewards = []
        current_grads = []
        obs = env.reset()
        for step in range(n_max_steps):
            obs, reward, done, grads = play_one_step(env, obs, model, loss_fn)
            current_rewards.append(reward)
            current_grads.append(grads)
            if done:
                break
        all_rewards.append(current_rewards)
        all_grads.append(current_grads)
    return all_rewards, all_grads

## Defining the discount function and normalizing function

In [27]:
def discount_rewards(rewards, discount_rate):
    discounted = np.array(rewards)
    for step in range(len(rewards) - 2, -1, -1):
        discounted[step] += discounted[step + 1] * discount_rate
    return discounted

In [28]:
def discount_and_normalize_rewards(all_rewards, discount_rate):
    all_discounted_rewards = [discount_rewards(rewards, discount_rate)
                            for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean) / reward_std
            for discounted_rewards in all_discounted_rewards]

# Training with Policy Gradients

In [29]:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

In [30]:
n_episodes_per_update = 10
n_iterations = 150
n_max_steps = 200
discount_rate = 0.95
n_inputs = 4

In [31]:
optimizer = keras.optimizers.Adam(learning_rate=0.01)
loss_fn = keras.losses.binary_crossentropy

In [32]:
def nn_policy_gradient(model, n_iterations, n_episodes_per_update, n_max_steps, loss_fn):
    env = gym.make("CartPole-v1")
    env.seed(42);

    for iteration in range(n_iterations):
        all_rewards, all_grads = play_multiple_episodes(
            env, n_episodes_per_update, n_max_steps, model, loss_fn)
        total_rewards = sum(map(sum, all_rewards))                     # Not shown in the book
        print("\rIteration: {}, mean rewards: {:.1f}".format(          # Not shown
            iteration, total_rewards / n_episodes_per_update), end="") # Not shown
        all_final_rewards = discount_and_normalize_rewards(all_rewards,
                                                        discount_rate)
        all_mean_grads = []
        for var_index in range(len(model.trainable_variables)):
            mean_grads = tf.reduce_mean(
                [final_reward * all_grads[episode_index][step][var_index]
                for episode_index, final_rewards in enumerate(all_final_rewards)
                    for step, final_reward in enumerate(final_rewards)], axis=0)
            all_mean_grads.append(mean_grads)
        optimizer.apply_gradients(zip(all_mean_grads, model.trainable_variables))

    return model

    env.close()

In [33]:
model = keras.models.Sequential([
    keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]),
    keras.layers.Dense(1, activation="sigmoid"),
])

In [34]:
model = nn_policy_gradient(model, n_iterations, n_episodes_per_update, n_max_steps, loss_fn)

Iteration: 149, mean rewards: 191.4

In [35]:
totals = []
for episode in range(20):
    print("Episode:",episode)
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy_untrained(obs)
        obs, reward, done, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

np.mean(totals), np.std(totals), np.min(totals), np.max(totals)

Episode: 0
Episode: 1
Episode: 2
Episode: 3
Episode: 4
Episode: 5
Episode: 6
Episode: 7
Episode: 8
Episode: 9
Episode: 10
Episode: 11
Episode: 12
Episode: 13
Episode: 14
Episode: 15
Episode: 16
Episode: 17
Episode: 18
Episode: 19


(196.3, 16.12792609110049, 126.0, 200.0)

##### Observations
The maximum steps are 200, meaning that the agent has won the game at least once. Also, there is a significant improvement in the minimum and the average number of steps the agent managed to balance the pole.