# OpenAI Gym
The research company OpenAI released a super useful python package for RL called Gym. OpenAI's Gym is a collection of environments or games which researchers can use to train RL agents with. All of the environments adhere to  the same simple API which is not unlike the one I described grid-world notebook. In this notebook I will walk you through an example of how to use the OpenAI Gym. 

## Installing the gym python package
We install gym by running the following cell.

In [1]:
# Install OpenAi Gym
!pip install gym



## CartPole
The environment we will consider first is a simple game called CartPole. In some sense, CartPole is like the "HelloWorld" of RL, it is a right of passage for all RL researchers. 

The challenge is for the agent to balance a pole on a cart by moving the cart left and right. Below is an image of the environment setup. 

![CartPole](../images/cartpole.png)

### State and Action Space
In CartPole the agent can choose from two discreet actions: "move left" and "move right". The state of the environment is given by a vector with four values which represent the "horisontal position" and "horisontal velocity" of the cart, and "angular position" and "angular velocity" of the pole. It is clear that an agent with access to these 4 values should be able to optimally control the cart.

Lets instantiate the environment and look at the shapes of the states and actions.

In [2]:
import gym

# Instantiate the environment
environment = gym.make("CartPole-v1")

print("State_space:", environment.observation_space)
print("Action Space:", environment.action_space)

State_space: Box(-3.4028234663852886e+38, 3.4028234663852886e+38, (4,), float32)
Action Space: Discrete(2)


A `Box` is an internal Gym object which is used to describe a vector with bounded values. The first two numbers in the `Box` above indicate the maximum and minimum values the entries in the vector can take on. The third entry in the `Box` is the shape of the vector, in this case the state vector has length 4. Finally, the final entry in the `Box` gives the data type in the state vector, in this case simply a floating point number. 

`Discreet` is another Gym object used to discribe a discreet action space. In this case the environment has two discreet actions, usally actions are given by integers starting at zero. So in this case the agent has action `0` and action `1`.

### Reset() and Sampling Random Actions
Lets now look at an example state vector and an example action. All Gym environments have a `reset()` method which resets the environment and returns the first observation. Remember to call this when ever your agent starts a new episode in the environment. Every environment also has a method to sample a valid random action from the action space.

In [3]:
observation = environment.reset()
print("Initial Observation:", observation)

action = environment.action_space.sample()
print("Random action:", action)

Initial Observation: [ 0.01980218 -0.01997417  0.0087345  -0.01244694]
Random action: 1


## The Step() Function
Gym environments has a `step()` method which takes an action as input and returns the resulting next state of the environment, as well as the agents reward and done flag. The done flag is a boolean which indicates if the agent has entered a terminal state or not. The step function alsoreturns some extra info, but we usually don't need to worry about these. Here is an example of how we can step() the environment.

In [4]:
# Sample a valid random action
action = environment.action_space.sample()

# Step the environment
next_observation, reward, done, info = environment.step(action)

print("Next Observation:", observation)
print("Reward:", reward)
print("Done:", done)

Next Observation: [ 0.01980218 -0.01997417  0.0087345  -0.01244694]
Reward: 1.0
Done: False


### Reward Function
Every environment in Gym has a different reward pattern. In CartPole it is very simple, at every timestep where the agent is alive it receives +1 reward. If the agent dies it gets a reward of 0. Thus the reward signal incentivises the agent to keep the pole balancing for as long as possible.

### The Agent-Environment Loop
We now have all of the components we need to make a random agent randomly choose actions in the environment.

In [5]:
# Always reset the environment at the start of an episode
observation = environment.reset()

# Store the sum of rewards in a variable called `episode_return`
episode_return = 0

# Initially set the `done` flag to False
done = False

# Loop until 'done' == True
while not done:
    # Agent chooses action
    # In this case we chose sample a random action
    action = environment.action_space.sample()

    # Step the environment
    next_observation, reward, done, info = environment.step(action)

    # Add the reward to `episode_return`
    episode_return += reward

    # Critical!!!
    # Set current observation to next observation
    observation = next_observation

print("Episode Return:", episode_return)


Episode Return: 18.0


You should try running the cell above a couple of times and see how variable the agents performance can be. You should find the random agent usually gets a episode return of around 20-40. The CartPole environment is considered solved if the agent can consistently get an epsiode return of 500.

To get an agent to solve the environment we could use RL. But it turns out that CartPole is such a simple environment that a simple monticarlo algorithm can solve it in a few hundred steps. So as an exercise lets solve CartPole in the simplest way I know how. 

### The Agent
Our agent is going to have an internal 4-vector of weights which we will use to compute the dot product with the state observation vector. If the result of the dot product is greater than zero the agent will chose to go right. If the result is less than zero the agent will chose to go left. 

We will randomly chose a set of weights and then evaluate it in the environment. Whenever we find a set of weights that are better than the current best set of weights, we will store them. Lets see if our simple agent can find a set of weights that solve the environment.

In [6]:
# We will need numpy for this
import numpy as np

class Agent():
    def __init__(self):
        self.weights = self.new_random_weights()

    def new_random_weights(self):
        # Center weights around zero
        # min value of -1 and max value +1
        self.weights = -1 + 2 * np.random.rand(4)
    
    def choose_action(self, observation):
        dot = np.matmul(self.weights, observation)

        action = 0 if dot < 0 else 1

        return action

### Montecarlo Algorithm
We intantiate the agent and let it interact with the environment. We store the best weights.


In [7]:
# We will need copy
import copy

# Agent
agent = Agent()

# Variable to keep track of best score and weights
best_score = 0
best_weights = None

# Maximum number of random trials
max_num_trials = 100000
for i in range(max_num_trials):
    
    # Get new weights for agent
    agent.new_random_weights()

    # Reset the environment
    observation = environment.reset()

    done = False
    score = 0
    while not done:
        # Agent chooses action
        action = agent.choose_action(observation)

        # Step environment
        next_observation, reward, done, info = environment.step(action)

        #Add reward to score
        score += reward

        # Critical!!!
        observation = next_observation

    # Check if score is new high score
    if score > best_score:
        # Store a copy of the agents weights
        best_weights = copy.deepcopy(agent.weights)
        # Store best score
        best_score = score
        # Print new best score
        print("New best score:", best_score)

    # Break out of loop if we found the optimal weights
    if best_score >= 500:
        print(f"Optimal weights found in {i} steps!!")
        break

New best score: 9.0
New best score: 500.0
Optimal weights found in 2 steps!!


Lets verify that the weights we found do solve the environment. We will let the agent interact with the environment 100 times and see what its average score is over all 100 runs.

In [8]:
# Load best weights into the agent
agent.weights = best_weights

scores = []
for i in range(100):
    done = False
    score = 0
    observation = environment.reset()
    while not done:
        # Agent chooses action
        action = agent.choose_action(observation)

        # Step environment
        next_observation, reward, done, info = environment.step(action)

        #Add reward to score
        score += reward

        # Critical!!!
        observation = next_observation

    scores.append(score)

print("Average Episode Return:", np.average(scores))

Average Episode Return: 428.22


As you can see, our agent is near optimal. So, CartPole is not the most difficult game to master.