# Using reinforcement learning to play OpenAI cartpole game

OpenAI was founded in late 2015 as a non-profit with a mission to “build safe artificial general intelligence (AGI) and ensure AGI’s benefits are as widely and evenly distributed as possible.” In addition to exploring many issues regarding AGI, one major contribution that OpenAI made to the machine learning world was developing both the Gym and Universe software platform. Gym is a collection of environments designed for testing and developing reinforcement learning algorithms. In this post we will train a neural network using reinforcement learning to play cartpole game in gym environment.

Before moving forward let's discuss how reinforcement learning actually works? In a nutshell, RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future. 

## Key Concepts and Terminology

Reinforcement learning, explained simply, is a computational approach where an agent interacts with an environment by taking actions in which it tries to maximize an accumulated reward. The image below shows the basic concept of RL. An agent in a current state $S_t$ takes an action $A_t$ to which the environment reacts and responds, returning a new state $S_{t+1}$ and reward $R{t+1}$ to the agent. Given the updated state and reward, the agent chooses the next action, and the loop repeats until an environment is solved or terminated.

<img src="reinforcement-learning-fig1-700.jpg" alt="Drawing" style="width: 500px;"/>

##### Agent-environment interaction loop.

## Cartpole from OpenAI Gym

Before jumping forward let us understand the game first. The idea of CartPole is that there is a pole standing up on top of a cart. The goal is to balance this pole by wiggling or moving the cart from side to side to keep the pole balanced upright.

The environment is deemed successful if we can balance the pole for 200 frames for cartpole-v0 and 500 frames for cartpole v1, and failure is deemed when the pole is more than 15 degrees from fully vertical or cart moves to the end of frame.

Every frame that we go with the pole "balanced" (less than 15 degrees from vertical), our "score" gets +1, and our target is a score of 200 or 500. Here, we are using cartpole-v1 that make our target to 500.
##### cartpole game
<img src="SmartShortClownanemonefish-size_restricted.gif" />

We're going to start by creating an agent. In the beginning, the agent  will just randomly chooses actions (left and right) when it is introduced in cartpole environment. Our goal is to get a score of 500 after training the model. Firstly, we will store the game information of any scenario where untrained model scores above 50 so that the agent can learn from. The input layer is the obervation from the environment, which includes pole position, cart position and such in an array. The output layer is actions: Left or Right.

In [11]:
#Loading required library
import warnings
with warnings.catch_warnings():  
    warnings.filterwarnings("ignore",category=FutureWarning)
    import gym
    import numpy as np
    import tflearn
    from tflearn.layers.core import input_data, dropout, fully_connected
    from tflearn.layers.estimator import regression
    from statistics import median, mean
    from collections import Counter
    import random

In [2]:
#setting up openai environment for cartpole game
cartpole_env = gym.make("CartPole-v1") #loading cartpole version 1
cartpole_env.reset()  #resetting the environment

array([ 0.02346698,  0.04203347, -0.04107528,  0.02749161])

Lets move forward and play the game with an untrained agent and observe the output. Here output is the average score of all 100 games played by the agent with randomly selected actions 1 or 0 which is move leftor right. The agent is incapable of completing the game sice the actions are randomly taken.

In [3]:
def untrained_random_game():
    all_randomscores = []
    
#  Each episode is a single game
#  Here machine plays 10 games

    for episode in range(10):
        cartpole_env.reset()
        randomscores = 0
        
#  Here we are making single game to last for 100 frame but 
#  the random selection will not be able to play through out the game.
        
        for t in range(100):
#  This will display the environment but it is time consuming
            cartpole_env.render()
            
#  This will just create a sample action in any environment.
#  In this environment, the action can be 0 or 1, which is left or right

            action = cartpole_env.action_space.sample() # this line takes random action
            
#  This executes the environment with an randomly taken action (1 or 0), 
#  and returns the array of the observation of the environment, reward 
#  either 0 or 1, done (true or false) for game over, and other info.

            observation, reward, done, info = cartpole_env.step(action)
            randomscores+=reward
            
            if done:
                break
        all_randomscores.append(randomscores)
    average_score = sum(all_randomscores)/len(all_randomscores)    
    print('100 Radom game play average score =', average_score)

In [4]:
untrained_random_game()

100 Radom game play average score = 20.3


### Collecting training data

Each time when we observe the scene start over, that means the environment was "done" or game was over.

In [12]:
score_threshold = 70    #minimum score to add in training dataset
number_of_games = 20000   #number of games to play for data collection

# number of frames in each game, here action needs
# to be taken 500 times in each game if succeed to play whole game

highest_steps = 500 

# This function is to collect data for training the model

def training_data_collection():
    
# saves all observations and actions
    training_data = []
    
# all scores
    scores = []
    
# All the scores that met our threshold
    accepted_scores = []
    
# iterate through number of games specified in number_of_games:

    for _ in range(number_of_games):
        score = 0 #score of individual game
        
# stores all the observations, info and action of individual game
        game_memory = []

# previous observation/array of cart position
        prev_observation = []

# for each frame in 500 (number of frames)
        for _ in range(highest_steps):
# choose random action (0 or 1)
            action = random.randrange(0,2)
# Let's play
            observation, reward, done, info = cartpole_env.step(action)

#  notice that the observation is returned FROM the action
#  so we'll store the previous observation here, pairing
#  the previous observation to the action we'll take.
            if len(prev_observation) > 0 :
                game_memory.append([prev_observation, action])
            prev_observation = observation
            score+=reward
            if done: break

# IF our score is higher than our threshold, we'd like to save
# every move we made.
# NOTE the reinforcement methodology here. 
# all we're doing is reinforcing the score, we're not trying 
# to influence the machine in any way as to HOW that score is 
# reached.

        if score >= score_threshold:
            accepted_scores.append(score)
            for data in game_memory:
                
# convert to one-hot (this is the output layer for our neural network)
                if data[1] == 1:
                    output = [0,1]
                elif data[1] == 0:
                    output = [1,0]
                    
                # saving our training data
                training_data.append([data[0], output])

# reset cartpole_env to play again
        cartpole_env.reset()
    
# save overall scores
        scores.append(score)
    
# saving the training data for later use
    training_data_save = np.array(training_data)
    np.save('saved.npy',training_data_save)

# printing basic stats of the scores
    print('Average accepted score:',mean(accepted_scores))
    print('Median score for accepted scores:',median(accepted_scores))
    print(Counter(accepted_scores))
    
    return training_data


### Building a neural network architecture

Building a neural network architecture with the use of relu and softmax activation function 
and adam is used as optimizer. We are using a simple multilayer perceptron model.

In [6]:
LR = 1e-3 #learning rate for the machine learning
# learning rate is a number that we multiply our resulting gradient

def neural_network_model(input_size):

    network = input_data(shape=[None, input_size, 1], name='input')

    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 512, activation='relu')
    network = dropout(network, 0.8)
    
    network = fully_connected(network, 512, activation='relu')
    network = dropout(network, 0.8)
        
    network = fully_connected(network, 256, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 128, activation='relu')
    network = dropout(network, 0.8)

    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=LR, loss='categorical_crossentropy', name='targets')
    model = tflearn.DNN(network, tensorboard_dir='log')

    return model

# concept of training model for optimizer adam (varient of sgd) is to minimize the loss between the 
# actual and the predictive output from given training samples

def train_model(training_data, model=False):

#observation of each frame as an input
    X = np.array([i[0] for i in training_data]).reshape(-1,len(training_data[0][0]),1)
    
#action for each observation is the target
    y = [i[1] for i in training_data] 
    
    if not model:
        model = neural_network_model(input_size = len(X[0]))
    
    model.fit({'input': X}, {'targets': y}, n_epoch=3, snapshot_step=500, 
              show_metric=True, run_id='cartpole-game')

# The number of epochs is a hyperparameter that defines the number times that the 
# learning algorithm will work through the entire training dataset. One epoch means 
# that each sample in the training dataset has had an opportunity to update the internal model parameters.
    return model

### Training the neural network model

In [7]:
def train_model(training_data, model=False):

#observation of each frame as an input
    X = np.array([i[0] for i in training_data]).reshape(-1,len(training_data[0][0]),1)
    
#action for each observation is the target
    y = [i[1] for i in training_data] 
    
    if not model:
        model = neural_network_model(input_size = len(X[0]))
    
    model.fit({'input': X}, {'targets': y}, n_epoch=4, snapshot_step=500, 
              show_metric=True, run_id='cartpole-game')

# The number of epochs is a hyperparameter that defines the number times that the 
# learning algorithm will work through the entire training dataset. One epoch means 
# that each sample in the training dataset has had an opportunity to update the internal model parameters.
    return model

In [8]:
training_data = training_data_collection()
model = train_model(training_data)
#save.model("cartpolev1")  # This saves the model for later use and retrain

Training Step: 679  | total loss: [1m[32m0.64349[0m[0m | time: 1.704s
| Adam | epoch: 004 | loss: 0.64349 - acc: 0.6212 -- iter: 10816/10833
Training Step: 680  | total loss: [1m[32m0.64992[0m[0m | time: 1.714s
| Adam | epoch: 004 | loss: 0.64992 - acc: 0.6153 -- iter: 10833/10833
--


### Play cartpole with trained model

We are going to use our trained agent to play the game and save the obtained stats. The trained agent takes the action based on neural network model we trained to play the game. Instead of taking the random action the agent takes the action generated from the neural network.

In [10]:
scores = []
choices = []
for each_game in range(100):
    score = 0
    game_memory = []
    prev_obs = []
    cartpole_env.reset()
    for _ in range(highest_steps):
#         cartpole_env.render()    #this displays the game played by agent

        if len(prev_obs)==0:
            action = random.randrange(0,2)
        else:
            action = np.argmax(model.predict(prev_obs.reshape(-1,len(prev_obs),1))[0])

        choices.append(action)
                
        new_observation, reward, done, info = cartpole_env.step(action)
        prev_obs = new_observation
        game_memory.append([new_observation, action])
        score+=reward
        if done: 
            break
    scores.append(score)

print('Average Score:',sum(scores)/len(scores))
print("Minimum score", min(scores))
print("Maximum score", max(scores))

Average Score: 499.44
Minimum score 475.0
Maximum score 500.0


We can see the difference between untrained agent and trained agent using reinforcement learning technique. To talk more specifically what RL does, we need to introduce additional terminology. We need to talk about

1. states and observations,
2. action spaces,
3. policies,
4. trajectories,
5. different formulations of return,
6. the RL optimization problem,
7. and value functions.

These terms are explained in details in OpenAI spinning up in given link
https://spinningup.openai.com/en/latest/spinningup/rl_intro.html
