## Cart-Pole Balancing using DQNs
In this assignment we will balance a cartpole using deep learning. We will build an agent, that given the current state of the environment, can make a prediction about what action would result in the best outcome. We are going to implement the two core pieces of DQNs, the epsilon greedy algorithm and memory replay. 

In this assignment we will use openai gym libraries to set up the game enviroment. Most of the game playing interface is already provided by the gym library. Our task is to implement the agent, and fix up the training. As we play the game, you should see the agent's score increase in the training loop. A score of 100 or above is what we are trying to achieve. 

In [1]:
# If you are running this practice on your machine, make sure to install gym and gym[atari]. Depending on your python 
# env, this could be done using pip install, or conda install etc. 
!pip3 install --user gym gym[atari]

only teacher can use pip3


In [1]:
from collections import deque
import numpy as np
import random

import gym
from keras.layers import Dense
from keras.models import Sequential
from keras.optimizers import Adam

  return f(*args, **kwds)
Using TensorFlow backend.
  return f(*args, **kwds)
  return f(*args, **kwds)


The class below creates a deep q-network (DQN) with a specific architecture, along with the relevant parameters (epsilon, gamma, etc.) In addition, it sets up the remember, act, and replay methods, which are required during the training phase.

In [2]:
class DQNAgent:
    def __init__(self, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.memory = deque(maxlen=2000)
        self.gamma = 0.95   # discount rate
        self.epsilon = 1.0  # exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995
        self.learning_rate = 0.001
        self.model = self._build_model()

    def _build_model(self):
        model = Sequential()        
        # Input layer has dimension of self.state_size
        model.add(Dense(16, activation='relu', input_shape=(self.state_size,)))
        # Intermediate layer
        model.add(Dense(16, activation='relu'))
        # Output layer has dimension of self.action_size
        model.add(Dense(self.action_size, activation='linear'))
        # Compile model
        optimizer = Adam(lr=self.learning_rate)
        model.compile(optimizer=optimizer, loss='mse')
        
        return model

    def remember(self, state, action, reward, next_state, done):
        # Add tuple to queue
        self.memory.append((state, action, reward, next_state, done))

    def act(self, state):
        # In this function we calculate and return the next action.
        # We are going to implement epsilon greedy logic here. 
        # With probability epsilon, return a random action and return that
        # With probability 1-epsilon return the action that the model predicts. 
        if np.random.rand() <= self.epsilon:
            # Return random action
            return env.action_space.sample()
        else:
            # Return predicted action
            act_values = self.model.predict(state)
            return np.argmax(act_values[0])

    def replay(self, batch_size):
        # We'll sample from our memories and get a handful of them and store them in minibatch 
        minibatch = random.sample(self.memory, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                # Calculate the total discounted reward according to the Q-Learning formula
                # target = current_reward + discounted maximum value obtained by next state
                target = reward + self.gamma * np.amax(self.model.predict(next_state)[0])

            target_f = self.model.predict(state)
            target_f[0][action] = target
            self.model.fit(state, target_f, epochs=1, verbose=0)
            
        # Decay the epsilon value 
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

    def load(self, name):
        self.model.load_weights(name)

    def save(self, name):
        self.model.save_weights(name)

The code below creates an environment for our game, an instance of a DQNAgent object, and proceeds to train the model to play the game over a series of 40 episodes. At each episode, the game proceeds for as long as the pole is able to remain upright. This is reflected in the 'time' variable, which is the for loop iterator.

Early on, the actions taken by the agent are mostly random (this corresponds to high values of epsilon). Over time, the value of epsilon decreases, and the actions taken by the agent are predicted by the DQN (they become more exploitative, and less explorative). In theory, since the model has been trained for a sufficient length of time, these actions should be advantageous, in other words they should contribute to keeping the pole upright and letting the game play for longer.

In [8]:
if __name__ == "__main__":
    env = gym.make('CartPole-v1')
    
    # State size for CartPole game
    state_size = env.observation_space.shape[0]
    # Action size for CartPole game
    action_size = env.action_space.n
    
    print('State size: %i' % state_size)
    print('Action size: %i' % action_size)
    
    agent = DQNAgent(state_size, action_size)
    agent.model.summary()
    done = False
    batch_size = 32 # Feel free to play with these 
    EPISODES = 40   # You shouldn't really need more than 100 episodes to get a score of 100

    
    for eps in range(1,EPISODES+1):
        state = env.reset()
        state = np.reshape(state, [1, state_size])
        for time in range(500):
            
            # Get an action from the agent
            action = agent.act(state)
            # Send this action to the env and get the next_state, reward, done values
            next_state, reward, done, _ = env.step(action)
            
            # DO NOT CHANGE THE FOLLOWING 2 LINES 
            reward = reward if not done else -10
            next_state = np.reshape(next_state, [1, state_size])
            
            # Tell the agent to remember this memory
            agent.remember(state, action, reward, next_state, done)
            
            # DO NOT CHANGE BELOW THIS LINE
            state = next_state
            if done:
                print("episode: {}/{}, score: {}, eps: {:.2}".format(eps, EPISODES, time, agent.epsilon))
                break
            if len(agent.memory) > batch_size:
                agent.replay(batch_size)
        if eps % 10 == 0:
            agent.save("./cartpole-dqn.h5")

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
State size: 4
Action size: 2
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_12 (Dense)             (None, 16)                80        
_________________________________________________________________
dense_13 (Dense)             (None, 16)                272       
_________________________________________________________________
dense_14 (Dense)             (None, 2)                 34        
Total params: 386
Trainable params: 386
Non-trainable params: 0
_________________________________________________________________
episode: 1/40, score: 28, eps: 1.0
episode: 2/40, score: 17, eps: 0.93
episode: 3/40, score: 34, eps: 0.79
episode: 4/40, score: 14, eps: 0.73
episode: 5/40, score: 12, eps: 0.69
episode: 6/40, score: 9, eps: 0.66
episode: 7/40, score: 21, eps: 0.59
episode: 8/40, score:

The agent was successful, and the desired score of 100 was consistently reached by the end of the 40 training episodes. 

It is worth noting that the agent doesn't always succeed, sometimes the desired score of 100 is not achieved. I was curious about the success rate of the agent, so I set up a simple trial below. The code above is repeated 10 times, and the success rate of the agent being able to reach 100 points is tracked.

In [3]:
success = 0
n = 10
env = gym.make('CartPole-v1')
for i in range(n):
    print('Starting trial %i...' % i)

    # State size for CartPole game
    state_size = env.observation_space.shape[0]
    # Action size for CartPole game
    action_size = env.action_space.n

    agent = DQNAgent(state_size, action_size)
    done = False
    batch_size = 32 # Feel free to play with these 
    EPISODES = 40   # You shouldn't really need more than 100 episodes to get a score of 100

    proceed = True
    for eps in range(1,EPISODES+1):
        if proceed:
            state = env.reset()
            state = np.reshape(state, [1, state_size])
            for time in range(500):

                # Get an action from the agent
                action = agent.act(state)
                # Send this action to the env and get the next_state, reward, done values
                next_state, reward, done, _ = env.step(action)

                # DO NOT CHANGE THE FOLLOWING 2 LINES 
                reward = reward if not done else -10
                next_state = np.reshape(next_state, [1, state_size])

                # Tell the agent to remember this memory
                agent.remember(state, action, reward, next_state, done)

                # DO NOT CHANGE BELOW THIS LINE
                state = next_state
                if done:
                    if time >= 100:
                        print('Score of at least 100 achieved at episode %i' % eps)
                        success += 1
                        proceed = False
                    break
                if len(agent.memory) > batch_size:
                    agent.replay(batch_size)
        else:
            break
            
    if eps == EPISODES:
        print('Maximum episodes reached, no high score...')

print('Percent successful trials: %4.2f' % (success/n * 100))

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
Starting trial 0...
Score of at least 100 achieved at episode 17
Starting trial 1...
Maximum episodes reached, no high score...
Starting trial 2...
Score of at least 100 achieved at episode 16
Starting trial 3...
Maximum episodes reached, no high score...
Starting trial 4...
Score of at least 100 achieved at episode 35
Starting trial 5...
Score of at least 100 achieved at episode 27
Starting trial 6...
Score of at least 100 achieved at episode 11
Starting trial 7...
Score of at least 100 achieved at episode 26
Starting trial 8...
Score of at least 100 achieved at episode 28
Starting trial 9...
Score of at least 100 achieved at episode 32
Percent successful trials: 80.00


It appears that 8 out of 10, or 80% of the trials resulted in a successful outcome. That is, much more often than not, the agent is able to score at least 100 points.

## Conceptual Overview

The structure of the DQN created in this assignment is very straightforward. It contains an input layer with a node for each of the observation states. In this case there are four observations:
* The position of the cart
* The velocity of the cart
* The angle of the pole
* The rotation rate of the pole

After this, there are two dense layers of 16 nodes each, and each layer uses a relu activation. The specific number of nodes was determined through trial and error, using 16 nodes let to the highest success rate for the agent (at 80%, as described above).

The output layer contains a node for each of the actions that the agent can take. In this case there are only two actions:
* Move the cart to the left
* Move the cart to the right

The output layer uses a linear activation function, and the action with the largest value is taken as the chosen action.

The purpose of the network is to predict the action most likely to result in a reward (i.e., keep the pole upright), given the current state of the observation space. 

The training of network proceeds as follows: First, an action is sample from the agent. Early on, this action is most likely random, but over time, the network itself is used to predict the best action given the current environment. The value of the epsilon parameter controls the probability of taking a random versus predicted action, and this probability decreases over time. The chosen action is taken, and the next state, reward, and whether or not the game has finished are returned by the environment. This entire series of events is remembered by the agent as a 'memory'.  If the game is not finished, and enough memories have been stored, these memories are replayed to train the DQN, making it 'better' at playing the game over time. It is important to note that during this replay phase, a discounted reward is used to prioritize rewards in the somewhat distant future (it's advantageous for the agent to keep the pole upright for a long time, so it's good to focus on long term rewards).

Overall, I was pleasantly surprised that a simple network could be used to successfully play the CartPole game. The entire DQN had only 386 weights to train, which is miniscule compared to most networks we've seen in this course. I arrived at the network structure using trial and error, I found that using 8 or 24 nodes