# Reinforcement learning 

In this Python notebook, we will have you implement a simple reinforcement learning agent for the AI gym mountain car problem. Please first have a look at the description of the task here: <A HREF="https://github.com/openai/gym/wiki/MountainCar-v0" TARGET="_blank">Description</A>

We will first experiment with the original formulation of the car mountain car task. The class we will use is CMC_original, which is the same as the normal AI gym version, but with an adapted render-function in order to be able to show the graphics in a Binder notebook.  

## Random agent
First run the following block, in which an agent is run that takes random actions.

In [1]:
%matplotlib inline

import os
import matplotlib
from matplotlib import pyplot as plt
import run_cart
import gym
import numpy as np
import random

class random_agent(object):
    """Random agent"""

    def act(self, observation, reward, done):
        return random.randint(0,2)
    
agent = random_agent()
reward, rewards = run_cart.run_cart_discrete(agent, env=run_cart.CMC_original(), graphics=True)
print('Reward = ' + str(reward))

Reward = -200.0


## Heuristic agent
Now let's try a heuristic agent, which uses a simple decision tree based on the position and velocity:

NOTE: MAKE THIS AN ASSIGNMENT FOR THE STUDENTS

In [2]:
class heuristic_agent(object):
    """Guido's heuristic agent"""

    def act(self, observation, reward, done):
        position = observation[0]
        velocity = observation[1]
        if position < -0.5:
            if velocity < -0.01:
                action = 0
            else:
                action = 2
        else:
            if velocity > 0.01:
                action = 2
            else:
                action = 0
        return action
    
agent = heuristic_agent()
reward, rewards = run_cart.run_cart_discrete(agent, env=run_cart.CMC_original(), graphics=True)
print('Reward = ' + str(reward))

Episode 0: Success!
Reward = -109.0


## Q-learning
Now all we need to do is have Q-learning find the simple heuristic by itself. We will start with the simplest form of tabular Q-learning in combination with the original mountain car task and reward function.

TODO: HAVE THE STUDENTS FILL IN THE Q-UPDATE FUNCTION

TODO: HAVE THE STUDENTS PLAY WITH THE LEARNING SETUP

In [3]:
class Q_learning_agent(object):
    """Simple Q-learning agent for the MountainCarv0 task
       https://en.wikipedia.org/wiki/Q-learning
    """

    n_actions = 3

    def __init__(self, min_speed, max_speed, min_position, max_position, alpha = 0.1, gamma = 0.9, p_explore = 0.1):
        
        # number of grids per state variable
        self.n_grid = 10
        self.min_speed = min_speed
        self.max_speed = max_speed
        self.speed_step = (max_speed - min_speed) / self.n_grid
        self.min_position = min_position
        self.max_position = max_position
        self.position_step = (max_position - min_position) / self.n_grid
        # discretizing the 2-variable state results in this number of states:
        self.n_states = int(self.n_grid**2)
        # make an empty Q-matrix
        self.Q = np.zeros([self.n_states, self.n_actions])
        #self.Q = np.random.rand(self.n_states, self.n_actions)
        # initialize previous state and action
        self.previous_state = 0
        self.previous_action = 0
        # learning rate
        self.alpha = alpha
        # discount factor:
        self.gamma = gamma
        # e-greedy, p_explore results in a random action:
        self.p_explore = p_explore

    def act(self, observation, reward, done, verbose = False):
        
        # Determine the new state:
        pos = observation[0]
        if(pos > self.max_position):
            pos = self.max_position
        elif(pos < self.min_position):
            pos = self.min_position
        obs_pos = int((pos - self.min_position) // self.position_step)                
        vel = observation[1]
        if(vel > self.max_speed):
            vel = self.max_speed
        elif(vel < self.min_speed):
            vel = self.min_speed
        obs_vel = int((vel - self.min_speed) // self.speed_step)
        new_state = obs_pos * self.n_grid + obs_vel
        
        if(verbose):
            print(f'Velocity {observation[1]}, position {observation[0]}, (grid {self.speed_step}, \
                          {self.position_step}), state = {new_state}')
        
        # Update the Q-matrix:
        self.Q[self.previous_state, self.previous_action] +=  self.alpha * \
            (reward + self.gamma * max(self.Q[new_state, :]) - self.Q[self.previous_state, self.previous_action])
        
        # determine the new action:
        if(random.random() < self.p_explore):
            action = random.randint(0, self.n_actions-1)
            #print(f'random action: {action:d}')
        else:
            action = np.argmax(self.Q[new_state, :])
            #print(f'action: {action:d}')
        
        # update previous state and action
        self.previous_state = new_state
        self.previous_action = action        
        
        # return the action
        return action

    
env=run_cart.CMC_original()

# set up off-policy learning with p_explore = 1
max_velocity = env.max_speed
min_velocity = -max_velocity
agent = Q_learning_agent(min_velocity, max_velocity, env.min_position, env.max_position, \
                         alpha = 0.20, gamma = 0.95, p_explore = 1.0)
n_episodes = 1000
reward, rewards = run_cart.run_cart_discrete(agent, env=env, graphics=False, n_episodes=n_episodes)
print('Reward per episode = ' + str(reward / n_episodes))

# on-policy now with e-greedy
agent.p_explore = 0.05
reward, rewards = run_cart.run_cart_discrete(agent, env=env, graphics=False, n_episodes=n_episodes)
print('Reward per episode = ' + str(reward / n_episodes))

n_episodes = 100
agent.alpha = 0.05
agent.p_explore = 0.02
reward, rewards = run_cart.run_cart_discrete(agent, env=env, graphics=False, n_episodes=n_episodes)
print('Reward per episode = ' + str(reward / n_episodes))


Reward per episode = -200.0
Episode 231: Success!
Episode 272: Success!
Episode 294: Success!
Episode 298: Success!
Episode 338: Success!
Episode 350: Success!
Episode 351: Success!
Episode 358: Success!
Episode 377: Success!
Episode 392: Success!
Episode 394: Success!
Episode 395: Success!
Episode 396: Success!
Episode 419: Success!
Episode 423: Success!
Episode 427: Success!
Episode 428: Success!
Episode 429: Success!
Episode 430: Success!
Episode 433: Success!
Episode 434: Success!
Episode 435: Success!
Episode 436: Success!
Episode 437: Success!
Episode 438: Success!
Episode 439: Success!
Episode 440: Success!
Episode 441: Success!
Episode 446: Success!
Episode 550: Success!
Episode 551: Success!
Episode 552: Success!
Episode 553: Success!
Episode 573: Success!
Episode 612: Success!
Episode 624: Success!
Episode 625: Success!
Episode 626: Success!
Episode 627: Success!
Episode 628: Success!
Episode 629: Success!
Episode 630: Success!
Episode 631: Success!
Episode 632: Success!
Epis

In [5]:
n_episodes = 1
agent.p_explore = 0
agent.alpha = 0
reward, rewards = run_cart.run_cart_discrete(agent, env=env, graphics=True, n_episodes=n_episodes)
print(f'Reward trained agent {reward}')

Episode 0: Success!
Reward trained agent -142.0
