# Part 12.2: Introduction to Q-Learning

### Single Action Cart

Mountain car actions:

* 0 - Apply left force
* 1 - Apply no force
* 2 - Apply right force

State values:

* state[0] - Position 
* state[1] - Velocity

The following shows a cart that simply applies full-force to climb the hill.  The cart is simply not strong enough.  It will need to use momentum from the hill behind it.

In [1]:
import gym

env = gym.make("MountainCar-v0")
env.reset()
done = False

i = 0
while not done:
    i += 1
    state, reward, done, _ = env.step(2)
    env.render()
    #print(f"Step {i}: State={state}, Reward={reward}")
    print("Step: {0}, State: {1}, Reward = {2}".format(i, state,reward))
env.close()

Step: 1, State: [-0.47707696  0.0006571 ], Reward = -1.0
Step: 2, State: [-0.47576764  0.00130932], Reward = -1.0
Step: 3, State: [-0.47381583  0.00195181], Reward = -1.0
Step: 4, State: [-0.471236    0.00257983], Reward = -1.0
Step: 5, State: [-0.46804728  0.00318872], Reward = -1.0
Step: 6, State: [-0.46427327  0.00377401], Reward = -1.0
Step: 7, State: [-0.45994186  0.00433141], Reward = -1.0
Step: 8, State: [-0.45508497  0.00485688], Reward = -1.0
Step: 9, State: [-0.44973833  0.00534664], Reward = -1.0
Step: 10, State: [-0.44394112  0.00579721], Reward = -1.0
Step: 11, State: [-0.43773568  0.00620545], Reward = -1.0
Step: 12, State: [-0.43116711  0.00656857], Reward = -1.0
Step: 13, State: [-0.42428292  0.00688418], Reward = -1.0
Step: 14, State: [-0.41713263  0.00715029], Reward = -1.0
Step: 15, State: [-0.40976734  0.0073653 ], Reward = -1.0
Step: 16, State: [-0.40223928  0.00752806], Reward = -1.0
Step: 17, State: [-0.39460144  0.00763784], Reward = -1.0
Step: 18, State: [-0.38

Step: 147, State: [-0.46967312 -0.00349252], Reward = -1.0
Step: 148, State: [-0.47256832 -0.0028952 ], Reward = -1.0
Step: 149, State: [-0.47484475 -0.00227643], Reward = -1.0
Step: 150, State: [-0.47648553 -0.00164078], Reward = -1.0
Step: 151, State: [-0.47747849 -0.00099296], Reward = -1.0
Step: 152, State: [-4.77816248e-01 -3.37757488e-04], Reward = -1.0
Step: 153, State: [-4.77496296e-01  3.19952109e-04], Reward = -1.0
Step: 154, State: [-0.47652101  0.00097528], Reward = -1.0
Step: 155, State: [-0.47489764  0.00162337], Reward = -1.0
Step: 156, State: [-0.47263822  0.00225941], Reward = -1.0
Step: 157, State: [-0.46975953  0.0028787 ], Reward = -1.0
Step: 158, State: [-0.46628287  0.00347666], Reward = -1.0
Step: 159, State: [-0.46223397  0.0040489 ], Reward = -1.0
Step: 160, State: [-0.45764271  0.00459126], Reward = -1.0
Step: 161, State: [-0.4525429   0.00509981], Reward = -1.0
Step: 162, State: [-0.44697198  0.00557092], Reward = -1.0
Step: 163, State: [-0.44097071  0.006001

### Programmed Car

This is a car that I hand-programmed.  It uses a simple rule, but solves the problem. The programmed car constantly applies force to one direction or another.  It does not reset.  Whatever direction the car is currently rolling, it applies force in that direction.  Therefore, the car begins to climb a hill, is overpowered, and rolls backward.  However, once it begins to roll backwards force is immediately applied in this new direction.

In [2]:
import gym

env = gym.make("MountainCar-v0")
state = env.reset()
done = False

i = 0
while not done:
    i += 1
    
    if state[1]>0:
        action = 2
    else:
        action = 0
    
    state, reward, done, _ = env.step(action)
    env.render()
    #print(f"Step {i}: State={state}, Reward={reward}")
    print("Step: {0}, State: {1}, Reward = {2}".format(i, state,reward))
env.close()

Step: 1, State: [-0.441596   -0.00162062], Reward = -1.0
Step: 2, State: [-0.44482546 -0.00322945], Reward = -1.0
Step: 3, State: [-0.44964022 -0.00481477], Reward = -1.0
Step: 4, State: [-0.45600514 -0.00636492], Reward = -1.0
Step: 5, State: [-0.46387355 -0.0078684 ], Reward = -1.0
Step: 6, State: [-0.4731875  -0.00931395], Reward = -1.0
Step: 7, State: [-0.48387809 -0.0106906 ], Reward = -1.0
Step: 8, State: [-0.49586589 -0.0119878 ], Reward = -1.0
Step: 9, State: [-0.50906144 -0.01319555], Reward = -1.0
Step: 10, State: [-0.52336599 -0.01430455], Reward = -1.0
Step: 11, State: [-0.53867228 -0.01530629], Reward = -1.0
Step: 12, State: [-0.55486556 -0.01619328], Reward = -1.0
Step: 13, State: [-0.57182469 -0.01695912], Reward = -1.0
Step: 14, State: [-0.58942338 -0.01759869], Reward = -1.0
Step: 15, State: [-0.60753159 -0.01810821], Reward = -1.0
Step: 16, State: [-0.62601693 -0.01848534], Reward = -1.0
Step: 17, State: [-0.64474616 -0.01872924], Reward = -1.0
Step: 18, State: [-0.66

### Reinforcement Learning

![Reinforcement Learning](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/reinforcement.png "Reinforcement Learning")


### Q-Learning Car

We will now use Q-Learning to produce a car that learns to drive itself.  Look out Tesla! 

Q-Learning works by building a table that provides a lookup table to determine which of several actions should be taken. As we move through a number of training episodes this table is refined.

$ Q^{new}(s_{t},a_{t}) \leftarrow (1-\alpha) \cdot \underbrace{Q(s_{t},a_{t})}_{\text{old value}} + \underbrace{\alpha}_{\text{learning rate}} \cdot  \overbrace{\bigg( \underbrace{r_{t}}_{\text{reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{\max_{a}Q(s_{t+1}, a)}_{\text{estimate of optimal future value}} \bigg) }^{\text{learned value}} $

In [3]:
import gym
import numpy as np

def calc_discrete_state(state):
    discrete_state = (state - env.observation_space.low)/buckets
    return tuple(discrete_state.astype(np.int))  

def run_game(q_table, render, should_update):
    done = False
    discrete_state = calc_discrete_state(env.reset())
    success = False
    
    while not done:
        # Exploit or explore
        if np.random.random() > epsilon:
            # Exploit - use q-table to take current best action (and probably refine)
            action = np.argmax(q_table[discrete_state])
        else:
            # Explore - t
            action = np.random.randint(0, env.action_space.n)
            
        # Run simulation step
        new_state, reward, done, _ = env.step(action)
        
        # 
        new_state_disc = calc_discrete_state(new_state)

        # 
        if new_state[0] >= env.goal_position:
            success = True
          
        # Update q-table
        if should_update:
            max_future_q = np.max(q_table[new_state_disc])
            current_q = q_table[discrete_state + (action,)]
            new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
            q_table[discrete_state + (action,)] = new_q

        discrete_state = new_state_disc
        
        if render:
            env.render()
            
    return success


In [4]:
LEARNING_RATE = 0.1
DISCOUNT = 0.95
EPISODES = 200
SHOW_EVERY = 40

DISCRETE_GRID_SIZE = [10, 10]
START_EPSILON_DECAYING = 1
END_EPSILON_DECAYING = EPISODES//2

In [5]:
env = gym.make("MountainCar-v0")

epsilon = 1  
epsilon_change = epsilon/(END_EPSILON_DECAYING - START_EPSILON_DECAYING)
buckets = (env.observation_space.high - env.observation_space.low)/DISCRETE_GRID_SIZE
q_table = np.random.uniform(low=-3, high=0, size=(DISCRETE_GRID_SIZE + [env.action_space.n]))
success = False


In [6]:
episode = 0
success_count = 0

while episode<EPISODES:
    episode+=1
    done = False

    if episode % SHOW_EVERY == 0:
        #print(f"Current episode: {episode}, success: {success_count} ({float(success_count)/SHOW_EVERY})")
        print("Current episode: {0}, success: {1}".format(episode, success_count,float(success_count)/SHOW_EVERY))
        success = run_game(q_table, True, False)
        success_count = 0
    else:
        success = run_game(q_table, False, True)
        
    if success:
        success_count += 1

    # Move epsilon towards its ending value, if it still needs to move
    if END_EPSILON_DECAYING >= episode >= START_EPSILON_DECAYING:
        epsilon -= epsilon_change

print(success)


Current episode: 40, success: 0
Current episode: 80, success: 0
Current episode: 120, success: 0
Current episode: 160, success: 1
Current episode: 200, success: 3
False


In [7]:
run_game(q_table, True, False)

False

In [8]:
env.close() #end the process 

In [9]:
import pandas as pd

df = pd.DataFrame(q_table.argmax(axis=2))

In [13]:
df.columns = ['v-{0}' for x in range(DISCRETE_GRID_SIZE[0])]
df.index = ['p-{1}' for x in range(DISCRETE_GRID_SIZE[1])]
df

Unnamed: 0,v-{0},v-{0}.1,v-{0}.2,v-{0}.3,v-{0}.4,v-{0}.5,v-{0}.6,v-{0}.7,v-{0}.8,v-{0}.9
p-{1},2,2,0,0,2,1,0,0,2,2
p-{1},0,1,0,1,2,0,1,2,1,1
p-{1},1,1,1,0,2,2,1,2,1,0
p-{1},1,0,0,0,0,2,2,1,0,1
p-{1},0,0,0,0,0,0,2,2,2,2
p-{1},2,2,0,1,0,2,2,1,1,1
p-{1},1,2,0,1,2,0,1,1,0,2
p-{1},1,1,2,1,2,2,0,1,2,2
p-{1},0,1,2,0,1,2,1,1,1,1
p-{1},1,2,2,2,2,2,2,2,2,2


In [None]:
np.argmax(q_table[(2,0)])