<h1 align = 'center'>Guessing Games</h1>
<h3 align = 'center'>machine learning, one step at a time</h3>
<h3 align = 'center'>Step 10. Q-learning</h3>

**10. Q-learning**

_In Q-learning, a form of reinforcement learning, an agent develops an optimal policy based on interactions with an environment; the environment provides a series of state-action-reward sequences without any additional descriptions, labels, context, or rules._

Our last attempt at solving the maze is almost a Q-learning algorithm. It developed the policy "don't repeat your mistakes" to find a path through the maze, but it was not nearly the _optimal_ policy. An _optimal_ policy would find the exit in the fewest possible steps. When we consider what we can learn from our states, actions, and rewards, we are missing something.

Our last exercise was good at exploiting __penalties__, but what about exploiting __rewards__?

***Find the reward in this q-table:***
<pre>
 =====  =========================
 state           action
 =====  =========================
            N     S     E     W
 (0,0)  [-903.    0.    0. -830.]
 (0,1)  [-253. -215.    0.    0.]
 (0,2)  [ -58.    0.  -64.    0.]
 (0,3)  [   0.    0.    0.    0.]

 (1,0)  [   0.    0. -236. -231.]
 (1,1)  [   0.    0.    0.    0.]
 (1,2)  [   0.  -11.    0.  -14.]
 (1,3)  [  -3.    0.   -2.    0.]

 (2,0)  [   0.  -62.    0.  -79.]
 (2,1)  [ -15.   -9.  -10.    0.]
 (2,2)  [   0.    0.    0.    0.]
 (2,3)  [   0.   <font color='blue'>+1.</font>   -1.   -3.]   <font color='blue'>< -- we found the exit! from (2,3) -> move S</font>

 (3,0)  [   0.    0.    0.    0.]
 (3,1)  [   0.    0.    0.    0.]
 (3,2)  [   0.    0.    0.    0.]
 (3,3)  [   0.    0.    0.    0.]
 </pre>
in this table, it took 3,000 random attempts to find the exit once

***There is something critically important hiding in that table.***

The table cleary says that, staring in state (2,3), a move to the South results in the state (3,3) and a reward of +1. Let's make our own notation for that:
<pre>
(2,3)[S] = (3,3){+1}
</pre>
That suggests that __state (2,3) is a pretty good place to be__, because from (2,3) we can acheive a positive reward. How could we use that knowledge to favor actions that put us in (2,3)?

Let's look again at the maze, marked up to show (2,3)[S] = (3,3){+1}:
<pre>
         ...  ...  ...  +++ 
enter->  (1)  ...  ...  +++ 
         ...  ...  ...  +++ 

         ...  +++  ...  ... 
         ...  +++  ...  ... 
         ...  +++  ...  ... 

         ...  ...  +++  2,3 
         ...  ...  +++  [S]  <-this is a good state
         ...  ...  +++   +1 

         +++  +++  ...  ... 
         +++  +++  ...  3,3  <-exit
         +++  +++  ...  ... 

</pre>
That view of the maze raises an interesting question: if we are exploring, how can we _exploit_ the reward that is available _if and when we arrive at state (2,3)?_

Well... how do we ever arrive at state (2,3)? In this maze, the only way is: (1,3)[S] = (2,3){0}:
<pre>
         ...  ...  ...  +++ 
enter->  (1)  ...  ...  +++ 
         ...  ...  ...  +++ 

         ...  +++  ...  1,3 
         ...  +++  ...  [S]  <-this is a totally boring, uninteresting state
         ...  +++  ...  -0- 

         ...  ...  +++  2,3 
         ...  ...  +++  [S]  <-this is a good state
         ...  ...  +++   +1 

         +++  +++  ...  ... 
         +++  +++  ...  3,3  <-exit
         +++  +++  ...  ... 
</pre>
Wait a minute... how can (1,3) be totally boring, if it can lead us to (2,3)?

Let's take a closer look at (1,3) and (2,3).

In [6]:
import numpy as np
from maze import Maze
maze = Maze()

def sample(maze):
    action = maze.sample()                    # this returns N,S,E,W
    return maze.action_space().index(action)  # this converts to 0,1,2,3

# build a q-table that finds the exit once
q = np.zeros((4,4,4)) 
stop = False
while not stop:
    state = maze.reset()
    done = False
    while not done:
        action = sample(maze)                         
        new_state, reward, done = maze.step(action)  
        q[state[0]][state[1]][action] += reward 
        state = new_state 
        if reward > 0:
            stop = True

Maze.print_q(q)

AttributeError: type object 'Maze' has no attribute 'print_q'