<h1 align = 'center'>Guessing Games</h1>
<h3 align = 'center'>machine learning, one step at a time</h3>
<h3 align = 'center'>Step 8. A Random Walk While Taking Notes</h3>

**8. A random walk while taking notes.**

When we took our random walk, we ignored everything that happened except for one event: stumbling accross the exit.

What if we paid attention, and learned from our mistakes?

Let's start thinking about the maze in terms of machine learning: _the art of accumulating knowledge by learning from mistakes_.

We have already seen that, when we explore the maze, it gives us feedback:

In [1]:
from maze import Maze

maze = Maze()
print(maze)
for i in range(10):           # take several random walks through the maze
    state = maze.reset()      # start each walk at the initial state
    done = False
    print('\n=== walk number ' + str(i) + ' =======================================================')
    while not done:
        action = maze.sample()
        initial_state = state
        state, reward, done = maze.step(action)
        print('started at:', initial_state, '| moved:', action, '| reward =',reward, '| done?', done)

         ...  ...  ...  +++ 
enter->  (1)  ...  ...  +++ 
         ...  ...  ...  +++ 

         ...  +++  ...  ... 
         ...  +++  ...  ... 
         ...  +++  ...  ... 

         ...  ...  +++  ... 
         ...  ...  +++  ... 
         ...  ...  +++  ... 

         +++  +++  ...  ... 
         +++  +++  ...  ...  <-exit
         +++  +++  ...  ... 


started at: [0 0] | moved: S | reward = 0 | done? False
started at: [1 0] | moved: E | reward = -1 | done? True

started at: [0 0] | moved: W | reward = -1 | done? True

started at: [0 0] | moved: N | reward = -1 | done? True

started at: [0 0] | moved: W | reward = -1 | done? True

started at: [0 0] | moved: E | reward = 0 | done? False
started at: [0 1] | moved: E | reward = 0 | done? False
started at: [0 2] | moved: N | reward = -1 | done? True

started at: [0 0] | moved: W | reward = -1 | done? True

started at: [0 0] | moved: N | reward = -1 | done? True

started at: [0 0] | moved: W | reward = -1 | done? True

started at: [0 0

If we wanted to learn from those results, we would need to _remember_ our rewards or penalties. We would need to store something, someplace... like taking notes in class (unless you don't take notes; if you don't take notes, it's like taking notes as if you did take notes).

If we were doing this by hand, what would we write in our notebook?

...well, we only know two things:
- our __state__ (that is, our x,y position)
- the result of taking an __action__ when starting from our __state__.

For example, here is the result of moving North immediately, and the notes we might take:

In [2]:
initial_state = maze.reset()
print('NOTE: initial state =', initial_state, '| attempting to move North...')\

new_state, reward, done = maze.step('N')
print('NOTE: new_state =', new_state, '| reward =', reward, '| done ?', done)

print('NOTE: Moving North from (0,0) is a bad idea!')
print('NOTE: Why is it bad? Because I got a penalty!')

NOTE: initial state = [0 0] | attempting to move North...
NOTE: new_state = [-1  0] | reward = -1 | done ? True
NOTE: Moving North from (0,0) is a bad idea!
NOTE: Why is it bad? Because I got a penalty!


OK, so there's that.

Seems like it's important to associate ( state(0,0) + action(N) = bad idea ).

_But equally important... there's no need to consider any context (nothing above refers specifically to a maze). A bad idea is a bad idea. It doesn't matter why it's a bad idea, or what specifically is bad about it. Machine learning does not code for explicit conditions (like stepping out of bounds versus stepping onto a blocked square); we are only concerned with state transitions and outcomes, in order to find our way toward a goal or solution._

That simplifies the problem. Who needs rules? We should be able simply to remember the result of the __transition__ associated with any __current state__ and __available action__ and go from there:
<pre>
== state =====   == action============    == transition =====    == result =============
state is (0,0) + action is: move North -> new state is (0,-1) -> reward = -1, game over!
==============   =====================    ===================    =======================
</pre>

We need a place to put all of our __states__, __actions__, and __rewards__ so that we can store the effect of any given __transition__.

To do that, we just need the _dimensions_ of the problem... like putting a dozen eggs into a carton that is 6x2, or those occasional have-to-be-different cartons that are 4x3.

The maze will reveal two things that will help us to discover the _dimensions_ of the maze problem: (1) the size of the __action__ space, and the size of the __state__ space. These size are __independent of the fact that it's a maze__. We could be solving any two-dimensional problem with multiple actions (like playing tic-tac-toe, or checkers). We don't care what the __actions__ or __states__ represent; we just need to know: how many are there?

Remember: the goal is to solve the problem knowing as little as possible about the context... it's not maze, it's just an orderly set of states and transitions.

(please re-read that last sentence)

On to finding the _dimensions_: the set of dimensions that we have seen before is the __action space__:

In [3]:
print(maze.action_space())
print('There are',len(maze.action_space()),'possible actions')

['N', 'S', 'E', 'W']
There are 4 possible actions


And the other set of dimensions, which our maze also provides, is the __state space__:


In [4]:
print('Here are the dimensions of all possible states:',maze.state_space())

Here are the dimensions of all possible states: (4, 4)


_NOTE TO THE CURIOUS: it's not strictly necessary that the maze provide the dimensions of the action or state spaces. We could discover those dynamically, by exploring the problem over and over. Those are provided here just to simplify the example._

Now we know that we have to deal with:
- 4 actions
- 4 x 4 = 16 states, that is really...
    - 4 rows
    - 4 columns

We need to be able to remember the results of any of __action__ taken in any __state__, which makes 4x4x4, like this:

In [5]:
import numpy as np     # this library does all kinds of magical things with numbers
q = np.zeros((4,4,4))  # don't worry about these details, just go with it
print(q)               # everyone calls this a q-table... meaning the 'quality' of every action

[[[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]]


It's a little hard to visualize which dimension is which... in detail, it works like this:
<pre>
row  col     --actions--
 0    0   [[[N. S. E. W.]
 0    1     [N. S. E. W.]
 0    2     [N. S. E. W.]
 0    3     [N. S. E. W.]]

 1    0    [[N. S. E. W.]
 1    1     [N. S. E. W.]
 1    2     [N. S. E. W.]
 1    3     [N. S. E. W.]]
 
 2    0    [[N. S. E. W.]
 2    1     [N. S. E. W.]
 2    2     [N. S. E. W.]
 2    3     [N. S. E. W.]]
 
 3    0    [[N. S. E. W.]
 3    1     [N. S. E. W.]
 3    2     [N. S. E. W.]
 3    3     [N. S. E. W.]]]
</pre>

...and yes, you could equivalently do this...

In [6]:
print(np.zeros((16,4)))  # around 1/2 the people find this easier to understand

[[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]


But we will stick with 4x4x4 for now.


Let's say we want to remember that going North right away is a bad idea... start by moving North, then store the results in the __q-table__:

In [7]:
from maze import Maze
maze = Maze()
initial_state = maze.reset()                 # start in state (0,0)
final_state, reward, done = maze.step('N')   # take a step to the North
print('started at:', initial_state, ', moved: N to ', final_state, ', reward =',reward, ', done?', done)

started at: [0 0] , moved: N to  [-1  0] , reward = -1 , done? True


That means _when I was in state (0,0), and chose action 'N", I got a reward of -1, and the game ended._

Or, the _quality_ of the action 'N' from state (0,0) is pretty bad.

Let's make a note of that in a 4x4x4 __q_table__:

In [8]:
q = np.zeros((4,4,4))
row = 0                    # initial row is zero
col = 0                    # initial col is zero
action = 0                 # N,S,E,W = 0,1,2,3... so N = 0
q[row][col][action] = -1   # remember that a bad thing happened
print(q)                   # notice the -1 hiding in the upper left corner

[[[-1.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]]


For convenience, let's convert our actions to numbers (so that later on we can store results in our __q-table__ automatically), like this:

In [9]:
# Here is a helpful function that you may need...
# it converts N,S,E,W to 0,1,2,3

def index_of_action(action):
    return maze.action_space().index(action)

# let's see how that works
for action in maze.action_space():
    print(action, index_of_action(action))

N 0
S 1
E 2
W 3


To store our penalty from moving North in our __q-table__ by hand using __index_of_action()__, we would do something like this:

In [10]:
import numpy as np
q = np.zeros((4,4,4))
q[0][0][index_of_action('N')] = -1   # we can put a function call inside an array index
print(q)

[[[-1.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]]


If we were to store the __rewards__ or __penalties__ from every possible initial move, we would get:

In [11]:
# note that the row & col do not change...
# these are the results of 4 transitions,
# all starting from row = 0, col = 0.
q[0][0][index_of_action('N')] = -1
q[0][0][index_of_action('S')] = 0
q[0][0][index_of_action('E')] = 0
q[0][0][index_of_action('W')] = -1
print(q)

[[[-1.  0.  0. -1.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]]


This q-table says 'hey, if you are just starting out... don't go north or west!'
<hr>
***Exercises***<p>
    
- Build a q-table that learns incrementally from 100,000 walks through the maze.

In [12]:
import numpy as np
from maze import Maze
maze = Maze()

# this converts N,S,E,W to 0,1,2,3
def index_of_action(action):
    return maze.action_space().index(action)

q = np.zeros((4,4,4))
for n in range(100000):
    state = maze.reset()
    done = False
    while not done:
        
        ####################################################################
        #                                                                  #
        #  YOUR CODE HERE:                                                 #
        #    - get a sample action from the maze                           #
        #    - use the action to take a step; capture the return values    #
        #    - update the q table by adding the reward to:                 #
        #        > row = state[0]                                          #
        #        > col = state[1]                                          #
        #        > action = index_of_action(whatever action you took)      #
        #        > q[row][col][action] += reward                           #
        #     - and don't forget to update your state                      #
        #                                                                  #
        ####################################################################

print(q)

IndentationError: expected an indented block (<ipython-input-12-0816f7bbd9ac>, line 29)