<h1 align = 'center'>Guessing Games</h1>
<h3 align = 'center'>machine learning, one step at a time</h3>
<h3 align = 'center'>Step 8. A Random Walk While Paying Attention</h3>

**8. A random walk while paying attention.**

When we took our random walk, we ignored everything that happened except for one event: stumbling accross the exit.

What if we paid attention, and learned from our mistakes?

Let's start thinking about the maze in terms of machine learning: _the art of accumulating knowledge by learning from mistakes_.

We have already seen that, when we explore the maze, it gives us feedback:

In [1]:
from maze import Maze

# take five random walks through the maze
maze = Maze()
for i in range(5):
    maze.reset()      # go back to the initial state
    done = False
    print('\n--- walk number ' + str(i) + ' -----------------------')
    while not done:
        state, reward, done = maze.step(maze.sample())
        print(state, reward, done)


--- walk number 0 -----------------------
[0 1] 0 False
[0 0] 0 False
[ 0 -1] -1 True

--- walk number 1 -----------------------
[ 0 -1] -1 True

--- walk number 2 -----------------------
[1 0] 0 False
[1 1] -1 True

--- walk number 3 -----------------------
[0 1] 0 False
[0 0] 0 False
[ 0 -1] -1 True

--- walk number 4 -----------------------
[1 0] 0 False
[2 0] 0 False
[ 2 -1] -1 True


If we wanted to learn from those results, we would to remember our rewards or penalties. We would need to store something, someplace.

What would we store? ...well, we only know one thing: the result of taking an __action__ when in a given __state__. For example, here is the result of moving north immediately:

In [2]:
initial_state = maze.reset()
new_state, reward, done = maze.step('N')

print('initial state =', initial_state)
print('new_state =', new_state, 'reward =', reward, 'done =', done)
print('Moving North is a bad idea!...')
print('...I should make a note of that.')

initial state = [0 0]
new_state = [-1  0] reward = -1 done = True
Moving North is a bad idea!...
...I should make a note of that.


OK, so there's that.

Seems like we should associate ( state(0,0) + action(N) = bad idea ).

And looking forward, it seems like we should be able to remember the reward associated with any __current state__ and __subsequent action__ (this is called a __transition__, because the __action__ causes us to transit from one __state__ to another).

We need a place to put all of our __rewards__ and __penalties__. How should we do that?

The maze will reveal two things that will help us to discover the _dimensions_ of that part of the problem.

One set of dimensions, that we have seen before, is the __action space__:

In [3]:
print(maze.action_space())
print('There are',len(maze.action_space()),'possible actions')

['N', 'S', 'E', 'W']
There are 4 possible actions


And the other set of dimensions, which our maze also provides, is the __state space__:


In [4]:
print('Here are the dimensions of all possible states:',maze.state_space())

Here are the dimensions of all possible states: (4, 4)


_NOTE TO THE CURIOUS: it's not strictly necessary that the maze provide the dimensions of the state space. We could discover that just by exploring the space over and over. It's provided here just to simplify the example._

So we need to be able to remember the results of any of 4 actions taken in a 4x4 space, which makes: 4x4x4, like this:

In [5]:
import numpy as np     # this library does all kinds of magical things with numbers
q = np.zeros((4,4,4))  # don't worry about these details, just go with it
print(q)               # everyone calls this data structure 'q'... it probably stands for 'quality'

[[[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]]


Let's say we want to remember that going north right away is a bad idea...

Recall that:

In [6]:
print('state  =', maze.reset())
print('result =', maze.step('N'))

state  = [0 0]
result = (array([-1,  0]), -1, True)


That means _when I was in state (0,0), and chose action 'N", I got a reward of -1._

Or, _the quality of the action 'N' from state (0,0) is, well, pretty bad._

For convenience, let's convert our actions to numbers, like this:

In [8]:
# Here is a helpful function that you may need...
# it converts N,S,E,W to 0,1,2,3

def index_of_action(action):
    return maze.action_space().index(action)

# let's see how that works
for action in maze.action_space():
    print(action, index_of_action(action))

N 0
S 1
E 2
W 3


And store our penalty from moving North like this:

In [9]:
q[0][0][index_of_action('N')] = -1   # the dimensions are [state_row][state_col][action]
print(q)

[[[-1.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]]


Now that we can remember past resutls, we can enforce our one-and-only decision rule: _don't do bad things_.

(or: select from among the moves that have the highest available reward, based on past experience)

Recall the shape of the maze:

In [None]:
print(maze)

If we were to store the __rewards__ or __penalties__ from every possible initial move, we would get:

In [10]:
# note that the row & col do not change...
# these are the results of 4 transitions,
# all starting from row = 0, col = 0.
q[0][0][index_of_action('N')] = -1
q[0][0][index_of_action('S')] = 0
q[0][0][index_of_action('E')] = 0
q[0][0][index_of_action('W')] = -1
print(q)

[[[-1.  0.  0. -1.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]

 [[ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]
  [ 0.  0.  0.  0.]]]


This q-table says 'hey, if you are just starting out... don't go north or west!'
<hr>
***Exercises***<p>
    
- Build a q-table that learns incrementally from 100,000 walks through the maze.

In [None]:
import numpy as np
from maze import Maze
maze = Maze()

# this converts N,S,E,W to 0,1,2,3
def index_of_action(action):
    return maze.action_space().index(action)

q = np.zeros((4,4,4))
for n in range(100000):
    state = maze.reset()
    done = False
    while not done:
        
        ####################################################################
        #                                                                  #
        #  YOUR CODE HERE:                                                 #
        #    - get a sample action from the maze                           #
        #    - use the action to take a step; capture the return values    #
        #    - update the q table by adding the reward to:                 #
        #        > row = state[0]                                          #
        #        > col = state[1]                                          #
        #        > action = index_of_action(whatever action you took)      #
        #        > q[row][col][action] += reward                           #
        #     - and don't forget to update your state                      #
        #                                                                  #
        ####################################################################

print(q)