<h1 align = 'center'>Guessing Games</h1>
<h3 align = 'center'>machine learning, one step at a time</h3>
<h3 align = 'center'>Step 8. A Random Walk While Taking Notes</h3>

**8. A random walk while taking notes.**

When we took our random walk, we ignored everything that happened except for one event: stumbling accross the exit.

What if we paid attention, and learned from our mistakes?

Let's start thinking about the maze in terms of machine learning: _the art of accumulating knowledge by learning from mistakes_.

We have already seen that, when we explore the maze, it gives us feedback:

In [None]:
from maze import Maze

# take several random walks through the maze
maze = Maze()
print(maze)
for i in range(10):
    state = maze.reset()      # go back to the initial state
    done = False
    print('\n=== walk number ' + str(i) + ' =======================================================')
    while not done:
        action = maze.sample()
        initial_state = state
        state, reward, done = maze.step(action)
        print('started at:', initial_state, '| moved:', action, '| reward/penalty =',reward, '| done?', done)

If we wanted to learn from those results, we would to _remember_ our rewards or penalties. We would need to store something, someplace... like taking notes in class (unless you don't take notes; if you don't take notes, it's like taking notes as if you did take notes).

If we were doing this by hand, what would we write in our notebook?

...well, we only know two things:
- our __state__ (that is, our x,y position)
- the result of taking an __action__ when starting from our __state__.

For example, here is the result of moving north immediately, and the notes we might take:

In [None]:
initial_state = maze.reset()
print('NOTE: initial state =', initial_state, '| attempting to move North...')\

new_state, reward, done = maze.step('N')
print('NOTE: new_state =', new_state, '| reward/penalty =', reward, '| done ?', done)

print('NOTE: Moving North from (0,0) is a bad idea!')
print('NOTE: Why is it bad? Because I got a penalty!')

OK, so there's that.

Seems like we should associate ( state(0,0) + action(N) = bad idea ).

_But equally important... there's no need to look for context. A bad idea is a bad idea. It doesn't matter why it's bad, because the only results are: good, bad, indifferent. Machine learning does not code for explicit conditions (like stepping out of bounds); we are only concerned with state transitions and outcomes._

That simplifies the problem. We should be able to remember the outcome associated with any __current state__ and __available action__ (this is called a __transition__, because the __action__ causes us to transit from one __state__ to another):
<pre>
== state =====    == transition =======    == next state =============
state is (0,0) -> action is: move North -> state is (0,-1), game over!
==============    =====================    ===========================
</pre>

We need a place to put all of our __states__, __rewards__ and __penalties__ that can store the effect of any given __transition__.

To do that, we just need the _dimensions_... like putting a dozen eggs into a carton that is 6x2, or those occasional have-to-be-different cartons that are 4x3.

The maze will reveal two things that will help us to discover the _dimensions_ of the maze problem, which are __independent of the fact that it's a maze__. A big maze could be 1000x1000, and a maze that lets you move diagonally could have 8 actions (N, NE, E, SE, S, SW, W, NW). We don't care what those things represent; we just need to know: how many are there?

(please re-read that last sentence)

One set of dimensions, that we have seen before, is the __action space__:

In [None]:
print(maze.action_space())
print('There are',len(maze.action_space()),'possible actions')

And the other set of dimensions, which our maze also provides, is the __state space__:


In [None]:
print('Here are the dimensions of all possible states:',maze.state_space())

_NOTE TO THE CURIOUS: it's not strictly necessary that the maze provide the dimensions of the state space. We could discover that just by exploring the space over and over. It's provided here just to simplify the example._

So we need to be able to remember the results of any of 4 actions taken in a 4x4 space, which makes: 4x4x4, like this:

In [None]:
import numpy as np     # this library does all kinds of magical things with numbers
q = np.zeros((4,4,4))  # don't worry about these details, just go with it
print(q)               # everyone calls this a q-table... will explain later

There's our notebook... it works like this:
<pre>
[[[0. 0. 0. 0.]    row 0, col 0, actions 0,1,2,3
  [0. 0. 0. 0.]    row 0, col 1, actions 0,1,2,3
  [0. 0. 0. 0.]    row 0, col 2, actions 0,1,2,3
  [0. 0. 0. 0.]]   row 0, col 3, actions 0,1,2,3

 [[0. 0. 0. 0.]    row 1, col 0, actions 0,1,2,3
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[0. 0. 0. 0.]    row 2, col 0, actions 0,1,2,3
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]

 [[0. 0. 0. 0.]    row 3, col 0, actions 0,1,2,3
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]
  [0. 0. 0. 0.]]]
</pre>

Let's say we want to remember that going north right away is a bad idea...

Recall that:

In [None]:
print('state  =', maze.reset())
print('result =', maze.step('N'))

That means _when I was in state (0,0), and chose action 'N", I got a reward of -1._

Or, the quality of the action 'N' from state (0,0) is, well, pretty bad (the name 'q' stands for 'quality').

For convenience, let's convert our actions to numbers, like this:

In [None]:
# Here is a helpful function that you may need...
# it converts N,S,E,W to 0,1,2,3

def index_of_action(action):
    return maze.action_space().index(action)

# let's see how that works
for action in maze.action_space():
    print(action, index_of_action(action))

And store our penalty from moving North in our __q-table__ like this:

In [None]:
q[0][0][index_of_action('N')] = -1   # the dimensions are [state_row][state_col][action]
print(q)

Now that we can remember past results, we can enforce our one-and-only decision rule: _don't do bad things_.

(or: select from among the moves that have the highest available reward, based on past experience)

If we were to store the __rewards__ or __penalties__ from every possible initial move, we would get:

In [None]:
# note that the row & col do not change...
# these are the results of 4 transitions,
# all starting from row = 0, col = 0.
q[0][0][index_of_action('N')] = -1
q[0][0][index_of_action('S')] = 0
q[0][0][index_of_action('E')] = 0
q[0][0][index_of_action('W')] = -1
print(q)

This q-table says 'hey, if you are just starting out... don't go north or west!'
<hr>
***Exercises***<p>
    
- Build a q-table that learns incrementally from 100,000 walks through the maze.

In [None]:
import numpy as np
from maze import Maze
maze = Maze()

# this converts N,S,E,W to 0,1,2,3
def index_of_action(action):
    return maze.action_space().index(action)

q = np.zeros((4,4,4))
for n in range(100000):
    state = maze.reset()
    done = False
    while not done:
        
        ####################################################################
        #                                                                  #
        #  YOUR CODE HERE:                                                 #
        #    - get a sample action from the maze                           #
        #    - use the action to take a step; capture the return values    #
        #    - update the q table by adding the reward to:                 #
        #        > row = state[0]                                          #
        #        > col = state[1]                                          #
        #        > action = index_of_action(whatever action you took)      #
        #        > q[row][col][action] += reward                           #
        #     - and don't forget to update your state                      #
        #                                                                  #
        ####################################################################

print(q)