<h1 align = 'center'>Guessing Games</h1>
<h3 align = 'center'>machine learning, one step at a time</h3>
<h3 align = 'center'>Step 8. A Random Walk While Taking Notes</h3>

**8. A random walk while taking notes.**

When we took our random walk, we ignored everything that happened except for one event: stumbling accross the exit.

What if we paid attention, and learned from our mistakes?

Let's start thinking about the maze in terms of machine learning: _the art of accumulating knowledge by learning from mistakes_.

We have already seen that, when we explore the maze, it gives us feedback:

In [None]:
from maze import Maze

maze = Maze()
print(maze)
for i in range(10):           # take several random walks through the maze
    state = maze.reset()      # start each walk at the initial state
    done = False
    print('\n=== walk number ' + str(i) + ' =======================================================')
    while not done:
        action = maze.sample()
        initial_state = state
        state, reward, done = maze.step(action)
        print('started at:', initial_state, '| moved:', action, '| reward =',reward, '| done?', done)

If we wanted to learn from those results, we would need to _remember_ our rewards or penalties. We would need to store something, someplace... like taking notes in class (unless you don't take notes; if you don't take notes, it's like taking notes as if you did take notes).

If we were doing this by hand, what would we write in our notebook?

...well, we only know two things:
- our __state__ (that is, our x,y position)
- the result of taking an __action__ when starting from our __state__.

For example, here is the result of moving North immediately, and the notes we might take:

In [None]:
initial_state = maze.reset()
print('NOTE: initial state =', initial_state, '| attempting to move North...')\

new_state, reward, done = maze.step('N')
print('NOTE: new_state =', new_state, '| reward =', reward, '| done ?', done)

print('NOTE: Moving North from (0,0) is a bad idea!')
print('NOTE: Why is it bad? Because I got a penalty!')

OK, so there's that.

Seems like it's important to associate ( state(0,0) + action(N) = bad idea ).

_But equally important... there's no need to consider any context (nothing above refers specifically to a maze). A bad idea is a bad idea. It doesn't matter why it's a bad idea, or what specifically is bad about it. Machine learning does not code for explicit conditions (like stepping out of bounds versus stepping onto a blocked square); we are only concerned with state transitions and outcomes, in order to find our way toward a goal or solution._

That simplifies the problem. Who needs rules? We should be able simply to remember the result of the __transition__ associated with any __current state__ and __available action__ and go from there:
<pre>
== state =====   == action============    == transition =====    == result =============
state is (0,0) + action is: move North -> new state is (0,-1) -> reward = -1, game over!
==============   =====================    ===================    =======================
</pre>

We need a place to put all of our __states__, __actions__, and __rewards__ so that we can store the effect of any given __transition__.

To do that, we just need the _dimensions_ of the problem... like putting a dozen eggs into a carton that is 6x2, or those occasional have-to-be-different cartons that are 4x3.

The maze will reveal two things that will help us to discover the _dimensions_ of the maze problem: (1) the size of the __action__ space, and the size of the __state__ space. These size are __independent of the fact that it's a maze__. We could be solving any two-dimensional problem with multiple actions (like playing tic-tac-toe, or checkers). We don't care what the __actions__ or __states__ represent; we just need to know: how many are there?

Remember: the goal is to solve the problem knowing as little as possible about the context... it's not maze, it's just an orderly set of states and transitions.

(please re-read that last sentence)

On to finding the _dimensions_: the set of dimensions that we have seen before is the __action space__:

In [None]:
print('There are',len(maze.action_space()),'possible actions:')
print(maze.action_space())
print('\nWe might also refer to the actions by numerical value:')
for n in range(len(maze.action_space())):
    print(n,maze.action_space()[n])

And the other set of dimensions, which our maze also provides, is the __state space__:


In [None]:
print('Here are the dimensions of all possible states:',maze.state_space())

_NOTE TO THE CURIOUS: it's not strictly necessary that the maze provide the dimensions of the action or state spaces. We could discover those dynamically, by exploring the problem over and over. Those are provided here just to simplify the example._

Now we know that we have to deal with:
- 4 x 4 = 16 states, that is really...
    - 4 rows
    - 4 columns
- 4 actions

We need to be able to remember the results of any of __action__ taken in any __state__, which makes 4x4x4, like this:

In [None]:
import numpy as np     # this library does all kinds of magical things with numbers
q = np.zeros((4,4,4))  # don't worry about these details, just go with it
print(q)               # everyone calls this a q-table... meaning the 'quality' of every action

This table (or something like it) is called a __q-table__... 'Q' stands for _quality_, because the table stores a cumulative qualitative measure of each action (rewards, penalties, or noting).

It's a little hard to visualize which dimension of our __q-table__ is which... let's print it differently:

In [None]:
from maze import Maze
Maze.print_q(q)

...and yes, instead of using a 4x4x4 table, you could equivalently represent each of the 16 states as a single row, and do this...

In [None]:
print(np.zeros((16,4)))  # around 1/2 the people find this easier to understand

But we will stick with 4x4x4 for now. If you want to use 16x4 in your exercises, that's OK.


Let's say we want to remember that going North right away is a bad idea... start by moving North, then store the results in the __q-table__:

In [None]:
from maze import Maze
maze = Maze()
initial_state = maze.reset()                 # start in state (0,0)
final_state, reward, done = maze.step('N')   # take a step to the North
print('started at:', initial_state, ', moved: N to ', final_state, ', reward =',reward, ', done?', done)

That means _when I was in state (0,0), and chose action 'N", I got a reward of -1, and the game ended._

Or, the _quality_ of the action 'N' from state (0,0) is pretty bad.

Let's make a note of that in a 4x4x4 __q_table__:

In [None]:
q = np.zeros((4,4,4))
row = 0                    # initial row is zero
col = 0                    # initial col is zero
action = 0                 # N,S,E,W = 0,1,2,3... so N = 0
q[row][col][action] = -1   # remember that a bad thing happened at (0,0), action = 0 (North)
Maze.print_q(q)            # notice the -1 hiding in the upper left corner

If we were to store the __rewards__ or __penalties__ from every possible initial move, we would get:

In [None]:
# These four entries refer to the initial state: (0,0)

# R  C  A  = Row, Column, Action
q[0][0][0] = -1   # Action = N = 0
q[0][0][1] =  0   # Action = S = 1
q[0][0][2] =  0   # Action = E = 2
q[0][0][3] = -1   # Action = W = 3
Maze.print_q(q)

This __q-table__ says 'hey, if you are just starting out... don't go north or west!'
<hr>
***Exercises***<p>
    
- Build a q-table that learns incrementally from 100,000 walks through the maze.

Before you begin, notice that the maze can provide random sample actions in either of two forms: __sample()__ provides actions as text (N,S,E,W); __sample_n()__ provides actions as numbers (0,1,2,3); the functins are interchangeable (use whichever form is more convenient):

In [None]:
import numpy as np
from maze import Maze

##################################################
#                                                #
#   Run this to see the equivalence between      #
#       maze.sample()                            #
#       maze.sample_n()                          #
#                                                #
##################################################

maze = Maze()
print('Here are sample actions in text form')
for n in range(5):
    print(maze.sample())
print('\nHere are sample actions in numeric form')  
for n in range(5):
    print(maze.sample_n())

***Here is the actual exercise... you could complete it using either form of sampling the maze:***

In [None]:
import numpy as np
from maze import Maze

maze = Maze()
q = np.zeros((4,4,4))
for n in range(100000):
    state = maze.reset()
    done = False
    while not done:
        
        ####################################################################
        #                                                                  #
        #  YOUR CODE HERE:                                                 #
        #    - get a sample action from the maze                           #
        #    - use the action to take a step; capture the return values    #
        #    - update the q table by adding the reward to:                 #
        #        > row = state[0]                                          #
        #        > col = state[1]                                          #
        #        > action = numerical value of whatever action you took    #
        #        > q[row][col][action] += reward                           #
        #     - and don't forget to update your state                      #
        #                                                                  #
        ####################################################################

Maze.print_q(q)