<h1 align = 'center'>Guessing Games</h1>
<h3 align = 'center'>machine learning, one step at a time</h3>
<h3 align = 'center'>Step 9. Exploration v. Exploitation</h3>

**9. Exploration v. exploitation**

_Reinforcment learning_ is a balancing act between __exploration__ and __exploitation__:

- __Exploration__: where am I? what actions can I take? what happens when I take an action?
- __Exploitation__: can I use my past knowledge of penalties and rewards to go in the right direction?

We _explored_ the maze by randomly examining it many times. How can we _exploit_ those results to find our way quickly and easily?

Let's go back to building our __q-table__:

In [2]:
import numpy as np
from maze import Maze
maze = Maze()

# for this lesson, it will be easier if our sample
# actions are 0,1,2,3 instead of N,S,E,W
def sample(maze):
    action = maze.sample()                    # this returns N,S,E,W
    return maze.action_space().index(action)  # this converts to 0,1,2,3

# run the maze lots of times; take note of every result in a q-table
q = np.zeros((4,4,4))
for n in range(1000):
    state = maze.reset()
    done = False
    while not done:
        action = sample(maze)                        # return a random action (0,1,2,3)  
        new_state, reward, done = maze.step(action)  # takes a step based upon the random action
        q[state[0]][state[1]][action] += reward      # makes note of the resulting transition
        state = new_state                            # ...and switches to the new state
        
print(q)

[[[-31.   0.   0. -27.]
  [-11.  -7.   0.   0.]
  [ -1.   0.  -1.   0.]
  [  0.   0.   0.   0.]]

 [[  0.   0.  -7. -10.]
  [  0.   0.   0.   0.]
  [  0.   0.   0.  -1.]
  [  0.   0.   0.   0.]]

 [[  0.  -1.   0.  -2.]
  [  0.  -1.   0.   0.]
  [  0.   0.   0.   0.]
  [  0.   0.   0.   0.]]

 [[  0.   0.   0.   0.]
  [  0.   0.   0.   0.]
  [  0.   0.   0.   0.]
  [  0.   0.   0.   0.]]]


Here are typical results for (0,0), marked up, just to make things clear:
<pre>
state |   N  |  S  |  E  |   W
(0,0) | -294.|  0. |  0. | -308.
</pre>
Around half the attempts ended right away, by going out of bounds (by stepping either north or west). Doesn't that seem a little wasteful?

Or put another way: for any given state, if an action has resulted in penalties in the past, can we avoid that action in the future?

We would like to use the q table to find our way, based on past results... but first! we need the awesome power of _argmax_:

In [None]:
import numpy as np

# here is an array of rewards...
a = [-100,-200,0,-50]

# I wish there was a function to tell me 
# the index of the entry that has the 
# maximum value... in this case, what is 
# the index of the entry with a value of 
# zero, indicating no penalties?

# enter argmax! which return the index of 
# the maximum value. wow. just wow.
print(np.argmax(a))

OK good. Now that we have a fully populated __q-table__ and the awesome power of _argmax_, let's traverse the maze:

In [None]:
# here is how we might take the first step...
state = maze.reset()

row = state[0]
col = state[1]
action = np.argmax(q[row][col])         # it's easier to use 0,1,2,3 instead of N,S,E,W for our action...
state, reward, done = maze.step(action) # ...those values are interchangeable when calling step(action)
print(maze.action_space()[action], state, reward, done)
print(maze)

That's a start... we will always take a first step that avoids a penalty.

How about taking 10 steps?

In [None]:
state = maze.reset()
for n in range(10):
    row = state[0]
    col = state[1]
    action = np.argmax(q[row][col])
    state, reward, done = maze.step(action)
    print(maze.action_space()[action], state, reward, done)

Well... we avoided penalties. Unfortunately, we got caught in a loop, and did not go anywhere.

What if we tried to balance __exploration__ and __exploitation__?

For example, as a first attempt: what if we moved at random, but discarded moves that have penalties? (or more formally: what if we discard __transitions__ that have negative __q-values__)?

<hr>
***Exercises***<p>
- traverse the maze using random actions, but avoid actions with negative q-values
- note! that won't work unless you allow the code, as provided, to populate a q-table in advance
- see the guidance in the code sample

In [None]:
import numpy as np
from maze import Maze
'''
=== DO NOT CHANGE CODE STARTING HERE ====== POLICE LINE DO NOT CROSS ===========================
'''
maze = Maze()

# convenient function to convert sample actions from N,S,E,W to 0,1,2,3
def sample(maze):
    action = maze.sample()                  
    return maze.action_space().index(action)

# build a q-table
q = np.zeros((4,4,4))
for n in range(10000):
    state = maze.reset()
    done = False
    while not done:
        action = sample(maze)                        # return a random action (0,1,2,3)  
        new_state, reward, done = maze.step(action)  # takes a step based upon the random action
        q[state[0]][state[1]][action] += reward      # makes note of the resulting transition
        state = new_state                            # ...and switches to the new state
'''
=== AND ENDING HERE ======================= POLICE LINE DO NOT CROSS ===========================
'''

###############################################################################
#                                                                             #
#  YOUR CODE GOES HERE...                                                     #
#     ...something like                                                       #
#             state = maze.reset()                                            #
#             done = False                                                    #
#             while not Done:                                                 #
#                 get a sample action                                         #
#                 while the q-value for the action is negative                #
#                     get a different sample action                           #
#                 use maze.step(action) to take step & update state           #
#                                                                             #
###############################################################################

print(maze)