<h1 align = 'center'>Guessing Games</h1>
<h3 align = 'center'>machine learning, one step at a time</h3>
<h3 align = 'center'>Step 9. Exploration v. Exploitation</h3>

**9. Exploration v. exploitation**

_Reinforcment learning_ requires two steps: __exploration__ and __exploitation__:

- __Exploration__: where am I? what actions can I take? what happens when I take an action?
- __Exploitation__: can I use my past knowledge of penalties and rewards to solve the problem?

We _explored_ the maze by randomly examining it many times. How can we _exploit_ those results to find our way quickly and easily?

Let's go back building our q table:

In [26]:
import numpy as np
from maze import Maze
maze = Maze()

# this converts N,S,E,W to 0,1,2,3
def index_of_action(action):
    return maze.action_space().index(action)

q = np.zeros((4,4,4))
for n in range(1000):
    state = maze.reset()
    done = False
    while not done:
        action = maze.sample()
        new_state, reward, done = maze.step(action)
        row = state[0]
        col = state[1]
        q[row][col][index_of_action(action)] += reward
        state = new_state
        
print(q)

[[[-294.    0.    0. -308.]
  [ -65.  -76.    0.    0.]
  [ -21.    0.  -19.    0.]
  [   0.    0.    0.    0.]]

 [[   0.    0.  -74.  -77.]
  [   0.    0.    0.    0.]
  [   0.   -3.    0.   -4.]
  [   0.    0.    0.    0.]]

 [[   0.  -25.    0.  -17.]
  [  -4.   -4.   -9.    0.]
  [   0.    0.    0.    0.]
  [   0.    0.    0.    0.]]

 [[   0.    0.    0.    0.]
  [   0.    0.    0.    0.]
  [   0.    0.    0.    0.]
  [   0.    0.    0.    0.]]]


Here are the results for (0,0), marked up, just to make things clear:

<pre>
<font color='blue'>           N     S     E     W</font>
<font color='blue'>(0,0)</font>[[[-294.    0.    0. -308.]</pre>
Around half the attempts ended the exploration by going out of bounds (by stepping either north or west) Doesn't that seem a little wasteful?

Or put another way: for any given state, if an action has resulted in penalties in the past, can we avoid that action in the future?

We would like to use the q table to find our way, based on past results... but first! we need the amazing power of _argmax_:

In [6]:
import numpy as np

# here is an array of rewards...
a = [-100,-200,0,-50]

# I wish there was a function to
# tell me the index of the entry
# that has the maximum value...
# in this case, what is the index
# of the entry with a value of
# zero, indicating no penalties?

# enter argmax! which return the
# index of the maximum value. wow.
print(np.argmax(a))

2


OK good now that we have a fully populated q-table and the awesome power of _argmax_, let's traverse the maze:

In [29]:
# here is how we might take the first step...

state = maze.reset()   # this returns 0,0
row = state[0]
col = state[1]
print('state', state, 'row =', row, 'col =', col)
print('q[row][col]', q[row][col])

state = maze.reset()
done = False
while not done:
    row = state[0]
    col = state[1]
    action_index = np.argmax(q[row][col])
    action = maze.action_space(action_index)
    maze.step(action)
    

state [0 0] row = 0 col = 0
q[row][col] [-294.    0.    0. -308.]


TypeError: action_space() takes 1 positional argument but 2 were given

In [None]:
(ever feel like someone was asking you the same question, different ways, because they are trying to drive home a point?)

__Here is the point:__
- if we always pick an action at random, we ignore everything that we have learned
- if we always pick the best action based on history, we won't have any history from which to choose.
