<h1 align = 'center'>Guessing Games</h1>
<h3 align = 'center'>machine learning, one step at a time</h3>
<h3 align = 'center'>Step 11. A cookie tomorrow, or today?</h3>

**11. A cookie tomorrow, or today?**

What's more valuabe: a cookie tomorrow, or a cookie today?

Here is a q-table (showing only the cumulative rewards, and excluding the penalities) for an exploration of the maze using the algorithm from the last lesson. In this case, one of the random steps happened to go from east to west, then back again, and by doing that, accidently captured the _future reward_ for a meaningless move, potentially creating a loop in our optimum path:
<pre>
=====  ================================
state         N       S       E       W

(0,0)                     1.000        
(0,1)                     1.000   1.000    < - Why am I indifferent between choosing East or West?
(0,2)             1.000           1.000
(0,3)                                  

(1,0)     1.000                        
(1,1)                                  
(1,2)                     2.000        
(1,3)             2.000           1.000

(2,0)                                  
(2,1)                                  
(2,2)                                  
(2,3)             1.000                

(3,0)                                  
(3,1)                                  
(3,2)                                  
(3,3)                                  
                               

</pre>

Why did that happen? Because in our algorithm, _all rewards are created equal_, meaning we give equal weight to a _current reward_ (finding the exit), or a _future reward_ (finding a step that leads toward a future step that finds the exit). If we happen to move back and forth between two steps, we add any positive reward from one to the other, and back again, falsely creating a loop, and potentially amplifying the reward.

That problem is easily solved by _discounting future rewards_. __Discounting__ means "giving less than 100% credit to a reward that you expect in the future."

**Or: a cookie tomorrow is a good thing, but it's worth less than a cookie today.**

Here is a version of our solution from the prior lesson, changed to solve the maze 10 times (potentially amplifying rewards). Run it several times to see __q-tables__ that have potential loops in the resulting rewards:

In [None]:
%matplotlib inline
import numpy as np
from maze import Maze

maze = Maze()
q = np.zeros((4,4,4))
solved = False
while not solved:
    state = maze.reset()
    done = False
    while not done:
        action = maze.sample_n()                         
        new_state, reward, done = maze.step(action)
        q[state[0]][state[1]][action] += reward
        if max(new_state) < 4 and min(new_state) >= 0:
            q[state[0]][state[1]][action] += max(q[new_state[0]][new_state[1]])
        state = new_state
        solved = True if max(q[0][0]) > 10 else False

Maze.print_q(np.maximum(q,0), mode='rewards')

<hr>
***Exercises***<p>
Correct the program to value future rewards at 1/2 the maximum value of the reward available for the future step.

In [None]:
%matplotlib inline
import numpy as np
from maze import Maze

###############################################
#                                             #
#  Correct the program to conform to the      #
#  sample output (below). Hint: you can       #
#  complete this exercise by typing three     #
#  or four characters into an existing line   #
#  of code.                                   #
#                                             #
###############################################

'''
SAMPLE OUTPUT

Here is sample output when the future reward
is valued at 1/2 the maximum value of the next
step:

=====  ================================
state         N       S       E       W

(0,0)                     0.031        
(0,1)                     0.062   0.016
(0,2)             0.125                
(0,3)                                  
-----  --------------------------------
(1,0)                                  
(1,1)                                  
(1,2)                     0.250        
(1,3)             0.500                
-----  --------------------------------
(2,0)                                  
(2,1)                                  
(2,2)                                  
(2,3)             1.000                
-----  --------------------------------
(3,0)                                  
(3,1)                                  
(3,2)                                  
(3,3)                                  
-----  --------------------------------
'''

maze = Maze()
q = np.zeros((4,4,4))
solved = False
while not solved:
    state = maze.reset()
    done = False
    while not done:
        action = maze.sample_n()                         
        new_state, reward, done = maze.step(action)
        q[state[0]][state[1]][action] += reward
        if max(new_state) < 4 and min(new_state) >= 0:
            q[state[0]][state[1]][action] += max(q[new_state[0]][new_state[1]])
        state = new_state
        solved = True if max(q[0][0]) > 0 else False   # run until the rewards pile up at (0,0)

Maze.print_q(np.maximum(q,0), mode='rewards')