<h1 align = 'center'>Guessing Games</h1>
<h3 align = 'center'>machine learning, one step at a time</h3>
<h3 align = 'center'>Step 12. Convergence</h3>

**12. Convergence**<p>

<font face=Times size=3><bold><blockquote>
_In mathematics, computer science and logic, convergence refers to the idea that different sequences of transformations come to a conclusion in a finite amount of time (the transformations are terminating), and that the conclusion reached is independent of the path taken to get to it (they are confluent)._<p>

Franz Baader; Tobias Nipkow (1998). _Term Rewriting and All That._ Cambridge University Press (via Wikipedia)
</blockquote></bold></font>

Let's balance _exploration_ and _exploitation_ by defining a constant called __explore__. Every time we take a step in the maze, we will either _explore_ or _exploit_, by comparing a random number to __explore__... something like this:
<pre>
if random_number < explore:
    explore the maze using sample(), as in the prior examples
else:
    using everything that we have learned to pick the best next step
</pre>

If __explore__ = 0.10, we will explore the maze 10% of the time, and rely on our results 90% of the time. The idea is to learn as we go, eventually finding the exit on every attempt. The mathematical journey of exploring many possibilities that, in time, resolve to a single solution is an example of _convergence_.

Using the code sample, below, try using different values for __discount__ and __explore__, to see if the results begin to converge. What happens to runs that find the exit?

In [None]:
%matplotlib inline
import numpy as np
from maze import Maze

##########################################################################
#                                                                        #   
#  RUN SOME EXPERIMENTS BY CHANGING THE VALUES FOR discount AND explore  #
#  don't get discouraged if the values go a little crazy                 #
#                                                                        #
#  The last lesson effectively used:                                     #
#      discount = 0.5                                                    #
#      explore  = 1.0                                                    #
#                                                                        #
discount = 0.50                                                         
explore  = 0.50                                                         
#                                                                        #
##########################################################################

maze = Maze(record=True)
q = np.zeros((4,4,4))

for n in range(1000):
    state = maze.reset()
    done = False
    while not done:     
        if np.random.random() < explore:
            # explore the maze using a random action
            action = maze.sample_n()
            new_state, reward, done = maze.step(action)
            q[state[0]][state[1]][action] += reward
            if max(new_state) < 4 and min(new_state) >= 0:
                q[state[0]][state[1]][action] += discount * max(q[new_state[0]][new_state[1]])
            state = new_state
        else:
            # exploit past attempts by taking the highest value action
            state, _, done = maze.step(np.argmax(q[state[0]][state[1]]))

# plot convergence (distance to exit for each successive attempt)
maze.convergence()
Maze.print_q(q,mode='rewards')

Depending on the values of __discount__ and __explore__, the contents of the __q-table__ may get a little out of control. The positive rewards get added together (even with a discount < 1.0), and can easily go out of bounds. The whole goal is to find our way from the entrance to the exit, so how about trying this:

_When we receive a positive reward at our initial state, stop exploring! ...we now know the solution._

(that means that we found the one-and-only __reward__ at the exit, then that positive q-value worked its way across our __q-table__ backwards, all the way to the entrance, over a series of subsequent attempts)

In [None]:
%matplotlib inline
import numpy as np
from maze import Maze

##########################################################################
#                                                                        #   
#  RUN SOME EXPERIMENTS BY CHANGING THE VALUES FOR discount AND explore  #
#  don't get discouraged if the values go a little crazy                 #
#                                                                        #
#  The last lesson effectively used:                                     #
#      discount = 0.5                                                    #
#      explore  = 1.0                                                    #
#                                                                        #
discount =  0.50                                                        
explore  =  0.50
#                                                                        #
##########################################################################

maze = Maze(record=True)
q = np.zeros((4,4,4))

for n in range(1000):
    state = maze.reset()
    done = False
    while not done:     
        if np.random.random() < explore:
            action = maze.sample_n()
            new_state, reward, done = maze.step(action)
            q[state[0]][state[1]][action] += reward
            if max(new_state) < 4 and min(new_state) >= 0:
                q[state[0]][state[1]][action] += discount * max(q[new_state[0]][new_state[1]])
            state = new_state
        else:
            state, _, done = maze.step(np.argmax(q[state[0]][state[1]]))
            
        # if we can find the exit from the starting point, stop exploring!
        if max(q[0][0]) > 0:
            explore = 0

# plot convergence (distance to exit for each successive attempt)
maze.convergence()
Maze.print_q(q,mode='rewards')

<hr>
***Exercises***<p>
Reconfigure this program to converge, on average, in the fewest possible steps (don't take 'fewest' literally... it's still a random process).

In [None]:
%matplotlib inline
import numpy as np
from maze import Maze

discount =  0.50    # try changing this constant                                             
explore  =  0.50    # try changing this constant

maze = Maze(record=True)
q = np.zeros((4,4,4))
attempts = 0

for n in range(5000):      # it's OK to try running more or fewer attempts
    state = maze.reset()
    done = False
    while not done:     
        if np.random.random() < explore:
            action = maze.sample_n()
            new_state, reward, done = maze.step(action)
            q[state[0]][state[1]][action] += reward
            if max(new_state) < 4 and min(new_state) >= 0:
                q[state[0]][state[1]][action] += discount * max(q[new_state[0]][new_state[1]])
            state = new_state
        else:
            state, _, done = maze.step(np.argmax(q[state[0]][state[1]]))
            
        if max(q[0][0]) > 0:
            explore = 0
            
    if explore > 0:
        attempts += 1

print('attempts required to converge: ', attempts)
maze.convergence()
Maze.print_q(q,mode='rewards')