<h1 align = 'center'>Guessing Games</h1>
<h3 align = 'center'>machine learning, one step at a time</h3>
<h3 align = 'center'>Step 10. Q-learning</h3>

**10. Q-learning**

<u>Wikipedia</u> says:

_Q-learning is a reinforcement learning technique used in machine learning. The goal of Q-Learning is to learn a policy, which tells an agent what action to take under what circumstances. It does not require a model of the environment and can handle problems with stochastic transitions and rewards, without requiring adaptations._

Translation: 'Q-learning is a way to figure out a maze on-the-fly, without knowing it's a maze, even if it's like a fun-house maze where somebody occasionally moves the walls around.' The word __stochastic__ means 'like a fun-house', or: at least partly random, in some way or other.

The last lesson was good enough to find a path without failing, but it wasn't quite Q-learning:
- it examines the maze in advance... if someone moved the walls around, it would fail completely
- it doesn't really develop much of a __policy__, in the sense that it still takes a random path
- it is good at exploiting __penalties__, but what about exploiting __rewards__?

***Find the reward in this q-table:***
<pre>
=====  ================================
state         N       S       E       W

(0,0)      -839       0       0    -866
(0,1)      -241    -283       0       0
(0,2)       -67       0     -53       0
(0,3)         0       0       0       0

(1,0)         0       0    -224    -227
(1,1)         0       0       0       0
(1,2)         0     -16       0     -10
(1,3)        -3       0      -6       0

(2,0)         0     -73       0     -53
(2,1)       -12     -12     -12       0
(2,2)         0       0       0       0
(2,3)         0       __1__      -1      -1  < -- there is the +1!

(3,0)         0       0       0       0
(3,1)         0       0       0       0
(3,2)         0       0       0       0
(3,3)         0       0       0       0

 </pre>
 ***in this table, it took 3,000 tries to find the exit once***

In [15]:
import numpy as np
from maze import Maze
maze = Maze()

# for this lesson, it will be easier if our sample
# actions are 0,1,2,3 instead of N,S,E,W
def sample(maze):
    action = maze.sample()                    # this returns N,S,E,W
    return maze.action_space().index(action)  # this converts to 0,1,2,3

# run the maze lots of times; take note of every result in a q-table
q = np.zeros((4,4,4))
for n in range(3000):
    state = maze.reset()
    done = False
    b = 0
    while not done:
        action = sample(maze)                        # return a random action (0,1,2,3)  
        new_state, reward, done = maze.step(action)  # takes a step based upon the random action
        q[state[0]][state[1]][action] += reward      # makes note of the resulting transition
        if max(new_state) < 4 and min(new_state) >=0:
            new_q = q[new_state[0]][new_state[1]]
            q[state[0]][state[1]][action] += max(new_q)        
            if max(new_q) > 0:
                b += 1
                print(b,state,'=== BEFORE ===============================\n',q)
                new_q = q[state[0]][state[1]]
                q[state[0]][state[1]][action] += max(new_q)
                print(b,state,'=== AFTER  ===============================\n',q)
                input('go')
        state = new_state                            # ...and switches to the new state
 

 [[[-679.    0.    0. -616.]
  [-160. -194.    0.    0.]
  [ -50.    0.  -42.    0.]
  [   0.    0.    0.    0.]]

 [[   0.    0. -136. -172.]
  [   0.    0.    0.    0.]
  [   0.   -7.    0.  -12.]
  [  -5.    1.   -3.    0.]]

 [[   0.  -44.    0.  -48.]
  [ -13.  -10.  -13.    0.]
  [   0.    0.    0.    0.]
  [   0.    1.   -1.   -2.]]

 [[   0.    0.    0.    0.]
  [   0.    0.    0.    0.]
  [   0.    0.    0.    0.]
  [   0.    0.    0.    0.]]]
 [[[-679.    0.    0. -616.]
  [-160. -194.    0.    0.]
  [ -50.    0.  -42.    0.]
  [   0.    0.    0.    0.]]

 [[   0.    0. -136. -172.]
  [   0.    0.    0.    0.]
  [   0.   -7.    0.  -12.]
  [  -5.    2.   -3.    0.]]

 [[   0.  -44.    0.  -48.]
  [ -13.  -10.  -13.    0.]
  [   0.    0.    0.    0.]
  [   0.    1.   -1.   -2.]]

 [[   0.    0.    0.    0.]
  [   0.    0.    0.    0.]
  [   0.    0.    0.    0.]
  [   0.    0.    0.    0.]]]


KeyboardInterrupt: 