<h1 align = 'center'>Guessing Games</h1>
<h3 align = 'center'>machine learning, one step at a time</h3>
<h3 align = 'center'>Step 14. Training</h3>

***14. Training***

Remember the maze? When we figured out how to __explore__ versus __exploit__ the experience represented in a __q-table__? And remember we decided to __discount__ future rewards?

Let's build those elements into an algorithm that _trains itself_ to play tic-tac-toe.

__And here's the big idea: there is nothing specific to tic-tac-toe in the algorithm. It may as well be solving a maze, or some other problem. The algorithm just explores & exploits based on states, transitions, and resulting rewards. _It doesn't even know the rules of the game._ __

Here is an alogrithm that can learn to play, or solve the maze, or lots of things (it will take several minutes to run -- and will stop when it has trained itself to play the game):

In [None]:
import numpy as np
from tictactoe import *

env = Game()                # the environment happens to be a game, but could be a maze or a puzzle...

discount = 0.9              # these two variables control how we store rewards or penalties in our
explore = 0.01              # q-table, just like in the lessons about the maze.

#######################################################################################################
#                                                                                                     #
# SUPER IMPORTANT POINT: we don't even need to know the dimensions of the problem (the total number   #
#                        of states of actions... we can ask the environment to tell us those limits)  #
#                                                                                                     #
q = np.zeros((Game.state_space(), Game.action_space()))                                               #                                       
#                                                                                                     #
#######################################################################################################

attempts = 0                # keep track of how many times we have tried to solve the problem
not_yet_trained = True      # and keep trying until we are 'trained', see below

while not_yet_trained:
    total_reward = 0
    attempts += 1
    for n in range(10):                                    # play 10 games, keep track of the total reward
        state = env.reset()
        done = False
        ###############################################################################################
        #                                                                                             #
        # The actual q-learning is exactly like the example with the maze                             #
        #                                                                                             #
        while not done:                                                                               #
            if np.random.random() < explore:                                                          #
                action = env.sample()                                                                 #
            else:                                                                                     #
                action = np.argmax(q[state])                                                          #
            new_state, reward, done = env.step(action)                                                #
            q[state][action] += reward + discount * np.amax(q[new_state])                             #
            state = new_state                                                                         #
        #                                                                                             #
        ###############################################################################################      
        total_reward += reward
    not_yet_trained = True if total_reward < 0 else False  # if we never lose in 10 games, we are 'trained'
    print(attempts, total_reward)                          # print a progress report
        
env.replay()                                               # print a replay of the last game

<hr>
***Exercises***

Take a good look at the example, you will need it for the final project. Try tuning __discount__ and __explore__. How does the pace of training depend on these variables? Should they be changed within the program, as the success rate improves?