# Chaper 12: Tabular Q-Learning

In this chapter, you’ll learn the basics of reinforcement learning. Then you'll use one type of reinforcement learning, namely tabular Q-learning, to solve the Frozen Lake game in OpenAI Gym. Along the way, you'll learn the concepts of dynamic programming and the Bellman equation, and on how to implement Q learning.

Machine learning can be classified into three different areas: Supervised learning, unsupervised learning, and reinforcement learning. 

In supervised learning, we show the machine learning models many examples of input-output pairs. The output values are also called target variables (or labels). The model extracts features from the input data (e.g., images) and associate them with the output (image labels such as horses, deer, cats, or dogs). We then apply the trained model on new examples and make predictions on what the output should be (is the image a horse or a deer?). In the previous chapters, we have discussed deep neural networks, which are examples of supervised learning. In contrast, in unsupervised learning, there are no pre-assigned target variables (labels) for the training data. The unsupervised learning models must find naturally-occurring patterns from the training data. Examples of unsupervised learning methods include clustering, principal component analysis, and data visualization (plotting, graphing, and so on). 

In reinforcement learning (RL), an agent operates in an environment through trial and error. The agent learns to achieve the optimal outcome by receiving feedback from the environment in the form of rewards. For the rest of the book, we’ll discuss various types of reinforcement learning methods, which include tabular Q learning, deep Q learning, policy gradients, and the actor-critic method.

In this chapter, you’ll learn how RL works. We’ll use the Frozen Lake game in OpenAI Gym to illustrate the concept of dynamic programming and Bellman equation. Your’ll learn to train the Q-table by trial and error. Specifically, the agent plays the game many times and adjusts the values in the Q-table based on the rewards: increase the Q-value if an action leads to positive reward and decrease the Q-value otherwise. You’ll also learn to use the trained Q-table to solve the Frozen Lake game.

# 1. Basics of Reinforcement Learning
Reinforcement Learning (RL) is one type of Machine Learning (ML). In a typical RL problem, an agent decides how to choose among a list of actions in an environment step by step in order to maximize the cumulative payoff from all the steps that he or she has taken. 

RL is widely used in many different fields, from control theory, operations research to statistics. The optimal actions are solved by using a Markov Decision Process (MDP). We’ll use trial and error to interact with the environment and see what rewards from those actions are. We then adjust the decision based on the outcome: reward good choices and penalize bad ones. Hence the name reinforcement learning. 

## 1.1. Basic Concepts
Let’s first discuss a few basic concepts related to RL: environment, agent, state, action, reward.
 
* Environment: the world in which agent(s) live and interact with each other or with nature. More important, an environment is where the agent(s) can explore and learn the best strategies. Examples include the Frozen Lake game, the popular Breakout Atari game, or a real-world problem that we need to solve. 
* Agent: the player of the game. In most games, there is one player and the opponent is enbedded into the environment. But you have seen two-player games such as Tic Tac Toe or Connect Four earlier in this book. 
* State: the current situation of the game. The current game board in the Connect Four game, for example, is the current state of the game.
* Action: what the player decides to do given the current game situation. 
* Reward: the payoff from the game. You can assign a value to each situation. Positive values are rewards and negative values penalties.

These concepts will become clearer as we move along.

## 1.2. The Bellman Equation and Q-Learning
Q-learning is one way to solve the optimization problem in RL. Q learning is a value-based approach. Another approach is policy gradients, which is a policy-based approach. We’ll discuss both in this book.

The agent is trying to learn the best strategy in order to maximize his or her expected payoff over time. A strategy (also called a policy) maps a certain state to a certain action. A strategy is basically a decision rule that tells the agent what to do in a certain situation.

The Q-value, $Q(s, a)$, measures how good a strategy is. You can interpret the letter Q as quality. The better the strategy, the higher the payoff to the agent, and the higher the Q-value. The agent is trying to find the best strategy that maximizes the Q value.

An agent’s action now not only affects the reward in this period, but also rewards in future periods. Therefore, finding the best strategy can be complicated and involves dynamic programming. For details of daynamic programming and the Bellan equation, see, e.g., https://en.wikipedia.org/wiki/Bellman_equation

In the setting of Q-learning, the Bellman equation is as follows:
$$Q(s,a) = Reward + DiscountFactor * max Q(s’, a’)$$
where $Q(s, a)$ is the Q value to the agent in the current state $s$ when an action $a$ is taken. Reward is the payoff to the agent as a result of this action. Discount factor is a constant between 0 and 1, and it measures how much the agent discounts future reward as opposed to current reward. Lastly, $max Q(s’, a’)$ is the maximum future reward, assuming optimal strategies will be applied in the future as well. 

In order to find out the Q values, we’ll try different actions in each state multiple times. We’ll adjust the Q values based on the outcome, increase the Q value if the reward is high and decrease the Q value if the reward is low or even negative. Hence the name reinforecement learing.

Rather than providing you with a lot of abstract technical jargon, I will use a simple example to show you how reinforcement learning works. 

# 2. Get Started with OpenAI Gym
OpenAI Gym provides the needed working environment for many simple games. Many machine learning enthusiasts use games in OpenAI Gym to test their algorithms. In this section, you’ll learn how to install the libraries needed in order to access games that we’ll use in this book. After that, you’ll learn how to play a simple game, the Frozen Lake, in this environment. 

Before you get started, install the OpenAI Gym library as follows with your virtual environment activated:

`pip install gym==0.15.7`

You need to restart the Jupyter Notebook app for the installation to take effect.

Or you can simply use the shortcut and run the following line of code in a new cell in this notebook:

In [1]:
!pip install gym==0.15.7

***
$\mathbf{\text{Python package version control}}$<br>
***
There are newer versions of OpenAI gym, but they are not compatible with Baselines, a package that we need to train Breakout and other Atari games (such as Space Invaders, Seaquest, Beam Rider).

In case you accidentally installed a different version, run the following lines of code to correct it, with your virtual environment activated.

`pip uninstall gym`

`pip install gym==0.15.7`

***

## 2.1. Basic Elements of A Game Environment
The OpenAI Gym game environments are designed mainly for testing reinforcement learning (RL) algorithms. But we'll use them to test deep learning game strategies first before testing RL algorithms in later chapters. 

Let’s first discuss a few basic concepts related to the game environment: 
* Environment: the world in which agent(s) live and interact with each other or with nature. More important, an environment is where the agent(s) can explore and learn the best strategies. Examples include the Frozen Lake game we’ll discuss in this chapter, or the popular Breakout Atari game, or a real-world problem that we need to solve. You’ll learn to use environments from OpenAI Gym, and you’ll also learn to create your own game environments later in this book. 
* Agent: the player of the game. In most games, there is one player and the opponent is embedded into the environment. But we'll also discuss two-player games such as Tic Tac Toe and Connect Four later in this book. 
* State: the current situation of the game. The current game board in the Connect Four game, for example, is the current state of the game. We'll explain more as we go along.
* Action: what the player decides to do given the current game situation. In a Tic Tac Toe game, your action is to choose which cell to place your game piece, for example. 
* Reward: the payoff from the game. You can assign a numerical value to each game outcome. For example, in a Tic Tac Toe game, we can assign a reward of 0 to all situations except when the game ends, at which point you can assign a reward of 1 if you win, -1 if you lose the game. 

These concepts will become clearer as we move along.

## 2.2. The Frozen Lake Game 

Let’s start with the Frozen Lake game environment in OpenAI Gym. In short, an
agent moves on the surface of a frozen lake, which is simplified as a four by four
grid. The agent starts at the top left corner and tries to get to the lower right corner
without falling into one of the four holes on the lake surface. The condition of the
lake surface is illustrated in the picture lake_surface.png under /files/ch12/ in the
book’s GitHub repository. If you open the picture, you’ll see four gray circles, which
are the four holes on the lake surface.
The code in the cell below will get you started:

In [2]:
import gym 
 
env=gym.make("FrozenLake-v0",is_slippery=False)
env.reset()                    
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


The make() method creates the game environment for us. We set the is_slippery
argument to False so that the game is deterministic, meaning the game will always
use the action that you choose. The default setting is is_slippery=True and this
means that you may not go to your intended location since the frozen lake surface
is slippery. For example, when you choose to go left on the surface, you may end up
going to the right. The reset() method starts the game and puts the player at the
starting position. The render() method shows the current game state.
If you run the above cell, you’ll see an output with 16 letters in the form a four by
four grid, which represents the lake surface. The letters have the following meanings:
• S: the starting position.
• H: a hole; the player will fall into the hole and lose the game at this position.
• F: frozen, meaning it’s safe to ski on.
• G: goal, the player wins the game if reaching this point.
The current position of the agent is highlighted in red. The above output shows that
the player is at the top left corner of the lake, which is the starting position of the
agent at the beginning of the game.
We can also print out all possible actions and states of the game as follows:

In [3]:
# Print out all possible actions in this game
actions=env.action_space
print(f"The action space in Frozen Lake is {actions}")
# Print out all possible states in this game
states=env.observation_space
print(f"The state space in Frozen Lake is {states}")

The action space in Frozen Lake is Discrete(4)
The state space in Frozen Lake is Discrete(16)


The action space in the Frozen Lake game has four values: 0, 1, 2, and 3, where 0
means going left, 1 going down, 2 going right, and 3 going up. The state space has 16
values: 0, 1, 2, . . . , 15. The top left square is state 0, the top right is state 3,..., and
the bottom right corner is state 15, as shown in the picture game_states.png under
/files/ch10/ in the book’s GitHub repository.
You can play a complete game as follows:

In [4]:
while True:
    action=actions.sample()
    print(action)
    new_state,reward,done,info=env.step(action)
    env.render()
    print(new_state,reward,done,info)    
    if done==True:
        break

1
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
4 0.0 False {'prob': 1.0}
2
  (Right)
SFFF
F[41mH[0mFH
FFFH
HFFG
5 0.0 True {'prob': 1.0}


The code cell above uses several methods in the game environment. The sample()
method randomly selects an action from the action space. That is, it returns one of
the values among {0, 1, 2, 3}. The step() method is where the agent is interacting with
the environment, and it takes the agent’s action as input. The output from the step()
method has four values: the new state, the reward, a variable done indicating whether
the game has ended, and a variable info with some description about the game state.
In this case, it provides the probability that the agent reaches the intended state. Since we are using the nonslippery version of the game, the probability is always
100%. The render() method shows a diagram of the resulting state.
The game loop is an infinite while loop. If the done variable returns a value True,
the game ends, and we stop the infinite while loop.
Note that since the actions are chosen randomly, when you run the above cell, you’ll
most likely get a different result.

## 2.3. Play the Frozen Lake Game Manually
Next, you’ll learn how to manually interact with the Frozen Lake game, so that you
have a better understanding of the game environment. This will prepare you to design
winning strategies for the Frozen Lake game.
The following lines of code show you how.

In [5]:
print('''
enter 0 for left, 1 for down
2 for right, and 3 for up
''')
env.reset()                    
env.render()
while True:
    try:
        action=int(input('how do you want to move?\n'))
        new_state,reward,done,_=env.step(action)
        env.render()
        if done==True:
            if new_state==15:
                print("Congrats, you won!")
            else:
                print("Better luck next time!")
            break          
        
    except:
        print('please enter 0, 1, 2, or 3')


enter 0 for left, 1 for down
2 for right, and 3 for up


[41mS[0mFFF
FHFH
FFFH
HFFG
how do you want to move?
1
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
how do you want to move?
1
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
how do you want to move?
2
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
how do you want to move?
1
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
how do you want to move?
2
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
how do you want to move?
2
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
Congrats, you won!


Use your key board to play the game a couple of times. After that, play a game by
choosing the following actions: 1, 1, 2, 1, 2, 2 (meaning down, down, right, down,
right, right sequentially). As a result, you’ll reach the destination without falling into
one of the holes and win the game. This is one of the shortest paths that you can
take to win the game.
Now, the question is: can you train your computer to win the game by itself?
The answer is yes, and you’ll learn how to do that by using the deep learning method
via deep neural networks.

# 3. Use Q Values to Play the Frozen Lake Game

The OpenAI Gym environement is designed for training RL game strategies. In particular, in this chapter, you'll learn how to use Q-learning to play the game.

## 3.1. The Logic Behind Q Learning
What if you have a Q-table to guide you to successfully play the Frozen Lake game? The Q-table is a 16 by 4 matrix, with the rows representing the 16 states: 0 means the top left corner (that is, the starting position), 3 means the top right corner, and 15 means the bottom right corner (that is, the goal position; i.e., the winning position). The four columns represent the four actions that the agent can take in any state: 0 means going left, 1 going down, 2 going right, and 3 going up.

The Q table can be downloaded from my website https://gattonweb.uky.edu/faculty/lium/ml/Qtable.csv

<br> If you open the file, you'll see a table as follows (I added two rows and one column for explanation purpose):
<img src="https://gattonweb.uky.edu/faculty/lium/ml/Qtable.png" />

With the guidance of the Q-table, reaching the destination (i.e., state 15, or the lower right corner) safely is easy for a computer program. Here are the steps:

1.	The computer starts at state 0.
2.	It looks at the above Q table and consults the row corresponding to state 0, which has four values: 0.531, 0.59, 0.59, and 0.531. The four values tells the computer that the total payoff from taking the four actions in state 0, respectively. 
3.	The computer chooses the action that leads to the highest Q value: taking actions 1 or 2 both have a payoff of 0.59, higher than those form taking actions 0 or 3. We have a tie here, so the computer chooses action 1 (that is, going down) in this case (the first in the two tied actions, 1 and 2). 
4.	Since the computer has chosen going down in state 0, the new state is now the first column in the second row based on the map of the frozen lake. Therefore, the new state is state 4.
5.	The computer now chooses the best action in state 4 based on the above Q table, following the same logic in steps 2 and 3. This means the computer takes action 1 again.
6.	The computer repeats the above steps until the game ends (that is, either the agent falls into a hole or reaches the destination).

Based on the numbers in the Q table and the logic in the above steps, the computer will take the following actions sequentially: down, down, right, down, right, and right. It will pass the following states: state 0 to state 4 to state 8 to state 9 to state 13 to state 14 to state 15. 

As you can see, the computer has successfully reached the goal (state 15) without falling into one of the four holes (states 5, 7, 11, and 12).

We’ll code that in next.

## 3.2. A Python Script to Win the Frozenlake Game

Run the following short script.

In [6]:
import gym
import numpy as np

env=gym.make('FrozenLake-v0', is_slippery=False)
env.reset()
file=r'https://gattonweb.uky.edu/faculty/lium/ml/Qtable.csv'
Q=np.loadtxt(file, delimiter=",")

def play_game():
    state=env.reset()
    env.render()
    while True:
        action = np.argmax(Q[state, :])
        print(f'state is {state}; action is {action}')
        new_state, reward, done, _ = env.step(action)
        # new_state becomes the state in the next round
        state=new_state
        env.render()
        if done==True:
            if reward ==1:
                print('Congratulations, you won!')
            else:
                print('Sorry, better luck next time.')
            break    
play_game()
env.close()


[41mS[0mFFF
FHFH
FFFH
HFFG
current state is 0 and the action is 1
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
current state is 4 and the action is 1
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
current state is 8 and the action is 2
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
current state is 9 and the action is 1
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
current state is 13 and the action is 2
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
current state is 14 and the action is 2
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
Congratulations, you have reached the destination!


The Q values are saved in a CSV file. We use the *loadtxt()* method in ***numpy*** to load up the Q table.

The state=env.reset() command resets the game so that the initial state is 0. 

We then use a while loop to play the game. In each iteration, the computer chooses the action that leads to the highest Q value in that state. Note here the *argmax()* method in ***numpy*** returns the argument that leads to the highest value. This is different from the *max()* method in ***numpy***, which returns the highest value among a group of values. 

We then print out the current state, and the action taken by the computer in that state so that we can keep track of the path taken by the computer.

The *env.step()* method returns the new state and the reward based on the action taken. It also tells us whether the game has ended. If the game has not ended, we set the new state as the current state and go to the next iteration to repeat the process. If the game has ended, the while loop stops and the script ends.

The computer has successfully reached state 15, taking the shortest possible path. 

You can run the script multiple times, and the output will be the same every time, because there is no randomness involved.

Amazing, right? You may wonder, how did you come up with the numbers in the Q-table to play the game? That’s what we’ll discuss next.

# 4. Training the Q-Values
In this section, we’ll first discuss what is Q-learning and the logic behind it. We then code in the logic and use a script to generate the Q values that we have just used in the last section.

## 4.1. What Is Q-Learning?
The Q-values form a table of S rows and A columns, and we call it the Q-table. We need to find out the Q-values so that the player can use these values to figure out the optimal strategies in every situation. 

Before Q-learning starts, we set all the values in the Q-table as 0.

At each iteration, we’ll use reinforcement learning (i.e., Q-learning) to update Q values as follows:

 $$ New\ Q(s,a) = learning\ rate * [Reward + discount\ factor * max\  Q(s’, a’)]+ (1-learning\ rate) * Old\  Q(s, a)$$
Here the learning rate, which has a value between 0 and 1, is how fast you update the Q values. The updated $Q(s, a)$ is a weighted average of the Q value based on that obtained from the Bellman’s equation and the previous $Q(s, a)$. Here is when updating (i.e., learning) happens.

After many rounds of trial and error, the update will be minimal, which means the Q values converge to the equilibrium value. 

If you look at the above equation, when 
$$Q(s,a) = Reward + discount\ factor * max\ Q(s’, a’)$$
There is no update, and we have 
$$New\  Q(s,a) = Old\  Q(s, a)$$
And that is the equilibrium state we are looking for. 

## 4.2. Let the Learning Begin
We’ll write a Python script and let the agent randomly select moves to play the game many rounds. Unavoidably, there will be many mistakes along the way. But we’ll assign a low reward if the agent fails so that it assigns a low Q value to actions taken in that state. On the other hand, if the agent makes right choices and successfully reaches the destination, we’ll assign a high reward so that the agent assigns high Q values to actions taken. 

It’s through such repeated rewards and punishments that the agent learns the correct Q values.

The script in the following cell trains the Q-table.

In [7]:
import gym
import numpy as np

env=gym.make('FrozenLake-v0', is_slippery=False)
env.reset()
learning_rate=0.6
discount_rate=0.9
max_exp=0.7
min_exp=0.3
max_steps=50
max_episode=10000

Q = np.zeros((16, 4))
def update_Q(episode):
    # The initial state is the starting position (state 0)
    state=env.reset()      
    # Play a full game till it ends
    for _ in range(max_steps):
        # Select the best action or the random action
        if np.random.uniform(0,1,1)>min_exp+(max_exp-min_exp)\
        *episode/max_episode:
            action = np.argmax(Q[state, :])
        else:
            action = env.action_space.sample()
        # Use the selected action to make the move
        new_state, reward, done, _ = env.step(action)
        # Update Q values
        if done==True:
            Q[state, action]=reward
            break    
        else:
            Q[state,action]=learning_rate*(reward+discount_rate\
    *np.max(Q[new_state,:]))+(1-learning_rate)*Q[state,action]
            state=new_state    
       
for episode in range(max_episode):
    update_Q(episode)
    
print(Q) 

# Save the trained Q for later use
np.savetxt("trained_frozenlake_Qs.csv", Q, delimiter=',')

[[0.531441 0.59049  0.59049  0.531441]
 [0.531441 0.       0.6561   0.59049 ]
 [0.59049  0.729    0.59049  0.6561  ]
 [0.6561   0.       0.59049  0.59049 ]
 [0.59049  0.6561   0.       0.531441]
 [0.       0.       0.       0.      ]
 [0.       0.81     0.       0.6561  ]
 [0.       0.       0.       0.      ]
 [0.6561   0.       0.729    0.59049 ]
 [0.6561   0.81     0.81     0.      ]
 [0.729    0.9      0.       0.729   ]
 [0.       0.       0.       0.      ]
 [0.       0.       0.       0.      ]
 [0.       0.81     0.9      0.729   ]
 [0.81     0.9      1.       0.81    ]
 [0.       0.       0.       0.      ]]


We set the learning rate to 0.6. This value can take different values in this simple case. You can set it to a much smaller value such as 0.1, as long as you train the model many times, the Q values will converge. 

The discount rate we use here is 0.9. This value will directly affect the Q values. Remember in the last subsection, we discussed that in equilibrium when the Q values converge, we have 
$$Q(s,a) = Reward + discount\ factor * max\ Q(s’, a’)$$
You can see that the converged Q value is a function of discount factor. In our case, as long as you set it anywhere between 0.9 and 1, the resulting Q values will work in the sense that it will successfully guide the agent to the destination safely. 

***
$\mathbf{\text{Exploration versus Exploitation}}$<br>
***
Another important parameter in the process of training the Q-tables is the exploration rate. Exploration means that the agent will randomly selects an action in that given state. This is important for training the Q values because without it, the Q values may be trapped in the wrong equilibrium and cannot get out of it. With exploration, it gives the agent the chance to explore new strategies and see if they lead to higher Q values. 

Exploitation is the opposite of exploration: it means the agent chooses the action based on the values in the current Q-table. This ensures that the final Q table converges. 

At each iteration, the Q values are updated. If the agent wins the game, the action earns a reward of 1. If the agent fails, the action earns a reward of -1. The update rule follows the equation we specified earlier 
$$New\ Q(s,a) = learning\ rate * [Reward + discount\ factor * max\ Q(s’, a’)]
                        + (1-learning\ rate) * Old\ Q(s, a)$$


***