# Chapter 13: Introduction to Reinforcement Learning

In this chapter, you’ll learn how reinforcement learning works.

After this chapter, you'll be able to create an animation to show in the Frozen Lake game how the Q-learning works in each step of the game. In particular, in each state, you'll put the game board on the left and the Q-table on the right. You'll highlight the row corresponding to the state and compare the Q-values under the four actions. You'll then highlight the best action in red. Like so: 
<img src="https://gattonweb.uky.edu/faculty/lium/ml/frozen_q_steps.gif"/>

We’ll use the Frozenlake game in OpenAI Gym to illustrate the concept of dynamic programming, Bellman equation, and how to implement Q learning.

Machine learning can be classified into three different areas: Supervised learning, unsupervised learning, and reinforcement learning. 

In supervised learning, we show the machine learning models many examples of input-output pairs. The output values are also called target variables (or labels). The model extracts features from the input data (e.g., images) and associate them with the output (image labels such as horses, deer, cats, or dogs). We then apply the trained model on new examples, and make predictions on what the output should be (is the image a horse or a deer). 

In the previous chapters, we have discussed deep neural networks, which are examples of supervised learning.

In contrast, in unsupervised learning, there are no pre-assigned target variables (labels) for the training data. The unsupervised learning models must find naturally-occurring patterns from the training data. Examples of unsupervised learning methods include clustering, principal component analysis, and data visualization (plotting, graphing, and so on). 

In reinforcement learning, an agent operates in an environment through trial and error. The agent learns to achieve the optimal outcome by receiving feedback from the environment in the form of rewards. For the rest of the book, we’ll discuss various types of reinforcement learning methods, which include tabular Q learning, deep Q learning, policy gradients, and double deep Q learning. 

***
$\mathbf{\text{Create a subfolder for files in Chapter 13}}$<br>
***
We'll put all files in Chapter 13 in a subfolder /files/ch13. The code in the cell below will create the subfolder.

***

In [2]:
import os

os.makedirs("files/ch13", exist_ok=True)

## 1. Basics of Reinforcement Learning
Reinforcement Learning (RL) is one type of Machine Learning (ML). In a typical RL problem, an agent decides how to choose among a list of actions in an environment step by step in order to maximize the cumulative payoff from all the steps that he or she has taken. 

RL is widely used in many different fields, from control theory, operations research to statistics. The optimal actions are solved by using a Markov Decision Process (MDP). The agent uses trial and error to interact with the environment and to see what rewards from those actions are. The agent then adjusts the decision based on the outcome: rewarding good choices and penalizing bad ones. Hence the name reinforcement learning. 

### 1.1. Basic Concepts
Let’s first discuss a few basic concepts related to RL: environment, agent, state, action, reward.
 
* Environment: the world in which agent(s) live and interact with each other or with nature. More important, an environment is where the agent(s) can explore and learn the best strategies. Examples include the Frozen Lake game, the popular Breakout Atari game, or a real-world problem that we need to solve. 
* Agent: the player of the game. In most games, there is one player and the opponent is enbedded into the environment. But you have seen two-player games such as Tic Tac Toe or Connect Four earlier in this book. 
* State: the current situation of the game. The current game board in the Connect Four game, for example, is the current state of the game.
* Action: what the player decides to do given the current game situation. 
* Reward: the payoff from the game. You can assign a value to each situation. Positive values are rewards and negative values penalties.

These concepts will become clearer as we move along.

### 1.2. The Bellman Equation and Q-Learning
Q-learning is one way to solve the optimization problem in RL. Q learning is a value-based approach. Another approach is policy gradients, which is a policy-based approach. We’ll discuss both in this book.

The agent is trying to learn the best strategy in order to maximize his or her expected payoff over time. A strategy (also called a policy) maps a certain state to a certain action. A strategy is basically a decision rule that tells the agent what to do in a certain situation.

The Q-value, $Q(s, a)$, measures how good a strategy is. You can interpret the letter Q as quality. The better the strategy, the higher the payoff to the agent, and the higher the Q-value. The agent is trying to find the best strategy that maximizes the Q value.

An agent’s action now not only affects the reward in this period, but also rewards in future periods. Therefore, finding the best strategy can be complicated and involves dynamic programming. For details of daynamic programming and the Bellan equation, see, e.g., https://en.wikipedia.org/wiki/Bellman_equation

In the setting of Q-learning, the Bellman equation is as follows:
$$Q(s,a) = Reward + Discount\ Factor * max\ Q(s’, a’)$$
where $Q(s, a)$ is the Q value to the agent in the current state $s$ when an action $a$ is taken. Reward is the payoff to the agent as a result of this action. Discount factor is a constant between 0 and 1, and it measures how much the agent discounts future reward as opposed to current reward. Lastly, $max Q(s’, a’)$ is the maximum future reward, assuming optimal strategies will be applied in the future as well. 

In order to find out the Q values, we’ll try different actions in each state multiple times. We’ll adjust the Q values based on the outcome, increase the Q value if the reward is high and decrease the Q value if the reward is low or even negative. Hence the name reinforecement learing.

Rather than providing you with a lot of abstract technical jargon, I will use a simple example to show you how reinforcement learning works. 

## 2. Use Q Values to Play the Frozen Lake Game

You have learned how to play the Frozen Lake game using deep learning in Chapter 8. So I assume you know how the game works. If not, check Chapter 8 for details. 

The OpenAI Gym environement is designed for training RL game strategies. In particular, in this chapter, you'll learn how to use Q-learning to play the Frozen Lake game.

### 2.1. The Logic Behind Q Learning
What if you have a Q-table to guide you to successfully play the Frozen Lake game? The Q-table is a 16 by 4 matrix, with the rows representing the 16 states: 0 means the top left corner (that is, the starting position), 3 means the top right corner, and 15 means the bottom right corner (that is, the goal position; i.e., the winning position). The four columns represent the four actions that the agent can take in any state: 0 means going left, 1 going down, 2 going right, and 3 going up.

The Q table can be downloaded from my website https://gattonweb.uky.edu/faculty/lium/ml/Qtable.csv

<br> If you open the file, you'll see a table as follows (I added two rows and one column for explanation purpose):
<img src="https://gattonweb.uky.edu/faculty/lium/ml/Qtable.png" />

With the guidance of the Q-table, reaching the destination (i.e., state 15, or the lower right corner) safely is easy for a computer program. Here are the steps:

1.	The computer starts at state 0.
2.	It looks at the above Q table and consults the row corresponding to state 0, which has four values: 0.531, 0.59, 0.59, and 0.531. The four values tell the computer the total payoff from taking the four actions in state 0. 
3.	The computer chooses the action that leads to the highest Q value: taking actions 1 or 2 both have a payoff of 0.59, higher than those form taking actions 0 or 3. We have a tie here, so the computer chooses action 1 (that is, going down) in this case (the first in the two tied actions, 1 and 2). 
4.	Since the computer has chosen going down in state 0, the new state is now the first column in the second row based on the map of the frozen lake. Therefore, the new state is state 4.
5.	The computer now chooses the best action in state 4 based on the above Q table, following the same logic in steps 2 and 3. This means the computer takes action 1 again.
6.	The computer repeats the above steps until the game ends (that is, either the agent falls into a hole or reaches the destination).

Based on the numbers in the Q table and the logic in the above steps, the computer will take the following actions sequentially: down, down, right, down, right, and right. It will pass the following states: state 0 to state 4 to state 8 to state 9 to state 13 to state 14 to state 15. 

As you can see, the computer has successfully reached the goal (state 15) without falling into one of the four holes (states 5, 7, 11, and 12).

We’ll code that in next.

### 2.2. A Python Script to Win the Frozenlake Game

Run the following short script.

In [2]:
import gym
import numpy as np

env = gym.make('FrozenLake-v0', is_slippery=False)
env.reset()

Q = np.loadtxt(r'https://gattonweb.uky.edu/faculty/lium/ml/Qtable.csv', delimiter=",")

def play_game():
    state=env.reset()
    env.render()
    while True:
        action = np.argmax(Q[state, :])
        print(f'current state is {state} and the action is {action}')
        new_state, reward, done, _ = env.step(action)
        # new_state becomes the state in the next round
        state=new_state
        env.render()
        if done==True:
            if reward ==1:
                print('Congratulations, you have reached the destination!')
            else:
                print('Sorry, better luck next time.')
            break    

play_game()
env.close()


[41mS[0mFFF
FHFH
FFFH
HFFG
current state is 0 and the action is 1
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
current state is 4 and the action is 1
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
current state is 8 and the action is 2
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
current state is 9 and the action is 1
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
current state is 13 and the action is 2
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
current state is 14 and the action is 2
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
Congratulations, you have reached the destination!


The Q values are saved in a CSV file. We use the *loadtxt()* method in ***numpy*** to load up the Q table.

The state=env.reset() command resets the game so that the initial state is 0. 

We then use a while loop to play the game. In each iteration, the computer chooses the action that leads to the highest Q value in that state. Note here the *argmax()* method in ***numpy*** returns the argument that leads to the highest value. This is different from the *max()* method in ***numpy***, which returns the highest value among a group of values. 

We then print out the current state, and the action taken by the computer in that state so that we can keep track of the path taken by the computer.

The *env.step()* method returns the new state and the reward based on the action taken. It also tells us whether the game has ended. If the game has not ended, we set the new state as the current state and go to the next iteration to repeat the process. If the game has ended, the while loop stops and the script ends.

The computer has successfully reached state 15, taking the shortest possible path. 

You can run the script multiple times, and the output will be the same every time, because there is no randomness involved.

Amazing, right? You may wonder, how did you come up with the numbers in the Q-table to play the game? That’s what we’ll discuss next.

## 3. Training the Q-Values
In this section, we’ll first discuss what is Q-learning and the logic behind it. We then code in the logic and use a script to generate the Q values that we have just used in the last section.

### 3.1. What Is Q-Learning?
The Q-values form a table of S rows and A columns, and we call it the Q-table. We need to find out the Q-values so that the player can use these values to figure out the optimal strategies in every situation. 

Before Q-learning starts, we set all the values in the Q-table as 0.

At each iteration, we’ll use reinforcement learning (Q-learning, to be exact) to update Q values as follows:

 $$ New\ Q(s,a) = learning\ rate * [Reward + discount\ factor * max\  Q(s’, a’)]+ (1-learning\ rate) * Old\  Q(s, a)$$
Here the learning rate, which has a value between 0 and 1, is how fast you update the Q values. The updated $Q(s, a)$ is a weighted average of the Q value based on that obtained from the Bellman’s equation and the previous $Q(s, a)$. Here is when updating (i.e., learning) happens.

After many rounds of trial and error, the update will be minimal, which means the Q values converge to the equilibrium value. 

If you look at the above equation, when 
$$Q(s,a) = Reward + discount\ factor * max\ Q(s’, a’)$$
There is no update, and we have 
$$New\  Q(s,a) = Old\  Q(s, a)$$
And that is the equilibrium state we are looking for. 

### 3.2. Let the Learning Begin
We’ll write a Python script and let the agent randomly select moves to play the game many rounds. Unavoidably, there will be many mistakes along the way. But we’ll assign a low reward if the agent fails so that it assigns a low Q value to actions taken in that state. On the other hand, if the agent makes right choices and successfully reaches the destination, we’ll assign a high reward so that the agent assigns high Q values to actions taken. 

It’s through such repeated rewards and punishments that the agent learns the correct Q values.

The script in the following cell trains the Q-table.

In [4]:
import gym
import numpy as np


env = gym.make('FrozenLake-v0', is_slippery=False)
env.reset()

learning_rate=0.6
discount_rate=0.9

max_exp=0.7
min_exp=0.3
max_steps=50
max_episode=10000

Q = np.zeros((16, 4))

def update_Q(episode):
    # The initial state is the starting position (state 0)
    state=env.reset()      
    # Play a full game till it ends
    for _ in range(max_steps):
        # Select the best action or the random action
        if np.random.uniform(0,1,1)>min_exp+(max_exp-min_exp)*episode/max_episode:
            action = np.argmax(Q[state, :])
        else:
            action = env.action_space.sample()
        # Use the selected action to make the move
        new_state, reward, done, _ = env.step(action)
        # Update Q values
        if done==True:
            Q[state, action] = reward
            break    
        else:
            Q[state, action] = learning_rate*(reward+discount_rate*np.max(Q[new_state, :]))\
                + (1-learning_rate)*Q[state, action]
            state=new_state    
       
for episode in range(max_episode):
    update_Q(episode)
    
print(Q) 

# Save the trained Q for later use
np.savetxt("files/ch13/trained_frozenlake_Qs.csv", Q, delimiter=',')

[[0.531441 0.59049  0.59049  0.531441]
 [0.531441 0.       0.6561   0.59049 ]
 [0.59049  0.729    0.59049  0.6561  ]
 [0.6561   0.       0.59049  0.59049 ]
 [0.59049  0.6561   0.       0.531441]
 [0.       0.       0.       0.      ]
 [0.       0.81     0.       0.6561  ]
 [0.       0.       0.       0.      ]
 [0.6561   0.       0.729    0.59049 ]
 [0.6561   0.81     0.81     0.      ]
 [0.729    0.9      0.       0.729   ]
 [0.       0.       0.       0.      ]
 [0.       0.       0.       0.      ]
 [0.       0.81     0.9      0.729   ]
 [0.81     0.9      1.       0.81    ]
 [0.       0.       0.       0.      ]]


We set the learning rate to 0.6. This value can take different values in this simple case. You can set it to a much smaller value such as 0.1, as long as you train the model many times, the Q values will converge. 

The discount rate we use here is 0.9. This value will directly affect the Q values. Remember in the last subsection, we discussed that in equilibrium when the Q values converge, we have 
$$Q(s,a) = Reward + discount\ factor * max\ Q(s’, a’)$$
You can see that the converged Q value is a function of discount factor. In our case, as long as you set it anywhere between 0.9 and 1, the resulting Q values will work in the sense that it will successfully guide the agent to the destination safely. 

***
$\mathbf{\text{Exploration versus Exploitation}}$<br>
***
Another important parameter in the process of training the Q-tables is the exploration rate. Exploration means that the agent will randomly selects an action in that given state. This is important for training the Q values because without it, the Q values may be trapped in the wrong equilibrium and cannot get out of it. With exploration, it gives the agent the chance to explore new strategies and see if they lead to higher Q values. 

Exploitation is the opposite of exploration: it means the agent chooses the action based on the values in the current Q-table. This ensures that the final Q table converges. 

At each iteration, the Q values are updated. If the agent wins the game, the action earns a reward of 1. If the agent fails, the action earns a reward of -1. The update rule follows the equation we specified earlier 
$$New\ Q(s,a) = learning\ rate * [Reward + discount\ factor * max\ Q(s’, a’)]
                        + (1-learning\ rate) * Old\ Q(s, a)$$


***

## 4. Test the Trained Q-Values
Now, you can test if the trained Q-table works or not. You'll first use the OpenAI Gym environement to test it. You'll then use the self-made Frozen Lake game environement to test the Q-values. 

### 4.1. Test in the OpenAI Gym environment
The following script is the same as you just ran in Section 2.2, except that you are using your own trained Q-table, instead of a Q-table provided by me.

In [5]:
import gym
import numpy as np

env = gym.make('FrozenLake-v0', is_slippery=False)
env.reset()

# Use the Q-table you just trained
Q = np.loadtxt('files/ch13/trained_frozenlake_Qs.csv', delimiter=",")

def play_game():
    state=env.reset()
    env.render()
    while True:
        action = np.argmax(Q[state, :])
        print(f'current state is {state} and the action is {action}')
        new_state, reward, done, _ = env.step(action)
        # new_state becomes the state in the next round
        state=new_state
        env.render()
        if done==True:
            if reward ==1:
                print('Congratulations, you have reached the destination!')
            else:
                print('Sorry, better luck next time.')
            break    

play_game()
env.close()


[41mS[0mFFF
FHFH
FFFH
HFFG
current state is 0 and the action is 1
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
current state is 4 and the action is 1
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
current state is 8 and the action is 2
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
current state is 9 and the action is 1
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
current state is 13 and the action is 2
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
current state is 14 and the action is 2
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
Congratulations, you have reached the destination!


You get exactly the same results as those in Section 2.2. So our training of the Q-table works. 

### 4.2. Apply the Q-Table in the Self-Made Game Environment
You leaned how to create a game environment from scratch in Chapter 10. Recall that the custom-made game environment works exactly the same as the OpenAI gym environment, plus a graphical rendering instead of a print-out rendering. Therefore, the Q-table should work in the custom-made environment as well. 

Let's check out below.

In [6]:
from utils.frozenlake_env import Frozen
import turtle as t
import time

env = Frozen()

# Use the Q-table you just trained
Q = np.loadtxt('files/ch13/trained_frozenlake_Qs.csv', delimiter=",")

def play_game():
    state=env.reset()
    env.render()
    while True:
        # Slow down the game so that you can see the graphical rendering
        time.sleep(1)
        action = np.argmax(Q[state, :])
        print(f'current state is {state} and the action is {action}')
        new_state, reward, done, _ = env.step(action)
        # new_state becomes the state in the next round
        state=new_state
        env.render()
        if done==True:
            if reward ==1:
                print('Congratulations, you have reached the destination!')
            else:
                print('Sorry, better luck next time.')
            break    

play_game()
time.sleep(5)
env.close()

current state is 0 and the action is 1
current state is 4 and the action is 1
current state is 8 and the action is 2
current state is 9 and the action is 1
current state is 13 and the action is 2
current state is 14 and the action is 2
Congratulations, you have reached the destination!


The trained Q-table guided the program to choose the following actions: 1, 1, 2, 1, 2, 2 (down, down, right, down, right, right) and successfully reached the destination. This is one of the shortest paths that can win the game.

There is also game board in each step, showing you where the player is. At the end of the game, the graphical rendering looks as follows:<br>
<img src="https://gattonweb.uky.edu/faculty/lium/ml/frozen6.png" />



## 5. Animate the Q-Learning Process
In this section, we'll create an animation to show how the agent makes a decision by consulting the Q-table at each step.

### 5.1. Draw the Q-table and Highlight Values and Actions
We'll first draw a Q-table. At each step, we'll highlight the corresponding row in blue based on which state the agent is in. We'll then hihglight in red the action with the highest Q-value in that row, and use that as the best action. We'll repeat this step by step until the game ends. 

For that purpose, we create a list *states* to contain all the states that the agent has been to; we also create a list *actions* to contain all actions taken by the agent,  as follows

```python
states = [0,4,8,9,13,14]
actions = [1,1,2,1,2,2]
```

In each state, we draw three pictures: picture a is the Q table; picture b is the Q table with the line corresponding to the current state highlighted in blue; picture c is the Q table with the action corresponding to the highest Q-value highlighted in red. 

The code in the cell below accomplishes the above tasks. 

In [1]:
from matplotlib import pyplot as plt
import numpy as np
from matplotlib.patches import Rectangle

# Use the Q-table you just trained
Q = np.loadtxt('files/ch13/trained_frozenlake_Qs.csv', delimiter=",")

# states and actions in each step
states = [0,4,8,9,13,14]
actions = [1,1,2,1,2,2]
xs = [0,0,2,0,2,2]
ys = [6,2,-2,-3,-7,-8]

for stepi in range(6):
    fig, ax=plt.subplots(figsize=(10,9), dpi=200)
    # table grid
    for x in range(-2,4,1):
        ax.plot([2*x,2*x],[-4.5,4.5],color='gray',linewidth=3)
    for y in range(-9,10,1):
        if y != 8:
            ax.plot([-6,6],[y/2,y/2],color='gray',linewidth=3)
    
    # four actions and 16 states
    plt.text(-3.5, 8.1/2, "action=",fontsize=18)
    for i in range(16):
        plt.text(-3.5, (6.1-i)/2, f"state {i}",fontsize=18)
    actions = ["left", "down", "right", "up"]
    for i in range(4):
        plt.text(-1.3+2*i, 8.1/2, f"{i}",fontsize=18)
        plt.text(-1.5+2*i, 7.2/2, f"{actions[i]}",fontsize=18)
    # write the 64 Q-values onto the graph
    for i in range(16):
        for j in range(4):
            plt.text(-1.8+2*j, (6.2-i)/2, f"{Q[i,j]:.3f}",fontsize=18)
    
    ax.set_xlim(-4,6)
    ax.set_ylim(-4.5,4.5)
    plt.savefig("files/ch13/qtableplt.png")
    plt.axis("off")
    plt.grid()
    plt.savefig(f"files/ch13/plt_Qs_stepa{stepi}.png")
  
    # highlight state row
    ax.add_patch(Rectangle((-4,ys[stepi]/2), 12,0.5,
                 facecolor = 'b',alpha=0.2))
    plt.savefig(f"files/ch13/plt_Qs_stepb{stepi}.png")
    # highlight action cell
    ax.add_patch(Rectangle((xs[stepi], ys[stepi]/2), 2,0.5,facecolor='r',alpha=0.8))
    plt.savefig(f"files/ch13/plt_Qs_stepc{stepi}.png")
    plt.close(fig)

If you open, for example, the picture plt_Qs_stepc2.png, you'll see the following: 
<img src="https://gattonweb.uky.edu/faculty/lium/ml/plt_Qs_stepc2.png"/>

In step 3, the agent is in state 8. Therefore, the row corresponding to state 8 in the Q-table is highlighted in light blue. The agent compares the four Q-values under the four actions. The values are 0.656, 0.000, 0.729, 0.590, respectively. Obviously, the Q-value under action=2 is the largest among the four numbers. Therefore, the agent chooses action 2 in this state. You can see that the number 0.729 is highlighted in red in the above picture. 

### 5.2. Animate the Use of the Q-Table
Next, you'll combine the pictures created in the last subsection into an animation. As a result, you'll see step by step how the best actions were taken with the guidance of the Q-table.

In [2]:
import turtle as t
import time
import random
import matplotlib
from PIL import Image
import numpy as np
import imageio
import os


frames=[]

for i in range(6):
    for letter in ["a", "b", "c"]:
        im = Image.open(f"files/ch13/plt_Qs_step{letter}{i}.png")
        f1=np.asarray(im)
        frames.append(f1)
imageio.mimsave('files/ch13/plt_Qs_steps.gif', frames, fps=2)

If you open the file plt_Qs_steps.gif, you'll see the following: 
<img src="https://gattonweb.uky.edu/faculty/lium/ml/plt_Qs_steps.gif"/>

In each state, you see three frames: the Q-table, the Q-table with the row correspoding to the current state highlighted in blue, and Q-table with the best action highlighted in red. 

### 5.3. Animate Game Board Positions and the Best Actions
We'll add the game board positions in each step to the above animation, and put them side by side with the Q-table in the animation.

First, we'll need to record all the game positions, as follows. 

In [3]:
from utils.frozenlake_env import Frozen
import turtle as t
import time

env = Frozen()

# Use the Q-table you just trained
Q = np.loadtxt('files/ch13/trained_frozenlake_Qs.csv', delimiter=",")


state=env.reset()
env.render()
step = 0
try:
    ts = t.getscreen() 
except t.Terminator:
    ts = t.getscreen()
env.render()
ts.getcanvas().postscript(file=f"myenv{step}.ps")

while True:
    # Slow down the game so that you can see the graphical rendering
    #time.sleep(1)
    action = np.argmax(Q[state, :])
    print(f'current state is {state} and the action is {action}')
    new_state, reward, done, _ = env.step(action)
    # new_state becomes the state in the next round
    state=new_state
    env.render()
    step += 1      
    ts.getcanvas().postscript(file=f"files/ch13/myenv{step}.ps")

    
    if done==True:
        if reward ==1:
            print('Congratulations, you have reached the destination!')
        else:
            print('Sorry, better luck next time.')
        break    


time.sleep(5)
env.close()

current state is 0 and the action is 1
current state is 4 and the action is 1
current state is 8 and the action is 2
current state is 9 and the action is 1
current state is 13 and the action is 2
current state is 14 and the action is 2
Congratulations, you have reached the destination!


The above program saves 7 ps files to the local folder in ps format. Next, we'll convert the files from the ps format to the png format.

In [5]:
import turtle as t
import time
import random
import matplotlib
from PIL import Image
import numpy as np
import imageio
import os

for i in range(7):
    im = Image.open(f"files/ch13/myenv{i}.ps")
    fig, ax=plt.subplots(figsize=(9,9), dpi=200)
    newax = fig.add_axes([0,0,1,1], anchor='NE', zorder=1)
    newax.imshow(im)
    newax.axis('off')
    ax.set_xlim(-4.5,4.5)
    ax.set_ylim(-4.5,4.5)
    plt.axis("off")
    #plt.grid()
    plt.savefig(f"files/ch13/myenv{i}plt.png")
    plt.close(fig)

You now have 7 game board position in the png format. 

Next, we'll put the game board on the left and the Q-table on the right to form one frame. We'll repeat the frame three times per step, with a total of 7 steps. Fianlly, we'll combined them into an animation; like so:

In [6]:
frames=[]

for i in range(6):
    for letter in ["a", "b", "c"]:
        im = Image.open(f"files/ch13/myenv{i}plt.png")
        f0=np.asarray(im)
        im = Image.open(f"files/ch13/plt_Qs_step{letter}{i}.png")
        f1=np.asarray(im)
        fs = np.concatenate([f0,f1],axis=1)
        frames.append(fs)
im = Image.open(f"files/ch13/myenv6plt.png")
f0=np.asarray(im)
im = Image.open(f"files/ch13/plt_Qs_stepa5.png")
f1=np.asarray(im)
fs = np.concatenate([f0,f1],axis=1)
frames.append(fs)
frames.append(fs)
frames.append(fs)

imageio.mimsave('files/ch13/frozen_q_steps.gif', frames, fps=2)

If you open the gif file, you'll see the following animation:
<img src="https://gattonweb.uky.edu/faculty/lium/ml/frozen_q_steps.gif"/>