# Chapter 8: Deep Learning Game Strategies

In the next five chapters, you’ll learn how to apply deep neural network to the real life situations. In particular, you'll use deep learning to train intelligent game strategies in various games. You’ll learn from A to Z on how to train a game strategy in this chapter. 

In this chapter, you'll learn how to play games in the OpenAI Gym environment. Even though games in the OpenAI Gym environment are designed for reinforcement learning, you'll learn how to creatively design a deep learning game strategy and win the game. 

We'll use the Fozen Lake game in the OpenAI Gym environment as the example in this chapter. You’ll learn how to generate data for training purpose. Once you have a trained model, you’ll learn how to use the model to design a best move and play against the computer. Finally, you’ll test the effectiveness of the game.

At the end of this chapter, you'll create an animation to show how the agent uses the trained model to make decisions on what's the best next move. We'll first draw a game board with the current position for the agent. The agent then hypothetically plays all four next moves, and let the trained DNN model predicts the probability of winning if the agent were to take that action. The agent will pick the action with the highest probability of winning. We'll highlight the best action in the animation in each stage of the game, like so:
<img src="https://gattonweb.uky.edu/faculty/lium/ml/frozen_stages.gif" />

***
$\mathbf{\text{Create a subfolder for files in Chapter 8}}$<br>
***
We'll put all files in Chapter 8 in a subfolder /files/ch08. The code in the cell below will create the subfolder.

***

In [2]:
import os

os.makedirs("files/ch08", exist_ok=True)

## 1. Get Started with the OpenAI Gym Environment
OpenAI Gym provides the needed working environment for many simple games. Many machine learning enthusiasts use games in OpenAI Gym to test their algorithms. In this section, you’ll learn how to install the libraries needed in order to access games that we’ll use in this book. After that, you’ll learn how to play a simple game, the Frozen Lake, in this environment. 

Before you get started, install the OpenAI Gym library as follows with your virtual environment activated:

`pip install gym==0.15.7`

Or you can simply use the shortcut and run the following line of code in a new cell in this notebook:

In [1]:
!pip install gym==0.15.7



***
$\mathbf{\text{Python package version control}}$<br>
***
There is a newer version of OpenAI gym, but the newer verion is not compatible with Baselines, a package that we need to train the Breakout game.

In case you accidentally installed a newer version, run the following lines of code to correct it.

`pip uninstall gym`

`pip install gym==0.15.7`

***

### 1.1. Basic Elements of A Game Environment
The OpenAI Gym game environments are designed mainly for testing reinforcement learning (RL) alorithms. But we'll use them to test deep learning game strategies first before testing RL algorithms in later chapters. 

Let’s first discuss a few basic concepts related to the game environment: 
* Environment: the world in which agent(s) live and interact with each other or with nature. More important, an environment is where the agent(s) can explore and learn the best strategies. Examples include the Frozen Lake game we’ll discuss in this chapter, or the popular Breakout Atariz game, or a real-world problem that we need to solve. You’ll learn to use pre-created environments from OpenAI, and you’ll also learn to create your own environment later in this book. 
* Agent: the player of the game. In most games, there is one player and the opponent is enbedded into the environment. But we'll also discuss two-player games such as Tic Tac Toe and Connect Four later in this book. 
* State: the current situation of the game. The current game board in the Connect Four game, for example, is the current state of the game. We'll explain more as we go along.
* Action: what the player decides to do given the current game situation. In a Tic Tac Toe game, your action is to choose which cell to place your piece, for example. 
* Reward: the payoff from the game. You can assign a numerical value to each game outcome. For example, in a Tic Tac Toe game, we can assign a reward of 0 to all situations except when the game ends, at which point you can assign a reward of 1 if you win, -1 if you lose the game. 

These concepts will become clearer as we move along.

### 1.2. The Frozen Lake Game 

If you go to the official Frozen Lake game website https://gym.openai.com/envs/FrozenLake-v0/, you'll see see an explanation similar to the following:
<img src="https://gattonweb.uky.edu/faculty/lium/ml/frozenlake.png" />

The agent moves on the surface of a frozen lake, which is simplified as a 4 by 4 grid. The agent starts at the top left corner and tries to get to the lower right corner, without falling into one of the four holes on the frozen lake. The conditions of the lake surface are as follows:
<img src="https://gattonweb.uky.edu/faculty/lium/ml/frozen_t.png" />

There are four holes on the lake surface, indicated by the four gray circles in the picture above.
We’ll write a Python script to access the OpenAI Gym environment and learn its features.  


The code in the cell below will get you started:

In [3]:
import gym 
 
env = gym.make("FrozenLake-v0", is_slippery=False)
env.reset()                    
env.render()


[41mS[0mFFF
FHFH
FFFH
HFFG


The *make()* method creates the game environment for us. We set the *is_slippery* argument to *False* so that the game is deterministic. Meaning the game will always use the action that you choose. That is, we choose the nonslippery version of the game. The default setting is *is_slippery=True* and this means that you may not go to your intended location since the frozen lake surface is slippery. For example, when you choose to go left on the surface, you may end up going to the right. 

The *reset()* method starts the game and puts the player at the starting position. The *render()* method shows the current game state. Note that the Frozen Lake game doesn't provide a graphical interface. Instead, it uses letters to indicate the game environment. The 16 letters form a 4 by 4 grid, which reprents the lake surface. The letters have the following meanings:
* S: starting position;
* F: frozen, meaning it's safe to ski on;
* H: hole, the player will fall into the hole and lose the game at this position;
* G: goal, the player wins the game if reaching this point. 
The current position is highlighted in red. The above output shows that the player is at the top left corner of the lake. 

We can also print out all possible actions and states of the game as follows:

In [4]:
# Print out all possible actions in this game
actions = env.action_space
print(f"The action space in the Frozenlake game is {actions}")

# Print out all possible states in this game
states = env.observation_space
print(f"The state space in the Frozenlake game is {states}")

The action space in the Frozenlake game is Discrete(4)
The state space in the Frozenlake game is Discrete(16)


The action space in the Frozenlake game has four values: 0, 1, 2, and 3, with the following meanings:
* 0: Going left
* 1: Going down
* 2: Going right
* 3: Going up
The state space in the Frozenlake game has 16 values: 0, 1, 2, …, 15. The top left square is state 0, the top right is state 3, and the bottom right corner is 15, as shown in the following picture:
<img src="https://gattonweb.uky.edu/faculty/lium/ml/lakesurface.png" />

You can play a complete game as follows:

In [4]:
import gym

env = gym.make('FrozenLake-v0', is_slippery=False)
env.reset()   

while True:
    action = actions.sample()
    print(action)
    new_state, reward, done, info = env.step(action)
    env.render()
    print(new_state, reward, done, info)    
    if done == True:
        break

0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
0 0.0 False {'prob': 1.0}
0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
0 0.0 False {'prob': 1.0}
2
  (Right)
S[41mF[0mFF
FHFH
FFFH
HFFG
1 0.0 False {'prob': 1.0}
0
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
0 0.0 False {'prob': 1.0}
3
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
0 0.0 False {'prob': 1.0}
1
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
4 0.0 False {'prob': 1.0}
1
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
8 0.0 False {'prob': 1.0}
3
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
4 0.0 False {'prob': 1.0}
0
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
4 0.0 False {'prob': 1.0}
1
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
8 0.0 False {'prob': 1.0}
3
  (Up)
SFFF
[41mF[0mHFH
FFFH
HFFG
4 0.0 False {'prob': 1.0}
1
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
8 0.0 False {'prob': 1.0}
1
  (Down)
SFFF
FHFH
FFFH
[41mH[0mFFG
12 0.0 True {'prob': 1.0}


The script above uses several methods in the game environment:
* The sample() method randomly selects an action from the action space. That is, it will return one of the values among {0, 1, 2, 3}. 
* The step() method is where the agent is interacting with the environment, and it takes the agent’s action as input. The output are four values: the new state, the reward, a variable *done* indicating whether the game has ended. The *info* variable provides some information about the game. In this case, it provides the probability that the agent reaches the intended state. Since we are using the nonslippery version of the game, the probability is always 100%. 
* The render() method shows a diagram of the resulting state. 

The game loop is an infinite *while* loop. If the *done* variable returns a value *True*, the game ends, and we stop the infinite while loop.

Note that since the actions are chosen randomly, when you run the script, you’ll most likely get different results.

### 1.3. Play the Frozen Lake Game Manually
Next, you’ll learn how to manually interact with the Frozen Lake game, so that you have a better understanding of the game environment. This will prepare you to design winning strategies for the Frozen Lake game.

The following lines of code show you how.

In [5]:
import gym

env = gym.make('FrozenLake-v0', is_slippery=False)
env.reset()   

print('''
enter 0 for left, 1 for down
2 for right, and 3 for up
''')
env.reset()                    
env.render()

while True:
    try:
        action = int(input('how do you want to move?\n'))
    except:
        print('please enter 0, 1, 2, or 3')
    new_state, reward, done, _ = env.step(action)
    env.render()
    if done==True:
        if new_state==15:
            print("Congrats, you have made to the destination!")
        else:
            print("Game over. Better luck next time!")
        break  


enter 0 for left, 1 for down
2 for right, and 3 for up


[41mS[0mFFF
FHFH
FFFH
HFFG
how do you want to move?
1
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
how do you want to move?
1
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
how do you want to move?
2
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
how do you want to move?
1
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
how do you want to move?
2
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
how do you want to move?
2
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
Congrats, you have made to the destination!


I chose the following actions: 1, 1, 2, 1, 2, 2 (meaning down, down, right, down, right, right sequentially) and successfully reached the destination. This is one of the shortest paths that you can win the game.

Now, the question is: can you train your computer to win the game by itself? 

The answer is yes, and you’ll learn how to do that by using the deep learning method via deep neural networks. 

## 2. Deep Learning Game Strategies: Generating Data

In the next few sections, you’ll learn how to use deep neural networks to train intelligent game strategies.

You’ll learn from A to Z on how to train a game strategy using the Frozen Lake game as an example. You'll apply the strategies to other games as well later in the book. 

First, you’ll learn how to generate simulated game data for training purpose. Once you have a trained model, you’ll learn how to use the model to design a best-move game strategy and play against the computer. Finally, you’ll test the effectiveness of the game.

### 2.2. Prepare Data for the Neural Network
How to use neural network to train a game strategy in this case? Here is a summary of what we’ll do to train the game strategy:
1. We’ll let the player randomly choose actions and complete a game and record the whole game history. The game history will contain all the intermediate states and actions from the very first move to the very last move.
2. We then associate each state-action pair with a game outcome (win or lose). The state-action pair is similar to X (i.e., image pixels) in our image classification problem, and the outcome is similar to y (i.e., image labels such as horse, deer, airplane and so on) in the image classification problem.
3. We’ll simulate a large number of games, say, 10000 of them. Use the histories of the games and the corresponding outcome as X and y pairs to feed into a Deep Neural Networks model. After training is done, we have a trained model.
4. At each move of the game, we look at all possible next moves, and feed the hypothetical state-action pair into the pretained model. The model will tell you the probability of winning the game if that state-action pair were chosen.
5. You select the move with the highest chance of winning based on the model's predictions.

#### Simulate One Game

First we’ll simulate one game and record the whole game history and the game outcome. The script below accomplishes that:

In [1]:
import gym
import numpy as np

import gym

env = gym.make('FrozenLake-v0', is_slippery=False)
# create lists to record game history and outcome 
history = []
winlose = [0]
# start the game
state = env.reset()
while True:
    # randomly choose an action  
    action = env.action_space.sample()
    # make a move
    new_state, reward, done, _ = env.step(action)
    # recording game hisotry
    history.append([state, action, new_state, reward, done])
    # prepare for the next round
    state = new_state
    # stop if the game is over
    if done==True:
        # if end up in state 15, change outcome to 1
        if new_state==15:
            winlose[0] = 1
        break  
    
print(history) 
print(winlose)  

[[0, 2, 1, 0.0, False], [1, 1, 5, 0.0, True]]
[0]


In the game above, the player made two moves, and lost the game. So the winlose list has a value of 0 in it (meaning loss). If the player wins, then the value in the list is 1. The history list records all intermediate steps. In each step, we have values of current state, move, next state, reward, and whether the game is over. 

#### Simulate Many Games
Next, we'll simulate 10,000 games and record all the intermediate steps and outcomes. 

In [2]:
import gym
import numpy as np

env = gym.make('FrozenLake-v0', is_slippery=False)
# create lists to record all game histories and outcomes 
histories = []
outcomes = []

# Define one_game() function
def one_game():
    history=[]
    outcome=0
    state = env.reset()
    while True:
        # randomly choose an action  
        action = env.action_space.sample()
        # make a move
        new_state, reward, done, _ = env.step(action)
        # recording game hisotry
        history.append([state, action, new_state, reward, done])
        # prepare for the next round
        state = new_state
        # stop if the game is over
        if done==True:
            # Once the game is over, manually add in the four end state
            # and action combinations, to cover all hypothetical situations
            history.append([state, 0, new_state, reward, done])
            history.append([state, 1, new_state, reward, done])
            history.append([state, 2, new_state, reward, done])
            history.append([state, 3, new_state, reward, done])
            # if end up in state 15, change outcome to 1
            if new_state==15:
                outcome = 1
            break  

    return history, outcome

# Play 10,000 games
for i in range(10000):
    history, outcome = one_game()
    # record history and outcome
    histories.append(history)
    outcomes.append(outcome)

Next, we'll save the simulated data on the computer for later use. We can use the ***pickle*** library to do that. The ***pickle*** library is in the Python Standard Library so it comes with Pyton installation on your computer. No installation is needed. 

In [3]:
import pickle
# save the simulation data on your computer
with open('frozen_games.pickle', 'wb') as fp:
    pickle.dump((histories, outcomes), fp)

You can load up the saved simulation data from your computer, and print out the first five games. 

In [5]:
# read the data and print out the first 10 games
with open('frozen_games.pickle', 'rb') as fp:
    histories, outcomes=pickle.load(fp)
    
from pprint import pprint
pprint(histories[:5])    
pprint(outcomes[:5])  

[[[0, 3, 0, 0.0, False],
  [0, 3, 0, 0.0, False],
  [0, 1, 4, 0.0, False],
  [4, 1, 8, 0.0, False],
  [8, 2, 9, 0.0, False],
  [9, 1, 13, 0.0, False],
  [13, 2, 14, 0.0, False],
  [14, 1, 14, 0.0, False],
  [14, 1, 14, 0.0, False],
  [14, 3, 10, 0.0, False],
  [10, 1, 14, 0.0, False],
  [14, 1, 14, 0.0, False],
  [14, 3, 10, 0.0, False],
  [10, 1, 14, 0.0, False],
  [14, 3, 10, 0.0, False],
  [10, 2, 11, 0.0, True]],
 [[0, 1, 4, 0.0, False],
  [4, 3, 0, 0.0, False],
  [0, 2, 1, 0.0, False],
  [1, 2, 2, 0.0, False],
  [2, 3, 2, 0.0, False],
  [2, 1, 6, 0.0, False],
  [6, 3, 2, 0.0, False],
  [2, 0, 1, 0.0, False],
  [1, 2, 2, 0.0, False],
  [2, 2, 3, 0.0, False],
  [3, 2, 3, 0.0, False],
  [3, 2, 3, 0.0, False],
  [3, 1, 7, 0.0, True]],
 [[0, 3, 0, 0.0, False],
  [0, 1, 4, 0.0, False],
  [4, 1, 8, 0.0, False],
  [8, 1, 12, 0.0, True]],
 [[0, 0, 0, 0.0, False],
  [0, 3, 0, 0.0, False],
  [0, 1, 4, 0.0, False],
  [4, 3, 0, 0.0, False],
  [0, 3, 0, 0.0, False],
  [0, 1, 4, 0.0, False],
  [

Next we'll train the deep neural network using the simulated data.

## 3. Deep Learning Game Strategies: Train the Deep Neural Network

We'll train the deep neural network so that it can learn from the simulated data. To do that, we'll first prepare the data so that we can feed them into the neural network.

### 3.1. Preparing the Data

In the next few sections, you’ll learn how to convert the game history and outcome data into a form that the computer understands before you feed them into the deep neural network.

#### Associate Game Outcome with Each Step
We'll associate each state-action pair with the final game outcome so that the model can estimate the probability of winning for each state-action combination. We'll use the first game above as an example. In the first game, the outcome is 0, meaning the player lost the game. There are 16 steps in game 1, so we'll create 16 values of X and y, as follows. 

In [6]:
game1_history = histories[0]  
game1_outcome = outcomes[0]     
# Print out each state-action pair
for i in game1_history:
    print(i)

[0, 3, 0, 0.0, False]
[0, 3, 0, 0.0, False]
[0, 1, 4, 0.0, False]
[4, 1, 8, 0.0, False]
[8, 2, 9, 0.0, False]
[9, 1, 13, 0.0, False]
[13, 2, 14, 0.0, False]
[14, 1, 14, 0.0, False]
[14, 1, 14, 0.0, False]
[14, 3, 10, 0.0, False]
[10, 1, 14, 0.0, False]
[14, 1, 14, 0.0, False]
[14, 3, 10, 0.0, False]
[10, 1, 14, 0.0, False]
[14, 3, 10, 0.0, False]
[10, 2, 11, 0.0, True]


The output above shows that the player starts at state 0 (top left corner), and chooses action 3 (going up), ended up in the same position (state 0). In the second round, the player starts at state 0, and chooses action 3 again, ended up in state 0 again. In the third round, the player starts at state 0, and chooses action 1 (going down), ended up in state 4, and so on. So we'll record X and y for game 1 as follows:

In [11]:
# Create empty X and y lists
game1_X, game1_y = [], []    
# Create X and y for each game step
for i in game1_history:
    game1_X.append([i[0], i[1]])
    game1_y.append(game1_outcome)
    
# Print out X, y
pprint(game1_X)
pprint(game1_y)

[[0, 3],
 [0, 3],
 [0, 1],
 [4, 1],
 [8, 2],
 [9, 1],
 [13, 2],
 [14, 1],
 [14, 1],
 [14, 3],
 [10, 1],
 [14, 1],
 [14, 3],
 [10, 1],
 [14, 3],
 [10, 2]]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


The above outcome reflects the 16 steps in game 1. X is a state-action pairk, while y is the ultimate game outcome. 

However, if we feed the data into a neural network, the algorithm will mistakenly think that state 14 is greater than 13. Action 3 is greater than action 2. To avoid such confusion, we need to use one-hot encoder to convert them into a vector of 1s and 0s.

In [36]:
# Define a onehot_encoder() function
def onehot_encoder(value, length):
    onehot=np.zeros((1,length))
    onehot[0,value]=1
    return onehot
# Change both state value and action value into onehot_encoder
game1onehot_X = []
for s, a in game1_X:
    print(s,a)

    onehot_s = onehot_encoder(s, 16)
    onehot_a = onehot_encoder(a, 4)
    sa = np.concatenate([onehot_s, onehot_a], axis=1)
    game1onehot_X.append(sa.reshape(-1,))
    
# Print out new X
pprint(game1onehot_X)


0 3
0 3
0 1
4 1
8 2
9 1
13 2
14 1
14 1
14 3
10 1
14 1
14 3
10 1
14 3
10 2
[array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1.]),
 array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1.]),
 array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0.]),
 array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0.]),
 array([0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0.]),
 array([0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0.]),
 array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
       0., 1., 0.]),
 array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       1., 0., 0.]),
 array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       1., 0., 0.]),
 array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0

That's the format we want. We now apply that to the whole dataset and save it in the local folder.

In [37]:
# Create empty X and y lists
X, y = [], []    
# Create X and y for each game step
for gamei, yi in zip(histories, outcomes):
    for step in gamei:
        s, a = step[0], step[1]
        onehot_s = onehot_encoder(s, 16)
        onehot_a = onehot_encoder(a, 4)
        sa = np.concatenate([onehot_s, onehot_a], axis=1)
        X.append(sa.reshape(-1,))
        y.append(yi)

Next, we'll save the dataset on your computer.

In [38]:
# save the simulation data on your computer
with open('gameXy.pickle', 'wb') as fp:
    pickle.dump((X,y),fp)

You can load up the dataset again and print out the first five observation to make sure they are correct.

In [39]:
# save the simulation data on your computer
with open('gameXy.pickle', 'wb') as fp:
    pickle.dump((X,y),fp)
# read the data and print out the first 5 games
with open('gameXy.pickle', 'rb') as fp:
    myX, myy=pickle.load(fp)
from pprint import pprint
pprint(myX[:5])
pprint(myy[:5])

[array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1.]),
 array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1.]),
 array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0.]),
 array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0.]),
 array([0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0.])]
[0, 0, 0, 0, 0]


To summarize, the code below generates and prepares the data for training in the next section.

In [None]:
import gym
import numpy as np

env = gym.make('FrozenLake-v0', is_slippery=False)
# create lists to record all game histories and outcomes 
histories = []
outcomes = []

# Define one_game() function
def one_game():
    history=[]
    outcome=0
    state = env.reset()
    while True:
        # randomly choose an action  
        action = env.action_space.sample()
        # make a move
        new_state, reward, done, _ = env.step(action)
        # recording game hisotry
        history.append([state, action, new_state, reward, done])
        # prepare for the next round
        state = new_state
        # stop if the game is over
        if done==True:
            # Record game history
            history.append([state, 0, new_state, reward, done])
            history.append([state, 1, new_state, reward, done])
            history.append([state, 2, new_state, reward, done])
            history.append([state, 3, new_state, reward, done])
            # if end up in state 15, change outcome to 1
            if new_state==15:
                outcome = 1
            break  

    return history, outcome

# Play 10,000 games
for i in range(10000):
    history, outcome = one_game()
    # record history and outcome
    histories.append(history)
    outcomes.append(outcome)

import pickle
# save the simulation data on your computer
with open('frozen_games.pickle', 'wb') as fp:
    pickle.dump((histories, outcomes), fp)
    
# read the data and print out the first 10 games
with open('frozen_games.pickle', 'rb') as fp:
    histories, outcomes=pickle.load(fp)
    
from pprint import pprint
pprint(histories[:10])    
pprint(outcomes[:10])  

# Define a onehot_encoder() function
def onehot_encoder(value, length):
    onehot=np.zeros((1,length))
    onehot[0,value]=1
    return onehot
# Create empty X and y lists
X, y = [], []    
# Create X and y for each game step
for gamei, yi in zip(histories, outcomes):
    for step in gamei:
        s, a = step[0], step[1]
        onehot_s = onehot_encoder(s, 16)
        onehot_a = onehot_encoder(a, 4)
        sa = np.concatenate([onehot_s, onehot_a], axis=1)
        X.append(sa.reshape(-1,))
        y.append(yi)
        
# save the simulation data on your computer
with open('gameXy.pickle', 'wb') as fp:
    pickle.dump((X,y),fp)
# read the data and print out the first 5 games
with open('gameXy.pickle', 'rb') as fp:
    myX, myy=pickle.load(fp)
from pprint import pprint
pprint(myX[:5])
pprint(myy[:5])

### 3.2. Train the Deep Neural Network
Now that the dataset is ready, we are ready to train our deep neural network. 

Here we are essentially performing a binary classification. We classify each state-action pair into win or lose. The output layer has one neuron with sigmoid activation. So we can think of the output as the probability of winning the game. 

There are four hidden layers in the model, with 128, 64, 32, and 16 neurons, respectively. But we do have a lot of freedom here. fewer layers with various numbers of neurons will generate similar results. 

Later, we'll use the trained model to play the Frozen Lake game. When playing, at each state, we'll ask the following question:
1. If I choose action 0 (i.e., move left), what's the probability of winning the game? We'll combine the current state and action 0 and feed this state-action pair to the trained deep neural network and get a probability; let's call it p(win|s,a0).
2. If I choose action 1 (i.e., move down), what's the probability of winning the game? We'll use the trained neural network and get p(win|s,a1).
3. If I choose action 2 (i.e., move right), what's the probability of winning the game? We'll use the trained neural network and get p(win|s,a2).
4. If I choose action 3 (i.e., move up), what's the probability of winning the game? We'll consult the trained neural network and get p(win|s,a3).
We then compare p(win|s,a0), p(win|s,a1), p(win|s,a2), and p(win|s,a3), and pick the action that leads to the highest p(win|s,a). 

In [None]:
# import needed modules
from tensorflow.keras.models import Sequential
from random import choice
import pickle
from tensorflow.keras.layers import Dense
import numpy as np
from pprint import pprint

# load the data       
with open('gameXy.pickle', 'rb') as fp:
    X, y = pickle.load(fp)

X = np.array(X).reshape((-1, 20))
y = np.array(y).reshape((-1, 1))

# Create a model
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(20,)))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
                   optimizer='adam', 
                   metrics=['accuracy'])

model.fit(X,y, epochs=50)

model.save('files/ch08/trained_frozen.h5')

In [None]:
Now that the model is trained, we can use it to play the game.

## 4. Play the Game with the Trained Model
To play the game with the trained model, we'll look at the state at each move, and hypothetically take actions 0, 1, 2, and 3. Then we use the trained model to predict the probability of winning with each of the four state-action pairs. We'll pick the action that leads to the highest probability of winning. We repeat at each step until the game ends.

In [5]:
# import needed modules
from tensorflow.keras.models import Sequential
from random import choice
import pickle
from tensorflow.keras.layers import Dense
import numpy as np
from pprint import pprint
import tensorflow as tf

# Create a model
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(20,)))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
                   optimizer='adam', 
                   metrics=['accuracy'])

reload = tf.keras.models.load_model("files/ch08/trained_frozen.h5")

# Define a onehot_encoder() function
def onehot_encoder(value, length):
    onehot=np.zeros((1,length))
    onehot[0,value]=1
    return onehot

action0=onehot_encoder(0, 4)
action1=onehot_encoder(1, 4)
action2=onehot_encoder(2, 4)
action3=onehot_encoder(3, 4)

import gym

env = gym.make('FrozenLake-v0', is_slippery=False)
state = env.reset()   

# save the predictions in each step
predictions = []
while True:
    # Convert state and action into onehots 
    state_arr = onehot_encoder(state, 16)
    # Use the trained model to predict the prob of winning 
    sa0 = np.concatenate([state_arr, action0], axis=1)    
    sa1 = np.concatenate([state_arr, action1], axis=1)  
    sa2 = np.concatenate([state_arr, action2], axis=1)  
    sa3 = np.concatenate([state_arr, action3], axis=1)
    sa = np.concatenate([sa0, sa1, sa2, sa3], axis=0)
    prediction = reload.predict(sa)
    action = np.argmax(prediction)
    predictions.append(prediction)
    print(action)
    new_state, reward, done, info = env.step(action)
    env.render()
    print(new_state, reward, done, info) 
    state = new_state
    if done == True:
        break


1
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
4 0.0 False {'prob': 1.0}
1
  (Down)
SFFF
FHFH
[41mF[0mFFH
HFFG
8 0.0 False {'prob': 1.0}
2
  (Right)
SFFF
FHFH
F[41mF[0mFH
HFFG
9 0.0 False {'prob': 1.0}
1
  (Down)
SFFF
FHFH
FFFH
H[41mF[0mFG
13 0.0 False {'prob': 1.0}
2
  (Right)
SFFF
FHFH
FFFH
HF[41mF[0mG
14 0.0 False {'prob': 1.0}
2
  (Right)
SFFF
FHFH
FFFH
HFF[41mG[0m
15 1.0 True {'prob': 1.0}


The player wins the game with the shortest possible path. So the deep learning game strategy works!!!

Next, we save the predictions in each stage of the game for later use. We'll use these probabilities later when we create animations.

In [6]:
import pickle

with open('files/ch08/frozen_predictions.p', 'wb') as fp:
    pickle.dump(predictions,fp)

## 5. Test the Efficacy of the Game Strategy
Winning one game can be a coincicence. We need a scientic way of testing the efficacy. For that, we'll let the trained model play the game 1000 times, and record how many times the model wins and how many times the model loses.  

In [None]:
# import needed modules
from tensorflow.keras.models import Sequential
from random import choice
import pickle
from tensorflow.keras.layers import Dense
import numpy as np
from pprint import pprint
import tensorflow as tf

# Create a model
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(20,)))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
                   optimizer='adam', 
                   metrics=['accuracy'])

reload = tf.keras.models.load_model("files/ch08/trained_frozen.h5")

# Define a onehot_encoder() function
def onehot_encoder(value, length):
    onehot=np.zeros((1,length))
    onehot[0,value]=1
    return onehot

action0=onehot_encoder(0, 4)
action1=onehot_encoder(1, 4)
action2=onehot_encoder(2, 4)
action3=onehot_encoder(3, 4)

import gym

env = gym.make('FrozenLake-v0', is_slippery=False)

def test():
    state = env.reset()
    winlose=0
    while True:
        # Convert state and action into onehots 
        state_arr = onehot_encoder(state, 16)
        # Use the trained model to predict the prob of winning 
        sa0 = np.concatenate([state_arr, action0], axis=1)    
        sa1 = np.concatenate([state_arr, action1], axis=1)  
        sa2 = np.concatenate([state_arr, action2], axis=1)  
        sa3 = np.concatenate([state_arr, action3], axis=1)
        sa = np.concatenate([sa0, sa1, sa2, sa3], axis=0)
        action = np.argmax(reload.predict(sa))
        #print(action)
        new_state, reward, done, info = env.step(action)
        #env.render()
        #print(new_state, reward, done, info) 
        state = new_state
        if done == True:
            # change winlose to 1 if the last state is 15
            if state==15:
                winlose=1
            break
    return winlose

winloses = []
for i in range(1000):
    winlose = test()
    winloses.append(winlose)

# Print out the number of winning games
print("the number of winning games is", winloses.count(1))

# Print out the number of losing games
print("the number of losing games is", winloses.count(0))

The strategy has won all 1000 games. 


## 6. Animate the Decision Process
We'll create an animation to show how the agent uses the trained model to make decisions on what's the best next move. 

We'll first draw a game board with the current position fo the agent. We'll then hypothetically play all four next moves, and let the trained DNN model tells us the probability of winning for each action. The agent will pick the action with the highest probability of winning. We'll highlight the best action in the animation in each stage of the game.

In [8]:
import tensorflow as tf
from matplotlib import pyplot as plt
from matplotlib.patches import Rectangle
import pickle
import os

import numpy as np

# Load up date during the training process
preds = pickle.load(open('files/ch08/frozen_predictions.p', 'rb'))

states = [(-6,2),(-6,0),(-6,-2),(-5,-2),(-5,-4),(-4,-4),(-3,-4)]
actions = [1,1,2,1,2,2]

grid = [["S", "F", "F", "F"],
        ["F", "H", "F", "H"],
        ["F", "F", "F", "H"],
        ["H", "F", "F", "G"]]

hs = [3.1,1.3,-1.4,-3.2]

# Generate six pictures
for stage in range(7):

    fig = plt.figure(figsize=(14,10), dpi=100)
    ax = fig.add_subplot(111) 
    ax.set_xlim(-7, 7)
    ax.set_ylim(-5, 5)
    #plt.grid()
    plt.axis("off")

    # table grid
    for x in range(-6,-1,1):
        ax.plot([x,x],[-4,4],color='gray',linewidth=3)
    for y in range(-4,5,2):
        ax.plot([-6,-2],[y,y],color='gray',linewidth=3)
    for row in range(4):
        for col in range(4):
            plt.text(col-5.8,2.6-2*row,grid[row][col],fontsize=60) 

    # highlight current state
    ax.add_patch(Rectangle(states[stage], 1,2,facecolor='r',alpha=0.8))             
    plt.savefig(f"files/ch08/frozen_stage{stage}step1.png")        
            
    if stage<=5:
        # reload trained model
        ps = preds[stage].reshape(4,)
        # Draw connections between neurons
        xys = [[(0,-3.2),(-2,0)],                   
               [(0,-1.4),(-2,0)],
               [(0,1.3),(-2,0)],
               [(0,3.1),(-2,0)]]
        for xy in xys:
            ax.annotate("",xy=xy[0],xytext=xy[1],
            arrowprops=dict(arrowstyle = '->', color = 'g', linewidth = 2))  
        # Put explanation texts on the graph
        plt.text(-1.5,1.25,"left",fontsize=20,color='g',rotation=55) 
        plt.text(-1.25,0.25,"down",fontsize=20,color='g',rotation=35) 
        plt.text(-1.25,-0.85,"right",fontsize=20,color='g',rotation=-35)
        plt.text(-1.5,-1.8,"up",fontsize=20,color='g',rotation=-55)     
        # add rectangle to plot
        for i in range(4):
            ax.add_patch(Rectangle((0,-0.6+hs[i]), 2, 1.3,
                         facecolor = 'b',alpha=0.1)) 
            plt.text(0.2,hs[i]-0.5,"Deep\nNeural\nNetwork",fontsize=20) 
            plt.text(2.6, hs[i]-0.15, f"Prob(win)={ps[i]:.4f}", fontsize=25, color="r")  
            ax.annotate("",xy=(2.5,hs[i]),xytext=(2,hs[i]),
            arrowprops=dict(arrowstyle = '->', color = 'g', linewidth = 2))   

        plt.savefig(f"files/ch08/frozen_stage{stage}step2.png") 
        
        # highlight the best action
        ax.add_patch(Rectangle((2.5,hs[actions[stage]]-0.4), 4.25, 1,
                     facecolor = 'b',alpha=0.5))     
        plt.savefig(f"files/ch08/frozen_stage{stage}step3.png")     
    plt.close(fig)

If you open the file frozen_stage0step1.png, you'll see the following:
<img src="https://gattonweb.uky.edu/faculty/lium/ml/frozen_stage0step1.png" />which is the starting position of the agent. The file frozen_stage0step2.png below:
<img src="https://gattonweb.uky.edu/faculty/lium/ml/frozen_stage0step2.png" /> shows the probabilities of winning the game for the four possible actions: left, down, right, and up. The file frozen_stage0step3.png below:
<img src="https://gattonweb.uky.edu/faculty/lium/ml/frozen_stage0step3.png" /> highlights action with the greatest probability of winning the game, which is down. That's the action the agent chooses in the first stage of the game.

We repeat this process in every stage of the game until the game ends.

We can combine the above pictures into an animation.

In [10]:
import imageio, PIL

frames=[]
for stage in range(6):
    for step in range(3):
        frame=PIL.Image.open(f"files/ch08/frozen_stage{stage}step{step+1}.png")   
        frame=np.asarray(frame)
        frames.append(np.array(frame))
frame=PIL.Image.open(f"files/ch08/frozen_stage6step1.png")   
frame=np.asarray(frame)
# put three frames at the end to highlight success
frames.append(np.array(frame))        
frames.append(np.array(frame)) 
frames.append(np.array(frame)) 

imageio.mimsave('files/ch08/frozen_stages.gif', frames, fps=2)

The animation looks as follows:
<img src="https://gattonweb.uky.edu/faculty/lium/ml/frozen_stages.gif" />