# Chaper 14: Q-Learning with Continuous States

The Frozen Lake game we solved with Q-learning has 16 different states and 4 possible actions in each state. Therefore, it is easy to create a Q-table with $16\times4=64$ values. 

However, in many real-world problems, the number of state-action combinations is either infinite or very large. Therefore, it is infeasible to build a Q-table. 

To solve this problem, we can use a finite number of discrete states to approximate for the infinite number of states. The finite number of states cannot be too small, or else the true state cannot be accurately represented and the Q-learning fails. The finite number of states cannot be too large either, or else it's prohibitively costly or time-consuming to train the Q values. 

At the end of this chapter, you'll be able to create an animation to compare the mountain car game before and after the Q-learning as follows: 
<img src="https://gattonweb.uky.edu/faculty/lium/ml/mountain_car_compares.gif"/>

On the left, you can see that without Q learning, the mountain car stays at the valley most of the time without much movement. After the Q learning, the mountain car can reach the mountain top in every episode. This shows how powerful Q-learning is.

Specifically, you’ll learn how to create a finite state space in Q-learning in this chapter in the context of playing the Mountain Car game in the OpenAI Gym library. The state variables are 

* the position of the car, which is a continous variable with values between -1.2 and 0.6;
* the speed of the car, which is a continous variable with values from -0.7 to 0.7. 

We'll use 190 discrete values of the position and 150 discrete values of the speed to form $190\times150=28500$ different states. After 100,000 rounds of training, the model wins the game 100% of the time. 

***
$\mathbf{\text{Create a subfolder for files in Chapter 14}}$<br>
***
We'll put all files in Chapter 14 in a subfolder /files/ch14. The code in the cell below will create the subfolder.

***

In [2]:
import os

os.makedirs("files/ch14", exist_ok=True)

## 1. Get Started with the Mountain Car Environment
You’ll first learn how to control the Mountain Car game in the OpenAI Gym environment. You’ll learn the parameter values, how to interact with the environment so that later you’ll be able to train the model to learn the Q table.

### 1.1. The Mountain Car Game
If you go to the Mountain Car game site via the link https://gym.openai.com/envs/MountainCar-v0/, you’ll see a picture similar to the following
<img src="https://gattonweb.uky.edu/faculty/lium/ml/mountain_car.png" />

The agent tries to drive the car up to the mountain top at the right. You can see a flag at the mountain top in the picture, and you win the game if you reach there within 200 attempts. 

The car starts at the bottom of the valley, and you need to swing the car back and forth to build up enough momentum in order to reach the goal. 

We’ll write a Python script to access the OpenAI Gym environment and learn its features. 

In [2]:
import gym
env = gym.make('MountainCar-v0')

#obs = env.reset()
# check the action space
number_actions = env.action_space.n
print("the number of possible actions are", number_actions)
# sample the action space ten times
print("the following are ten sample actions")
for i in range(10):
   print(env.action_space.sample())
# check the shape of the observation space
print("the shape of the observation space is", env.observation_space.shape)   



the number of possible actions are 3
the following are ten sample actions
0
1
2
2
2
0
0
2
2
0
the shape of the observation space is (2,)


There are 3 possible actions: 0, 1, and 2, with the following meanings:
* 0: move to the left
* 1: no movement
* 2: move to the right

There are two state variables: 

* the position of the car, which is a continous variable with values between -1.2 and 0.6;
* the speed of the car, which is a continous variable with values from -0.7 to 0.7. A negative value means the car is moving to the left, while a positive value means the car is moving to the right. 

To win the game, you need to reach the top of the mountain: that is, the position of the car needs to be greater or equal to 0.5. 

Below, you'll try to reach the mountain top in 200 attempts, by randomly selecting actions. The graphical rendering will show you how the game looks like.

In [3]:
import gym
import time

env = gym.make('MountainCar-v0')

obs = env.reset()
for i in range(200):
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    env.render()
time.sleep(5)
env.close()    

This is a difficult game. With random actions, the car stays at the bottom of the valley without much movemnt, let alone reachign the mountain top.

Next, you'll print out the state variables of the game, and learn how to convert them into discrete numbers.

### 1.2. Convert A Continuos State into Discrete Values
Let's first have a look at the state values. 

In [2]:
import gym
import time

env = gym.make('MountainCar-v0')

obs = env.reset()
# print out the state
print(obs)
action = env.action_space.sample()
# print the new state
obs, reward, done, info = env.step(action)
print(obs)  

[-0.41920721  0.        ]
[-0.41997741 -0.0007702 ]


Next, we define a varaible *state*, which has discrete values. We multiply the first element in the observation, the car position, by 100, and take the integer value. The value is roughly between -120 and 60. We add 125 to it so that it remains positive and can be used as an index value in the Q table.

We then multiply the second element in the observation, the car speed, by 1000, and take the integer value as well. We add 75 to it so that it remains positive and can be used as an index value in the Q table.

The code below prints out ten samples of the variable *state*.

In [4]:
def obs_to_state(obs):
    state = [int(obs[0]*100)+125, int(obs[1]*1000)+75]
    return state

for i in range(10):
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    state = obs_to_state(obs)
    print(state)

[79, 70]
[78, 69]
[78, 68]
[77, 67]
[76, 68]
[75, 67]
[74, 66]
[73, 67]
[72, 66]
[72, 67]


The variable *state*, which is a list, now has two integers in it. 

Since the value of the car position is between -1.2 to 0.6, and the speed of the car is between -0.7 to 0.7, *state* has roughly a maximum of $180\times140=25200$ values. We create a Q table with dimensions 190 by 150 by 3. The first dimension of the Q-table corresponds to the 180 or so discrete car positions; the second dimension corresponds to the 140 or so discrete car speeds; the third dimension corresponds to the three possible actions. We use 190 and 150 instead of 180 and 140 to have some margin of safety to avoid possible IndexError. 

### 1.3. Understand the Reward Structure of the Game
Next, we'll play a game until it's finished and print out all the rewards and when an episode is considered finished.

In [2]:
import gym
import time

env = gym.make('MountainCar-v0')

def obs_to_state(obs):
    state = [int(obs[0]*100)+125, int(obs[1]*1000)+75]
    return state

obs = env.reset()
episode = 0
while True:
    episode += 1 
    action = env.action_space.sample()
    obs, reward, done, info = env.step(action)
    state = obs_to_state(obs)
    print(episode, state, reward, done, info)
    env.render()
    # play till a full episode is finished
    if done:
        break
time.sleep(5)
env.close()  

1 [70, 75] -1.0 False {}
2 [70, 75] -1.0 False {}
3 [70, 76] -1.0 False {}
4 [71, 77] -1.0 False {}
5 [71, 79] -1.0 False {}
6 [72, 80] -1.0 False {}
7 [72, 81] -1.0 False {}
8 [73, 82] -1.0 False {}
9 [74, 82] -1.0 False {}
10 [75, 83] -1.0 False {}
11 [76, 84] -1.0 False {}
12 [76, 84] -1.0 False {}
13 [77, 83] -1.0 False {}
14 [78, 82] -1.0 False {}
15 [79, 82] -1.0 False {}
16 [79, 80] -1.0 False {}
17 [80, 79] -1.0 False {}
18 [80, 78] -1.0 False {}
19 [80, 77] -1.0 False {}
20 [80, 75] -1.0 False {}
21 [81, 76] -1.0 False {}
22 [81, 75] -1.0 False {}
23 [81, 75] -1.0 False {}
24 [81, 75] -1.0 False {}
25 [80, 73] -1.0 False {}
26 [80, 72] -1.0 False {}
27 [80, 71] -1.0 False {}
28 [79, 72] -1.0 False {}
29 [79, 71] -1.0 False {}
30 [78, 70] -1.0 False {}
31 [78, 71] -1.0 False {}
32 [77, 71] -1.0 False {}
33 [77, 72] -1.0 False {}
34 [77, 72] -1.0 False {}
35 [76, 72] -1.0 False {}
36 [76, 71] -1.0 False {}
37 [76, 72] -1.0 False {}
38 [75, 73] -1.0 False {}
39 [75, 72] -1.0 Fals

Each episode has a maximum of 200 steps. The episode is considered finished when the mountain car reaches the mountain top or when the number of attempts has reached 200, whichever comes first. 

In each step, the reward is -1, unless the mountain car reaches the mountain top, in which case the reward is 1. 

## 2. Train the Q-Table for the Mountain Car Game

We now train the Q-table for the mountain car game. 

We first populate the 190 by 150 by 3 Q table with zeros. In each step, unless the mountain car reaches the top, we use Q-learning to update the Q values as follows:

 $$ New\ Q(s,a) = learning\ rate * [Reward + discount\ factor * max\  Q(s’, a’)]+ (1-learning\ rate) * Old\  Q(s, a)$$
 
If the mountain car reaches the top in the step, we update the Q value as follows.

 $$ New\ Q(s,a) = Reward $$

After many rounds of trial and error, the update will be minimal, which means the Q values converge to the equilibrium values. 

Here is the training process

In [None]:
import gym, time
import numpy as np
env = gym.make('MountainCar-v0')
#env.reset()

learning_rate=0.2
discount_rate=0.99

max_exp=0.9
min_exp=0.01

max_episode=100000
max_steps=200
Q = np.zeros((190, 150, 3))

def obs_to_state(obs):
    state = [int(obs[0]*100)+125, int(obs[1]*1000)+75]
    return state

outcome=[]
def update_Q(episode):
    # Initial state is the starting position
    
    obs=env.reset()      
    # Play a full game till it ends
    for _ in range(max_steps):
        state = obs_to_state(obs)
        # Select the best action or the random action
        if np.random.uniform(0,1,1)>min_exp+(max_exp-min_exp)*episode/max_episode:
            action = np.argmax(Q[state[0], state[1], :])
        else:
            action = env.action_space.sample()
        # Use the selected action to make the move
        new_obs, reward, done, _ = env.step(action)
        new_state = obs_to_state(new_obs)
        # Update Q values
        if done==True:
            if new_obs[0]>=0.5:
                Q[state[0], state[1], action] = reward
                outcome.append(1)
                break    
            else:
                outcome.append(0)
                break      
        else:
            Q[state[0], state[1], action] = learning_rate*\
            (reward+discount_rate*np.max(Q[new_state[0], new_state[1], :]))\
            + (1-learning_rate)*Q[state[0], state[1], action]
            obs=new_obs 
            continue
   
for episode in range(max_episode):
    update_Q(episode)
    if episode%1000 == 0:
        print("this is episode", episode)
    
print(Q)   
import pickle
with open('files/ch14/mountain_car_Qs.pickle', 'wb') as fp:
    pickle.dump((Q, outcome),fp)
# Read the data and print out the first 10 games
with open('files/ch14/mountain_car_Qs.pickle', 'rb') as fp:
    (myQ, myoutcome)=pickle.load(fp)
print(myQ.shape, len(myoutcome))

The training takes about 30 minutes. The exact amount of time depends on your computer speed. The model is saved as a pickle file on your computer for later use. I put it in this GitHub respository as well. So use the file files/ch14/mountain_car_Qs.pickle if you have trouble training the Q table by yourself.

## 3. Test the Trained Q-Values
Now, you can test if the trained Q-table works. 

In [3]:
import gym, time
import numpy as np
env = gym.make('MountainCar-v0')

import pickle
# Read the data and print out the first 10 games
with open('files/ch14/mountain_car_Qs.pickle', 'rb') as fp:
    (Q, outcome)=pickle.load(fp)
print(Q.shape)
def obs_to_state(obs):
    state = [int(obs[0]*100)+125, int(obs[1]*1000)+75]
    return state

num_success=0
def test_Q():
    global num_success
    obs=env.reset()   
    steps = 0
    # Play a full game till it ends
    while True:
        steps += 1
        state = obs_to_state(obs)
        # Select the best action or the random action
 
        action = np.argmax(Q[state[0], state[1], :])

        
        # Use the selected action to make the move
        new_obs, reward, done, _ = env.step(action)

        obs=new_obs
        env.render()
        # Update Q values
        if done==True:
            if new_obs[0]>=0.5:
                print(f"congrats, you made it in {steps} attempts")
                num_success += 1
            else:
                print("sorry, better luck next time")
            env.close()
            break    
  
for _ in range(10):
    test_Q()

print("success rate", num_success/10)

(190, 150, 3)
congrats, you made it in 86 attempts
congrats, you made it in 155 attempts
congrats, you made it in 112 attempts
congrats, you made it in 83 attempts
congrats, you made it in 88 attempts
congrats, you made it in 92 attempts
congrats, you made it in 155 attempts
congrats, you made it in 114 attempts
congrats, you made it in 155 attempts
congrats, you made it in 162 attempts
success rate 1.0


We tested the game for 10 episodes and the mountain car has reached the mountain top in every episode. 

## 4. Animate the Game Before and After the Q-Learning
We'll first animate the mountain car game before Q-learning. You'll see that the mountain car stays at the valley without much movement. After Q-learning, the mountain car made to the top in every episode. We'll put the animation before and after the Q-learning side by side to show what difference the Q-learning makes. 

First, let's animate the game before Q-learning.

In [None]:
import gym
from gym import wrappers

env = gym.make('MountainCar-v0')

frames=[]
for _ in range(5):
    env.reset()
    frames.append(env.render(mode='rgb_array'))
    while True:      
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        frames.append(env.render(mode='rgb_array'))
        if done:
            break
env.close()

import imageio
import numpy as np

# Now double the size.
frames4=[]
for frame in frames:
    frame4 = frame.repeat(2, axis=0).repeat(2, axis=1)
    frames4.append(frame4)
imageio.mimsave('files/ch14/mountain_car_beforeQ.gif', frames4, fps=24)

Here we double the height and width of each frame so we have four times the resolution as before, using the *repeat()* method in the ***numpy*** library. 

If you open the gif file in your local folder, you should see an animation as follows
<img src="https://gattonweb.uky.edu/faculty/lium/ml/mountain_car_beforeQ.gif" />

The above animation shows that the mountain car stays at the valley most of the time without much movement. 

Now, let's create an animation of the mountain car with the help of the Q-learning. 

In [1]:
import gym
from gym import wrappers
import imageio
import numpy as np
import pickle

# Read the data and print out the first 10 games
with open('files/ch14/mountain_car_Qs.pickle', 'rb') as fp:
    (Q, outcome)=pickle.load(fp)
print(Q.shape)
def obs_to_state(obs):
    state = [int(obs[0]*100)+125, int(obs[1]*1000)+75]
    return state

env = gym.make('MountainCar-v0')

Q_frames=[]
for _ in range(10):
    obs=env.reset() 
    Q_frames.append(env.render(mode='rgb_array'))
    while True:    
        state = obs_to_state(obs)
        # Select the best action or the random action
        action = np.argmax(Q[state[0], state[1], :])        
        # Use the selected action to make the move
        observation, reward, done, info = env.step(action)
        new_obs, reward, done, _ = env.step(action)
        Q_frames.append(env.render(mode='rgb_array'))
        obs=new_obs
        env.render()                
        if done:
            break
env.close()

# Now double the size.
Q_frames4=[]
for Q_frame in Q_frames:
    Q_frame4 = Q_frame.repeat(2, axis=0).repeat(2, axis=1)
    Q_frames4.append(Q_frame4)
imageio.mimsave('files/ch14/mountain_car_withQ.gif', Q_frames4, fps=24)

If you open, the file mountain_car_withQ.gif, you'll see the following: 
<img src="https://gattonweb.uky.edu/faculty/lium/ml/mountain_car_withQ.gif"/>

The mountain can can now reach the mountain top in every episode without much effort.

Next, we'll put the frames before Q-learning and after Q-learning side by side. You can see the comparision in the same animation.

In [2]:

frames = []
num_frames = min(len(frames4), len(Q_frames4))
for i in range(num_frames):
    frame4 = frames4[i]
    Q_frame4 = Q_frames4[i]
    frame = np.concatenate([frame4, Q_frame4], axis=1)
    frames.append(frame)
imageio.mimsave('files/ch14/mountain_car_compares.gif', frames, fps=24)



If you open the file mountain_car_compares.gif, you'll see the following: 
<img src="https://gattonweb.uky.edu/faculty/lium/ml/mountain_car_compares.gif"/>

Now you can compare the mountain car game with and without Q-learning. This shows how powerful Q-learning is.