## CartPole Skating

> **Problem**: If Peter wants to escape from the wolf, he needs to be able to move faster than him. We will see how Peter can learn to skate, in particular, to keep balance, using Q-Learning.

First, let's install the gym and import required libraries:

In [1]:
import sys
!pip install gym 

import gym
import matplotlib.pyplot as plt
import numpy as np
import random

Keyring is skipped due to an exception: org.freedesktop.DBus.Error.InvalidFileContent: D-Bus library appears to be incorrectly set up: see the manual page for dbus-uuidgen to correct this issue. (Failed to open "/var/lib/dbus/machine-id": No such file or directory; UUID file '/etc/machine-id' should contain a hex string of length 32, not length 0, with no other text)
Defaulting to user installation because normal site-packages is not writeable


  "Gym minimally supports python 3.6 as the python foundation not longer supports the version, please update your version to 3.7+"


## Create a cartpole environment

In [2]:
env = gym.make("CartPole-v1")
print(env.action_space)
print(env.observation_space)
print(env.action_space.sample())

Discrete(2)
Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
0


To see how the environment works, let's run a short simulation for 100 steps.

In [3]:
env.reset()

for i in range(100):
   env.render()
   env.step(env.action_space.sample())
env.close()

  "You are calling render method without specifying any render mode. "
  "You are calling 'step()' even though this "


During simulation, we need to get observations in order to decide how to act. In fact, `step` function returns us back current observations, reward function, and the `done` flag that indicates whether it makes sense to continue the simulation or not:

In [4]:
env.reset()

done = False
while not done:
   env.render()
   obs, rew, done, info, _ = env.step(env.action_space.sample())
   print(f"{obs} -> {rew}")
env.close()

[ 0.01878172  0.15108874 -0.03655479 -0.29923922] -> 1.0
[ 0.02180349 -0.0434936  -0.04253957 -0.0183054 ] -> 1.0
[ 0.02093362 -0.23798048 -0.04290568  0.26065814] -> 1.0
[ 0.01617401 -0.4324645  -0.03769252  0.5395053 ] -> 1.0
[ 0.00752472 -0.23683354 -0.02690241  0.2351883 ] -> 1.0
[ 0.00278805 -0.431561   -0.02219864  0.5192655 ] -> 1.0
[-0.00584317 -0.23613368 -0.01181334  0.21967083] -> 1.0
[-0.01056584 -0.04084487 -0.00741992 -0.07671498] -> 1.0
[-0.01138274 -0.23585968 -0.00895422  0.21361773] -> 1.0
[-0.01609994 -0.04061086 -0.00468186 -0.08187626] -> 1.0
[-0.01691215  0.1545779  -0.00631939 -0.37603265] -> 1.0
[-0.01382059 -0.04045373 -0.01384004 -0.08534893] -> 1.0
[-0.01462967  0.15486385 -0.01554702 -0.3823661 ] -> 1.0
[-0.01153239 -0.04003394 -0.02319434 -0.09462536] -> 1.0
[-0.01233307 -0.23481591 -0.02508685  0.19065048] -> 1.0
[-0.01702939 -0.42957017 -0.02127384  0.47531515] -> 1.0
[-0.02562079 -0.62438536 -0.01176754  0.7612178 ] -> 1.0
[-0.0381085  -0.42910326  0.003

We can get min and max value of those numbers:

In [5]:
print(env.observation_space.low)
print(env.observation_space.high)

[-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38]
[4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38]


## State Discretization

In [6]:
def discretize(x):
    return tuple((x/np.array([0.25, 0.25, 0.01, 0.1])).astype(np.int))

Let's also explore other discretization method using bins:

In [8]:
def create_bins(i,num):
    return np.arange(num+1)*(i[1]-i[0])/num+i[0]

print("Sample bins for interval (-5,5) with 10 bins\n",create_bins((-5,5),10))

ints = [(-5,5),(-3.5,3.5),(-0.5,0.5),(-2,2)] # intervals of values for each parameter
nbins = [20,20,10,10] # number of bins for each parameter
bins = [create_bins(ints[i],nbins[i]) for i in range(4)]

def discretize_bins(x):
    return tuple(np.digitize(x[i],bins[i]) for i in range(4))

Sample bins for interval (-5,5) with 10 bins
 [-5. -4. -3. -2. -1.  0.  1.  2.  3.  4.  5.]


Let's now run a short simulation and observe those discrete environment values.

In [9]:
env.reset()

done = False
while not done:
   #env.render()
   obs, rew, done, info, _ = env.step(env.action_space.sample())
   #print(discretize_bins(obs))
   print(discretize(obs))
env.close()

(0, 0, 3, -2)
(0, 1, 2, -5)
(0, 0, 1, -2)
(0, 1, 1, -5)
(0, 2, 0, -8)
(0, 3, -1, -10)
(0, 4, -3, -13)
(0, 3, -6, -11)
(0, 2, -8, -8)
(0, 1, -10, -5)
(0, 2, -11, -8)
(0, 1, -12, -6)
(0, 0, -14, -3)
(0, 0, -14, -1)
(0, 0, -15, -4)
(0, 1, -16, -8)
(0, 0, -17, -5)
(0, 1, -19, -9)
(0, 0, -20, -6)
(0, 1, -22, -10)


## Q-Table Structure

In [13]:
Q = {}
actions = (0,1)

def qvalues(state):
    return [Q.get((state,a),0) for a in actions]

## Let's Start Q-Learning!

In [14]:
# hyperparameters
alpha = 0.3
gamma = 0.9
epsilon = 0.

def probs(v,eps=1e-4):
    v = v-v.min()+eps
    v = v/v.sum()
    return v

In [None]:


Qmax = 0
cum_rewards = []
rewards = []
for epoch in range(100000):
    obs = env.reset()
    done = False
    cum_reward=0
    
    # == do the simulation ==
    while not done:
        print(obs[0])
        s = discretize(obs[0])
        if random.random()<epsilon:
            # exploitation - chose the action according to Q-Table probabilities
            v = probs(np.array(qvalues(s)))
            a = random.choices(actions,weights=v)[0]
        else:
            # exploration - randomly chose the action
            a = np.random.randint(env.action_space.n)

        obs, rew, done, info, _ = env.step(a)
        cum_reward+=rew
        ns = discretize(obs)
        Q[(s,a)] = (1 - alpha) * Q.get((s,a),0) + alpha * (rew + gamma * max(qvalues(ns)))
    
    cum_rewards.append(cum_reward)
    rewards.append(cum_reward)
    
    # == Periodically print results and calculate average reward ==
    if epoch%5000==0:
        print(f"{epoch}: {np.average(cum_rewards)}, alpha={alpha}, epsilon={epsilon}")
        if np.average(cum_rewards) > Qmax:
            Qmax = np.average(cum_rewards)
            Qbest = Q
        cum_rewards=[]

## Plotting Training Progress

In [None]:
plt.plot(rewards)

From this graph, it is not possible to tell anything, because due to the nature of stochastic training process the length of training sessions varies greatly. To make more sense of this graph, we can calculate **running average** over series of experiments, let's say 100. This can be done conveniently using `np.convolve`:

In [None]:
def running_average(x,window):
    return np.convolve(x,np.ones(window)/window,mode='valid')

plt.plot(running_average(rewards,100))

## Varying Hyperparameters and Seeing the Result in Action

Now it would be interesting to actually see how the trained model behaves. Let's run the simulation, and we will be following the same action selection strategy as during training: sampling according to the probability distribution in Q-Table: 

In [None]:
obs = env.reset()
done = False
while not done:
   s = discretize(obs)
   env.render()
   v = probs(np.array(qvalues(s)))
   a = random.choices(actions,weights=v)[0]
   obs,_,done,_ = env.step(a)
env.close()


## Saving result to an animated GIF

If you want to impress your friends, you may want to send them the animated GIF picture of the balancing pole. To do this, we can invoke `env.render` to produce an image frame, and then save those to animated GIF using PIL library:

In [15]:
from PIL import Image
env = gym.make("CartPole-v1", render_mode='rgb_array')
obs = env.reset()
done = False
i=0
ims = []
while not done:
   s = discretize(obs[0])
   img=env.render()
   ims.append(Image.fromarray(img))
   v = probs(np.array([Qbest.get((s,a),0) for a in actions]))
   a = random.choices(actions,weights=v)[0]
   obs,_,done,_ = env.step(a)
   i+=1
env.close()
ims[0].save('images/cartpole-balance.gif',save_all=True,append_images=ims[1::2],loop=0,duration=5)
print(i)

NameError: name 'Qbest' is not defined