# Train Mountain Car

[OpenAI Gym](http://gym.openai.com) has been designed in such a way that all environments provide the same API - i.e. the same methods `reset`, `step` and `render`, and the same abstractions of **action space** and **observation space**. Thus is should be possible to adapt the same reinforcement learning algorithms to different environments with minimal code changes.

## A Mountain Car Environment

[Mountain Car environment](https://gym.openai.com/envs/MountainCar-v0/) contains a car stuck in a valley:

The goal is to get out of the valley and capture the flag, by doing at each step one of the following actions:

| Value | Meaning |
|---|---|
| 0 | Accelerate to the left |
| 1 | Do not accelerate |
| 2 | Accelerate to the right |

The main trick of this problem is, however, that the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.

Observation space consists of just two values:

| Num | Observation  | Min | Max |
|-----|--------------|-----|-----|
|  0  | Car Position | -1.2| 0.6 |
|  1  | Car Velocity | -0.07 | 0.07 |

Reward system for the mountain car is rather tricky:

 * Reward of 0 is awarded if the agent reached the flag (position = 0.5) on top of the mountain.
 * Reward of -1 is awarded if the position of the agent is less than 0.5.

Episode terminates if the car position is more than 0.5, or episode length is greater than 200.
## Instructions

Adapt our reinforcement learning algorithm to solve the mountain car problem. Start with existing [notebook.ipynb](notebook.ipynb) code, substitute new environment, change state discretization functions, and try to make existing algorithm to train with minimal code modifications. Optimize the result by adjusting hyperparameters.

> **Note**: Hyperparameters adjustment is likely to be needed to make algorithm converge. 


Solution

In [1]:
import sys
#!pip install gym 

import gym
import matplotlib.pyplot as plt
import numpy as np
import random

Create a cartpole environment

In [2]:
env = gym.make("CartPole-v1")
print(env.action_space)
print(env.observation_space)
print(env.action_space.sample())

Discrete(2)
Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
1


In [3]:
env.reset()

for i in range(100):
   env.render()
   env.step(env.action_space.sample())
env.close()

  gym.logger.warn(
  logger.warn(


During simulation, we need to get observations in order to decide how to act. In fact, step function returns us back current observations, reward function, and the done flag that indicates whether it makes sense to continue the simulation or not:

In [5]:
env.reset()

done = False
while not done:
   env.render()
   obs, rew, done, info,_ = env.step(env.action_space.sample())
   print(f"{obs} -> {rew}")
env.close()

[ 0.02290763 -0.20757952 -0.01469015  0.31233308] -> 1.0
[ 0.01875604 -0.0122514  -0.00844349  0.01505378] -> 1.0
[ 0.01851101 -0.20725125 -0.00814241  0.30506077] -> 1.0
[ 0.01436599 -0.40225622 -0.0020412   0.5951647 ] -> 1.0
[ 0.00632086 -0.5973495   0.0098621   0.88720393] -> 1.0
[-0.00562613 -0.40236285  0.02760618  0.59763753] -> 1.0
[-0.01367339 -0.20763783  0.03955892  0.3137765 ] -> 1.0
[-0.01782614 -0.01310109  0.04583446  0.03382697] -> 1.0
[-0.01808817  0.18133463  0.04651099 -0.2440497 ] -> 1.0
[-0.01446147  0.37576246  0.04163    -0.52170676] -> 1.0
[-0.00694622  0.18008001  0.03119587 -0.2162017 ] -> 1.0
[-0.00334462  0.37474242  0.02687183 -0.49888316] -> 1.0
[ 0.00415023  0.5694754   0.01689417 -0.7829778 ] -> 1.0
[ 0.01553973  0.37412542  0.00123461 -0.48502797] -> 1.0
[ 0.02302224  0.5692299  -0.00846595 -0.7773215 ] -> 1.0
[ 0.03440684  0.3742254  -0.02401238 -0.4873142 ] -> 1.0
[ 0.04189135  0.17945035 -0.03375866 -0.2022948 ] -> 1.0
[ 0.04548036  0.37503842 -0.037

  gym.logger.warn(


We can get min and max value of those numbers:

State Discretization

In [6]:
def discretize(x):
    return tuple((x/np.array([0.25, 0.25, 0.01, 0.1])).astype(np.int))

Let's also explore other discretization method using bins:

In [7]:
def create_bins(i,num):
    return np.arange(num+1)*(i[1]-i[0])/num+i[0]

print("Sample bins for interval (-5,5) with 10 bins\n",create_bins((-5,5),10))

ints = [(-5,5),(-2,2),(-0.5,0.5),(-2,2)] # intervals of values for each parameter
nbins = [20,20,10,10] # number of bins for each parameter
bins = [create_bins(ints[i],nbins[i]) for i in range(4)]

def discretize_bins(x):
    return tuple(np.digitize(x[i],bins[i]) for i in range(4))

Sample bins for interval (-5,5) with 10 bins
 [-5. -4. -3. -2. -1.  0.  1.  2.  3.  4.  5.]


Let's now run a short simulation and observe those discrete environment values.

In [9]:
env.reset()

done = False
while not done:
   #env.render()
   obs, rew, done, info,_ = env.step(env.action_space.sample())
   #print(discretize_bins(obs))
   print(discretize(obs))
env.close()

(0, 0, 0, -3)
(0, 0, 0, 0)
(0, 0, -1, 2)
(0, 0, 0, 0)
(0, 0, 0, -3)
(0, 0, -1, 0)
(0, 0, -1, 2)
(0, 0, 0, 0)
(0, 0, 0, 2)
(0, 0, 0, 0)
(0, 0, 0, 2)
(0, -1, 0, 5)
(0, -2, 1, 8)
(0, -1, 2, 5)
(0, -2, 3, 8)
(0, -1, 5, 5)
(0, -2, 6, 8)
(0, -1, 8, 5)
(0, 0, 9, 3)
(0, 0, 10, 0)
(0, 0, 10, 3)
(0, -1, 11, 7)
(0, -2, 12, 10)
(0, -1, 14, 7)
(0, 0, 16, 5)
(0, -1, 17, 8)
(0, -2, 18, 12)
(0, -1, 21, 9)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  return tuple((x/np.array([0.25, 0.25, 0.01, 0.1])).astype(np.int))


**Q-Table Structure**

In [10]:
Q = {}
actions = (0,1)

def qvalues(state):
    return [Q.get((state,a),0) for a in actions]

*Let's Start Q-Learning!*

In [11]:
# hyperparameters
alpha = 0.3
gamma = 0.9
epsilon = 0.90

In [12]:
def probs(v,eps=1e-4):
    v = v-v.min()+eps
    v = v/v.sum()
    return v

Qmax = 0
cum_rewards = []
rewards = []
for epoch in range(100000):
    obs = env.reset()
    done = False
    cum_reward=0
    # == do the simulation ==
    while not done:
        s = discretize(obs)
        if random.random()<epsilon:
            # exploitation - chose the action according to Q-Table probabilities
            v = probs(np.array(qvalues(s)))
            a = random.choices(actions,weights=v)[0]
        else:
            # exploration - randomly chose the action
            a = np.random.randint(env.action_space.n)

        obs, rew, done, info = env.step(a)
        cum_reward+=rew
        ns = discretize(obs)
        Q[(s,a)] = (1 - alpha) * Q.get((s,a),0) + alpha * (rew + gamma * max(qvalues(ns)))
    cum_rewards.append(cum_reward)
    rewards.append(cum_reward)
    # == Periodically print results and calculate average reward ==
    if epoch%5000==0:
        print(f"{epoch}: {np.average(cum_rewards)}, alpha={alpha}, epsilon={epsilon}")
        if np.average(cum_rewards) > Qmax:
            Qmax = np.average(cum_rewards)
            Qbest = Q
        cum_rewards=[]

  return tuple((x/np.array([0.25, 0.25, 0.01, 0.1])).astype(np.int))


ValueError: operands could not be broadcast together with shapes (2,) (4,) 

Plotting Training Progress

In [None]:
plt.plot(rewards)

From this graph, it is not possible to tell anything, because due to the nature of stochastic training process the length of training sessions varies greatly. To make more sense of this graph, we can calculate running average over series of experiments, let's say 100. This can be done conveniently using np.convolve:

## Varying Hyperparameters and Seeing the Result in Action

Now it would be interesting to actually see how the trained model behaves. Let's run the simulation, and we will be following the same action selection strategy as during training: sampling according to the probability distribution in Q-Table:

In [None]:
obs = env.reset()
done = False
while not done:
   s = discretize(obs)
   env.render()
   v = probs(np.array(qvalues(s)))
   a = random.choices(actions,weights=v)[0]
   obs,_,done,_ = env.step(a)
env.close()

# Saving result to an animated GIF
If you want to impress your friends, you may want to send them the animated GIF picture of the balancing pole. To do this, we can invoke env.render to produce an image frame, and then save those to animated GIF using PIL library:

In [None]:
from PIL import Image
obs = env.reset()
done = False
i=0
ims = []
while not done:
   s = discretize(obs)
   img=env.render(mode='rgb_array')
   ims.append(Image.fromarray(img))
   v = probs(np.array([Qbest.get((s,a),0) for a in actions]))
   a = random.choices(actions,weights=v)[0]
   obs,_,done,_ = env.step(a)
   i+=1
env.close()
ims[0].save('images/cartpole-balance.gif',save_all=True,append_images=ims[1::2],loop=0,duration=5)
print(i)