# Large Notebook 3: Reinfocement Learning
## Ex. 3: Control problem MountainCar-v0

In the exercise class we will cover the control problem of a car at the bottom of a valley which should pick-up enough momentum to get over the hill. We will use the environment from the OpenAI Gym, which allows you to play and visualize the 'game'. Use RL to train a policy that gets the car over the hill in the least amount of time. 

**Before you can start this exercise you have to install the package OpenAI Gym. Start your anaconda environment with python3 and install:**

* pip install gymnasium[classic-control]


In [1]:
# UNCOMMENT IF GYMNASIUM FAILS TO IMPORT
#%pip install gymnasium[classic-control]

import gymnasium as gym
import numpy as np

If you have properly installed the openAI gym you should be able to import it. We will now run a DEMO to see if everything is working. The code already is able to simulate the MountainCar problem for the case that it actions are **random**. To be able to view the rendered video of the poor and helpless car, desperately trying to drive up the hill, you should run the code on your own computer.
For more info on this particular environment see e.g. the website: https://gymnasium.farama.org/environments/classic_control/mountain_car/



In [None]:
def demo():
    """run the MountainCar environment with random actions"""
    
    env = gym.make('MountainCar-v0', render_mode='human')  #  create an instance of the environment

    state = env.reset()  # reset the current game

    for _ in range(200):  # play 200 random actions
        env.render()  # render the current game state to screen
        a = env.action_space.sample()  # get a random action
        state, reward, terminated, truncated, info = env.step(a) # take the action and return the outcome
        
    env.close()
    
# run the demo
demo() 

## Building your RL player, ie. training your policy.
We have to start by creating the game environment and checking some properties of the state and action space:

In [2]:
env = gym.make('MountainCar-v0')  # no rendering!

# get usefull information about the environment:
state = env.reset()
print('start state:', state)
print('Number of sctions in the action space:', env.action_space.n)
print('Lowest state in the state space:', env.observation_space.low)
print('Hightest state in the state space:', env.observation_space.high)

#perform one step of the game for action a=1
a=1
state, reward, terminated, truncated, info = env.step(a)
print('After the step with (a=1):',state, reward, terminated, truncated, info )


start state: (array([-0.56116825,  0.        ], dtype=float32), {})
Number of sctions in the action space: 3
Lowest state in the state space: [-1.2  -0.07]
Hightest state in the state space: [0.6  0.07]
After the step with (a=1): [-5.608871e-01  2.811749e-04] -1.0 False False {}


We see that the car starts out in state with two floats [-0.525, 0] (as it initializes random these numbers will differ each time you reset). You can perform any of 3 actions (a = 0 or 1 or 2). We don't know what the numbers in the state mean, they could be the $x$, $y$ coordinates of the car or the velocity and height, but **we also don't have to know!** We will let the RL algortithm learn how to drive the car regardless of the exact meaning of the state.

You should now code a function `s2q(s)` that links state `s` to a location in the Q-matrix. This can quickly be done by discretizing the state space into bins and determine the bin number corresponding to a certain value. The function should return a tuple (or list) `loc` that holds the two bin numbers.

In [3]:
def s2q(s):
    # convert continous state values to discrete location indices inside the Q matrix
    #----------ADD CODE HERE---------#    
    n_bins_pos = 10
    n_bins_vel = 10
    
    pos_min,pos_max = -1.2, 0.6
    vel_min,vel_max = -0.07,0.07
    
    # bin width for each position
    bin_width_pos = (pos_max-pos_min) /n_bins_pos
    bin_width_vel = (vel_max-vel_min) /n_bins_vel
    
    ##pos, vel
    if s.shape == (1, 2):
        s = s[0]  # now shape (2,)

    pos, vel = s
    
    # bin width
    bin_width_pos = (pos_max - pos_min) / n_bins_pos
    bin_width_vel = (vel_max - vel_min) / n_bins_vel

    # raw indices
    pos_index = int((pos - pos_min) / bin_width_pos)
    vel_index = int((vel - vel_min) / bin_width_vel)

    # clipping so that indexing doesn't go out of range
    pos_index = min(max(pos_index, 0), n_bins_pos - 1)
    vel_index = min(max(vel_index, 0), n_bins_vel - 1)

    loc = (pos_index,vel_index)
    
    
    return loc


The next function `qlearn()` should train your Q-matrix by playing `num_games` games according to an 'epsilon-greedy' strategy (Google it!) and update the Q-matrix accoding to the following Bellmann equation:

$$ \mathbf{Q}^{\rm new}[s_t,a_t]=(1-\alpha)\mathbf{Q}[s_t,a_t]+\alpha\left(R_t+\gamma\, \text{max}_a  \mathbf{Q}[s_{t+1},a]\right). $$

Here, $\alpha$ is the learning rate and $\gamma$ is the discount factor and are bounded by $\alpha,\gamma\in[0,1]$. These parameters have to be set with care, as they influence the speed of convergence of the Q-matrix. The discount factor lets you weigh the importance of future over immediate rewards. This is done by mixing-in the term $\text{max}_a  \mathbf{Q}[s_{t+1},a]$, which gives the maximum Q value in the future state.

In [4]:
def qLearn(Q, α, γ, ϵ, ϵ_min, num_games):
    """ 
    learns the Q table by interacting with the environment and applying the Bellman eqation 
    Q: q-table (n-dimensional ndarray)
    α: learning rate
    γ: discount factor
    ϵ: probability of taking a random action in the epsilon-greedy policy
    ϵ_min: minimum value ϵ can take when applying a reduction algortihm to ϵ
    """

    state,info = env.reset()  # reset the current game
    wins = 0
    done=False
    
    #----------ADD CODE HERE---------#
    total_reward_for_all_episodes = 0

    for _ in range(num_games):
        state,_ = env.reset()
        done = False
        episode_reward=0
        
        while not done:
            if np.random.uniform(0,1) < ϵ: #epsilon greedy action
                action = env.action_space.sample() #take a random action
            # else choose the best know from the Q table
            else:
                s_index = s2q(state)
                action = np.argmax(Q[s_index])
                
            next_state, reward, terminated, truncated, info  = env.step(action)
            episode_reward += reward
            done = terminated or truncated
            
            s_index = s2q(state)
            next_s_index = s2q(next_state)
            
            #bellman equation
            best_next_action = np.max(Q[next_s_index])
            Q[s_index][action] = (1 - α) * Q[s_index][action] + α * (reward + γ * best_next_action)
            
            # move on next state
            state=next_state
            
            if done and reward > 0:
                wins+=1
        
        total_reward_for_all_episodes += episode_reward

        ## decay eps
        if ϵ > ϵ_min:
            ϵ *= 0.99       
        
    print(f'Training ended. Number of wins: {wins}')
    avg_reward = total_reward_for_all_episodes / num_games

    return avg_reward

Finally you put everything together. It is almost completly finished for you. What values for the hyperparameters do you choose? 

In [5]:
# initialize the Q matrix as a numpy array with zeros
Qdim = (10,10,env.action_space.n)
Q = np.zeros(shape=Qdim)

# set the hyperparameters
α = 0.1  # learning rate [0,1]
γ = 0.90  # discount rate [0,1]
ϵ =  0.9 # epsilon greedy strategy [0,1]
ϵ_min = 0.1  # minimu value of epsilon [0,1]
num_games = 1000 # number of training games

# train the agent and store results
qLearn(Q, α, γ, ϵ, ϵ_min, num_games)
np.save('qrun1.npy', Q)


Training ended. Number of wins: 0


Once a Q-matrix has been trained we can use it as a policy and play a game. Write code that performs actions according to the input Q-matrix to play a single episode.

In [None]:
# replay the game using the trained Q matrix
Q = np.load('qrun1.npy')

# create and reset the environment with render mode on
env = gym.make('MountainCar-v0', render_mode='human')
state,info = env.reset()  
    
# play a single episode with max. 1000 actions
for _ in range(1000):           
    env.render()                
    loc = s2q(state)
    a = np.argmax(Q[loc]) 
    next_state, reward, terminated, truncated, info = env.step(a) 
    state = next_state
        
    if terminated: 
        print('Qplay Output:', reward, terminated, truncated, info)
        break

env.close()

# Experiments
Set up a couple fo experiments to figure oyt the following things:
* How do $\alpha$ and $\gamma$ effect your learning perfomance?
* Are both elements of the state vector equally important and can we reduce the dimensions of the Q-matrix of one (or both) of them?

In [5]:
# initialize the Q matrix as a numpy array with zeros
# set the hyperparameters
α = [0.1,0.5,0.7]  # learning rate [0,1]
γ = [0.6,0.8,0.9]  # discount rate [0,1]
ϵ =  0.9 # epsilon greedy strategy [0,1]
ϵ_min = 0.1  # minimu value of epsilon [0,1]
num_games = 10 # number of training games
episodes=500


train_loops = episodes // num_games


# train the agent and store results
for alpha in α:
    for gamma in γ:
        Q = np.zeros((10, 10, env.action_space.n))
        
        returns_history = []
        current_ϵ = ϵ
        
        for _ in range(train_loops):
            
            avg_reward = qLearn(Q, alpha, gamma, current_ε, ϵ_min, num_games)
            print("avg_reward",avg_reward)
            
            if current_ϵ > ϵ_min:
                current_ε *= 0.99
                
            returns_history.append(avg_reward)
                 
            filename = f"q_alpha{alpha}_gamma{gamma}.npy"
            np.save(filename, Q)
            
            final_avg = np.mean(returns_history[-5:])  # last 5 calls => last 50 episodes
            print(f"Finished training with alpha={alpha}, gamma={gamma}. "
              f"Approx final average reward = {final_avg:.2f}")

-env.close()

Training ended. Number of wins: 0
avg_reward -200.0
Finished training with alpha=0.1, gamma=0.6. Approx final average reward = -200.00
Training ended. Number of wins: 0
avg_reward -200.0
Finished training with alpha=0.1, gamma=0.6. Approx final average reward = -200.00
Training ended. Number of wins: 0
avg_reward -200.0
Finished training with alpha=0.1, gamma=0.6. Approx final average reward = -200.00
Training ended. Number of wins: 0
avg_reward -200.0
Finished training with alpha=0.1, gamma=0.6. Approx final average reward = -200.00
Training ended. Number of wins: 0
avg_reward -200.0
Finished training with alpha=0.1, gamma=0.6. Approx final average reward = -200.00
Training ended. Number of wins: 0
avg_reward -200.0
Finished training with alpha=0.1, gamma=0.6. Approx final average reward = -200.00
Training ended. Number of wins: 0
avg_reward -200.0
Finished training with alpha=0.1, gamma=0.6. Approx final average reward = -200.00
Training ended. Number of wins: 0
avg_reward -200.0
Fin

TypeError: bad operand type for unary -: 'NoneType'

In [None]:
best_alpha=0.1 #todo
best_gamma=0.7
best_Q = np.load(f"q_alpha{best_alpha}_gamma{best_gamma}.npy")   ### 

env = gym.make('MountainCar-v0', render_mode='human')
state, info = env.reset()

for _ in range(1000):
    env.render()
    loc = s2q(state)             # convert to discrete index
    action = np.argmax(best_Q[loc])   # pick best action
    next_state, reward, terminated, truncated, info = env.step(action)
    state = next_state

    if terminated or truncated:
        print("Episode finished with reward:", reward)
        break

env.close()

Episode finished with reward: -1.0
