<hr/>

# Foundations of Reinforcement Learning

<hr/>

<h1><font color="darkblue">Lab 4: Monte Carlo Method  </font></h1>





##  Content
1. Monte Carlo Method


Import Gym and other necessary libraries

In [1]:
%pylab inline
import numpy as np
import matplotlib.pyplot as plt
import gym
from IPython import display
import random

Populating the interactive namespace from numpy and matplotlib


## 1. Monte Carlo Method  (CartPole-v1 environment)

### 1.1 CartPole Introduction

We now apply Monte Carlo Method for CartPole problem. 


1. A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. 

0. The system is controlled by applying a force of +1 or -1 to the cart. 

0. The pendulum starts up, and the goal is to prevent it from falling over. 

0. A reward of +1 is provided for every timestep that the pole remains up. 

0. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

0. For more info (See [SOURCE ON GITHUB](https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py)).

The following examples show the basic usage of this testing environment: 



### 1.1.1 Episode initialization and Initial Value

In [3]:
env = gym.make('CartPole-v0')
observation = env.reset() ##Initial an episode

if gym.__version__>'0.26.0':
    observation = observation[0]

print("Inital observation is {}".format(observation))

print("\nThis means the cart current position is {}".format(observation[0]), end = '')
print(" with velocity {},".format(observation[1]))

print("and the pole current angular position is {}".format(observation[2]), end = '')
print(" with angular velocity {},".format(observation[3]))


Inital observation is [ 0.03085803 -0.04887363 -0.02494748  0.026118  ]

This means the cart current position is 0.030858034268021584 with velocity -0.04887362942099571,
and the pole current angular position is -0.024947484955191612 with angular velocity 0.026118000969290733,


### 1.1.2 Take actions


Use env.step(action) to take an action

action is an integer from 0 to 1

0: "Left"; 1: "Right"

In [4]:
print("Current observation is {}".format(observation))

action = 0 #go left

#################### simulate one step
if gym.__version__>'0.26.0':
    observation, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
else:
    observation, reward, done, info = env.step(action) 
####################



print("\nNew observation is {}".format(observation))
print("Step reward is {}".format(reward))
print("Did episode just ends? -{}".format(done)) # episode ends when 3.1(6) happens



Current observation is [ 0.03085803 -0.04887363 -0.02494748  0.026118  ]

New observation is [ 0.02988056 -0.2436291  -0.02442512  0.3108265 ]
Step reward is 1.0
Did episode just ends? -False


### 1.1.3 Simulate multiple episodes

(You may uncomment those lines to see an animation. However, it will not work for JupyterHub since the animation requires GL instead of webGL. If you have Jupyter notebook localy on your computer, this version of code will work through a virtual frame.)

In [None]:
env = gym.make('CartPole-v0')
observation = env.reset()
total_reward = 0
ep_num = 0
# img = plt.imshow(env.render(mode='rgb_array')) 


for _ in range(1000):
    #     img.set_data(env.render(mode='rgb_array')) 
    #     display.display(plt.gcf())
    #     display.clear_output(wait=True)
    
    action = env.action_space.sample()     # this takes random actions
    #################### simulate one step
    if gym.__version__>'0.26.0':
        observation, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
    else:
        observation, reward, done, info = env.step(action) 
    ####################
       
    total_reward += reward
    


    if done:                               # episode just ends
        observation = env.reset()          # reset episode
        if gym.__version__>'0.26.0':
            observation = observation[0]
        ep_num += 1

print("Average reward per episode is {}".format(total_reward/ep_num))
env.close()


### 1.1.4 States simplification 

For convenience, we consider only cart position and pole angular position, (i.e. state dimension = 2).

Note that the observed cart position $P \in [-4.8, 4.8]$ and pole angular position $\theta \in [-0.418, 0.418]$ for all times. Then, we could evenly devide those two intervals to from a finite number of states.

In [None]:
def find_state_idx(ob,ls0,ls1):
    pos_diff = ob[0] +4.8
    a_pos_diff = ob[2] + 0.418
    
    step_size_1 = 4.8*2/(ls0-1)
    step_size_2 = 0.418*2/(ls1-1)
    
    
    d_1 = np.round(pos_diff/step_size_1)
    d_3 = np.round(a_pos_diff/step_size_2)
     
    return [d_1,d_3]


ls_cart = 100 #devide the position of cart into 100 states
ls_pole = 100 #devide the angular position of pole into 100 states

# Threre are 100 * 100 = 10000 different states in total

observation = env.reset()
if gym.__version__>'0.26.0':
    observation = observation[0]
state_idx = find_state_idx(observation,ls_cart,ls_pole)

print("\nThe cart current position is {}".format(observation[0]), end = '')
print("and the pole current angular position is {}".format(observation[2]))

print("which projected to state {}".format(state_idx))

### 1.2 On-policy first-visit MC control
1. Implement "On-policy first-visit MC control" algorithum in [Ch 5.4 Sutton] to choose optimal actions
2. Simulate this algorithum for 30000 episodes.
3. Devide the previous 30000 episodes into 15 sets. Plot average rewards for each sets. (i.e. plot average rewards for the first 2000 episodes, the second 2000 episodes, ..., and the 15th 2000 episodes.) 
4. Plot the heatmap for Q for each action


In [None]:
## Suggested functions (Feel free to modify existing and add new functions)

def get_action(current_state, Q, epsilon):
    
    # Choose optimal action based on current state and Q
    #
    # input:  current state,  (array) 
    #         Q,              (array)  
    #         epsilonn,       (float)  
    # output: action
    #         
    return action



def update_Q(Q, obs, acts):
    # Update Q at the end of each episode
    #
    # input:  current Q, (array) 
    #         obs,       (array)  states observed in this episode
    #         acts,       (array)  actions took in this spisode
    # output: Updated Q
    #         

        
    return Q





## Suggested flow (Feel free to modify and add)


set_num = 15
s = 0
env = gym.make('CartPole-v1')
observation = env.reset()
if gym.__version__>'0.26.0':
    observation = observation[0]
while 1:
    
    current_state = observation
    action = get_action(current_state,Q,epsilon)
    
    #################### simulate one step
    if gym.__version__>'0.26.0':
        observation, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
    else:
        observation, reward, done, info = env.step(action) 
    ####################
    
    if done:  # end of epsode
        Q = update_Q(Q, obs, acts)
        
        ep_num += 1
        
        if  np.mod(ep_num,2000)==0: # end of every set of episode
            s+=1
            
            if s == set_num:
                break
env.close()