<a href="https://colab.research.google.com/github/monicasjsu/Reinforcement-Learning/blob/master/Sarsa_Mountain_Car.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Import the necessary libraries**

In [1]:
!apt-get install python-opengl -y

!apt install xvfb -y

!pip install pyvirtualdisplay

!pip install piglet


from pyvirtualdisplay import Display
Display().start()

import gym
from IPython import display
import matplotlib.pyplot as plt
%matplotlib inline

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Suggested packages:
  libgle3
The following NEW packages will be installed:
  python-opengl
0 upgraded, 1 newly installed, 0 to remove and 6 not upgraded.
Need to get 496 kB of archives.
After this operation, 5,416 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 python-opengl all 3.1.0+dfsg-1 [496 kB]
Fetched 496 kB in 2s (246 kB/s)
Selecting previously unselected package python-opengl.
(Reading database ... 144617 files and directories currently installed.)
Preparing to unpack .../python-opengl_3.1.0+dfsg-1_all.deb ...
Unpacking python-opengl (3.1.0+dfsg-1) ...
Setting up python-opengl (3.1.0+dfsg-1) ...
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  xvfb
0 upgraded, 1 newly installed, 0 to remove and 6 not upgraded.
Need to get 783 kB of ar

**Create the Gym environment**




In [2]:
env_name = 'MountainCar-v0'
env = gym.make(env_name)

In [3]:
print("Action Set size :", env.action_space)
print("Observation set shape :", env.observation_space)
print("Highest state feature value :", env.observation_space.high)
print("Lowest state feature value:", env.observation_space.low)
print(env.observation_space.shape)

Action Set size : Discrete(3)
Observation set shape : Box(2,)
Highest state feature value : [0.6  0.07]
Lowest state feature value: [-1.2  -0.07]
(2,)


**Set the hyperparameters**

In [4]:
n_states = 40  # number of states
episodes = 10  # number of episodes

initial_lr = 1.0  # initial learning rate
min_lr = 0.005  # minimum learning rate
gamma = 0.99  # discount factor
max_steps = 300
epsilon = 0.05

In [6]:
import numpy as np

env = env.unwrapped
env.seed(0)  
np.random.seed(0)

**Define a function for discretization**

In [7]:
def discretization(env, obs):
    env_low = env.observation_space.low
    env_high = env.observation_space.high

    env_den = (env_high - env_low) / n_states
    pos_den = env_den[0]
    vel_den = env_den[1]

    pos_high = env_high[0]
    pos_low = env_low[0]
    vel_high = env_high[1]
    vel_low = env_low[1]

    pos_scaled = int((obs[0] - pos_low) / pos_den) 
    vel_scaled = int((obs[1] - vel_low) / vel_den)  

    return pos_scaled, vel_scaled

In [None]:
q_table = np.zeros((n_states,n_states,env.action_space.n))
total_steps = 0
for episode in range(episodes):
   obs = env.reset()
   total_reward = 0
   # decreasing learning rate alpha over time
   alpha = max(min_lr,initial_lr*(gamma**(episode//100)))
   steps = 0

   #action for the initial state using epsilon greedy
   if np.random.uniform(low=0,high=1) < epsilon:
        a = np.random.choice(env.action_space.n)
   else:
        pos,vel = discretization(env,obs)
        a = np.argmax(q_table[pos][vel])
  
   while True:
      env.render()
      pos,vel = discretization(env,obs)
    
      obs,reward,terminate,_ = env.step(a) 
      total_reward += abs(obs[0]+0.5) 
      pos_,vel_ = discretization(env,obs)

      #action for the next state using epsilon greedy
      if np.random.uniform(low=0,high=1) < epsilon:
          a_ = np.random.choice(env.action_space.n)
      else:
          a_ = np.argmax(q_table[pos_][vel_])

      #q-table update
      q_table[pos][vel][a] = (1-alpha)*q_table[pos][vel][a] + alpha*(reward+gamma*q_table[pos_][vel_][a_])
      steps+=1
      if terminate:
          break
      a = a_
   print("Episode {} completed with total reward {} in {} steps".format(episode+1,total_reward,steps)) 
while True: #to hold the render at the last step when Car passes the flag
   env.render()

Episode 1 completed with total reward 10681.592878022653 in 32927 steps
Episode 2 completed with total reward 787.9200319408282 in 2908 steps
Episode 3 completed with total reward 1726.4446335156654 in 5812 steps
Episode 4 completed with total reward 1093.8120666928066 in 3982 steps
Episode 5 completed with total reward 2037.6759841203943 in 5999 steps
Episode 6 completed with total reward 740.6977100558029 in 2894 steps
Episode 7 completed with total reward 860.4562672146648 in 3157 steps
Episode 8 completed with total reward 581.5362914429438 in 2693 steps
Episode 9 completed with total reward 3582.4670596187384 in 8886 steps
Episode 10 completed with total reward 1531.109979650578 in 5667 steps
