<a href="https://colab.research.google.com/github/poudel-bibek/Intro-to-AI-Assignments/blob/main/A9_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Artboard](https://user-images.githubusercontent.com/96804013/153448807-fb099682-bac7-4254-986c-1d54ce6e534d.png)


# Assignment 9: Q-Learning (Task)
---
In this assignment, we will use OpenAI Gym ([link](https://gym.openai.com/envs/MountainCar-v0/)) to solve a classic control problem of driving a car uphill by rocking back and forth to gain momentum. To do so, we will be training a Reinforcement Learning agent using the Q-Learning algorithm ([Link](https://en.wikipedia.org/wiki/Q-learning)).

<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/153467131-a630edb2-aeb9-4694-bd69-28382b49c25a.gif")
"/>
</p>

<p align="center">
  <em>Figure 1: Performance of an untrained agent on MountainCar-v0</em>
</p>

<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/153467239-0e681df2-0f79-4453-81e0-b2ac1e9fd378.gif")
"/>
</p>

<p align="center">
  <em>Figure 2: Performance of a trained agent on MountainCar-v0</em>
</p>
 

Let's start by importing necessary packages, libraries and specifying the environment. 
                    
                    import time
                    import gym 
                    import random
                    import numpy as np 
                    import matplotlib.pyplot as plt 
                    env = gym.make("MountainCar-v0")

<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/152835304-fa8af20e-d6c5-4b41-b36e-21b1ebe66240.png")
"/>
</p>

<p align="center">
  <em>Figure 3: The agents' interaction with the environment: </em>
</p>


Before we jump into the task, lets get familiar with OpenAI Gym terminologies and MountainCar-v0 environment.

###In the MountainCar-v0 environment: 
---
__Actions:__ 
  - 0 =	push left
  - 1 =	no push
  - 2 =	push right

__Observations:__ 
  - Position
  - Velocity

__Reward:__ For each timestep that the agent spends to perform a task it collects a reward of `-1`. i.e., the goal is to perform the task as quickly as possible (with least -ve rewards).
  
__Terminologies:__

- Step = an agent taking one action in the environment
- Observation = the agents' view of the environment state (for this assignment, terms `states` and `observations` are used equivalently.)
- Reward = a value assigned by the environment on how "good" the last action taken by the agent
- Done = whether or not the current episode is terminated

Run the snippet below to print various exploratory values. 

                    random_action = env.action_space.sample() 
                    env.reset() 
                    observation, reward, done, info = env.step(random_action)

                    print(f"Action = {random_action}")
                    print(f"Observation = {observation}, shape = {observation.shape}")
                    print(f"Reward = {reward}")
                    print(f"Done = {done}")

                    print(f"Number of actions that can be taken = {env.action_space.n}")
                    print(f"Limits of the observation: \n max ={env.observation_space.low} \n min = {env.observation_space.high}")

Visually, our  Q-table looks like:

<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/153719045-4da1d0a1-25ba-4b69-94e9-b0ee67a288fa.jpg")
"/>
</p>

<p align="center">
  <em>Figure 4: Q-table for MountainCarV-0 with 20 discrete position and velocity spaces. (Q-values randomly populated)</em>
</p>

Next, we will implement 2 helper functions: 

__1. make_discrete:__ 
  - The observation (position, velocity) values that we get from the environment are continuous (the the ranges you explored above). Since we set up our Q-table in such a way that there can be a total of 20 possible position and velocity values each, we will convert the continuous values to fall into one of the possible dicrete values. 

__2. get_greedy_action:__
  - Make use of the Q-table learned so far to get the best possible action to take given an observation.

Use the code below to implement them: 

                    def make_discrete(observation):
                      pos, vel =  observation
                      pos_bin = int(np.digitize(pos, pos_space))
                      vel_bin = int(np.digitize(vel, vel_space))
                      return (pos_bin, vel_bin)

                    def get_greedy_action(q_table, observation, actions=[0, 1, 2]):
                      # For the current observation, get the most favorable action
                      values = np.array([q_table[observation,a] for a in actions])
                      greedy_action = np.argmax(values)
                      return greedy_action


---
##Exercise 1

__select_action function__:
  - Apply exploration (i.e., with a certain probability, apply a random action instead of the best action that we learned from the Q-table). 

In the implementation of select_action function below, fill in with your code

                    def select_action(current_exploration_rate, greedy_action):
                      rand_num = ######## YOUR CODE HERE (1) #########

                      if rand_num < current_exploration_rate:
                        action = ######## YOUR CODE HERE (2) #########
                      else: 
                        action = greedy_action

                      return action 

- Hints: 
    1. Sample a random number between 0 and 1 (from all possible numbers between 0 and 1) with uniform probability
    2. Randomly choose either of actions `0`, `1` or `2`, with equal probability.

- References: 
  - random.random ([link](https://www.w3schools.com/python/ref_random_random.asp))
  - random.choice ([link](https://www.w3schools.com/python/ref_random_choice.asp))

Next, we will set some parameters that influence how "exploratory" our agent is.
The exploration strategy that we will use is going to be exponentially decaying $\epsilon$-greedy strategy given by: 
$$\text{For episode n:}$$
$$ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\text{current exploration probability(%)}~=max[~ϵ_{min}, ~ϵ_{start}(ϵ_{decay~rate})^n] $$

We will start the agent to perform `100%` exploratory actions at the start. The probability then exponentially decreases (as a function of episode number) to a minimum exploration of `5%`.

Set exploration specific params using: 

                  eps_start = 1.0
                  eps_decay_rate = 0.95
                  eps_min = 0.05

Futher, set environment and learning specific params using: 

                  env._max_episode_steps = 1000
                  max_episodes = 50000
                  learning_rate = 0.1
                  gamma = 0.99


Below is the Q-Learning algorithm that we will be implementing in the main training loop. 

For futher reathing go to `Sutton and Barto 6.5` ([link](http://www.incompleteideas.net/book/RLbook2020.pdf#%5B%7B%22num%22%3A1800%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22XYZ%22%7D%2C79.2%2C415.572%2Cnull%5D))


<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/153653226-b68e09a6-b8db-48f1-80cf-42ad7425dd76.png")
"/>
</p>

<p align="center">
  <em>Figure 5: Q-Learning algorithm</em>
</p>

---
## Exercise 2: 

In the main training loop code below, implement the following equation from the algorithm above. 


$$Q(S,A) ← Q(S,A) + \alpha[ R + \gamma Q(S', a) - Q(S,A)]$$

where `S` = current_obs, `A` = action, `S'` = obs_next`α`, `a` = action_next and `α` is named`learning_rate`, `γ` is named `gamma`.

                    current_exploration_rate = eps_start
                    reward_collected = []
                    start = time.time()

                    for i in range(max_episodes):
                      timestep = 0 
                      score = 0
                      done = False
                      current_obs = env.reset()
                      current_obs = make_discrete(current_obs)
                      
                      while not done:
                        greedy_action = get_greedy_action(q_table, current_obs)

                        action = select_action (current_exploration_rate, greedy_action)
                        obs_next, reward, done, info = env.step(action)
                        obs_next = make_discrete(obs_next)

                        action_next = get_greedy_action(q_table, obs_next)

                        q_table[current_obs, action] = ########### YOUR CODE HERE ##########
                        timestep += 1
                        score += reward
                        current_obs = obs_next
                        
                      if i % 100 == 0:
                        print(f"Episode {i}: Exploration rate = {round(current_exploration_rate,3)}, completed with {timestep} timesteps, score = {score}")
                        reward_collected.append(score)
                      
                      current_exploration_rate = max(eps_min, current_exploration_rate*eps_decay_rate)
      
                    print(f"total training time = {time.time() - start} seconds")



Use the following code to get Reward plot:

                    fig, ax = plt.subplots(figsize = (10,5))
                    x = len(reward_collected)
                    ax.plot(range(x), reward_collected)
                    ax.set_title(f"Reward Collected over {x*100} episodes")
                    ax.set_xlabel("Episodes")
                    ax.set_ylabel("Reward per episode")

                    x_labels = [f"{int(i/1000)}k" for  i in range(0, 100*x + 1, 1000)]
                    ax.set_xticklabels(x_labels);

<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/153732989-c888162b-e0b0-4053-853e-6aaed5c7a93f.png")
"/>
</p>

<p align="center">
  <em>Figure 6: Reward plot</em>
</p>

- From the plot above: the agent learns to complete the task in less that 200 timesteps.

- The increase in the `reward collected per episode` as the episodes progress is a good indicator for the agent learning to perform its task successfully. 

- Now, use the code below to render a rollout of agent performing the MountainCar task.


                    !pip install gym pyvirtualdisplay > /dev/null 2>&1
                    !apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
                    !pip install colabgymrender==1.0.2

---

                    from colabgymrender.recorder import Recorder 

                    env = gym.make("MountainCar-v0")
                    env = Recorder(env, './video')

                    done = False 
                    current_observation = env.reset() 
                    
                    while not done:
                      action = select_action(current_exploration_rate, q_table, current_observation)
                      observation_next, reward, done, info = env.step(action)
                      current_observation = observation_next
