<a href="https://colab.research.google.com/github/poudel-bibek/Intro-to-AI-Assignments/blob/main/A10_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![bhi](https://user-images.githubusercontent.com/96804013/152689231-de4db6bd-e653-4dfc-8881-dc3feef3389d.png)


# Assignment 10: Deep Q-Networks (Task)
---

In this assignment, we will use OpenAI Gym ([link](https://gym.openai.com/envs/CartPole-v1/)) to solve a classic control problem of Balancing a Cartpole. To do so, we will be training a Reinforcement Learning agent using the DQN algorithm ([Link](https://openai.com/blog/openai-baselines-dqn/)).


<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/152690560-12ee45f2-69bb-422c-b230-90e4430f15b3.gif")
"/>
</p>

<p align="center">
  <em>Figure 1: Performance of an untrained agent on Cartpole-v1</em>
</p>


<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/152689694-f72560fd-cbc1-4f55-87fc-c5e21655d667.gif")
"/>
</p>

<p align="center">
  <em>Figure 2: Performance of agent during training</em>
</p>

Let's start by importing necessary packages, libraries and specifying the environment. 

                    import time
                    import gym
                    import random
                    import numpy as np
                    from collections import deque

                    import tensorflow as tf
                    from tensorflow import keras
                    from keras.models import Sequential
                    from keras.layers import Dense

                    env = gym.make("CartPole-v1")


<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/152835304-fa8af20e-d6c5-4b41-b36e-21b1ebe66240.png")
"/>
</p>

<p align="center">
  <em>Figure 3: The agents' interaction with the environment: </em>
</p>


Before we jump into the task, lets get familiar with OpenAI Gym terminologies and CartPole-v1 environment.

###In the CartPole-v1 environment: 
---
__Actions:__
  - 0 = Push cart to the left
  - 1 =	Push cart to the right 

__Observations:__
  - Cart Position 
  - Cart Velocity 
  - Pole Angle 
  - Pole Velocity At Tip

__Reward:__ For each timestep that the agent is able to NOT FAIL, it collects a  reward of `+1`

__Terminologies:__

- Step = an agent taking one action in the environment
- Observation = the agents' view of the environment state (for this assignment, terms `states` and `observations` are used equivalently.)
- Reward = a value assigned by the environment on how "good" the last action taken by the agent
- Done = whether or not the current episode is terminated

Run the snippet below to print various exploratory values. 

                    random_action = env.action_space.sample()
                    env.reset()
                    observation, reward, done, info = env.step(random_action)

                    print(f"Action = {random_action}")
                    print(f"Observation = {observation}, shape = {observation.shape}")
                    print(f"Reward = {reward}")
                    print(f"Done = {done}")

                    print(f"Number of actions that can be taken = {env.action_space.n}")
                    print(f"Limits of the observation: \n max ={env.observation_space.low} \n min ={env.observation_space.high}")


Now, we will write 2 helper functions: 

__1. select_action()__

- The agent has a choice between whether to use the action that it has learned is the best one to take (exploitation) or to take a random action instead (exploration) 
- We use the value of exploration rate (which in the main loop linearly decays from 50% exploration to 2.5%) to decide what action to take
- Use the snippet below to implement action selection function

                    def select_action(min_exploration_rate, current_exploration_rate, observation, dqn_agent):
                      rand_num = random.random() 
                      if rand_num <= current_exploration_rate:
                        action = 0 if current_exploration_rate <= min_exploration_rate/2 else 1 
                      else: 
                        action = np.argmax(dqn_agent.predict(observation)[0])
                      return action 

__2. replay()__

  - The two major contributions of the DQN algorithm are:
    - The use of a target network: there are 2 copies of the agent namely, policy network and target network. The q_values agent currently assigns for every state, action pair come from policy network whereas the aspirational q_values for every state, action pair (what the q_values should be) come from a target network. In the main training loop, the target network gets a copy of the policy network every N episodes (chosen as 10 here)

    - The concept of a  replay memory: collect the transitions experiences of the agent into a replay memory, every episode sample a batch of (experience, target values) from the replay memory and train the agent on it.


- For every experience sampled, a target Q-value is the sum of reward that we get after taking action `a` on state `s` and the discounted max `Q` value among all possible actions from next state `s'`.  The `Q` in the equation below is the target network. 

<br>

$$Q(s,a) = r(s,a) + \gamma \cdot max_{a} Q(s', a)~~..........~~(equation~1)$$

<br>
<br>

---
## Exercise 1

- In the snippet below for replay function, fill in the code for to implement equation 1.

- Hints: 
    - $r(s,a) = reward$
    - $\gamma$ is taken as an agrument (gamma) in the function
    - For max operation use the np.amax() from numpy, reference ([link](https://numpy.org/doc/stable/reference/generated/numpy.amax.html))
    - use `target_agent.predict(observation_next)[0]` to use the get the Q-values

                    def replay(memory, dqn_agent, target_agent, batch_size, gamma):
                      states_batch = [] 
                      q_values_batch = []

                      minibatch = random.sample(memory, batch_size)
                      for current_observation,action, reward, observation_next, done in minibatch: 
                        if not done:
                          target = ############ YOUR CODE HERE ###############
                        else: 
                          target = reward 

                        current_q = dqn_agent.predict(current_observation)
                        current_q[0][action] = target 

                        states_batch.append(current_observation[0])
                        q_values_batch.append(current_q[0]) 

                      states_batch = np.array(states_batch)
                      q_values_batch = np.array(q_values_batch)
                      dqn_agent.fit(states_batch, q_values_batch, epochs=1, verbose=0) 
                      return dqn_agent 

Now we are going to set some learning parameters
                
                    max_episodes = 75 
                    gamma = 0.97 
                    memory = deque(maxlen=256)
                    target_update_steps = 10

                    num_actions = env.action_space.n
                    print(f"Number of actions that can be taken = {num_actions}")
                    num_observations = env.observation_space.shape[0]


                    batch_size = 64
                    learning_rate = 1e-4 #0.0001
                    Adam = keras.optimizers.Adam 

<br><br>
The DQN agent is a 4-layered neural network model that takes as inputs observations from the environment and output an action to take. 

---
## Exercise 2



Initialize a Sequential model with following code: 

                  dqn_agent = Sequential()

Use `dqn_agent.add()` function to build a DQN agent with the following architecture.
  - Dense layer with `64` output units and `relu` activation, remember to specity `input_dim`
  - Dense layer with `64` output units and `relu` activation
  - Dense layer with `24` output units and `relu` activation
  - Dense layer with `num_actions` output units and `linear` activation

- Compile the `dqn_agent` with `mse` loss and `Adam` optimizer (make sure you specify the `learning_rate` here)

- Print the `dqn_agent` model summary
- References: 
  - Dense ([link](https://keras.io/api/layers/core_layers/dense/))
  - Compile ([link](https://keras.io/api/models/model_training_apis/))


<br><br>

Next, we will set some parameters that influence how "exploratory" our agent is.
The exploration strategy that we will use is going to be exponentially decaying $\epsilon$-greedy strategy given by: 
$$\text{For episode n:}$$
$$ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\text{current exploration probability(%)}~=max[~ϵ_{min}, ~ϵ_{start}(ϵ_{decay~rate})^n] $$

We will start the agent to perform `100%` exploratory actions at the start. The probability then exponentially decreases (as a function of episode number) to a minimum exploration of `5%`.

Set params using: 

                  eps_start = 1.0
                  eps_decay_rate = 0.95
                  eps_min = 0.05

---
## Exercise 3

__Main training loop:__

In the code snippet below, please fill in the the empty spaces.

Hints: 

1. __Select action:__ make use of the select_action function defined above
2. __Take a step in the environment__ make use of the step() function
3. __Gather experience:__ what constitutes a single experience?

                    update_counter = 0 
                    reward_collected = []
                   
                    target_agent = dqn_agent 
                    current_exploration_rate = eps_start 
                    
                    start = time.time()
                    for episode in range(max_episodes): 

                      print(f"Episode {episode}:", end = ' ')
                      current_observation = env.reset().reshape(1,-1) 

                      done = False 
                      score = 0
                      timestep = 0

                      while not done:
                        action = ##########YOUR CODE HERE (1) ##############
                       
                        observation_next , reward, done, info  = ##########YOUR CODE HERE (2) ##############
                        observation_next = observation_next.reshape(1,-1)

                        experience = (##########YOUR CODE HERE (3) ##############)
                        memory.append(experience)

                        current_observation = observation_next
                        score += reward
                        timestep+=1 

                      # Out of while loop
                      print(f"Exploration rate = {round(current_exploration_rate,3)},", end = "\t")
                      current_exploration_rate = max(eps_min, current_exploration_rate*eps_decay_rate) 
                      
                      print(f"completed with {timestep} timesteps, score = {score}", end = '\n')
                      reward_collected.append(score)

                      if len(memory) >= batch_size: 
                        dqn_agent  = replay(memory, dqn_agent, target_agent, batch_size, gamma)

                      update_counter+= 1
                      if update_counter% target_update_steps ==0: 
                        target_agent.set_weights(dqn_agent.get_weights()) 

                    env.close()
                    print(f"Total training time taken = {time.time() - start} seconds")

Next, use the snippet below to plot the reward collected during training: 

                    fig, ax = plt.subplots(figsize = (8,5))
                    x = len(reward_collected)
                    ax.plot(range(x), reward_collected)
                    ax.set_title(f"Reward Collected over {x} episodes")
                    ax.set_xlabel("Episodes")
                    ax.set_ylabel("Reward per episode")
                    ax.set_xticks(range(0,x+1,10));
                    

<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/153685218-326ad9a1-e6f9-49a8-aab0-358e76bf236e.png")
"/>
</p>

<p align="center">
  <em>Figure 4: Reward plot (Your plot may look different) </em>
</p>



- The increase in the `reward collected per episode` as the episodes progress is a good indicator for the agent learning to perform its task successfully. 

- Now, use the code below to render a rollout of agent performing the cartpole balancing task.


                    !pip install gym pyvirtualdisplay > /dev/null 2>&1
                    !apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
                    !pip install colabgymrender==1.0.2

---

                    from colabgymrender.recorder import Recorder 

                    env = gym.make("CartPole-v1")
                    env = Recorder(env, './video')

                    observation = env.reset().reshape(1,-1)
                    done = False
                    while not done:
                      learned_action = np.argmax(dqn_agent.predict(observation)[0])
                      observation, reward, done, info = env.step(learned_action)  
                      observation = observation.reshape(1,-1)
                    env.play()

## Optional: 

- Since fully training the agent can take a long time, You have an option to __terminate the training cell__ above and load an agent that was trained for 1000 episodes and see the performance.

- To do so, use the following code: 

                    !wget -o -q https://github.com/poudel-bibek/Intro-to-AI-Assignments/files/8027697/my_agent.zip
                    !unzip -o -q ./my_agent.zip -d unzipped/ 
                    trained_agent = tf.keras.models.load_model('./unzipped/my_agent_50')
---

                    env = gym.make("CartPole-v1")
                    env = Recorder(env, './video')

                    observation = env.reset().reshape(1,-1)
                    done = False
                    while not done:
                      # Get best action from agent 
                      learned_action = np.argmax(trained_agent.predict(observation)[0])

                      # Apply the best action in the environment
                      observation, reward, done, info = env.step(learned_action)  
                      observation = observation.reshape(1,-1)
                    env.play()