<a href="https://colab.research.google.com/github/poudel-bibek/Intro-to-AI-Assignments/blob/main/A10_Task.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![bhi](https://user-images.githubusercontent.com/96804013/152689231-de4db6bd-e653-4dfc-8881-dc3feef3389d.png)


# Assignment 10: Deep Q-Networks (Task)
---

In this assignment, we will use OpenAI Gym ([link](https://gym.openai.com/envs/CartPole-v1/)) to solve a classic control problem of Balancing a Cartpole. To do so, we will be training a Reinforcement Learning agent using the DQN algorithm ([Link](https://openai.com/blog/openai-baselines-dqn/)).


<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/152690560-12ee45f2-69bb-422c-b230-90e4430f15b3.gif")
"/>
</p>

<p align="center">
  <em>Figure 1: Performance of an untrained agent on Cartpole-v1</em>
</p>


<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/152689694-f72560fd-cbc1-4f55-87fc-c5e21655d667.gif")
"/>
</p>

<p align="center">
  <em>Figure 2: Performance of agent during training</em>
</p>


Let's start by importing necessary packages, libraries and specifying the environment. 

                    import time
                    import gym
                    import random
                    import numpy as np
                    from collections import deque

                    import tensorflow as tf
                    from tensorflow import keras
                    from keras.models import Sequential
                    from keras.layers import Dense

                    env = gym.make("CartPole-v1")

After that, please set the randomization seeds for experiment reproducibility. 

                    SEED = 42
                    random.seed(SEED)
                    np.random.seed(SEED)
                    tf.random.set_seed(SEED)

<p align="center">
  <img src="https://user-images.githubusercontent.com/96804013/152835304-fa8af20e-d6c5-4b41-b36e-21b1ebe66240.png")
"/>
</p>

<p align="center">
  <em>Figure 3: The agents' interaction with the environment: </em>
</p>


Before we jump into the task, lets get familiar with OpenAI Gym terminologies and CartPole-v1 environment.

###Terminologies
---
- Step = an agent taking one action in the environment
- Observation = the agents' view of the environment state
- Reward = a value assigned by the environment on how "good" the last action taken by the agent
- Done = whether or not the current episode is terminated

Run the snippet below to print shapes and values. 

                    random_action = env.action_space.sample()
                    env.reset()
                    observation, reward, done, info = env.step(random_action)

                    print(f"Action = {random_action}")
                    print(f"Observation = {observation}, shape = {observation.shape}")
                    print(f"Reward = {reward}")
                    print(f"Done = {done}")

Now, we will write 2 helper functions: 

__1. select_action()__

- The agent has a choice between whether to use the action that it has learned is the best one to take (exploitation) or to take a random action instead (exploration) 
- We use the value of exploration rate (which in the main loop linearly decays from 50% exploration to 2.5%) to decide what action to take
- Use the snippet below to implement action selection function

                    def select_action(exploration_rate, observation, dqn_agent):
                      rand_num = random.random() # Flip a coin
                      if rand_num <= exploration_rate:
                        action = 0 if exploration_rate/2 < 0.05 else 1  
                      else: 
                        action = np.argmax(dqn_agent.predict(observation)[0])
                       
                      return action 

__2. replay()__

  - One of the major contributions of the DQN algorithm is the concept of a  replay memory (train itself on the stored transitions from past).

- For every experience sampled, a target Q-value is the sum of reward that we get after taking action `a` on state `s` and the discounted max `Q` value among all possible actions from next state `s'` 

<br>

$$Q(s,a) = r(s,a) + \gamma \cdot max_{a} Q(s', a)~~..........~~(equation~1)$$

<br>

---
## Exercise 1

- In the snippet below for replay function, fill in the code for to implement equation 1.

- Hints: 
    - $r(s,a) = reward$
    - $\gamma$ is taken as an agrument (gamma) in the function
    - For max operation use the np.amax() from numpy, reference ([link](https://numpy.org/doc/stable/reference/generated/numpy.amax.html))
    - use `dqn_agent.predict(observation_next[0])` to use the get the Q-values

                    def replay(memory, dqn_agent, batch_size, gamma):
                      minibatch = random.sample(memory, batch_size)
                      for current_observation,action, reward, observation_next, done in minibatch: 
                        if not done:
                          target = *******************YOUR CODE HERE *******************

                        else: 
                          target = reward 

                        current_q = dqn_agent.predict(current_observation) 
                        current_q[0][action] = target 
                        dqn_agent.fit(current_observation, current_q, epochs=1, verbose=0)

Now we are going to set some learning parameters

                    max_episodes = 100 
                    max_timesteps = 200 
                    gamma = 0.95 # Discount factor
                    memory = deque(maxlen=800)

                    num_actions = env.action_space.n
                    print(f"Number of actions that can be taken = {num_actions}")
                    num_observations = env.observation_space.shape[0]

                    batch_size = 16 
                    learning_rate = 0.001
                    Adam = keras.optimizers.Adam 

The DQN agent is a neural network model that takes as inputs observations from the environment and output an action to take. 

---
## Exercise 2


Initialize a Sequential model with following code: 

                  dqn_agent = Sequential()

Use `dqn_agent.add()` function to build a DQN agent with the following architecture.
  - Dense layer with `24` output units and `relu` activation, remember to specity `input_dim`
  - Dense layer with `24` output units and `relu` activation
  - Dense layer with `num_actions` output units and `linear` activation

- Compile the `dqn_agent` with `mse` loss and `Adam` optimizer (make sure you specify the `learning_rate` here)

- Print the `dqn_agent` model summary
- References: 
  - Dense ([link](https://keras.io/api/layers/core_layers/dense/))

---
## Exercise 3

In the code snippet below, please fill in the the empty spaces.

Hints: 

1. __Select action:__ make use of the select_action function defined above
2. __Take a step in the environment__ make use of the step() function
3. __Gather experience:__ what constitutes a single experience?


                    start = time.time()
                    for episode in range(max_episodes): 
                      print(f"Training episode {episode}", end = '\t')

                      # Start each episode by initializing the environment (reset)
                      current_observation = env.reset().reshape(1,-1) # Make shape compatible

                      for timestep in range(max_timesteps): 

                        # Linearly decay from 80% exploration to 2.5%
                        current_exploration_rate = 0.5*max(0.05, (max_episodes- 5*episode)/max_episodes)

                        # select an action according to epsilon greedy strategy 
                        action = *******************YOUR CODE HERE(1)*******************
                        
                        # Take a step in the environment
                        observation_next , reward, done, info  = *******************YOUR CODE HERE(2)*******************
                        observation_next = observation_next.reshape(1,-1)

                        # Gather the experience and append it o memory 
                        experience = (*******************YOUR CODE HERE(3)*******************)
                        memory.append(experience)

                        # If we have enough memories collected
                        if len(memory) > batch_size: 
                          # Replay the memory of experiences collected so far (to train agent)
                          replay(memory, dqn_agent, batch_size, gamma)

                        # We are done with this timestep, so current observation <-- next observation
                        current_observation = observation_next

                        if done: 
                          # If the agent fails, episodes are terminated before max_timesteps
                          print(f"completed with {timestep + 1} timesteps", end = '\n')
                          break # Get out of the loop to next episode if this episode is finished 
                      
                    # Save a trained agent (at the end of all episodes)
                    dqn_agent.save("my_agent")

                    env.close()
                    print(f"Total training time taken = {time.time() - start} seconds")
                    trained_agent = tf.keras.models.load_model("my_agent")

---
- The increase in the number of timesteps as the episodes progress is a good indicator for the agent learning to perform its task successfully. 

- Since fully training the agent can take a long time, You have an option to __terminate the training cell__ above and load an agent that was trained for 500 episodes and see the performance.

- To do so, use the following code: 

                    !wget -o -q https://github.com/poudel-bibek/Intro-to-AI-Assignments/files/8027697/my_agent.zip
                    !unzip -o -q ./my_agent.zip -d unzipped/ 
                    trained_agent = tf.keras.models.load_model('./unzipped/my_agent')

- Now, use the code below (either with the hosted agent or the agent that you trained) to render a rollout of agent performing the cartpole balancing task.


                    !pip install gym pyvirtualdisplay > /dev/null 2>&1
                    !apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1
                    !pip install colabgymrender==1.0.2

---

                    from colabgymrender.recorder import Recorder 

                    env = gym.make("CartPole-v1")
                    env = Recorder(env, './video')
                    observation = env.reset().reshape(1,-1)

                    for timestep in range(max_timesteps):
                      learned_action = np.argmax(trained_agent.predict(observation)[0])
                      observation, reward, done, info = env.step(learned_action)  
                      observation = observation.reshape(1,-1)

                      if done: 
                        break 

                    env.play()