# Machine Learning and AI for Autonomous Systems
## A program by IISc and TalentSprint
### Assignment: Deep Q-Learning (DQN)

## Learning Objectives

At the end of the experiment, you will be able to :

* understand Q-learning
* differentiate between Q-learning and Deep Q-learning
* implement Deep Q-learning to solve Atari Ms-Pacman environment

### Setup Steps:

In [1]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "2302794" #@param {type:"string"}

In [2]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "9008710123" #@param {type:"string"}

In [3]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "AIAS_B2_M4_AST_06_Deep_Q_Learning_C" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
    ipython.magic("sx wget https://files.grouplens.org/datasets/movielens/ml-25m.zip")
    ipython.magic("sx unzip ml-25m.zip")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://aias-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


## Information

**Q-Learning**

The agent will perform the sequence of actions that will eventually generate the maximum total reward. This total reward is also called the **Q-value** and we will formalize our strategy as:

$$Q(s, a) = r(s, a) + \gamma\ maxQ(s', a)$$

The above equation states that the Q-value yielded from being at state $s$ and performing action $a$ is the immediate reward $r(s,a)$ plus the highest Q-value possible from the next state $s’$. Gamma here is the discount factor which controls the contribution of rewards further in the future.

$Q(s’,a)$ depends on $Q(s”,a)$ which will then have a coefficient of gamma squared. So, the Q-value depends on Q-values of future states as shown here:

$$Q(s, a) \rightarrow \gamma\ Q(s', a) + \gamma^2\ Q(s'', a)\ ...\ ...\ ...\ \gamma^n\ Q(s''^{...n}, a)$$

Adjusting the value of gamma will diminish or increase the contribution of future rewards.

Since this is a recursive equation, we can start with making arbitrary assumptions for all q-values. With experience, it will converge to the optimal policy.

To know more about Q-Learning, click [here](https://github.com/rishal-hurbans/Grokking-Artificial-Intelligence-Algorithms/tree/master/ch10-reinforcement_learning).




**Approximate Q-Learning and Deep Q-Learning**

The main problem with Q-Learning is that it does not scale well to large (or even medium) Markov Decision Processes with many states and actions, and it is hard to keep track of an estimate for every single Q-Value.
<br><br>
<center>
<img src="https://cdn.analyticsvidhya.com/wp-content/uploads/2019/04/Screenshot-2019-04-16-at-5.46.01-PM-670x440.png" width=650px />
</center>
<br><br>

The solution is to find a function $Q_θ(s, a)$ that approximates the Q-Value of any state-action pair (s, a) using a manageable number of parameters (given by the parameter vector θ). This is called **Approximate Q-Learning**.

$$Q_{target}(s, a) = r + \gamma\ maxQ_{\theta}(s', a')$$

For years it was recommended to use linear combinations of handcrafted features extracted from the state to estimate Q-Values, but in 2013, DeepMind showed that deep neural networks can work much better, especially for complex problems, and it does not require any feature engineering. A DNN used to estimate Q-Values is called a Deep Q-Network (DQN), and using a DQN for Approximate Q-Learning is called **Deep Q-Learning**.

In deep Q-learning, we use a neural network to approximate the Q-value function. The state is given as the input and the Q-value of all possible actions is generated as the output.

### Implementing Deep Q-Learning for Atari Ms-Pacman

#### **Atari Ms-Pacman**

<center>
<img src="https://cdn.iisc.talentsprint.com/AIAS/Pacman.png" width=450px height=350px/>
</center>
<br><br>

In this environment, a player controls the Pacman, who attempts to collect all of the pellets while avoiding ghosts that pursue him. The agent must learn to control the Pacman by moving left, right, up, and down, collecting all the pellets without being caught by any of the ghosts.

Let's see the details of different aspects, such as rewards, states, and actions, that needs to be considered while modeling an RL solution for this problem.


#### **Actions:**

There are 9 discrete deterministic actions:

- 0: NOOP (no operation)
- 1: UP
- 2: RIGHT
- 3: LEFT
- 4: DOWN
- 5: UPRIGHT
- 6: UPLEFT
- 7: DOWNRIGHT
- 8: DOWNLEFT

#### **States**

By default, the environment returns the RGB image, of shape (210, 160, 3), that is displayed to human players as an observation.

#### **Rewards**

Points are obtained by eating pellets, while avoiding ghosts (contact with one causes Ms. Pac-Man to lose a life). Eating one of the special power pellets turns the ghosts blue for a small duration, allowing them to be eaten for extra points.

To know more about Ms. Pacman environment, refer [here](https://www.gymlibrary.dev/environments/atari/ms_pacman/).

### Install dependencies

In [4]:
!pip3 -q install PyVirtualDisplay
!sudo apt-get install xvfb
!sudo apt-get install python-opengl
!sudo apt-get install ffmpeg
!pip -q install gym-notebook-wrapper
!pip -q install gym[atari]
!pip -q install gym[accept-rom-license]
!pip -q install pyglet
!sudo apt install freeglut3-dev freeglut3 libgl1-mesa-dev libglu1-mesa-dev libxext-dev libxt-dev
!sudo apt install python3-opengl libgl1-mesa-glx libglu1-mesa

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libfontenc1 libxfont2 libxkbfile1 x11-xkb-utils xfonts-base xfonts-encodings
  xfonts-utils xserver-common
The following NEW packages will be installed:
  libfontenc1 libxfont2 libxkbfile1 x11-xkb-utils xfonts-base xfonts-encodings
  xfonts-utils xserver-common xvfb
0 upgraded, 9 newly installed, 0 to remove and 45 not upgraded.
Need to get 7,813 kB of archives.
After this operation, 11.9 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 libfontenc1 amd64 1:1.1.4-1build3 [14.7 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libxfont2 amd64 1:2.0.5-1build1 [94.5 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/main amd64 libxkbfile1 amd64 1:1.1.0-1build3 [71.8 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/main amd64 x11-xkb-utils amd64 7.7+5build4 [172 kB]
Get:5 http://archiv

### Import required packages

In [5]:
import numpy as np
import gym
import gnwrapper
import glob
import io
import base64
from IPython.display import HTML
from pyvirtualdisplay import Display
from IPython import display as ipythondisplay
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import urllib.request
from IPython.display import clear_output

import warnings
warnings.filterwarnings("ignore")

In [6]:
# List of available environments

print(gym.envs.registry.all())

dict_values([EnvSpec(id='ALE/Tetris-v5', entry_point='gym.envs.atari:AtariEnv', reward_threshold=None, nondeterministic=False, max_episode_steps=27000, order_enforce=True, autoreset=False, disable_env_checker=False, new_step_api=False, kwargs={'game': 'tetris', 'obs_type': 'rgb', 'repeat_action_probability': 0.25, 'full_action_space': False, 'frameskip': 4}, namespace='ALE', name='Tetris', version=5), EnvSpec(id='ALE/Tetris-ram-v5', entry_point='gym.envs.atari:AtariEnv', reward_threshold=None, nondeterministic=False, max_episode_steps=27000, order_enforce=True, autoreset=False, disable_env_checker=False, new_step_api=False, kwargs={'game': 'tetris', 'obs_type': 'ram', 'repeat_action_probability': 0.25, 'full_action_space': False, 'frameskip': 4}, namespace='ALE', name='Tetris-ram', version=5), EnvSpec(id='ALE/Galaxian-v5', entry_point='gym.envs.atari:AtariEnv', reward_threshold=None, nondeterministic=False, max_episode_steps=27000, order_enforce=True, autoreset=False, disable_env_check

### Configure parameters

We will be using an epsilon-greedy algorithm for choosing the best action, where there is an epsilon chance of sampling a random action from the action space. Instead of using epsilon decay, we will be using linear annealing to decrease epsilon from 1 to 0.1 over 1 million frames by Deepmind’s specification.

In [7]:
seed = 42

# Discount factor for past rewards
gamma = 0.99

# Epsilon greedy parameter
epsilon = 1.0

# Minimum epsilon greedy parameter
epsilon_min = 0.1

# Maximum epsilon greedy parameter
epsilon_max = 1.0

# Rate at which to reduce chance of random action being taken
epsilon_interval = (epsilon_max - epsilon_min)

# Size of batch taken from replay buffer
batch_size = 32

# Number of frames to run
max_steps_per_episode = 10000

Next, we define functions used to show the video by adding it to the CoLab notebook

In [8]:
display = Display(visible=0, size=(1400, 900))
display.start()

""" Utility functions to enable video recording of gym environment and displaying it.
To enable video, we just do "env = wrap_env(env) """

def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay
                loop controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else:
    print("Could not find video")


def wrap_env(env):
    try:
        env = gnwrapper.Monitor(env, './video', "recording")
    except:
        env = gnwrapper.Monitor(env, './video', "recording")

    clear_output(wait=True)
    return env


### Instantiate the environment




In [9]:
# Create Pacman environment
# Initialize the environment 'ALE/MsPacman-v5'.

env = wrap_env(gym.make('ALE/MsPacman-v5', render_mode="rgb_array"))
env.seed(seed)   # setting seed for reproducibility

(3444837047, 2669555309)

In [10]:
print('State shape: ', env.observation_space.shape)
print('Number of actions: ', env.action_space.n)

State shape:  (210, 160, 3)
Number of actions:  9


### Create the Deep Q-Network Model

Deep Q-network learns an approximation of the Q-table, which is a mapping between the states and actions that an agent will take. For every state in Pacman environment, we'll have nine actions that can be taken.

The environment provides the state, and the action is chosen by selecting the larger of the nine Q-values predicted in the output layer.

In [11]:
# Resets the environment to an initial state and returns an initial observation.
# Shape of observation or state
env.reset().shape

(210, 160, 3)

Refer to the following [Deepmind paper](https://arxiv.org/pdf/1312.5602v1.pdf) for playing Atari with Deep Reinforcement Learning

In [12]:
# Write a funtion for model creation
num_actions = 9

def create_q_model():

    # Network defined by the Deepmind paper
    inputs = layers.Input(shape=(210, 160, 3))

    # Convolutions on the frames of the screen
    layer1 = layers.Conv2D(32, 8, strides=4, activation="relu")(inputs)
    layer2 = layers.Conv2D(64, 4, strides=2, activation="relu")(layer1)
    layer3 = layers.Conv2D(64, 3, strides=1, activation="relu")(layer2)

    layer4 = layers.Flatten()(layer3)

    layer5 = layers.Dense(512, activation="relu")(layer4)
    action = layers.Dense(num_actions, activation="linear")(layer5)

    return keras.Model(inputs=inputs, outputs=action)


The first model makes the predictions for Q-values which are used to take an action.

In [13]:
# Create model
model = create_q_model()
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 210, 160, 3)]     0         
                                                                 
 conv2d (Conv2D)             (None, 51, 39, 32)        6176      
                                                                 
 conv2d_1 (Conv2D)           (None, 24, 18, 64)        32832     
                                                                 
 conv2d_2 (Conv2D)           (None, 22, 16, 64)        36928     
                                                                 
 flatten (Flatten)           (None, 22528)             0         
                                                                 
 dense (Dense)               (None, 512)               11534848  
                                                                 
 dense_1 (Dense)             (None, 9)                 4617  

Now build a target model for the prediction of future rewards. Since the same network is calculating the predicted value and the target value, there could be a lot of divergence between these two. So, instead of using one neural network for learning, we can use two.

The weights of a target model get updated every 10000 steps thus when the loss between the Q-values is calculated the target Q-value is stable.

In [15]:
# Create target model
model_target = create_q_model()

### Train the Model

The following pseudo-algorithm implements deep Q-learning with experience replay.


<br><br>

<center>
<img src="https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/deep_Q_learning.png" width=480px, height=480px/>
</center>
<br><br>




**Note:** The below code cell might take some time to run, suggesting to use GPU. Refer to the following [link](https://towardsdatascience.com/reinforcement-learning-explained-visually-part-5-deep-q-networks-step-by-step-5a5317197f4b) for the more details of Training the Deep-Q Networks

In [16]:
# In the Deepmind paper they use RMSProp however then Adam optimizer
# improves training time
optimizer = keras.optimizers.Adam(learning_rate=0.00025, clipnorm=1.0)

# Experience replay buffers
# Reinforcement learning algorithms use replay buffers to store trajectories of experience
# when executing a policy in an environment. During training, replay buffers are queried for a
# subset of the trajectories (either a sequential subset or a sample) to "replay" the agent's experience
action_history = []
state_history = []
state_next_history = []
rewards_history = []
done_history = []
episode_reward_history = []
running_reward = 0
episode_count = 0
frame_count = 0

# Number of frames to take random action and observe output
epsilon_random_frames = 50000

# Number of frames for exploration
epsilon_greedy_frames = 1000000.0

# Maximum replay length
max_memory_length = 100000

# Train the model after 4 actions
update_after_actions = 4

# How often to update the target network
update_target_network = 10000

# Using huber loss for stability and also to avoid exploding gradients
# Huber loss is a combination of linear as well as quadratic scoring methods.
# It has an additional hyperparameter delta (δ). Loss is linear for values above delta and quadratic below delta.
# Compared with MSE, Huber Loss is less sensitive to outliers as if the loss is too much
# it changes quadratic equation to linear and hence is a combination of both MSE and MAE.
loss_function = keras.losses.Huber()

while True:  # Run until solved
    state = np.array(env.reset())
    episode_reward = 0

    # episodes - This indicates how many games we want the agent to play in order to train itself
    for timestep in range(1, max_steps_per_episode):
        frame_count += 1

        # Use epsilon-greedy for exploration to select an action
        if frame_count < epsilon_random_frames or epsilon > np.random.rand(1)[0]:
            # With the probability epsilon, we take random action
            action = np.random.choice(num_actions)
        else:
            # Predict action Q-values
            # From environment state
            state_tensor = tf.convert_to_tensor(state)
            state_tensor = tf.expand_dims(state_tensor, 0)
            action_probs = model(state_tensor, training=False)
            # Take best action
            # with probability 1-epsilon, we select an action that has a maximum Q-value
            action = tf.argmax(action_probs[0]).numpy()

        # Decay probability of taking random action
        # Hence, a decaying epsilon ensures that our agent does not rely upon the
        # random predictions at the initial training epochs, only to later on exploit
        # its own predictions more aggressively as the Q-function converges to more consistent predictions.
        epsilon -= epsilon_interval / epsilon_greedy_frames
        epsilon = max(epsilon, epsilon_min)

        # Is used to display the environment image
        env.render()

        # Apply the sampled action in our environment
        # env.step - executes the given action and returns four values
        state_next, reward, done, _ = env.step(action)
        state_next = np.array(state_next)

        episode_reward += reward

        # Save actions and states in replay buffer
        action_history.append(action)
        state_history.append(state)
        state_next_history.append(state_next)
        done_history.append(done)
        rewards_history.append(reward)
        state = state_next

        # Update every fourth frame and once batch size is over 32
        if frame_count % update_after_actions == 0 and len(done_history) > batch_size:

            # Get indices of samples for replay buffers
            indices = np.random.choice(range(len(done_history)), size=batch_size)

            # Using list comprehension to sample from replay buffer
            state_sample = np.array([state_history[i] for i in indices])
            state_next_sample = np.array([state_next_history[i] for i in indices])
            rewards_sample = [rewards_history[i] for i in indices]
            action_sample = [action_history[i] for i in indices]
            done_sample = tf.convert_to_tensor([float(done_history[i]) for i in indices])

            # Build the updated Q-values for the sampled future states
            # Use the target model for stability
            # The Target network takes the next state from each data sample and predicts
            # the best (max predicted Q value) out of all actions that can be taken from that state. This is the ‘Target Q Value’
            future_rewards = model_target.predict(state_next_sample, verbose=0)

            # Q value = reward + discount factor * expected future reward
            # Compute Q value
            updated_q_values = rewards_sample + gamma * tf.reduce_max(future_rewards, axis=1)

            # If final frame set the last value to -1
            updated_q_values = updated_q_values * (1 - done_sample) - done_sample

            # Create a mask so we only calculate loss on the updated Q-values
            masks = tf.one_hot(action_sample, num_actions)

            with tf.GradientTape() as tape:

                # Train the model on the states and updated Q-values
                # The Q network takes the current state and action from each data sample
                # and predicts the Q value for that particular action. This is the ‘Predicted Q Value’.
                q_values = model(state_sample)

                # Apply the masks to the Q-values to get the Q-value for action taken
                q_action = tf.reduce_sum(tf.multiply(q_values, masks), axis=1)

                # Calculate loss between new Q-value and old Q-value
                loss = loss_function(updated_q_values, q_action)

            # Backpropagation
            # after we compute the loss using the given loss function, and we use the tape to compute
            # the gradient of the loss with regard to the model’s trainable variables. Again,
            # these gradients will be tweaked later, before we apply them, depending on how good or bad the action turned out to be.
            grads = tape.gradient(loss, model.trainable_variables)
            optimizer.apply_gradients(zip(grads, model.trainable_variables))

        if frame_count % update_target_network == 0:
            # update the target network with new weights
            model_target.set_weights(model.get_weights())
            # Log details
            template = "running reward: {:.2f} at episode {}, frame count {}"
            print(template.format(running_reward, episode_count, frame_count))

        # Limit the state and reward history
        if len(rewards_history) > max_memory_length:
            del rewards_history[:1]
            del state_history[:1]
            del state_next_history[:1]
            del action_history[:1]
            del done_history[:1]

        if done:
            break

        if timestep%50 == 0:
            print(f"Episode: {episode_count}, Reward: {running_reward}, Timestep: {timestep} !")

    # Update running reward to check condition for solving
    episode_reward_history.append(episode_reward)
    if len(episode_reward_history) > 100:
        del episode_reward_history[:1]
    running_reward = np.mean(episode_reward_history)

    episode_count += 1

    # Condition to consider the task solved
    # Note that this execution may take more than 15 minutes
    if running_reward > 400 or episode_count > 50:
        print(f"Reward: {running_reward} \nStopped at episode {episode_count}!")
        break




Episode: 0, Reward: 0, Timestep: 50 !
Episode: 0, Reward: 0, Timestep: 100 !
Episode: 0, Reward: 0, Timestep: 150 !
Episode: 0, Reward: 0, Timestep: 200 !
Episode: 0, Reward: 0, Timestep: 250 !
Episode: 0, Reward: 0, Timestep: 300 !
Episode: 0, Reward: 0, Timestep: 350 !
Episode: 0, Reward: 0, Timestep: 400 !
Episode: 0, Reward: 0, Timestep: 450 !
Episode: 1, Reward: 180.0, Timestep: 50 !
Episode: 1, Reward: 180.0, Timestep: 100 !
Episode: 1, Reward: 180.0, Timestep: 150 !
Episode: 1, Reward: 180.0, Timestep: 200 !
Episode: 1, Reward: 180.0, Timestep: 250 !
Episode: 1, Reward: 180.0, Timestep: 300 !
Episode: 1, Reward: 180.0, Timestep: 350 !
Episode: 1, Reward: 180.0, Timestep: 400 !
Episode: 1, Reward: 180.0, Timestep: 450 !
Episode: 1, Reward: 180.0, Timestep: 500 !
Episode: 2, Reward: 235.0, Timestep: 50 !
Episode: 2, Reward: 235.0, Timestep: 100 !
Episode: 2, Reward: 235.0, Timestep: 150 !
Episode: 2, Reward: 235.0, Timestep: 200 !
Episode: 2, Reward: 235.0, Timestep: 250 !
Episode

**Here**, one of the stopping criteria is if `running_reward > 400`, by increasing this value, learning can be improved. Here, we have chosen the `episode_count > 50` by increasing this value the training time also increases.

**Note:** The Deepmind paper trained for "a total of 50 million frames( that is 38 days of game experience in total)". However it will give good results at around 10 million frames which are processed in less than 24 hours on a modern machine.


### Visualizations

The agent's progress can be seen in the below video.

In [17]:
# Visualize training
env.close()
show_video()

### Please answer the questions below to complete the experiment:




In [18]:
# @title  What is the significance of the discount factor (γ) in Q-learning? { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "It determines the impact of future rewards on the Q-values" #@param ["","It controls the exploration rate","It is irrelevant in Q-learning", "It influences the learning rate", "It determines the impact of future rewards on the Q-values"]

In [24]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "Good and Challenging for me" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]

In [19]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "good and challenging for me" #@param {type:"string"}

In [20]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["","Yes", "No"]

In [21]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]

In [22]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]

In [25]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")

Your submission is successful.
Ref Id: 1021
Date of submission:  06 Jul 2024
Time of submission:  09:15:43
View your submissions: https://aias-iisc.talentsprint.com/notebook_submissions
