
# Discount factor effect and Q-Learning

## 1. Based on the probabilities on the arrows above, Model your MDP (transition probabilities and rewards) in the notebook. you can use the [s, a, s’] for each part or you can use your own way of defining the MDP. What should be seen are transition probabilities, rewards and possible actions

In [1]:
transition_probabilities = {
    (0, 0): [(0, 0.7), (1, 0.3)],
    (0, 1): [(0, 1.0)],
    (0, 2): [(0, 0.8), (1, 0.2)],
    (1, 0): [(1, 1)],
    (1, 2): [(2, 1.0)],
    (2, 1): [(2, 0.1), (1, 0.1), (0, 0.8)],
}

rewards = {
    (0, 0, 0): 10,
    (1, 2, 2): -50,
    (2, 1, 0): 40,
}

possible_actions = [[0, 1, 2], [0, 2], [1]]

In [2]:
import numpy as np
import random

In [3]:
Q_values = np.full((3, 3), -np.inf)  # -np.inf for impossible actions
for state, actions in enumerate(possible_actions):
    Q_values[state, actions] = 0.0  # for all possible actions

## 2. Take your discount factor to be 0.9. Perform Q-learning and report the Q-values for each (state, action) pair. Based on that, what is the optimal policy?

In [4]:
gamma = 0.9  # the discount factor
#gamma = 0.95
for iteration in range(50):
    for state in range(3):
      for action in possible_actions[state]:
          updated_q_value = 0
          for next_state, prob in transition_probabilities[(state, action)]:
              reward = rewards.get((state, action, next_state), 0)
              max_q_next = max(Q_values[next_state])
              updated_q_value += prob * (reward + gamma * max_q_next)

          Q_values[state, action] = updated_q_value

In [5]:
Q_values

array([[18.91891892, 17.02702703, 13.62162162],
       [ 0.        ,        -inf, -4.87971488],
       [       -inf, 50.13365013,        -inf]])

In [6]:
Q_values.argmax(axis=1)  # optimal action for each state

array([0, 0, 1])

## 3. Perform the same procedure but this time with a discount factor of 0.95. Did your optimal policy change? Explain your results.



In [7]:
Q_values = np.full((3, 3), -np.inf)  # -np.inf for impossible actions
for state, actions in enumerate(possible_actions):
    Q_values[state, actions] = 0.0  # for all possible actions

In [8]:
Q_values

array([[  0.,   0.,   0.],
       [  0., -inf,   0.],
       [-inf,   0., -inf]])

In [9]:
# you can use the same code as above
# gamma = 0.9  # the discount factor
gamma = 0.95
for iteration in range(50):
    for state in range(3):
      for action in possible_actions[state]:
          updated_q_value = 0
          for next_state, prob in transition_probabilities[(state, action)]:
              reward = rewards.get((state, action, next_state), 0)
              max_q_next = max(Q_values[next_state])
              updated_q_value += prob * (reward + gamma * max_q_next)

          Q_values[state, action] = updated_q_value

In [10]:
Q_values

array([[21.79615996, 20.70635196, 16.76923123],
       [ 1.02074831,        -inf,  1.08097586],
       [       -inf, 53.77587186,        -inf]])

In [11]:
Q_values.argmax(axis=1)  # optimal action for each state

array([0, 2, 1])