### Lab Objectives

We will practice understanding Markov Reward Process and Markov Decision Process. Also we will try to apply the direct solution to solve the Bellman Expectation Equation.

***

### Introduction 

Before we begin, let's refresh our memory on this question:

**What is Markov Chain?**

To summarise a Markov chain is defined as follows:
* S = set of states
* P = state transition probabilities
* $S_0$ = starting state

<img src="https://drive.google.com/uc?id=12HMJqQHwH7Q1-CP1trhRMpFJqc4ddtUq">

[Open this link!](https://setosa.io/blog/2014/07/26/markov-chains/index.html)



**what is the core problem that is being solved for in reinforcement learning?** 

At the simplest level, the problem we are solving for is to teach the agent to behave *optimally* in a specific environment. An example might be to teach a robot to bounce a ball for some period of time; or program a helicopter to keep the same altitude in unpredictable windy conditions.

The numerical definition of what is *optimal* is defined by the **objective function**, which is typically maximized in the context of MDPs (Markov Decision Processes). In the context of MDPs , this objective function is known as the **value function**, aka the *total expected reward*, which is received over sequential state transitions.

The goal then, is to teach the agent to maximize the *total expected reward* it earns over some time horizon (theoretically, it is an infinite time horizon) by selecting the best action as dictated by the value function. The agent learns to maximize rewards as it transitions from state to state, taking actions in each state. In a deterministic case, the transitioning process is diagrammed as follows:

$$\text{State 1}\xrightarrow[Action]{}\text{State 2} \xrightarrow[Action]{}\text{State 3}\xrightarrow[Action]{}\dots$$



### Markov Decision Processes

Let's formalize the key components of the RL problem in the context of MDPs:

An MDP is defined by: $(S, A, P, R, S_0, \gamma)$
* S = set of states (state-space)
* A = set of actions (action-space)
* P = state transition probabilities
* R = reward for taking an action $a\in\text{A}$ in state $s\in\text{S}$
* $S_0$ = starting state
* $\gamma$ = discount rate
    
In more detail:
* **States** - states can be discrete/finite (imagine cells in a grid world) or continuous/infinite (position on a road).
    * Referred to as the *state space* (i.e. discrete state space or continuous state space)
* **Actions** - actions can also be discrete (moving up/down/left/right in a grid world cell) or continuous (how many degrees to turn a steering wheel when driving a car).
    * Referred to as the *action space* (i.e. discrete action space or continuous action space)
* **Rewards** - rewards are issued by a reward function $\rho : S_t \times A_t \rightarrow R$. The reward function is a property of the environment.
* **Transition probabilities**. In MDPs, this is denoted by $P_{s,a}$. The transition probability is the probability that, for example, some action $A$ in state $S$ leads to state $S^\prime$ (prime denotes the next time step) - represented notationally as $P(s^\prime|s,a)$.
* **Discount factor** - the discount factor is a number greater than 0 and less than 1 that is used to discount rewards received over sequential time-steps. It is denoted as $\gamma \in [0, 1)$
* **Value function** - one of the primary functions learned by the agent: the value function dictates either the value of a state or the value of action. More on this below.
* **Policy function** - one of the primary functions learned by the agent: the policy maps states to actions. More below.

#### Other useful definitions
* **Experience** - $\big(\text{State}_{t}$, $\text{Action}_{t}$, $\text{Reward}_{t}\big)$ tuple
* **Trajectory** - A sequence of *experiences* through time, represented as: $\tau$ (tau)
* **Episode** - A trajectory that ends in a terminal state


### Policy

The process of learning for the agent can be thought of as a sequence of mapping states to actions $a = \pi(s)$ to maximize expected reward over an episode. This is known as the **policy:** $\pi: S \rightarrow A$. 

Note:
* The agent needs to explore and interact with its environment to learn where actions earn the maximum rewards.
* Actions in the current time step effect rewards in future time steps
* There is a trade off between the frequency of sampling the environment and frequency of taking actions
* It can be based on discrete state-spaces or continuous state-spaces (and same for action-spaces)

### Value functions

The **worthiness** of a policy is calculated by the aforementioned *value function*. There are various forms of value functions. First, the value of a state:

<img src="https://drive.google.com/uc?id=1W6LE42KbtNoOs1LKkFxlWhOF6-7wzWxS" />

Second, the value of action:

<img src="https://drive.google.com/uc?id=1I0Df6jgFDH7gBU1djylJ8ZzmJUO36P9_" />



#### <font color="#DE008A">Bellman's optimal ($*$) action-value function</font> :

$$Q^*{(s,a)} = \ R{(s,a)} + \gamma \max_{a^\prime} \sum_{s^\prime\in\text{S}} \big[P{(s^\prime|s,a)}  Q^*{(s^\prime,a^\prime)}\big]$$


## Cleaning Robot Problem

The main characteristics of this world are the following:

- Discrete time and space
- Fully observable
- Infinite horizon
- Known Transition Model

<img src="https://drive.google.com/uc?id=1Yz6xnDuo6StlKzmj4eDqVYvZfOORBlBT"/>

The main goal for the robot in this task is to find the best way to reach the charging station.

What does the "best way" even mean?
It depends on the reward that the robot receives in each intermediate state -> that leads to multiple optimal policies.

<img src="https://drive.google.com/uc?id=1vef0Xhpy5OMwTxKLgBBqxrxNTp9bhw9S"/>

**The Bellman equation**

$$Q^*{(s,a)} = \ R{(s,a)} + \gamma \max_{a^\prime} \sum_{s^\prime\in\text{S}} \big[P{(s^\prime|s,a)}  Q^*{(s^\prime,a^\prime)}\big]$$

In this example the reward for each non-terminal state is 
$$R(s) = −0.04$$


BUT, before we begin let's assume that somehow we got the optima optimal policy and utility values generated by the optimal value-function just to help us understand the idea.

<img src="https://drive.google.com/uc?id=1C-hx0MwCi9hNQcM9Q-QIvVJzalbVWypa"/>

In our example we suppose the robot starts from the state (1,1). Using the Bellman equation we have to find the action with the highest utility between UP, LEFT, DOWN and RIGHT. We do not have the optimal policy, but we have the transition model and the utility values for each state. You have to recall the two main rules of our environment: (i) if the robot bounce on the wall it goes back to the previous state, and (ii) the selected action is executed only with a probability of 80% in accordance with the transition model. Instead of dealing with those ugly numbers I want to show you a visual representaion of the possible outcomes:
<img src="https://drive.google.com/uc?id=1_X65joHfWYkFNEIorjSNXmMR0rRGNTjX"/>

For each possible outcome I reported the utility and the probability given by the transition model. This corresponds to the first part of the Bellman equation. The next step is to calculate the product between the utility and the transition probability, then sum up the value for each action.

<img src="https://drive.google.com/uc?id=1lmDRxIXOs0h6cGP6VzzsRjauu-nGJYbO"/>

We found out that for state (1,1) the action UP has the highest value. This is in accordance with the optimal policy we magically got.

Now we have all the elements and we can plug the values in the Bellman equation finding the utility of the state (1,1):

$$U(s11)=−0.04 + 1.0 × 0.7456= 0.7056$$

In [5]:
!wget -q https://github.com/mhd-medfa/IU-Reinforcement-Learning-22-lab/raw/main/week03-mdp/T.npy  -O T.npy

In [10]:
import numpy as np

class MDP:
  def __init__(self):
        #Starting state vector
        #The agent starts from (1, 1)
        self.state = np.array([[0.0, 0.0, 0.0, 0.0, 
                                    0.0, 0.0, 0.0, 0.0, 
                                    1.0, 0.0, 0.0, 0.0]])
        self.rewards = np.array([-0.04, -0.04, -0.04,  +1.0,
                                 -0.04,   0.0, -0.04,  -1.0,
                                 -0.04, -0.04, -0.04, -0.04])
        
        # Probabilities Transition matrix loaded from file
        # 12x12x4 matrix (12 starting states, 12 next states, 4 actions)
        self.transits = np.load("T.npy")  
        #Utility vector genereated by the ((assumed)) optimal value-function
        self.values = np.array([[0.812, 0.868, 0.918,   1.0,
                                 0.762,   0.0, 0.660,  -1.0,
                                 0.705, 0.655, 0.611, 0.388]])
        self.gamma = 1.0 #Assuming that the discount factor is equal to 1.0

        # self.epsilon = 0.0005

  def return_state_utility(self):
      """Return the state utility (return).

      @return the utility (return) of the state
      """
      action_array = np.zeros(4)
      for action in range(0, 4):
          action_array[action] = np.sum(np.multiply(self.values, np.dot(self.state, self.transits[:,:,action])))
      return self.rewards[8] + self.gamma * np.max(action_array)

In [12]:
mdp = MDP()
u_11 = mdp.return_state_utility()
print("Utility of state (1,1): " + str(u_11))

Utility of state (1,1): 0.7056


**Direct solution**

<img src="https://drive.google.com/uc?id=1CbWE_KdxuQibsZccX7W7ZWdeni48Vffh"/>

In [41]:
import numpy as np

class MDP:
  def __init__(self):
        #Starting state vector
        #The agent starts from (1, 1)
        self.state = np.array([[0.0, 0.0, 0.0, 0.0, 
                                    0.0, 0.0, 0.0, 0.0, 
                                    1.0, 0.0, 0.0, 0.0]])
        self.rewards = np.array([-0.04, -0.04, -0.04,  +1.0,
                                 -0.04,   0.0, -0.04,  -1.0,
                                 -0.04, -0.04, -0.04, -0.04])
        
        # Probabilities Transition matrix loaded from file
        #(It is too big to write here)
        self.transits = np.load("T.npy")
        #Generate the first policy randomly
        # Nan=Nothing, -1=Terminal, 0=Up, 1=Left, 2=Down, 3=Right
        self.policy = np.random.randint(0, 4, size=(12)).astype(np.float32)
        self.policy[5] = np.NaN
        self.policy[3] = self.policy[7] = -1

        #Utility vector
        self.values = np.array([0.0, 0.0, 0.0,  0.0,
                                0.0, 0.0, 0.0,  0.0,
                                0.0, 0.0, 0.0,  0.0])
        self.gamma = 0.999

        self.epsilon = 0.0001
        self.iteration = 0
  def policy_evaluation(self, shape=(3,4)):
        length = shape[0]*shape[1]
        self.values = np.zeros(length)
        for s in range(12):
            if not np.isnan(self.policy[s]):
                action = int(self.policy[s])
                self.values[s] = np.linalg.solve(np.identity(length) - self.gamma\
                                                 *self.transits[:,:,action]\
                                                 , self.rewards)[s]
                # self.values[s] = np.dot(np.linalg.pinv(np.identity(length) - self.gamma\
                #                                  *self.transits[:,:,action])\
                #                                  , self.rewards)[s]
                # self.values[s] = np.dot(np.linalg.pinv(np.identity(length) - self.gamma\
                #                                  *self.transits[:,:,action])\
                #                                  , self.rewards)[s]
        return self.values

  def expected_action(self):
      """Return the expected action.
      It returns an action based on the
      expected utility of doing a in state s, 
      according to T and u. This action is
      the one that maximize the expected
      utility.

      @return expected action (int)
      """
      actions_array = np.zeros(4)
      for action in range(4):
          #Expected utility of doing a in state s, according to T and u.
          actions_array[action] = np.sum(np.multiply(self.values, np.dot(self.state, self.transits[:,:,action])))
      return np.argmax(actions_array)

def print_policy(p, shape):
    """Print the policy on the terminal
    Using the symbol:
    * Terminal state
    ^ Up
    > Right
    v Down
    < Left
    # Obstacle
    """
    counter = 0
    policy_string = ""
    for row in range(shape[0]):
        for col in range(shape[1]):
            if(p[counter] == -1): policy_string += " *  "            
            elif(p[counter] == 0): policy_string += " ^  "
            elif(p[counter] == 1): policy_string += " <  "
            elif(p[counter] == 2): policy_string += " v  "           
            elif(p[counter] == 3): policy_string += " >  "
            elif(np.isnan(p[counter])): policy_string += " #  "
            counter += 1
        policy_string += '\n'
    print(policy_string)


In [42]:
mdp = MDP()

while True:
    mdp.iteration += 1
    #1- Policy evaluation
    u_old = u.copy()
    u = mdp.policy_evaluation()

    unchanged = True
    delta = np.absolute(u - u_old).max()
    if (delta < mdp.epsilon * (1 - mdp.gamma) / mdp.gamma) or mdp.iteration > 10000: break
    for s in range(12):
        if not np.isnan(mdp.policy[s]) and not mdp.policy[s]==-1:
            mdp.state = np.zeros((1,12))
            mdp.state[0,s] = 1.0
            #2- Policy improvement
            a = mdp.expected_action()  
            if a != mdp.policy[s]: 
                mdp.policy[s] = a
                unchanged = False
    # print_policy(mdp.policy, shape=(3,4))
    if unchanged: break

print("=================== FINAL RESULT ==================")
print("Iterations: " + str(mdp.iteration))
#print("Delta: " + str(delta))
print("Gamma: " + str(mdp.gamma))
print("Epsilon: " + str(mdp.epsilon))
print("===================================================")
print(u[0:4])
print(u[4:8])
print(u[8:12])
print("===================================================")
print_policy(mdp.policy, shape=(3,4))
print("===================================================")


Iterations: 10001
Gamma: 0.999
Epsilon: 0.0001
[-1.3412535  0.692393   0.7433094  1.       ]
[-1.38956486  0.         -0.3121032  -1.        ]
[-40.          -1.4239544   -1.37568607 -35.67148095]
 ^   >   >   *  
 ^   #   ^   *  
 <   >   >   <  



In [44]:
mdp.values

array([ -1.3412535 ,   0.692393  ,   0.7433094 ,   1.        ,
        -1.38956486,   0.        ,  -0.3121032 ,  -1.        ,
       -40.        ,  -1.4239544 ,  -1.37568607, -35.67148095])

In [45]:
np.dot(mdp.state, mdp.transits[:,:,0])

array([[0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.8, 0. , 0. , 0.1, 0.1]])

In [46]:
mdp.state

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])

In [47]:
mdp.transits[:,:,0]

array([[0.9, 0.1, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
       [0.1, 0.8, 0.1, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
       [0. , 0.1, 0.8, 0.1, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
       [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
       [0.8, 0. , 0. , 0. , 0.2, 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
       [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
       [0. , 0. , 0.8, 0. , 0. , 0. , 0.1, 0.1, 0. , 0. , 0. , 0. ],
       [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ],
       [0. , 0. , 0. , 0. , 0.8, 0. , 0. , 0. , 0.1, 0.1, 0. , 0. ],
       [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.1, 0.8, 0.1, 0. ],
       [0. , 0. , 0. , 0. , 0. , 0. , 0.8, 0. , 0. , 0.1, 0. , 0.1],
       [0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.8, 0. , 0. , 0.1, 0.1]])