# Markov Decision Processes (MDPs) in Reinforcement Learning

A Markov Decision Process (MDP) provides a mathematical framework for modeling decision-making in environments where outcomes are partly random and partly under the control of a decision maker. MDPs are used to model the environment in Reinforcement Learning.

## Key Concepts in Markov Decision Processes

### Definition of an MDP

An MDP is defined by a tuple $(S, A, P, R, \gamma)$, where:
- $S$ is a finite set of states.
- $A$ is a finite set of actions.
- $P(s'|s, a)$ is the state transition probability function, representing the probability of moving to state $s'$ from state $s$ after taking action $a$.
- $R(s, a)$ is the reward function, which gives the immediate reward received after taking action $a$ in state $s$.
- $\gamma \in [0, 1]$ is the discount factor, which determines the present value of future rewards.

### State Transition Probability

The state transition probability function $P(s'|s, a)$ captures the dynamics of the environment. It is the probability of transitioning to state $s'$ given that the current state is $s$ and the action taken is $a$.

### Reward Function

The reward function $R(s, a)$ provides the immediate reward received after taking action $a$ in state $s$. The reward signal guides the agent to learn the optimal policy.

### Discount Factor

The discount factor $\gamma$ is used to weigh future rewards. It lies between 0 and 1, where a value closer to 0 makes the agent myopic by only considering immediate rewards, while a value closer to 1 makes the agent strive for long-term rewards.

### Policy

A policy $\pi(a|s)$ is a mapping from states to probabilities of selecting each possible action. The goal in an MDP is to find an optimal policy $\pi^*$ that maximizes the expected return from each state.

### Value Functions

Value functions estimate the expected return (cumulative reward) from a given state or state-action pair under a policy.

- **State-Value Function** $V^\pi(s)$: The expected return starting from state $s$ and following policy $\pi$.

  $$
  V^\pi(s) = \mathbb{E}^\pi \left[ \sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s \right]
  $$

- **Action-Value Function** $Q^\pi(s, a)$: The expected return starting from state $s$, taking action $a$, and following policy $\pi$ thereafter.

  $$
  Q^\pi(s, a) = \mathbb{E}^\pi \left[ \sum_{t=0}^\infty \gamma^t R(s_t, a_t) \mid s_0 = s, a_0 = a \right]
  $$

### Bellman Equations

The Bellman equations provide a recursive decomposition for the value functions.

- **Bellman Expectation Equation for $V^\pi(s)$**:

  $$
  V^\pi(s) = \sum_{a \in A} \pi(a|s) \sum_{s' \in S} P(s'|s, a) \left[ R(s, a) + \gamma V^\pi(s') \right]
  $$

- **Bellman Expectation Equation for $Q^\pi(s, a)$**:

  $$
  Q^\pi(s, a) = \sum_{s' \in S} P(s'|s, a) \left[ R(s, a) + \gamma \sum_{a' \in A} \pi(a'|s') Q^\pi(s', a') \right]
  $$

### Optimal Value Functions

The optimal value functions $V^*(s)$ and $Q^*(s, a)$ provide the maximum expected return achievable from any state or state-action pair, respectively.

- **Bellman Optimality Equation for $V^*(s)$**:

  $$
  V^*(s) = \max_{a \in A} \sum_{s' \in S} P(s'|s, a) \left[ R(s, a) + \gamma V^*(s') \right]
  $$

- **Bellman Optimality Equation for $Q^*(s, a)$**:

  $$
  Q^*(s, a) = \sum_{s' \in S} P(s'|s, a) \left[ R(s, a) + \gamma \max_{a' \in A} Q^*(s', a') \right]
  $$

### Finding the Optimal Policy

The optimal policy $\pi^*$ can be derived from the optimal value functions.

- From $V^*(s)$:

  $$
  \pi^*(s) = \text{argmax}_{a \in A} \sum_{s' \in S} P(s'|s, a) \left[ R(s, a) + \gamma V^*(s') \right]
  $$

- From $Q^*(s, a)$:

  $$
  \pi^*(s) = \text{argmax}_{a \in A} Q^*(s, a)
  $$

## Numerical Example

Consider a simple MDP with 3 states and 2 actions.

- **States**: $S = \{s_1, s_2, s_3\}$
- **Actions**: $A = \{\text{a1}, \text{a2}\}$
- **Transition Probabilities** and **Rewards**:

  | State | Action | Next State | Probability | Reward |
  |-------|--------|------------|-------------|--------|
  | $s_1$ | $a_1$  | $s_2$      | 0.5         | 5      |
  | $s_1$ | $a_1$  | $s_3$      | 0.5         | 10     |
  | $s_1$ | $a_2$  | $s_2$      | 1.0         | 0      |
  | $s_2$ | $a_1$  | $s_1$      | 1.0         | -1     |
  | $s_2$ | $a_2$  | $s_3$      | 1.0         | 2      |
  | $s_3$ | $a_1$  | $s_1$      | 1.0         | 0      |
  | $s_3$ | $a_2$  | $s_3$      | 1.0         | 1      |

- **Discount Factor**: $\gamma = 0.9$

### Value Iteration Example

1. **Initialize $V(s)$**: Initialize $V(s) = 0$ for all $s \in S$.

2. **Value Iteration Update**:

   Update $V(s)$ using the Bellman Optimality Equation for $V^*(s)$ until convergence:

   $$
   V(s) \leftarrow \max_{a \in A} \sum_{s' \in S} P(s'|s, a) \left[ R(s, a) + \gamma V(s') \right]
   $$

   For state $s_1$:

   $$
   V(s_1) = \max \left\{ 0.5 \left[ 5 + 0.9 V(s_2) \right] + 0.5 \left[ 10 + 0.9 V(s_3) \right], \left[ 0 + 0.9 V(s_2) \right] \right\}
   $$

   For state $s_2$:

   $$
   V(s_2) = \max \left\{ -1 + 0.9 V(s_1), 2 + 0.9 V(s_3) \right\}
   $$

   For state $s_3$:

   $$
   V(s_3) = \max \left\{ 0 + 0.9 V(s_1), 1 + 0.9 V(s_3) \right\}
   $$

   Repeat the updates until $V(s)$ converges.

### Optimal Policy Derivation

Once $V(s)$ converges to $V^*(s)$, derive the optimal policy $\pi^*$:

For each state $s$, choose the action that maximizes the expected return:

$$
\pi^*(s) = \text{argmax}_{a \in A} \sum_{s' \in S} P(s'|s, a) \left[ R(s, a) + \gamma V^*(s') \right]
$$

### Conclusion

MDPs provide a fundamental framework for modeling decision-making processes in uncertain environments. By defining states, actions, transition probabilities, rewards, and discount factors, MDPs enable the formulation and solution of complex decision problems. Value iteration and policy iteration are key algorithms used to compute optimal policies in MDPs, forming the basis for many reinforcement learning methods.

## References

- **[Puterman, M.L. (1994)]** Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
- **[Sutton, R.S., Barto, A.G. (1998)]** Reinforcement Learning: An Introduction. MIT Press.


## Key Properties of Markov Decision Processes

1. **Markov Property**: The future state depends only on the current state and action, not on the sequence of events that preceded it. This is also known as the memoryless property.
   
   $$
   P(s_{t+1} | s_t, a_t) = P(s_{t+1} | s_1, a_1, s_2, a_2, \ldots, s_t, a_t)
   $$

2. **Stationarity**: The transition probabilities and rewards are stationary over time. That is, they do not change with time.

3. **Finite State and Action Spaces**: MDPs typically assume a finite set of states $S$ and actions $A$, making them suitable for discrete decision-making problems.

4. **Policy**: A policy $\pi$ defines the agent's way of behaving at a given time. It can be deterministic or stochastic.
   - **Deterministic Policy**: Maps each state to a specific action.
   - **Stochastic Policy**: Maps each state to a probability distribution over actions.

## Important Notes on Using MDPs

- **Model Requirements**: MDPs require a complete model of the environment, including the state transition probabilities and reward functions.
- **Computational Complexity**: Solving MDPs exactly using methods like Value Iteration or Policy Iteration can be computationally expensive, especially for large state or action spaces.
- **Bellman Equations**: The Bellman equations are central to understanding and solving MDPs, providing a recursive decomposition for value functions.
- **Optimality**: The goal is to find the optimal policy $\pi^*$ that maximizes the expected cumulative reward from any given state.

## Advantages of Using MDPs

1. **Framework**: Provides a comprehensive mathematical framework for modeling sequential decision-making problems.
2. **Optimal Solutions**: Can compute optimal policies that maximize expected rewards, providing a benchmark for performance.
3. **Theoretical Foundation**: Strong theoretical foundation, leveraging concepts from probability theory and dynamic programming.
4. **Versatility**: Applicable to a wide range of problems in fields such as robotics, operations research, economics, and artificial intelligence.

## Disadvantages of Using MDPs

1. **Model Dependency**: Requires a known model of the environment, which may not always be available in practice.
2. **Scalability**: Solving MDPs for large state or action spaces can be computationally prohibitive, leading to the "curse of dimensionality."
3. **Stationarity Assumption**: Assumes stationary transition probabilities and rewards, which may not hold in dynamic environments.
4. **Memoryless Property**: The Markov property may be too restrictive for some real-world problems where history or long-term dependencies matter.

MDPs provide a powerful and rigorous framework for tackling decision-making problems under uncertainty. However, their practical application is often limited by the need for a complete model and the computational challenges associated with large-scale problems.
