# Reinforcement Learning
### Bellman equations, Bellman optimality equations, and optimal policy
With **Bellman equations**, we get the **Bellman optimality equations**. With the Bellman optimality equations, we can estimate the **optimal value functions**. As a result, the **optimal policy** can be obtained from the optimal value functions. This is one way to solve an RL problem.
<hr>

The Bellman equation for **state-value function** $v_\pi(s)$:
<br>$v_\pi(s)=\sum_a \pi(a|s)\sum_{s′,r} p(s′,r|s,a)(r+\gamma v_\pi(s′))$
<br>Which states that the value of a state $s$ is the expected immediate reward plus the discounted value of the next state $s′$, averaged over all possible actions and next states.
<br><br> The Bellman equation for **action-value function** $q_\pi(s,a)$
<br>$q_\pi(s,a)=\sum_{s′,r} p(s′,r|s,a)(r+\gamma \sum_{a'} \pi(a'|s') q_\pi(s′,a'))$
<br>Which states that the value of taking action aa in state $s$ is the expected immediate reward plus the discounted value of the next state $s′$, averaged over all possible next states and actions.
<hr>

Specifically, the optimal value functions $v_*$ or $q_*$ satisfy the Bellman optimality equations, which are recursive relationships that define the optimal values in terms of each other.
<br>a. **Bellman Optimality Equation** for $v_*(s)$:
<br>$v_*(s)=max⁡_a \sum_{s′,r} p(s′,r|s,a)(r+\gamma v_*(s′))$
<br>b. **Bellman Optimality Equation** for $q_*(s,a)$:
<br>$q_*(s,a)=\sum_{s',r} p(s′,r|s,a)(r+\gamma max_{a'} q_*(s′,a′))$
<hr>

An **optimal policy** $\pi_*$ is defined as a policy that maximizes the expected return (cumulative discounted reward) for all states $s$. Having the optimal value functions, we can get the optimal policy $\pi_*$ by:
<br>$\pi_*(s)=argmax⁡_a \sum_{s′,r} p(s′,r|s,a)(r+\gamma v_*(s′))$
<br> Or from the action-value function:
<br>$\pi_*(s)=argmax_a q_*(s,a)$
<hr>
In the following, we use the Bellman optimality equation to get the optimal state-value function $v_*(s)$. Next, we obtain the optimal policy by $v_*$. This example is also a little advanced at this stage. But, it gives some hints about the formulae mentioned here. 
<hr>
https://github.com/ostad-ai/Reinforcement-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Reinforcement-Learning

In [1]:
# Example of an MDP with two states and two actions
# We use the three-argument transition probabilities
# Define the MDP parameters
states = ['s1', 's2']
actions = ['a1', 'a2']

# Transition probabilities P(s' | s, a)
transition_probs = {
    's1': {'a1': {'s2': 1.0}, 'a2': {'s1': 1.0}},
    's2': {'a1': {'s1': 1.0}, 'a2': {'s2': 1.0}},
}

# Rewards R(s, a, s')
rewards = {
    's1': {'a1': {'s2': 5}, 'a2': {'s1': 1}},
    's2': {'a1': {'s1': 3}, 'a2': {'s2': 2}},
}

# Discount factor
gamma = 0.9

# Value iteration to compute V*(s)
def value_iteration(states, actions, transition_probs, rewards, gamma, theta=1e-6):
    V = {s: 0 for s in states}  # Initialize V(s) to 0
    while True:
        delta = 0
        for s in states:
            v = V[s]
            # Compute the new value for V[s] using the Bellman optimality equation
            V[s] = max(
                sum(
                    transition_probs[s][a].get(s_next, 0) *\
                    (rewards[s][a].get(s_next, 0) + gamma * V[s_next])
                    for s_next in states
                )
                for a in actions
            )
            delta = max(delta, abs(v - V[s]))
        if delta < theta:
            break
    return V

# Derive the optimal policy pi*(s) from V*(s)
def get_optimal_policy(states, actions, transition_probs, rewards, gamma, V):
    policy = {}
    for s in states:
        # Choose the action that maximizes the expected return
        policy[s] = max(
            actions,
            key=lambda a: sum(
                transition_probs[s][a].get(s_next, 0) *\
                (rewards[s][a].get(s_next, 0) + gamma * V[s_next])
                for s_next in states
            )
        )
    return policy

In [2]:
# Compute the optimal value function V*(s)
V_optimal = value_iteration(states, actions, transition_probs, rewards, gamma)

# Get the optimal policy using the optimal state-value function V_optimal
optimal_policy = get_optimal_policy(states, actions, transition_probs, rewards, gamma, V_optimal)

# Print the results
print("Optimal Value Function:")
for s in states:
    print(f"V*({s}) = {V_optimal[s]:.2f}")

print("\nOptimal Policy:")
for s in states:
    print(f"pi*({s}) = {optimal_policy[s]}")

Optimal Value Function:
V*(s1) = 40.53
V*(s2) = 39.47

Optimal Policy:
pi*(s1) = a1
pi*(s2) = a1
