# Reinforcement Learning
### Value iteration

The **Value Iteration** is a *dynamic programming* algorithm (a model-based method) used in **Reinforcement Learning** (RL) to compute the optimal policy for a **Markov Decision Process** (MDP). The algorithm iteratively improves the value function until it converges to the optimal value function, from which the optimal policy can be derived.
<br>The key steps in **Value Iteration**:
 - **Initialize:** Start with an arbitrary value function $v(s)$ for all states $s$.
 - **Iterate:** Update the value function using the **Bellman optimality equation**:
    <br> $v_{k+1}(s)=max_⁡a \sum_{s′} p(s′|s,a)[r(s,a,s′)+\gamma v_k(s′)]$
    <br> where:
    <br> $p(s'|s,a)$ is the transition probability.
    <br> $r(s,a,s′)$ is the reward function.
    <br> $\gamma$ is the discount factor.
 - **Convergence:** Repeat the iteration until the value function converges (i.e., the change between iterations is smaller than a threshold $\theta$).
 - **Extract Policy:** Once the value function converges, extract the optimal policy $\pi(s)$ by selecting the action that maximizes the expected value.

<hr>

In the example in this Notebook, we use a **Grid World** environment:
 - **States:** A 3x3 grid (9 states), labeled as (0,0) to (2,2).
 - **Actions:** Up, Down, Left, Right.
 - **Rewards:**
    - Reaching the goal state (2,2) gives a reward of +10.
    - Reaching a "pit" state (1,1) gives a reward of −10.
    - All other transitions give a reward of −1.
- **Terminal States:** (2,2) (goal) and (1,1) (pit).
- **Transition Probabilities:**
    - Moving in the intended direction succeeds with probability 0.8.
    - With probability 0.2, the agent moves in a random direction

<hr>
https://github.com/ostad-ai/Reinforcement-Learning
<br> Explanation: https://www.pinterest.com/HamedShahHosseini/Reinforcement-Learning

In [2]:
# Define the grid world
states = [(i, j) for i in range(3) for j in range(3)]  # 3x3 grid
actions = ["up", "down", "left", "right"]              # Possible actions
gamma = 0.9                                           # Discount factor
theta = 1e-6                                          # Convergence threshold

# Terminal states
terminal_states = {(2, 2): 10, (1, 1): -10}  # (state: reward)

# Transition probabilities
def transition_probability(s, a, s_):
    if s in terminal_states:  # Terminal states have no transitions
        return 0
    intended_s = get_intended_state(s, a)
    if s_ == intended_s:
        return 0.8
    elif s_ in get_neighbors(s):
        return 0.2 / len(get_neighbors(s))
    else:
        return 0

# Reward function
def reward(s, a, s_):
    if s_ in terminal_states:
        return terminal_states[s_]
    return -1  # Default reward for non-terminal transitions

# The state the agent goes to, 
# from current state s, by taking action a
# if everything was deterministic
def get_intended_state(s, a):
    i, j = s
    if a == "up":
        return (max(i - 1, 0), j)
    elif a == "down":
        return (min(i + 1, 2), j)
    elif a == "left":
        return (i, max(j - 1, 0))
    elif a == "right":
        return (i, min(j + 1, 2))

# Immediate neighbors of the current state s 
def get_neighbors(s):
    i, j = s
    neighbors = []
    for di, dj in [(-1, 0), (1, 0), (0, -1), (0, 1)]:
        ni, nj = i + di, j + dj
        if 0 <= ni < 3 and 0 <= nj < 3:
            neighbors.append((ni, nj))
    return neighbors

# Value function initialization
v = {s: 0 for s in states}

# Value Iteration
def value_iteration():
    while True:
        delta = 0
        for s in states:
            if s in terminal_states:  # Skip terminal states
                continue
            v_current = v[s]
            v[s] = max(sum(transition_probability(s, a, s_) *\
                  (reward(s, a, s_) + gamma * v[s_])\
                for s_ in states)\
                for a in actions)
            delta = max(delta, abs(v_current - v[s]))
        if delta < theta:
            break

# Extract optimal policy
def extract_policy():
    policy = {}
    for s in states:
        if s in terminal_states:  # No action for terminal states
            policy[s] = None
            continue
        policy[s] = max(actions,
            key=lambda a: sum(transition_probability(s, a, s_) *\
            (reward(s, a, s_) + gamma * v[s_])\
            for s_ in states))
    return policy

In [3]:
# Run value iteration
value_iteration()
# Extract and print the optimal policy
optimal_policy = extract_policy()
print('Environment: A 3*3 Grid World')
print('Algorithm: Value Iteration')
print(30*'-')
print("Optimal Value Function v(s):")
for row in range(3):
    for col in range(3):
        print(f"State {(row,col)}: {v[(row,col)]:.2f}",end=', ')
    print('')
print("\nOptimal Policy \u03c0(s):")
for row in range(3):
    for col in range(3):
        print(f"State {(row,col)}: {optimal_policy[(row,col)]}",end=', ')
    print('')

Environment: A 3*3 Grid World
Algorithm: Value Iteration
------------------------------
Optimal Value Function v(s):
State (0, 0): 0.63, State (0, 1): 1.89, State (0, 2): 4.71, 
State (1, 0): 1.89, State (1, 1): 0.00, State (1, 2): 7.55, 
State (2, 0): 4.71, State (2, 1): 7.55, State (2, 2): 0.00, 

Optimal Policy π(s):
State (0, 0): down, State (0, 1): right, State (0, 2): down, 
State (1, 0): down, State (1, 1): None, State (1, 2): down, 
State (2, 0): right, State (2, 1): right, State (2, 2): None, 
