### **Policies and State Value Functions**

RL objective &rarr; Formulate effective policies.\
Specify which action to take in each state to return maximize return.

### **Consider Grid World Example**
* Agent aims to reach diamond while avoding mountains.
* Nine states.
* Deterministic movements.\
**Grid World example-rewards**
* Given based on states:
    - Diamond: +10
    - Mountains: -2
    - Other states: -1

**Grid world example-Policy**
```
    0: left, 1:down, 2: right,3 :up\
    policy={
        0:1,1:2,2:1,3:1,4:3,5:1,6:2,7:3
    }
    #intialize the environment
    state,info=env.reset()
    terminated=False
    while not terminated:
        action=policy[state]
        state,reward,terminated,_,_ =env.step(action)
```

**To valuate the policy we utiliz state value functions**
* Estimates the states worth
* Expected return starting from state, following policy\
$V(S)=r_{s+1}+\gamma r_{s+2}+\gamma^2 r_{s+3}+...+\gamma^{n-1} r_{s+n}$\
Sum of discounted rewards collected by
- starting in states
- and following the policy
- This involves discounting reward by a factor, gamma ,over time, and summing these discounted rewards.\
\
**Grid world example:State-values**\
In our example, we have 9 states, therefore, we need to compute nine state values. For simplicity we consider a discount factor gamma of 1.\
\
**Value of Goal state**
- starting in goal state, agent doesnot move
- V(goal state)=0
- starting in state 5, agent moves to goal
- V(5)=10
- staring in 2, rewards are -1,10.
- And so on untill all the states values are computed.


**Bellman equation**\
In practice, the Bellman equation, a recursive formula, computes state values by combining the immediate reward of the current state with the discounted value of the next state, thereby connecting each state's value to its successors. In deterministic environments like ours, this standard formula suffices, whereas non-deterministic environments require modifications to incorporate transition probabilities.\
$V(S)=r_{s+1}+\gamma V(s+1)$

### **Computing state-values**
```
def compute_state_value(state):
        if state==terminal_state:
            return 0
        action=policy[state]
        _,next_state,reward,_,_=env.unwrapped.P[state][action]
        return reward+gamma*compute_state_value(next_state)
```

```
terminal_state=8
gamma=1
V={state:compute_state_value for states in range(num_states)}
```

To compare we define new policy and check which have high state value we keep that.

```
import gymnasium as gym
import matplotlib.pyplot as plt
```

```
def render():
    state_image=env.render()
    plt.imshow(state_image)
    plt.show()
```

```
# Create the environment
env = gym.make('MyGridWorld', render_mode='rgb_array')
state, info = env.reset()

# Define the policy
policy = {0:2, 1:2, 2:1, 3:1, 4:0, 5:0, 6:2, 7:2}

terminated = False
while not terminated:
    # Select action based on policy 
    action = policy[state]
    state, reward, terminated, truncated, info = env.step(action)
    # Render the environment
    render()
```

```
improved_policy = {}

for state in range(num_states-1):
    # Find the best action for each state based on Q-values
    max_action = max(range(num_actions), key=lambda action:Q[(state,action)])
    improved_policy[state] = max_action

terminated = False
state=0
while not terminated:
  # Select action based on policy 
  action = improved_policy[state]
  # Execute the action
  state, reward, terminated, truncated, info = env.step(action)
  render()
```