### **Actions_value_functions**

**Q-values**

* Expected return of:
    - Starting at a state s
    - Taking a action
    - Following a policy

Action value functions also known as Q-values provide us with an estimate of the expected return of starting in a state, taking a certain action, and then following a policy thereafter. Therefore, the action value is the sum of the immediate reward received after performing an action and the discounted value of the new state computed for a specific policy. While state-value functions give us a broad overview of the desirability of states, action-value functions break it down further, giving us insight into the desirability of actions within those states.

$Q(s,a)=r_{a}+\gamma V(s+1)$\
Action-value of state a,action a &rarr; sum of:
- reward received after performing action a in state s
- discounted value of next state resulting from action a

**Grid World**\
Recall the nine states and the policy dictating the agent's deterministic movements. We previously evaluated this policy using state-value functions. Now we need to compute action values for each state, which means that, for each state, we have to compute 4 values. We'll keep the state values on the right as we will need them for the action-value computation.

![image](1.png)

**Q-values - state 4**\
Suppose the agent is born in the state 4.\
The agent can choose to go in four direction.\
If agent moves down from State 4, it receives a -2 reward and lands in state 5 having a value of 5 which we have previously calculated.\
**State 4 - action down**
The Q-value for moving down from state 4 combines a reward of -2 with the next state's value, which gives 3, assuming a discount factor of 1.\
**State 4-action left**\
Moving left yields a Q-value of 1, calculated by adding a reward of -1 to the value of the resulting state, which is 2.\
**State 4-action up**\
Similarly, moving up results in a Q-value of 7, derived from a -1 reward and the new state's value of 8.\
**State 4-action right**\
Finally, when the agent moves right, it receives a reward of -1 and visits a state of value 10, leading to a Q-value of 9.

The process is repeated for all the states untill all the q values have in computed. 

```
def compute_q_value(state,action):
    if state == terminal_state:
        return None
    _,next_state,reward,_,_=env.unwrapped.P[state][action][0]
    return reward +gamma*compute_state_value(next_state)

Q={(state,action):compute_q_values(state,action)
    for state in range(num_states)
    for action in range(num_staes)}
print(Q)
```
Remember that we are taking the code of the previous lectures as the left part of the code such as the compute_state_value(state).

Now we can improve our policy based upon which action has the best q value for the given state.

```
improved_policy={}
for state in range(num_states-1):
    max_action=max(range(num_actions),key=lambda action:Q[(state,action)])#this is first we define the actions through range(num action ) then key to find maximum is lamda which for the action in the range gives the tuple(state,action) in the dictionary Q which will return Q value then we select the action based on the max value
    improved_policy[state]=action
```
