# Value Iteration Methods

## Tabular Q-Learning

Given an MDP $(S, A, P, R, \gamma)$, define $V^{*}(S)$ as the "value" of a state, that is, an agent executing all-optimal actions given the circumstances, receives a total sum of $V^{*}(S)$ of (discounted) rewards from a given state $S$. 
Similarly, define $Q^{*}(s, a)$ as the optimal sum of discounted rewards given state $s$ and an executed action on that state, $a$. 

Define $\pi^{*}$ as the "optimal policy", that is, given a state $s$, $\pi^{*}(s)$ is by definition the best action that can be executed on that state. This is easier to define in terms of the previous functions defined:


$\pi^{*}: S \rightarrow A$


If we know $Q^{*}$, we can easily extract the policy, because we can maximize the Q-value over all possible actions.

$\pi^{*}(s) = \text{argmax}_{a}(Q^{*}(s, a))$

In the case of $V^{*}(s)$, we need to do one ply of expectimax, as such:

$\pi^{*}(s) = argmax_{a}\sum_{s'}T(s,a,s')[R(s,a,s') + \gamma V^{*}(s')]$

Note that we need to be able to access the $T$ and $R$ (or, at least, learn them as well) to be able to use the Value iteration naively. Q learning does not require this.

To solve for $Q$, we need some kind of iterative method that will work its way up in a Dynamic Programming fashion, converging to the optimal Q values.

This is achieved by iteration on the Q-values.

Initializing $Q^0(s, a) = 0$ for all possible $s, a$ pairs.  
Define an update of Q as:  
$Q^{k+1}(s_t, a_t) \leftarrow Q^{t}(s_t, a_t) + \alpha * [ R(s_t, a_t, s_{t+1}) + \gamma * \text{max}_{a}(Q^{k}(s_{t+1}, a)) - Q^{k}(s_t, a_t)) ]$

This basically "nudges" the value of Q, and provably converges to the optimal.

### How to program this  
Creating and initializing all the Q values is trivial.  
Finding your path on an icy lake. FrozenLake.

In [10]:
import gym
import numpy as np
from numpy import random

In [11]:
env = gym.make('FrozenLake-v0')

In [18]:
print(f'{env.observation_space.n}, {env.action_space.n}')
print(env.action_space.sample())


16, 4
0


In [None]:
state = env.reset()
done = False
alpha = 0.4
epsilon = 0.99
epsilon_decay = 0.97
gamma = 0.95

states = env.observation_space.n
actions = env.action_space.n

qtable = np.zeros((states, actions))
for episode in range(100000):
    state, done = env.reset(), False
    
    while not done:
        #env.render()
        if random.rand() <= epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(qtable[state])
        next_state, reward, done, _ = env.step(action)
        old_qval = qtable[state][action]
        qtable[state][action] = old_qval + alpha * (reward + gamma * np.max(qtable[next_state]) - old_qval)
    
    if epsilon > 0.11:
        epsilon *= epsilon_decay

print(qtable)
        