# Function approximation 


Up to now, we have been calculating $Q(s,a)$ by different methods:
	 
The core idea has been always the same: Store in a lookup table the statistics of how good an action is on every state.		


What could possibly go wrong here? Well, it turns out this is not really useful to solve really **large** problems:

- Backgammon: $10^{20}$ states.
- Go: $10^{170}$ states.
- Chess: $10^{120}$ states.
- Helicopter flying: continuous state space.
- Atoms in the observable universe: $10^{81}$.

The way out of this is through **generalization**: how can we learn about the rest of the world from limited experience? To do generalization in our context, we will approximate the $Q$ function by a parameterized version. The goal is then to find a parameter vector $\theta$, living in a suitable parameter space, such that

$$\hat{Q}(s,a,\theta) \approx Q(s,a)$$

Our parameter $\theta$ might be, for instance the weights in a neural network, or the coefficients for a linear regression model.

We focus on differentiable function approximators, as this allows us to define "good" search directions to look at, but in principle we could try anything else (random forests, wavelets, Fourier basis expansion).

This is almost supervised learning, except that the data is **not stationary**: A modification of the policy parameter $\theta$ would have influence on the rest of the trajectory.

Let us recall first gradient descent. Let $J(\theta)$ be a differentiable function of $\theta$. The **gradient** of $J(\theta)$ is defined as:
	$$ \nabla_\theta J(\theta) = \left( \frac{\partial J(\theta)}{\partial \theta_1}, \ldots  , \frac{\partial J(\theta)}{\partial \theta_n} \right)^T$$

This quantity simply represents the direction on which the function is growing. To find a local minimum, all we need to do is to keep changing $\theta$ in the direction of the gradient, that is,$\Delta \theta := -\frac{1}{2} \alpha\nabla_\theta J(\theta)$ where $\alpha$ is a hyperparameter, called the **step size**.


**What does it have to do with RL?** Our goal is to find a parameter $\theta^*$ that minimizes:
$$J(\theta) := \frac{1}{2}\sum_{s \in \mathcal S, a \in \mathcal A}\left[ (Q(s,a)-\hat{Q}(s,a,\theta))^2 \right]$$

where $Q(s,a)$ is the true value function and $\hat{Q}(s, a, \theta)$ is the approximation.
By doing the gradient descent update:

$$\begin{aligned}
    \Delta \theta  &=& -\frac{1}{2} \alpha\nabla_\theta J(\theta) \\
    &=& \mathbb \alpha \mathbb E \left[ (Q(s,a)-\hat{Q}(s,a,\theta))\right ]\nabla_\theta \hat{Q}(s,a,\theta)
    \end{aligned} $$
    
There is, however, a drawback to this:  We still need to calculate the expectation! (pass over all states). We can solve this by issue by doing **stochastic gradient descent**, instead of the full gradient descent update. We choose a few directions and we replace the estimation of the gradient by the average across those directions.


So the previous steps can help us find a value function approximation in the event we have discrete states and actions. What happens if the state itself is continuous? In this case we represent the full state as **features**: These could be, for instance, the distance from a robot to (each) wall.

We have an embedding of the state space into a smaller dimensional space:

$$ s \mapsto \mathbf x(s) := (\mathbf x_1(s), \mathbf x_2(s), \ldots \mathbf x_m(s)) $$

and we can represent the value function as a linear combination of features:
$$ \hat{v}(s,\theta) := \sum_{i =1}^m\mathbf x_i(s)\theta_i $$

The update becomes:
$$ \Delta \theta = \alpha (Q(s,a)-\hat{Q}(s,a,\theta))\mathbf x(s)$$

where $\mathbf x(s) = (\mathbf x_1(s), \mathbf x_2(s), \ldots \mathbf x_m(s))$



As a final remark, note that table lookup is a special case of value function approximation. Note also that we somewhat assume that we know the true value function. That is, of course, *cheating*: we do not know the value function, (there is no supervisor), we only have rewards. How do we stop cheating then?


We can think of state-value functions as the following mappings:

- In DP: $s \mapsto \mathcal R^a_s + \gamma\sum_{s' \in \mathcal S}\mathcal P^a_{ss'}\cdot  \max_{a' \in \mathcal A}\hat{Q}(s',a',\theta)$

- In Monte-Carlo: $s \mapsto G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+\ldots$

- In TD: $s \mapsto R_{t+1}+\gamma \hat{Q}(S_{t+1},A_{t+1}\theta)$

 
In general, we have something of the form $s \mapsto g$, where $g$ is some target value. Up to now, we have been doing sort of trivial updates: moving the estimated value "a bit more" towards $g$, when doing SARSA or Q-Learning. However, viewing each backup as a *training example* we can use any **supervised learning** method to estimate the value function.
 	
How to stop cheating, as per the previous section? Instead of the true value function $v(s)$, or the action-value function $Q(s,a)$, we plug in the corresponding updates as in the previous slide.


We do this by **bootstrapping**, which means, updating the value function from other estimates.  Since off-policy methods do not backup state and action pairs with the same function they are estimating, at least theoretically, it is possible that the $Q-$learning with function approximation will not converge. In practice, it often does.



In [1]:

import numpy as np
import gym


class Estimator:
    def __init__(self):
        np.random.seed(1)
        self.w = np.ones((16,4))
        self.alpha = 0.01
        
    def featurize(self,s):
        x =np.zeros(16)
        x[s] = 1
        return x
        
    def predict(self,s):
        return np.matmul(self.featurize(s),self.w)
    
    def update(self,s,a, td_target):
        error = td_target-self.predict(s)[a]
        self.w[s,a] += self.alpha*error
        
    
def epsilon_greedy_policy(estimator, epsilon, actions):
    """ Q is a numpy array, epsilon between 0,1 
    and a list of actions"""
    
    def policy_fn(state):
        if np.random.rand()>epsilon:
            action = np.argmax(estimator.predict(state))
        else:
            action = np.random.choice(actions)
        return action
    return policy_fn


estimator = Estimator()

env = gym.make("FrozenLake-v0")

gamma = 1

n_episodes = 10000


actions = range(env.action_space.n)

score = []    
for j in range(n_episodes):
    done = False
    state = env.reset()
    policy = epsilon_greedy_policy(estimator, \
                                   epsilon=100./(j+1), actions = actions )
    
    
    ### Generate sample episode
    while not done:
        
        action = policy(state)
        new_state, reward, done, _ =  env.step(action)
        new_action = policy(new_state)
                       
        #Calculate the td_target
        if done:
            td_target = reward
        else:
            new_q_val = estimator.predict(new_state)[new_action]
            td_target = reward + gamma * new_q_val
        
        estimator.update(state,action, td_target)    
        state, action = new_state, new_action
            
        if done:
            if len(score) < 100:
                score.append(reward)
            else:
                score[j % 100] = reward
            #print("\rEpisode {} / {}.\
            # Avg score: {}".format(j+1, \
            # n_episodes, np.mean(score)), end="")

env.close()

## Exercises
- As usual, try to beat the benchmark. You can use different estimators, instead of linear functions. 