# Q learning - Linear Regression Attempt
In Q learning, the update rule is:
$$Q(s,a) = Q(s,a) + \alpha (r+\gamma max_{a'} Q(s',a') - Q(s,a))$$
If the Q function reached the best, there would be nothing more to learn. Hence, we wanted to make $r+\gamma max_{a'} Q(s',a') - Q(s,a)$ close to 0. Then the problem became fitting the dataset: [$r+\gamma max_{a'} Q(s',a')$, $Q(s,a)$]. In every step of environment evolution, the [$r+\gamma max_{a'} Q(s',a')$, $Q(s,a)$] pair can form a dataset for fitting.
The most basic fitting problem is linear regression. The first attempt is to fit the dataset to a linear model: $$Q(s,a)=\sum_{i}{c_i s_i}+\sum_{j}{c_j a_j},$$ where $c_i$ and $c_j$ are constants.

In [9]:
from osim.env import L2RunEnv
import numpy as np
from scipy.optimize import minimize, Bounds
from sklearn.linear_model import LinearRegression

DEFAULT_SEED = 20180101
rng = np.random.RandomState(DEFAULT_SEED)

env = L2RunEnv(visualize=False)
# Obtain the dimension observation space and action space
dim_obs = env.get_observation_space_size()
dim_act = env.get_action_space_size()

# Set the range of action values
action_low = -0.1
action_high = 0.1

# Set hyperparameters
discount = 0.0001
epsilon = 0.9
episode = 2000

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m


In [10]:
class qfunction:
    # random initialization
    def __init__(self, dim_obs, dim_act, rng=None):
        if rng is None:
            rng = np.random.RandomState(DEFAULT_SEED)
        self.rng = rng
        
        self.dim_obs = dim_obs
        self.dim_act = dim_act
        self.obs_coeff = rng.uniform(-1, 1, dim_obs)
        self.act_coeff = rng.uniform(-1, 1, dim_act)
        self.dim = dim_obs + dim_act

    def __call__(self, obs, act):
        obs_term = obs * self.obs_coeff
        act_term = act * self.act_coeff
        res = np.sum(obs_term) + np.sum(act_term)
        return res

    def update(self, coeff):
        self.obs_coeff = coeff[:self.dim_obs]
        self.act_coeff = coeff[self.dim_obs:]

In [11]:
# Initialize Q function
qf = qfunction(dim_obs, dim_act)
model = LinearRegression(fit_intercept=False)
# Initialize the dataset:(xdata, ydata)
xdata = np.zeros((qf.dim,))
ydata = np.zeros((1,))

In [12]:
action0 = np.zeros(dim_act)
for i in range(episode):
    # Initialize a new simulation
    state = np.array(env.reset())
    reward = 0

    # Run the simulation until the framework stop it
    done = False
    while not done:
        # get the action based on Q function and epsilon greedy
        if (rng.rand() < epsilon) :
            # exploration: randomly choose an action
            action = rng.uniform(action_low, action_high, dim_act)
        else:
            # exploitation: choose the action maximumizing Q function
            action_func = lambda x: -qf(state, x)
            bnds = Bounds(action_low, action_high)
            res = minimize(action_func, action0, method='SLSQP', bounds=bnds)
            action = res.x

        # evolve the system to the next time stamp
        state_, reward, done, info = env.step(action)
        state_ = np.array(state_)

        # build the dataset
        action_func = lambda x: -qf(state_, x)
        bnds = Bounds(action_low, action_high)
        res = minimize(action_func, action0, method='SLSQP', bounds=bnds)
        max_q = -res.fun
        # {s, a} and [r + gamma * max_a` Q(s`, a`)]
        xx = np.concatenate((state, action))
        yy = np.array(reward + discount * max_q)
        # put the data point into dataset
        xdata = np.vstack((xdata, xx))
        ydata = np.vstack((ydata, yy))

        # Do linear regression and update Q function
        model.fit(xdata, ydata)
        qf.update(model.coef_.T)

        # Update state
        state = state_

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

## Result Discussion
The fitting process throw out an error of overflow, and the fitting parameters were not convergent during fitting process. This problem is caused by the linear model.
When taking maximum of $Q(s,a)$ under given state, the linear model would give the maximum at $a_i = \pm 1$, the bound of action space, no matter what the state $s$ is. When epsilon-greedy gives a random action, $r+\gamma max_{a'}Q(s',a')$ is much larger than Q from current fitted function and this newly added will increase the slope. Finally, the fitted plane would be extremely steep, leading to the overflow error.