# Model Based Output Feedback Synthesis

This document summarizes the various discussions among Lucas, Pete and Russ regarding output feedback controller synthesis.

## Research Question/Goal

- Use model based techniques to synthesize output feedback controllers.
- Offer an alternative to the end-to-end deep learning approaches?
- Simple/interpretable controllers?
- Controllers with performance/stability guarantees?

# Technical Approaches

Here we list the different technical approaches we have discussed

## Optimization over output feedback policies

We would like to directly optimize an output feedback policy $\pi$, i.e.

\begin{align*}
        \min_{x(.), u(.), \pi} \ \ \ & \int_{t_0}^{t_f} g(x(t),u(t)) dt\\
        s.t. \ \ \  & \forall t, \ \ \dot{x}(t) = f(x(t),u(t)), \\
        & x(t_0) = x_0, \\
        & u = \pi(y(x))
\end{align*}

Where $y(x)$ is the renderer.

We would like to solve this optimization from many initial samples $x_{0,i}$ and simultaneously optimize the feedback policy given a particular form, for example linear feedback $u = Ky(x)$.

When we have many initial samples the objective could be replaced with

\begin{equation*}
 \min_{x(.), u(.), \pi} \sum_i \int_{t_0}^{t_f} g(x_i(t),u_i(t)) dt\\
\end{equation*}

The key is that the same control policy $u = \pi(y)$ is used for all samples.

### Technical Details

- This is really o-policy learning
- Use GPU + PyTorch to do forward simulation in "batch". i.e. forward simulate from many individual initial conditions. Use this batch simulation to take a gradient step.
- Is there a better way to do 


### Questions

- How does this relate to standard reinforcement learning?
- What optimization techniques should we use to tackle this problem? SNOPT, PyTorch etc.

### Examples

- Double integrator stabilization
- Pendulum swing up


## Simultaneous search for policy and verifying Lyapunov function

### Idea
Sampling based approach of LQR trees and SOS funnels.

$$\dot{x} = f(x,u)$$

$$ u = \pi_\theta(x)) $$

$$ V(x) = p.s.d. \text{by construction, but parameterized by parameters } \psi $$

$$ \dot{V} = \frac{dV}{dx} \dot{x}$$
$$ = \big[ \frac{dV}{dx}\big]^T \big[f(x,\pi_{\theta}(x) \big] $$

### Loss function

$X$ = {$x_1, x_2, ..., x_N$} many samples

$$ L(\theta) = \sum_{i} l(x_i, \theta) $$

\begin{equation}
  \mathcal{l}(x_i,\theta) =
  \begin{cases}
    \dot{V}(x_i, \theta) & \text{if $\dot{V}(x_i, \theta) > 0$} \\
    0 & \text{otherwise}
  \end{cases}
\end{equation}


### Questions

- Is enforcing that $V$ be p.s.d by construction too restrictive? i.e. can we get function approximators that are sufficiently rich without using something like a deep neural network?

- What are the advantages/disadvantages of this method compared to the direct policy search?
    - No longer optimal control since we have no objective function. It's really about stability more than optimiality.
    - Optimizing two things simultaneously (Lyapunov function and policy), does this make the optimization harder and/or more unstable?

### Examples
- Pendulum swing up

# Related Work

- How does this compare to standard RL?
    - Sergey Levine Guided Policy Search
    - Chelsea Finn PR2 Robot w/ spatula etc.
- What are we solving that isn't solved by LQR trees?