In [1]:
%pylab inline
import pandas as pd
from pyro.contrib.brm import defm, makedesc, makedata
from pyro.contrib.brm.design import metadata_from_df
from pyro.contrib.brm.family import Normal
from pyro.contrib.brm.formula import parse
from pyro.contrib.brm.model import model_repr
from pprint import pprint as pp

Populating the interactive namespace from numpy and matplotlib


## Defining the model

As an example, how might we specify a model similar to that used for the turk experiment with brmp? Start by considering the data we'll eventually collect. It might look something like:

In [2]:
df = pd.DataFrame(dict(
    y = np.array([0.5, 0.6, 0.4,0.1]),
    design = pd.Categorical(np.array(['D1','D2','D1','D3'])),
    participant = pd.Categorical(np.array([0,1,2,3]))
))
df.head()

Unnamed: 0,y,design,participant
0,0.5,D1,0
1,0.6,D2,1
2,0.4,D1,2
3,0.1,D3,3


The primary brmp interface requires a user to provide a formula and a data frame when defining a model. However, for OED, we want to use the model *before* we ever get any data (i.e. sample from the prior to pick the first design), so we need a way to describe the data we'll collect it, in order to complete the model specification.

brmp already makes use of an abstract description of the data internally. Usually this is derived from the data frame given by the user, but we can get hold of it manually:

In [3]:
meta = metadata_from_df(df)
meta.columns

[RealValued(name='y'),
 Categorical(name='design', levels=['D1', 'D2', 'D3']),
 Categorical(name='participant', levels=[0, 1, 2, 3])]

So perhaps we need to ask a user to supply this, or something like it?

This would entail specifying e.g. the number of participants ahead of time. (In this case the number of participants corresponse to the number of levels of the participant column.) This seems OK, but I wonder if it's possible/desirable to relax this.

If the user also provides a formula (and response distribution) to describe the structure of the model, then we have enough information to generate a model. For example:

In [4]:
formula = parse('y ~ design + (design || participant)')
model = makedesc(formula, meta, Normal, []) # Normal would be zero_one_inflated_beta or similar
print(model_repr(model))

Population
----------------------------------------
Coef Priors:
design[D1]      | Cauchy(loc=0.0, scale=1.0)
design[D2]      | Cauchy(loc=0.0, scale=1.0)
design[D3]      | Cauchy(loc=0.0, scale=1.0)
Group 0
----------------------------------------
Factors: participant
Num Levels: 4
Corr. Prior: None
S.D. Priors:
design[D1]      | HalfCauchy(scale=3.0)
design[D2]      | HalfCauchy(scale=3.0)
design[D3]      | HalfCauchy(scale=3.0)
Response
----------------------------------------
Family: Normal()
Link:
  Parameter: mu
  Function:  identity
Priors:
sigma           | HalfCauchy(scale=3.0)


For reference, here's how brmp would code the design matrices for this model:

In [5]:
pp(makedata(formula, df))

{'J_0': array([0, 1, 2, 3]),
 'X': array([[1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]]),
 'Z_0': array([[1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]]),
 'y_obs': array([0.5, 0.6, 0.4, 0.1])}


## Optimizing over designs

Assuming we ask users to supply `formula` & `meta`, does that give us everything we need to determine the space of possible designs? We know from the formula that `y` is the response, so I think the answer is "yes" assuming that:

* Everything in the data frame other than the response corresponds to part of the design space.
* Every possible row of the data frame that's consistent with the metadata corresponds to a valid design.

If these don't hold, we need to collect more information.

Are there cases where we will want to optimize of only part of the design space? e.g. We might want the system to pick the next question to ask, but we want to manually specify the participant?

## Generating the guide

The guide will be a big ANN that takes $y$ and $d$ as input and has a (perhaps multi-way) classifier on top. It seems like we can figure out the sizes of the inputs and outputs from `formula` and `meta`. The user just needs to say something about what happens inbetween.

## TODO

Some things we (might) need but don't yet have:

* The ability to sample from the prior.
* A response distribution suitable for responses from a slider. e.g. zero/one inflated beta? (Assuming we want to repeat the friendliness experiment or similar.)
* Support for [distributional models](https://cran.r-project.org/web/packages/brms/vignettes/brms_distreg.html). (I think this is how one would handle the $k_i$ parameters in the turk model. This is not urgent.)
* A way for users to specify `meta`.
* Interface tweaks to support defining models (and subsequently performing inference) this way.