In [1]:
import numpy as np
import plotly.express as px
import pandas as pd
import scipy as sp

from collections import Counter
from IPython.display import clear_output

# Introduction

Instrumentation is expensive. So we want to do as little of it as possible. Yet we also want meaningful and robust models. So how do we go about acheiving both of these aims at once? 

Well it starts with recognizing that we rarely know *nothing* about a problem - we usually have some kind of educated guess. This might be in the form of a class of model or in specific parameters of that model. Regardless we have some prior belief about what could be sensible. Along with this we have some sense of uncertainty with regards to that belief. 

With these two things in hand we can do something rather interesting - we can predict what the world would look like if we were right. At first this doesn't seem particularly useful because the whole point is we don't know whether we are right. However, imagine for a moment that we are right. Furthermore imagine that we know that we're going to collect only 25 samples. Well what we can ask is, for any given 25 samples, what would a model built on our assumed world look like? Which 25 samples would get us closest to our underlying model? Put another way - which samples would best minimize the expected variance in our model parameters? 

We can ask this because we can, more or less, sample from our hypothesis, fit our model to that sampled data, and then do it over and over again until we're confident we understand the variance in our estimates for those specific samples. 

Now at this point you may be saying - if I just take the same components out over and over aren't I always going to get the same results? Nope! Remember part of our model is an encoding of uncertainty or error in the predictions. So when we draw those 25 samples over and over we will indeed get something new each time and it is from this that our variance arises. 

Okay so what has this gotten us? Well what you can imagine is that we have a function $V(X, A)$ that gives us our expected coefficient variance given our sample $X$ and whatever set of assumptions are required to get an actual, sortable $V$. For example we might have to assume some coefficients of our model (like variance of our error term). And given we have $V$ we can now create the hyper-surface of $V$ while varying $X$. The lowest valley in this surface represents the 25 samples that would, if we're right about our assumptions, give us the most confident estimates of our model parameters. 

We can then go ahead and sample this data, refit our assumptions and repeat the process. What's going to happen is that our understanding of the space is going to improve over time, always encoded in $A$ and we're also going to be targeting the data we have the best reason to believe is valuable. Robust models on the cheap (hopefully). 


## A Fishy Example

Now there are much more sophisticated ways of doing this kind of learning (most allow you to minimize the size of $A$) but let's just do a simple bootstrapping example to prove the point. We'll look at the length of fish at age. 

Typically this is modeled using a Von-Bertalanffy:

$$L=L_{\infty}(1-e^{-Kt})$$

We'll go ahead and assume the error here is normal. 

Now one thing that's important here is setting up how the problem is going to work. Let's assume we are trying to get learn the age to length relationship of Bluefish (from Dr. Suess) but it's a rather rare species so we want to age as few fish as possible (because aging requires killing the fish). Therefore we're going to need to pick fish based on their length and age them from there. The question is what lengths should we grab to best learn this model? Suppose we can only catch 3 at a time. 

To make our lives easier we're going to bin the lengths to be integer units of measurement. 

In [48]:
def build_pool(L_inf, K, sigma, ages, N=1000):
    pool = []
    for age in ages:
        for l in L_inf * (1-np.exp(-K * age)) + np.random.normal(0, sigma, N):
            l = max(np.floor(l), 1)
            pool.append({
                'age': age,
                'length': l
            })
    return pd.DataFrame(pool)

pool = build_pool(1000, 0.3, 0.0001, np.arange(0, 10, 0.25))
px.scatter(pool, x='age', y='length')

In [3]:
grouped_pool = pool.groupby('length').agg(list).reset_index()
grouped_pool.head()

Unnamed: 0,length,age
0,1.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,2.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,3.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,4.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,5.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [4]:
Counter([1, 2, 2, 3]).items()

dict_items([(1, 1), (2, 2), (3, 1)])

In [5]:
def sample_from_grouped_pool(grouped_pool, lengths):
    sample = []
    for length, count in Counter(lengths).items():
        ages = np.random.choice(
            grouped_pool[grouped_pool['length'] == length]['age'].values[0], 
            count, replace=True
        )
        for age in ages:
            sample.append({
                'age': age,
                'length': length
            })
    return pd.DataFrame(sample)

lengths = [100, 200, 300, 400, 500, 600, 700, 800, 900]
sample = sample_from_grouped_pool(grouped_pool, lengths)
sample.head()

Unnamed: 0,age,length
0,0.25,100
1,1.0,200
2,1.25,300
3,1.75,400
4,2.25,500


## Setup

We begin by setting up the real world. 

In [6]:
real_pool = build_pool(1000, 0.3, 25, np.arange(0, 10, 0.25), N=1000)
px.scatter(pool, x='age', y='length', title='Real World')

In [7]:
real_grouped_pool = real_pool.groupby('length').agg(list).reset_index()
real_grouped_pool.head()

Unnamed: 0,length,age
0,1.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,2.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,3.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,4.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,5.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


Next we're going to create our initial estimate (perhaps we got these parameter estimates from a study on redfish).

In [8]:
L_inf = 1000
K = 0.5
sigma = 50
ages = np.arange(0, 10, 0.25)

With this in mind we can now pretend like this is the real world. 

In [9]:
guess_pool = build_pool(L_inf, K, sigma, ages, N=1000)
px.scatter(guess_pool, x='age', y='length', title='Guess')

In [10]:
guess_grouped_pool = guess_pool.groupby('length').agg(list).reset_index()
guess_grouped_pool.head()

Unnamed: 0,length,age
0,1.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,2.0,"[0.0, 0.0, 0.0, 0.0, 0.25, 0.25, 0.25]"
2,3.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"
3,4.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]"
4,5.0,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.25]"


Now the trick is we need a way to guess how much variance exists for a given set of 5 lengths that we want to measure. To do this we need a way to fit our model repeatedly on new samples from our present hypothesis. 

In [11]:
def objective(x, sample):
    L_inf, K, sigma = x
    sample['predicted_length'] = L_inf * (1-np.exp(-K * sample['age']))
    sample['neg_log_likelihood'] = -np.log(sp.stats.norm.pdf(sample['length'] - sample['predicted_length'], 0, sigma))
    return sample['neg_log_likelihood'].sum()

lengths = [1, 200, 400, 600, 800]
sample = sample_from_grouped_pool(guess_grouped_pool, lengths)
NLL = objective((L_inf, K, sigma), sample)
print(NLL)
sample

28.587967641453808


Unnamed: 0,age,length,predicted_length,neg_log_likelihood
0,0.0,1,0.0,4.831162
1,0.75,200,312.710721,7.371703
2,1.0,400,393.46934,4.839491
3,1.5,600,527.633447,5.878345
4,4.0,800,864.664717,5.667267


In [12]:
sp.optimize.minimize(
    objective,
    (L_inf, K, sigma),
    args=(sample,),
    bounds=((100, None), (0.1, None), (10, None))
)

  message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
  success: True
   status: 0
      fun: 27.78840891578647
        x: [ 9.193e+02  5.488e-01  6.273e+01]
      nit: 41
      jac: [ 5.684e-06  7.387e-03  9.237e-06]
     nfev: 188
     njev: 47
 hess_inv: <3x3 LbfgsInvHessProduct with dtype=float64>

In [13]:
def relative_deviation_of_parameters(grouped_pool, lengths, L_inf, K, sigma, trials=25):
    L_inf_arr = []
    K_arr = []
    sigma_arr = []
    for _ in range(trials):
        sample = sample_from_grouped_pool(grouped_pool, lengths)
        result = sp.optimize.minimize(
            objective,
            (
                L_inf+np.random.normal(0, L_inf / 10, 1)[0], 
                K+np.random.normal(0, K / 10, 1)[0], 
                sigma+np.random.normal(0, sigma / 10, 1)[0]
            ),
            args=(sample,),
            bounds=((0, None), (0, None), (0, None))
        )
        L_inf_arr.append(result.x[0])
        K_arr.append(result.x[1])
        sigma_arr.append(result.x[2])
        clear_output()
    return {
        'L_inf': np.std(L_inf_arr)/L_inf,
        'K': np.std(K_arr)/K,
        'sigma': np.std(sigma_arr)/sigma
    }

lengths = [1, 200, 400, 600, 800]
dop = relative_deviation_of_parameters(guess_grouped_pool, lengths, L_inf, K, sigma)
print(sum(dop.values()))
dop

1.7743796324569223


{'L_inf': 0.7594513059582284,
 'K': 0.5972108979535181,
 'sigma': 0.41771742854517563}

In [14]:
lengths = [400, 400, 400, 400, 400]
dop = relative_deviation_of_parameters(guess_grouped_pool, lengths, L_inf, K, sigma)
print(sum(dop.values()))
dop

4.126243459063968


{'L_inf': 0.38389978028559946,
 'K': 3.386627424487119,
 'sigma': 0.3557162542912497}

In [15]:
lengths = [200, 200, 400, 400, 400]
dop = relative_deviation_of_parameters(guess_grouped_pool, lengths, L_inf, K, sigma)
print(sum(dop.values()))
dop

3.381950461854448


{'L_inf': 0.7227761640174449,
 'K': 2.171845381945127,
 'sigma': 0.48732891589187566}

Now let's go ahead and pick some random combinations of lengths and look for the best ones.

In [16]:
allowed_lengths = [100, 200, 300, 400, 500, 600, 700, 800, 900]
best_score = float('inf')
for _ in range(10):
    lengths = np.random.choice(allowed_lengths, 3)
    dop = relative_deviation_of_parameters(guess_grouped_pool, lengths, L_inf, K, sigma)
    score = sum(dop.values())
    if score < best_score:
        best_score = score
        best_lengths = lengths

best_lengths

array([600, 700, 400])

With these best lengths in hand we now sample from the real pool and look at our new model estimates.

In [17]:
sample = sample_from_grouped_pool(real_grouped_pool, best_lengths)
sample.head()

Unnamed: 0,age,length
0,3.0,600
1,4.5,700
2,1.5,400


In [34]:
results = []
for _ in range(50):
    sample = sample_from_grouped_pool(real_grouped_pool, best_lengths)
    result = sp.optimize.minimize(
        objective,
        (L_inf, K, sigma),
        args=(sample,),
        bounds=((100, None), (0.1, None), (10, None))
    )
    results.append({
        'L_inf': result.x[0],
        'K': result.x[1],
        'sigma': result.x[2]
    })
clear_output()
results = pd.DataFrame(results)
results.describe()

Unnamed: 0,L_inf,K,sigma
count,50.0,50.0,50.0
mean,1039.396382,0.319899,15.896759
std,324.982837,0.083018,8.420897
min,797.173331,0.1,10.0
25%,895.818047,0.279116,10.0
50%,956.53833,0.323836,12.144212
75%,1045.847252,0.369843,19.546256
max,2259.176232,0.482049,45.134032


And boom! With just three samples we've already got a pretty great curve! We could now repeat this several times to really hone in on the values we're looking for. 

In [35]:
results = []
for _ in range(50):
    lengths = np.random.choice(allowed_lengths, 3)
    sample = sample_from_grouped_pool(real_grouped_pool, lengths)
    result = sp.optimize.minimize(
        objective,
        (L_inf, K, sigma),
        args=(sample,),
        bounds=((100, None), (0.1, None), (10, None))
    )
    results.append({
        'L_inf': result.x[0],
        'K': result.x[1],
        'sigma': result.x[2]
    })
clear_output()
results = pd.DataFrame(results)
results.describe()

Unnamed: 0,L_inf,K,sigma
count,50.0,50.0,50.0
mean,1057.059637,0.369523,16.962039
std,339.535964,0.45717,7.803419
min,600.005453,0.1,10.0
25%,954.517744,0.282086,10.0
50%,999.217471,0.308594,15.200005
75%,1005.445302,0.343536,20.540237
max,2549.659139,3.492414,44.404054


In [49]:
results = []
for _ in range(50):
    lengths = np.array([600, 600, 900])
    sample = sample_from_grouped_pool(real_grouped_pool, lengths)
    result = sp.optimize.minimize(
        objective,
        (L_inf, K, sigma),
        args=(sample,),
        bounds=((100, None), (0.1, None), (10, None))
    )
    results.append({
        'L_inf': result.x[0],
        'K': result.x[1],
        'sigma': result.x[2]
    })
clear_output()
results = pd.DataFrame(results)
results.describe()

Unnamed: 0,L_inf,K,sigma
count,50.0,50.0,50.0
mean,1003.049596,0.299771,16.050487
std,50.892274,0.030093,7.674551
min,929.376007,0.228411,10.0
25%,959.853714,0.276427,10.0
50%,998.13693,0.304284,12.380186
75%,1038.751064,0.32015,23.1182
max,1145.039711,0.36129,36.34244


$$L = L_{\infty}(1 - e^{-Kt})$$

$$\partial_K L = tL_{\infty}e^{-Kt}$$

In [47]:
t = np.arange(0, 10, 0.25)
y = t * 1000 * np.exp(-0.3 * t)
px.scatter(x=t, y=y)