In [14]:
import pandas as pd
import numpy as np
import plotly.express as px
import scipy as sp

# Introduction

Abundance by age (also known as age structure) is a significant part of any stock assessment. Today, age structure is established through stratified sampling of ages by length. And this process is reasonably intensive. More or less you have to catch loads of fish, measure them all, and then take samples from each length grouping (say by centimeter). Each of these samples then requires an otolith to be removed and preserved. All those otoliths then are sent back to the lab where they are individually aged. And finally you have the information you need to create an age structure (which is just, per length group how many fish of each age are expected). Then you apply the age structure against all the lengths in a specific sample to get the full distribution of ages. 

So here's the question. Can we use optimized query design to pick fewer otoliths and get just as good a result? Well OQD rqeuires first and foremost a hypothesis as encoded by a model. So let's start there.

# The Model

A pretty common relationship between age and length in fisheries is a growth model of which the von-Bertalanffy is the classic. It looks something like this:

$$L = V(t) = L_{\infty}(1 - e^{-k(t-t_0)}) + \epsilon$$

We're going to make our lives easier (just for the sake of argument) by assuming that $\epsilon$ follows a normal distribution that is independent of length $N(0, \sigma)$. 

Now this model is great and all, but it's also in the wrong direction. Why? Because the whole point is that we have no idea what ages we're going to sample! The whole point is finding age is hard but finding lengths is easy. So we need to invert this whole thing. 

However we also don't really care about the specific age per length. Instead what we care about is the relative probilities of getting any specific age at a specific length. Call that $P(t|L)$. This, applied across all of our lengths would then tell us the relative abundance of each of our ages $t$. Now what's nice about this is we know that for a specific age $t$ the mean length (that's just given by our von-Bertalanffy) and we also know (or rather are assuming) that the $L$ follow a normal distribution. In otherwords we have the probability of a given length *given* an age $t$. 

$$P(L_\delta|t)=\int_{L-\delta}^{L+\delta}\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{l-V(t)}{\sigma})^2}dl$$

Where that's just the probility density function of a normal distribution $N(V(t), \sigma)$. Note that because this is a probability distribution function we've had to integrate over a bin size to get an actual probability. 

To get $P(t|L)$ we can just bring Bayes Theorem into the mix! 

$$P(t|L_\delta) = \frac{P(L_\delta|t)P(t)}{P(L_\delta)}$$

Now for a leap of faith. You're going to look at that equation and say, hold on my ultimate goal here is to estimate $P(t)$ why is it up there? Well for now let's just consider it a parameter $\alpha_t$. Furthermore we're going to assume that $P(L_\delta)$ is also just known (we measured a representative sample of fish). Therefore we have:

$$P(t|L_\delta) = \frac{\alpha_t}{P(L_\delta)}\int_{L-\delta}^{L+\delta}\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{l-V(t)}{\sigma})^2}dl$$

The next leap of faith we're going to make is $V(t)$ has already been estimated. Also note that our big integral is not dependent on $\alpha_t$ at all. Therefore we can represent it by:

$$I_t(L_\delta) = \int_{L-\delta}^{L+\delta}\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{l-V(t)}{\sigma})^2}dl$$

At which point we get:

$$P(t|L_\delta) = \frac{I_t(L_\delta)}{P(L_\delta)}\alpha_t$$

Now there's a pretty big problem here. We will never actually measure $P(t|L_\delta)$ because... well... it doesn't actually exist (probabilities are a bit of a human construct). So this is not a very good model - it's rather hard to fit. However what do most models measure? Well then measure central tendancies! So we could instead look for the expected $t$ at each $L_\delta$. But that's just the age multiplied by the probabilities summed!

$$t(L_\delta) = \sum_\tau{\tau P(\tau|L_\delta)} = \sum_\tau {\tau\frac{I_\tau(L_\delta)}{P(L_\delta)}\alpha_\tau}$$

Now we're getting somewhere! What we still need for OQD is the noise component of all of this. Well remember:

$$\sigma^2 = \sum (x-\mu)^2 P(x)$$

So we have:

$$\sigma^2 = \sum_\tau (\tau - t(L_\delta)) P(\tau | L_\delta)= \sum_\tau (\tau - t(L_\delta))\frac{I_\tau(L_\delta)}{P(L_\delta)}\alpha_\tau$$

And that gives us our model! 

# OQD

Alright now to apply OQD. The first question is what's our signal? Note that because we always measure signal as the derivative with respect to one of our parameters across a difference that we can learn much of what we need to know by just taking the derivative of our model for starters. Because we're assuming that $V(t)$ and therefore $I_t(L_\delta)$ is known our only parameters here are actually just the $a_\tau$. So let's take that derivative:

$$\partial_{\alpha_\tau}t(L_\delta) = \tau\frac{I_\tau(L_\delta)}{P(L_\delta)}= \tau\frac{P(L_\delta | \tau)}{P(L_\delta)}$$

This is a very nice, intuitive result. It says that our signal will be maximized (relative to a zero probability case for $\tau$) where the ratio of how often we'd find an age $\tau$ over how often we'd even find that length is maximized. In other words if we sample where we'd expect to find $\tau$ most frequently we'll amplify the signal of $\alpha_\tau$ - which definitely feels right.

Now of course you may be wondering - didn't we make a really big assumption about $V(t)$? Actually, not really. $V(t)$ should be reasonably stable year over year so we don't need to collect that much data to fit it. In addition we could just apply OQD to fitting it in a relatively straighforward fashion. 

Alright, before we even consider noise let's give this approach a whirl. 

# Signal Alone

In [12]:
rows = []
with open('age_structure/data/griffin age data.txt', 'r') as fh:
    i = 0
    for line in fh.readlines():
        if i == 0:
            cols = [e.strip() for e in line.strip().split(' ') if e.strip()]
        else:
            elements = [e.strip() for e in line.strip().split(' ') if e.strip()]
            rows.append({
                c: e
                for c, e in zip(cols, elements)
            })
        i += 1
df = pd.DataFrame(rows)
df = df[~df['TL'].isin(['.'])]
df['length'] = df['TL'].astype(int)
df['age'] = df['Rings'].astype(int)
df

Unnamed: 0,TL,Gear,Rings,length,age
0,197,trawl,2,197,2
1,166,trawl,2,166,2
2,141,trawl,1,141,1
3,172,trawl,1,172,1
4,103,trawl,0,103,0
...,...,...,...,...,...
182,204,trapnet,1,204,1
183,196,trapnet,1,196,1
184,171,trapnet,1,171,1
185,191,trapnet,2,191,2


In [18]:
px.scatter(df, x='age', y='length')

In [22]:
def bert_objective(x, sample):
    L_inf, K, t_0, sigma = x
    sample['predicted_length'] = L_inf * (1-np.exp(-K * (sample['age'] - t_0)))
    sample['neg_log_likelihood'] = -np.log(sp.stats.norm.pdf(sample['length'] - sample['predicted_length'], 0, sigma))
    return sample['neg_log_likelihood'].sum()

sample = df
L_inf, K, t_0, sigma = 300, 0.1, 1, 100
sol = sp.optimize.minimize(
    bert_objective,
    (L_inf, K, t_0, sigma),
    args=(sample,),
    bounds=((100, None), (0.1, None), (-5, None), (10, None))
)
sol


divide by zero encountered in log


invalid value encountered in subtract



  message: CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
  success: True
   status: 0
      fun: 981.1488255866923
        x: [ 3.000e+02  2.482e-01 -1.467e+00  6.114e+01]
      nit: 16
      jac: [-8.321e-01 -6.174e+02  5.123e+01  1.564e+00]
     nfev: 125
     njev: 25
 hess_inv: <4x4 LbfgsInvHessProduct with dtype=float64>

In [23]:
L_inf, K, t_0, sigma = sol.x
df['predicted_length'] = df.apply(
    lambda row: L_inf * (1-np.exp(-K * (row['age'] - t_0))),
    axis=1
)
px.scatter(df, x='age', y='predicted_length')

In [36]:
def prob_length_given_t(length, delta, age):
    return (
        sp.stats.norm.cdf(length + delta, loc=(L_inf * (1-np.exp(-K * (age - t_0)))), scale=sigma)
        - sp.stats.norm.cdf(length - delta, loc=(L_inf * (1-np.exp(-K * (age - t_0)))), scale=sigma)
    )

prob_length_given_t(100, 10, 1)

0.10795003712558682

In [82]:
rows = []
with open('age_structure/data/griffin trawl data.txt', 'r') as fh:
    i = 0
    for line in fh.readlines():
        if i == 0:
            cols = [e.strip() for e in line.strip().split(' ') if e.strip()]
        else:
            elements = [e.strip() for e in line.strip().split(' ') if e.strip()]
            rows.append({
                c: e
                for c, e in zip(cols, elements)
            })
        i += 1
full_df = pd.DataFrame(rows)
full_df = full_df[~full_df['TL'].isin(['.', np.nan])]
full_df['length'] = full_df['TL'].astype(int)
full_df['length_cm'] = np.ceil(full_df['length'] / 10)
int_df = full_df.groupby('length_cm').count().rename({'length': 'count'}, axis=1)[['count']].reset_index()
int_df['count'] = int_df['count'] / int_df['count'].sum()
int_df = int_df.rename({'count': 'p'}, axis=1)
probs = {
    row['length_cm']: row['p']
    for _, row in int_df.iterrows()
}
probs

{7.0: 0.002857142857142857,
 8.0: 0.03428571428571429,
 9.0: 0.04285714285714286,
 10.0: 0.06428571428571428,
 11.0: 0.07714285714285714,
 12.0: 0.03857142857142857,
 13.0: 0.008571428571428572,
 14.0: 0.018571428571428572,
 15.0: 0.06142857142857143,
 16.0: 0.09,
 17.0: 0.09714285714285714,
 18.0: 0.08714285714285715,
 19.0: 0.08,
 20.0: 0.05857142857142857,
 21.0: 0.05142857142857143,
 22.0: 0.03857142857142857,
 23.0: 0.04,
 24.0: 0.027142857142857142,
 25.0: 0.03428571428571429,
 26.0: 0.01,
 27.0: 0.014285714285714285,
 28.0: 0.011428571428571429,
 30.0: 0.005714285714285714,
 31.0: 0.0014285714285714286,
 32.0: 0.0014285714285714286,
 35.0: 0.002857142857142857}

In [76]:
df[(df['length'] >= 310) & (df['length'] < 140)].sort_values('length')

Unnamed: 0,TL,Gear,Rings,length,age,predicted_length,neg_log_likelihood
133,130,trapnet,0,130,0,91.556567,5.229792
69,131,trawl,1,131,1,137.382211,5.037564
90,132,trawl,0,132,0,91.556567,5.250895
101,132,trawl,1,132,1,137.382211,5.03599
155,134,trapnet,0,134,0,91.556567,5.273068
146,135,trapnet,0,135,0,91.556567,5.284556
74,136,trawl,1,136,1,137.382211,5.032371
130,136,trapnet,1,136,1,137.382211,5.032371
32,137,trawl,1,137,1,137.382211,5.032135
95,137,trawl,1,137,1,137.382211,5.032135


In [67]:
def ratio(length, delta, age):
    cm = length / 10
    best_key = None
    best_diff = float('inf')
    for key in probs.keys():
        if abs(key - cm) < best_diff:
            best_key = key
            best_diff = abs(key - cm)
    return (
        prob_length_given_t(length, delta, age) / max(round(probs[best_key], 2), 0.01)
    )

ratio(200, 10, 1)

1.2876012255487033

In [78]:
for age in range(0, df['age'].max() + 1):
    best_ratio = 0
    best_length = None
    for length in range(100, int(np.ceil(L_inf)) + 10, 10):
        r = prob_length_given_t(length, 10, age)
        if r > best_ratio:
            best_ratio = r
            best_length = length
    print(age, best_length, best_ratio)

0 100 0.1286982496538741
1 140 0.1298023530174387
2 170 0.12975111954325724
3 200 0.1299020523213259
4 220 0.1297859894824056
5 240 0.12991944171027026
6 250 0.12976322622995617
7 260 0.12972638150363458


In [79]:
full_df

Unnamed: 0,length_cm,p
0,7.0,0.002857
1,8.0,0.034286
2,9.0,0.042857
3,10.0,0.064286
4,11.0,0.077143
5,12.0,0.038571
6,13.0,0.008571
7,14.0,0.018571
8,15.0,0.061429
9,16.0,0.09


In [71]:
full_df

Unnamed: 0,length_cm,p
0,7.0,0.002857
1,8.0,0.034286
2,9.0,0.042857
3,10.0,0.064286
4,11.0,0.077143
5,12.0,0.038571
6,13.0,0.008571
7,14.0,0.018571
8,15.0,0.061429
9,16.0,0.09


In [69]:
for length in range(100, int(np.ceil(L_inf)) + 10, 10):
    print(length, ratio(length, 10, 0))

100 2.1449708275645687
110 1.552398068910664
120 2.9176883358393413
130 10.680461054555689
140 4.759225883955809
150 1.3768177354342346
160 0.7757673565166903
170 0.5746521740483712
180 0.5117738731131662
190 0.4493973223853212
200 0.4554629731725768
210 0.40457766040577603
220 0.3645548496482215
230 0.25591567697083595
240 0.23326646316135236
250 0.15529277327551716
260 0.3020322227750505
270 0.1906845218874298
280 0.11723533124022856
290 0.07019109171079174
300 0.04092468705046981
310 0.023236359114375826


In [88]:
def get_bin(length, bins):
    best_bin = None
    best_diff = float('inf')
    for bin in bins:
        if abs(bin - length) < best_diff:
            best_bin = bin
            best_diff = abs(bin - length)
    return best_bin

    

In [127]:
bins = [100, 140, 170, 200, 220, 240, 250, 260]
df['bin'] = df.apply(lambda row: get_bin(row['length'], bins), axis=1)
fdfs = []
for bin in df['bin'].unique():
    fdfs.append(df.sample(12))
fdf = pd.concat(fdfs)
age_structure = fdf.groupby(['bin', 'age']).count().rename({'length': 'count'}, axis=1)[['count']].reset_index()
full_df['bin'] = full_df.apply(lambda row: get_bin(row['length'], bins), axis=1)
bin_structure = fdf.groupby(['bin']).count().rename({'length': 'count'}, axis=1)[['count']].reset_index()
age_structure = age_structure.merge(bin_structure, on='bin', how='left', suffixes=('', '_bin'))
age_structure['p'] = age_structure['count'] / age_structure['count_bin']
age_structure.sort_values(['bin', 'age'])
ndf = full_df.groupby('bin').count().rename({'length': 'count'}, axis=1)[['count']].reset_index().merge(age_structure[['bin', 'age', 'p']], on='bin', how='left')
ndf['count'] = ndf['count'] * ndf['p']
ndf = ndf[['age', 'count']].groupby('age').sum().reset_index()
print(fdf.shape)
ndf

(96, 9)


Unnamed: 0,age,count
0,0,211.1
1,1,196.066667
2,2,243.56746
3,3,23.43254
4,4,3.222222
5,5,14.555556
6,6,8.055556


In [128]:
df['bin'] = np.ceil(df['length'] / 10)
age_structure = df.groupby(['bin', 'age']).count().rename({'length': 'count'}, axis=1)[['count']].reset_index()
full_df['bin'] = np.ceil(full_df['length'] / 10)
bin_structure = df.groupby(['bin']).count().rename({'length': 'count'}, axis=1)[['count']].reset_index()
age_structure = age_structure.merge(bin_structure, on='bin', how='left', suffixes=('', '_bin'))
age_structure['p'] = age_structure['count'] / age_structure['count_bin']
age_structure.sort_values(['bin', 'age'])
ndf = full_df.groupby('bin').count().rename({'length': 'count'}, axis=1)[['count']].reset_index().merge(age_structure[['bin', 'age', 'p']], on='bin', how='left')
ndf['count'] = ndf['count'] * ndf['p']
ndf = ndf[['age', 'count']].groupby('age').sum().reset_index()
print(df.shape)
ndf

(186, 9)


Unnamed: 0,age,count
0,0.0,201.833333
1,1.0,194.296212
2,2.0,245.850758
3,3.0,40.561364
4,4.0,5.333333
5,5.0,3.25
6,6.0,4.666667
7,7.0,2.208333


I have two takeaways from this. First the fact that $P(L_\delta)$ is in the denominator doesn't make a whole lot of sense until you consider the fact that a low number of samples you can expect is going to make this whole thing a lot noiser (and by the same degree) therefore I think you can more or less ignore that aspect of things. 

Second I need a way to estimate these parameters given the data I've collected. Because this stuff on its own is not quite enough (I don't think) to estimate the overall age structure. 