Model Descriptions
------------------


## Heuristic Model

The first model developed is referred to as the
*heuristic model* and was derived by observing that there were
correlations between fishing behaviour and several of the
values present in AIS messages. In particular, the
likelihood that a vessel was fishing tends to increase with
the standard deviation of the speed ($\sigma_s$) and course
($\sigma_c$), but to decrease with mean speed. These features
were used to develop the *heursitic model*:
$$
fishing\_score = \frac{2}{3}\left(\sigma_{s_m} + \sigma_{c_m} + \overline{s_m}\right) \\
s_m = 1.0 - \min\left(1, speed\,/\,17\right) \\
c_m = course\,/\,360 \\
$$
where the means and standard deviations are computed over a one hour window.

The heuristic model performs reasonably well trawlers and
longliners, but poorly for purse seiners.

ADD TAG/REVISION INFO TO SEE SPECIFIC CODE.


## Generic Model

A series of logistic regression models were then developed
using the same three features found in the *heursitic
model*. In order to increase the expressiveness of the
logistic model, powers of the 3 base features are added to
the features. Thus, the full feature vector consits of:
$$
\sigma_{s_m}, \sigma_{s_m}^2,\ldots, \sigma_{s_m}^n, \sigma_{c_m}, \sigma_{c_m}^2,\ldots, \sigma_{c_m}^n,
\overline{s_m},\overline{s_m}^2, \ldots \overline{s_m}^n \\
$$
where $n$ is what we shall be refer to as the *feature order*.
Note that that despite the odd form of
$s_m$, from the point of view of the
logistic model, it's equivalent to the the speed
capped at 17 knots.

The first of the logistic models, referred to as the
*generic model*, is the model currently in use and 
is a logistic model using a 12 hour time window
a feature order of 6. One model is trained for all gear
types. This model generally performs bettter than the
heuristic model, but still performs rather poorly on purse
seiners. The 12-hour window was arrived at by plotting the model
accuracy versus window size. There is a different optimal
window size for each gear type, but 12 hours performed
well for a model trained and tested on all gear types.

ADD TAG/REVISION INFO TO SEE SPECIFIC CODE.

## Multi-Window Model

The multi-window model, which is on the verge of being deployed, 
is a logistic model similar to the *generic model* except that
is use uses
multiple time windows, ranging in duration from one-half to
twenty four hours. Using multiple window sizes both provides
a richer feature set and avoids the needs to optimize over
window size. In
addition separate models are trained for each of the three
primary gear types: longliners, trawlers and purse seines.
We are also experimenting with adding other features.
In particular, whether it is currently daylight appears to
be a very useful feature for predicting purse seine fishing.
These changes, taken together, dramatically improve the
performance, particularly of purse seiners.


## Future Models

It is straightforward to use the multi-window logistic
model features described above with a random forest or neural net
model. In early experiments, both of these model types offer
slightly improved performance relative to logistic model while at
the same eliminating the need to augment the feature vector
with powers of the base features.

We eventually plan to experiment with using convolutional or
recurrent neural networks to find features in the AIS data
directly rather than hand engineering the features.


General Notes
-------------

The precision of the models vary by gear type: Long liners are easiest to
predict, even for a model trained on all gear types,
followed by trawlers; purse seiners are the worst.

We have evaluated the models using a separate test set (and
for window size and feature order, optimization, using
separate train-, validation- and test-sets) plotting
precision/recall and ROC curves.

We have also evaluated the generic model on each gear type
separately as well as on the combined data set. In addition,
for longliners we have cross trained and validated between
two separately labelled datasets with slightly different
labeling methods (Kristinas' and Alex data).


# Compare Models

Compares the heursitic model that was previously used and the logistic model that is 
currently used.  Two different versions of the logistic model are shown. The vanilla
model, which is what is currently in production and the multi-window model, that looks
at features at multiple window sizes, that is likely to replace the vanilla model in the
near future.

In [1]:
%matplotlib inline
import numpy as np
from vessel_scoring import data
from vessel_scoring.models import train_model_on_data
from vessel_scoring.evaluate_model import compare_auc, compare_metrics
from IPython.core.display import display, HTML, Markdown
from sklearn import metrics

In [2]:
# Load training and test data
_, train_lline,  valid_lline, test_lline = data.load_dataset_by_vessel(
        'datasets/kristina_longliner.measures.npz')
_, train_trawl,  valid_trawl, test_trawl = data.load_dataset_by_vessel(
        'datasets/kristina_trawl.measures.npz')
_, train_pseine, valid_pseine, test_pseine = data.load_dataset_by_vessel(
        'datasets/kristina_ps.measures.npz')

test_lline_crowd, _, _, _ = data.load_dataset_by_vessel(
        "datasets/classified-filtered.measures.npz")

train = np.concatenate([train_trawl, train_lline, train_pseine, valid_lline, valid_trawl, valid_pseine])

## How much test data do we have

Our initial test and training data consisted of roughly a dozen different vessels of each type 
classified over a multi-year period by Kristina Boerder of Dalhousie University. One-Quarter of 
those are used for testing, so there is a relatively small number of different vessels in the test
sets. 

In addition, we are beginning to collect crowd sourced data for both testing and training. Some of the
early crowd sourced data, available for long liners only, is used as an additional test set in the examples
below.

In [3]:
for name, test_data in [("trawlers", test_trawl),
                        ("purse seiners", test_pseine),
                        ("longliners", test_lline),
                        ("crowd sourced longliners", test_lline_crowd)]:
    mmsi_count = len(set(test_data['mmsi']))
    pt_count = len(test_data)
    print("For {0} we have {1} test vessels with {2} test points".format(name, mmsi_count, pt_count))

For trawlers we have 3 test vessels with 5000 test points
For purse seiners we have 3 test vessels with 5000 test points
For longliners we have 2 test vessels with 5000 test points
For crowd sourced longliners we have 118 test vessels with 324166 test points


In [1]:
# Prepare the models

from vessel_scoring.legacy_heuristic_model import LegacyHeuristicModel
from vessel_scoring.logistic_model import LogisticModel

uniform_training_data = {'longliner': train, 
                         'longliner crowd' : train,
                         'trawler': train, 
                         'purse seiner': train}

test_data = {'longliner': test_lline, 
             'longliner crowd': test_lline_crowd,
             'trawler': test_trawl, 
             'purse seiner': test_pseine}

untrained_models = [
    ("Legacy", LegacyHeuristicModel(window=3600), 
         uniform_training_data),
    ('Logistic', LogisticModel(windows=[43200],
                                    order=6),
         uniform_training_data),
    ('Logistic (MW)', LogisticModel(windows=[1800, 3600, 10800, 21600, 43200, 86400],
                                    order=6), 
         uniform_training_data),
]

NameError: name 'train' is not defined

## Discrete Comparisons

The models output a numbers between 0 and 1 that correspond to how 
confident they are that there is fishing occuring. For
this first set of comparisons we treat predictions `>0.5`
as fishing and those `<=0.5` as nonfishing. This allows us to use
*precision*, *recall* and *f1-score* as metrics.

In [None]:
import imp, vessel_scoring.evaluate_model; imp.reload(vessel_scoring.evaluate_model)
from vessel_scoring.evaluate_model import (train_model, compare_pr,
                                           compare_metrics, compare_metrics_table)

for vessel_class in ["longliner", "longliner crowd", "trawler", "purse seiner"]:
    display(HTML("<h3>Comparison for {0}</h3>".format(vessel_class)))
    models = []
    for name, mdl, train_data in untrained_models:
        models.append((name, train_model(mdl, train_data[vessel_class])))
    display(Markdown(compare_metrics_table(models, test_data[vessel_class])))

## Precision - Recall Comparisons

One way to compare the models without picking a specific threshold is
to plot the precision versus recall of each model. 

In [None]:
for vessel_class in ["longliner", "longliner crowd", "trawler", "purse seiner"]:
    display(HTML("<h3>Comparison for {0}</h3>".format(vessel_class)))
    models = []
    for name, mdl, train_data in untrained_models:
        models.append((name, train_model(mdl, train_data[vessel_class])))
    compare_pr(models, test_data[vessel_class])

## ROC Comparisons

Another approach to compare continuous output is the Receiver Operator Characteristic curve.
This curve plots *true positive rate* versus *false positive rate* and is useful for evaluating
what is possible with different threshold values.  The Area Under the Curve (AUC) is used as 
a metric in this case, with a larger AUC being better.

In [None]:
for vessel_class in ["longliner", "longliner crowd", "trawler", "purse seiner"]:
    display(HTML("<h3>Comparison for {0}</h3>".format(vessel_class)))
    models = []
    for name, mdl, train_data in untrained_models:
        models.append((name, train_model(mdl, train_data[vessel_class])))
    compare_auc(models, test_data[vessel_class])