## Proposed models for the matching problem: one or two vessels?

**Global Fishing Watch** has obtained tracking data of fishing vessels from 1) Automatic Identification System (AIS) devices and 2) Vessel Monitoring Systems (VMS) that are regulated by national goverment authorities. The overall goal is to use these positional data to predict fishing activites in the world's oceans.

*Automatic Identification System (AIS)*: AIS trackers broadcast vessel identificaiton information and location. The speed and direction of the vessel can also be calculated from AIS data at various timepoints. GFW has built models to predict fishing activity of vessels based on AIS tracking data. A major drawback of using AIS data to model fishing activity is that only a small fraction of fishing vessels are equiped with AIS devices. More so, many of these vessels lacking AIS trackers may be engaging in illeagal fishing activities.  

*Vessel Monitoring Systems (VMS)*: VMS are established by national government authorties to monitor vessel movements with GPS. Vessels broadcast their positions to satellites and this information is captured by the authorities. VMS data is typically not made available to the public. Through collaborations with national governments like Indonesia, Panama, Peru, and Chile, GFW has been able to obtain VMS data. 

**The matching problem:** To get a better picture of fishing activity, we could overlap AIS and VMS location data. AIS devices broadcast vessel location every three minutes (Taconet et al., 2019), resulting in accurate position given by longitute/latitute coordinates. In contrast, VMS devices broadcast less frequently. I do not know the rate of VMS  broadcasting but I suspect this it is once every few hours to once per day. This means that VMS positional data could be less accurate depending on the speed of the vessel and when the GPS receiver transmits a signal. This creates several problems. One problem is that the same vessel could have different (non-overlapping) AIS and GPS coordinates. As shown in the diagram below (left), for example, a vessel is moving at 10 knots (~18.5km/hr). If GPS singal is transmitted every hour then the AIS and GPS positional data for the same vessel could be up to 18.5km apart. Another potential problem is that overlapping AIS and GPS could represent two, instead of one, vessel (diagram, right). 

![potential problems](GPS_schematic_4.png)

**Goal:** *Using the AIS and GPS data available, construct a model to predict whether overlapping AIS and GPS tracks are coming from one or two vessels?* 

Below, I propose three ideas for possible models. AIS data can be obtained from [GFW's website](https://globalfishingwatch.org/data-download/datasets/public-training-data-v1). I do not have access to vessel GPS data and therefore need to make some assumptions when describing the model. I could not find evidence that that AIS and VMS use the same identifer naming/number scheme, so I assume that vessel identifiers are different such that it is not possible to match vessels simply by their ID (this would be too easy!). 


In [1]:
import pandas as pd
import numpy as np
import io

#### What is the data we have to build our models?

**AIS data**: The following data is provided with AIS data (downloaded [here](https://globalfishingwatch.org/data-download/datasets/public-training-data-v1))
 - mmsi: Anonymized vessel identifier
 - timestamp: Unix timestamp
 - distance_from_shore: Distance from shore (meters)
 - distance_from_port: Distance from port (meters)
 - speed: Vessel speed (knots)
 - course: Vessel course
 - lat: Latitude in decimal degrees
 - lon: Longitude in decimal degrees
 - is_fishing: Label indicating fishing activity.
   - 0 = Not fishing
   - greater than 0 = Fishing. Data values between 0 and 1 indicate the average score for the position if scored by multiple people.
   - -1 = No data
 - source: The training data batch. Data was prepared by GFW, Dalhousie, and a crowd sourcing campaign. False positives are marked as false_positives.

**GPS data**: From the [GFW website](https://globalfishingwatch.org/our-map/), it is mentioned that VMS tracking data provided by nation partners include information on "vessel identities, gear type, location, speed, direction and more" and that GFW is now building similar models with GPS data as was done with AIS data to predict fishing activity. Therefore, I assume that **VMS tracking data has same types of data listed above for AIS tracking data.** I also assume a set of AIS and GPS tracks with known overlapping 

**Assumptions about the data for model training:**
1. GPS data has same types of data (features) listed above for AIS data.
2. A data set for vessels with known overlapping and non-overlapping AIS and GPS tracks exists. 

**Given the discussion above, here is the training data I am using for modeling:**

*Label:*
 - AIS and GPS overlapping, i.e. corresponding to one vessel? 1 Yes, 0 No

*Features:*
 - AIS mmsi (string)
 - AIS timestamp (date/time)
 - AIS distance_from_shore (meters)
 - AIS distance_from_port (meters)
 - AIS vessel speed (knots)
 - AIS vessel course
 - AIS latitude (decimal degrees)
 - AIS longitude (decimal degrees)
 - AIS is_fishing? (0, 0<, -1)
 - AIS source (string)
 - GPS vessel ID number (string)
 - GPS timestamp (date/time)
 - GPS distance_from_shore (meters)
 - GPS distance_from_port (meters)
 - GPS vessel speed (knots)
 - GPS vessel course
 - GPS latitude (decimal degrees)
 - GPS longitude (decimal degrees)
 - GPS is_fishing? (0, 0<, -1)
 - GPS source (string)

Let's get a sense what the AIS and presumably GPS data looks like using purse seines vessels as an example.

In [2]:
# Load and view the Automatic Identify System (AIS) data for purse serine vessels
purse_seines = pd.read_csv('AIS_data/purse_seines.csv')
cols = purse_seines.columns.tolist()
purse_seines['mmsi'] = purse_seines['mmsi'].astype(int)
purse_seines['timestamp'] = purse_seines['timestamp'].astype(int)

# Convert UNIX timestamp to date-time and calculate intervals between signal transmissions
purse_seines['UNIX_timestamp'] = purse_seines['timestamp']
purse_seines['timestamp'] = pd.to_datetime(purse_seines['UNIX_timestamp'],unit='s')
purse_seines['time_diff'] = purse_seines['timestamp'] - purse_seines['timestamp'].shift(1) # time difference from last timepoint, only valid for same vessel

purse_seines.head(3)

Unnamed: 0,mmsi,timestamp,distance_from_shore,distance_from_port,speed,course,lat,lon,is_fishing,source,UNIX_timestamp,time_diff
0,9924005022437,2013-09-19 14:34:34,0.0,1414.178833,0.0,298.5,8.8615,-79.668427,-1.0,false_positives,1379601274,NaT
1,9924005022437,2013-09-19 14:53:15,0.0,1414.178833,0.0,298.5,8.861506,-79.668442,-1.0,false_positives,1379602395,00:18:41
2,9924005022437,2013-09-19 15:20:30,0.0,1414.178833,0.1,128.399994,8.861511,-79.668488,-1.0,false_positives,1379604030,00:27:15


In [3]:
# How many timepoints does each AIS-tracked purse seines vessel have?
ps_timepoint_count = purse_seines.groupby(['mmsi']).size()
min_count = ps_timepoint_count.min()
max_count = ps_timepoint_count.max()

print("Number of time points per vessel ranges from", min_count, "to", max_count, ".")
print("Here are 10 vessels with mmsi ID and the number of AIS time points:", "\n")
print(ps_timepoint_count[:10])

Number of time points per vessel ranges from 6560 to 204261 .
Here are 10 vessels with mmsi ID and the number of AIS time points: 

mmsi
9924005022437      55933
10880510825243    117980
11170005450471     14360
18199244904065     17129
26616040923734     29073
36212632719018    106659
38322969102051    170686
38992105566132     10302
39005622580143    122160
43935946737362     86465
dtype: int64


#### Model 1: Logistic regression

We could train a logistic regression model to classify a set of AIS and GPS tracks as overlapping or not overlapping with some probability. I would apply the general method of maximum likelihood to estimate the coefficents for each feature. I would also see if colinearity exists between features (and remove one of the corrlating features) and whether regularization of the coefficients improves model performance on validation and test sets. When predicting whether an AIS and GPS track is overlapping using our logistic regression model, we could initially consider a probability > 0.5 as overlapping. But this cutoff can be optimized to reduce false positives or false negatives. I would evaluate the model's performance with metrics such as AUC-ROC and F1-score.

#### Model 2: Support vector machine (non-linear decision boundary)

We could also train a SVM model to classify a set of AIS and GPS tracks as overlapping or not overlapping. I would optimize the classifer to minimize overfitting with the trade-off for more false negatives by including a tuning parameter (a constant) determined by cross-validation. The performance of the SVM model would be based on AUC-ROC curves.

#### Model 3: Forecasting locations using time series

For the two models proposed above, what if there is insuffient information in the features to build an accurate classifier? Or what if a rich training/validation/test set with positive (overlapping, one vessel) and negative (non-overlapping, two vessels) examples of AIS and GPS locations is not available? 

One way to deal with this is to restructure the GPS data as a time series to forecast (fill in) longtitude/latitude locations. This is shown in the figure below where we use the known GPS locations (green vessels) to predict GPS locations in the "gaps" (pink vessels). Then we can apply statisical methods to determine if a certain number (determined by optimization) forecasted and known GPS for a vessel are each highly correlated (overlap) with known AIS time points. If so, we would consider that the AIS and GPS tracking corresponds to the same vessel. 

![time series](time_series_2.png)


How do we restructure the data? Data from the previous GPS time point (*n-1*) serve as input variables (features) and the next time point (*n*) serves as the output variable (label). The window size of the lag is one becuase we only include one previous time point (*n-1*) in the example below. Note the we could increase the window size by including more previous time points.

*Labels:*
 - GPS longitude for time point *n* *
 - GPS latitude for time point *n* * 
 
*only if previous time point(s) has same mmsi ID, i.e. same vessel

*Features:*
 - GPS timestamp for time point *n-1* 
 - GPS distance_from_shore for time point *n-1* 
 - GPS distance_from_port for time point *n-1* 
 - GPS vessel speed for time point *n-1* 
 - GPS vessel course for time point *n-1*
 - GPS latitude for time point *n-1* 
 - GPS longitude for time point *n-1* 
 - GPS is_fishing for time point *n-1*?
 - GPS source for time point *n-1* 
 
 
 - GPS timestamp for time point *n*
 - GPS distance_from_shore for time point *n*
 - GPS distance_from_port for time point *n*
 - GPS vessel speed for time point *n* 
 - GPS vessel course for time point *n*
 - GPS is_fishing for time point *n*?
 - GPS source for time point *n*

#### Model 4: Neural networks for time series classification

We could also build a model using convolutional neural networks, which can learn spatially or temporally invariant features. To train a convnet model, we will assume that we have a data set with known overlapping (one vessel) and non-overlapping (two vessels) AIS and GPS tracks. Similar to audio data, which is formatted as 1-D time series, we could restructured our data into a time series of combined AIS and GPS tracks in a set time window. The data example below has a widow size (lag) of 2. We could use either transfer learning or train our own convent (depending on how much data we have) to classify for two categories (Yes/No) using a ReLu activation function and tuning various hyperparameters.

*Labels:*
 - AIS and GPS tracks correspond to one vessel? 1 Yes, 0 No

*Features:*
 - AIS mmsi at time point *n*
 - AIS timestamp at time point *n*
 - AIS distance_from_shore at time point *n*
 - AIS distance_from_port at time point *n*
 - AIS vessel speed at time point *n*
 - AIS vessel course at time point *n*
 - AIS latitude at time point *n*
 - AIS longitude at time point *n*
 - AIS is_fishing at time point *n*? 
 - AIS source at time point *n*
 - GPS vessel ID number at time point *n*
 - GPS timestamp at time point *n*
 - GPS distance_from_shore at time point *n*
 - GPS distance_from_port at time point *n*
 - GPS vessel speed at time point *n*
 - GPS vessel course at time point *n*
 - GPS latitude at time point *n*
 - GPS longitude at time point *n*
 - GPS is_fishing at time point *n*?
 - GPS source at time point *n*


 - all AIS features at time point *n-1*
 - all AIS features at time point *n-2*
 - etc.
 
 
 - all GPS features at time point *n-1*
 - all GPS features at time point *n-2*
