ERIN O'CONNELL 
#  AIS Workflow using 1 day of sample data

Initial Data Cleaning Notes: 

Draught Column: many 0 values

Change in Draught: all 0 values once subsetted to darkvess, making the binary change column redundant 

Lat/lon Column: many 0 values, some values over 90 and over 180

MMSI: I used this as a unique_id field for all the ships rather than the ‘name’ field

Lat/lon: also had issues with many trailing decimals

Making sure columns are recognized as numeric, POSix, date time formatting. Normally, I would start with subsetting rows by geographic location clip, but because that wasn’t working I knew I needed to focus on cleaning data first


Uncomment Installation if Needed

In [1]:
!pip freeze > requirements.txt

Using Python 3.11.7

In [2]:
#pip install rasterio

Collecting rasterio
  Downloading rasterio-1.3.10-cp311-cp311-macosx_11_0_arm64.whl.metadata (14 kB)
Collecting affine (from rasterio)
  Downloading affine-2.4.0-py3-none-any.whl.metadata (4.0 kB)
Collecting cligj>=0.5 (from rasterio)
  Downloading cligj-0.7.2-py3-none-any.whl.metadata (5.0 kB)
Collecting snuggs>=1.4.1 (from rasterio)
  Downloading snuggs-1.4.7-py3-none-any.whl.metadata (3.4 kB)
Collecting click-plugins (from rasterio)
  Downloading click_plugins-1.1.1-py2.py3-none-any.whl.metadata (6.4 kB)
Downloading rasterio-1.3.10-cp311-cp311-macosx_11_0_arm64.whl (18.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.2/18.2 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading cligj-0.7.2-py3-none-any.whl (7.1 kB)
Downloading snuggs-1.4.7-py3-none-any.whl (5.4 kB)
Downloading affine-2.4.0-py3-none-any.whl (15 kB)
Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)
Installing collected packages: snuggs, cligj, click-plugins

In [6]:
#pip install geopandas

Collecting geopandas
  Downloading geopandas-0.14.4-py3-none-any.whl.metadata (1.5 kB)
Collecting fiona>=1.8.21 (from geopandas)
  Downloading fiona-1.9.6-cp311-cp311-macosx_11_0_arm64.whl.metadata (50 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.2/50.2 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
Downloading geopandas-0.14.4-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading fiona-1.9.6-cp311-cp311-macosx_11_0_arm64.whl (13.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.9/13.9 MB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: fiona, geopandas
Successfully installed fiona-1.9.6 geopandas-0.14.4
Note: you may need to restart the kernel to use updated packages.


In [1]:
# Get your packages
import os
import rasterio
from rasterio.plot import show
import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
from datetime import datetime, timedelta

Load Data for Analysis

In [3]:
# (NOTE:this is currently hardcoded but could be changed to access GCS bucket using gcloud API)
# Uncommment these lines if data is not already in project directory

#ais = gpd.read_file('/Users/kale/Downloads/ais_test/ais_sample.csv')

# Load EEZ
#eez = gpd.read_file('/Users/kale/Downloads/ais_test/iran_eez.geojson')
#print(type(eez))

# Load TTW
#ttw = gpd.read_file('/Users/kale/Downloads/ais_test/iran_ttw.geojson')

<class 'geopandas.geodataframe.GeoDataFrame'>


In [4]:
### Calculate Change in Draught Column ####

# Convert 'draught' column to numeric
ais['draught'] = pd.to_numeric(ais['draught'])

# Arrange dataframe by 'mmsi' and 'timestamp'
ais = ais.sort_values(by=['mmsi', 'timestamp'])

# Calculate change in draught by grouping by 'mmsi' and store it in new column, change_in_draught
ais['change_in_draught'] = ais.groupby('mmsi')['draught'].diff()

# Ungroup the dataframe
ais = ais.reset_index(drop=True)

In [None]:
### Calculate lag time between AIS transmisison ####

# Sort by mmsi and timestamp column
ais['timestamp'] = pd.to_datetime(ais['timestamp'])
ais = ais.sort_values(by=['mmsi', 'timestamp'])
ais['time_diff'] = ais.groupby('mmsi')['timestamp'].diff()

# Convert time difference column from seconds to hours
ais['time_diff_hours'] = ais['time_diff'] / 3600

# Remove 'secs' from the values in ais['time_diff_hours']
ais['time_diff_hours'] = ais['time_diff_hours'].replace('secs', '', regex=True)

# Convert the column to numeric (if it's not already)
ais['time_diff_hours'] = pd.to_numeric(ais['time_diff_hours'])

# Subset rows where time difference is equal to or greater than 6 hours and store this in darkvessles obj
darkvessels = ais[ais['time_diff_hours'] >= 6]

In [None]:
### Checking IDs of dark vessels ####

# Get unique IDs excluding NA
ids = darkvessels['mmsi'].dropna().unique()

# Subset based on the ID vector just created
sub_df = darkvessels[darkvessels['mmsi'].isin(ids)]

# Another way to cross-check this subset by complete cases
t = darkvessels[darkvessels['mmsi'].notna()]

# Creates vector of unique vessel IDs can be useful later

In [None]:
### Data Filtering and Cleaning ####

# Remove rows without a value for lon
t = t[t['longitude'].notna() & (t['longitude'] != '')]

# Remove rows with invalid latitude and longitude values
t = t[(t['latitude'] <= 91) & (t['longitude'] <= 180)]

# Brute force removed a specific outlier
t = t[t['mmsi'] != 352003189]

# Clean Coordinates Trailing Decimals trim to 5
t['latitude'] = t['latitude'].astype(float).round(5)
t['longitude'] = t['longitude'].astype(float).round(5)

# Create Change in Draught Binary Column
t['draught_change'] = 0
t.loc[t['change_in_draught'].notna() & (t['change_in_draught'] != ''), 'draught_change'] = 1

# Create a new dataframe with specific columns we want 
cleaned_csv = t[['mmsi', 'timestamp', 'latitude', 'longitude', 'draught_change', 'time_diff_hours']]

In [None]:
### Subset Dark Vessels based on Location ####

# Filter based on input vector data using geopandas
eez = gpd.read_file("/Users/kale/Downloads/ais_test/iran_eez.geojson")
ttw = gpd.read_file("/Users/kale/Downloads/ais_test/iran_ttw.geojson")

# Convert cleaned_csv to a GeoDataFrame
gdf = gpd.GeoDataFrame(cleaned_csv, geometry=gpd.points_from_xy(cleaned_csv.longitude, cleaned_csv.latitude), crs="EPSG:4326")

# Are both layers in the same CRS? Check this for both ttw and eez
if (eez.crs == gdf.crs):
    print("Both layers are in the same crs!",
          eez.crs, gdf.crs)

# Clip vessels to boundary 
v_eez = gpd.overlay(gdf, eez, how='intersection')
v_ttw = gpd.overlay(gdf, ttw, how='intersection')

### Notes:
# Can use clip, intersection, contains, depending on desired output
# Always make sure crs === crs, datumn, extents match, resolutions match

Final Cleaning and Export of CSV

In [None]:
# Extract coordinates Formatting for Export
eez_csv = v_eez.assign(Latitude=v_eez.geometry.y, Longitude=v_eez.geometry.x).drop(columns=['geometry'])
ttw_csv = v_ttw.assign(Latitude=v_ttw.geometry.y, Longitude=v_ttw.geometry.x).drop(columns=['geometry'])

# Write csv to filepath
eez_csv.to_csv("eez_vessels.csv", index=False)
ttw_csv.to_csv("ttw_vessels.csv", index=False)


   # NOTES: 
    # many of the draught columns are filled with 0, so when
    # I subsetted down to the dark vessels none of them have draught
    # values meaning the draught difference is also 0, therefore
    # making the binary draught column all 0
    

 # Part 3:How would you expand your solution to predict where a dark vessel went during its period of darkness?

To solve this issue, I would interpolate from previous vessel movements. I have done this with animal movement data. To do this, I used a distribution of previous movement metrics: turning angle, speed, position, environmental predictors etc. I could also create tracks of vessels.

I would also incorperate ML: training/validation subset 70/30 split

Ideas: regression trees, cnns(neural nets), hypertune parameters or our predictors of vessel movement 


I would need data from movement metrics and other environmental predictors.
