# Notebook 1: Data Exploration and Preprocessing

This notebook covers the initial steps of exploring the AIS data and visualizing the generated sea graph. The core, repeatable processes have been moved to executable `.py` scripts.

## 1. Install Dependencies

In [1]:
!pip install -q numpy pandas geopandas matplotlib seaborn plotly scikit-learn notebook



## 2. Imports

In [2]:
import os
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx

## 3. Configuration

Define the paths to your processed data files. These should be generated by the `1_create_graph.py` and `2_prepare_data.py` scripts first.

In [3]:
CLEANED_AIS_PATH = "../data/processed/cleaned_guam_ais.csv"
GRAPH_PATH = "../data/processed/sea_graph_guam.pkl"

## 4. Load and Inspect Cleaned AIS Data

Load the `cleaned_guam_ais.csv` file that was created by the `2_prepare_data.py` script.

In [4]:
df = pd.read_csv(CLEANED_AIS_PATH)

In [5]:
df.head()

Unnamed: 0,MMSI,BaseDateTime,LAT,LON,SOG,COG,Heading,VesselName,IMO,CallSign,VesselType,Status,Length,Width,Draft,Cargo,TransceiverClass
0,368920000,2024-01-01 00:09:13,13.42058,144.66585,0.0,290.0,326.0,HENSON,IMO9132129,NENB,35.0,5.0,100.0,18.0,0.0,35.0,A
1,338926439,2024-01-01 00:10:52,13.42453,144.6632,0.0,0.0,150.0,CGC MYRTLE HAZARD,,NMHD,51.0,0.0,46.0,7.0,3.0,51.0,A
2,338926439,2024-01-01 00:13:13,13.42454,144.66318,0.0,0.0,150.0,CGC MYRTLE HAZARD,,NMHD,51.0,0.0,46.0,7.0,3.0,51.0,A
3,338926439,2024-01-01 00:14:23,13.42452,144.66319,0.0,0.0,150.0,CGC MYRTLE HAZARD,,NMHD,51.0,0.0,46.0,7.0,3.0,51.0,A
4,338926439,2024-01-01 00:21:13,13.42453,144.66319,0.0,0.0,150.0,CGC MYRTLE HAZARD,,NMHD,51.0,0.0,46.0,7.0,3.0,51.0,A


In [None]:
df.info()

In [None]:
# Check for missing values in the cleaned dataset
df.isna().sum()

## 5. Analyze Vessel Data

In [None]:
print(f"Unique MMSI (vessels) in the dataset: {df['MMSI'].nunique()}")
print("\nTop 5 most frequent vessels:")
print(df['VesselName'].value_counts().head())

## 6. Load and Visualize the Sea Graph

Load the graph generated by `1_create_graph.py` and plot the sea nodes.

In [None]:
with open(GRAPH_PATH, "rb") as f:
    G = pickle.load(f)

print(f"Graph loaded with:")
print(f"- {G.number_of_nodes()} nodes")
print(f"- {G.number_of_edges()} edges")

In [None]:
# Setup figure and axis
plt.figure(figsize=(10, 10))

# Plot sea graph nodes
xs, ys = zip(*G.nodes)
plt.scatter(ys, xs, s=5, color='blue', label='Sea Nodes')

# Label and style
plt.title('Sea Graph: Guam Region')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend()
plt.grid(True)

# Set axis limits to focus on the Guam area
plt.xlim(144.0, 145.0)
plt.ylim(13.0, 14.0)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()

## 7. Visualize a Sample Trajectory

To understand the nature of the training data, let's extract the path of a single vessel and plot it on the graph.

In [None]:
# Convert lat/lon to grid nodes (this is also done in the scripts)
def snap_to_grid(lat, lon, step=0.05):
    return (round(round(lat / step) * step, 4), round(round(lon / step) * step, 4))

df['grid_node'] = df.apply(lambda row: snap_to_grid(row['LAT'], row['LON']), axis=1)

# Filter out any points that aren't in our graph
df_filtered = df[df['grid_node'].isin(G.nodes)]

# Group by vessel and extract trajectories
trajectories = {}
for mmsi, group in df_filtered.groupby("MMSI"):
    sorted_group = group.sort_values("BaseDateTime")
    path = list(sorted_group["grid_node"])
    if len(path) > 20: # Get a reasonably long path for visualization
        trajectories[mmsi] = path

# Pick one vessel to plot
sample_mmsi = next(iter(trajectories))  
path = trajectories[sample_mmsi]

# Plot the trajectory
xs_path, ys_path = zip(*path)

plt.figure(figsize=(10, 10))
plt.scatter(ys, xs, s=5, color='lightblue', label='All Sea Nodes') # All nodes
plt.plot(ys_path, xs_path, marker='o', markersize=3, color='red', label=f'Vessel {sample_mmsi}')
plt.title(f"Sample Trajectory of Vessel {sample_mmsi}")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.legend()
plt.grid(True)
plt.xlim(144.0, 145.0)
plt.ylim(13.0, 14.0)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()