### 0. Thoughts and ideas 

#### 0.1 General

One possible strategy:
- Treat the prediction of future AIS data as a prediction task itself (X: Historic AIS and positions, Y: AIS data in next timestep) and create a model for this
- Use the predicted AIS data as well as historic AIS data and positions to predict new position. 


Another:
- Let a model use the previous timesteps to predict all information about the next timestep.

#### 0.2 About the datasets

#### 0.3 Research


##### Article evaluating several models to predict ship trajectories

Definitions:
- Ship trajectory is the sequence if timestamped points Pi = {Ti, LATi, LONi, SOGi, COGi}


Methodology:
- Information from the first four timestamps are used to predict the next.
- Implemented using a Pytorch framework
- Use ADAM as optimizer
- Use the following hyperparameters: Learning rate: 0.0001, epoch: 100, dropout: 0.5, Hidden size:128 (15), input/output dimensions: 2 and hidden layer: 1


Interesting points
- "Deep learning exhibits remarkable performance in AIS data-driven ship trajectory prediction"
- "Deep learning are in general better than machine learning for this application"
- Transformer, BI-GRU and GRU performs the best, transformer only outperforms on medium sized datasets
- SVR is the best machine learning algorithm

##### Brainstorming - 18.09.2024

AIS - data:
- Parameters that intuitively give us the next position (COG, SOG and ROT), (current position, ETARAW and PortID)
- Should try merging navstat codes used to describe the same activity



General:
- Should somehow allow the algorithm to keep the last values - research different strategies (CNN or LSTM?)
- Might want to use a classifier to predict features
- Ship-ID is probably a pointless input for the classifier


##### Other


- Gustav & co brukte autogluon: https://auto.gluon.ai/stable/index.html


In [10]:
# IMPORTS

import numpy as np
import pandas as pd
import xgboost as xgb

### 1. Data 

#### 1.1 Load data into dataframes

In [11]:



ais_train_data_path = '../../Project materials/ais_train.csv'


ais_data_train = pd.read_csv(ais_train_data_path, sep='|')


ais_data_train.head()

Unnamed: 0,time,cog,sog,rot,heading,navstat,etaRaw,latitude,longitude,vesselId,portId
0,2024-01-01 00:00:25,284.0,0.7,0,88,0,01-09 23:00,-34.7437,-57.8513,61e9f3a8b937134a3c4bfdf7,61d371c43aeaecc07011a37f
1,2024-01-01 00:00:36,109.6,0.0,-6,347,1,12-29 20:00,8.8944,-79.47939,61e9f3d4b937134a3c4bff1f,634c4de270937fc01c3a7689
2,2024-01-01 00:01:45,111.0,11.0,0,112,0,01-02 09:00,39.19065,-76.47567,61e9f436b937134a3c4c0131,61d3847bb7b7526e1adf3d19
3,2024-01-01 00:03:11,96.4,0.0,0,142,1,12-31 20:00,-34.41189,151.02067,61e9f3b4b937134a3c4bfe77,61d36f770a1807568ff9a126
4,2024-01-01 00:03:51,214.0,19.7,0,215,0,01-25 12:00,35.88379,-5.91636,61e9f41bb937134a3c4c0087,634c4de270937fc01c3a74f3


In [12]:
#Create dataframes for each ship-id:

ship_train_groups = ais_data_train.groupby('vesselId')
ship_train_dataframes = {ship_id: group for ship_id, group in ship_train_groups}

#Split data into input and output. Input can now be accessed as ship_dataframes[shipID][0] and output as ship_dataframes[shipID][1]

# for key in ship_train_dataframes:
#     ship_train_dataframes[key] = [ship_train_dataframes[key].drop(columns=['latitude', 'longitude']), ship_train_dataframes[key][['latitude', 'longitude']]]


# print(ship_train_dataframes)

#### 1.2 Split data into X and Y

### 2. Try to create predictions using simple models:

#### 2.1 XG-boost

In [13]:
# xgb_simple = xgb.XGBClassifier()


# for key in ship_train_dataframes:
#     xgb_simple.fit(ship_train_dataframes[key][0], ship_train_dataframes[key][1])

### 3. Attempting to implement a similar approach as in the article: 

In [28]:
#Imports

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from collections import defaultdict

#### 3.1 Preprocess data into timeseries

In [15]:
all_timeseries = []
scaler = MinMaxScaler()

sequence_length = 5


for ship_id, df in ship_train_dataframes.items():
    df['time'] = pd.to_datetime(df['time'])

    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second

    features = df[['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog']].values

    features_normalized = scaler.fit_transform(features)

    for i in range(len(features_normalized) - sequence_length):
        timeseries = features_normalized[i:i+sequence_length+1]
        all_timeseries.append(timeseries)


all_timeseries = np.array(all_timeseries)

X_data = all_timeseries[:, :-1, :]
Y_data = all_timeseries[:, -1, :]




print(all_timeseries.shape)
print(X_data.shape)
print(Y_data.shape)

(1518629, 6, 7)
(1518629, 5, 7)
(1518629, 7)


#### 3.2 GRU - model

In [16]:
class GRUnet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(GRUnet, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first = True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, X):

        hidden_initialize = torch.zeros(self.num_layers, X.size(0), self.hidden_size).to(X.device)

        out, _ = self.gru(X, hidden_initialize)

        out = self.fc(out[:, -1, :])

        return out
    

#Model parameters:
input_size = 7
hidden_size = 64   #Random guess on what is best
output_size = 7     
num_layers = 2 

GRU_model = GRUnet(input_size=input_size, hidden_size=hidden_size, output_size=output_size, num_layers=num_layers)

print(GRU_model)

GRUnet(
  (gru): GRU(7, 64, num_layers=2, batch_first=True)
  (fc): Linear(in_features=64, out_features=7, bias=True)
)


#### 3.3 Train Model

In [17]:
# Hyperparameters 

learning_rate = 0.001
num_epochs = 10
batch_size = 64


optimizer = optim.Adam(GRU_model.parameters(), lr = learning_rate)
loss_function = nn.MSELoss()


#Preprocess data:

X_tensor = torch.tensor(X_data, dtype=torch.float32)
Y_tensor = torch.tensor(Y_data, dtype=torch.float32)

train_dataset = torch.utils.data.TensorDataset(X_tensor, Y_tensor)
data_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle = True)

for epoch in range(num_epochs):
    for inputs, targets in data_loader:

        optimizer.zero_grad()

        outputs = GRU_model(inputs)

        loss = loss_function(outputs, targets)

        loss.backward()
        optimizer.step()

    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')




Epoch [1/10], Loss: 0.0161
Epoch [2/10], Loss: 0.0156
Epoch [3/10], Loss: 0.0136
Epoch [4/10], Loss: 0.0193
Epoch [5/10], Loss: 0.0098
Epoch [6/10], Loss: 0.0167
Epoch [7/10], Loss: 0.0118
Epoch [8/10], Loss: 0.0149
Epoch [9/10], Loss: 0.0184
Epoch [10/10], Loss: 0.0135


In [18]:
##Save model

torch.save(GRU_model.state_dict(), "gru_model_test.pth")

#### 3.4 Make Predictions

In [20]:
loaded_GRU_model = GRUnet(input_size=input_size, hidden_size=hidden_size, output_size=output_size, num_layers=num_layers)
loaded_GRU_model.load_state_dict(torch.load("gru_model_test.pth"))

loaded_GRU_model.eval()

GRUnet(
  (gru): GRU(7, 64, num_layers=2, batch_first=True)
  (fc): Linear(in_features=64, out_features=7, bias=True)
)

In [22]:


# def predict_ship_position(ship_id, time, model, sequence_length = 5):

#     if ship_id not in ship_train_dataframes:
#         print(f"No training data available for ship_id: {ship_id}")
#         return None
    
#     ship_df = ship_train_dataframes[ship_id]

#     ship_df['time'] = pd.to_datetime(ship_df['time'])

#     ship_df = ship_df.sort_values(by='time').reset_index(drop=True)

#     prediction_time = pd.to_datetime(time)

#     last_known_time = df['time'].iloc[-1]
#     time_diff = (prediction_time - last_known_time).total_seconds()
    
#     if time_diff <= 0:
#         print(f"Prediction time {prediction_time} is before or equal to the last known time {last_known_time}")
#         return None
    
#     steps_needed = int(time_diff//20)

#     df['hour'] = df['time'].dt.hour
#     df['minute'] = df['time'].dt.minute
#     df['second'] = df['time'].dt.second

#     features = df[['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog']].values

#     if len(features) < sequence_length:
#         print(f"Not enough historical data to predict for ship_id: {ship_id}")
#         return None

#     input_sequence = features[-sequence_length:]

#     input_sequence_normalized =scaler.transform(input_sequence)
#     input_tensor = torch.tensor(input_sequence_normalized, dtype=torch.float32).unsqueeze(0)

#     for _ in range(steps_needed):
#         with torch.no_grad():
           
#             next_position = model(input_tensor)

        
#         next_position_tensor = next_position.unsqueeze(0)
#         input_tensor = torch.cat((input_tensor[:, 1:, :], next_position_tensor), dim=1)
    
#     next_position_np = next_position.numpy()
#     next_position_inverse = scaler.inverse_transform(next_position_np)

#     return next_position_inverse


In [27]:
ais_test_data_path = '../../Project materials/ais_test.csv'
ais_data_test = pd.read_csv(ais_test_data_path)
unique_ship_ids = ais_data_test['vesselId'].unique()

In [29]:

# for idx, row in ais_data_test.iterrows():
#     ship_id_sample = row['vesselId']
#     prediction_time = row['time']

#     # Predict the position at the specified future time
#     predicted_position = predict_ship_position(ship_id=ship_id_sample, time=prediction_time, model=loaded_GRU_model)

#     if predicted_position is not None:
#         #print(f"Predicted position for ship_id {ship_id} at {prediction_time}: {predicted_position}")

In [45]:
##Attempt at a faster approach:



# Dictionary to store predictions for each ship

ship_predictions = defaultdict(dict)

# Predict for each unique ship_id and store results
for ship_id in unique_ship_ids:
    if ship_id not in ship_train_dataframes:
        print(f"No training data available for ship_id: {ship_id}")
        continue

    df = ship_train_dataframes[ship_id]
    
    # Ensure 'time' is in datetime format
    df['time'] = pd.to_datetime(df['time'])
    
    # Extract the last known sequence for this ship
    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second

    features = df[['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog']].values
    if len(features) < sequence_length:
        print(f"Not enough historical data to predict for ship_id: {ship_id}")
        continue

    # Prepare the input sequence
    input_sequence = features[-sequence_length:]
    scaler = MinMaxScaler().fit(features)
    input_sequence_normalized = scaler.transform(input_sequence)

    # Convert to tensor and move to device
    input_tensor = torch.tensor(input_sequence_normalized, dtype=torch.float32).unsqueeze(0)

    # Determine the farthest time in the test data for this ship
    ship_test_times = pd.to_datetime(ais_data_test[ais_data_test['vesselId'] == ship_id]['time'])
    farthest_time = ship_test_times.max()

    # Calculate how many steps are needed to reach the farthest time
    last_known_time = df['time'].iloc[-1]
    # Each step is 20 minutes, so we divide the total seconds by (20 * 60)
    total_steps_needed = int((farthest_time - last_known_time).total_seconds() // (20 * 60))

    # Predict all future steps up to the farthest time
    current_input = input_tensor.clone()

    for step in range(1, total_steps_needed + 1):
        with torch.no_grad():
            # Predict the next step
            next_position = loaded_GRU_model(current_input)

        # Store the prediction with its corresponding timestamp
        prediction_time = last_known_time + pd.Timedelta(minutes=20 * step)
        prediction_np = next_position.cpu().numpy()
        prediction_original_scale = scaler.inverse_transform(prediction_np.reshape(1, -1))
        
        # Add the prediction to the dictionary for the ship ID and timestamp
        ship_predictions[ship_id][prediction_time] = prediction_original_scale[0]

        # Update the input tensor by removing the oldest step and adding the predicted next step
        current_input = torch.cat((current_input[:, 1:, :], next_position.unsqueeze(0)), dim=1)

# Create a DataFrame to store the final results for each test point
predictions_list = []

# Iterate over the test data and extract the prediction from the stored dictionary
for idx, row in ais_data_test.iterrows():
    ship_id = row['vesselId']
    target_time = pd.to_datetime(row['time']).round('min')

    # Convert target time to string for lookup
    target_time_str = target_time.strftime('%Y-%m-%d %H:%M')

    # Retrieve the stored prediction for this ship at the target time
    if ship_id in ship_predictions:
        if target_time_str in ship_predictions[ship_id]:
            # Exact match found
            prediction = ship_predictions[ship_id][target_time_str]
        else:
            # No exact match, find the closest timestamp
            available_times = list(ship_predictions[ship_id].keys())
            closest_time_str = min(available_times, key=lambda x: abs(pd.Timestamp(x) - target_time))
            prediction = ship_predictions[ship_id][closest_time_str]

        predictions_list.append({
            'ship_id': ship_id,
            'time': target_time,
            'predicted_latitude': prediction[4],  # assuming columns are ['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog']
            'predicted_longitude': prediction[3],
            'predicted_sog': prediction[5],
            'predicted_cog': prediction[6]
        })

# Convert predictions to a DataFrame
predictions_df = pd.DataFrame(predictions_list)

print(predictions_df.head())

                    ship_id                time  predicted_latitude  \
0  61e9f3aeb937134a3c4bfe3d 2024-05-08 00:03:00           31.613583   
1  61e9f473b937134a3c4c02df 2024-05-08 00:06:00           14.981131   
2  61e9f469b937134a3c4c029b 2024-05-08 00:10:00           38.514565   
3  61e9f45bb937134a3c4c0221 2024-05-08 00:11:00          -42.319794   
4  61e9f38eb937134a3c4bfd8d 2024-05-08 00:12:00           48.373402   

   predicted_longitude  predicted_sog  predicted_cog  
0           -87.975906       0.676197     195.341293  
1           120.511551       0.643126      24.003500  
2            10.919418      18.040033      88.664772  
3           171.528503       0.095447     200.180939  
4            -6.027080       1.329763     238.518478  


In [46]:
print(predictions_df.size)

310434


In [47]:
csv_data = predictions_df[['predicted_longitude', 'predicted_latitude']]

# Write to CSV file with a header
csv_file_path = 'submission.csv'

# Save the DataFrame to CSV
csv_data.to_csv(csv_file_path, index=True, index_label = 'ID', header=['longitude_predicted','latitude_predicted'])
