## 1. Important to note

### 1.1 Submission

Each group submits three things:
- Select two predictions on Kaggle
- Two short notebooks that contain everything needed to reproduce the two selected Kaggle predictions
- A report (this document) summarizes all steps in our group work


Need to remember:
- Begin all notebooks with full names, student IDs (The one on the student card) and Kaggle team name
- Project deadline is 22.00 at 10. November
- The two short notebooks need to be able to reproduce the kaggle score!
- Short notebooks need to use less than 12 hours to run
- Use clear section titles in the report so that it is easy to find all parts under 1.3 Possible deductions


### 1.2 Need to remember

- Include blackboard group number in kaggle name

### 1.3 Possible deductions

- Late delivery: -30 points
- No exploratory data analysis: -3 points (Need to do four of: Search domain knowledge, Check if data is intuitive, Understand how the data was generated, Explore individual features, Clean up features)
- Only one type of predictor used (does not apply to the short notebooks): -3 points
- No feature engineering: -3 points
- No model interpretation: -3 points

### 1.5 Tips and Tricks

- Choose the second submission model to be a more generalized version and (maybe) worse performing on the public kaggle leaderboard. They use the best of these two to calculate grades.
- Notebooks can store temporary results (e.g after feature engineering) as disk-files
- It is allowed to use constant hyperparameters etc. in the short notebooks as long as the report shows how we obtained them (e.g found by hyperparameter tuning)

### 1.6 Code/model related tips and tricks

List of all code/model-related tips and tricks:

- Extract meaningful features from the additional datasets
- Create a feature for whether the ship is moored or not
- Must handle missing values suitably, can be holes in the data [interpolation]
- Tune to find the best hyperparameters

## 2. Exploratory data analysis

### 2.1 Search Domain Knowledge

### 2.2 Check if data is intuitive

### 2.3 Understand how the data was generated

### 2.4 Explore individual features

### 2.5 Clean up features

## 3. Feature Engineering

## 4. Model interpretation

## 5. Predictors and Submissions

In [1]:
# IMPORTS
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from collections import defaultdict

### 5.1 GRU model

#### 5.11 Version 1 - Submission 1 (701.553)

In [None]:
ais_train_data_path = '../../Project materials/ais_train.csv'
ais_data_train = pd.read_csv(ais_train_data_path, sep='|')

ship_train_groups = ais_data_train.groupby('vesselId')
ship_train_dataframes = {ship_id: group for ship_id, group in ship_train_groups}

In [None]:
all_timeseries = []
scaler = MinMaxScaler()

sequence_length = 5


for ship_id, df in ship_train_dataframes.items():
    df['time'] = pd.to_datetime(df['time'])

    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second

    features = df[['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog']].values

    features_normalized = scaler.fit_transform(features)

    for i in range(len(features_normalized) - sequence_length):
        timeseries = features_normalized[i:i+sequence_length+1]
        all_timeseries.append(timeseries)


all_timeseries = np.array(all_timeseries)

X_data = all_timeseries[:, :-1, :]
Y_data = all_timeseries[:, -1, :]

In [None]:
class GRUnet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(GRUnet, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first = True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, X):

        hidden_initialize = torch.zeros(self.num_layers, X.size(0), self.hidden_size).to(X.device)

        out, _ = self.gru(X, hidden_initialize)

        out = self.fc(out[:, -1, :])

        return out
    

#Model parameters:
input_size = 7
hidden_size = 64   #Random guess on what is best
output_size = 7     
num_layers = 2 

GRU_model = GRUnet(input_size=input_size, hidden_size=hidden_size, output_size=output_size, num_layers=num_layers)


In [None]:
learning_rate = 0.001
num_epochs = 10
batch_size = 64


optimizer = optim.Adam(GRU_model.parameters(), lr = learning_rate)
loss_function = nn.MSELoss()


#Preprocess data:

X_tensor = torch.tensor(X_data, dtype=torch.float32)
Y_tensor = torch.tensor(Y_data, dtype=torch.float32)

train_dataset = torch.utils.data.TensorDataset(X_tensor, Y_tensor)
data_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle = True)

for epoch in range(num_epochs):
    for inputs, targets in data_loader:

        optimizer.zero_grad()

        outputs = GRU_model(inputs)

        loss = loss_function(outputs, targets)

        loss.backward()
        optimizer.step()

In [None]:
torch.save(GRU_model.state_dict(), "gru_model_test.pth")

In [None]:
loaded_GRU_model = GRUnet(input_size=input_size, hidden_size=hidden_size, output_size=output_size, num_layers=num_layers)
loaded_GRU_model.load_state_dict(torch.load("gru_model_test.pth"))

loaded_GRU_model.eval()

In [None]:
ais_test_data_path = '../../Project materials/ais_test.csv'
ais_data_test = pd.read_csv(ais_test_data_path)
unique_ship_ids = ais_data_test['vesselId'].unique()

In [None]:
ship_predictions = defaultdict(dict)

for ship_id in unique_ship_ids:
    if ship_id not in ship_train_dataframes:
        print(f"No training data available for ship_id: {ship_id}")
        continue

    df = ship_train_dataframes[ship_id]
    
    df['time'] = pd.to_datetime(df['time'])
    
    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second

    features = df[['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog']].values
    if len(features) < sequence_length:
        print(f"Not enough historical data to predict for ship_id: {ship_id}")
        continue

    input_sequence = features[-sequence_length:]
    scaler = MinMaxScaler().fit(features)
    input_sequence_normalized = scaler.transform(input_sequence)

    input_tensor = torch.tensor(input_sequence_normalized, dtype=torch.float32).unsqueeze(0)

    ship_test_times = pd.to_datetime(ais_data_test[ais_data_test['vesselId'] == ship_id]['time'])
    farthest_time = ship_test_times.max()

    last_known_time = df['time'].iloc[-1]
    total_steps_needed = int((farthest_time - last_known_time).total_seconds() // (20 * 60))

    current_input = input_tensor.clone()

    for step in range(1, total_steps_needed + 1):
        with torch.no_grad():
            next_position = loaded_GRU_model(current_input)

        prediction_time = last_known_time + pd.Timedelta(minutes=20 * step)
        prediction_np = next_position.cpu().numpy()
        prediction_original_scale = scaler.inverse_transform(prediction_np.reshape(1, -1))
        
        ship_predictions[ship_id][prediction_time] = prediction_original_scale[0]

        current_input = torch.cat((current_input[:, 1:, :], next_position.unsqueeze(0)), dim=1)

predictions_list = []

for idx, row in ais_data_test.iterrows():
    ship_id = row['vesselId']
    target_time = pd.to_datetime(row['time']).round('min')

    target_time_str = target_time.strftime('%Y-%m-%d %H:%M')

    if ship_id in ship_predictions:
        if target_time_str in ship_predictions[ship_id]:
            prediction = ship_predictions[ship_id][target_time_str]
        else:

            available_times = list(ship_predictions[ship_id].keys())
            closest_time_str = min(available_times, key=lambda x: abs(pd.Timestamp(x) - target_time))
            prediction = ship_predictions[ship_id][closest_time_str]

        predictions_list.append({
            'ship_id': ship_id,
            'time': target_time,
            'predicted_latitude': prediction[4],  
            'predicted_longitude': prediction[3],
            'predicted_sog': prediction[5],
            'predicted_cog': prediction[6]
        })

predictions_df = pd.DataFrame(predictions_list)

In [None]:
csv_data = predictions_df[['predicted_longitude', 'predicted_latitude']]

# Write to CSV file with a header
csv_file_path = 'submission.csv'

# Save the DataFrame to CSV
csv_data.to_csv(csv_file_path, index=True, index_label = 'ID', header=['longitude_predicted','latitude_predicted'])
