## 1. Important to note

### 1.1 Submission

Each group submits three things:
- Select two predictions on Kaggle
- Two short notebooks that contain everything needed to reproduce the two selected Kaggle predictions
- A report (this document) summarizes all steps in our group work


Need to remember:
- Begin all notebooks with full names, student IDs (The one on the student card) and Kaggle team name
- Project deadline is 22.00 at 10. November
- The two short notebooks need to be able to reproduce the kaggle score!
- Short notebooks need to use less than 12 hours to run
- Use clear section titles in the report so that it is easy to find all parts under 1.3 Possible deductions


### 1.2 Need to remember

- Include blackboard group number in kaggle name

### 1.3 Possible deductions

- Late delivery: -30 points
- No exploratory data analysis: -3 points (Need to do four of: Search domain knowledge, Check if data is intuitive, Understand how the data was generated, Explore individual features, Clean up features)
- Only one type of predictor used (does not apply to the short notebooks): -3 points
- No feature engineering: -3 points
- No model interpretation: -3 points

### 1.5 Tips and Tricks

- Choose the second submission model to be a more generalized version and (maybe) worse performing on the public kaggle leaderboard. They use the best of these two to calculate grades.
- Notebooks can store temporary results (e.g after feature engineering) as disk-files
- It is allowed to use constant hyperparameters etc. in the short notebooks as long as the report shows how we obtained them (e.g found by hyperparameter tuning)

### 1.6 Code/model related tips and tricks

List of all code/model-related tips and tricks:

- Extract meaningful features from the additional datasets
- Create a feature for whether the ship is moored or not
- Must handle missing values suitably, can be holes in the data [interpolation]
- Tune to find the best hyperparameters

## 2. Exploratory data analysis

### 2.1 Search Domain Knowledge

- Vessels ususally visits the same ports -> historic port visits can be used to infer most likely destination port

TODO:
- Research what all navstat values mean

### 2.2 Check if data is intuitive

### 2.3 Understand how the data was generated

### 2.4 Explore individual features

### 2.5 Clean up features

## 3. Feature Engineering

## 4. Model interpretation

## 5. Predictors and Submissions

In [1]:
# IMPORTS
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from collections import defaultdict

### 5.1 GRU model

#### 5.11 Version 1 - Submission 1 (701.553)

In [None]:
ais_train_data_path = '../../Project materials/ais_train.csv'
ais_data_train = pd.read_csv(ais_train_data_path, sep='|')

ship_train_groups = ais_data_train.groupby('vesselId')
ship_train_dataframes = {ship_id: group for ship_id, group in ship_train_groups}

In [None]:
all_timeseries = []
scaler = MinMaxScaler()

sequence_length = 5


for ship_id, df in ship_train_dataframes.items():
    df['time'] = pd.to_datetime(df['time'])

    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second

    features = df[['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog']].values

    features_normalized = scaler.fit_transform(features)

    for i in range(len(features_normalized) - sequence_length):
        timeseries = features_normalized[i:i+sequence_length+1]
        all_timeseries.append(timeseries)


all_timeseries = np.array(all_timeseries)

X_data = all_timeseries[:, :-1, :]
Y_data = all_timeseries[:, -1, :]

In [None]:
class GRUnet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(GRUnet, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first = True)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, X):

        hidden_initialize = torch.zeros(self.num_layers, X.size(0), self.hidden_size).to(X.device)

        out, _ = self.gru(X, hidden_initialize)

        out = self.fc(out[:, -1, :])

        return out
    

#Model parameters:
input_size = 7
hidden_size = 64   #Random guess on what is best
output_size = 7     
num_layers = 2 

GRU_model = GRUnet(input_size=input_size, hidden_size=hidden_size, output_size=output_size, num_layers=num_layers)


In [None]:
learning_rate = 0.001
num_epochs = 10
batch_size = 64


optimizer = optim.Adam(GRU_model.parameters(), lr = learning_rate)
loss_function = nn.MSELoss()


#Preprocess data:

X_tensor = torch.tensor(X_data, dtype=torch.float32)
Y_tensor = torch.tensor(Y_data, dtype=torch.float32)

train_dataset = torch.utils.data.TensorDataset(X_tensor, Y_tensor)
data_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle = True)

for epoch in range(num_epochs):
    for inputs, targets in data_loader:

        optimizer.zero_grad()

        outputs = GRU_model(inputs)

        loss = loss_function(outputs, targets)

        loss.backward()
        optimizer.step()

In [None]:
torch.save(GRU_model.state_dict(), "gru_model_test.pth")

In [None]:
loaded_GRU_model = GRUnet(input_size=input_size, hidden_size=hidden_size, output_size=output_size, num_layers=num_layers)
loaded_GRU_model.load_state_dict(torch.load("gru_model_test.pth"))

loaded_GRU_model.eval()

In [None]:
ais_test_data_path = '../../Project materials/ais_test.csv'
ais_data_test = pd.read_csv(ais_test_data_path)
unique_ship_ids = ais_data_test['vesselId'].unique()

In [None]:
ship_predictions = defaultdict(dict)

for ship_id in unique_ship_ids:
    if ship_id not in ship_train_dataframes:
        print(f"No training data available for ship_id: {ship_id}")
        continue

    df = ship_train_dataframes[ship_id]
    
    df['time'] = pd.to_datetime(df['time'])
    
    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second

    features = df[['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog']].values
    if len(features) < sequence_length:
        print(f"Not enough historical data to predict for ship_id: {ship_id}")
        continue

    input_sequence = features[-sequence_length:]
    scaler = MinMaxScaler().fit(features)
    input_sequence_normalized = scaler.transform(input_sequence)

    input_tensor = torch.tensor(input_sequence_normalized, dtype=torch.float32).unsqueeze(0)

    ship_test_times = pd.to_datetime(ais_data_test[ais_data_test['vesselId'] == ship_id]['time'])
    farthest_time = ship_test_times.max()

    last_known_time = df['time'].iloc[-1]
    total_steps_needed = int((farthest_time - last_known_time).total_seconds() // (20 * 60))

    current_input = input_tensor.clone()

    for step in range(1, total_steps_needed + 1):
        with torch.no_grad():
            next_position = loaded_GRU_model(current_input)

        prediction_time = last_known_time + pd.Timedelta(minutes=20 * step)
        prediction_np = next_position.cpu().numpy()
        prediction_original_scale = scaler.inverse_transform(prediction_np.reshape(1, -1))
        
        ship_predictions[ship_id][prediction_time] = prediction_original_scale[0]

        current_input = torch.cat((current_input[:, 1:, :], next_position.unsqueeze(0)), dim=1)

predictions_list = []

for idx, row in ais_data_test.iterrows():
    ship_id = row['vesselId']
    target_time = pd.to_datetime(row['time']).round('min')

    target_time_str = target_time.strftime('%Y-%m-%d %H:%M')

    if ship_id in ship_predictions:
        if target_time_str in ship_predictions[ship_id]:
            prediction = ship_predictions[ship_id][target_time_str]
        else:

            available_times = list(ship_predictions[ship_id].keys())
            closest_time_str = min(available_times, key=lambda x: abs(pd.Timestamp(x) - target_time))
            prediction = ship_predictions[ship_id][closest_time_str]

        predictions_list.append({
            'ship_id': ship_id,
            'time': target_time,
            'predicted_latitude': prediction[4],  
            'predicted_longitude': prediction[3],
            'predicted_sog': prediction[5],
            'predicted_cog': prediction[6]
        })

predictions_df = pd.DataFrame(predictions_list)

In [None]:
csv_data = predictions_df[['predicted_longitude', 'predicted_latitude']]

csv_file_path = 'submission.csv'

csv_data.to_csv(csv_file_path, index=True, index_label = 'ID', header=['longitude_predicted','latitude_predicted'])


#### 5.12 Version 2 - Submission 2 (749.9)

In [None]:
ais_train_data_path = '../../Project materials/ais_train.csv'
#ports_data_path = '../../Project materials/ports.csv'
vessels_data_path = '../../Project materials/vessels.csv'


ais_data_train = pd.read_csv(ais_train_data_path, sep='|')
#ports = pd.read_csv(ports_data_path, sep='|')
vessels = pd.read_csv(vessels_data_path, sep='|')

ship_train_ais_groups = ais_data_train.groupby('vesselId')
ship_train_dataframes = {ship_id: group for ship_id, group in ship_train_ais_groups}


vessels = vessels.set_index('vesselId')
#ports = ports.set_index('portId')


#Handle NAN values:
columns_to_fill = ['GT', 'length', 'breadth', 'enginePower']

for column in columns_to_fill:
    median_value = vessels[column].median()  # Calculate the median value of the column
    vessels[column] = vessels[column].fillna(median_value)

In [None]:
# Define scalers
time_series_scaler = MinMaxScaler()
vessel_features_scaler = MinMaxScaler()

sequence_length = 5

# Collect all vessel features for scaling
all_vessel_features = []
all_features = []

for ship_id, df in ship_train_dataframes.items():
    if ship_id in vessels.index:
        vessel_features = vessels.loc[ship_id][['length', 'breadth', 'enginePower', 'GT']].values
        all_vessel_features.append(vessel_features)

    # Collect time-series features for scaling
    df['time'] = pd.to_datetime(df['time'])
    #df['etaRaw'] = df['etaRaw'].apply(lambda x: f"2024-{x}" if pd.notna(x) and isinstance(x, str) else np.nan)
    #df['etaRaw'] = pd.to_datetime(df['etaRaw'], format='%Y-%m-%d %H:%M', errors='coerce')
    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second
    #df['time_to_eta'] = (df['etaRaw'] - df['time']).dt.total_seconds() / 3600

    for idx, row in df.iterrows():
        ais_features = [
            row['hour'], row['minute'], row['second'], 
            row['longitude'], row['latitude'], row['sog'], 
            row['cog'], row['rot'], row['heading'], row['navstat']
        ]
        all_features.append(ais_features)

# Convert collected features to numpy arrays and fit the scalers
all_vessel_features = np.array(all_vessel_features, dtype=np.float32)
all_features = np.array(all_features, dtype=np.float32)

vessel_features_scaler.fit(all_vessel_features)
time_series_scaler.fit(all_features)

# Transform each ship's data consistently using the fitted scalers
all_timeseries = []
vessel_features_list = []

for ship_id, df in ship_train_dataframes.items():
    if ship_id in vessels.index:
        # Extract and scale vessel-specific features
        vessel_features = vessels.loc[ship_id][['length', 'breadth', 'enginePower', 'GT']].values.reshape(1, -1)
        vessel_features = vessel_features_scaler.transform(vessel_features).flatten()
    else:
        continue

    # Extract and transform time-series features using the same fitted scaler
    df['time'] = pd.to_datetime(df['time'])
    # df['etaRaw'] = df['etaRaw'].apply(lambda x: f"2024-{x}" if pd.notna(x) and isinstance(x, str) else np.nan)
    # df['etaRaw'] = pd.to_datetime(df['etaRaw'], format='%Y-%m-%d %H:%M', errors='coerce')
    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second
    #df['time_to_eta'] = (df['etaRaw'] - df['time']).dt.total_seconds() / 3600

    features_list = []
    for idx, row in df.iterrows():
        ais_features = [
            row['hour'], row['minute'], row['second'], 
            row['longitude'], row['latitude'], row['sog'], 
            row['cog'], row['rot'], row['heading'], row['navstat']
        ]
        features_list.append(ais_features)

    # Convert to numpy array and transform using the scaler
    features = np.array(features_list, dtype=np.float32)
    features_normalized = time_series_scaler.transform(features)

    # Create time-series data for training
    for i in range(len(features_normalized) - sequence_length):
        timeseries = features_normalized[i:i + sequence_length + 1]
        all_timeseries.append(timeseries)

        # Store vessel-specific features for each sequence
        vessel_features_list.append(vessel_features)

# Convert lists to numpy arrays
all_timeseries = np.array(all_timeseries, dtype=np.float32)
vessel_features_list = np.array(vessel_features_list, dtype=np.float32)

# Split time-series into X (input sequences) and Y (target values)
X_data = all_timeseries[:, :-1, :]  # Shape: (num_samples, sequence_length, num_features)
Y_data = all_timeseries[:, -1, :]  # Shape: (num_samples, num_features)


In [None]:
class GRUnetExtended(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers, non_timestep_features_size):
        super(GRUnetExtended, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.gru = nn.GRU(input_size, hidden_size, num_layers, batch_first = True)
        self.fc = nn.Linear(hidden_size + non_timestep_features_size, output_size)

    def forward(self, X, non_timestep_features):

        hidden_initialize = torch.zeros(self.num_layers, X.size(0), self.hidden_size).to(X.device)

        out, _ = self.gru(X, hidden_initialize)

        out = out[:, -1, :]

        combined = torch.cat((out, non_timestep_features), dim=1)

        out = self.fc(combined)

        out = torch.tanh(out)

        return out
    



#Model parameters:
input_size = 10
hidden_size = 64   #Random guess on what is best
output_size = 10     
num_layers = 2 
non_timestep_features_size = 4

extended_GRU_model = GRUnetExtended(input_size=input_size, hidden_size=hidden_size, output_size=output_size, num_layers=num_layers, non_timestep_features_size=non_timestep_features_size)


In [None]:
class ShipDataset(Dataset):
    def __init__(self, X_data, vessel_features_list, Y_data):
        self.X_data = torch.tensor(X_data, dtype=torch.float32)
        self.vessel_features = torch.tensor(vessel_features_list, dtype=torch.float32)
        self.Y_data = torch.tensor(Y_data, dtype=torch.float32)

    def __len__(self):
        return len(self.X_data)

    def __getitem__(self, idx):
        return {
            'time_series': self.X_data[idx],
            'vessel_features': self.vessel_features[idx],
            'target': self.Y_data[idx]
        }

In [None]:
all_timeseries = np.array(all_timeseries, dtype=np.float32)
vessel_features_list = np.array(vessel_features_list, dtype=np.float32)
Y_data = np.array(Y_data, dtype=np.float32)



dataset = ShipDataset(X_data, vessel_features_list, Y_data)
train_loader = DataLoader(dataset, batch_size=64, shuffle=True)

criterion = nn.MSELoss()
optimizer = optim.Adam(extended_GRU_model.parameters(), lr=0.0001)

# Training the model
num_epochs = 10
for epoch in range(num_epochs):
    extended_GRU_model.train()
    running_loss = 0.0
    for batch in train_loader:
        # Extract the features from the batch
        time_series = batch['time_series']
        vessel_features = batch['vessel_features']
        target = batch['target']


        # Forward pass
        outputs = extended_GRU_model(time_series, vessel_features)

        # Calculate loss
        loss = criterion(outputs, target)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()


# Print the gradient norms for all parameters
        for name, param in extended_GRU_model.named_parameters():
            if param.grad is not None:
                grad_norm = param.grad.norm()
        
                if grad_norm > 10:  # Threshold for detecting unusually large gradients
                    print(f"Warning: Unusually large gradient detected for {name}: {grad_norm}")


        #torch.nn.utils.clip_grad_norm_(extended_GRU_model.parameters(), max_norm=1.0)

        optimizer.step()

        # Track the loss
        running_loss += loss.item()

    # Print the average loss for each epoch
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}")


In [None]:
ais_test_data_path = '../../Project materials/ais_test.csv'
ais_data_test = pd.read_csv(ais_test_data_path)
unique_ship_ids = ais_data_test['vesselId'].unique()

In [None]:
# Dictionary to store predictions for each ship
ship_predictions = defaultdict(dict)

for ship_id in unique_ship_ids:
    if ship_id not in ship_train_dataframes:
        print(f"No training data available for ship_id: {ship_id}")
        continue

    df = ship_train_dataframes[ship_id]

    # Convert time column to datetime
    df['time'] = pd.to_datetime(df['time'])

    # Extract time features
    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second

    # Calculate `time_to_eta`
    # df['etaRaw'] = df['etaRaw'].apply(lambda x: f"2024-{x}" if pd.notna(x) else x)
    # df['etaRaw'] = pd.to_datetime(df['etaRaw'], format='%Y-%m-%d %H:%M', errors='coerce')
    # df['time_to_eta'] = (df['etaRaw'] - df['time']).dt.total_seconds() / 3600  # Time to ETA in hours

    # # Extract schedule-specific features (e.g., port latitude and longitude)
    # df['port_lat'] = df['portId'].apply(lambda x: ports.loc[x]['latitude'] if pd.notna(x) and x in ports.index else np.nan)
    # df['port_lon'] = df['portId'].apply(lambda x: ports.loc[x]['longitude'] if pd.notna(x) and x in ports.index else np.nan)

    # Extract features used during training
    features = df[['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog', 'rot', 'heading', 'navstat']]

    # Handle missing values with different strategies based on feature context using .loc[]
    # features.loc[:, 'longitude'] = features['longitude'].fillna(features['longitude'].mean())
    # features.loc[:, 'latitude'] = features['latitude'].fillna(features['latitude'].mean())
    # features.loc[:, 'sog'] = features['sog'].fillna(features['sog'].median())
    # features.loc[:, 'cog'] = features['cog'].fillna(features['cog'].median())
    # features.loc[:, 'time_to_eta'] = features['time_to_eta'].fillna(features['time_to_eta'].median())
    # features.loc[:, ['port_lat', 'port_lon']] = features[['port_lat', 'port_lon']].fillna(features[['port_lat', 'port_lon']].median())

    # # Convert features to numpy array after handling NaN values
    features = features.values

    if len(features) < sequence_length:
        print(f"Not enough historical data to predict for ship_id: {ship_id}")
        continue

    # Use the already fitted time-series scaler to transform features
    input_sequence_normalized = time_series_scaler.transform(features[-sequence_length:])
    input_tensor = torch.tensor(input_sequence_normalized, dtype=torch.float32).unsqueeze(0)  # Shape: (1, sequence_length, num_features)

    # Extract and normalize vessel-specific features using the already fitted scaler
    if ship_id in vessels.index:
        vessel_features = vessels.loc[ship_id][['length', 'breadth', 'enginePower', 'GT']].values.reshape(1, -1)
        vessel_features_normalized = vessel_features_scaler.transform(vessel_features).flatten()
    else:
        print(f"No vessel-specific features available for ship_id: {ship_id}")
        continue

    vessel_features_tensor = torch.tensor(vessel_features_normalized, dtype=torch.float32).unsqueeze(0)  # Shape: (1, vessel_feature_size)

    # Get the farthest prediction time needed from test data
    ship_test_times = pd.to_datetime(ais_data_test[ais_data_test['vesselId'] == ship_id]['time'])
    farthest_time = ship_test_times.max()

    # Last known time in the training data
    last_known_time = df['time'].iloc[-1]
    total_steps_needed = int((farthest_time - last_known_time).total_seconds() // (20 * 60))  # Time difference in steps of 20 minutes

    # Make predictions recursively for the required number of steps
    current_input = input_tensor.clone()

    for step in range(1, total_steps_needed + 1):
        with torch.no_grad():
            # Make the prediction using the model
            next_position = loaded_extended_GRU_model(current_input, vessel_features_tensor)

        # Prediction time for the next step
        prediction_time = last_known_time + pd.Timedelta(minutes=20 * step)

        # Convert the prediction back to the original scale for storing
        prediction_np = next_position.cpu().numpy()
        prediction_original_scale = time_series_scaler.inverse_transform(prediction_np.reshape(1, -1))

        # Store the prediction
        ship_predictions[ship_id][prediction_time] = prediction_original_scale[0]

        # Re-normalize the prediction to feed it back into the model
        prediction_normalized = time_series_scaler.transform(prediction_original_scale)

        # Update the input tensor by removing the oldest time step and appending the new normalized prediction
        next_position_tensor = torch.tensor(prediction_normalized, dtype=torch.float32).unsqueeze(0)
        current_input = torch.cat((current_input[:, 1:, :], next_position_tensor), dim=1)


# Create a list to store the final predictions for the test data
predictions_list = []

for idx, row in ais_data_test.iterrows():
    ship_id = row['vesselId']
    target_time = pd.to_datetime(row['time']).round('min')

    if ship_id in ship_predictions:
        if target_time in ship_predictions[ship_id]:
            prediction = ship_predictions[ship_id][target_time]
        else:
            # If the exact target time is not found, find the closest prediction time
            available_times = list(ship_predictions[ship_id].keys())
            closest_time = min(available_times, key=lambda x: abs(pd.Timestamp(x) - target_time))
            prediction = ship_predictions[ship_id][closest_time]

        predictions_list.append({
            'ship_id': ship_id,
            'time': target_time,
            'predicted_latitude': prediction[4],  # Assuming latitude is at index 4
            'predicted_longitude': prediction[3],  # Assuming longitude is at index 3
            'predicted_sog': prediction[5],  # Assuming speed over ground (sog) is at index 5
            'predicted_cog': prediction[6]  # Assuming course over ground (cog) is at index 6
        })

# Convert the predictions list to a DataFrame
predictions_df = pd.DataFrame(predictions_list)


In [None]:
csv_data = predictions_df[['predicted_longitude', 'predicted_latitude']]

# Write to CSV file with a header
csv_file_path = 'submission2.csv'

# Save the DataFrame to CSV
csv_data.to_csv(csv_file_path, index=True, index_label = 'ID', header=['longitude_predicted','latitude_predicted'])

### 5.2 Catboost med randomsearchCV

### 5.2.1 Version 1 - Submission 2 (686)

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import RandomizedSearchCV
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from collections import defaultdict

In [None]:
ais_data_train = pd.read_csv('ais_train.csv', sep='|')

ship_train_groups = ais_data_train.groupby('vesselId')
ship_train_dataframes = {ship_id: group for ship_id, group in ship_train_groups}

all_timeseries = []
scaler = MinMaxScaler()
scaler_y = MinMaxScaler()
raw_targets = []

sequence_length = 5

for ship_id, df in ship_train_dataframes.items():
    df['time'] = pd.to_datetime(df['time'])

    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second

    features = df[['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog',]].values

    for i in range(sequence_length, len(features)):
        raw_targets.append(features[i, [3, 4, 5, 6]])

    features_normalized = scaler.fit_transform(features)

    for i in range(len(features_normalized) - sequence_length):
        timeseries = features_normalized[i:i+sequence_length+1]
        all_timeseries.append(timeseries)

all_timeseries = np.array(all_timeseries)
raw_targets = np.array(raw_targets)

X_data = all_timeseries[:, :-1, :].reshape(-1, sequence_length * 7)
Y_data = raw_targets   # Output is the next time step's features

scaler_y.fit(Y_data)
Y_data_normalized = scaler_y.transform(Y_data)

X_data = X_data.astype('float32')
Y_data_normalized = Y_data_normalized.astype('float32')
print(all_timeseries.shape)
print(X_data.shape)
print(Y_data_normalized.shape)

In [None]:
# Initialiser CatBoost-modellen
catboost_model = CatBoostRegressor(
    loss_function='MultiRMSE',
    verbose=0
)
# Definer hyperparameter-rutenett for RandomizedSearchCV
param_dist = {
    'depth': [4, 6],  # Flere verdier for depth
    'learning_rate': [0.01, 0.05],  # Flere valgmuligheter for læringsrate
    'iterations': [50, 100],  # Flere iterasjoner for mer trening
    'l2_leaf_reg': [3, 5]
}

# Utfør RandomizedSearchCV for hyperparameter-tuning
random_search = RandomizedSearchCV(
    estimator=catboost_model,
    param_distributions=param_dist,
    n_iter=5,  # Antall kombinasjoner å prøve
    scoring='neg_mean_squared_error',
    cv=3,  # Cross-validation splits
    verbose=2,
    random_state=42,
    n_jobs=1
)

# Tren RandomizedSearchCV på dataene
random_search.fit(X_data, Y_data_normalized)

# Beste modell fra RandomizedSearchCV
best_catboost_model = random_search.best_estimator_



In [None]:
# Last inn testdata
ais_data_test = pd.read_csv('ais_test.csv', sep=',')

unique_ship_ids = ais_data_test['vesselId'].unique()

ship_predictions = {}

# Predict for each unique ship_id
for ship_id in unique_ship_ids:

    df = ship_train_dataframes[ship_id]
    
    # Ensure 'time' is in datetime format
    df['time'] = pd.to_datetime(df['time'])
    
    # Extract the last known sequence for this ship
    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second

    features = df[['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog']].values
    if len(features) < sequence_length:
        print(f"Not enough historical data to predict for ship_id: {ship_id}")
        continue
    

    # Prepare the input sequence
    input_sequence = features[-sequence_length:]
    input_sequence_normalized = scaler.transform(input_sequence)
    input_sequence_flattened = input_sequence_normalized.reshape(1, -1)

    # Predict the next step using the best CatBoost model
    prediction = best_catboost_model.predict(input_sequence_flattened)
    prediction_original_scale = scaler_y.inverse_transform(prediction.reshape(1, -1))

    # Store the prediction
    ship_predictions[ship_id] = prediction_original_scale[0]


In [None]:
predictions_list = []

for idx, row in ais_data_test.iterrows():
    ship_id = row['vesselId']

    if ship_id in ship_predictions:
        prediction = ship_predictions[ship_id]
        predictions_list.append({
            'ship_id': ship_id,
            'time': row['time'],
            'predicted_longitude': prediction[0],  # 'longitude' er første verdi
            'predicted_latitude': prediction[1],   # 'latitude' er andre verdi
            'predicted_sog': prediction[2],        # 'sog' er tredje verdi
            'predicted_cog': prediction[3]         # 'cog' er fjerde verdi
        })

# Convert predictions to a DataFrame
predictions_df = pd.DataFrame(predictions_list)

print(predictions_df.head())

# Write to CSV file with a header
csv_data = predictions_df[['predicted_longitude', 'predicted_latitude']]

csv_file_path = 'submission_catboost.csv'
csv_data.to_csv(csv_file_path, index=True, index_label='ID', header=['longitude_predicted', 'latitude_predicted'])


### 5.3 LSTM (800)

In [None]:
ais_data_train = pd.read_csv('ais_train.csv', sep='|')

ship_train_groups = ais_data_train.groupby('vesselId')
ship_train_dataframes = {ship_id: group for ship_id, group in ship_train_groups}

all_timeseries = []
scaler = MinMaxScaler()

sequence_length = 5


for ship_id, df in ship_train_dataframes.items():
    df['time'] = pd.to_datetime(df['time'])

    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second

    features = df[['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog']].values

    features_normalized = scaler.fit_transform(features)

    for i in range(len(features_normalized) - sequence_length):
        timeseries = features_normalized[i:i+sequence_length+1]
        all_timeseries.append(timeseries)


all_timeseries = np.array(all_timeseries)

X_data = all_timeseries[:, :-1, :]
Y_data = all_timeseries[:, -1, :]

print(all_timeseries.shape)
print(X_data.shape)
print(Y_data.shape)

In [None]:

# Definer LSTM-modellen
class LSTMnet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers):
        super(LSTMnet, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Definer LSTM lagene
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        
        # Fullt koblet (fully connected) lag for utgang
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, X):
        # Initialiser skjult tilstand og cell state med nuller
        hidden_initialize = torch.zeros(self.num_layers, X.size(0), self.hidden_size).to(X.device)
        cell_initialize = torch.zeros(self.num_layers, X.size(0), self.hidden_size).to(X.device)

        # Passer input gjennom LSTM
        out, _ = self.lstm(X, (hidden_initialize, cell_initialize))

        # Fullt koblet lag for å få utgang basert på siste tidssteg
        out = self.fc(out[:, -1, :])

        return out

# Modellparametere:
input_size = 7      # Antall trekk i hver datapunkt
hidden_size = 64    # Gjetning for passende størrelse
output_size = 7     # Antall utganger
num_layers = 2      # Antall lag i LSTM

# Opprett LSTM-modellen
LSTM_model = LSTMnet(input_size=input_size, hidden_size=hidden_size, output_size=output_size, num_layers=num_layers)

print(LSTM_model)


In [None]:
from torch.utils.data import DataLoader, TensorDataset

# Hyperparametere
learning_rate = 0.001
num_epochs = 10
batch_size = 64

# Optimaliseringsfunksjon og tapfunksjon
optimizer = optim.Adam(LSTM_model.parameters(), lr=learning_rate)
loss_function = nn.MSELoss()

# Simulert treningsdata (eksempel)
# X_data og Y_data bør være forberedt på samme måte som tidligere nevnt
X_data = np.random.rand(1000, 360, input_size)  # Dummy data: 1000 sekvenser med 360 tidssteg hver
Y_data = np.random.rand(1000, output_size)      # Dummy labels

# Konverter til PyTorch tensorer
X_tensor = torch.tensor(X_data, dtype=torch.float32)
Y_tensor = torch.tensor(Y_data, dtype=torch.float32)

# Lag dataloaders
train_dataset = TensorDataset(X_tensor, Y_tensor)
data_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)

# Treningsløkke
for epoch in range(num_epochs):
    for inputs, targets in data_loader:

        # Nullstill gradientene
        optimizer.zero_grad()

        # Fremoverpassering gjennom LSTM-modellen
        outputs = LSTM_model(inputs)

        # Beregn tap
        loss = loss_function(outputs, targets)

        # Tilbaketråkk (backpropagation)
        loss.backward()

        # Oppdater modellens parametere
        optimizer.step()

    # Print tapet for hver epoke
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

In [None]:
torch.save(LSTM_model.state_dict(), "lstm_model_test.pth")
loaded_LSTM_model = LSTMnet(input_size=input_size, hidden_size=hidden_size, output_size=output_size, num_layers=num_layers)

# Last inn lagrede parametre fra en fil
loaded_LSTM_model.load_state_dict(torch.load("lstm_model_test.pth"))

# Sett modellen i evalueringsmodus
loaded_LSTM_model.eval()

print("LSTM-modellen er lastet inn og satt i evalueringsmodus.")

ais_data_test = pd.read_csv('ais_test.csv', sep=',')
unique_ship_ids = ais_data_test['vesselId'].unique()

In [None]:
ship_predictions = defaultdict(dict)

# Predict for each unique ship_id and store results
for ship_id in unique_ship_ids:
    if ship_id not in ship_train_dataframes:
        print(f"No training data available for ship_id: {ship_id}")
        continue

    df = ship_train_dataframes[ship_id]
    
    # Ensure 'time' is in datetime format
    df['time'] = pd.to_datetime(df['time'])
    
    # Extract the last known sequence for this ship
    df['hour'] = df['time'].dt.hour
    df['minute'] = df['time'].dt.minute
    df['second'] = df['time'].dt.second

    features = df[['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog']].values
    if len(features) < sequence_length:
        print(f"Not enough historical data to predict for ship_id: {ship_id}")
        continue

    # Prepare the input sequence
    input_sequence = features[-sequence_length:]
    scaler = MinMaxScaler().fit(features)
    input_sequence_normalized = scaler.transform(input_sequence)

    # Convert to tensor and move to device
    input_tensor = torch.tensor(input_sequence_normalized, dtype=torch.float32).unsqueeze(0)

    # Determine the farthest time in the test data for this ship
    ship_test_times = pd.to_datetime(ais_data_test[ais_data_test['vesselId'] == ship_id]['time'])
    farthest_time = ship_test_times.max()

    # Calculate how many steps are needed to reach the farthest time
    last_known_time = df['time'].iloc[-1]
    # Each step is 20 minutes, so we divide the total seconds by (20 * 60)
    total_steps_needed = int((farthest_time - last_known_time).total_seconds() // (20 * 60))

    # Predict all future steps up to the farthest time
    current_input = input_tensor.clone()

    for step in range(1, total_steps_needed + 1):
        with torch.no_grad():
            # Predict the next step
            next_position = loaded_LSTM_model(current_input)

        # Store the prediction with its corresponding timestamp
        prediction_time = last_known_time + pd.Timedelta(minutes=20 * step)
        prediction_np = next_position.cpu().numpy()
        prediction_original_scale = scaler.inverse_transform(prediction_np.reshape(1, -1))
        
        # Add the prediction to the dictionary for the ship ID and timestamp
        ship_predictions[ship_id][prediction_time] = prediction_original_scale[0]

        # Update the input tensor by removing the oldest step and adding the predicted next step
        current_input = torch.cat((current_input[:, 1:, :], next_position.unsqueeze(0)), dim=1)

# Create a DataFrame to store the final results for each test point
predictions_list = []

# Iterate over the test data and extract the prediction from the stored dictionary
for idx, row in ais_data_test.iterrows():
    ship_id = row['vesselId']
    target_time = pd.to_datetime(row['time']).round('min')

    # Convert target time to string for lookup
    target_time_str = target_time.strftime('%Y-%m-%d %H:%M')

    # Retrieve the stored prediction for this ship at the target time
    if ship_id in ship_predictions:
        if target_time_str in ship_predictions[ship_id]:
            # Exact match found
            prediction = ship_predictions[ship_id][target_time_str]
        else:
            # No exact match, find the closest timestamp
            available_times = list(ship_predictions[ship_id].keys())
            closest_time_str = min(available_times, key=lambda x: abs(pd.Timestamp(x) - target_time))
            prediction = ship_predictions[ship_id][closest_time_str]

        predictions_list.append({
            'ship_id': ship_id,
            'time': target_time,
            'predicted_latitude': prediction[4],  # assuming columns are ['hour', 'minute', 'second', 'longitude', 'latitude', 'sog', 'cog']
            'predicted_longitude': prediction[3],
            'predicted_sog': prediction[5],
            'predicted_cog': prediction[6]
        })

# Convert predictions to a DataFrame
predictions_df = pd.DataFrame(predictions_list)

print(predictions_df.head())

In [None]:
csv_data = predictions_df[['predicted_longitude', 'predicted_latitude']]

# Write to CSV file with a header
csv_file_path = 'submission2.csv'

# Save the DataFrame to CSV
csv_data.to_csv(csv_file_path, index=True, index_label = 'ID', header=['longitude_predicted','latitude_predicted'])