# Final Project - Group M

Team Members:
1. Akshay Augustine Sheby - 5123774
2. Krishnapriya Krishnan Santhadevi - 5123779
3. Megha Eldho - 5123773
4. Ranjitha Umesh - 5123734

# LSTM model
We have tested 3 different algorithms such as GRU, LSTM and CNN. From these 3 algorithms we found LSTM as the best algorithm, because it is giving better accuracies when compared with other algorithms. We have created seperate jupyter notebooks for data preprocessing, LSTM model, GRU model and CNN model.

We tried different approaches for labelling the start step and end step. 
1. We created one single column for the label, named 'step_labels'. From the first row of .csv.stepMixed file we took the start step index value and identified that indexed row in the .csv file then we marked 'step_labels' column as 1. Then we took the end step index value and identified that indexed row in the .csv file then we marked 'step_labels' column as 1. We gave 0 for all other rows with indexes not listed in the .csv.stepMixed file. We got a maximum precision score on the test dataset as 0.44944.
2. We created two seperate columns for labels, one is 'start_step_label' and other one is 'end_step_label'. From the first row of .csv.stepMixed file we took the start step index value and identified that indexed row in the .csv file then we marked 'start_step_label' column as 1. Then we took the end step index value and identified that indexed row in the .csv file then we marked 'end_step_label' column as 1. We gave 0 for all other remaining unknown rows in the columns 'start_step_label' and 'end_step_label'. We got a maximum precision score on the test dataset as 0.58124.
3. Then for the previously labelled columns, instead of giving all the unknown values as 0, we tried implementing linear interpolations, but it didn't upgrade the score and accuracy.
4. Then we applied mean value for the unknown values in the label columns, but it also didn't work out well.
5. Then we changed the way of labelling, we labelled all the 10 rows after the valid start step index as 1 and all the 10 rows before the valid end step index as 1. It also didn't gave good accuracy.
6. Then we labelled 1 row before and after the valid start step index and valid end step index as 1. We got a maximum precision score on the test dataset as 0.62397, which is the best score we obtained.
7. Then we labelled 2, 3, 4 and 5 rows before and after the valid start step index and valid end step index as 1. It was also not good.

We tried different algorithms in which LSTM performed well because of these reasons:
- LSTM got better results when compared with others. LSTM model performed well for handling sequential data and capturing long-term dependencies. LSTMs can effectively learn patterns and dependencies in the data over long time horizons. LSTMs have memory cells that allow them to store and retrieve information over time, making them capable of capturing long-term dependencies in the data. 
- GRU got the results similar to LSTM. GRU models are simplified variant of LSTM that aim to strike a balance between complexity and performance. GRUs also perform well in capturing temporal dependencies in time series data. GRU is easier to implement and train. 
- CNN results were not good because it is not good for handling large datasets of sequential data. CNNs are primarily designed for image processing tasks but can also be adapted for time series prediction. While not specifically designed for sequential data, CNNs can still capture local patterns and temporal correlations in time series. 

In [1]:
#Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import csv
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

In [2]:
# 1. Data Preparation

# Load input features from a CSV file
file_path = 'Kaggle competition dataset/data_full.csv'
datasets = pd.read_csv(file_path)
print(datasets)

        Unnamed: 0  AccelX_5  AccelY_5  AccelZ_5   GyroX_5   GyroY_5   
0                0  1.370639  3.077730 -9.138201  0.026021 -0.025069  \
1                1  1.380689  3.039416 -9.200333  0.038649 -0.038450   
2                2  1.378264  2.981465 -9.305405  0.043459 -0.038100   
3                3  1.423814  2.944719 -9.343213  0.042548 -0.028578   
4                4  1.422443  2.946009 -9.392369  0.027376 -0.014168   
...            ...       ...       ...       ...       ...       ...   
722577      722577 -0.572854  7.180082  6.513024  0.001732  0.005325   
722578      722578 -0.538156  7.221120  6.618960  0.001231  0.003183   
722579      722579 -0.520193  7.248638  6.627628  0.001357  0.013642   
722580      722580 -0.527089  7.316613  6.646155  0.008508  0.025486   
722581      722581 -0.503076  7.346765  6.589603  0.011767  0.025111   

         GyroZ_5  start_step_labels  end_step_labels  
0       0.026772                0.0              0.0  
1       0.035676         

In [3]:
#Specify the device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

#Convert input features and labels to PyTorch tensors
X = torch.tensor(datasets.values[:,1:7], dtype=torch.float64)
y = torch.tensor(datasets.values[:,7:9], dtype=torch.int64)

# ove the input features and labels to the specified device
X = X.to(device)
y = y.to(device)


# 2. LSTM Model Definition

#Define the LSTMModel class inheriting from nn.Module
class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_sizes, output_size):
        super(LSTMModel, self).__init__()
        self.num_layers = len(hidden_sizes)
        self.hidden_layers = nn.ModuleList()
        self.hidden_layers.append(nn.LSTM(input_size, hidden_sizes[0], batch_first=True))
        for i in range(self.num_layers - 1):
            self.hidden_layers.append(nn.LSTM(hidden_sizes[i], hidden_sizes[i+1], batch_first=True))
        self.fc = nn.Linear(hidden_sizes[-1], output_size)

    #Define the forward pass of the model, where the input passes through the LSTM layers
    def forward(self, x):
        output, _ = self.hidden_layers[0](x)
        for i in range(1, self.num_layers):
            output, _ = self.hidden_layers[i](output)
        output = self.fc(output)
        return output

    
# 3. Hyperparameters and Model Initialization
input_size = 6
hidden_sizes = [32, 16, 8]
output_size = 2  # 0 or 1
learning_rate = 0.001
batch_size = 64
num_epochs = 10

#Create an instance of the LSTM model and move it to the specified device
model = LSTMModel(input_size, hidden_sizes, output_size).to(device)

#Ensure that the data type of the model's parameters match the data type of the input features(float64)
model.to(torch.float64)

LSTMModel(
  (hidden_layers): ModuleList(
    (0): LSTM(6, 32, batch_first=True)
    (1): LSTM(32, 16, batch_first=True)
    (2): LSTM(16, 8, batch_first=True)
  )
  (fc): Linear(in_features=8, out_features=2, bias=True)
)

In [None]:
# Create a data loader
dataset = torch.utils.data.TensorDataset(X, y)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# 4. Training loop

train_accuracy_list = []
train_loss_list = []

#Iterate over the specified number of epochs
for epoch in range(num_epochs):
    #Initialize variables for tracking loss and accuracy
    train_loss = 0.0
    correct = 0
    total = 0
    #Data is iterated through the data loader in batches
    for inputs, targets in dataloader: #batch inputs and targets
        #Convert inputs and targets to appropriate data types
        inputs = inputs.to(device) #shape: [64, 6] [batch_size, num_inp_features]
        targets = targets.to(device) #shape: [64] [batch_size]

        #Perform forward pass through the model
        outputs = model(inputs) #shape: [64, 2] [batch_size, num_of_classes]
        
        # Calculate loss
        start_pred = outputs[:, 0]
        end_pred = outputs[:, 1]
        
        start_target = targets[:, 0]
        end_target = targets[:, 1]
        
        # Convert the outputs to float tensor
        start_pred = start_pred.float()
        end_pred = end_pred.float()
        
        # Convert the targets to long (integer) tensor
        start_target = start_target.float()
        end_target = end_target.float()
        
        #Calculate the loss using loss function
        start_loss = criterion(start_pred, start_target)
        end_loss = criterion(end_pred, end_target)

        # Clear the gradients before backward pass, it ensures that the gradients from previous iterations are cleared before calculating new gradients
        optimizer.zero_grad()
        
        #Perform a single backward pass by taking average of start step loss and end step loss
        loss = (start_loss + end_loss) / 2
        loss.backward()
        
        #Update the neural network parameters based on the gradients
        optimizer.step()

        #Update training loss
        train_loss += loss.item() * inputs.size(0)
        
        #Make sure the outputs have appropriate shape and are in the range of [0, 1] using a sigmoid activation function
        start_pred = torch.sigmoid(start_pred)
        end_pred = torch.sigmoid(end_pred)
        
        start_pred = start_pred.round()
        end_pred = end_pred.round()

        #Update training accuracy     
        total += start_target.size(0)
        total += end_target.size(0)
        
        correct += (start_pred == start_target).sum().item()  
        correct += (end_pred == end_target).sum().item()

    #Calculate average training loss
    train_loss = train_loss / len(dataset)
    
    #Calculate average training accuracy
    train_accuracy = correct / total
    
    #Append the training loss and accuracy to seperate lists to plot the graph
    train_accuracy_list.append(train_accuracy)
    train_loss_list.append(train_loss)

    # Print the training loss and accuracy for the current epoch
    print(f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {train_loss:.4f}, Train Accuracy: {train_accuracy*100}")

Epoch [1/10], Train Loss: 4.1210, Train Accuracy: 64.19873453808704
Epoch [2/10], Train Loss: 4.0136, Train Accuracy: 67.4896413140654
Epoch [3/10], Train Loss: 3.9780, Train Accuracy: 68.62155437029985
Epoch [4/10], Train Loss: 3.9513, Train Accuracy: 68.33217544859961
Epoch [5/10], Train Loss: 3.9335, Train Accuracy: 68.94463188952949
Epoch [6/10], Train Loss: 3.9189, Train Accuracy: 68.9068507103692
Epoch [7/10], Train Loss: 3.9069, Train Accuracy: 69.12731011843638
Epoch [8/10], Train Loss: 3.8964, Train Accuracy: 68.20769130700737


In [None]:
# 5. Visualization

#Plot the training accuracy graph
plt.plot(range(1, num_epochs+1), train_accuracy_list)
plt.xlabel('Epoch')
plt.ylabel('Train Accuracy')
plt.title('Training Accuracy vs. Epochs')
plt.show()

#Plot the training loss graph
plt.plot(range(1, num_epochs+1), train_loss_list)
plt.xlabel('Epoch')
plt.ylabel('Train Loss')
plt.title('Training Loss vs. Epochs')
plt.show()

In [None]:
# 6. Test Data and Prediction

df_test_data = pd.read_csv("testdata.csv")
print(df_test_data)

In [None]:
#Prediction
#Convert the test data to a PyTorch tensor
test_inputs = torch.tensor(df_test_data.values, dtype=torch.float64)

#Move the test inputs to the specified device
test_inputs = test_inputs.to(device)

#Set the model to evaluation mode
model.eval()

#Perform forward pass on the test data and obtain predictions for start step and end step probabilities
with torch.no_grad():
    out = model(test_inputs)
    print(out)
    start_prob = out[:, 0]
    end_prob = out[:, 1]
    start_prob = torch.sigmoid(start_prob)
    end_prob = torch.sigmoid(end_prob)

print(len(out))

In [None]:
outputs = pd.DataFrame({'index': range(len(test_inputs)), 'start': start_prob.round(), 'end': end_prob.round()})

# Save the predictions to a CSV file
outputs.to_csv('output7_lstm_8.csv', index=False)
print(outputs)