<a href="https://colab.research.google.com/github/haujla2391/CSCI-4170/blob/main/01_data_and_baselines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Part A

## Task 1

### Datasheet:

Dataset name + link: https://www.kaggle.com/code/asmahwimli/human-activity-recognition

License/terms:
* Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.  This dataset is distributed AS-IS and no responsibility implied or explicit can be addressed to the authors or their institutions for its use or misuse. Any commercial use is prohibited.

* Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, Jorge L. Reyes-Ortiz.  Energy Efficient Smartphone-Based Activity Recognition using Fixed-Point Arithmetic. Journal of Universal Computer Science. Special Issue in Ambient Assisted Living: Home Care.   Volume 19, Issue 9. May 2013

* Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. 4th International Workshop of Ambient Assited Living, IWAAL 2012, Vitoria-Gasteiz, Spain, December 3-5, 2012. Proceedings. Lecture Notes in Computer Science 2012, pp 216-223.

* Jorge Luis Reyes-Ortiz, Alessandro Ghio, Xavier Parra-Llanas, Davide Anguita, Joan Cabestany, Andreu Catal�. Human Activity and Motion Disorder Recognition: Towards Smarter Interactive Cognitive Environments. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.

Prediction task + target definition:
* The task is to classify human activities using data collected from a smartphone's accelerometer and gyroscope.
* Target: Activity Label (Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, Laying)

Units of prediction (window length T, channels C, sampling rate):
* T - 128 samples (time steps) per row
* C - 9 channels total
* Sampling rate - 50 Hz

Label set and class balance (counts per class):
* Walking - 1,722
* Walking Upstairs - 1,544
* Walking Downstairs - 1,406
* Sitting - 1,777
* Standing - 1,906
* Laying - 1,944

Participants/subjects (how many, and how split is done):
* Participants: 30 volunteers aged between 19 and 48 years
* Subjects - 30
* Split - Subject independant split (training: 21 subjects, test: 9 subjects)

Intended use / decision context:
* Developing and evaluating machine learning algorithms to classify basic physical activities in real-time.
* Context: Health monitoring and fitness tracking

Non-intended use / known risks:
* Not intended to detect concurrent activities, rather just one activity at a time.
* The phone was strapped at the waist, so other data of watches may not work with this model
* Elderly and young kids aren't in the age range for the data so they may not be represented well

Preprocessing steps (normalization, filtering, windowing):
* Filtering (Noise): A median filter and a 3rd order low-pass Butterworth filter with a 20 Hz cutoff were applied to remove high-frequency digital noise.

* Gravity Separation: The acceleration signal was separated into Body Acceleration and Gravity using another low-pass Butterworth filter (cutoff of 0.3 Hz). Gravity is assumed to have only low-frequency components.

* Windowing: The continuous stream was sliced into fixed-width sliding windows of 128 readings (2.56 seconds).

* Overlap: A 50% overlap (64 samples) was used between windows to ensure transitional movements between activities weren't missed.

Leakage risks + mitigation (subject split, normalization fit, duplicates):
* Subject Split : The split is done by Subject IDs, the risk being that if Windows A and B are from the same person and have 50% overlap, the model will just remember the data. The mitigation can be to keep the same person's data in the same set (train or test)
* Normalization fit: The risk is calculating the mean and stdev on the entire dataset. The mitigation is splitting the data and then calculating on the training data.
* Duplicate Windows: The risk is that there can be overlaps and redundant information. The mitigation is to maintain the subject split.

In [28]:
import os
import torch
import numpy as np
import pandas as pd
import kagglehub
import numpy as np
from sklearn.metrics import f1_score, confusion_matrix, classification_report
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split

In [29]:
# Download latest version
path = '/kaggle/input/ucihar-dataset/UCI-HAR Dataset'

print("Path to dataset files:", path)

subject_train_file = os.path.join(path, 'train', 'subject_train.txt')
subjects_train = np.loadtxt(subject_train_file, dtype=int)

subjects_test_file = os.path.join(path, 'test', 'subject_test.txt')
subjects_test = np.loadtxt(subjects_test_file, dtype=int)

# Load raw sensor data
body_acc_x = np.loadtxt(f'{path}/train/Inertial Signals/body_acc_x_train.txt')
body_acc_y = np.loadtxt(f'{path}/train/Inertial Signals/body_acc_y_train.txt')
body_acc_z = np.loadtxt(f'{path}/train/Inertial Signals/body_acc_z_train.txt')
body_gyro_x = np.loadtxt(f'{path}/train/Inertial Signals/body_gyro_x_train.txt')
body_gyro_y = np.loadtxt(f'{path}/train/Inertial Signals/body_gyro_y_train.txt')
body_gyro_z = np.loadtxt(f'{path}/train/Inertial Signals/body_gyro_z_train.txt')

# Stack features along the third dimension
X_train = np.stack([body_acc_x, body_acc_y, body_acc_z, body_gyro_x, body_gyro_y, body_gyro_z], axis=2)

print("Training data shape:", X_train.shape)

y_train = np.loadtxt(f'{path}/train/y_train.txt')
y_train = y_train - 1
y_train = y_train.astype(int)

print("Training labels shape:", y_train.shape)

# Test data
body_acc_x = np.loadtxt(f'{path}/test/Inertial Signals/body_acc_x_test.txt')
body_acc_y = np.loadtxt(f'{path}/test/Inertial Signals/body_acc_y_test.txt')
body_acc_z = np.loadtxt(f'{path}/test/Inertial Signals/body_acc_z_test.txt')
body_gyro_x = np.loadtxt(f'{path}/test/Inertial Signals/body_gyro_x_test.txt')
body_gyro_y = np.loadtxt(f'{path}/test/Inertial Signals/body_gyro_y_test.txt')
body_gyro_z = np.loadtxt(f'{path}/test/Inertial Signals/body_gyro_z_test.txt')

# Stack features along the third dimension
X_test = np.stack([body_acc_x, body_acc_y, body_acc_z, body_gyro_x, body_gyro_y, body_gyro_z], axis=2)

print("Testing data shape:", X_test.shape)

y_test = np.loadtxt(f'{path}/test/y_test.txt')
y_test = y_test - 1
y_test = y_test.astype(int)

print("Training labels shape:", y_test.shape)

Path to dataset files: /kaggle/input/ucihar-dataset/UCI-HAR Dataset
Training data shape: (7352, 128, 6)
Training labels shape: (7352,)
Testing data shape: (2947, 128, 6)
Training labels shape: (2947,)


## Task 2

X_train.shape: (7352, 128, 6)
- num_windows = 7352 (number of 2.56-second sliding windows in the train set)
- T (window length / timesteps) = 128 (at 50 Hz sampling rate → 2.56 seconds per window)
- C (channels) = 6 (x/y/z body acceleration + x/y/z body gyroscope)

y_train.shape (post-remapping): (7352,)
- Values are integers in the range 0 to 5


X_test.shape: (2947, 128, 6)


y_test.shape (2947,)

In [30]:
activity_names = [
    "WALKING",
    "WALKING_UPSTAIRS",
    "WALKING_DOWNSTAIRS",
    "SITTING",
    "STANDING",
    "LAYING"
]

label_counts = pd.Series(y_train).value_counts().sort_index()
total_windows = len(y_train)
percentages = (label_counts / total_windows * 100).round(2)

df_dist = pd.DataFrame({
    'Label': label_counts.index,
    'Activity': [activity_names[i] for i in label_counts.index],
    'Count': label_counts.values,
    'Percentage': percentages.values.astype(str) + '%'
})

print(df_dist.to_string(index=False))

 Label           Activity  Count Percentage
     0            WALKING   1226     16.68%
     1   WALKING_UPSTAIRS   1073     14.59%
     2 WALKING_DOWNSTAIRS    986     13.41%
     3            SITTING   1286     17.49%
     4           STANDING   1374     18.69%
     5             LAYING   1407     19.14%


In [31]:
print("NaN in X_train:", np.isnan(X_train).any())
print("inf in X_train:", np.isinf(X_train).any())
print("NaN in y_train:", np.isnan(y_train).any())

NaN in X_train: False
inf in X_train: False
NaN in y_train: False


## Task 3

In [32]:
assert len(subjects_train) == len(X_train)

unique_subjects = np.unique(subjects_train)

train_subjects, val_subjects = train_test_split(
    unique_subjects,
    test_size=0.2,
    random_state=45
)


train_mask = np.isin(subjects_train, train_subjects)
val_mask  = np.isin(subjects_train, val_subjects)



X_val  = X_train[val_mask]
X_train = X_train[train_mask]


y_val = y_train[val_mask]
y_train = y_train[train_mask]

subjects_train_split = subjects_train[train_mask]
subjects_val_split  = subjects_train[val_mask]


overlap = np.intersect1d(subjects_train_split, subjects_val_split)
print("Subject overlap:", len(overlap))

Subject overlap: 0


## Task 4

In [33]:
# Find majority class in training set
majority_class = np.bincount(y_train).argmax()
print("Majority Class:", majority_class)

# Predict majority class for test set
y_pred_majority = np.full_like(y_test, fill_value=majority_class)

# Metrics
macro_f1_majority = f1_score(y_test, y_pred_majority, average='macro')
cm_majority = confusion_matrix(y_test, y_pred_majority)
report_majority = classification_report(y_test, y_pred_majority, output_dict=True)
per_class_f1_majority = {label: report_majority[str(label)]['f1-score']
                         for label in report_majority.keys() if label.isdigit()}

f1_table_majority = pd.DataFrame.from_dict(per_class_f1_majority, orient='index', columns=['F1-score'])

print("\nMajority-Class Predictor")
print("Macro-F1:", macro_f1_majority)
print("Confusion Matrix:\n", cm_majority)
print("Per-Class F1:\n", f1_table_majority)


Majority Class: 5

Majority-Class Predictor
Macro-F1: 0.05137772675086108
Confusion Matrix:
 [[  0   0   0   0   0 496]
 [  0   0   0   0   0 471]
 [  0   0   0   0   0 420]
 [  0   0   0   0   0 491]
 [  0   0   0   0   0 532]
 [  0   0   0   0   0 537]]
Per-Class F1:
    F1-score
0  0.000000
1  0.000000
2  0.000000
3  0.000000
4  0.000000
5  0.308266


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Task 5

In [34]:
def flatten_windows(X):
    return X.reshape(X.shape[0], -1)

X_train_f = flatten_windows(X_train)
X_val_f   = flatten_windows(X_val)
X_test_f  = flatten_windows(X_test)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

X_train_t = torch.tensor(X_train_f, dtype=torch.float32)
Y_train_t = torch.tensor(y_train, dtype=torch.long)

X_val_t = torch.tensor(X_val_f, dtype=torch.float32)
Y_val_t = torch.tensor(y_val, dtype=torch.long)

X_test_t = torch.tensor(X_test_f, dtype=torch.float32)
Y_test_t = torch.tensor(y_test, dtype=torch.long)

batch_size = 64

train_loader = DataLoader(TensorDataset(X_train_t, Y_train_t), batch_size=batch_size, shuffle=True)
val_loader   = DataLoader(TensorDataset(X_val_t, Y_val_t), batch_size=batch_size)
test_loader  = DataLoader(TensorDataset(X_test_t, Y_test_t), batch_size=batch_size)

class MLP(nn.Module):
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Dropout(0.3),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.BatchNorm1d(128),
            nn.Dropout(0.3),

            nn.Linear(128, num_classes)
        )

    def forward(self, x):
        return self.model(x)

input_dim = X_train_f.shape[1]
num_classes = len(np.unique(y_train))

model = MLP(input_dim, num_classes).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

def evaluate(loader):
    model.eval()
    all_preds = []
    all_targets = []

    with torch.no_grad():
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            outputs = model(x)
            preds = torch.argmax(outputs, dim=1)

            all_preds.extend(preds.cpu().numpy())
            all_targets.extend(y.cpu().numpy())

    return np.array(all_targets), np.array(all_preds)


epochs = 30
best_val_f1 = 0

for epoch in range(epochs):
    model.train()

    for x, y in train_loader:
        x, y = x.to(device), y.to(device)

        optimizer.zero_grad()
        outputs = model(x)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()

    # Validation
    y_val_true, y_val_pred = evaluate(val_loader)
    val_macro_f1 = f1_score(y_val_true, y_val_pred, average='macro')

    if val_macro_f1 > best_val_f1:
        best_val_f1 = val_macro_f1
        torch.save(model.state_dict(), "best_mlp.pt")

    print(f"Epoch {epoch+1}/{epochs} | Val Macro-F1: {val_macro_f1:.4f}")


Epoch 1/30 | Val Macro-F1: 0.5955
Epoch 2/30 | Val Macro-F1: 0.6401
Epoch 3/30 | Val Macro-F1: 0.6350
Epoch 4/30 | Val Macro-F1: 0.6911
Epoch 5/30 | Val Macro-F1: 0.6754
Epoch 6/30 | Val Macro-F1: 0.7048
Epoch 7/30 | Val Macro-F1: 0.7354
Epoch 8/30 | Val Macro-F1: 0.7051
Epoch 9/30 | Val Macro-F1: 0.7186
Epoch 10/30 | Val Macro-F1: 0.7776
Epoch 11/30 | Val Macro-F1: 0.7768
Epoch 12/30 | Val Macro-F1: 0.7796
Epoch 13/30 | Val Macro-F1: 0.6966
Epoch 14/30 | Val Macro-F1: 0.7915
Epoch 15/30 | Val Macro-F1: 0.7598
Epoch 16/30 | Val Macro-F1: 0.8068
Epoch 17/30 | Val Macro-F1: 0.8160
Epoch 18/30 | Val Macro-F1: 0.7642
Epoch 19/30 | Val Macro-F1: 0.7348
Epoch 20/30 | Val Macro-F1: 0.8280
Epoch 21/30 | Val Macro-F1: 0.8279
Epoch 22/30 | Val Macro-F1: 0.8005
Epoch 23/30 | Val Macro-F1: 0.7484
Epoch 24/30 | Val Macro-F1: 0.8316
Epoch 25/30 | Val Macro-F1: 0.8035
Epoch 26/30 | Val Macro-F1: 0.8362
Epoch 27/30 | Val Macro-F1: 0.8416
Epoch 28/30 | Val Macro-F1: 0.7966
Epoch 29/30 | Val Macro-F1: 0

## Task 6

In [None]:
model.load_state_dict(torch.load("best_mlp.pt"))
y_test_true, y_test_pred = evaluate(test_loader)

# Primary Metric
test_macro_f1 = f1_score(y_test_true, y_test_pred, average='macro')
print("\nPrimary Metric:")
print(f"Test Macro-F1: {test_macro_f1:.4f}")

cm = confusion_matrix(y_test_true, y_test_pred)
print("\nConfusion Matrix:")
print(cm)

report = classification_report(y_test_true, y_test_pred, output_dict=True)
per_class_f1 = {
    label: report[str(label)]['f1-score']
    for label in report.keys()
    if label.isdigit()
}

f1_table = pd.DataFrame.from_dict(per_class_f1, orient='index', columns=['F1-score'])
print("\nPer-Class F1 Scores:")
print(f1_table)



Primary Metric:
Test Macro-F1: 0.8326

Confusion Matrix:
[[447  17  27   2   1   2]
 [  1 461   9   0   0   0]
 [  0   5 414   0   1   0]
 [  0   0   0 392  63  36]
 [  2   3   0 154 334  39]
 [  0   0   0 121  34 382]]

Per-Class F1 Scores:
   F1-score
0  0.945032
1  0.963427
2  0.951724
3  0.675862
4  0.692228
5  0.767068
