## Summarized Report
## Hongyu Guo
## May, 2024

### Dataset

The dataset used in this project is sourced from the Autism Brain Imaging Data Exchange (ABIDE). The data has been preprocessed using Cameron Craddock's 200 ROI parcellation atlas, which includes the following key steps:

1. **Atlas and ROIs**:
   - Utilized Cameron Craddock's 200 ROI parcellation atlas.
   - Included 200 Regions of Interest (ROIs).

2. **Preprocessing Pipeline**:
   - Applied the CPAC preprocessing pipeline.
   - Implemented band-pass filtering (0.01 - 0.1 Hz) after nuisance variable regression.
   - Incorporated global mean signal correction during nuisance variable regression for strategies that included global signal correction.

### Data Overview

- **Sample Distribution**:
  - ASD (Autism Spectrum Disorder): 408 samples
  - TD (Typically Developing): 486 samples
- **Time Steps**: Ranged from 78 to 316
- **ROIs**: 200

### Data Cleaning

To ensure data quality, the following cleaning steps were applied:

- Samples with any Pearson Correlation Coefficient (PCC) value as NaN were removed.
- Post-cleaning Sample Counts:
  - ASD: 391
  - TD: 468

### Network Generation

The network features were generated as follows:

- Calculated the PCC between pairs of ROIs.
- Saved the upper diagonal of the PCC matrix (excluding the diagonal values) as the features for each subject.

# Model Development
### Model 1. Semi-supervised Autocoder classification model

### Model 2. Pre-train a VAE model followed by adding the MLP for ASD classification.

1. Train the VAE to reconstruct the input data.
2. Integrate the MLP into the Pre-trained VAE for ASD Classification.


### Results

The classification accuracy was around 66%. This suggests that further data cleaning, preprocessing, and model optimization are needed, potentially indicating some important steps or techniques were missed from the literature.




In [4]:
from google.colab import drive
drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [3]:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
import time
import numpy as np

Found GPU at: /device:GPU:0


In [5]:
import pickle
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import torch
from torch.utils.data import DataLoader, TensorDataset
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# Load the data
td_path = '/content/drive/My Drive/ASD_project_2024/pcc_TD_cleaned.pkl'
asd_path = '/content/drive/My Drive/ASD_project_2024/pcc_ASD_cleaned.pkl'

with open(td_path, 'rb') as file:
    td_data = pickle.load(file)

with open(asd_path, 'rb') as file:
    asd_data = pickle.load(file)

# Transform the data to [0, 1] range
td_data = (td_data + 1) / 2
asd_data = (asd_data + 1) / 2

# Create labels
td_labels = np.zeros(td_data.shape[0])
asd_labels = np.ones(asd_data.shape[0])

# Combine the data
X = np.vstack((td_data, asd_data))
y = np.concatenate((td_labels, asd_labels))

# Split the data into training, testing, and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.15, random_state=42)

# Print dataset sizes for debugging
print(f"Training dataset size: {X_train.shape[0]}")
print(f"Validation dataset size: {X_val.shape[0]}")
print(f"Testing dataset size: {X_test.shape[0]}")

# Convert data to PyTorch tensors
train_x = torch.from_numpy(X_train).float()
train_y = torch.from_numpy(y_train).float().view(-1, 1)
train_dataset = TensorDataset(train_x, train_y)

test_x = torch.from_numpy(X_test).float()
test_y = torch.from_numpy(y_test).float().view(-1, 1)
test_dataset = TensorDataset(test_x, test_y)

val_x = torch.from_numpy(X_val).float()
val_y = torch.from_numpy(y_val).float().view(-1, 1)
val_dataset = TensorDataset(val_x, val_y)

# Create data loaders
bs = 32  # Example batch size

train_loader = DataLoader(train_dataset, batch_size=bs, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=bs, shuffle=True, drop_last=True)
val_loader = DataLoader(val_dataset, batch_size=bs, shuffle=True, drop_last=True)


Training dataset size: 644
Validation dataset size: 33
Testing dataset size: 182


In [6]:
def to_var(x):
    if torch.cuda.is_available():
        x = x.cuda()
    return Variable(x)

def flatten(x):
    return to_var(x.view(x.size(0), -1))

In [7]:
feature_size = X_train.shape[1]
feature_size


19900

In [23]:
td_data.shape, asd_data.shape

((468, 19900), (391, 19900))

# Method 1
# Semi-supervised Autocoder classification model
## Different from the paper, VAE is used in this code.


In [17]:
### May, 2024

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
import numpy as np

# Define the VAE model
class VAE(nn.Module):
    def __init__(self, input_dim, h_dim, z_dim, mlp_dim):
        super(VAE, self).__init__()
        # Encoder
        self.encoder1 = nn.Sequential(
            nn.Linear(input_dim, h_dim),
            nn.BatchNorm1d(h_dim),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3))

        self.encoder2 = nn.Sequential(
            nn.Linear(h_dim, z_dim * 2))

        self.decoder = nn.Sequential(
            nn.Linear(z_dim, h_dim),
            nn.BatchNorm1d(h_dim),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(h_dim, input_dim),
            nn.Sigmoid())

        self.task2 = nn.Sequential(
            nn.Linear(z_dim, mlp_dim),
            nn.BatchNorm1d(mlp_dim),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(mlp_dim, 1),
            nn.Sigmoid())

    def reparameterize(self, mu, logvar):
        std = logvar.mul(0.5).exp_()
        esp = Variable(torch.randn(*mu.size())).to(mu.device)
        z = mu + std * esp
        return z

    def forward(self, x):
        h = self.encoder1(x)
        h = self.encoder2(h)
        mu, logvar = torch.chunk(h, 2, dim=1)
        z = self.reparameterize(mu, logvar)
        pre_ASD = self.task2(z)
        return self.decoder(z), mu, logvar, pre_ASD



# Set the hyperparameters
input_dim = X_train.shape[1]
h_dim = 1000
z_dim = 600
mlp_dim = 60

num_epochs = 10
learning_rate = 0.0001
weights = [0.6]

model = VAE(input_dim, h_dim, z_dim, mlp_dim)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Function to calculate accuracy
def calculate_accuracy(predicted, target):
    predicted = (predicted > 0.5).float()
    correct = (predicted == target).sum().item()
    return correct / target.size(0)

# Grid search for the best weight
for weight in weights:
    print(f"Testing weight: {weight:.2f}")

    for epoch in range(num_epochs):
        model.train()
        train_loss_VAE = 0.0
        train_loss_ASD = 0.0
        correct = 0
        total = 0

        for batch_idx, (data, target) in enumerate(train_loader):
            data = data.to(device)
            target = target.to(device).float().view(-1, 1)
            optimizer.zero_grad()

            recon_x, mu, logvar, pre_ASD = model(data)

            # Reconstruction loss
            recon_loss = F.binary_cross_entropy(recon_x, data, reduction='mean')

            # KL divergence
            kld = -0.5 * torch.mean(1 + logvar - mu.pow(2) - logvar.exp())
            loss_task1 = recon_loss + kld

            # Classification loss
            loss_task2 = F.binary_cross_entropy(pre_ASD, target, reduction='mean')

            # Total loss with weight
            loss = (1 - weight) * loss_task1 + weight * loss_task2

            loss.backward()
            optimizer.step()

            train_loss_VAE += loss_task1.item()
            train_loss_ASD += loss_task2.item()

            # Calculate accuracy
            correct += calculate_accuracy(pre_ASD.detach(), target) * target.size(0)
            total += target.size(0)

        train_accuracy = 100 * correct / total
        print(f"Training --- Epoch {epoch + 1}/{num_epochs}, VAE Loss: {train_loss_VAE / len(train_loader)}, ASD Loss: {train_loss_ASD / len(train_loader)}, Accuracy: {train_accuracy:.2f}%")

        # Validation
        model.eval()
        val_loss_VAE = 0.0
        val_loss_ASD = 0.0
        val_correct = 0
        val_total = 0

        with torch.no_grad():
            for batch_idx, (data, target) in enumerate(val_loader):
                data = data.to(device)
                target = target.to(device).float().view(-1, 1)

                recon_x, mu, logvar, pre_ASD = model(data)

                # Reconstruction loss
                recon_loss = F.binary_cross_entropy(recon_x, data, reduction='mean')

                # KL divergence
                kld = -0.5 * torch.mean(1 + logvar - mu.pow(2) - logvar.exp())
                loss_task1 = recon_loss + kld

                # Classification loss
                loss_task2 = F.binary_cross_entropy(pre_ASD, target, reduction='mean')

                val_loss_VAE += loss_task1.item()
                val_loss_ASD += loss_task2.item()

                # Calculate accuracy
                val_correct += calculate_accuracy(pre_ASD, target) * target.size(0)
                val_total += target.size(0)

        if val_total > 0:
            val_accuracy = 100 * val_correct / val_total
            print(f"Validation - Epoch {epoch + 1}/{num_epochs}, VAE Loss: {val_loss_VAE / len(val_loader)}, ASD Loss: {val_loss_ASD / len(val_loader)}, Accuracy: {val_accuracy:.2f}%")
        else:
            print(f"Validation - Epoch {epoch + 1}/{num_epochs}, No validation samples processed.")

# Testing after training
model.eval()
test_loss_ASD = 0.0
test_correct = 0
test_total = 0

with torch.no_grad():
    for batch_idx, (data, target) in enumerate(test_loader):
        data = data.to(device)
        target = target.to(device).float().view(-1, 1)

        _, _, _, pre_ASD = model(data)

        # Classification loss
        loss_task2 = F.binary_cross_entropy(pre_ASD, target, reduction='mean')
        test_loss_ASD += loss_task2.item()

        # Calculate accuracy
        test_correct += calculate_accuracy(pre_ASD, target) * target.size(0)
        test_total += target.size(0)

if test_total > 0:
    test_accuracy = 100 * test_correct / test_total
    print("----------------")
    print(f"Test Results - ASD Loss: {test_loss_ASD / len(test_loader)}, Accuracy: {test_accuracy:.2f}%")
else:
    print("No test samples processed.")


Testing weight: 0.60
Training --- Epoch 1/10, VAE Loss: 0.9095546483993531, ASD Loss: 0.660964635014534, Accuracy: 61.41%
Validation - Epoch 1/10, VAE Loss: 0.7109365463256836, ASD Loss: 0.6686631441116333, Accuracy: 56.25%
Training --- Epoch 2/10, VAE Loss: 0.9122982919216156, ASD Loss: 0.4238467141985893, Accuracy: 86.25%
Validation - Epoch 2/10, VAE Loss: 0.7453729510307312, ASD Loss: 0.574348509311676, Accuracy: 78.12%
Training --- Epoch 3/10, VAE Loss: 0.9073170572519302, ASD Loss: 0.31150773391127584, Accuracy: 94.38%
Validation - Epoch 3/10, VAE Loss: 0.7780651450157166, ASD Loss: 0.625944197177887, Accuracy: 59.38%
Training --- Epoch 4/10, VAE Loss: 0.9023516118526459, ASD Loss: 0.21313233226537703, Accuracy: 99.06%
Validation - Epoch 4/10, VAE Loss: 0.7989895939826965, ASD Loss: 0.8695805668830872, Accuracy: 46.88%
Training --- Epoch 5/10, VAE Loss: 0.8951648026704788, ASD Loss: 0.15752620473504067, Accuracy: 100.00%
Validation - Epoch 5/10, VAE Loss: 0.77845299243927, ASD Los

# Method 2:
# The semi-supervised learning by adding the MLP for ASD classification into the pre-trained VAE model.

1. Train the VAE to reconstruct the input data.
2. Integrate the MLP into the Pre-trained VAE for ASD Classification.

# Step 1: Train the VAE

In [19]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

# Define the VAE model
class VAE(nn.Module):
    def __init__(self, input_dim, h_dim, z_dim):
        super(VAE, self).__init__()
        # Encoder
        self.encoder1 = nn.Sequential(
            nn.Linear(input_dim, h_dim),
            nn.BatchNorm1d(h_dim),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3))

        self.encoder2 = nn.Sequential(
            nn.Linear(h_dim, z_dim * 2))

        self.decoder = nn.Sequential(
            nn.Linear(z_dim, h_dim),
            nn.BatchNorm1d(h_dim),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(h_dim, input_dim),
            nn.Sigmoid())

    def reparameterize(self, mu, logvar):
        std = logvar.mul(0.5).exp_()
        esp = Variable(torch.randn(*mu.size())).to(mu.device)
        z = mu + std * esp
        return z

    def forward(self, x):
        h = self.encoder1(x)
        h = self.encoder2(h)
        mu, logvar = torch.chunk(h, 2, dim=1)
        z = self.reparameterize(mu, logvar)
        return self.decoder(z), mu, logvar

# Set the hyperparameters
input_dim = X_train.shape[1]
h_dim = 1000
z_dim = 600

num_epochs = 20
learning_rate = 0.0001

vae = VAE(input_dim, h_dim, z_dim)
optimizer = optim.Adam(vae.parameters(), lr=learning_rate)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
vae.to(device)

# Train the VAE
for epoch in range(num_epochs):
    vae.train()
    train_loss = 0.0

    for batch_idx, (data, _) in enumerate(train_loader):
        data = data.to(device)
        optimizer.zero_grad()

        recon_x, mu, logvar = vae(data)

        # Reconstruction loss
        recon_loss = F.binary_cross_entropy(recon_x, data, reduction='mean')

        # KL divergence
        kld = -0.5 * torch.mean(1 + logvar - mu.pow(2) - logvar.exp())
        loss = recon_loss + kld

        loss.backward()
        optimizer.step()

        train_loss += loss.item()

    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {train_loss / len(train_loader)}")


Epoch 1/20, Loss: 0.8618254005908966
Epoch 2/20, Loss: 0.8046551197767258
Epoch 3/20, Loss: 0.778367668390274
Epoch 4/20, Loss: 0.7667982012033463
Epoch 5/20, Loss: 0.7586871802806854
Epoch 6/20, Loss: 0.7530590772628785
Epoch 7/20, Loss: 0.748719933629036
Epoch 8/20, Loss: 0.7450314700603485
Epoch 9/20, Loss: 0.7412677139043808
Epoch 10/20, Loss: 0.7382235944271087
Epoch 11/20, Loss: 0.735998558998108
Epoch 12/20, Loss: 0.7336918354034424
Epoch 13/20, Loss: 0.7315237104892731
Epoch 14/20, Loss: 0.7292782127857208
Epoch 15/20, Loss: 0.7274185061454773
Epoch 16/20, Loss: 0.725600180029869
Epoch 17/20, Loss: 0.7239365667104721
Epoch 18/20, Loss: 0.7222286820411682
Epoch 19/20, Loss: 0.7205595374107361
Epoch 20/20, Loss: 0.7196021974086761


# Step 2: Integrate the MLP into the Pre-trained VAE for ASD Classification

In [21]:
# Add MLP for ASD classification into the pre-trained VAE
class VAEWithMLP(nn.Module):
    def __init__(self, vae, z_dim, mlp_dim):
        super(VAEWithMLP, self).__init__()
        self.vae = vae
        self.task2 = nn.Sequential(
            nn.Linear(z_dim, mlp_dim),
            nn.BatchNorm1d(mlp_dim),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.3),
            nn.Linear(mlp_dim, 1),
            nn.Sigmoid())

    def forward(self, x):
        recon_x, mu, logvar = self.vae(x)
        z = self.vae.reparameterize(mu, logvar)
        pre_ASD = self.task2(z)
        return recon_x, mu, logvar, pre_ASD

mlp_dim = 60
vae_with_mlp = VAEWithMLP(vae, z_dim, mlp_dim)
optimizer = optim.Adam(vae_with_mlp.parameters(), lr=learning_rate)
vae_with_mlp.to(device)

# Function to calculate accuracy
def calculate_accuracy(predicted, target):
    predicted = (predicted > 0.5).float()
    correct = (predicted == target).sum().item()
    return correct / target.size(0)

# Training the VAE with auxiliary learning for ASD classification
for epoch in range(num_epochs):
    vae_with_mlp.train()
    train_loss_VAE = 0.0
    train_loss_ASD = 0.0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(train_loader):
        data = data.to(device)
        target = target.to(device).float().view(-1, 1)
        optimizer.zero_grad()

        recon_x, mu, logvar, pre_ASD = vae_with_mlp(data)

        # Reconstruction loss
        recon_loss = F.binary_cross_entropy(recon_x, data, reduction='mean')

        # KL divergence
        kld = -0.5 * torch.mean(1 + logvar - mu.pow(2) - logvar.exp())
        loss_task1 = recon_loss + kld

        # Classification loss
        loss_task2 = F.binary_cross_entropy(pre_ASD, target, reduction='mean')


        # Total loss with weight
        loss = (1 - weight) * loss_task1 + weight * loss_task2

        loss.backward()
        optimizer.step()

        train_loss_VAE += loss_task1.item()
        train_loss_ASD += loss_task2.item()

        # Calculate accuracy
        correct += calculate_accuracy(pre_ASD.detach(), target) * target.size(0)
        total += target.size(0)

    train_accuracy = 100 * correct / total
    print(f"Training --- Epoch {epoch + 1}/{num_epochs}, VAE Loss: {train_loss_VAE / len(train_loader)}, ASD Loss: {train_loss_ASD / len(train_loader)}, Accuracy: {train_accuracy:.2f}%")

    # Validation
    vae_with_mlp.eval()
    val_loss_VAE = 0.0
    val_loss_ASD = 0.0
    val_correct = 0
    val_total = 0

    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(val_loader):
            data = data.to(device)
            target = target.to(device).float().view(-1, 1)

            recon_x, mu, logvar, pre_ASD = vae_with_mlp(data)

            # Reconstruction loss
            recon_loss = F.binary_cross_entropy(recon_x, data, reduction='mean')

            # KL divergence
            kld = -0.5 * torch.mean(1 + logvar - mu.pow(2) - logvar.exp())
            loss_task1 = recon_loss + kld

            # Classification loss
            loss_task2 = F.binary_cross_entropy(pre_ASD, target, reduction='mean')

            # Apply scaling factor to VAE loss
            loss_task1 *= vae_weight

            val_loss_VAE += loss_task1.item()
            val_loss_ASD += loss_task2.item()

            # Calculate accuracy
            val_correct += calculate_accuracy(pre_ASD, target) * target.size(0)
            val_total += target.size(0)

    if val_total > 0:
        val_accuracy = 100 * val_correct / val_total
        print(f"Validation - Epoch {epoch + 1}/{num_epochs}, VAE Loss: {val_loss_VAE / len(val_loader)}, ASD Loss: {val_loss_ASD / len(val_loader)}, Accuracy: {val_accuracy:.2f}%")
    else:
        print(f"Validation - Epoch {epoch + 1}/{num_epochs}, No validation samples processed.")

# Testing after training
vae_with_mlp.eval()
test_loss_ASD = 0.0
test_correct = 0
test_total = 0

with torch.no_grad():
    for batch_idx, (data, target) in enumerate(test_loader):
        data = data.to(device)
        target = target.to(device).float().view(-1, 1)

        _, _, _, pre_ASD = vae_with_mlp(data)

        # Classification loss
        loss_task2 = F.binary_cross_entropy(pre_ASD, target, reduction='mean')
        test_loss_ASD += loss_task2.item()

        # Calculate accuracy
        test_correct += calculate_accuracy(pre_ASD, target) * target.size(0)
        test_total += target.size(0)

if test_total > 0:
    test_accuracy = 100 * test_correct / test_total
    print(f"Test Results - ASD Loss: {test_loss_ASD / len(test_loader)}, Accuracy: {test_accuracy:.2f}%")
else:
    print("No test samples processed.")


Training --- Epoch 1/20, VAE Loss: 0.8798136830329895, ASD Loss: 0.3315301775932312, Accuracy: 94.06%
Validation - Epoch 1/20, VAE Loss: 0.07620536535978317, ASD Loss: 0.5184755921363831, Accuracy: 78.12%
Training --- Epoch 2/20, VAE Loss: 0.8538363665342331, ASD Loss: 0.17242252826690674, Accuracy: 100.00%
Validation - Epoch 2/20, VAE Loss: 0.07477930188179016, ASD Loss: 0.5550687909126282, Accuracy: 68.75%
Training --- Epoch 3/20, VAE Loss: 0.8199673473834992, ASD Loss: 0.16685220524668692, Accuracy: 99.84%
Validation - Epoch 3/20, VAE Loss: 0.07494974136352539, ASD Loss: 0.4892743229866028, Accuracy: 81.25%
Training --- Epoch 4/20, VAE Loss: 0.8069587856531143, ASD Loss: 0.15332339107990264, Accuracy: 99.84%
Validation - Epoch 4/20, VAE Loss: 0.07308287173509598, ASD Loss: 0.5494983196258545, Accuracy: 71.88%
Training --- Epoch 5/20, VAE Loss: 0.7977668821811676, ASD Loss: 0.1439684960991144, Accuracy: 100.00%
Validation - Epoch 5/20, VAE Loss: 0.07251542806625366, ASD Loss: 0.52770