# <font color = 'blue'>  **Assignment 3: Transductive Learning with Transformers**

## What is Semi-Supervised Learning?

Most machine learning you’ve seen is supervised (all data has labels).
But consider this: we might have thousands of labeled images for
training an image classifier, but millions or billions of unlabeled
images sitting on the internet or in databases. When developing a model
for understanding images, we somehow should try to use those unlabeled
images rather than just throwing them away. Semi-supervised learning is
an extension of supervised learning that simply tries not to discard
existing unlabeled data. The key insight is that even without labels,
unlabeled data contains valuable information about the structure and
distribution of the data that can improve our model’s performance.

## Transductive Learning: A Special Case

In this assignment, we’ll focus on a special case of semi-supervised
learning called transductive learning. Instead of training a model to
classify any future data that might come along (inductive learning),
transductive learning has a more focused goal: classify the specific
unlabeled data we already have in our dataset. Think of it this way -
you have a fixed collection of images, some labeled and some not, and
you want to predict labels for the unlabeled ones in that same
collection. Since the model can see all the data (labeled and unlabeled)
during training, it can learn how they relate to each other, often
leading to better performance on those specific unlabeled examples.

## Overview

In this assignment, you will implement a transductive learning approach
using a transformer-based model. This will teach you how to apply
transformers beyond sequential data and how to leverage both labeled and
unlabeled data simultaneously during training.

## Learning Objectives

By completing this assignment, you will:


* Understand transductive vs. inductive learning
* Apply transformer attention mechanisms to
non-sequential data
* Implement semi-supervised learning with partial
labels
* Gain experience with PyTorch’s MultiheadAttention module



## Problem Setup

You are given:
* **X**: A matrix of d-dimensional data points (features
for each point)
*  **Labels**: Some points have known class labels,
others are unlabeled
* **Goal**: Classify all points using a
transformer-based model that sees both labeled and unlabeled data during
training


## <font color = 'blue'> **Q1. Obtain the dataset** </font>


**Q1: [4 points]**

We will work with this [tabular dataset](https://drive.google.com/open?id=1WLnWBThCYZ25pReI5DCwk2bgDaCrJxI_&authuser=ikoutis%40njit.edu&usp=drive_fs) that you can download and place in your Google Drive. Mount your drive and load the dataset following the steps in the cell below.


In [2]:
#from google.colab import drive
#drive.mount('/content/gdrive', force_remount=True)

import scipy.io
import numpy as np

#mat = scipy.io.loadmat('/content/gdrive/MyDrive/data/tabularDataset.mat')
#
#This assumes that you have the file in a folder named 'data' in your google drive

The file contains:

* Two matrices $X$ and $X_1$ of numerical features. These datasets have the same dimensions (169343x80). These contain different numerical features for the same points.
*  **Use $X_1$ throughout the assignment. Whenever $X$ is mentioned below, please use $X_1$.**
* An array $y$ of labels, ranging from 0-39.
* Training indices $otrain$ specifying which rows contain training data.
* Validation indices $ovalid$ specifying which rows contain validation data.
* Similarly, it contains the indexes for a validation and a test set, $ovalid$ and $otest$ respectively.

The following cell shows how to access these arrays and assign them to local numpy objects.

In [154]:
import torch
from torch.utils.data import Dataset
# Load the dataset
mat = scipy.io.loadmat('./tabularDataset.mat')

#print(mat) - helped in figuring out that it's X1 and not X_1.

# Extract arrays (example - complete this)
X1 = mat.get('X1')  # Feature matrix - use this throughout
y = mat.get('y')     # Labels (0-39)
otrain = mat.get('otrain')  # Training indices
ovalid = mat.get('ovalid')  # Validation indices
otest = mat.get('otest')    # Test indices

# Create training, validation, and test splits
# Your code here
xTrain = X1[otrain[0]]
xValid = X1[ovalid[0]]
xTest = X1[otest[0]]

yTrain = y[otrain[0]]
yValid = y[ovalid[0]]
yTest = y[otest[0]]

#print(yTrain[0])
#print(yTrain.shape)
#print(yTest.shape)
#print(yTest[0])
#print(xTest[0])
print(f"Dataset shape: {X1.shape}")
print(f"Number of classes: {len(np.unique(y))}")
print(f"Training samples: {len(xTrain)}, and shape : {xTrain.shape}")
print(f"Validation samples: {len(xValid)} and shape : {xValid.shape}")
print(f"Test samples: {len(xTest)} and shape : {xTest.shape}")

# creating a custom dataset for train, val and test while converting them to tensors.
class CustomDataset(Dataset):
    def __init__(self, x, y):
        self.x = torch.tensor(x, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.int32)
    
    def __getitem__(self, index):
        return (self.x[index], self.y[index])
    
    def __len__(self):
        return len(self.x)
    

trainDataset = CustomDataset(xTrain,yTrain)
valDataset = CustomDataset(xValid,yValid)
testDataset = CustomDataset(xTest,yTest)

#print(f"{trainDataset[0]}")
#print(f"{valDataset[0]}")
#print(f"{testDataset[0]}")


Dataset shape: (169343, 80)
Number of classes: 40
Training samples: 90941, and shape : (90941, 80)
Validation samples: 29799 and shape : (29799, 80)
Test samples: 48603 and shape : (48603, 80)


## <font color = 'blue'>  **Q2. Create a Standard MLP Baseline**

**Q2: [16 points]**

In what follows, we set $k=40$. Implement and train a standard MLP with the following architecture:

    Linear(2k, 2k) → ReLU → Linear(2k, k)

**Requirements**:
* Create a DataLoader with batch size 256
* Train on `X_train` for 30 epochs
* Select the model from epoch `i` where validation accuracy is maximized - Report final test accuracy

In [107]:
# your code goes here
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.optim as optim
import copy

# a dataloader with 256 batch size
trainLoader = DataLoader(trainDataset, batch_size=256, shuffle=True)
valLoader = DataLoader(valDataset, batch_size=256, shuffle=True)
k = 40

class MLP(nn.Module):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.linear1 = nn.Linear(2*k,2*k)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(2*k, k)
    
    def forward(self,x):
        x1 = self.linear1(x)
        act_x1 = self.relu(x1)
        return self.linear2(act_x1)

#test
#model = MLP()
#print(model(trainDataset[0][0]))

#placing the model on GPU


torch.manual_seed(33)

def tuning(learningRate, modelClass):

    model = modelClass()

    device = torch.device('mps')
    model = model.to(device) 

    print("----------------------------------------------------------------------------------------")
    print("----------------------------------------------------------------------------------------")
    print(f"Hyperparameter learning rate = {learningRate}")
    print("----------------------------------------------------------------------------------------")
    optimizer = optim.SGD(model.parameters(),lr=learningRate)
    loss = nn.CrossEntropyLoss()

    def makeTrainStep(model, lossFunction, optimizer):
        def trainStep(x,y):
            model.train()
            logits = model(x)
            #print(f'logits in Train Step {logits.shape} {logits}')
            #print(f'Y is {y.squeeze().shape}')
            loss = lossFunction(logits,y.squeeze())
            #print(f'Loss in Train Step {loss.shape} {loss}')
            pred = logits.argmax(dim=1)
            numOfCorrectPred = (pred == y.squeeze()).sum().item()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            return loss.item(), numOfCorrectPred
        return trainStep

    def makeValStep(model, lossFunction):
        def valStep(x,y):
            model.eval()
            logits = model(x)
            y = y.squeeze()
            loss = lossFunction(logits,y)
            #since we are concerned with validation accuracy only
            pred = logits.argmax(dim=1)
            numOfCorrectPred = (pred == y).sum().item()
            return loss.item(), numOfCorrectPred
        return valStep

    def trainMLP(numEpochs, trainLoader, valLoader, trainStep, valStep):
        
        trainLosses = []
        valLosses = []
        bestValAcc = 0.0
        trainAcc = []
        valAcc = []
        bestWeights = copy.deepcopy(model.state_dict())

        for epoch in range(numEpochs): 
            trainLoss = 0.0
            sampleSize = 0
            avgTrainLoss = 0.0
            correctTrainPredictions = 0.0
            for xb, yb in trainLoader:
                
                xb = xb.to(device)
                yb = yb.to(device)

                batchSize = xb.size(0)
                sampleSize += batchSize
                
                loss, correctPredictions = trainStep(xb,yb)
                
                trainLoss += loss * batchSize
                correctTrainPredictions += correctPredictions

            avgTrainLoss = trainLoss / sampleSize
            accuracyTrain = (correctTrainPredictions / sampleSize) * 100

            trainLosses.append(avgTrainLoss)
            trainAcc.append(accuracyTrain)

            valLoss = 0.0
            sampleSize = 0
            avgValLoss = 0.0
            correctValPredictions = 0.0
            with torch.no_grad():
                for xb, yb in valLoader:
                    xb = xb.to(device)
                    yb = yb.to(device)

                    batchSize = xb.size(0)
                    #print('Batch size ', batchSize)
                    sampleSize += batchSize
                    
                    loss, correctPredictions = valStep(xb, yb)
                    valLoss += loss * batchSize
                    correctValPredictions += correctPredictions
                    # calculating validation accuracy


            avgValLoss = valLoss / sampleSize
            accuracy = (correctValPredictions / sampleSize) * 100 

            valLosses.append(avgValLoss)
            valAcc.append(accuracy)

            print(f"Epoch {epoch+1}/{numEpochs}, Train Loss: {avgTrainLoss:.6f}, Train Acc : {accuracyTrain:.6f}, Val Loss: {avgValLoss:.6f}, Val acc: {accuracy:.6f}")

            # save params for best validation accuracy
            if accuracy > bestValAcc:
                bestWeights = copy.deepcopy(model.state_dict())

        return trainLosses, valLosses, trainAcc, valAcc, bestWeights
                
        
    trainStep = makeTrainStep(model, loss, optimizer)
    valStep = makeValStep(model,loss)

    trainLosses, valLosses, trainAcc, valAcc, weights = trainMLP(30, trainLoader, valLoader, trainStep, valStep)

    return trainLosses, valLosses, trainAcc, valAcc, weights

params = {
    'lr' : [1, 1e-1, 1e-2, 1e-3, 1e-4]
}


results = []
bestValLoss = 0
bestLr = 0
bestTrainLosses = []
bestValLosses = []
bestValAcc = 0
bestWeights = None
bestTrainAccs = []
bestValAccs = []
for lr in params['lr']:
    trainLosses, valLosses, trainAcc, valAcc, weights = tuning(lr, MLP)
    leastValLoss = min(valLosses)
    maxValAcc = max(valAcc)
    results.append(
        {   
            'lr': lr,
            'trainLosses' : trainLosses,
            'valLosses' : valLosses,
            'valAcc': valAcc,
            'leastValLoss' : leastValLoss,
            'highestValAcc' : maxValAcc
        }
    )
    if maxValAcc > bestValAcc:
        bestValAcc = maxValAcc
        bestValLoss = leastValLoss
        bestLr = lr
        bestTrainLosses = trainLosses
        bestValLosses = valLosses
        bestWeights = weights
        bestTrainAccs = trainAcc
        bestValAccs = valAcc

----------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------
Hyperparameter learning rate = 1
----------------------------------------------------------------------------------------
Epoch 1/30, Train Loss: 2.346027, Train Acc : 35.657184, Val Loss: 2.003752, Val acc: 43.686030
Epoch 2/30, Train Loss: 1.919831, Train Acc : 45.796725, Val Loss: 1.818539, Val acc: 47.404275
Epoch 3/30, Train Loss: 1.832056, Train Acc : 48.081723, Val Loss: 1.831667, Val acc: 47.038491
Epoch 4/30, Train Loss: 1.791597, Train Acc : 48.915231, Val Loss: 1.920693, Val acc: 44.139065
Epoch 5/30, Train Loss: 1.763622, Train Acc : 49.535413, Val Loss: 1.769315, Val acc: 48.004967
Epoch 6/30, Train Loss: 1.744172, Train Acc : 50.112710, Val Loss: 1.716125, Val acc: 50.713111
Epoch 7/30, Train Loss: 1.731063, Train Acc : 50.406307, Val Loss: 1.826378, Val acc: 47.256619
Epoch 8/30, Train Loss: 1.71

In [108]:
for result in results:
    print(f'Lr = {result['lr']} \n valAcc = {result['highestValAcc']} || valLoss = {result['leastValLoss']}')
    print('------------------------------------------------------------')

Lr = 1 
 valAcc = 52.72995738112017 || valLoss = 1.6331573304739881
------------------------------------------------------------
Lr = 0.1 
 valAcc = 51.48159334205846 || valLoss = 1.6966648200537646
------------------------------------------------------------
Lr = 0.01 
 valAcc = 40.175844827007616 || valLoss = 2.249120347247579
------------------------------------------------------------
Lr = 0.001 
 valAcc = 7.627772744051814 || valLoss = 2.997708225983126
------------------------------------------------------------
Lr = 0.0001 
 valAcc = 7.691533272928622 || valLoss = 3.563907101768876
------------------------------------------------------------


In [109]:
print(f'Best metrics are for lr = {bestLr}: \n valLoss = {bestValLoss} || valAcc = {bestValAcc} ')

Best metrics are for lr = 1: 
 valLoss = 1.6331573304739881 || valAcc = 52.72995738112017 


In [110]:
bestModel = MLP()
bestModel.load_state_dict(bestWeights)
device = torch.device('mps')
bestModel.to(device)
testLoader = DataLoader(testDataset, batch_size=256, shuffle=True)

correctPreds = 0
sampleSize = 0
with torch.no_grad():
    for xb, yb in testLoader:
        xb = xb.to(device)
        yb = yb.to(device)
        bestModel.eval()
        yb = yb.squeeze()
        logits = bestModel(xb)
        preds = logits.argmax(dim=1)
        correctPreds += (preds == yb).sum().item()
        sampleSize += xb.size(0)
    print(f'Test Accuracy is {(correctPreds/sampleSize)*100}')

Test Accuracy is 47.74808139415263


## <font color = 'blue'> **Q3. Non-Parametric Layer Normalization**

**Q3: [4 points]**

Write a 1-liner function that implements non-parametric layer
normalization. This function should take a matrix and divide each row by its L2 norm plus epsilon (eps) for numerical stability.

``` python
def fixed_layer_norm(x, eps=1e-8):
    # Your 1-liner here
    pass
```

In [111]:
def fixedLayerNorm(x, eps=1e-8):
    # ref : https://docs.pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html
    return x / (torch.norm(x, p = 2, dim = 1, keepdim=True) + eps)

## <font color = 'blue'> **Q4. MLP with Fixed Layer Normalization**

**Q4: [16 points]**

Now let’s use the fixed normalization layer (fln) in an MLP-like network
with the following architecture:

    X → fln → Linear(2k, 2k) → fln → ReLU → Linear(2k, k)

**Requirements**:
* Train this network using the same setup as Q2 (batch
size 256, 30 epochs)
* Select the model from the epoch with maximum
validation accuracy
* Report final test accuracy
* Create plots comparing the training and validation accuracy curves between the standard MLP (Q2) and this normalized version

In [112]:
# your code here

class MLP2(nn.Module):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.linear1 = nn.Linear(2*k,2*k)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(2*k,k)
    
    def forward(self,x):
        xNorm = fixedLayerNorm(x)
        x1 = self.linear1(xNorm)
        x1Norm = fixedLayerNorm(x1)
        xAct = self.relu(x1Norm)
        return self.linear2(xAct)


results2 = []
bestValLoss2 = 0
bestLr2 = 0
bestTrainLosses2 = []
bestValLosses2 = []
bestValAcc2 = 0
bestWeights2 = None
bestTrainAccs2 = []
bestValAccs2 = []
for lr in params['lr']:
    trainLosses, valLosses, trainAcc, valAcc, weights = tuning(lr, MLP2)
    leastValLoss = min(valLosses)
    maxValAcc = max(valAcc)
    results2.append(
        {   
            'lr': lr,
            'trainLosses' : trainLosses,
            'valLosses' : valLosses,
            'valAcc': valAcc,
            'leastValLoss' : leastValLoss,
            'highestValAcc' : maxValAcc
        }
    )
    if maxValAcc > bestValAcc2:
        bestValAcc2 = maxValAcc
        bestValLoss2 = leastValLoss
        bestLr2 = lr
        bestTrainLosses2 = trainLosses
        bestValLosses2 = valLosses
        bestWeights2 = weights
        bestTrainAccs2 = trainAcc
        bestValAccs2 = valAcc

for result in results2:
    print(f'Lr = {result['lr']} \n valAcc = {result['highestValAcc']} || valLoss = {result['leastValLoss']}')
    print('------------------------------------------------------------')

print(f'Best metrics are for lr = {bestLr2}: \n valLoss = {bestValLoss2} || valAcc = {bestValAcc2} ')

bestModel = MLP2()
bestModel.load_state_dict(bestWeights2)
device = torch.device('mps')
bestModel.to(device)
testLoader = DataLoader(testDataset, batch_size=256, shuffle=True)

correctPreds = 0
sampleSize = 0
with torch.no_grad():
    for xb, yb in testLoader:
        xb = xb.to(device)
        yb = yb.to(device)
        bestModel.eval()
        yb = yb.squeeze()
        logits = bestModel(xb)
        preds = logits.argmax(dim=1)
        correctPreds += (preds == yb).sum().item()
        sampleSize += xb.size(0)
print(f'Test Accuracy is {(correctPreds/sampleSize)*100}')

----------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------
Hyperparameter learning rate = 1
----------------------------------------------------------------------------------------
Epoch 1/30, Train Loss: 2.669100, Train Acc : 29.012217, Val Loss: 2.230194, Val acc: 39.907379
Epoch 2/30, Train Loss: 2.229689, Train Acc : 39.986365, Val Loss: 2.106615, Val acc: 43.904158
Epoch 3/30, Train Loss: 2.071791, Train Acc : 43.177445, Val Loss: 2.003806, Val acc: 44.269942
Epoch 4/30, Train Loss: 1.978500, Train Acc : 45.261213, Val Loss: 2.245581, Val acc: 38.575120
Epoch 5/30, Train Loss: 1.926792, Train Acc : 46.456494, Val Loss: 1.848321, Val acc: 47.689520
Epoch 6/30, Train Loss: 1.880243, Train Acc : 47.395564, Val Loss: 1.822731, Val acc: 47.991543
Epoch 7/30, Train Loss: 1.847313, Train Acc : 48.091620, Val Loss: 1.799393, Val acc: 48.384174
Epoch 8/30, Train Loss: 1.82

##### Comparision

In [113]:
import plotly.graph_objects as go

fig = go.Figure()
x = list(range(len(bestTrainAccs)))  
fig.add_trace(go.Scatter(x=x,y=bestTrainAccs, mode='lines', name='Train Acc for MLP'))
fig.add_trace(go.Scatter(x=x,y=bestTrainAccs2, mode='lines', name='Train Acc for MLP2'))
fig.add_trace(go.Scatter(x=x,y=bestValAccs, mode='lines', name='Validation Accuracy for MLP'))
fig.add_trace(go.Scatter(x=x,y=bestValAccs2, mode='lines', name='Validation Accuracy for MLP2'))
fig.update_layout(
    title=f'Training and Validation Accuracies',
    xaxis_title='Epoch',
    yaxis_title='Accuracy (in %)',
    legend_title='Legend'
)
fig.show()

Although the model with layer normalization performs better on the test dataset, its train accuracy converges lower than that of normal model. Similar trend is observable in validation accuracy as well. This could be because normalization may have removed useful information or may have caused a distribution shift which isn't handled by the remaining elements of the model's architecture.

## <font color = 'blue'> **Q5. Transductive Transformer Model**

**Q5: [20 points]**

Now we’ll implement the core transductive learning model. Unlike the previous MLPs that processed individual data points, this model will take two matrices as input and learn relationships between all points simultaneously.

**Input Structure**:
- `X_labeled`: matrix of shape (128, 2k) containing
labeled data points
- `X_unlabeled`: matrix of shape (128, 2k)
containing unlabeled data points
- `y_labeled`: vector of shape (128,) containing labels for the labeled data

**Architecture** (following this flow):

    X_labeled (128, 2k) + X_unlabeled (128, 2k)
                    ↓
             Concatenate along rows
                    ↓
             X_combined (256, 2k)
                    ↓
          Fixed Layer Normalization
                    ↓
        Self-Attention Block (1 head)
                    ↓
          Fixed Layer Normalization
                    ↓
             Linear(2k, k)
                    ↓
             Output (256, k)
             ↓              ↓
       Labeled preds   Unlabeled preds
         (128, k)        (128, k)


**Requirements**:
* Implement a model that follows this exact architecture
* Use `nn.MultiheadAttention` with `num_heads=1` and no positional encodings
* Apply the `fixed_layer_norm` function before and after self-attention
* Compute loss only on the labeled portion of the output (first 128 predictions)
* The model should make predictions for both labeled and unlabeled points

**Key Implementation Notes**:
- The model processes both labeled and unlabeled data together, allowing attention to learn relationships across all points
- Loss is computed only on the labeled portion, but the model can make predictions for all 256 points
- No positional encodings are needed since the order of points doesn't matter

In [181]:
#your code here

class TransductiveModel(nn.Module):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.attention = nn.MultiheadAttention(embed_dim=2*k, num_heads=1, batch_first=True)
        self.linear = nn.Linear(2*k, k)
    
    def forward(self, X_labeled, X_unlabeled):
        #print(X_labeled.shape, X_unlabeled.shape)
        X_combined = torch.cat([X_labeled,X_unlabeled],dim=0)
        xNorm = fixedLayerNorm(X_combined)
        attentionOutput, _ = self.attention(xNorm,xNorm,xNorm) # q,k,v ref: https://docs.pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html
        xNorm = fixedLayerNorm(attentionOutput)
        return self.linear(xNorm) 
        
tm = TransductiveModel()
x1 = torch.randn((128,80))
x2 = torch.randn((128,80))
output = tm(x1,x2)

labeledPreds, unlabeledPreds = output[:128,:], output[128:,:]

print(f'Shape of the \n output is {output.shape}, \n labeledPreds is {labeledPreds.shape}, \n unlabeledPreds is {unlabeledPreds.shape}')



Shape of the 
 output is torch.Size([256, 40]), 
 labeledPreds is torch.Size([128, 40]), 
 unlabeledPreds is torch.Size([128, 40])


## <font color = 'blue'> **Q6. Creating Training Data for Transductive Learning**

**[16 points]**

**Problem**: Our transductive model expects pairs of matrices
(X_labeled, X_unlabeled), but we need to efficiently generate *multiple transductive training batches* with full coverage of the training set and systematic handling of unequal dataset sizes. In this question we will generate batches that comprise one epoch.

**Side Note**: By `X_test`, we mean both validation and test data
combined. In true transductive learning (similar to how GNNs use all
node features), the model should see all available unlabeled data during
training - both what we’ll use for validation and final testing.

**Requirements**:
* Use `torch.randperm()` for training data to ensure
every point appears once in the batches (i.e. once per epoch)
* Use `torch.randint()` for test data to handle any test set size
* Pad the final training batch if needed to reach exactly 128 points
* Generate ALL batches using vectorized operations (no loops)
* Test the function and verify output shapes


**Implementation Guide**:


##### Converting to tensors

In [None]:
xTrain = torch.from_numpy(xTrain)
xTest = torch.from_numpy(xTest)
xValid = torch.from_numpy(xValid)

#combining val and test set to get X test for transductive learning
xTest = torch.cat((xTest,xValid),dim=0)
xTrain.shape, xTest.shape


(torch.Size([90941, 80]), torch.Size([78402, 80]))

In [156]:
yTrain = torch.from_numpy(yTrain)
yTrain.shape

torch.Size([90941, 1])

In [None]:
yTest = torch.from_numpy(yTest)
yValid = torch.from_numpy(yValid)

# Y test for transductive learning
yTest = torch.cat((yTest, yValid), dim=0)
yTest.shape

torch.Size([78402, 1])

In [185]:
yTest = yTest.squeeze()
yTest.shape

torch.Size([78402])

In [None]:
def create_transductive_epoch(X_train, y_train, X_test):
    """
    Generate an entire epoch of transductive batches with full training coverage

    Args:
        X_train: (n_train, 2k) - training data
        y_train: (n_train,) - training labels
        X_test: (n_test, 2k) - combined validation + test data

    Returns:
        X_labeled_all: (num_batches, 128, 2k) - labeled training data
        X_unlabeled_all: (num_batches, 128, 2k) - unlabeled test data
        y_labeled_all: (num_batches, 128) - labels for training data
    """
    n_train, n_test = len(X_train), len(X_test)
    num_batches = (n_train + 127) // 128  # Ceiling division for full coverage

    # SHUFFLE training indices for full coverage (no replacement)
    train_perm = torch.randperm(n_train)

    # PAD if needed to make complete batches of 128
    if n_train % 128 != 0:
        padding_needed = 128 - (n_train % 128)
        extra_indices = torch.randint(0, n_train, (padding_needed,))
        train_perm = torch.cat([train_perm, extra_indices])

    # Reshape into batches: (num_batches, 128)
    train_indices = train_perm.view(num_batches, 128)

    # RANDOM sampling for test data (with replacement)
    test_indices = torch.randint(0, n_test, (num_batches, 128))

    # Use vectorized indexing to get all batches at once
    # Your implementation here

    X_labeled_all = X_train[train_indices]
    #print('X labeled shape ', X_labeled_all.shape)

    X_unlabeled_all = X_test[test_indices]
    #print('X unlabeled shape ', X_unlabeled_all.shape)

    y_labeled_all = y_train[train_indices]
    #print('y labeled shape ', y_labeled_all.shape)

    return X_labeled_all, X_unlabeled_all, y_labeled_all

X_labeled_all, X_unlabeled_all, y_labeled_all = create_transductive_epoch(xTrain, yTrain, xTest)


X labeled shape  torch.Size([711, 128, 80])
X unlabeled shape  torch.Size([711, 128, 80])
y labeled shape  torch.Size([711, 128, 1])


In [165]:
X_labeled_all[710].shape

torch.Size([128, 80])


## <font color = 'blue'>  **Q7. Training Loop for Transductive Learning**

**[16 points]**

**Problem**: Now we need to implement and run a complete training loop that uses our transductive batch generation and trains the model from Q5.

**Requirements**:
- Use 30 epochs (as in previous questions)
- Use Adam optimizer with learning rate 0.001
- Try to move entire epoch to GPU, fall back to batch-by-batch if memory insufficient
- For each batch, compute loss only on the labeled portion (first 128 predictions)
- Calculate and print training accuracy for each individual batch
- Calculate and print average training accuracy for each epoch
- Return both per-epoch and per-batch accuracy lists for analysis

**Key Implementation Details**:
- Model forward pass:
`logits = model(X_labeled_batch, X_unlabeled_batch)` returns (256, k)
- Training loss: `loss = criterion(logits[:128], y_labeled_batch)`
- Training accuracy: compare `logits[:128].argmax(dim=1)` with
`y_labeled_batch` - Keep running totals for epoch-level statistics

**Implementation Guide:**

In [None]:
def train_transductive_model(model, X_train, y_train, X_test, num_epochs=30, device='cuda'):
    """
    Train the transductive transformer model

    Args:
        model: TransductiveTransformer from Q5
        X_train: (n_train, 2k) - training data
        y_train: (n_train,) - training labels
        X_test: (n_test, 2k) - combined validation + test data
        num_epochs: number of training epochs
        device: 'cuda' or 'cpu'

    Returns:
        epoch_accuracies: list of average training accuracy per epoch
        batch_accuracies: list of lists - training accuracy for each batch in each epoch
    """
    model = model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    criterion = torch.nn.CrossEntropyLoss()

    epoch_accuracies = []
    batch_accuracies = []

    for epoch in range(num_epochs):
        print(f"\nEpoch {epoch+1}/{num_epochs}")

        # Generate batches for this epoch
        X_labeled_all, X_unlabeled_all, y_labeled_all = create_transductive_epoch(X_train, y_train, X_test)
        num_batches = X_labeled_all.size(0)

        # Move entire epoch to GPU if possible (memory permitting)
        try:
            X_labeled_all = X_labeled_all.to(device)
            X_unlabeled_all = X_unlabeled_all.to(device)
            y_labeled_all = y_labeled_all.to(device)
            print(f"Moved entire epoch ({num_batches} batches) to GPU")
        except RuntimeError as e:
            print(f"Cannot fit entire epoch on GPU: {e}")
            print("Will transfer batches individually")
            # Keep on CPU, transfer batch by batch

        # Training statistics for this epoch
        epoch_batch_accuracies = []
        total_correct = 0
        total_samples = 0

        for batch_idx in range(num_batches):
            # Get batch (either from GPU or transfer from CPU)
            if X_labeled_all.device == torch.device(device):
                # Already on GPU
                X_labeled_batch = X_labeled_all[batch_idx]
                X_unlabeled_batch = X_unlabeled_all[batch_idx]
                y_labeled_batch = y_labeled_all[batch_idx]
            else:
                # Transfer batch to GPU
                X_labeled_batch = X_labeled_all[batch_idx].to(device)
                X_unlabeled_batch = X_unlabeled_all[batch_idx].to(device)
                y_labeled_batch = y_labeled_all[batch_idx].to(device)

            # Forward pass and training
            # Your implementation here
            model.train()
            logits = model(X_labeled_batch, X_unlabeled_batch) # through self attention, xTest data points are leveraged to enhance the model's capabilities.
            #print('Logits shape ',logits.shape)
            loss = criterion(logits[:128], y_labeled_batch.squeeze()) # backprop only on X_labeled (xTrain set), using this context, we evaluate on xTest.
            loss.backward() 
            optimizer.step()
            optimizer.zero_grad()

            # Calculate training accuracy (first 128 predictions - labeled portion)
            train_predictions = logits[:128]
            batch_accuracy = (train_predictions.argmax(dim=1) == y_labeled_batch.squeeze()).float().mean()

            # Update statistics
            epoch_batch_accuracies.append(batch_accuracy.item())
            total_correct += (train_predictions.argmax(dim=1) == y_labeled_batch.squeeze()).sum().item()
            total_samples += len(y_labeled_batch)

            print(f"  Batch {batch_idx+1}/{num_batches}: Training Accuracy = {batch_accuracy:.4f}")

        # Calculate epoch average accuracy
        epoch_accuracy = total_correct / total_samples
        epoch_accuracies.append(epoch_accuracy)
        batch_accuracies.append(epoch_batch_accuracies)

        print(f"Epoch {epoch+1} Average Training Accuracy: {epoch_accuracy:.4f}")

    return epoch_accuracies, batch_accuracies

tdModel = TransductiveModel()
epoch_acc, batch_acc = train_transductive_model(tdModel, xTrain, yTrain, xTest, 30, device=device)


Epoch 1/30
X labeled shape  torch.Size([711, 128, 80])
X unlabeled shape  torch.Size([711, 128, 80])
y labeled shape  torch.Size([711, 128, 1])
Moved entire epoch (711 batches) to GPU
  Batch 1/711: Training Accuracy = 0.0078
  Batch 2/711: Training Accuracy = 0.0391
  Batch 3/711: Training Accuracy = 0.0625
  Batch 4/711: Training Accuracy = 0.1562
  Batch 5/711: Training Accuracy = 0.0625
  Batch 6/711: Training Accuracy = 0.0625
  Batch 7/711: Training Accuracy = 0.0938
  Batch 8/711: Training Accuracy = 0.1641
  Batch 9/711: Training Accuracy = 0.1562
  Batch 10/711: Training Accuracy = 0.2031
  Batch 11/711: Training Accuracy = 0.1172
  Batch 12/711: Training Accuracy = 0.2109
  Batch 13/711: Training Accuracy = 0.2109
  Batch 14/711: Training Accuracy = 0.1875
  Batch 15/711: Training Accuracy = 0.2031
  Batch 16/711: Training Accuracy = 0.1953
  Batch 17/711: Training Accuracy = 0.1406
  Batch 18/711: Training Accuracy = 0.1641
  Batch 19/711: Training Accuracy = 0.1797
  Batch

In [183]:
fig = go.Figure()
x = list(range(len(epoch_acc)))  
fig.add_trace(go.Scatter(x=x,y=epoch_acc, mode='lines', name='Epoch Accuracies'))
fig.update_layout(
    title=f'Epoch Accuracies',
    xaxis_title='Epoch',
    yaxis_title='Accuracy',
    legend_title='Legend'
)
fig.show()

* It converges but its performance is worse than random guessing.

In [180]:
for i in range(30):
    fig = go.Figure()
    x = list(range(len(batch_acc[i])))  
    fig.add_trace(go.Scatter(x=x,y=batch_acc[i], mode='lines', name='Epoch Accuracies'))
    fig.update_layout(
        xaxis_title='Epoch-Batch',
        yaxis_title='Accuracy',
        legend_title='Legend'
    )
    fig.show()

* As the number of epochs increases, the range of accuracies is translated upwards. 

## <font color = 'blue'> **Q8. Test Accuracy Evaluation (Inductive)**

**Q8: [16 points]**

**Problem**: Now we need to evaluate our trained transductive model in an inductive setting - making predictions on test data by pairing each test point with different random samples of training data.

**Requirements**:
- Use `model.eval()` and `torch.no_grad()` for proper evaluation
- Ensure full coverage of test data (every test point appears exactly once)
- For each test batch, sample 128 random training points as context
- The model sees training points as "labeled" and test points as "unlabeled"
- Predictions for test points come from the unlabeled branch (second 128 outputs)
- Calculate and print accuracy for each test batch
- Return overall test accuracy and per-batch accuracies

**Key Insight**: Each test point gets classified using a random sample of training data as context, simulating how the model would perform on truly unlabeled data.

**Implementation Guideline:**

In [None]:
def evaluate_test_accuracy(model, X_train, y_train, X_test, y_test, device='cuda'):
    """
    Evaluate the model on test data using inductive approach

    Args:
        model: trained TransductiveTransformer from Q5
        X_train: (n_train, 2k) - training data
        y_train: (n_train,) - training labels
        X_test: (n_test, 2k) - test data
        y_test: (n_test,) - test labels (for evaluation)
        device: 'cuda' or 'cpu'

    Returns:
        test_accuracy: overall accuracy on test set
        batch_accuracies: list of accuracy for each test batch
    """
    model.eval()  # Set to evaluation mode

    n_train, n_test = len(X_train), len(X_test)
    num_test_batches = (n_test + 127) // 128  # Ceiling division for full coverage

    # SHUFFLE test indices for full coverage (like training in Q6)
    test_perm = torch.randperm(n_test)

    # PAD if needed to make complete batches of 128
    if n_test % 128 != 0:
        padding_needed = 128 - (n_test % 128)
        extra_indices = torch.randint(0, n_test, (padding_needed,))
        test_perm = torch.cat([test_perm, extra_indices])

    # Reshape into batches: (num_test_batches, 128)
    test_indices = test_perm.view(num_test_batches, 128)

    # RANDOM sampling for training data (with replacement)
    train_indices = torch.randint(0, n_train, (num_test_batches, 128))

    # Vectorized indexing to get all batches
    X_test_all = X_test[test_indices]      # (num_test_batches, 128, 2k)
    y_test_all = y_test[test_indices]      # (num_test_batches, 128)
    X_train_all = X_train[train_indices]   # (num_test_batches, 128, 2k)
    y_train_all = y_train[train_indices]   # (num_test_batches, 128)

    # Move to GPU if possible
    # Your implementation here

    try:
        X_test_all = X_test_all.to(device)
        y_test_all = y_test_all.to(device)
        X_train_all = X_train_all.to(device)
        y_train_all = y_train_all.to(device)
    except RuntimeError:
        print("Processing batches individually due to memory constraints")


    batch_accuracies = []
    total_correct = 0
    total_samples = 0

    with torch.no_grad():  # No gradients needed for evaluation
        for batch_idx in range(num_test_batches):
            # Get batch data
            X_labeled_batch = X_train_all[batch_idx]  # Training points as "labeled"
            X_unlabeled_batch = X_test_all[batch_idx] # Test points as "unlabeled"
            y_labeled_batch = y_train_all[batch_idx]  # Training labels
            y_test_batch = y_test_all[batch_idx]      # Test labels (for evaluation)

            if X_test_all.device != device:
                X_labeled_batch = X_labeled_batch.to(device)
                X_unlabeled_batch = X_unlabeled_batch.to(device)
                y_test_batch = y_test_batch.to(device)

            # Forward pass: model predicts on test points using training context
            logits = model(X_labeled_batch, X_unlabeled_batch)  # (256, k)

            # Test predictions are in the second half (unlabeled branch)
            test_predictions = logits[128:]  # (128, k)

            # Calculate batch accuracy
            batch_accuracy = (test_predictions.argmax(dim=1) == y_test_batch).float().mean()
            batch_accuracies.append(batch_accuracy.item())

            # Update totals
            total_correct += (test_predictions.argmax(dim=1) == y_test_batch).sum().item()
            total_samples += len(y_test_batch)

            print(f"Test Batch {batch_idx+1}/{num_test_batches}: Accuracy = {batch_accuracy:.4f}")

    # Calculate overall test accuracy
    test_accuracy = total_correct / total_samples
    print(f"\nOverall Test Accuracy: {test_accuracy:.4f}")

    return test_accuracy, batch_accuracies

testAccuracy, batch_accs = evaluate_test_accuracy(tdModel, xTrain, yTrain, xTest, yTest, device=device)

Test Batch 1/613: Accuracy = 0.3672
Test Batch 2/613: Accuracy = 0.4766
Test Batch 3/613: Accuracy = 0.3594
Test Batch 4/613: Accuracy = 0.4766
Test Batch 5/613: Accuracy = 0.4531
Test Batch 6/613: Accuracy = 0.3594
Test Batch 7/613: Accuracy = 0.4531
Test Batch 8/613: Accuracy = 0.4531
Test Batch 9/613: Accuracy = 0.4453
Test Batch 10/613: Accuracy = 0.4062
Test Batch 11/613: Accuracy = 0.3750
Test Batch 12/613: Accuracy = 0.3359
Test Batch 13/613: Accuracy = 0.4297
Test Batch 14/613: Accuracy = 0.4609
Test Batch 15/613: Accuracy = 0.4062
Test Batch 16/613: Accuracy = 0.4062
Test Batch 17/613: Accuracy = 0.4219
Test Batch 18/613: Accuracy = 0.4141
Test Batch 19/613: Accuracy = 0.4375
Test Batch 20/613: Accuracy = 0.3906
Test Batch 21/613: Accuracy = 0.3906
Test Batch 22/613: Accuracy = 0.4531
Test Batch 23/613: Accuracy = 0.5156
Test Batch 24/613: Accuracy = 0.3594
Test Batch 25/613: Accuracy = 0.4531
Test Batch 26/613: Accuracy = 0.3984
Test Batch 27/613: Accuracy = 0.3906
Test Batch

In [188]:
fig = go.Figure()
x = list(range(len(batch_accs)))  
fig.add_trace(go.Scatter(x=x,y=batch_accs, mode='lines', name='Batch Accuracies'))
fig.update_layout(
    title=f'Batch Accuracies',
    xaxis_title='Batch',
    yaxis_title='Accuracy',
    legend_title='Legend'
)
fig.show()

We do cross 55% accuracy in one test batch, which is the highest observed in this experiment.