# Multi-Task Learning (MTL)
Task is to take a student's data as input and predict:
1.	Regression: The student's final grade (G3), a number from 0 to 20.
2.	Classification: Whether the student is in a romantic relationship (romantic), a "yes" or "no" value.

The hypothesis is that a shared "student profile" (learned by the network's body) can help predict both academic performance and personal life.


# 1. Loading Data

In [None]:
import pandas as pd

# Load Data
file_path = "./data/student/student-mat.csv"

data = pd.read_csv(file_path, sep=';')
print('Loaded:', file_path)
print(data.head())
print(data.info())

### 1.1 Define Targets
We will use:
- Regression target: G3 (final grade, 0-20)
- Classification target: romantic ("yes"/"no") transformed to binary 1/0

In [None]:
# regression target
data['y_grade'] = data['G3'].astype(int)

# classification target: map yes/no to 1/0
data['y_romantic'] = data['romantic'].map({'yes': 1, 'no': 0}).astype(int)

print(data[['G3', 'y_grade', 'romantic', 'y_romantic']])

    ### 1.2 Quick Checks on Targets

In [None]:
# grade distribution
print('Grade (G3) distribution:')
print(data['y_grade'].describe())

# romantic balance
print('\nRomantic class balance:')
print(data['y_romantic'].value_counts(normalize=True))

## 2. Handle Categorical Features
Convert non-numeric columns into numeric form using one-hot encoding. Binary columns (yes/no) are mapped to 1 and 0.

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

# Separate categorical and numeric columns
cat_cols = [c for c in data.columns if data[c].dtype == 'object' and c not in ['romantic']]
num_cols = [c for c in data.columns if data[c].dtype != 'object' and c not in ['y_grade', 'y_romantic']]

print('Categorical columns:', cat_cols)
print('Numeric columns:', num_cols)

# Define preprocessing pipeline
preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
])

# Fit and transform data
X = preprocessor.fit_transform(data)
y_grade = data['y_grade'].values.astype(np.int8)
y_romantic = data['y_romantic'].values.astype(np.int8)

print('Processed feature shape:', X.shape)

## 3. Split Data
Splitting the data ensures the model is trained on one subset (train), tuned on another unseen subset (validation), and finally evaluated on a completely unseen subset (test) to prevent overfitting and to estimate generalization performance.

We split into train, validation, and test sets.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_temp, y_grade_train, y_grade_temp, y_romantic_train, y_romantic_temp = train_test_split(
    X, y_grade, y_romantic, test_size=0.3, random_state=42)

X_val, X_test, y_grade_val, y_grade_test, y_romantic_val, y_romantic_test = train_test_split(
    X_temp, y_grade_temp, y_romantic_temp, test_size=0.5, random_state=42)

print(f'Train: {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}')

## 4. PyTorch Dataset Class
Create a custom dataset returning (x, y_grade, y_romantic) tuples.

In [None]:
import torch
from torch.utils.data import Dataset, DataLoader

class StudentDataset(Dataset):
    def __init__(self, X, y_grade, y_romantic):
        self.X = torch.tensor(X.todense() if hasattr(X, 'todense') else X, dtype=torch.float32)
        self.y_grade = torch.tensor(y_grade, dtype=torch.int8).unsqueeze(1)
        self.y_romantic = torch.tensor(y_romantic, dtype=torch.int8)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y_grade[idx], self.y_romantic[idx]

# Create datasets
dset_train = StudentDataset(X_train, y_grade_train, y_romantic_train)
dset_val = StudentDataset(X_val, y_grade_val, y_romantic_val)
dset_test = StudentDataset(X_test, y_grade_test, y_romantic_test)

# DataLoaders
train_loader = DataLoader(dset_train, batch_size=32, shuffle=True)
val_loader = DataLoader(dset_val, batch_size=32)
test_loader = DataLoader(dset_test, batch_size=32)

print('DataLoaders ready!')

In [None]:
# Verify one batch.
for xb, yb_grade, yb_romantic in train_loader:
    print('Batch X:', xb.shape)
    print('Batch y_grade:', yb_grade.shape)
    print('Batch y_romantic:', yb_romantic.shape)
    break

## 5. Building the Multi-Head Model
Define a neural network that shares a common body and branches into two separate heads: one for regression (predicting grades) and one for classification (predicting romantic status).

The shared body learns a general representation of the student data. Two heads specialize for their respective tasks:
- Regression head outputs a single scalar (predicted grade).
- Classification head outputs two values (for romantic status yes/no).

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiTaskModel(nn.Module):
    def __init__(self, input_dim, shared_dim=128, dropout_p=0.3):
        super(MultiTaskModel, self).__init__()
        
        # Shared body
        self.shared = nn.Sequential(
            nn.Linear(input_dim, shared_dim),
            nn.BatchNorm1d(shared_dim),
            nn.ReLU(),
            nn.Dropout(dropout_p),
            nn.Linear(shared_dim, shared_dim // 2),
            nn.BatchNorm1d(shared_dim // 2),
            nn.ReLU()
        )
        
        # Regression head (Grade prediction)
        self.head_regression = nn.Sequential(
            nn.Linear(shared_dim // 2, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )
        
        # Classification head (Romantic status)
        self.head_classification = nn.Sequential(
            nn.Linear(shared_dim // 2, 64),
            nn.ReLU(),
            nn.Linear(64, 2)  # two values for binary classification
        )

    def forward(self, x):
        shared_features = self.shared(x)
        out_grade = self.head_regression(shared_features)
        out_romantic = self.head_classification(shared_features)
        return out_grade, out_romantic

In [None]:
# Model Initialization
input_dim = X_train.shape[1]
model = MultiTaskModel(input_dim=input_dim)
print(model)

In [None]:
# Forward Pass Check

xb, yb_grade, yb_romantic = next(iter(train_loader))

out_grade, out_romantic = model(xb)
print('Grade output shape:', out_grade.shape)
print('Romantic output shape:', out_romantic.shape)

## 6. Custom Training Loop
In this section, we train our multi-task network using two different loss functions — one for regression (MSELoss) and one for classification (CrossEntropyLoss). We combine them into a single total loss for optimization.

In [None]:
# Define Define Losses and Optimizer

import torch.optim as optim

# Define loss functions
criterion_regression = nn.MSELoss()          # for grade prediction
criterion_classification = nn.CrossEntropyLoss()  # for romantic status

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)

In [None]:
# Training Loop

def train_epoch(model, loader, optimizer, criterion_reg, criterion_cls, device):
    model.train()
    total_loss, total_reg, total_cls = 0, 0, 0

    for xb, yb_grade, yb_romantic in loader:
        xb, yb_grade, yb_romantic = xb.to(device), yb_grade.to(device), yb_romantic.to(device)

        optimizer.zero_grad()

        # Forward pass
        out_grade, out_romantic = model(xb)

        # Compute individual losses
        loss_reg = criterion_reg(out_grade.squeeze(), yb_grade.float())
        loss_cls = criterion_cls(out_romantic, yb_romantic.long())

        # Combine losses: 2 approaches
        # loss = loss_reg + loss_cls
        loss = 0.8 * loss_reg + (1 - 0.8) * loss_cls

        # Backpropagation
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        total_reg += loss_reg.item()
        total_cls += loss_cls.item()

    n = len(loader)
    return total_loss/n, total_reg/n, total_cls/n

In [None]:
# Validation Loop

def validate_epoch(model, loader, criterion_reg, criterion_cls, device):
    model.eval()
    total_loss, total_reg, total_cls = 0, 0, 0

    with torch.no_grad():
        for xb, yb_grade, yb_romantic in loader:
            xb, yb_grade, yb_romantic = xb.to(device), yb_grade.to(device), yb_romantic.to(device)
            
            out_grade, out_romantic = model(xb)
            
            loss_reg = criterion_reg(out_grade.squeeze(), yb_grade.float())
            loss_cls = criterion_cls(out_romantic, yb_romantic.long())
            
            loss = loss_reg + loss_cls

            total_loss += loss.item()
            total_reg += loss_reg.item()
            total_cls += loss_cls.item()

    n = len(loader)
    return total_loss/n, total_reg/n, total_cls/n

In [None]:
# Train & Validate Over Epochs

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

num_epochs = 50

train_losses, val_losses = [], []
train_reg_losses, val_reg_losses = [], []
train_cls_losses, val_cls_losses = [], []

for epoch in range(num_epochs):
    tr_loss, tr_reg, tr_cls = train_epoch(model, train_loader, optimizer, criterion_regression, criterion_classification, device)
    val_loss, val_reg, val_cls = validate_epoch(model, val_loader, criterion_regression, criterion_classification, device)

    train_losses.append(tr_loss)
    val_losses.append(val_loss)
    train_reg_losses.append(tr_reg)
    val_reg_losses.append(val_reg)
    train_cls_losses.append(tr_cls)
    val_cls_losses.append(val_cls)

    print(f"Epoch {epoch+1}/{num_epochs} | Total Loss: {tr_loss:.4f} | Val Loss: {val_loss:.4f} | Reg: {val_reg:.4f} | Cls: {val_cls:.4f}")

In [None]:
# Ploting (optional)
import matplotlib.pyplot as plt

# --- Plot 1: Total Loss ---
plt.figure(figsize=(14, 4))
plt.subplot(1, 3, 1)
plt.plot(train_losses, label='Train Total Loss', color='blue')
plt.plot(val_losses, label='Val Total Loss', color='orange')
plt.title('Total Loss over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

# --- Plot 2: Grade (Regression) Loss ---
plt.subplot(1, 3, 2)
plt.plot(train_reg_losses, label='Train Grade Loss', color='green')
plt.plot(val_reg_losses, label='Val Grade Loss', color='red')
plt.title('Grade (Regression) Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

# --- Plot 3: Romantic (Classification) Loss ---
plt.subplot(1, 3, 3)
plt.plot(train_cls_losses, label='Train Romantic Loss', color='purple')
plt.plot(val_cls_losses, label='Val Romantic Loss', color='brown')
plt.title('Romantic (Classification) Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()

plt.tight_layout()
plt.show()

## 7. Evaluation & Analysis

In [None]:
import torch
from sklearn.metrics import mean_absolute_error, accuracy_score, f1_score

# Set model to evaluation mode
model.eval()

# Disable gradient computation
with torch.no_grad():
    # Forward pass on test set
    x_test_tensor = torch.tensor(X_test, dtype=torch.float32)
    y_grade_pred, y_romantic_pred = model(x_test_tensor)

    # --- Grade Prediction (Regression) ---
    # Convert predictions and targets to numpy
    y_grade_pred_np = y_grade_pred.squeeze().numpy()
    y_grade_true_np = y_grade_test

    # Calculate Mean Absolute Error
    mae = mean_absolute_error(y_grade_true_np, y_grade_pred_np)

    # --- Romantic Status (Classification) ---
    # Get predicted class (0 or 1)
    y_romantic_pred_class = torch.argmax(y_romantic_pred, dim=1).numpy()
    y_romantic_true_np = y_romantic_test

    # Calculate Accuracy
    accuracy = accuracy_score(y_romantic_true_np, y_romantic_pred_class)

    # Calculate F1-Score for 'yes' class (1 = 'yes')
    f1 = f1_score(y_romantic_true_np, y_romantic_pred_class, pos_label=1)

# --- Final Report ---
print("=== Test Set Evaluation ===")

print("-"*50)
for name, param in model.named_parameters():
    print(name, param.data.numpy().flatten())
print("-"*50)

print(f"Grade Prediction (MAE): {mae:.4f}")
print(f"Romantic Status (Accuracy): {accuracy:.4f}")
print(f"Romantic Status (F1-Score): {f1:.4f}")


## 8. Results For First Pass

```
=== Test Set Evaluation ===
Grade Prediction (MAE): 3.7891
Romantic Status (Accuracy): 0.5333
Romantic Status (F1-Score): 0.3333
```

Grade prediction is off by 3.8 score on average which is 20% of maximum score. The model correctly predicts romantic status for only about half of the students, which is barely better than random guessing. The F1-score of 0.33 tells us it’s performing poorly on the minority class ("yes"), missing many true positives.


## 9. How to Improve?

The main idea is to give different “importance” to each task when computing the total loss. Right now, you’re doing: `loss = loss_reg + loss_cls`.

Here, loss_reg (MSE) might be 25 and loss_cls (CrossEntropy) might be 0.6, so the network focuses mostly on grade prediction. To fix this, you can introduce a weighting hyperparameter alpha (between 0.0 and 1.0) and compute: `loss = alpha * loss_reg + (1 - alpha) * loss_cls`.

Table of how it affected MAE, Accuracy and F1:
| Alpha | MAE    | Accuracy | F1     |
| ----- | ------ | -------- | ------ |
| 0.2   | 3.8320 | 0.5667   | 0.3500 |
| 0.5   | 3.6765 | 0.5667   | 0.3158 |
| 0.8   | 3.7729 | 0.4833   | 0.2439 |

Interpretation:

- Alpha = 0.2 → more weight on romantic status (classification):
    - F1 is highest (0.3500), MAE is slightly worse.
    - Classification improves because the network “cares more” about this task.

- Alpha = 0.5 → equal weighting:
    - MAE is slightly better (3.6765) than 0.2, F1 slightly lower (0.3158).
    - Balanced performance between regression and classification.

- Alpha = 0.8 → more weight on grade prediction (regression):
    - MAE is comparable but F1 drops significantly (0.2439).
    - The network prioritizes reducing regression error at the expense of classification performance.

Conclusion:

Weighted loss lets you control which task the model prioritizes, and you can tune alpha depending on whether you care more about grades or romantic status. This is exactly why multi-task learning often requires careful loss balancing.