
# ENEL 645 - Assignment 2
### Multi-Modal Garbage Classification using Image and Text Data

## Group 13 Team Members
- Lana Oreoluwa (30270508)
- Laxmi Paudel (30243739)
- Ayodele Oluwabusola (30228072)
- Taiwo Oyedele (30224753)

## 1. Introduction
This is a multi-modal classification system that combines both image and text data for garbage classification tasks. Specifically, it integrates the visual data from images and textual information embedded in filenames to predict class labels. The objective is to create a deep learning model that leverages both image and textual features for accurate predictions.

### Load necessary Library

In [None]:
import os
import re
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torchvision import models, transforms
from transformers import DistilBertModel, DistilBertTokenizer
from PIL import Image
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

#### Load dataset
The dataset is loaded from the cluster with "/work/TALC/enel645_2025w/garbage_data.

In [None]:
# Paths (Replace with actual dataset paths)
TRAIN_PATH = "/work/TALC/enel645_2025w/garbage_data/CVPR_2024_dataset_Train"
VAL_PATH = "/work/TALC/enel645_2025w/garbage_data/CVPR_2024_dataset_Val"
TEST_PATH = "/work/TALC/enel645_2025w/garbage_data/CVPR_2024_dataset_Test"

### Text Preprocessing

Extracts and processes text from the image file name by removing the file extension, replacing underscores with spaces, and removing any digits from the text. The resulting text is then returned for use in the model.

In [None]:
# Text Preprocessing
def extract_text_from_path(file_path):
    file_name = os.path.basename(file_path)
    file_name_no_ext, _ = os.path.splitext(file_name)
    text = file_name_no_ext.replace('_', ' ')
    return re.sub(r'\d+', '', text)  

### Load Dataset Paths

This function loads image paths, associated text (extracted from images), and labels from a given directory. It iterates through class subdirectories, collects image files, extracts text from each image, and assigns the appropriate label based on the class folder.

In [None]:
# Load dataset paths
def load_data_from_path(root_path):
    image_paths, texts, labels = [], [], []
    class_folders = sorted(os.listdir(root_path))
    label_map = {class_name: idx for idx, class_name in enumerate(class_folders)}

    for class_name in class_folders:
        class_path = os.path.join(root_path, class_name)
        if os.path.isdir(class_path):
            for file_name in os.listdir(class_path):
                file_path = os.path.join(class_path, file_name)
                if file_name.lower().endswith(('.png', '.jpg', '.jpeg')):
                    image_paths.append(file_path)
                    texts.append(extract_text_from_path(file_path))
                    labels.append(label_map[class_name])
    return image_paths, texts, labels

### Image Transformation Pipeline for Preprocessing

Defines a series of image transformations using torchvision.transforms. It resizes images to 224x224 pixels, converts them to tensors, and normalizes the pixel values with pre-defined mean and standard deviation values

In [None]:
# Image Transformations
train_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(20),
    transforms.ColorJitter(0.2, 0.2, 0.2, 0.1),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

In [None]:
test_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

### Dataset Class for MultiModal Input

This class handles the loading and preprocessing of images and text for a multi-modal model. It applies transformations to images and tokenizes the text, returning them alongside the corresponding labels for each sample in the dataset.

In [None]:
# Dataset Class
class Dataset(Dataset):
    def __init__(self, image_paths, texts, labels, tokenizer, transform, max_len=24):
        self.image_paths = image_paths
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.transform = transform
        self.max_len = max_len

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        # Load image
        image = Image.open(self.image_paths[idx]).convert('RGB')
        image = self.transform(image)

        # Process text
        encoding = self.tokenizer(
            self.texts[idx], padding='max_length', truncation=True,
            max_length=self.max_len, return_tensors='pt'
        )

        return {
            'image': image,
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'label': torch.tensor(self.labels[idx], dtype=torch.long)
        }

In [None]:
# Fusion Module
class Fusion(nn.Module):
    def __init__(self, img_dim=1024, text_dim=1024):
        super().__init__()
        self.attn = nn.Linear(img_dim + text_dim, 1)

    def forward(self, img_features, text_features):
        weights = torch.sigmoid(self.attn(torch.cat([img_features, text_features], dim=1)))
        return weights * img_features + (1 - weights) * text_features

### MultiModalClassifier Model

This class defines a multi-modal classifier that uses ResNet-50 for image feature extraction and DistilBERT for text encoding. The extracted features from both modalities are fused and passed through a fully connected layer to predict the final class. The model includes image and text-specific fully connected layers for further processing.

In [None]:
# Model
class MultiModalClassifier(nn.Module):
    def __init__(self, num_classes=4):
        super().__init__()
        self.image_model = models.resnet50(weights="IMAGENET1K_V2")
        self.image_model = nn.Sequential(*list(self.image_model.children())[:-1])
        self.image_fc = nn.Sequential(
            nn.Linear(2048, 1024),
            nn.BatchNorm1d(1024),
            nn.ReLU()
        )

        self.text_model = DistilBertModel.from_pretrained('distilbert-base-uncased')
        self.text_fc = nn.Sequential(
            nn.Linear(768, 1024),
            nn.BatchNorm1d(1024), 
            nn.ReLU()
        )

        self.fusion = Fusion(img_dim=1024, text_dim=1024)
        self.classifier = nn.Linear(1024, num_classes)

    def forward(self, images, input_ids, attention_mask):
        img_features = self.image_model(images).squeeze()
        # Ensure img_features has the right shape
        if len(img_features.shape) == 1:
            img_features = img_features.unsqueeze(0)
        img_features = self.image_fc(img_features)
        
        text_output = self.text_model(input_ids=input_ids, attention_mask=attention_mask)[0]
        text_features = self.text_fc(text_output[:, 0, :])
        combined_features = self.fusion(img_features, text_features)
        return self.classifier(combined_features)

## Model Training and Validation

Trains the model for a specified number of epochs, calculates training and validation loss and accuracy, updates the model with the best validation loss, and saves the best model. It also adjusts the learning rate using the scheduler after each epoch.

In [None]:
# Training and Evaluation Functions
def train(model, train_loader, val_loader, criterion, optimizer, scheduler, epochs=5):
    best_loss = float('inf')

    for epoch in range(epochs):
        model.train()
        train_loss, correct_train, total_train = 0, 0, 0

        for batch in train_loader:
            images, input_ids, attn_mask, labels = batch['image'].to(device), batch['input_ids'].to(device), batch['attention_mask'].to(device), batch['label'].to(device)

            optimizer.zero_grad()
            outputs = model(images, input_ids, attn_mask)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            train_loss += loss.item()
            predictions = outputs.argmax(dim=1)
            correct_train += (predictions == labels).sum().item()
            total_train += labels.size(0)

        train_acc = correct_train / total_train
        train_loss /= len(train_loader)

        model.eval()
        val_loss, correct_val, total_val = 0, 0, 0

        with torch.no_grad():
            for batch in val_loader:
                images, input_ids, attn_mask, labels = batch['image'].to(device), batch['input_ids'].to(device), batch['attention_mask'].to(device), batch['label'].to(device)
                outputs = model(images, input_ids, attn_mask)
                loss = criterion(outputs, labels)
                val_loss += loss.item()

                predictions = outputs.argmax(dim=1)
                correct_val += (predictions == labels).sum().item()
                total_val += labels.size(0)

        val_acc = correct_val / total_val
        val_loss /= len(val_loader)

        print(f"Epoch {epoch+1}: Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.4f} | Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

        scheduler.step()

        if val_loss < best_loss:
            best_loss = val_loss
            torch.save(model.state_dict(), 'best_multimodal_output.pth')
            print("Saved Best Model!")

    model.load_state_dict(torch.load('best_multimodal_output.pth'))
    return model

### Model evaluation
Evaluates the model on the test dataset by calculating accuracy, generating a classification report, and displaying a confusion matrix. It predicts labels, compares them to actual labels, and visualizes the results with a heatmap.

In [None]:
def evaluate(model, test_loader):
    model.eval()
    all_predictions, all_labels = [], []

    with torch.no_grad():
        for batch in test_loader:
            images, input_ids, attn_mask, labels = batch['image'].to(device), batch['input_ids'].to(device), batch['attention_mask'].to(device), batch['label'].to(device)
            outputs = model(images, input_ids, attn_mask)
            preds = outputs.argmax(dim=1)

            all_predictions.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    accuracy = (np.array(all_predictions) == np.array(all_labels)).mean()
    print(f"Test Accuracy: {accuracy:.4f}")
    print("Classification Report:")

    print(classification_report(all_labels, all_predictions))

    cm = confusion_matrix(all_labels, all_predictions)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix')
    plt.show()

    return all_predictions, all_labels

### Main Execution and Model Training

This section loads datasets, initializes the multi-modal model, sets up data loaders, defines the optimizer, scheduler, and loss function, and runs training and evaluation on the test set.

In [None]:
# Main Execution
if __name__ == "__main__":
    # Load tokenizer
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

    # Load datasets
    train_image_paths, train_texts, train_labels = load_data_from_path(TRAIN_PATH)
    val_image_paths, val_texts, val_labels = load_data_from_path(VAL_PATH)
    test_image_paths, test_texts, test_labels = load_data_from_path(TEST_PATH)

    train_dataset = Dataset(train_image_paths, train_texts, train_labels, tokenizer, train_transform)
    val_dataset = Dataset(val_image_paths, val_texts, val_labels, tokenizer, test_transform)
    test_dataset = Dataset(test_image_paths, test_texts, test_labels, tokenizer, test_transform)

    # Create data loaders
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=4)
    val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False, num_workers=4)
    test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, num_workers=4)

    # Initialize model
    model = MultiModalClassifier().to(device)

    # Define optimizer and scheduler
    optimizer = optim.AdamW([
        {"params": model.image_model.parameters(), "lr": 1e-4},
        {"params": model.text_model.parameters(), "lr": 5e-6},
        {"params": model.classifier.parameters(), "lr": 1e-4},
    ])
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10, eta_min=1e-6)

    # Define loss function
    criterion = nn.CrossEntropyLoss()

    # Train the model
    print("Training...")
    model = train(model, train_loader, val_loader, criterion, optimizer, scheduler, epochs=5)

    # Evaluate on test set
    print("Evaluating on test set...")
    evaluate(model, test_loader)

### Output

In [None]:
Using device: cuda
Training...
Epoch 1: Train Loss: 0.6203, Train Acc: 0.7689 | Val Loss: 0.3742, Val Acc: 0.8711
Saved Best Model!
Epoch 2: Train Loss: 0.3300, Train Acc: 0.8782 | Val Loss: 0.3232, Val Acc: 0.8828
Saved Best Model!
poch 3: Train Loss: 0.2556, Train Acc: 0.9082 | Val Loss: 0.2978, Val Acc: 0.8972
Saved Best Model!
Epoch 4: Train Loss: 0.1896, Train Acc: 0.9337 | Val Loss: 0.3145, Val Acc: 0.8906
Epoch 5: Train Loss: 0.1451, Train Acc: 0.9501 | Val Loss: 0.3227, Val Acc: 0.8944
Evaluating on test set...
Test Accuracy: 0.8540
Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.73      0.76       695
           1       0.81      0.90      0.85      1086
           2       0.92      0.94      0.93       799
           3       0.89      0.81      0.85       852

    accuracy                           0.85      3432
   macro avg       0.86      0.85      0.85      3432
weighted avg       0.85      0.85      0.85      3432

