**Introducting**

This tutorial on pseudo-labeling for domain adaptation using the DomainNet dataset! You'll learn how to leverage pseudo-labeling techniques to improve model performance across different domains within the diverse DomainNet dataset, enhancing your model's ability to generalize and adapt.



*   **Domain Adapation:**
    Domain adaptation refers to the process of transferring knowledge from one domain (source) to another (target) where the data distributions are different. It's often used in machine learning and AI to improve the performance of models when applied to new, unseen environments.


*   **CCN:**
    Convolutional Neural Networks (CNNs), a type of deep learning model especially effective for tasks involving spatial data, like images.



*   **Pseduo-labeling:**
    Pseudo-labeling is a powerful semi-supervised learning technique where a model trained on labeled data generates labels for unlabeled data. This approach can iteratively improve the model's performance by incorporating the newly labeled data. Here’s a high-level overview of a pseudo-labeling cycle
  

*   **DomainNet Dataset**
    is a collection of common objects across six different domains: Clipart, Infograph, Real, Painting, Quickdraw, and Sketch. Each domain includes 345 categories (classes) of objects such as bracelets, planes, birds, and cellos. This dataset is often used for tasks like domain adaptation and domain generalization


**Prerequisites**

*   Basic understanding of Pytorch, Machine learning and neural networks
*   Python, CNN, Pytorch libraries



In [1]:
import os

import numpy as np

import torch
import torch.nn as nn
from torch.utils.data import SubsetRandomSampler


from torchvision import transforms, datasets, models
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader, Dataset, ConcatDataset

from PIL import Image

In [2]:
# Setup device-agnostic code
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [3]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("mei1963/domainnet")

print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/mei1963/domainnet/versions/1


In [4]:
real = os.path.join(path, "DomainNet/real")
sketch = os.path.join(path, "DomainNet/sketch")

In [5]:
# Define your transforms
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor()
])

In [6]:
# create source dataset with full labels
source_domain = datasets.ImageFolder(sketch, transform=transform)

# Create DataLoader FULL SIZE
source_loader = torch.utils.data.DataLoader(source_domain, batch_size=32, shuffle=True, drop_last=True)

In [7]:
# Helper fucntions
def create_subset_loader(dataset, subset_fraction, batch_size):
    num_samples = len(dataset)
    num_subset = int(num_samples * subset_fraction)

    # Generate random indices for the subset
    indices = np.random.choice(num_samples, num_subset, replace=False)

    # Create SubsetRandomSampler
    subset_sampler = SubsetRandomSampler(indices)

    # Create DataLoader
    subset_loader = DataLoader(dataset, batch_size=batch_size, sampler=subset_sampler)

    return subset_loader

def generate_pseudo_labels(model, dataloader, device, threshold=0.9):
  model.eval()
  pseudo_labels = []
  with torch.no_grad():
    for inputs in dataloader:
      inputs = inputs.to(device)
      outputs = model(inputs)
      probs = torch.softmax(outputs, dim=1)
      max_probs, labels = torch.max(probs, dim=1)
      # print(max_probs)
      mask = max_probs >= threshold
      pseudo_labels.extend([(input, label) for input, label, m in zip(inputs, labels, mask) if m])
  return pseudo_labels

# Function to move dataset to the specified device
def move_dataset_to_device(dataset, device):
    images, labels = [], []
    for img, label in dataset:
        img = img.to(device)
        label = torch.tensor(label).to(device)
        images.append(img)
        labels.append(label)
    return torch.stack(images), torch.stack(labels)


# Extend Image class to produce Unlabeled dataset
class UnlabeledDataset(datasets.ImageFolder):
  def __init__(self, root, transform=None):
    super(UnlabeledDataset, self).__init__(root, transform=transform)

  def __getitem__(self, index):
    image_path, _ = self.imgs[index] #get the image path
    image = Image.open(image_path).convert('RGB')
    if self.transform is not None:
      image = self.transform(image)
    return image

unlabeled_target_dataset = UnlabeledDataset(real, transform=transform)


class PseudoLabeledDataset(Dataset):
    def __init__(self, pseudo_labels, transform=None):
        self.pseudo_labels = pseudo_labels
        self.transform = transform

    def __len__(self):
        return len(self.pseudo_labels)

    def __getitem__(self, index):
        image, label = self.pseudo_labels[index]
        return image, label


# Create custom datasets on the same device
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, images, labels):
        self.images = images
        self.labels = labels

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        return self.images[idx], self.labels[idx]

In [8]:
#create subset of the datasets
source_loader_10 = create_subset_loader(source_domain, 0.1, 32)
target_loader_10 = create_subset_loader(unlabeled_target_dataset, 0.1, 32)

source_loader_25 = create_subset_loader(source_domain, 0.25, 32)
target_loader_25 = create_subset_loader(unlabeled_target_dataset, 0.25, 32)

**Models**

In [9]:
# Load pre-trained EfficientNet-B0 model
efficientNetB0 = models.efficientnet_b0(pretrained=True)

num_classes = len(source_domain.classes)
efficientNetB0.classifier[1] = nn.Linear(efficientNetB0.classifier[1].in_features, num_classes)

efficientNetB0 = efficientNetB0.to(device)



In [10]:
optimizer = torch.optim.Adam(efficientNetB0.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

**Training Loop**

In [11]:
def train_model(model, data_loader, pseudo_loader, loss_fn, optimizer, device, epochs=3):
    model.to(device)
    model.train()
    for epoch in range(epochs):
        print(f'Epoch {epoch+1}/{epochs}')

        # Training on the source domain data
        running_loss = 0.0
        for inputs, labels in data_loader:
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            loss = loss_fn(outputs, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
        print(f'Loss: {running_loss/len(data_loader)}')

        # If pseudo_loader is provided, train on pseudo-labeled target domain data
        if pseudo_loader:
            running_loss = 0.0
            for batch in pseudo_loader:
                inputs, pseudo_labels = batch
                inputs, pseudo_labels = inputs.to(device), pseudo_labels.to(device)

                outputs = model(inputs)
                loss = loss_fn(outputs, pseudo_labels)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

                running_loss += loss.item()
            print(f'Target Domain Loss: {running_loss/len(pseudo_loader)}')

        print(f'Epoch {epoch+1} completed.\n')


**Initial Training**: First we train the efficientNetB0 model on the source dataset

In [12]:
train_model(efficientNetB0, source_loader_10, None, loss_fn, optimizer, device, 10)

Epoch 1/10
Loss: 4.864010580019517
Epoch 1 completed.

Epoch 2/10
Loss: 3.1891665827144275
Epoch 2 completed.

Epoch 3/10
Loss: 2.119454925710505
Epoch 3 completed.

Epoch 4/10
Loss: 1.392784386331385
Epoch 4 completed.

Epoch 5/10
Loss: 0.8942305392839692
Epoch 5 completed.

Epoch 6/10
Loss: 0.5775178066708825
Epoch 6 completed.

Epoch 7/10
Loss: 0.39115909426049755
Epoch 7 completed.

Epoch 8/10
Loss: 0.3272264300422235
Epoch 8 completed.

Epoch 9/10
Loss: 0.2285071876238693
Epoch 9 completed.

Epoch 10/10
Loss: 0.18726907836442644
Epoch 10 completed.



In [13]:
# # save the Initial model after training the source domain
torch.save(efficientNetB0.state_dict(), '/content/drive/MyDrive/CNN/efficientNetB0_10_percent.pth')

**Generate Pseudo-Labels** : Use the trained model to predict labels for the unlabeled target domain data.

In [14]:
pseudo_labels = generate_pseudo_labels(efficientNetB0, target_loader_10, device)

In [15]:
# pseudo_labels[0]

(tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          ...,
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.]],
 
         [[1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          ...,
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.]],
 
         [[1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          ...,
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.],
          [1., 1., 1.,  ..., 1., 1., 1.]]], device='cuda:0'),
 tensor(183, device='cuda:0'))

In [16]:
len(pseudo_labels), type(pseudo_labels)

(4088, list)

In [17]:
pseudo_labeled_dataset = PseudoLabeledDataset(pseudo_labels)

**Combine Datasets** : Merge the labeled source domain data with the pseudo-labeled target domain data to form a larger training dataset.

Source dataset 10 percent

In [18]:
# preparing source dataset for concat
source_dataset_10 = source_loader_10.dataset

source_images = []
source_labels = []

for image, label in source_dataset_10:
    source_images.append(image)
    source_labels.append(label)

In [19]:
source_images_tensor = [img.cpu() if isinstance(img, torch.Tensor) else transform(img).cpu() for img in source_images]
source_labels_tensor = [torch.tensor(label).cpu() for label in source_labels ]

In [20]:
new_source_dataset_10 = CustomDataset(source_images_tensor, source_labels_tensor)

Pseudo dataset

In [21]:
# Split images and labels into separate lists
pseudo_images = []
pseudo_labels = []


# Iterate through the DataLoader
for pseudo_image, pseudo_label in pseudo_labeled_dataset:
    # print(f'Images are on device: {pseudo_image.device}')
    # print(f'Labels are on device: {pseudo_label.device}')

    pseudo_images.append(pseudo_image)
    pseudo_labels.append(pseudo_label)



In [22]:
# pseudo_images[0]

In [23]:
# Convert images to tensors and move to CPU if needed
pseudo_images_tensor = []
for img in pseudo_images:
    # print("Processing image type:", type(img))
    if isinstance(img, list):
        # Flatten list if necessary (e.g., list of lists)
        img = np.array(img)
        # print("Flattened list to array:", img.shape)
    if isinstance(img, np.ndarray):
        img = Image.fromarray(img)
        # print("Converted array to image")
    if not isinstance(img, torch.Tensor):
        img_tensor = transform(img)  # Convert image to tensor
        # print("Converted image to tensor:", img_tensor.shape)
    else:
        img_tensor = img
    pseudo_images_tensor.append(img_tensor.cpu())
    # print("Moved tensor to CPU:", img_tensor.device)

In [24]:
pseudo_images_tensor = [img.cpu() if img.is_cuda else img for img in pseudo_images_tensor]
pseudo_labels_tensor = [torch.tensor(label).cpu() if torch.is_tensor(label) and label.is_cuda else torch.tensor(label) for label in pseudo_labels]

  pseudo_labels_tensor = [torch.tensor(label).cpu() if torch.is_tensor(label) and label.is_cuda else torch.tensor(label) for label in pseudo_labels]


In [25]:
new_pseudo_labeled_dataset = CustomDataset(pseudo_images_tensor, pseudo_labels_tensor)

Combine datasets for retraining

In [26]:
combined_dataset = ConcatDataset([new_source_dataset_10, new_pseudo_labeled_dataset])

combined_dataloader = DataLoader(combined_dataset, batch_size=32, shuffle=True)


In [27]:
# Iterate through the DataLoader
for image, label in new_source_dataset_10:
    print(type(image), type(label))
    break

# <class 'torch.Tensor'> <class 'int'>

<class 'torch.Tensor'> <class 'torch.Tensor'>


Retrain the model :
    Train the model again using the combined dataset (both original and pseudo-labeled data).

In [28]:
train_model(efficientNetB0, combined_dataloader ,None, loss_fn, optimizer, device, epochs=10)

Epoch 1/10
Loss: 2.434154198341763
Epoch 1 completed.

Epoch 2/10
Loss: 1.7881761828495055
Epoch 2 completed.

Epoch 3/10
Loss: 1.4956327249708863
Epoch 3 completed.

Epoch 4/10
Loss: 1.2729815321740825
Epoch 4 completed.

Epoch 5/10
Loss: 1.0892650301590614
Epoch 5 completed.

Epoch 6/10
Loss: 0.9288182519376278
Epoch 6 completed.

Epoch 7/10
Loss: 0.8004968537472162
Epoch 7 completed.

Epoch 8/10
Loss: 0.7006382982461006
Epoch 8 completed.

Epoch 9/10
Loss: 0.6140884771987898
Epoch 9 completed.

Epoch 10/10
Loss: 0.548223829328937
Epoch 10 completed.



**Conclusion**




*   The initial loss when using source domain only is significantly higher.
*   Adding pseudo labels starts with a much lower initial loss, indicating the
    additional data helps the model start off in a better position.
*   This lower initial loss would reflect the fact that the model has already   
    learned from the source data before incorporating pseudo-labeled data.
*   Pseudo-labeled data introduces some uncertainty, as the pseudo-labels might
    not be as accurate as true labels. This can introduce noise, leading to a slightly higher loss during training. Possibly explaing why the 10th epoch for the inital source domain training was lower


