In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
! mkdir data

In [3]:
! cp '/content/drive/MyDrive/image_text_dataset.zip' .

In [4]:
! unzip --qq image_text_dataset.zip -d data

In [5]:
  ! pip install torch torchvision transformers timm Pillow accelerate

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

2. Data Preparation:
Dataset Format: Your dataset should consist of image files and corresponding text descriptions. Organize it into a structure that PyTorch Dataset can handle. A common format is a CSV or JSON file listing image paths and text captions.
Custom Dataset Class: Create a PyTorch Dataset class to load and preprocess your data.

In [6]:
    import torch
    from torch.utils.data import Dataset
    from PIL import Image
    from transformers import AutoTokenizer

    class ImageTextDataset(Dataset):
        def __init__(self, data_file, image_dir, image_transform, text_tokenizer, max_length=128):
            """
            Args:
                data_file (str): Path to the CSV/JSON file containing image paths and text captions.
                image_dir (str):  Path to the directory containing the images.
                image_transform (callable): Image transformation to apply.
                text_tokenizer (callable): Text tokenizer (e.g., from Hugging Face Transformers).
                max_length (int): Maximum length of the text sequence.
            """
            self.data = self.load_data(data_file)  # Implement your data loading
            self.image_dir = image_dir
            self.image_transform = image_transform
            self.text_tokenizer = text_tokenizer
            self.max_length = max_length

        def load_data(self, data_file):
            """Loads data from CSV/JSON.  Implement this based on your data format."""
            # Example (CSV):
            import pandas as pd
            df = pd.read_csv(data_file)
            return df.to_dict('records')  # List of dictionaries

        def __len__(self):
            return len(self.data)

        def __getitem__(self, idx):
            item = self.data[idx]
            image_path = os.path.join(self.image_dir, item['image'])  # Adjust key name
            text = item['answer']  # Adjust key name

            image = Image.open(image_path).convert("RGB")
            image = self.image_transform(image)

            text_encoded = self.text_tokenizer(text,
                                               max_length=self.max_length,
                                               padding='max_length',
                                               truncation=True,
                                               return_tensors='pt')  # PyTorch tensors

            return {
                'image': image,
                'text_input_ids': text_encoded['input_ids'].squeeze(),
                'text_attention_mask': text_encoded['attention_mask'].squeeze()
            }


*   **Image Transformations:** Define image transformations using `torchvision.transforms`.  Common transformations include resizing, normalization, and data augmentation.

    ```python
    
    ```


In [7]:
from torchvision import transforms

image_transform = transforms.Compose([
      transforms.Resize((224, 224)),  # Adjust size as needed
      transforms.ToTensor(),
      transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # ImageNet stats
])


*   **Text Tokenizer:** Load the Phi-3 tokenizer using `AutoTokenizer`.

   

In [8]:
    from transformers import AutoTokenizer

    phi3_model_name = "microsoft/Phi-3-mini-4k-instruct"  # Or your specific Phi-3 variant
    text_tokenizer = AutoTokenizer.from_pretrained(phi3_model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

3. Model Definition:
SigLIP Image Encoder: Use a pre-trained SigLIP image encoder from timm

In [9]:
    import timm
    import torch.nn as nn

    class SigLIPImageEncoder(nn.Module):
        def __init__(self, model_name='resnet50', pretrained=True, embed_dim=512): # Adjust model_name and embed_dim
            super().__init__()
            self.model = timm.create_model(model_name, pretrained=pretrained, num_classes=0, global_pool='avg') # No classification head
            self.embed_dim = embed_dim
            self.projection = nn.Linear(self.model.num_features, embed_dim) # Project to the desired embedding dimension

        def forward(self, image):
            features = self.model(image)
            embedding = self.projection(features)
            return embedding


*   **Phi-3 Text Encoder (Frozen):** Load the pre-trained Phi-3 model using `AutoModel`.  **Crucially, freeze its parameters.**

    

In [10]:
    from transformers import AutoModel

    class Phi3TextEncoder(nn.Module):
        def __init__(self, model_name="microsoft/Phi-3-mini-4k-instruct", embed_dim=512): # Adjust model_name and embed_dim
            super().__init__()
            self.model = AutoModel.from_pretrained(model_name)
            self.embed_dim = embed_dim
            # Freeze Phi-3 parameters
            for param in self.model.parameters():
                param.requires_grad = False
            # Add a projection layer
            self.projection = nn.Linear(self.model.config.hidden_size, embed_dim)

        def forward(self, input_ids, attention_mask):
            outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
            # Use the last hidden state as the text representation
            last_hidden_state = outputs.last_hidden_state
            # Average pooling over the sequence length
            pooled_output = last_hidden_state.mean(dim=1)
            embedding = self.projection(pooled_output)
            return embedding

SigLIP Loss: Implement the SigLIP loss function. This loss encourages similar embeddings for matching image-text pairs and dissimilar embeddings for non-matching pairs. A simplified version is InfoNCE (contrastive loss).

In [11]:
    import torch
    import torch.nn.functional as F

    def info_nce_loss(image_embeddings, text_embeddings, temperature=0.07):
        """
        Computes the InfoNCE loss (contrastive loss).

        Args:
            image_embeddings (torch.Tensor): Image embeddings.
            text_embeddings (torch.Tensor): Text embeddings.
            temperature (float): Temperature scaling factor.
        """
        # Normalize embeddings
        #image_embeddings = F.normalize(image_embeddings, dim=1)
        #text_embeddings = F.normalize(text_embeddings, dim=1)
        image_embeddings = F.normalize(image_embeddings, dim=-1)
        text_embeddings = F.normalize(text_embeddings, dim=-1)

        # Compute similarity scores
        #logits = torch.matmul(image_embeddings, text_embeddings.T) / temperature

        logits = image_embeddings @ text_embeddings.T
        logits = logits/temperature

        # Create labels for the diagonal (matching pairs)
        #labels = torch.arange(logits.shape[0], device=image_embeddings.device)
        #labels = torch.arange(logits.shape[0], device=image_embeddings.device).long()
        batch_size = logits.size(0)
        targets = torch.eye(batch_size).to(logits.device)
        # Compute cross-entropy loss
        #loss = F.cross_entropy(logits, labels)
        #loss = F.binary_cross_entropy_with_logits(logits, labels)
        loss = F.binary_cross_entropy_with_logits(logits, targets)
        return loss

Complete Model: Combine the image encoder, text encoder, and loss function.

In [12]:
    class SigLIPModel(nn.Module):
        def __init__(self, image_encoder, text_encoder):
            super().__init__()
            self.image_encoder = image_encoder
            self.text_encoder = text_encoder

        def forward(self, image, text_input_ids, text_attention_mask):
            image_embeddings = self.image_encoder(image)
            text_embeddings = self.text_encoder(text_input_ids, text_attention_mask)
            return image_embeddings, text_embeddings

4. Training Loop:
Initialization: Create instances of the dataset, data loaders, model, optimizer, and learning rate scheduler. Use torch.utils.data.DataLoader for efficient data loading. Since Phi-3 is frozen, only the image encoder and projection layers will be trained.

In [13]:
    from torch.utils.data import DataLoader
    import torch.optim as optim

    # Model
    num_epochs = 5
    image_encoder = SigLIPImageEncoder()
    text_encoder = Phi3TextEncoder()
    model = SigLIPModel(image_encoder, text_encoder)

    # Dataset and DataLoader
    dataset = ImageTextDataset(data_file='/content/data/image_text_dataset.csv',
                                 image_dir='/content/data',
                                 image_transform=image_transform,
                                 text_tokenizer=text_tokenizer)
    dataloader = DataLoader(dataset, batch_size=8, shuffle=True, num_workers=4)

    # Optimizer (only train image encoder and projection layers)
    trainable_params = list(image_encoder.parameters()) + list(text_encoder.projection.parameters())
    optimizer = optim.AdamW(trainable_params, lr=1e-4)

    # Learning Rate Scheduler (optional)
    lr_scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=len(dataloader) * num_epochs)

    # Device (GPU if available)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

model.safetensors:   0%|          | 0.00/102M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

SigLIPModel(
  (image_encoder): SigLIPImageEncoder(
    (model): ResNet(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (act1): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (act1): ReLU(inplace=True)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (drop_block): Identity()
          (act2): ReLU(inplace=True)
          (aa): Identity()
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), s

Training Loop: Iterate over the data loader, compute the loss, update the model parameters, and log the training progress. Use accelerate for multi-GPU training if needed.

In [14]:
    from accelerate import Accelerator
    import os

    accelerator = Accelerator()
    model, optimizer, dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, dataloader, lr_scheduler
    )

    num_epochs = 5
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        for step, batch in enumerate(dataloader):
            image = batch['image'].to(device)
            text_input_ids = batch['text_input_ids'].to(device)
            text_attention_mask = batch['text_attention_mask'].to(device)

            image_embeddings, text_embeddings = model(image, text_input_ids, text_attention_mask)
            loss = info_nce_loss(image_embeddings, text_embeddings)

            total_loss += loss.item()

            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

            if step % 100 == 0:
                print(f"Epoch {epoch+1}/{num_epochs}, Step {step}, Loss: {loss.item()}")

        print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {total_loss / len(dataloader)}")

Epoch 1/5, Step 0, Loss: 1.0193607807159424
Epoch 1/5, Average Loss: 0.48543866258114576
Epoch 2/5, Step 0, Loss: 0.40336906909942627
Epoch 2/5, Average Loss: 0.38083961606025696
Epoch 3/5, Step 0, Loss: 0.37829697132110596
Epoch 3/5, Average Loss: 0.3772946549579501
Epoch 4/5, Step 0, Loss: 0.37287402153015137
Epoch 4/5, Average Loss: 0.3756789341568947
Epoch 5/5, Step 0, Loss: 0.37297382950782776
Epoch 5/5, Average Loss: 0.37561523262411356


Evaluation (Optional):
Define an evaluation function to assess the performance of the trained model. Common evaluation metrics for image-text retrieval include recall@k and mean average precision (mAP).
6. Saving the Model:
Save the trained image encoder and projection layers. You don't need to save the frozen Phi-3 model.


In [19]:
    torch.save(image_encoder.state_dict(), "/content/drive/MyDrive/image_encoder.pth")
    torch.save(text_encoder.projection.state_dict(), "/content/drive/MyDrive/text_projection.pth")


Important Considerations:
Hardware: Training large models requires significant GPU resources. Consider using a cloud platform like Google Colab, AWS, or Azure.
Dataset Size: The performance of the model depends heavily on the size and quality of your dataset.
Hyperparameter Tuning: Experiment with different hyperparameters, such as learning rate, batch size, temperature, and image size.
Data Augmentation: Use data augmentation techniques to improve the generalization ability of the model.
Regularization: Use regularization techniques, such as weight decay and dropout, to prevent overfitting.
Gradient Clipping: Use gradient clipping to prevent exploding gradients during training.
Mixed Precision Training: Use mixed precision training (e.g., with torch.cuda.amp) to reduce memory usage and speed up training. accelerate simplifies this.
Distributed Training: Use distributed training (e.g., with torch.distributed or accelerate) to train the model on multiple GPUs. accelerate handles much of the complexity.
Window Title: Double-check the PowerPoint window title. The code uses "Presentation1 - PowerPoint". Adjust if necessary.
Text Tokenizer: Ensure the text tokenizer is compatible with Phi-3.
Model Compatibility: Ensure the SigLIP image encoder is compatible with the input image size.
Memory Management: Monitor GPU memory usage and adjust the batch size accordingly.
This is a complex project, but by breaking it down into smaller steps and carefully addressing each component, you can successfully train a SigLIP model with a frozen Phi-3 language model on your custom dataset. Remember to start with a small dataset and gradually increase the size as you gain confidence. Good luck!