<div style="text-align: center;">
    <a href="https://www.dataia.eu/">
        <img border="0" src="https://github.com/ramp-kits/template-kit/raw/main/img/DATAIA-h.png" width="90%"></a>
</div>

# Template Kit for RAMP challenge

<i> Thomas Moreau (Inria) </i>

## Introduction

Describe the challenge, in particular:

- Where the data comes from?
- What is the task this challenge aims to solve?
- Why does it matter?

# Exploratory data analysis

The goal of this section is to show what's in the data, and how to play with it.
This is the first set in any data science project, and here, you should give a sense of the data the participants will be working with.

You can first load and describe the data, and then show some interesting properties of it.

In [6]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load the data

import problem
X, y = problem.get_train_data()

  from .autonotebook import tqdm as notebook_tqdm
100%|███████████████████████████████████████| 338M/338M [03:30<00:00, 1.68MiB/s]
!!!!!!!!!!!!megablocks not available, using torch.matmul instead
<All keys matched successfully>


In [7]:
X_test, y_test = problem.get_test_data()

<All keys matched successfully>


In [8]:
#convert to tensors
import torch
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.float32)

In [9]:
from torch.utils.data import DataLoader
from torch.utils.data import TensorDataset

dataset = TensorDataset(X, y)

# Create DataLoader
batch_size = 64
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)


In [10]:
from submissions.starting_kit import CondImageGenerator
model = CondImageGenerator.ConditionalVAE(input_channels=3, hidden_dim_enc=128, hidden_dim_dec=128,
                 latent_dim=128, n_layers_enc=4, n_layers_dec=4,
                 condition_dim=768, image_size=128, cond_new_dim=768, device='cuda')
model.fit(dataloader)

Epoch 1/1: 100%|██████████| 175/175 [00:11<00:00, 15.52it/s]

Loss: 0.2303





In [11]:
dataset_test = TensorDataset(X_test, y_test)

# Create DataLoader
batch_size = 64
dataloader_test = DataLoader(dataset_test, batch_size=batch_size, shuffle=False)

In [12]:
result = model.predict(dataloader_test)

100%|██████████| 26/26 [00:00<00:00, 34.09it/s]


In [13]:
from ramp_custom.clip_score import CLIPScore
score_type = CLIPScore()
score = score_type(X_test, result)
print(score)

AttributeError: 'Tensor' object has no attribute 'find'

In [None]:
!ramp-test --submission submissions/starting_kit --quick-test

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "c:\Users\Akshita Kumar\capstone\Scripts\ramp-test.exe\__main__.py", line 7, in <module>
  File "C:\Users\Akshita Kumar\capstone\Lib\site-packages\rampwf\utils\cli\testing.py", line 117, in start
    main()
  File "C:\Users\Akshita Kumar\capstone\Lib\site-packages\click\core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Akshita Kumar\capstone\Lib\site-packages\click\core.py", line 1082, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "C:\Users\Akshita Kumar\capstone\Lib\site-packages\click\core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Akshita Kumar\capstone\Lib\site-packages\click\core.py", line 788, in invoke
    return __callback(*args, **kwargs

# Challenge evaluation

A particularly important point in a challenge is to describe how it is evaluated. This is the section where you should describe the metric that will be used to evaluate the participants' submissions, as well as your evaluation strategy, in particular if there is some complexity in the way the data should be split to ensure valid results.

In [45]:
import numpy as np
from numpy.linalg import norm
from rampwf.score_types.base import BaseScoreType

import torch
import torch.nn as nn
import torchvision.transforms as T
from torchvision.models import resnet18

class CLIPScore(BaseScoreType):
    """
    CLIP-like Score for text-conditioned image generation.

    This metric:
      1. Accepts pre-computed text embeddings (shape: (n_samples, 768)).
      2. Computes image embeddings from generated images (numpy array of shape (n_samples, H, W, C))
         using a ResNet-18 backbone.
      3. Projects the ResNet features (512-dim) to 768 dimensions so they match the text embeddings.
      4. Normalizes both embeddings and computes the cosine similarity.
      5. Returns the inverse of the mean cosine similarity (so that lower scores are better).
    """
    is_lower_the_better = True
    minimum = 0.0
    maximum = float('inf')
    
    def __init__(self, name='clip score', precision=2):
        self.name = name
        self.precision = precision

        # 1) Initialize a ResNet-18 backbone and remove its final classification layer.
        self.image_model = resnet18(pretrained=True)
        self.image_model.fc = nn.Identity()  # Now outputs a 512-dim feature vector

        # 2) Define a linear projection from 512 -> 768 to match text embeddings.
        self.projection = nn.Linear(512, 768, bias=False)

        # Set models to evaluation mode and freeze parameters.
        self.image_model.eval()
        for param in self.image_model.parameters():
            param.requires_grad = False
        for param in self.projection.parameters():
            param.requires_grad = False

        # # 3) Define a transform that normalizes images as expected by ResNet.
        # self.transform = T.Compose([
        #     T.ConvertImageDtype(torch.float),  # Convert image to float (0,1)
        #     T.Normalize(mean=[0.485, 0.456, 0.406],
        #                 std=[0.229, 0.224, 0.225]),
        # ])

    def __call__(self, text_embeddings: np.ndarray, generated_images: np.ndarray) -> float:
        """
        Compute a CLIP-like score between text embeddings and generated images.

        Parameters
        ----------
        text_embeddings : np.ndarray
            Pre-computed text embeddings. Expected shape: (n_samples, 768).
        generated_images : np.ndarray
            Generated images. Expected shape: (n_samples, height, width, channels) with 3 channels (RGB).

        Returns
        -------
        float
            The inverse of the mean cosine similarity between text and image embeddings.
        """

        # -------------------------------------------------------
        # 1) Preprocess images and convert to Torch tensors.
        # -------------------------------------------------------
        images_tensor = torch.from_numpy(generated_images)
        images_tensor = images_tensor.float() / 255.0  # Scale pixel values from [0,255] to [0,1]
        #images_tensor = self.transform(images_tensor)

        # -------------------------------------------------------
        # 2) Extract features with the ResNet-18 backbone.
        # -------------------------------------------------------
        with torch.no_grad():
            features = self.image_model(images_tensor)  # (n_samples, 512)

        # -------------------------------------------------------
        # 3) Project features to match the text embedding dimension (768).
        # -------------------------------------------------------
        with torch.no_grad():
            image_embeddings_torch = self.projection(features)  # (n_samples, 768)

        # Convert the image embeddings to a NumPy array.
        image_embeddings = image_embeddings_torch.cpu().numpy()

        # -------------------------------------------------------
        # 4) Normalize embeddings and compute cosine similarity.
        # -------------------------------------------------------
        text_norm = text_embeddings / (norm(text_embeddings, axis=1, keepdims=True) + 1e-8)
        if isinstance(text_norm, torch.Tensor):
            text_norm = text_norm.numpy()
        image_norm = image_embeddings / (norm(image_embeddings, axis=1, keepdims=True) + 1e-8)
        cos_sim = np.abs(np.sum(text_norm * image_norm, axis=1))  # Cosine similarity for each sample
        mean_cos_sim = np.mean(cos_sim)

        # -------------------------------------------------------
        # 5) Return the inverse of the mean similarity (lower is better).
        # -------------------------------------------------------
        clip_score = 1.0 / (mean_cos_sim + 1e-8)
        return clip_score


In [46]:
score_type = CLIPScore()
score = score_type(X_test, result)
print(score)

36.90282


  text_norm = text_embeddings / (norm(text_embeddings, axis=1, keepdims=True) + 1e-8)


In [23]:
import numpy as np
from scipy.linalg import sqrtm
from rampwf.score_types.base import BaseScoreType

class FID(BaseScoreType):
    """
    Fréchet Inception Distance (FID) for image generation.
    
    This metric computes the difference between the statistics of the
    generated images and the real images. Lower values indicate that the
    generated images are closer to the real ones.
    
    Note: This implementation flattens images and computes covariances
    directly. For more robust evaluations, it is common to extract features
    using a pre-trained network such as Inception.
    """
    is_lower_the_better = True
    minimum = 0.0
    maximum = float('inf')
    
    def __init__(self, name='FID', precision=2):
        self.name = name
        self.precision = precision

    def __call__(self, y_true, y_pred):
        """
        Compute the FID between ground truth images and generated images.
        
        Parameters
        ----------
        y_true : numpy array
            Ground truth images. Shape: (n_samples, height, width, channels).
        y_pred : numpy array
            Generated images. Shape: (n_samples, height, width, channels).
        
        Returns
        -------
        float
            The computed FID score.
        """
        # Flatten images to vectors
        y_true_flat = y_true.mean(axis=(1,2))
        y_pred_flat = y_pred.mean(axis=(1,2))
        
        # Compute mean vectors
        mu_true = np.mean(y_true_flat, axis=0)
        mu_pred = np.mean(y_pred_flat, axis=0)
        
        # Compute covariance matrices
        sigma_true = np.cov(y_true_flat, rowvar=False)
        sigma_pred = np.cov(y_pred_flat, rowvar=False)
        
        # Compute squared difference between means
        diff = mu_true - mu_pred
        diff_squared = np.sum(diff**2)
        
        # Compute the square root of the product of covariance matrices
        covmean = sqrtm(sigma_true.dot(sigma_pred))
        if np.iscomplexobj(covmean):
            covmean = covmean.real
        
        fid = diff_squared + np.trace(sigma_true + sigma_pred - 2 * covmean)
        return fid


In [25]:
FID_score = FID()
FID_score(y_test.numpy(), result)

np.float64(2.190257105446081)

In [None]:
y

In [27]:
y_test[100]

tensor([[[0.5255, 0.5333, 0.5373,  ..., 0.6471, 0.6431, 0.6353],
         [0.5333, 0.5412, 0.5451,  ..., 0.6510, 0.6471, 0.6353],
         [0.5412, 0.5490, 0.5529,  ..., 0.6510, 0.6510, 0.6392],
         ...,
         [0.1961, 0.2275, 0.2745,  ..., 0.1647, 0.1529, 0.1529],
         [0.2314, 0.2588, 0.3020,  ..., 0.1647, 0.1529, 0.1490],
         [0.2471, 0.2706, 0.3059,  ..., 0.1647, 0.1569, 0.1608]],

        [[0.3608, 0.3647, 0.3725,  ..., 0.5216, 0.5255, 0.5216],
         [0.3647, 0.3686, 0.3725,  ..., 0.5255, 0.5255, 0.5216],
         [0.3686, 0.3725, 0.3765,  ..., 0.5294, 0.5294, 0.5294],
         ...,
         [0.0471, 0.0745, 0.1176,  ..., 0.0588, 0.0510, 0.0549],
         [0.0784, 0.1098, 0.1412,  ..., 0.0588, 0.0510, 0.0510],
         [0.0941, 0.1176, 0.1451,  ..., 0.0588, 0.0588, 0.0627]],

        [[0.2118, 0.2196, 0.2235,  ..., 0.4549, 0.4549, 0.4510],
         [0.2196, 0.2235, 0.2275,  ..., 0.4549, 0.4549, 0.4510],
         [0.2235, 0.2275, 0.2353,  ..., 0.4510, 0.4510, 0.

# Submission format

Here, you should describe the submission format. This is the format the participants should follow to submit their predictions on the RAMP plateform.

This section also show how to use the `ramp-workflow` library to test the submission locally.

## The pipeline workflow

The input data are stored in a dataframe. To go from a dataframe to a numpy array we will use a scikit-learn column transformer. The first example we will write will just consist in selecting a subset of columns we want to work with.

## Submission

To submit your code, you can refer to the [online documentation](https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/stable/using_kits.html).