# Evaluation Notebook

I create a standardized `.jsonl` format. Each line of the `.jsonl` file is a dictionary with the following information:
 - `image`: A (potentially relative) path to the target image. Relative paths are used when standard benchmarks are used, and absolute paths vary between users.
 - `width`, `height`: The dimensions of the image.
 - `references`: A list of the format:
    - `caption`: A caption describing the object.
    - `xyxy`: A bounding box, in xyxy format, in pixels.
 - `known_absent_captions`: A list of object text descriptions that are known to not be in the image.

Each of the detection APIs will follow a standardized format.
Input:
 - `image`: An image
 - `captions`: A batch of captions to test for
Output: A list of object detections, each with:
   - `xyxy`: A bounding box, in xyxy format, in pixels.
   - `logits`: A list of binary logits (representing log\[P(true)/P(false)\]).
   - `scores`: A list of normalized (0...1) scores for each of the captions. These may not sum to 1 if none of the captions match, or if multiple captions are not mutually exclusive in some way. They are calculated directly by applying the sigmoid function to the `logits` output.
   - `caption`: An optional string which includes the caption generated for that chosen object.

Providing logit and score outputs for each detection allows us to calculate average precision for false negatives.


In [1]:
from detect_cas import owlv2, get_caption_logit, generate_caption_phi3
import numpy as np

# This wrapper makes it easy to generate captions.
class PaliGemmaWrapper:
    def __init__(self, base_model_id, finetune_checkpoint=None, device="cuda"):
        token = os.environ['HUGGINGFACE_ACCESS_TOKEN']
        model_id = "google/paligemma-3b-mix-224"
        model = PaliGemmaForConditionalGeneration.from_pretrained(
            finetune_checkpoint,
            torch_dtype=torch.bfloat16,
            token=token,
            device_map=device,
        )
        processor = PaliGemmaProcessor.from_pretrained(model_id, token=token)
        
        self.model = model
        self.procesor = processor
        
    def generate_caption(self, image, bbox):
        # Tokenize the bounding box.
        (x1, y1, x2, y2) = bbox
        # Scale to integer coordinated in a 1024x1024 image.
        x1_quantized = int((x1 / image.width) * 1024)
        y1_quantized = int((y1 / image.height) * 1024)
        x2_quantized = int((x2 / image.width) * 1024)
        y2_quantized = int((y2 / image.height) * 1024)
        bbox_tokenized = f"<loc{x1_quantized:04d}><loc{y1_quantized:04d}><loc{x2_quantized:04d}><loc{y2_quantized:04d}>"
        
        prefix = f"Describe {bbox_tokenized}"
        
        inputs = processor(prefix, image.convert("RGB"), return_tensors="pt").to("cuda")
        with torch.no_grad():
            output = model.generate(**inputs, max_new_tokens=32)
            
        return output

# Uses OwlV2's raw text embedding similarity, and chooses the object detection with the highest similarity score.
class OwlV2Selector:
    def __init__(self, owlv2):
        self.owlv2 = owlv2
        
    def detect(self, image, captions):
        return self.owlv2(image, captions)

# Uses object detections from OwlV2, and physical attributes from phi-3-vision, to select the target object.
class CropAndCaptionSelector:
    def __init__(self, owlv2):
        self.owlv2 = owlv2
        
    def detect(self, image, targets):
        preliminary_detections = self.owlv2(image, targets)
        generated_captions = [generate_caption_phi3(image, detection['xyxy']) for detection in preliminary_detections]
        detections = []
        for i in range(len(preliminary_detections)):
            logits = np.array([get_caption_logit(generated_captions[i], target) for target in targets])
            detection = {
                "xyxy": preliminary_detections[i]["xyxy"],
                "logits": logits,
                "scores": np.sigmoid(logits),
                "caption": generated_captions[i],
            }
            
        return detections

"""
Uses object detections from OwlV2, and a finetuned Pali 3 model to generate captions for the corresponding object detections.

In the longer term... a nice paper would be to do this (the object detection, and reasoning), end-to-end. For example:

```
<image>

<user> Locate the person in the white shirt. </user>

<agent>

Let's first locate all the people in the image.
<bbox 1>: A person with a white shirt.
<bbox 2>: ...

The most likely match for the criteria is <bbox 1>.

</agent>

```
"""

class ChainOfThoughtSelector:
    """
    Very similar to the crop-and-caption selector, just uses a different captioning backend.
    """
    def __init__(self, owlv2, vlm: PaliGemmaWrapper):
        self.owlv2 = owlv2
        self.vlm = vlm
        
    def detect(self, image, targets):
        preliminary_detections = self.owlv2(image, targets)
        generated_captions = [self.vlm.generate_caption(image, detection['xyxy']) for detection in preliminary_detections]
        detections = []
        for i in range(len(preliminary_detections)):
            logits = np.array([get_caption_logit(generated_captions[i], target) for target in targets])
            detection = {
                "xyxy": preliminary_detections[i]["xyxy"],
                "logits": logits,
                "scores": np.sigmoid(logits),
                "caption": generated_captions[i],
            }
            
        return detections


## Processing `.jsonl` datasets

Now, let's create a pipeline through which we can process validation datasets and generate nice metrics (e.g, AP, AP50, AP75). I don't think AP, AP50, and AP75 should deviate very much, because OwlV2 is already pretty good; I would be more concerned about false positives and false negatives.