### Base Florence-2 Model

Florence-2 is a vision foundation model trained by Microsoft in 2023. It has good out of the box performance at a variety of image tasks. It produces predictions by outputing text tokens, with a limit of 1024 per image. To designate parts of the image, normalised x, y co-ordinates are produced by the model.

This notebook details how to use Florence-2 and its base model. We need to specify both a task type and a specific prompt for further instructions.

Task Types (a sample of the main tasks):
- `<CAPTION>` Description
- `<DETAILED_CAPTION>` Description
- `<REGION_TO_DESCRIPTION>` Description, given a bounding box
- `<OD>` Object detection
- `<OCR_WITH_REGION>` Object detection, given bounding box
- `<REFERRING_EXPRESSION_SEGMENTATION>` Segmentation, given text
- `<REGION_TO_SEGMENTATION>` Segmentation, given bounding box
- `<OCR>` Optical Character Recognition
- `<OCR_WITH_REGION>` Optical Character Recognition, given bounding box

Example prompts when adding a task:
- VQA: What does the image describe?
- VQA: What does the region {region} describe?
- Object Detection: Locate the objects in the image.
- Object Detection: Locate the phrases in the caption: {caption}.
- Segmentation: What is the polygon mask of region {region}?
- OCR: Extract text with region {region}.

For full details of the training, see the paper: https://arxiv.org/abs/2311.06242

In [None]:
# Import necessary libraries
from PIL import Image, ImageDraw, ImageOps
import torch
from transformers import AutoProcessor, AutoModelForCausalLM 
import matplotlib.pyplot as plt
import numpy as np

# setup device and dtype if using GPU
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
print(f"DEVICE: {DEVICE} \nTORCH DTYPE: {torch_dtype}")

# setup correct file paths
data_path = "./data/"

# Load model and processor from Hugging Face
# microsoft/Florence-2-large-ft is the fine-tuned version of microsoft/Florence-2-large. It is finetuned for a variety of downstream tasks.
model_name = "microsoft/Florence-2-large-ft" # microsoft/Florence-2-large
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch_dtype, trust_remote_code=True).to(DEVICE)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

model.eval()  # Set the model to evaluation mode

In [None]:
# create basic functions for a variety of image processing tasks

def obj_det_Florence(classes, image, model, processor):
    """
    Return all the bounding boxes and labels for the classes in the image.
    """
    # we can combine as so: 
    # prompts = "detect cars and motobikes in the image"
    # but if it cannot detect the object, it may be because of model's limitations in handling complex scenes with multiple objects. 
    # Therefore, its better to call one at a time to get the bounding boxes for each class
    prompts = [f"Locate {i}" for i in classes]

    all_boxes = []
    all_labels = []
    for prompt in prompts:
        # Process the input
        prompt = prompt.lower()
        inputs = processor(text=prompt, images=image, return_tensors="pt").to(DEVICE, dtype=torch_dtype)

        # Generate predictions
        generated_ids = model.generate(
            input_ids=inputs["input_ids"],
            pixel_values=inputs["pixel_values"],
            max_new_tokens=1024, # limit
            early_stopping=False,
            do_sample=False,
            num_beams=3,
        )

        # Decode the predictions
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

        # Post-process the output
        parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
        
        # Collect results
        all_boxes.extend(parsed_answer['<OD>']["bboxes"])
        all_labels.extend(parsed_answer['<OD>']["labels"])

    combined_results = {"bboxes": all_boxes, "labels": all_labels}
    
    return combined_results