# Using Florence2 as Remotely Sourced Zoo Model

In [2]:
import fiftyone as fo
import fiftyone.zoo as foz

# Load a dataset
dataset = foz.load_zoo_dataset("quickstart", overwrite=True)
dataset=dataset.take(3)

Overwriting existing directory '/home/harpreet/fiftyone/quickstart'
Downloading dataset to '/home/harpreet/fiftyone/quickstart'
Downloading dataset...
 100% |████|  187.5Mb/187.5Mb [895.2ms elapsed, 0s remaining, 209.5Mb/s]      
Extracting dataset...
Parsing dataset metadata
Found 200 samples
Dataset info written to '/home/harpreet/fiftyone/quickstart/info.json'
Loading existing dataset 'quickstart'. To reload from disk, either delete the existing dataset or provide a custom `dataset_name` to use


# Setup Zoo Model

In [3]:
foz.register_zoo_model_source("https://github.com/harpreetsahota204/florence2", overwrite=True)

Downloading https://github.com/harpreetsahota204/florence2...
  138.5Kb [45.4ms elapsed, ? remaining, 3.0Mb/s] 
Overwriting existing model source '/home/harpreet/fiftyone/__models__/florence2'


In [4]:
foz.download_zoo_model(
    "https://github.com/harpreetsahota204/florence2",
    model_name="microsoft/Florence-2-base-ft", 
)

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

LICENSE:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

CODE_OF_CONDUCT.md:   0%|          | 0.00/444 [00:00<?, ?B/s]

SECURITY.md:   0%|          | 0.00/2.66k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/14.8k [00:00<?, ?B/s]

SUPPORT.md:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

(<fiftyone.zoo.models.RemoteZooModel at 0x76fc2d534f90>,
 '/home/harpreet/fiftyone/__models__/florence2/Florence-2-base-ft')

In [5]:
model = foz.load_zoo_model(
    "microsoft/Florence-2-base-ft"
    )

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

Florence2LanguageForConditionalGeneration has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.


# Use Florence2 for Captions

The three captioning operations require no additional arguments beyond selecting the operation type. 

Supported `detail_level` values:

* `basic`

*  `detailed`

* `more_detailed`

In [6]:
model.set_operation(operation="caption", detail_level= "more_detailed")

dataset.apply_model(
    model, 
    label_field="captions", 
)

 100% |█████████████████████| 3/3 [1.4s elapsed, 0s remaining, 2.2 samples/s]      


In [7]:
dataset.first()['captions']

'A plane is sitting on a runway. The plane is gray and has a yellow propeller on the front. The runway is surrounded by green grass. The sky is blue with white clouds.'

# Use Florence2 for Detection

The operations for `detection`, `dense_region_caption`, `region_proposal` don't require additional parameters for general use. 

However, `open_vocabulary_detection` requires a `text_prompt` parameter to guide the detection towards specific objects. 


The results are stored as Detections objects containing bounding boxes and labels:

In [8]:
model.set_operation(
    operation="detection",
    detection_type="open_vocabulary_detection",
    text_prompt="people"
)

dataset.apply_model(
    model,
    label_field="people_detections",
)

 100% |█████████████████████| 3/3 [281.9ms elapsed, 0s remaining, 10.6 samples/s]     


In [9]:
dataset.first()['people_detections']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e71a24821cdb2eb6f8e2cc',
            'attributes': {},
            'tags': [],
            'label': 'people',
            'bounding_box': [
                0.05349999666213989,
                0.3434999988564842,
                0.7720000147819519,
                0.4550000085763686,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
    ],
}>

In [10]:
model.set_operation(
    operation="detection",
    detection_type="dense_region_caption",
)

dataset.apply_model(
    model,
    label_field="dense_detections",
)

 100% |█████████████████████| 3/3 [1.0s elapsed, 0s remaining, 2.9 samples/s]         


In [11]:
dataset.first()['dense_detections']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e71a2d821cdb2eb6f8e2cf',
            'attributes': {},
            'tags': [],
            'label': 'propeller',
            'bounding_box': [
                0.3125,
                0.3484999859919313,
                0.2440000057220459,
                0.3589999625498573,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
    ],
}>

# Use Florence2 for Phrase Grounding

Phrase grounding requires either a direct caption or a reference to a caption field. You can provide this in two ways:

In [12]:
model.set_operation(    
    operation="phrase_grounding",
    caption="people",)

# Apply with a different operation
dataset.apply_model(
    model,
    label_field="cap_phrase_groundings",
)

 100% |█████████████████████| 3/3 [262.1ms elapsed, 0s remaining, 11.4 samples/s]     


In [13]:
dataset.first()['cap_phrase_groundings']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e71a43821cdb2eb6f8e2ff',
            'attributes': {},
            'tags': [],
            'label': 'people',
            'bounding_box': [
                0.05349999666213989,
                0.34049998513429447,
                0.7709999918937683,
                0.45800002229855824,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
    ],
}>

When you want to use a Field of a Sample for grounding, you use the following pattern:

In [14]:
model.set_operation(    
    operation="phrase_grounding",
    caption_field="captions"
    )

dataset.apply_model(
    model,
    label_field="cap_field_phrase_groundings",
    caption_field="captions"
    )

 100% |█████████████████████| 3/3 [1.1s elapsed, 0s remaining, 2.6 samples/s]         


In [15]:
dataset.first()['cap_field_phrase_groundings']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e71a76821cdb2eb6f8e302',
            'attributes': {},
            'tags': [],
            'label': 'A plane',
            'bounding_box': [
                0.05349999666213989,
                0.34249999428242095,
                0.7689999461174011,
                0.4529999994282421,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection: {
            'id': '67e71a76821cdb2eb6f8e303',
            'attributes': {},
            'tags': [],
            'label': 'runway',
            'bounding_box': [
                0.0004999999888241291,
                0.7854999479700308,
                0.9979999656789005,
                0.05600004173832699,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection

# Use Florence2 for Segmentation

Segmentation requires either a direct expression or a reference to a field containing expressions. 

Similar to phrase grounding, you can provide this in two ways:

In [16]:
model.set_operation(    
    operation="segmentation",
    expression="people",)

dataset.apply_model(
    model,
    label_field="exp_segmentations",
)

   0% ||--------------------| 0/3 [3.0ms elapsed, ? remaining, ? samples/s] 

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (1024). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


 100% |█████████████████████| 3/3 [8.4s elapsed, 0s remaining, 0.4 samples/s]   


In [17]:
dataset.first()['exp_segmentations']

<Polylines: {
    'polylines': [
        <Polyline: {
            'id': '67e71a96821cdb2eb6f8e32b',
            'attributes': {},
            'tags': [],
            'label': 'object_1',
            'points': [
                [
                    [0.05849999785423279, 0.6024999685533153],
                    [0.3085000038146973, 0.6214999839907787],
                    [0.39049999713897704, 0.6214999839907787],
                    [0.39049999713897704, 0.6144999519723361],
                    [0.39049999713897704, 0.6024999685533153],
                    [0.39049999713897704, 0.588499975986168],
                    [0.39049999713897704, 0.5744999834190208],
                    [0.39049999713897704, 0.5604999908518735],
                    [0.39049999713897704, 0.5534999945682999],
                    [0.39049999713897704, 0.5464999982847263],
                    [0.39049999713897704, 0.5414999754144101],
                    [0.39049999713897704, 0.5414999754144101],
                 

When you want to use a Field of a Sample for grounding, you use the following pattern:

In [18]:
model.set_operation(    
    operation="segmentation",
    expression_field="captions"
    )


dataset.apply_model(
    model,
    label_field="exp_field_segmentations",
    expression_field="captions"
)

 100% |█████████████████████| 3/3 [438.5ms elapsed, 0s remaining, 6.8 samples/s]      


In [19]:
dataset.first()['exp_field_segmentations']

<Polylines: {
    'polylines': [
        <Polyline: {
            'id': '67e71ace821cdb2eb6f8e32e',
            'attributes': {},
            'tags': [],
            'label': 'object_1',
            'points': [
                [
                    [0.0004999999888241291, 0.7894999662662837],
                    [0.9994999885559082, 0.7964999982847263],
                    [0.9994999885559082, 0.838499975986168],
                    [0.0004999999888241291, 0.838499975986168],
                ],
            ],
            'confidence': None,
            'index': None,
            'closed': True,
            'filled': True,
        }>,
    ],
}>

# OCR

Basic OCR ("ocr") requires no additional parameters and returns text strings. For OCR with region information (`ocr_with_region`), you can set `store_region_info=True` to include bounding boxes for each text region:

In [21]:
model.set_operation(operation="ocr", store_region_info=True)

dataset.apply_model(model, label_field="text_regions")

 100% |█████████████████████| 3/3 [339.3ms elapsed, 0s remaining, 8.8 samples/s]      


In [22]:
dataset.first()['text_regions']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e71b18821cdb2eb6f8e330',
            'attributes': {},
            'tags': [],
            'label': '</s>VX2801',
            'bounding_box': [
                0.05649999976158142,
                0.6014999639792521,
                0.2169999897480011,
                0.06300000228703162,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
    ],
}>

In [23]:
model.set_operation(operation="ocr", store_region_info=False)

dataset.apply_model(model, label_field="text_regions_no_region_info")

 100% |█████████████████████| 3/3 [224.3ms elapsed, 0s remaining, 13.4 samples/s]     


In [24]:
dataset.first()['text_regions_no_region_info']

'VX2001'