In [None]:
!pip install fiftyone transformers<=4.49 torch einops timm accelerate

In [1]:
import os
os.environ['FIFTYONE_ALLOW_LEGACY_ORCHESTRATORS'] = 'true'

In [1]:
import fiftyone.zoo as foz
import fiftyone as fo

dataset = foz.load_zoo_dataset("quickstart")

smol_ds = dataset.take(10).clone(name="smol_qs")

Dataset already downloaded
Loading existing dataset 'quickstart'. To reload from disk, either delete the existing dataset or provide a custom `dataset_name` to use


# Captioning

The Florence2 captioning operator can run in three different detail levels:

1. **Basic** - This is the default mode, generating simple captions for images
2. **Detailed** - Provides more comprehensive image descriptions
3. **More Detailed** - Offers the most extensive and detailed image captions


Each detail level progressively provides more information in the generated captions, allowing you to choose the appropriate level of description for your use case.


The operator uses the Microsoft `Florence-2-base-ft` model and allows you to specify the output field name where the captions will be stored in your FiftyOne dataset. You can run the operator like this:


In [2]:
import fiftyone.operators as foo

MODEL_PATH ="microsoft/Florence-2-base-ft"

florence2_captioning = foo.get_operator("@jacobmarks/florence2/caption_with_florence2")

Python version is above 3.10, patching the collections module.




When running in a notebook you follow these steps:

1) Ensure that you have set the environment variable `os.environ['FIFTYONE_ALLOW_LEGACY_ORCHESTRATORS'] = 'true'`

2) Kick off a [Delegated Service](https://beta-docs.voxel51.com/cli/#fiftyone-delegated-operations) by opening your terminal and executing `fiftyone delegated launch`

3) Use the `await` syntax and pass `delegate=True` into your operator calls.

In [None]:
await florence2_captioning(
    smol_ds,
    model_path=MODEL_PATH,
    detail_level="basic",
    output_field="basic_caption",
    delegate=True
    )

In [4]:
smol_ds.first()['basic_caption']

'A herd of zebra standing on top of a lush green field.'

In [None]:
await florence2_captioning(
    smol_ds,
    model_path=MODEL_PATH,
    detail_level="detailed",
    output_field="detailed_caption",
    )

In [6]:
smol_ds.first()['detailed_caption']

'In this image we can see zebras on the ground. In the background there are trees.'

In [None]:
await florence2_captioning(
    smol_ds,
    model_path=MODEL_PATH,
    detail_level="more_detailed",
    output_field="more_detailed_caption",
    )

In [8]:
smol_ds.first()['more_detailed_caption']

'There are several zebras standing in a field. There is tall grass under them. There are trees behind them in the distance. '

# Phrase Grounding

The Florence2 phrase grounding operator allows you to locate specific phrases within an image. It can be run in two ways:

1. Using an existing caption field in your dataset by specifying the `caption_field` parameter

2. Using a direct caption input by providing the `caption` parameter

The operator will analyze the image and attempt to locate/ground the phrases from the caption within the image. This is particularly useful for tasks where you need to identify and locate specific objects or elements mentioned in image captions. 

You can run it like this:


In [9]:
import fiftyone.operators as foo

MODEL_PATH ="microsoft/Florence-2-base-ft"

florence2_phrase_grounding = foo.get_operator("@jacobmarks/florence2/caption_to_phrase_grounding_with_florence2")

In [None]:
# Using a caption field
await florence2_phrase_grounding(
    smol_ds,
    model_path=MODEL_PATH,
    caption_field="detailed_caption", #must be a field on your dataset
    output_field="phrase_grounding_caption_field",
    delegate=True
)

In [11]:
smol_ds.first()['phrase_grounding_caption_field']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e1cd03b161a758312e0ce9',
            'attributes': {},
            'tags': [],
            'label': 'zebras',
            'bounding_box': [
                0.3014999866485596,
                0.22850000063578288,
                0.696999979019165,
                0.6650000095367432,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1cd03b161a758312e0cea',
            'attributes': {},
            'tags': [],
            'label': 'zebras',
            'bounding_box': [
                0.47149996757507323,
                0.2634999910990397,
                0.2740000247955322,
                0.6249999682108561,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection: {
  

In [None]:
await florence2_phrase_grounding(
    smol_ds,
    model_path=MODEL_PATH,
    caption="A cat sitting on a couch", #open text caption
    output_field="phrase_grounding_open_text",
    delegate=True
)

In [13]:
smol_ds.first()['phrase_grounding_caption_field']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e1cd03b161a758312e0ce9',
            'attributes': {},
            'tags': [],
            'label': 'zebras',
            'bounding_box': [
                0.3014999866485596,
                0.22850000063578288,
                0.696999979019165,
                0.6650000095367432,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1cd03b161a758312e0cea',
            'attributes': {},
            'tags': [],
            'label': 'zebras',
            'bounding_box': [
                0.47149996757507323,
                0.2634999910990397,
                0.2740000247955322,
                0.6249999682108561,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection: {
  

# Detection

The Florence2 detection operator can run in four different modes:

1. **Detection** (default) - Standard object detection
2. **Dense Region Caption** - Generates captions for different regions in the image
3. **Region Proposal** - Suggests regions of interest in the image
4. **Open Vocabulary Detection** - Allows detection of objects specified by a text prompt

Each mode serves a different purpose in analyzing and understanding the content of images, with open vocabulary detection being particularly flexible as it allows you to specify what objects you want to detect through text prompts, though the model only supports *detecting one class through open vocabulary detection*.

You can run the operator like this:

In [14]:
import fiftyone.operators as foo

MODEL_PATH ="microsoft/Florence-2-base-ft"

florence2_detection = foo.get_operator("@jacobmarks/florence2/detect_with_florence2")

In [None]:
await florence2_detection(
    smol_ds,
    model_path=MODEL_PATH,
    detection_type="detection",
    output_field="standard_detection",
    delegate=True
)

In [16]:
smol_ds.first()['standard_detection']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e1cd20b161a758312e0d4f',
            'attributes': {},
            'tags': [],
            'label': 'zebra',
            'bounding_box': [
                0.3044999837875366,
                0.2654999891916911,
                0.44100000858306887,
                0.6289999802907308,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1cd20b161a758312e0d50',
            'attributes': {},
            'tags': [],
            'label': 'zebra',
            'bounding_box': [
                0.47149996757507323,
                0.2654999891916911,
                0.27300000190734863,
                0.6269999980926514,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection: {
  

In [None]:
await florence2_detection(
    smol_ds,
    model_path=MODEL_PATH,
    detection_type="dense_region_caption",
    output_field="dense_region_detection",
    delegate=True
)

In [18]:
smol_ds.first()['dense_region_detection']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e1cd3eb161a758312e0dd7',
            'attributes': {},
            'tags': [],
            'label': 'zebra',
            'bounding_box': [
                0.3044999837875366,
                0.26749998728434243,
                0.37299997806549073,
                0.621999994913737,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1cd3eb161a758312e0dd8',
            'attributes': {},
            'tags': [],
            'label': 'zebra',
            'bounding_box': [
                0.47049999237060547,
                0.2654999891916911,
                0.2739999771118164,
                0.625000015894572,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection: {
    

In [None]:
await florence2_detection(
    smol_ds,
    model_path=MODEL_PATH,
    detection_type="region_proposal",
    output_field="region_proposals",
    delegate=True
)

In [20]:
smol_ds.first()['region_proposals']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e1cd57b161a758312e0e37',
            'attributes': {},
            'tags': [],
            'label': 'object_1',
            'bounding_box': [
                0.3044999837875366,
                0.23649999300638835,
                0.693999981880188,
                0.6579999764760335,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection: {
            'id': '67e1cd57b161a758312e0e38',
            'attributes': {},
            'tags': [],
            'label': 'object_2',
            'bounding_box': [
                0.3044999837875366,
                0.26649999618530273,
                0.37299997806549073,
                0.6229999860127767,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
        <Detection:

In [None]:
await florence2_detection(
    smol_ds,
    model_path=MODEL_PATH,
    detection_type="open_vocabulary_detection",
    text_prompt = "pedestrains", # the object you want to detect
    output_field="open_detection",
    delegate=True
)

In [22]:
smol_ds.first()['open_detection'] 

<Detections: {'detections': []}>

# Referring Segmentation

The Florence2 segmentation operator performs referring expression segmentation, which allows you to segment specific regions in an image based on natural language descriptions. 

This is particularly useful when you need to segment specific parts of an image based on textual descriptions, allowing for precise region identification using natural language.
It can be used in two ways:

1. Using a direct expression through the `expression` parameter

2. Using an existing expression field in your dataset via the `expression_field` parameter

You can run the operator like this:

In [23]:
import fiftyone.operators as foo

MODEL_PATH ="microsoft/Florence-2-base-ft"

florence2_referring_segmentation = foo.get_operator("@jacobmarks/florence2/referring_expression_segmentation_with_florence2")

In [None]:
await florence2_referring_segmentation(
    smol_ds,
    model_path=MODEL_PATH,
    expression_field="basic_caption", #must be a field on your dataset
    output_field="expression_field_segmentations",
    delegate=True
)

In [25]:
smol_ds.first()['expression_field_segmentations']

<Polylines: {
    'polylines': [
        <Polyline: {
            'id': '67e1cd9cb161a758312e0e73',
            'attributes': {},
            'tags': [],
            'label': 'object_1',
            'points': [
                [
                    [0.49049997329711914, 0.2684999783833822],
                    [0.4994999885559082, 0.2684999783833822],
                    [0.4994999885559082, 0.2684999783833822],
                    [0.5105000019073487, 0.2684999783833822],
                    [0.5105000019073487, 0.2684999783833822],
                    [0.5204999923706055, 0.2684999783833822],
                    [0.5204999923706055, 0.2684999783833822],
                    [0.5295000076293945, 0.2684999783833822],
                    [0.5295000076293945, 0.2684999783833822],
                    [0.5384999752044678, 0.2684999783833822],
                    [0.5384999752044678, 0.2684999783833822],
                    [0.5484999656677246, 0.2684999783833822],
                    [0.548

In [None]:
await florence2_referring_segmentation(
    smol_ds,
    model_path=MODEL_PATH,
    expression="people in the background",
    output_field="open_expression_segmentations",
    delegate=True
)

In [27]:
smol_ds.first()['open_expression_segmentations']

<Polylines: {
    'polylines': [
        <Polyline: {
            'id': '67e1cdb6b161a758312e0e7d',
            'attributes': {},
            'tags': [],
            'label': 'object_1',
            'points': [
                [
                    [0.5804999828338623, 0.25849998792012535],
                    [0.5914999961853027, 0.25849998792012535],
                    [0.5914999961853027, 0.24249998728434244],
                    [0.6054999828338623, 0.24249998728434244],
                    [0.6054999828338623, 0.23849999109903972],
                    [0.6224999904632569, 0.23849999109903972],
                    [0.6224999904632569, 0.23849999109903972],
                    [0.6375, 0.23849999109903972],
                    [0.6375, 0.24549999237060546],
                    [0.6524999618530274, 0.24549999237060546],
                    [0.6524999618530274, 0.2535000006357829],
                    [0.6674999713897705, 0.2535000006357829],
                    [0.6674999713897705, 

# OCR

The Florence2 OCR (Optical Character Recognition) operator can extract text from images in two modes:

1. **Basic OCR** - Extracts text from the image

2. **Region-based OCR** - Extracts text and stores the region information for where the text was found by setting `store_region_info=True`

This operator is useful for extracting text content from images, with the option to also capture the location information of where the text appears in the image when needed.

> **Note:** If there is no text in the image then the model will output irrelevant values!

You can run the operator like this:


In [35]:
import fiftyone.operators as foo

MODEL_PATH ="microsoft/Florence-2-base-ft"

florence2_ocr = foo.get_operator("@jacobmarks/florence2/ocr_with_florence2")

In [None]:
await florence2_ocr(
    smol_ds,
    model_path=MODEL_PATH,
    store_region_info=True,
    output_field="ocr_regions",
    delegate=True
)

In [32]:
smol_ds.first()['ocr_regions']

<Detections: {
    'detections': [
        <Detection: {
            'id': '67e1ce70b161a758312e0e87',
            'attributes': {},
            'tags': [],
            'label': '</s>٢٠١٠ ٠٠ ا٠ ب٠اى٠ ما١ ٠ ١١ بواكاف٠ماق با٢ برى باءى ي٠جى للى مانى الجمان بجان يجوان ماماران الحمامة الصوارية يولام الخامية المحانية لحلاية ماولية والحى والمايل الأحة اللهارة العلان والله ماروة الكلار الملاور الشرام ماير الفية علاه الزاموة فيلا مللا كامل المعامد الطايمة ادامادة الاسلا المولله الماعليم المدام ييلل القطاديةة امجادلة الوجلا عامفة ال الماوية',
            'bounding_box': [
                0.0004999999888241291,
                0.9994999567667643,
                0.998999988567084,
                0.0,
            ],
            'mask': None,
            'mask_path': None,
            'confidence': None,
            'index': None,
        }>,
    ],
}>

In [36]:
await florence2_ocr(
    smol_ds,
    model_path=MODEL_PATH,
    store_region_info=False,
    output_field="ocr",
    delegate=True
)

<fiftyone.operators.executor.ExecutionResult at 0x7d0d2fbc46d0>

In [37]:
smol_ds.first()['ocr']

'1'