# 🚀 Customize and Deploy `microsoft/Florence-2-large` on Amazon SageMaker AI
---
In this notebook, we explore **Florence-2-large**, Microsoft's advanced vision-language model that excels at understanding and generating content from both images and text. You'll learn how to fine-tune it on multimodal datasets, evaluate its vision capabilities, and deploy it using SageMaker.

**What is Florence-2-large?**

Microsoft's **Florence-2-large** is a state-of-the-art vision-language model that can process both images and text to perform a wide variety of computer vision and multimodal tasks. From image captioning and visual question answering to object detection and OCR, Florence-2-large provides a unified approach to vision-language understanding.  
🔗 Model card: [microsoft/Florence-2-large on Hugging Face](https://huggingface.co/microsoft/Florence-2-large)

---

**Key Specifications**

| Feature | Details |
|---|---|
| **Parameters** | ~770 million |
| **Architecture** | Vision Transformer + Language Model with cross-modal attention |
| **Modalities** | Image + Text input → Text output |
| **Vision Encoder** | Advanced vision transformer for image understanding |
| **Tasks Supported** | Captioning, VQA, OCR, Object Detection, Segmentation |
| **License** | MIT License |
| **Image Resolution** | High-resolution image processing capabilities |

---

**Benchmarks & Behavior**

- Florence-2-large achieves **state-of-the-art performance** on numerous vision-language benchmarks.  
- Excellent **image understanding** with detailed scene analysis and object recognition.  
- Strong **OCR capabilities** for text extraction from images and documents.  
- Versatile **multimodal reasoning** combining visual and textual information effectively.  

---

**Using This Notebook**

Here's what you'll cover:

* Load multimodal datasets and prepare them for vision-language fine-tuning  
* Fine-tune with SageMaker Training Jobs using vision-optimized configurations  
* Run Model Evaluation on vision-language benchmarks  
* Deploy to SageMaker Endpoints for multimodal inference  

---

Let's begin by exploring `microsoft/Florence-2-large` and testing its vision-language capabilities.


In [1]:
%pip install -Uq sagemaker datasets pillow

In [2]:
import boto3
import sagemaker
from PIL import Image

In [3]:
region = boto3.Session().region_name

sess = sagemaker.Session(boto3.Session(region_name=region))

sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

role = sagemaker.get_execution_role()

In [4]:
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")