Image segmentation models separate areas corresponding to different areas of interest in an image. These models work by assigning a label to each pixel. There are several types of segmentation: semantic segmentation, instance segmentation, and panoptic segmentation.

In this guide, we will:

1. Take a look at different types of segmentation.
2. Have an end-to-end fine-tuning example for semantic segmentation.

# Libraries

In [None]:
pip install -q datasets transformers evaluate accelerate

In [None]:

import json
import requests
from PIL import Image
from transformers import pipeline, AutoImageProcessor
from datasets import load_dataset
from huggingface_hub import cached_download, hf_hub_url

# Types of Segmentation

## Semantic Segmentation

Semantic segmentation assigns a label or class to every single pixel in an image. If we were to take a look at a semantic segmentation model output, it will assign the same class to every instance of an object it comes across in an image. For example, all cats will be labeled as “cat” instead of “cat-1”, “cat-2”. We can use transformers’ image segmentation pipeline to quickly infer a semantic segmentation model. Let’s take a look at the example image.

The model we will use is NVIDIA'S SegFormer: nvidia/segformer-b1-finetuned-cityscapes-1024-1024.

In [None]:
# Get the image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/segmentation_input.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image

In [None]:
# Get segmentation pipeline output results
semantic_segmentation = pipeline("image-segmentation", "nvidia/segformer-b1-finetuned-cityscapes-1024-1024")
results = semantic_segmentation(image)
results

In [None]:
# Taking a look at the mask for the building class, we can see every building is classified with the same mask.
labels = [seg_dict['label'] for seg_dict in results]
required_label = 'building'
results[labels.index(required_label)]["mask"]

## Instance Segmentation

In instance segmentation, the goal is not to classify every pixel, but to predict a mask for every instance of an object in a given image. It works very similar to object detection, where there is a bounding box for every instance, there’s a segmentation mask instead. 

We will use Facebook's facebook/mask2former-swin-large-cityscapes-instance for this.

In [None]:
instance_segmentation = pipeline("image-segmentation", "facebook/mask2former-swin-large-cityscapes-instance")
results = instance_segmentation(image)
results

In [None]:
# Check out one of the car instances
results[2]["mask"]

## Panoptic Segmentation

Panoptic segmentation combines semantic segmentation and instance segmentation, where every pixel is classified into a class and an instance of that class, and there are multiple masks for each instance of a class. We'll use Facebook's facebook/mask2former-swin-large-cityscapes-panoptic for panoptic segmentation.

In [None]:
panoptic_segmentation = pipeline("image-segmentation", "facebook/mask2former-swin-large-cityscapes-panoptic")
results = panoptic_segmentation(image)

# Result show we have more classes. 
# We will later illustrate to see that every pixel is classified into one of the classes.
results

# Fine-tuning a model for Semantic Segmentation

Seeing all types of segmentation, let’s have a deep dive on fine-tuning a model for semantic segmentation. We will now:

a. Finetune SegFormer on the SceneParse150 dataset.<br>
b. Use the fine-tuned model for inference.

Common real-world applications of semantic segmentation include training self-driving cars to identify pedestrians and important traffic information, identifying cells and abnormalities in medical imagery, and monitoring environmental changes from satellite imagery.

In [None]:
# Load SceneParse150 dataset
ds = load_dataset("scene_parse_150", split="train[:50]") # Load subset first, for experimentation

ds = ds.train_test_split(test_size=0.2)
train_ds = ds["train"]
test_ds = ds["test"]

In [None]:
# Inspect the data set
# image: a PIL image of the scene.
# annotation: a PIL image of the segmentation map, which is also the model’s target.
# scene_category: a category id that describes the image scene like “kitchen” or “office”.
train_ds[0]

In [None]:
# In this guide, you’ll only need image and annotation, both of which are PIL images.
train_ds[0]["image"]

In [None]:
# Create a dictionary that maps a label id to a label class
# Download the mappings from the Hub and create the id2label and label2id dictionaries:
repo_id = "huggingface/label-files"
filename = "ade20k-id2label.json"
id2label = json.load(open(cached_download(hf_hub_url(repo_id, filename, repo_type="dataset")), "r"))
id2label = {int(k): v for k, v in id2label.items()}
label2id = {v: k for k, v in id2label.items()}
num_labels = len(id2label)

# Preprocess

In [None]:
# Load a SegFormer image processor to prepare the images and annotations for the model
# Some datasets, like this one, use the zero-index as the background class
# However, the background class isn’t actually included in the 150 classes...
# so you’ll need to set do_reduce_labels=True to subtract one from all the labels
# The zero-index is replaced by 255 so it’s ignored by SegFormer’s loss function

checkpoint = "nvidia/mit-b0"
image_processor = AutoImageProcessor.from_pretrained(checkpoint, do_reduce_labels=True)

## Using the model on a custom dataset


In [None]:
# You could also create and use your own dataset if you prefer
# You can train using the run_semantic_segmentation.py script instead of a notebook instance
# The script requires 2 things
# 1. a DatasetDict with two Image columns, “image” and “label”, and...
# 2. an id2label dictionary mapping the class integers to their class names

In [None]:

# Example DatasetDict with two Image columns, “image” and “label”
from datasets import Dataset, DatasetDict, Image

image_paths_train = ["path/to/image_1.jpg/jpg", "path/to/image_2.jpg/jpg", ..., "path/to/image_n.jpg/jpg"]
label_paths_train = ["path/to/annotation_1.png", "path/to/annotation_2.png", ..., "path/to/annotation_n.png"]

image_paths_validation = [...]
label_paths_validation = [...]

def create_dataset(image_paths, label_paths):
    dataset = Dataset.from_dict({"image": sorted(image_paths),
                                "label": sorted(label_paths)})
    dataset = dataset.cast_column("image", Image())
    dataset = dataset.cast_column("label", Image())
    return dataset

# step 1: create Dataset objects
train_dataset = create_dataset(image_paths_train, label_paths_train)
validation_dataset = create_dataset(image_paths_validation, label_paths_validation)

# step 2: create DatasetDict
dataset = DatasetDict({
     "train": train_dataset,
     "validation": validation_dataset,
     }
)

# step 3: push to Hub (assumes you have ran the huggingface-cli login command in a terminal/notebook)
dataset.push_to_hub("your-name/dataset-repo")

# optionally, you can push to a private repo on the Hub
# dataset.push_to_hub("name of repo on the hub", private=True)

In [None]:

# Example id2label dictionary mapping the class integers to their class names
import json
# simple example
id2label = {0: 'cat', 1: 'dog'}
with open('id2label.json', 'w') as fp:
json.dump(id2label, fp)