# Object Detection with Detection Transformer (DETR)

This notebook demonstrates how to use a pre‑trained DETR model from the HuggingFace `transformers` library to detect objects in images. We'll load an image, run inference, and visualize the predicted bounding boxes and labels.

DETR (Detection Transformer) reframes object detection as a direct set prediction problem, eliminating the need for anchor boxes, region proposals, and non‑maximum suppression. It uses a CNN backbone to extract features, a transformer encoder–decoder to reason about objects globally, and a set‑based loss (Hungarian matching) during training.

## 1. Setup and Imports

In [None]:
import torch
import requests
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from transformers import DetrImageProcessor, DetrForObjectDetection
import numpy as np

print("PyTorch version:", torch.__version__)
print("Transformers version:", transformers.__version__)

## 2. Load Pre‑trained DETR Model and Processor

We'll use the `facebook/detr-resnet-50` model, which was trained on the COCO dataset. The processor handles image resizing, normalization, and conversion to tensor.

In [None]:
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
model.eval()
print("Model loaded successfully.")

## 3. Load an Image

We'll download a sample image from the web. You can also replace this with your own image path.

In [None]:
# URL of a sample image (COCO style street scene)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
print("Image size:", image.size)

## 4. Run Inference

We process the image, pass it through the model, and obtain predictions. The model outputs `logits` (class scores) and `pred_boxes`.

In [None]:
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# Model outputs are in raw format; we need to post-process them to get final boxes and labels.
target_sizes = torch.tensor([image.size[::-1]])  # (height, width)
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.7)[0]

print(f"Detected {len(results['labels'])} objects with confidence > 0.7:")
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(f"  {model.config.id2label[label.item()]}: {round(score.item(), 3)} at {box}")

## 5. Visualize the Detections

We'll draw the bounding boxes and labels on the image using matplotlib.

In [None]:
def plot_detections(image, results, threshold=0.7):
    fig, ax = plt.subplots(1, figsize=(12, 8))
    ax.imshow(image)
    ax.axis('off')

    for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
        if score < threshold:
            continue
        x1, y1, x2, y2 = box.tolist()
        rect = patches.Rectangle((x1, y1), x2 - x1, y2 - y1,
                                 linewidth=2, edgecolor='lime', facecolor='none')
        ax.add_patch(rect)
        caption = f"{model.config.id2label[label.item()]}: {score:.2f}"
        ax.text(x1, y1, caption, fontsize=10, color='white',
                bbox=dict(facecolor='green', alpha=0.5))

    plt.tight_layout()
    plt.show()

plot_detections(image, results, threshold=0.7)

## 6. How DETR Works – A Quick Recap

DETR consists of four main parts:

1. **CNN Backbone** (e.g., ResNet‑50) – extracts a feature map from the image.
2. **Transformer Encoder** – the feature map is flattened, positional encodings are added, and a standard transformer encoder processes the sequence, allowing global context to be captured.
3. **Transformer Decoder** – a fixed set of **object queries** (learned embeddings) interact with the encoder output via cross‑attention. The decoder outputs a representation for each query.
4. **Prediction Heads** – two small feed‑forward networks per query: one for class probabilities (including a special “no object” class) and one for bounding box coordinates `(x, y, w, h)`.

**Training** uses the Hungarian algorithm to match predictions to ground truth and a composite loss (classification + L1 box + GIoU). **Inference** simply takes the predictions with confidence above a threshold – no anchor boxes or NMS required.

## 7. Next Steps

- Try the model on your own images (replace the URL with a local path).
- Fine‑tune the model on a custom dataset.
- Explore the source code of DETR in the HuggingFace library.
- Implement the Hungarian matching and loss yourself for deeper understanding.