In [None]:
import torch

import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import imageio
import cv2

import time
import requests

In [None]:
from ultralytics import YOLO

In [None]:
def read_image_from_url(url):
    img = Image.open(requests.get(url, stream=True).raw)
    img.thumbnail((256,256), Image.LANCZOS)
    return img

In [None]:
model = YOLO("yolov8n.pt")  # auto-downloads weights

Lets look at what is inside a YOLO model.

In [None]:
print(model)

In [None]:
from torchinfo import summary

summary(model=model.model, 
        input_size=(16, 3, 640, 640),
        col_names=["input_size", "output_size", "num_params", "trainable"],
        col_width=20,
        row_settings=["var_names"]
)

## What to notice in this YOLO model summary

### 1) End-to-end I/O: from pixels → dense predictions
- **Input:** `[B, 3, 640, 640]` (here `B=16`)
- **Output:** `[B, 84, 8400]`
  - `8400` is the number of prediction locations (anchors / grid cells across feature maps).
  - `84 = 4 + 80` (typical COCO setup): **4 box values** + **80 class scores** per location.
  - The model is **fully convolutional**, so input resolution can change (with matching changes to the `8400` term).



### 2) The "backbone → neck → head" story is visible in the shapes
You can point out the classic resolution pyramid:

- Early layers downsample quickly:
  - `640×640 → 320×320 → 160×160 → 80×80 → 40×40 → 20×20`
- Increasing channels as spatial size shrinks:
  - `3 → 16 → 32 → 64 → 128 → 256`

This is the tradeoff: **lower resolution, richer features**.

---

### 3) C2f blocks = the workhorse (efficient feature reuse)
You’ll see many `C2f` modules repeated.
- These are **lightweight residual/partial-connections style blocks** designed to keep accuracy while reducing compute.
- Repetition of `C2f` at multiple scales is what builds **feature richness without exploding parameters**.

---

### 4) SPPF = “cheap global context”
`SPPF` (Spatial Pyramid Pooling - Fast) shows up near the deepest layer (`20×20`).
- It uses multiple pooling operations to inject **multi-scale context**.

---

### 5) Upsample + Concat = the neck (feature fusion)
Look for this pattern:
- `Upsample` (e.g. `20×20 → 40×40`, then `40×40 → 80×80`)
- `Concat` (channel dimension grows, e.g. `256 + 128 → 384`)

This is the **FPN/PAN-style fusion idea**:
- **Deep features** (semantic) are merged with **shallow features** (detail)
- Helps detect **both small and large objects**.

---

### 6) Detect head is multi-scale (and appears “recursive” in the summary)
`Detect (22)` appears many times because it consumes multiple feature maps internally.
- YOLO heads typically predict at **multiple scales** (commonly around `80×80`, `40×40`, `20×20` for 640 input).
- That’s why the model can detect **small objects** (high-res head) and **big objects** (low-res head).

---

### 7) DFL at the end = Distribution Focal Loss (a modern box trick)
At the bottom you see:
- `DFL (dfl) ... Trainable: False`
- Some YOLO variants predict a **distribution over distances** instead of direct box coordinates.
- This often improves localization quality.
- It’s listed as non-trainable because it’s effectively a **fixed transformation layer** used during decoding.


In [None]:

# url = "http://cdn.cnn.com/cnnnext/dam/assets/200130092551-02-market-street-san-francisco-car-free-now.jpg"
# url = "https://m.media-amazon.com/images/I/71xybHPToQL._AC_SL1500_.jpg"

url = "https://www.monash.edu/__data/assets/image/0006/959496/clayton-campus-green-chemical-futures-building-exterior2017.jpg"

results = model(read_image_from_url(url))

plt.imshow(results[0].plot()[:,:,::-1])
plt.show()

This notebook requires a webcam to run. 

In [None]:
cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    results = model(frame, stream=True)

    for r in results:
        frame = r.plot()

    cv2.imshow("YOLO Live Demo", frame)
    if cv2.waitKey(1) & 0xFF == 27:  # ESC
        break

cap.release()
cv2.destroyAllWindows()
