# SAM3 Football Video Analysis - Standalone Test

This notebook demonstrates using **SAM3 (Segment Anything Model 3)** for detecting and tracking objects in football match videos using text prompts.

**Features:**
- Extract frames from video
- Detect objects using text prompts (e.g., "player", "soccer ball")
- Track objects across frames
- Generate annotated output video

**References:**
- SAM3 GitHub: https://github.com/facebookresearch/sam3
- SAM3 Blog: https://blog.roboflow.com/fine-tune-sam3/
- Roboflow Tutorial: https://lnkd.in/gBN4ir2M

## 1. Environment Setup

Install SAM3 and dependencies. This will take a few minutes.

In [None]:
# Check if running in Colab
try:
    import google.colab
    IN_COLAB = True
    print("✓ Running in Google Colab")
except:
    IN_COLAB = False
    print("✓ Running locally")

# Clone SAM3 repository
!git clone https://github.com/facebookresearch/sam3.git
%cd sam3
!pip install -e ".[notebooks]" -q
%cd ..

# Install additional dependencies
!pip install -q supervision opencv-python-headless pillow tqdm matplotlib

print("\n✓ Installation complete!")

✓ Running in Google Colab
fatal: destination path 'sam3' already exists and is not an empty directory.
/content/sam3
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
  Building editable for sam3 (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
jaxlib 0.7.2 requires numpy>=2.0, but you have numpy 1.26.0 which is incompatible.
music21 9.9.1 requires numpy>=1.26.4, but you have numpy 1.26.0 which is incompatible.
shap 0.50.0 requires numpy>=2, but you have numpy 1.26.0 which is incompatible.
pytensor 2.35.1 requires numpy>=2.0, but you have numpy 1.26.0 which is incompatible.
opencv-contrib-python 4.12.0.88 requires num

## 2. Configure HuggingFace Token

SAM3 requires a HuggingFace token to download model weights.

**Steps:**
1. Go to https://huggingface.co/settings/tokens
2. Create a new token (read access is sufficient)
3. Request access to SAM3 models at https://huggingface.co/facebook/sam3
4. Enter your token below

In [None]:
import os
from getpass import getpass

# Get HuggingFace token
if IN_COLAB:
    from google.colab import userdata
    try:
        HF_TOKEN = userdata.get("HF_TOKEN")
        print("✓ Using HF_TOKEN from Colab secrets")
    except:
        HF_TOKEN = getpass("Enter your HuggingFace token: ")
else:
    HF_TOKEN = getpass("Enter your HuggingFace token: ")

os.environ["HF_TOKEN"] = HF_TOKEN
print("✓ Token configured")

✓ Using HF_TOKEN from Colab secrets
✓ Token configured


## 3. GPU Check and Model Loading

In [None]:
import torch
import torchvision

print("PyTorch version:", torch.__version__)
print("Torchvision version:", torchvision.__version__)
print("CUDA available:", torch.cuda.is_available())

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")
else:
    print("⚠ No GPU detected. Processing will be slower.")

PyTorch version: 2.9.0+cu126
Torchvision version: 0.24.0+cu126
CUDA available: True
GPU: Tesla T4
CUDA version: 12.6


In [None]:
# Enable TF32 for better performance on Ampere GPUs
torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
# print(torch.cuda.get_device_properties(0).major)
if torch.cuda.get_device_properties(0).major >= 8:
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    print("✓ TF32 enabled for faster inference")

In [None]:
# Load SAM3 model
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

print("Loading SAM3 model...")
model = build_sam3_image_model()
processor = Sam3Processor(model, confidence_threshold=0.3)
print("✓ SAM3 model loaded successfully!")

Loading SAM3 model...


config.json:   0%|          | 0.00/25.8k [00:00<?, ?B/s]

sam3.pt:   0%|          | 0.00/3.45G [00:00<?, ?B/s]

✓ SAM3 model loaded successfully!


## 4. Utility Functions

In [None]:
import supervision as sv
import cv2
import numpy as np
from PIL import Image
from pathlib import Path
from tqdm.notebook import tqdm

# Color palette for annotations
COLOR = sv.ColorPalette.from_hex([
    "#ffff00", "#ff9b00", "#ff8080", "#ff66b2", "#ff66ff", "#b266ff",
    "#9999ff", "#3399ff", "#66ffff", "#33ff99", "#66ff66", "#99ff00"
])

def from_sam(sam_result: dict) -> sv.Detections:
    """Convert SAM3 results to supervision Detections format."""
    xyxy = sam_result["boxes"].to(torch.float32).cpu().numpy()
    confidence = sam_result["scores"].to(torch.float32).cpu().numpy()

    mask = sam_result["masks"].to(torch.bool)
    mask = mask.reshape(mask.shape[0], mask.shape[2], mask.shape[3]).cpu().numpy()

    return sv.Detections(
        xyxy=xyxy,
        confidence=confidence,
        mask=mask
    )

def annotate(image: Image.Image, detections: sv.Detections, label: str = None) -> Image.Image:
    """Annotate image with detections."""
    text_scale = sv.calculate_optimal_text_scale(resolution_wh=image.size)

    mask_annotator = sv.MaskAnnotator(
        color=COLOR,
        color_lookup=sv.ColorLookup.INDEX,
        opacity=0.6
    )
    box_annotator = sv.BoxAnnotator(
        color=COLOR,
        color_lookup=sv.ColorLookup.INDEX,
        thickness=2
    )
    label_annotator = sv.LabelAnnotator(
        color=COLOR,
        color_lookup=sv.ColorLookup.INDEX,
        text_scale=0.5,
        text_padding=5,
        text_color=sv.Color.BLACK,
        text_thickness=1
    )

    annotated_image = image.copy()
    annotated_image = mask_annotator.annotate(annotated_image, detections)
    annotated_image = box_annotator.annotate(annotated_image, detections)

    if label:
        labels = [
            f"{label} {i+1} ({confidence:.2f})"
            for i, confidence in enumerate(detections.confidence)
        ]
        annotated_image = label_annotator.annotate(annotated_image, detections, labels)

    return annotated_image

print("✓ Utility functions defined")

✓ Utility functions defined


## 5. Video Processing Functions

In [None]:
def extract_frames(video_path: str, output_dir: str, max_frames: int = None, fps: int = None):
    """Extract frames from video.

    Args:
        video_path: Path to input video
        output_dir: Directory to save frames
        max_frames: Maximum number of frames to extract (None = all)
        fps: Target FPS (None = use video's original FPS)
    """
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    cap = cv2.VideoCapture(video_path)
    video_fps = cap.get(cv2.CAP_PROP_FPS)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))

    # Calculate frame skip if target FPS is specified
    frame_skip = 1
    if fps and fps < video_fps:
        frame_skip = int(video_fps / fps)

    frames_extracted = 0
    frame_idx = 0

    print(f"Video FPS: {video_fps:.2f}")
    print(f"Total frames: {total_frames}")
    print(f"Frame skip: {frame_skip}")

    pbar = tqdm(total=min(max_frames or total_frames, total_frames), desc="Extracting frames")

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        if frame_idx % frame_skip == 0:
            frame_path = output_dir / f"frame_{frames_extracted:05d}.jpg"
            cv2.imwrite(str(frame_path), frame)
            frames_extracted += 1
            pbar.update(1)

            if max_frames and frames_extracted >= max_frames:
                break

        frame_idx += 1

    cap.release()
    pbar.close()

    print(f"✓ Extracted {frames_extracted} frames to {output_dir}")
    return frames_extracted, video_fps

def process_frames_with_sam3(frames_dir: str, prompt: str, output_dir: str, confidence_threshold: float = 0.5):
    """Process frames with SAM3 detection.

    Args:
        frames_dir: Directory containing input frames
        prompt: Text prompt for detection (e.g., "player", "soccer ball")
        output_dir: Directory to save annotated frames
        confidence_threshold: Minimum confidence for detections
    """
    frames_dir = Path(frames_dir)
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    frame_files = sorted(frames_dir.glob("*.jpg"))

    detection_stats = []

    for frame_path in tqdm(frame_files, desc=f"Processing with prompt '{prompt}'"):
        # Load frame
        image = Image.open(frame_path).convert("RGB")

        # Run SAM3 detection
        inference_state = processor.set_image(image)
        inference_state = processor.set_text_prompt(state=inference_state, prompt=prompt)

        # Convert to detections
        detections = from_sam(sam_result=inference_state)

        # Filter by confidence
        detections = detections[detections.confidence > confidence_threshold]

        # Annotate
        annotated_image = annotate(image, detections, label=prompt)

        # Save
        output_path = output_dir / frame_path.name
        annotated_image.save(output_path)

        # Track stats
        detection_stats.append({
            'frame': frame_path.name,
            'count': len(detections),
            'avg_confidence': detections.confidence.mean() if len(detections) > 0 else 0
        })

    print(f"✓ Processed {len(frame_files)} frames")
    print(f"  Average detections per frame: {np.mean([s['count'] for s in detection_stats]):.1f}")
    print(f"  Average confidence: {np.mean([s['avg_confidence'] for s in detection_stats if s['avg_confidence'] > 0]):.2f}")

    return detection_stats

def create_video_from_frames(frames_dir: str, output_path: str, fps: float = 30.0):
    """Create video from frames.

    Args:
        frames_dir: Directory containing frames
        output_path: Path for output video
        fps: Frames per second for output video
    """
    frames_dir = Path(frames_dir)
    frame_files = sorted(frames_dir.glob("*.jpg"))

    if not frame_files:
        print("⚠ No frames found!")
        return

    # Get frame dimensions
    first_frame = cv2.imread(str(frame_files[0]))
    height, width = first_frame.shape[:2]

    # Create video writer
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_path, fourcc, fps, (width, height))

    for frame_path in tqdm(frame_files, desc="Creating video"):
        frame = cv2.imread(str(frame_path))
        out.write(frame)

    out.release()
    print(f"✓ Video saved to {output_path}")

print("✓ Video processing functions defined")

✓ Video processing functions defined


## 6. Upload Your Video

Upload a football match video clip to test.

In [None]:
video_path = "/content/RMA_BAR_video2.mp4"  # Change this to your video path
print(f"Using video: {video_path}")

Using video: /content/RMA_BAR_video2.mp4


## 7. Configure Processing Parameters

In [19]:
CONFIG = {
    'max_frames': 500,  # Limit frames for faster testing (None = process all)
    'target_fps': None,    # Extract frames at this FPS (None = use video's FPS)
    'prompts': ['football', 'barcelona player', 'real madrid player'],  # Multiple prompts!
    'confidence_threshold': 0.5,  # Minimum confidence for detections
    'output_fps': 30.0  # FPS for output video
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

Configuration:
  max_frames: 500
  target_fps: None
  prompts: ['football', 'barcelona player', 'real madrid player']
  confidence_threshold: 0.5
  output_fps: 30.0


## 8. Process Video

This will:
1. Extract frames from video
2. Run SAM3 detection on each frame
3. Annotate frames with detections
4. Create output video

In [None]:
# Create working directories
FRAMES_DIR = "frames_input2"
ANNOTATED_DIR = "frames_annotated"
OUTPUT_VIDEO = "output_annotated.mp4"

# Step 1: Extract frames (only once)
print("\n" + "="*60)
print("STEP 1: Extracting frames")
print("="*60)
num_frames, video_fps = extract_frames(
    video_path=video_path,
    output_dir=FRAMES_DIR,
    max_frames=CONFIG['max_frames'],
    fps=CONFIG['target_fps']
)

# Step 2: Process with SAM3 for EACH PROMPT
print("\n" + "="*60)
print("STEP 2: Processing with SAM3 (Multiple Prompts)")
print("="*60)

all_detections = {}  # Store detections for each prompt

for prompt in CONFIG['prompts']:
    print(f"\n--- Processing prompt: '{prompt}' ---")
    stats = process_frames_with_sam3(
        frames_dir=FRAMES_DIR,
        prompt=prompt,
        output_dir=f"frames_{prompt.replace(' ', '_')}",  # Separate dir per prompt
        confidence_threshold=CONFIG['confidence_threshold']
    )
    all_detections[prompt] = stats

# Step 3: Combine all detections and create final video
print("\n" + "="*60)
print("STEP 3: Combining detections and creating video")
print("="*60)

# Process frames with ALL prompts combined
frames_dir = Path(FRAMES_DIR)
output_dir = Path(ANNOTATED_DIR)
output_dir.mkdir(parents=True, exist_ok=True)

frame_files = sorted(frames_dir.glob("*.jpg"))

for frame_path in tqdm(frame_files, desc="Combining all prompts"):
    image = Image.open(frame_path).convert("RGB")

    # Run detection for each prompt and combine
    all_frame_detections = []

    for prompt in CONFIG['prompts']:
        inference_state = processor.set_image(image)
        inference_state = processor.set_text_prompt(state=inference_state, prompt=prompt)
        detections = from_sam(sam_result=inference_state)
        detections = detections[detections.confidence > CONFIG['confidence_threshold']]
        all_frame_detections.append((prompt, detections))

    # Annotate with all detections
    annotated_image = image.copy()
    for prompt, detections in all_frame_detections:
        if len(detections) > 0:
            annotated_image = annotate(annotated_image, detections, label=prompt)

    # Save
    output_path = output_dir / frame_path.name
    annotated_image.save(output_path)

print(f"✓ Combined detections from {len(CONFIG['prompts'])} prompts")

# Create output video
create_video_from_frames(
    frames_dir=ANNOTATED_DIR,
    output_path=OUTPUT_VIDEO,
    fps=CONFIG['output_fps']
)

print("\n" + "="*60)
print("✓ PROCESSING COMPLETE!")
print("="*60)
print(f"Processed {len(CONFIG['prompts'])} prompts: {', '.join(CONFIG['prompts'])}")


STEP 1: Extracting frames
Video FPS: 30.00
Total frames: 2479
Frame skip: 1


Extracting frames:   0%|          | 0/500 [00:00<?, ?it/s]

✓ Extracted 500 frames to frames_input2

STEP 2: Processing with SAM3 (Multiple Prompts)

--- Processing prompt: 'football' ---


Processing with prompt 'football':   0%|          | 0/500 [00:00<?, ?it/s]

✓ Processed 500 frames
  Average detections per frame: 1.2
  Average confidence: 0.67

--- Processing prompt: 'barcelona player' ---


Processing with prompt 'barcelona player':   0%|          | 0/500 [00:00<?, ?it/s]

## 9. Visualize Results

Display sample annotated frames

In [None]:
import matplotlib.pyplot as plt

# Display sample frames
annotated_frames = sorted(Path(ANNOTATED_DIR).glob("*.jpg"))
sample_indices = [0, len(annotated_frames)//4, len(annotated_frames)//2, -1]

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

for idx, frame_idx in enumerate(sample_indices):
    if frame_idx < len(annotated_frames):
        img = Image.open(annotated_frames[frame_idx])
        axes[idx].imshow(img)
        axes[idx].set_title(f"Frame {frame_idx} - {annotated_frames[frame_idx].name}")
        axes[idx].axis('off')

plt.tight_layout()
plt.show()

print(f"\nShowing 4 sample frames from {len(annotated_frames)} total frames")

## 10. Detection Statistics

In [None]:
import pandas as pd

# Create DataFrame from stats
df = pd.DataFrame(stats)

print("Detection Statistics:")
print(f"  Total frames processed: {len(df)}")
print(f"  Frames with detections: {(df['count'] > 0).sum()}")
print(f"  Average detections per frame: {df['count'].mean():.2f}")
print(f"  Max detections in a frame: {df['count'].max()}")
print(f"  Average confidence: {df[df['avg_confidence'] > 0]['avg_confidence'].mean():.2f}")

# Plot detection counts over time
plt.figure(figsize=(14, 4))
plt.plot(df['count'], marker='o', markersize=3)
plt.xlabel('Frame Number')
plt.ylabel('Number of Detections')
plt.title(f'Detections Over Time (Prompt: "{CONFIG["prompt"]}")')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 11. Download Output Video

Download the annotated video to your computer.

In [None]:
if IN_COLAB:
    from google.colab import files
    print("Downloading output video...")
    files.download(OUTPUT_VIDEO)
    print("✓ Download started!")
else:
    print(f"Output video saved at: {OUTPUT_VIDEO}")

## 12. Try Different Prompts (Optional)

Test with different prompts to detect various objects.

In [None]:
# Example: Detect soccer ball
BALL_CONFIG = {
    'prompt': 'soccer ball',
    'confidence_threshold': 0.4,
    'output_dir': 'frames_ball_annotated',
    'output_video': 'output_ball_annotated.mp4'
}

print(f"Processing with prompt: '{BALL_CONFIG['prompt']}'")

# Process frames (reusing already extracted frames)
ball_stats = process_frames_with_sam3(
    frames_dir=FRAMES_DIR,
    prompt=BALL_CONFIG['prompt'],
    output_dir=BALL_CONFIG['output_dir'],
    confidence_threshold=BALL_CONFIG['confidence_threshold']
)

# Create video
create_video_from_frames(
    frames_dir=BALL_CONFIG['output_dir'],
    output_path=BALL_CONFIG['output_video'],
    fps=CONFIG['output_fps']
)

print("\n✓ Ball detection complete!")

# Download if in Colab
if IN_COLAB:
    files.download(BALL_CONFIG['output_video'])

## Notes

**Supported Prompts:**
- `"player"` - Detect all players
- `"soccer ball"` - Detect the ball
- `"goalkeeper"` - Detect goalkeepers
- `"referee"` - Detect referees
- `"player in blue"` - Detect players wearing blue
- `"player in white"` - Detect players wearing white

**Tips:**
- Start with a small number of frames (`max_frames=50`) for quick testing
- Adjust `confidence_threshold` if you get too many/few detections
- Use `target_fps=5` for faster processing, increase for smoother video
- SAM3 works best with clear, high-quality video footage

**Performance:**
- GPU: ~5-10 FPS processing speed
- CPU: ~0.5-1 FPS processing speed (much slower)
- For a 30-second clip at 5 FPS: ~2-5 minutes on GPU

**Troubleshooting:**
- If you get "Out of Memory" errors, reduce `max_frames` or use lower resolution video
- If detections are poor, try adjusting the prompt or confidence threshold
- Make sure your HuggingFace token has access to SAM3 models