# X Mark Detection using GLIP

## Overview
This notebook uses GLIP (Grounded Language-Image Pre-training) to detect x marks in video frames.
GLIP is a vision-language model that can detect objects based on text descriptions.

## Advantages of GLIP
- **No manual feature engineering**: Uses deep learning to understand visual patterns
- **Text-based queries**: Simply describe what to detect ("x mark", "cross", "printed x")
- **Robust to variations**: Handles different lighting, angles, and occlusions
- **Pre-trained**: Leverages knowledge from large-scale training data

## Requirements
```bash
pip install torch torchvision transformers opencv-python matplotlib pillow
pip install groundingdino-py  # GLIP/GroundingDINO implementation
```

Note: If groundingdino-py is not available, you can use the Hugging Face transformers implementation or the official GroundingDINO repository.

In [24]:
# Imports
import cv2
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from PIL import Image
import torch
from typing import List, Tuple, Dict
import warnings
warnings.filterwarnings('ignore')

# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
print(f"PyTorch version: {torch.__version__}")
print(f"OpenCV version: {cv2.__version__}")

# Set up matplotlib
plt.rcParams['figure.figsize'] = (16, 10)
plt.rcParams['image.cmap'] = 'gray'

Using device: cpu
PyTorch version: 2.9.1
OpenCV version: 4.12.0


In [25]:
# Load GLIP/GroundingDINO Model
# Option 1: Using transformers (if available)
try:
    from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
    
    model_id = "IDEA-Research/grounding-dino-tiny"  # Or "IDEA-Research/grounding-dino-base"
    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
    model.eval()
    
    print(f"✓ Loaded model: {model_id}")
    print(f"  Model on device: {next(model.parameters()).device}")
    USE_TRANSFORMERS = True
    
except ImportError:
    print("transformers library not available or model not found")
    print("Please install: pip install transformers")
    print("\nAlternatively, you can use the official GroundingDINO repository:")
    print("git clone https://github.com/IDEA-Research/GroundingDINO.git")
    USE_TRANSFORMERS = False
    raise

✓ Loaded model: IDEA-Research/grounding-dino-tiny
  Model on device: cpu


In [26]:
# Configuration
VIDEO_PATH = "../../videos/example.mp4"
OUTPUT_CSV = "glip_x_detection_results.csv"

# Detection parameters
TEXT_PROMPT = "x mark . cross . printed x"  # Text description (use periods to separate multiple terms)
CONFIDENCE_THRESHOLD = 0.43  # Minimum confidence score (0-1)
SAMPLE_EVERY_N_FRAMES = 30  # Process every Nth frame (30 = 1 frame per second at 30fps)

# Processing parameters
MAX_IMAGE_SIZE = 800  # Resize frames to this size (maintains aspect ratio)

print("✓ Configuration loaded")
print(f"  Text prompt: '{TEXT_PROMPT}'")
print(f"  Confidence threshold: {CONFIDENCE_THRESHOLD}")
print(f"  Sampling rate: every {SAMPLE_EVERY_N_FRAMES} frames")

✓ Configuration loaded
  Text prompt: 'x mark . cross . printed x'
  Confidence threshold: 0.38
  Sampling rate: every 30 frames


In [27]:
# Load video metadata
cap = cv2.VideoCapture(VIDEO_PATH)
fps = cap.get(cv2.CAP_PROP_FPS)
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
duration_sec = total_frames / fps
cap.release()

frames_to_process = total_frames // SAMPLE_EVERY_N_FRAMES

print("Video Metadata:")
print(f"  Resolution: {width}x{height}")
print(f"  FPS: {fps:.2f}")
print(f"  Total Frames: {total_frames}")
print(f"  Duration: {duration_sec:.2f} seconds")
print(f"\nProcessing:")
print(f"  Frames to process: ~{frames_to_process}")
print(f"  Estimated time: {frames_to_process * 0.5:.1f}-{frames_to_process * 2:.1f} seconds (depends on hardware)")

Video Metadata:
  Resolution: 1280x720
  FPS: 30.00
  Total Frames: 31211
  Duration: 1040.37 seconds

Processing:
  Frames to process: ~1040
  Estimated time: 520.0-2080.0 seconds (depends on hardware)


In [None]:
# Detection Functions

def resize_image(image: np.ndarray, max_size: int = 800) -> Tuple[np.ndarray, float]:
    """
    Resize image while maintaining aspect ratio.
    Returns: (resized_image, scale_factor)
    """
    h, w = image.shape[:2]
    scale = min(max_size / max(h, w), 1.0)  # Don't upscale
    
    if scale < 1.0:
        new_w = int(w * scale)
        new_h = int(h * scale)
        resized = cv2.resize(image, (new_w, new_h), interpolation=cv2.INTER_LINEAR)
        return resized, scale
    
    return image, 1.0


def detect_x_with_glip(image: np.ndarray, text_prompt: str, 
                       confidence_threshold: float = 0.25) -> Dict:
    """
    Detect x marks in image using GLIP.
    
    Returns:
        Dictionary containing:
        - x_detected: bool
        - confidence: float (max confidence if multiple detections)
        - num_detections: int
        - boxes: list of [x1, y1, x2, y2] in original image coordinates
        - scores: list of confidence scores
        - labels: list of detected label strings
    """
    # Resize image for faster processing
    resized_img, scale = resize_image(image, MAX_IMAGE_SIZE)
    
    # Convert to PIL Image (RGB)
    if len(resized_img.shape) == 2:  # Grayscale
        pil_image = Image.fromarray(resized_img).convert('RGB')
    else:
        pil_image = Image.fromarray(resized_img)
    
    # Run detection
    with torch.no_grad():
        inputs = processor(images=pil_image, text=text_prompt, return_tensors="pt").to(device)
        outputs = model(**inputs)
    
    # Post-process results (using correct API)
    results = processor.post_process_grounded_object_detection(
        outputs,
        inputs.input_ids,
        target_sizes=[pil_image.size[::-1]]  # (height, width)
    )[0]
    
    # Extract detections
    boxes = results['boxes'].cpu().numpy()  # [x1, y1, x2, y2] format
    scores = results['scores'].cpu().numpy()
    labels = results['labels']
    
    # Filter by confidence threshold
    valid_indices = scores >= confidence_threshold
    boxes = boxes[valid_indices]
    scores = scores[valid_indices]
    labels = [label for i, label in enumerate(labels) if valid_indices[i]]
    
    # Scale boxes back to original image size
    if scale < 1.0 and len(boxes) > 0:
        boxes = boxes / scale
    
    # Prepare result
    num_detections = len(boxes)
    x_detected = num_detections > 0
    max_confidence = float(scores.max()) if x_detected else 0.0
    
    return {
        'x_detected': x_detected,
        'confidence': max_confidence,
        'num_detections': num_detections,
        'boxes': boxes.tolist(),
        'scores': scores.tolist(),
        'labels': labels
    }


def visualise_detections(image: np.ndarray, detection_result: Dict) -> np.ndarray:
    """
    Draw detection boxes and labels on image.
    """
    vis_image = image.copy()
    
    if len(vis_image.shape) == 2:  # Convert grayscale to RGB for visualisation
        vis_image = cv2.cvtColor(vis_image, cv2.COLOR_GRAY2RGB)
    
    if not detection_result['x_detected']:
        # No detections - add "NO X" label
        cv2.putText(vis_image, 'NO X DETECTED', (20, 50),
                   cv2.FONT_HERSHEY_SIMPLEX, 1.5, (255, 0, 0), 3)
        return vis_image
    
    # Draw each detection
    for box, score, label in zip(detection_result['boxes'], 
                                  detection_result['scores'], 
                                  detection_result['labels']):
        x1, y1, x2, y2 = map(int, box)
        
        # Draw bounding box
        cv2.rectangle(vis_image, (x1, y1), (x2, y2), (0, 255, 0), 3)
        
        # Draw label with confidence
        label_text = f"{label}: {score:.2f}"
        
        # Background for text
        (text_w, text_h), baseline = cv2.getTextSize(
            label_text, cv2.FONT_HERSHEY_SIMPLEX, 0.6, 2
        )
        cv2.rectangle(vis_image, (x1, y1 - text_h - baseline - 5), 
                     (x1 + text_w, y1), (0, 255, 0), -1)
        
        # Draw text
        cv2.putText(vis_image, label_text, (x1, y1 - 5),
                   cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 0), 2)
    
    # Add summary text
    summary = f"Detected: {detection_result['num_detections']} x mark(s)"
    cv2.putText(vis_image, summary, (20, 50),
               cv2.FONT_HERSHEY_SIMPLEX, 1.2, (0, 255, 0), 3)
    
    return vis_image


print("✓ Detection functions defined")

In [None]:
# Test on Sample Frames

print("Testing GLIP on sample frames...\n")

# Sample 5 frames evenly distributed
sample_frame_indices = np.linspace(0, total_frames - 1, 100, dtype=int)

cap = cv2.VideoCapture(VIDEO_PATH)

for idx in sample_frame_indices:
    cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
    ret, frame = cap.read()
    
    if not ret:
        continue
    
    # Convert BGR to RGB
    frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    
    # Detect
    result = detect_x_with_glip(frame_rgb, TEXT_PROMPT, CONFIDENCE_THRESHOLD)
    
    # Visualise
    vis_frame = visualise_detections(frame_rgb, result)
    
    # Display
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    axes[0].imshow(frame_rgb)
    axes[0].set_title('Original Frame')
    axes[0].axis('off')
    
    axes[1].imshow(vis_frame)
    axes[1].set_title('Detection Result')
    axes[1].axis('off')
    
    timestamp = idx / fps
    plt.suptitle(f"Frame {idx} (t={timestamp:.2f}s) - X Detected: {result['x_detected']} (conf={result['confidence']:.3f})", 
                fontsize=14)
    plt.tight_layout()
    plt.show()
    
    # Print details
    print(f"Frame {idx} (t={timestamp:.2f}s):")
    print(f"  X Detected: {result['x_detected']}")
    print(f"  Num Detections: {result['num_detections']}")
    if result['x_detected']:
        print(f"  Max Confidence: {result['confidence']:.3f}")
        for i, (score, label) in enumerate(zip(result['scores'], result['labels'])):
            print(f"    Detection {i+1}: {label} ({score:.3f})")
    print()

cap.release()
print("✓ Sample frame testing complete")

In [30]:
# Process Full Video

import pandas as pd
import time

print(f"Processing video: {VIDEO_PATH}")
print(f"Sampling every {SAMPLE_EVERY_N_FRAMES} frames\n")

results = []
cap = cv2.VideoCapture(VIDEO_PATH)
frame_idx = 0
start_time = time.time()

print("Progress: ", end='', flush=True)

while True:
    ret, frame = cap.read()
    if not ret:
        break
    
    # Sample frames
    if frame_idx % SAMPLE_EVERY_N_FRAMES == 0:
        # Convert BGR to RGB
        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        
        # Detect
        result = detect_x_with_glip(frame_rgb, TEXT_PROMPT, CONFIDENCE_THRESHOLD)
        
        # Store result
        timestamp = frame_idx / fps
        results.append({
            'frame_number': frame_idx,
            'timestamp_sec': timestamp,
            'x_present': result['x_detected'],
            'confidence_score': result['confidence'],
            'num_detections': result['num_detections'],
            'detection_method': 'GLIP'
        })
        
        # Progress indicator
        if len(results) % 10 == 0:
            progress = (frame_idx / total_frames) * 100
            print(f"{progress:.1f}%", end='... ', flush=True)
    
    frame_idx += 1

cap.release()
elapsed_time = time.time() - start_time

print("100% - Done!\n")

# Convert to DataFrame
results_df = pd.DataFrame(results)

# Display summary
print(f"✓ Processing complete in {elapsed_time:.2f} seconds")
print(f"  Processed {len(results_df)} frames")
print(f"  Average: {len(results_df)/elapsed_time:.2f} frames/second\n")

print("--- Detection Summary ---")
x_present_count = results_df['x_present'].sum()
print(f"Frames with X detected: {x_present_count} / {len(results_df)} ({x_present_count/len(results_df)*100:.1f}%)")

if x_present_count > 0:
    print(f"Average confidence when X detected: {results_df[results_df['x_present']]['confidence_score'].mean():.3f}")
    print(f"Average detections per frame (when X present): {results_df[results_df['x_present']]['num_detections'].mean():.2f}")

if x_present_count < len(results_df):
    print(f"Average confidence when X not detected: {results_df[~results_df['x_present']]['confidence_score'].mean():.3f}")

# Show first few results
print("\nFirst 10 results:")
print(results_df.head(10).to_string())

Processing video: ../../videos/example.mp4
Sampling every 30 frames

Progress: 0.9%... 

KeyboardInterrupt: 

In [None]:
# Visualise Results Over Time

fig, axes = plt.subplots(3, 1, figsize=(16, 10))

# Plot 1: X presence over time
axes[0].plot(results_df['timestamp_sec'], results_df['x_present'].astype(int), 
             'o-', markersize=3, linewidth=0.8)
axes[0].fill_between(results_df['timestamp_sec'], 0, results_df['x_present'].astype(int), 
                      alpha=0.3, label='X Present')
axes[0].set_xlabel('Time (seconds)')
axes[0].set_ylabel('X Present (1=Yes, 0=No)')
axes[0].set_title('X Detection Over Time (GLIP)')
axes[0].grid(True, alpha=0.3)
axes[0].set_ylim(-0.1, 1.1)
axes[0].legend()

# Plot 2: Confidence scores over time
axes[1].plot(results_df['timestamp_sec'], results_df['confidence_score'], 
             'o-', markersize=3, linewidth=0.8, alpha=0.7, color='blue')
axes[1].axhline(y=CONFIDENCE_THRESHOLD, color='r', linestyle='--', 
               label=f'Confidence Threshold ({CONFIDENCE_THRESHOLD})')
axes[1].set_xlabel('Time (seconds)')
axes[1].set_ylabel('Confidence Score')
axes[1].set_title('Detection Confidence Over Time')
axes[1].grid(True, alpha=0.3)
axes[1].legend()

# Plot 3: Number of detections per frame
axes[2].plot(results_df['timestamp_sec'], results_df['num_detections'], 
             'o-', markersize=3, linewidth=0.8, alpha=0.7, color='green')
axes[2].set_xlabel('Time (seconds)')
axes[2].set_ylabel('Number of Detections')
axes[2].set_title('Number of X Marks Detected Per Frame')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("✓ Visualisation complete")

In [None]:
# Save Results to CSV

results_df.to_csv(OUTPUT_CSV, index=False)
print(f"✓ Results saved to: {OUTPUT_CSV}")

# Create time ranges summary when X is present
x_present_df = results_df[results_df['x_present']].copy()

if len(x_present_df) > 0:
    # Group consecutive detections (allowing gaps of up to 2 sampling intervals)
    gap_threshold = SAMPLE_EVERY_N_FRAMES * 2 / fps  # Convert to seconds
    x_present_df['group'] = (x_present_df['timestamp_sec'].diff() > gap_threshold).cumsum()
    
    time_ranges = []
    for group_id in x_present_df['group'].unique():
        group_data = x_present_df[x_present_df['group'] == group_id]
        start_time = group_data['timestamp_sec'].min()
        end_time = group_data['timestamp_sec'].max()
        duration = end_time - start_time
        avg_confidence = group_data['confidence_score'].mean()
        total_detections = group_data['num_detections'].sum()
        
        time_ranges.append({
            'start_sec': start_time,
            'end_sec': end_time,
            'duration_sec': duration,
            'avg_confidence': avg_confidence,
            'total_detections': total_detections
        })
    
    ranges_df = pd.DataFrame(time_ranges)
    print(f"\n--- Time Ranges When X is Present ---")
    print(ranges_df.to_string(index=False))
    
    # Save time ranges
    ranges_csv = OUTPUT_CSV.replace('.csv', '_time_ranges.csv')
    ranges_df.to_csv(ranges_csv, index=False)
    print(f"\n✓ Time ranges saved to: {ranges_csv}")
else:
    print("\nNo X detected in any frame.")

print("\n" + "="*60)
print("PROCESSING COMPLETE")
print("="*60)

## Notes and Tips

### Adjusting Detection Parameters

1. **Text Prompt**: Experiment with different descriptions:
   - `"x mark . cross . printed x"` - Multiple terms increase recall
   - `"printed x mark on laminate sheet"` - More specific context
   - `"dark x mark . black cross"` - Color/appearance descriptors

2. **Confidence Threshold**: 
   - Lower (0.15-0.25): More detections, higher false positives
   - Higher (0.30-0.40): Fewer false positives, might miss some x marks

3. **Sampling Rate**:
   - Process every frame: `SAMPLE_EVERY_N_FRAMES = 1` (slower but complete)
   - Process 1 per second: `SAMPLE_EVERY_N_FRAMES = 30` (faster)

### Model Selection

- **grounding-dino-tiny**: Faster, less accurate (good for quick testing)
- **grounding-dino-base**: Slower, more accurate (better for production)

### Performance Tips

- Use GPU if available (much faster)
- Adjust `MAX_IMAGE_SIZE` to balance speed vs accuracy
- Process in batches if you have multiple videos

### Troubleshooting

- **Too many false positives**: Increase confidence threshold or refine text prompt
- **Missing detections**: Lower confidence threshold or try different text descriptions
- **Slow processing**: Increase sampling rate or reduce image size