# Model Exploration

I tried several different approaches for handling real-time object detection. Though YOLO was the obvious first choice, I wanted to first see if there were other models that could deliver similar performance without fine tuning. If game objects could be recognized without fine tuning, it would open the door for a far more generalizable voice to action interface.

## YOLO

The full implementation and testing of the YOLO model can be found in the model_train.ipynb file. 

Here I just wanted to mention my initial findings when I was first exploring options. Without fine tuning, the YOLO model was incapable of recognizing cards. However it only took a a handful of annotated screenshots to start seeing very high reliability in it's predictions. In fact, it was so good I ended up expanding the classes so that it didn't just detect cards, but also what state the card was in (for instance if a card is tapped or sick). This is important because in magic it is often the case that multiple cards will share the same name. If our player is trying to interact with a card, it is helpful to understand which specific instance of that card they might be referring to. This state information can be helpful in determining this. For instance if they are trying to attack with a card, it is probably not a sick or tapped instance of it, since those cards cannot be used until the next turn.

## Grounding Dino

One of the first option I came across was grounding Dino. It is capable of delivering real-time performance without having to be fine-tuned.

In [None]:
import cv2
import numpy as np
from PIL import Image
from mss import mss
from groundingdino.util.inference import load_model, load_image, predict, annotate
import groundingdino.datasets.transforms as T
import torch
from torchvision import transforms

### GroundingDINO Testing

In [None]:
model = load_model("GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py", "GroundingDINO/weights/groundingdino_swint_ogc.pth")

In [None]:
IMAGE_PATH = "../mtga_data/z_screen_16.png"
#IMAGE_PATH = "shot.png"

TEXT_PROMPT = "card"
BOX_TRESHOLD = 0.25
TEXT_TRESHOLD = 0.25

image_source, image = load_image(IMAGE_PATH)

print(type(image))
print(image.shape)
print(type(image_source))

boxes, logits, phrases = predict(
    model=model,
    image=image,
    caption=TEXT_PROMPT,
    box_threshold=BOX_TRESHOLD,
    text_threshold=TEXT_TRESHOLD
)

annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
cv2.imwrite("annotated_image2.jpg", annotated_frame)

### Real-time GroundingDINO Model

In [None]:
# Configure the model

model = load_model("GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py", "GroundingDINO/weights/groundingdino_swint_ogc.pth")
IMAGE_PATH = "../yolo_tune/mtga_data/z_screen_5.png"
TEXT_PROMPT = "orange button"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

def load_cv2_image(cv2_image):
    transform = T.Compose(
        [
            T.RandomResize([800], max_size=1333),
            T.ToTensor(),
            T.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
        ]
    )
    image_source = Image.fromarray(cv2_image)
    image = np.asarray(image_source)
    image_transformed, _ = transform(image_source, None)
    return image, image_transformed


monitor = {"top": 0, "left": 0, "width": 1920, "height": 1080}
sct = mss()

def process_frame(frame):
    image = cv2.cvtColor(np.array(frame), cv2.COLOR_BGRA2RGB)
    image_source, processed_image = load_cv2_image(image)

    boxes, logits, phrases = predict(
        model=model,
        image=processed_image,
        caption=TEXT_PROMPT,
        box_threshold=BOX_TRESHOLD,
        text_threshold=TEXT_TRESHOLD
    )

    annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)

    return annotated_frame

def main():
    cv2.namedWindow("Live Labels", cv2.WINDOW_NORMAL)
    cv2.resizeWindow("Live Labels", 960, 540)

    while True:
        screen = sct.grab(monitor)
        
        labeled_frame = process_frame(screen)

        cv2.imshow("Live Labels", labeled_frame)

        if cv2.waitKey(1) & 0xFF == ord("q"):
            break

    cv2.destroyAllWindows()

if __name__ == "__main__":
    main()

![GroundingDINO Output](images/GroundingDINO_output.jpg)

### GroundingDINO Conclusion

GroundingDINO is reasonably good at zero shot object identification. However it is much less consistent than a tuned model. As can been seen in the image above, it is not very confident in it's predictions and often cuts out the mana cost of cards as they tend to float slightly above the cards (at the top right). It is also significantly slower than YOLO and struggles to handle multi-class detection. For instance, I would like both UI elements and cards to be detected, but GroundingDINO seems to struggle as the number of classes it is asked to identify increases and it usually fails to consistently identify either classes.

## Segment Anything

I also tried segmentation with a SAM model. I though that image segmentation might let me detect objects and then a future step could classify or perform OCR to identify the discrete objects.

In [None]:
from segment_anything import sam_model_registry, SamAutomaticMaskGenerator
import matplotlib.pyplot as plt

In [None]:
sam_checkpoint = "sam_vit_b_01ec64.pth" # Used the lightest model for speed
model_type = "vit_b"  # others options were vit_b, vit_l, vit_h
device = "cuda" if torch.cuda.is_available() else "cpu"

sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device)

In [None]:
mask_generator = SamAutomaticMaskGenerator(
    sam,
    points_per_side=16,
    pred_iou_thresh=0.8,
    stability_score_thresh=0.95,
    min_mask_region_area=100
)

print("Model loaded successfully")

# Load and preprocess the image
image_path = "z_screen_13.png"  # Replace with your image path
image = cv2.imread(image_path)
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  # Convert BGR to RGB for SAM

# Resize for faster processing
image_resized = cv2.resize(image_rgb, (0, 0), fx=0.5, fy=0.5)

print(f"Image shape: {image_resized.shape}")

# Generate masks
masks = mask_generator.generate(image_resized)

print(f"Number of masks: {len(masks)}")

# Visualize and save the segmentation results
def overlay_masks(image, masks):
    """Overlay masks on the image with random colors."""
    overlay = image.copy()
    for mask in masks:
        segmentation = mask["segmentation"]
        color = np.random.randint(0, 255, size=(3,), dtype=np.uint8)
        overlay[segmentation] = 0.5 * overlay[segmentation] + 0.5 * color
        contours, _ = cv2.findContours(segmentation.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
        cv2.drawContours(overlay, contours, -1, color.tolist(), thickness=2)
    return overlay

# Apply the overlay
segmented_image = overlay_masks(image_resized, masks)

# Save and display the result
output_path = "segmented_image.jpg"
cv2.imwrite(output_path, cv2.cvtColor(segmented_image, cv2.COLOR_RGB2BGR))

![SAM Output](images/SAM_output.png)

### Segment Anything Conclusion

This segmentation model was sometimes very accurate in detecting card objects but could not do it consistently. Sometimes it identified shapes within cards rather than the card itself and it often focused on background objects rather than the cards. It also proved far too slow to be used for real-time inference. I still think it would be an interesting option to explore in the future as I believe there are faster segmentation models. I also think a fine-tuned segmentation model could be a more accurate option than YOLO for game object clicking since it finds exact edges rather than bounding boxes which could be helpful for overlapping or rotated cards.