## Testing Magi against Manga/Comic in Japanese

The Manga Whisperer (Magi), developed by Ragav Sachdeva and Andrew Zisserman aims to automatically generate transcript for comics. It impressively detects panels, text blocks, and characters, and organized each transcripts character by character.

- Paper: https://arxiv.org/pdf/2401.10224
- Hugging Face: https://huggingface.co/ragavsachdeva/magi

The goal of this notebook is to see how suitable is magi at detecting Japanese in comics.

I had trouble running the given example code, running magi directly from HF's transformers library on my M3 Mac, thus, I cloned and put the repo within the project and accessed it directly.


In [1]:
# Imports and configerations
import numpy as np
from PIL import Image
import torch
import os
import json

import sys
sys.path.append("./magi")  # Add the magi directory to Python path

img_location = "./test_manga"

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
def read_image_as_np_array(image_path):
    with open(image_path, "rb") as file:
        image = Image.open(file).convert("L").convert("RGB")
        image = np.array(image)
    return image

# Get all img path
images = [f"{img_location}/{x}" for x in os.listdir(img_location)]
images = [read_image_as_np_array(image) for image in images]

In [3]:
# Check if MPS (Metal Performance Shaders) is available for Apple Silicon
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

device

device(type='mps')

In [4]:
from magi.configuration_magi import MagiConfig
from magi.modelling_magi import MagiModel

# Read the config file
with open("./magi/config.json", "r") as f:
    config_dict = json.load(f)

# Create the MagiConfig instance
config = MagiConfig(**config_dict)

# Create the model directly using MagiModel
model = MagiModel(config).to(device)

# Load the state dict if you have local weights
state_dict = torch.load("./magi/pytorch_model.bin", map_location=device)
model.load_state_dict(state_dict)



<All keys matched successfully>

In [5]:
with torch.no_grad():
    results = model.predict_detections_and_associations(images)
    text_bboxes_for_all_images = [x["texts"] for x in results]
    ocr_results = model.predict_ocr(images, text_bboxes_for_all_images)

In [6]:
for i in range(len(images)):
    model.visualise_single_image_prediction(images[i], results[i], filename=f"image_{i}.png")
    model.generate_transcript_for_single_image(results[i], ocr_results[i], filename=f"transcript_{i}.txt")

<img src="./test_magi_results/image_1.png" style="width:300px" />

As shown, model impressively detected texts. However, as shown below, the transcript for it attempts to find alphanumeric characters instead of Japanese characters. Thus is not perticularly helpful in this project.

```
 ### Transcript ###
<1>: This week is
<1>: 30.5.7.4% of the amount
<?>: About 10,000%
<?>: 1.7.3.7D The
<1>: 27(1) The Council:
<1>: “I think it’s a good thing,” he said.
<4>: “But
<?>: SEME!
<?>: #1: "All right here!"
<?>: It is difficult to be
<?>: I'm sure that
<?>: Too the best

```

Edit: I later also realized that this model is trained with English comics. Thus is not a great fit for my need.