# [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/document_visual_ai_with_fiftyone_workshop/blob/main/04_evaluation.ipynb)


In [None]:
!pip install fiftyone python-Levenshtein

Let's install some plugins to help us along the way. Run the following in your terminal:

1. `fiftyone plugins download https://github.com/jacobmarks/keyword-search-plugin`

2. `fiftyone plugins download https://github.com/harpreetsahota204/caption-viewer`

3. `fiftyone plugins download https://github.com/voxel51/fiftyone-plugins --plugin-names @voxel51/dashboard`

This plugin is the main one for this notebook: `fiftyone plugins download https://github.com/harpreetsahota204/text_evaluation_metrics`

In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/text_evaluation_metrics

### Load local dataset

You can load the dataset we created in the first notebook as follows:

In [None]:
import fiftyone as fo

dataset = fo.load_dataset("neurips-2025-vision-papers")

### (Alternatively) Load dataset from Hugging Face Hub

If you're picking up in a fresh Colab notebook or didn't go through the first notebook, you can download the [Visual AI at NeurIPS 2025 dataset with the embeddings from the Jina models we used in the previous notebook](https://huggingface.co/datasets/harpreetsahota/visual_ai_at_neurips2025_jina_with_ocr), hosted on Hugging Face.

Note that this dataset we are downloading already has the OCR results parsed, so it will save you time from having to run inference on your own. These models do take painfully long to run inference.

In [None]:
import fiftyone as fo
from fiftyone.utils.huggingface import load_from_hub

dataset = load_from_hub("harpreetsahota/visual_ai_at_neurips2025_jina_with_ocr", persistent=True)

In [None]:
import fiftyone as fo
dataset = fo.load_dataset("harpreetsahota/visual_ai_at_neurips2025_jina_with_ocr")

## Text Evaluation Metrics

## Best Practices

1. **Start with ANLS**: It's the standard metric for VLM OCR tasks

2. **Use Exact Match as a secondary metric**: Provides a strict accuracy baseline


### Compute ANLS (Average Normalized Levenshtein Similarity)

**Average Normalized Levenshtein Similarity** - Primary metric for VLM OCR evaluation It:

- Normalizes edit distance by string length
- Applies a configurable threshold (typically 0.5)
- Returns 1.0 if similarity â‰¥ threshold, otherwise returns the similarity score
- Is robust to minor OCR errors

**Use case**: Primary evaluation metric for OCR tasks, VLM document understanding


In [None]:
import fiftyone as fo
import fiftyone.operators as foo

anls_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_anls")

result = anls_op(
    dataset, 
    pred_field="md_abstract", 
    gt_field="abstract", 
    output_field="md_ansl_score",
    threshold=0.5,
    case_sensitive=False,
    delegate=False
    )

In [None]:
dataset

### Compute Exact Match

**Binary exact match accuracy** between prediction and ground truth. 

- Case-sensitive option
- Whitespace stripping option
- Returns 1.0 for perfect match, 0.0 otherwise

**Use case**: Strict evaluation where partial credit isn't appropriate (e.g., form field extraction)

In [None]:
import fiftyone as fo
import fiftyone.operators as foo

em_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_exact_match")

result = em_op(
    dataset, 
    pred_field="md_abstract", 
    gt_field="abstract",
    output_field="md_exact_match_score",
    delegate=False
    )

### Compute Normalized Similarity

**Continuous similarity score** (0.0-1.0) without threshold

- No threshold applied
- Full range of similarity values
- Useful for ranking and analysis

**Use case**: Fine-grained analysis, ranking samples by similarity

In [None]:
import fiftyone as fo
import fiftyone.operators as foo

sim_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_normalized_similarity")

result = sim_op(
    dataset, 
    pred_field="md_abstract", 
    gt_field="abstract",
    output_field="md_norm_sim_score",
    delegate=False
    )

### Compute CER

**Character Error Rate** - Ratio of character-level edits needed to transform prediction into ground truth.

- Based on Levenshtein distance at character level
- Lower is better (0.0 = perfect)
- Case-sensitive by default

**Use case**: Detailed character-level error analysis, language-agnostic evaluation

In [None]:
import fiftyone as fo
import fiftyone.operators as foo

cer_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_cer")

result = cer_op(
    dataset, 
    pred_field="md_abstract", 
    gt_field="abstract",
    output_field="md_cer_score",
    delegate=False
    )


### Compute WER

**Word Error Rate** - Ratio of word-level edits needed to transform prediction into ground truth.

- Based on Levenshtein distance at word level
- Lower is better (0.0 = perfect)
- Case-sensitive by default

**Use case**: Speech recognition, word-level accuracy analysis

In [None]:
import fiftyone as fo
import fiftyone.operators as foo

wer_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_wer")

result = wer_op(
    dataset, 
    pred_field="md_abstract", 
    gt_field="abstract",
    output_field="md_wer_score",
    delegate=False
    )

### Launch the App and build some dashboards

You can launch the App as follows:


In [None]:
import fiftyone as fo
session = fo.launch_app(dataset, auto=False)
session.url



#### Create the following scatter plots:

**1. ANLS vs. Character Error Rate (CER)**  
   - **Why**: **Most informative**, as ANLS is a threshold-based similarity score (binary-ish: 0 or 1 with some partial credit), while CER is a continuous error metric measuring character-level edits
   - **What it tells us**: 
     - Shows the relationship between high-level OCR quality and character-level error details
     - Reveals whether high ANLS scores correlate with low CER (as expected)
     - Can identify cases where strings look "similar enough" (high ANLS) but have significant character-level problems
     - Helps validate metric consistency

**2. Word Error Rate (WER) vs. Character Error Rate (CER)** 
   - **Why**: Both are continuous error metrics at different levels of granularity
   - **What it tells us**:
     - Shows whether word-level and character-level errors scale together
     - Identifies cases where a few character errors affect entire words vs. minor character mistakes
     - Reveals if the OCR errors are more systematic (affecting whole words) or scattered (individual characters)


## Evaluate Classifications

FiftyOne has a nice [evaluation API](https://docs.voxel51.com/user_guide/evaluation.html) that you can use to assess how well a model performs.

By default, `evaluate_classifications` will treat your classifications as generic multiclass predictions, and it will evaluate each prediction by directly comparing its label to the associated ground truth prediction.

In [None]:
results = dataset.evaluate_classifications(
    "arxiv_category_predictions",
    gt_field="arxiv_category_mapped",
    eval_key="mapped_eval",
)

In [None]:
results = dataset.evaluate_classifications(
    "unmapped_arxiv_category_predictions",
    gt_field="arxiv_category",
    eval_key="unmapped_eval",
)

Let's parse the string output we got from Mooondream3 as actual FiftyOne Classifications:

In [None]:
import fiftyone as fo

# Get all values from your string field
string_values = dataset.values("md_mapped_categories")

# Create Classification objects from the string values
classifications = []
for value in string_values:
    if value:  # Only create classification if value exists
        classifications.append(fo.Classification(label=value))
    else:
        classifications.append(None)

# Set the classifications to a new field
dataset.set_values("md_mapped_categories_classes", classifications)

And then run the evaluation method:

In [None]:
results = dataset.evaluate_classifications(
    "md_mapped_categories_classes",
    gt_field="arxiv_category_mapped",
    eval_key="md_mapped_eval",
)

#### Evaluate Detections

Although this dataset doesn't have ground truth detections, I still want to illustrate how we would perform this task.

Let's assume that the `miner_text_detections` are the ground truth, and the field on the Dataset with the predictions are `text_detections`.

We can use the [`evaluate_detections`](https://docs.voxel51.com/api/fiftyone.core.collections.html#fiftyone.core.collections.SampleCollection.evaluate_detections) method to assess model performance:

In [None]:
dataset.evaluate_detections(
    "text_detections",
    gt_field="miner_text_detections",
    eval_key="eval_detections",
    classwise=False #allow matches between classes, otherwise set to True to only match objects with the same class labe
)

In [22]:
dataset

Name:        harpreetsahota/visual_ai_at_neurips2025_jina_with_ocr
Media type:  image
Num samples: 1134
Persistent:  True
Tags:        []
Sample fields:
    id:                                  fiftyone.core.fields.ObjectIdField
    filepath:                            fiftyone.core.fields.StringField
    tags:                                fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:                            fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:                          fiftyone.core.fields.DateTimeField
    last_modified_at:                    fiftyone.core.fields.DateTimeField
    type:                                fiftyone.core.fields.StringField
    name:                                fiftyone.core.fields.StringField
    virtualsite_url:                     fiftyone.core.fields.StringField
    abstract:                            fiftyone.core.fields.StringField
    arxiv_id:        

### My assignment to you

Compute the various metrics we introduced in this notebook using the OCR outputs from the other models. You can compare two model outputs against one another to see how much they differ!
