# FiftyOne Text Evaluation Metrics Plugin

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/harpreetsahota204/text_evaluation_metrics/blob/main/text_evaluation_demo.ipynb)

This notebook demonstrates the **Text Evaluation Metrics** plugin for FiftyOne.

## Available Metrics

1. **ANLS** - Average Normalized Levenshtein Similarity (primary VLM OCR metric)
2. **Exact Match** - Binary exact match accuracy
3. **Normalized Similarity** - Continuous similarity without threshold
4. **CER** - Character Error Rate
5. **WER** - Word Error Rate

üîó [GitHub Repository](https://github.com/harpreetsahota204/text_evaluation_metrics)

## Installation

First, install FiftyOne and the required dependencies.

In [None]:
!pip install -q fiftyone python-Levenshtein

### Install the Plugin

Download and install the plugin directly from GitHub.

In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/text_evaluation_metrics

## Setup

Import libraries and check versions.

In [None]:
import fiftyone as fo
import fiftyone.operators as foo

print(f"FiftyOne version: {fo.__version__}")

### Create Sample Dataset

---
## 1. Compute ANLS

**ANLS** is the primary metric for VLM OCR evaluation.

In [None]:
anls_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_anls")

result = anls_op(dataset, pred_field="prediction", gt_field="ground_truth", threshold=0.5)

print(f"Mean ANLS: {result['mean_anls']:.3f}")
print("\nPer-Sample ANLS:")
for s in dataset:
    print(f"{s['description']:20s} | {s['prediction_anls']:.3f}")

---
## 2. Compute Exact Match

**Exact Match** returns 1.0 only for perfect matches.

In [None]:
em_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_exact_match")

result = em_op(dataset, pred_field="prediction", gt_field="ground_truth")

print(f"Accuracy: {result['accuracy']:.1%}")
print("\nPer-Sample Exact Match:")
for s in dataset:
    match = "‚úì" if s['prediction_exact_match'] == 1.0 else "‚úó"
    print(f"{s['description']:20s} | {match}")

---
## 3. Compute Normalized Similarity

**Normalized Similarity** provides continuous scores without threshold.

In [None]:
sim_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_normalized_similarity")

result = sim_op(dataset, pred_field="prediction", gt_field="ground_truth")

print(f"Mean Similarity: {result['mean_similarity']:.3f}")
print("\nPer-Sample Similarity (sorted):")
for s in sorted(dataset, key=lambda x: x['prediction_similarity'], reverse=True):
    bar = "‚ñà" * int(s['prediction_similarity'] * 30)
    print(f"{s['description']:20s} | {s['prediction_similarity']:.3f} | {bar}")

---
## 4. Compute CER (Character Error Rate)

**CER** measures character-level edits needed.

In [None]:
cer_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_cer")

result = cer_op(dataset, pred_field="prediction", gt_field="ground_truth")

print(f"Mean CER: {result['mean_cer']:.3f} (lower is better)")
print("\nPer-Sample CER (sorted):")
for s in sorted(dataset, key=lambda x: x['prediction_cer']):
    quality = "Excellent" if s['prediction_cer'] < 0.1 else "Good" if s['prediction_cer'] < 0.2 else "Fair" if s['prediction_cer'] < 0.5 else "Poor"
    print(f"{s['description']:20s} | {s['prediction_cer']:.3f} | {quality}")

---
## 5. Compute WER (Word Error Rate)

**WER** measures word-level edits needed.

In [None]:
wer_op = foo.get_operator("@harpreetsahota/text-evaluation-metrics/compute_wer")

result = wer_op(dataset, pred_field="prediction", gt_field="ground_truth")

print(f"Mean WER: {result['mean_wer']:.3f} (lower is better)")
print("\nPer-Sample WER (sorted):")
for s in sorted(dataset, key=lambda x: x['prediction_wer']):
    quality = "Excellent" if s['prediction_wer'] == 0.0 else "Good" if s['prediction_wer'] < 0.3 else "Fair" if s['prediction_wer'] < 0.6 else "Poor"
    print(f"{s['description']:20s} | {s['prediction_wer']:.3f} | {quality}")

---
## 6. Comprehensive Comparison

Compare all metrics side-by-side.

In [None]:
print("=" * 100)
print("COMPREHENSIVE METRIC COMPARISON")
print("=" * 100)
print(f"{'Description':20s} | {'ANLS':>6s} | {'Exact':>5s} | {'Sim':>6s} | {'CER':>6s} | {'WER':>6s}")
print("-" * 100)
for s in dataset:
    print(f"{s['description']:20s} | {s['prediction_anls']:6.3f} | {s['prediction_exact_match']:5.0f} | {s['prediction_similarity']:6.3f} | {s['prediction_cer']:6.3f} | {s['prediction_wer']:6.3f}")

print("\n" + "=" * 100)
print("SUMMARY STATISTICS")
print("=" * 100)
print(f"Mean ANLS:       {dataset.mean('prediction_anls'):.3f}")
print(f"Accuracy:        {dataset.mean('prediction_exact_match'):.1%}")
print(f"Mean Similarity: {dataset.mean('prediction_similarity'):.3f}")
print(f"Mean CER:        {dataset.mean('prediction_cer'):.3f}")
print(f"Mean WER:        {dataset.mean('prediction_wer'):.3f}")

---
## Conclusion

This notebook demonstrated all 5 text evaluation metrics.

### Key Takeaways
- **ANLS** is the primary metric for VLM OCR tasks
- **Exact Match** provides a strict accuracy baseline
- **Normalized Similarity** helps understand error distribution
- **CER/WER** provide detailed error analysis

### Resources
- üìö [Plugin Documentation](https://github.com/harpreetsahota204/text_evaluation_metrics)
- üåê [FiftyOne Docs](https://docs.voxel51.com/)

**Author:** Harpreet Sahota | **License:** Apache 2.0