Yes — you're thinking of the **multilingual wav2vec 2.0 phoneme recognizer**, likely from the **XLSR** series (`xlsr_53_56k.pt`) or from **SUPERB-style benchmarks**. These are excellent for comparison and serve as a **more directly competitive baseline** than Allosaurus, especially since:

* They use the **same architecture** (wav2vec 2.0)
* Include **Swedish** among supported languages (via Common Voice or VoxPopuli)
* Are trained to predict phoneme sequences — either via fine-tuning or CTC

---

## 🔎 Likely Reference Models

### 1. **XLSR-53**

* Trained on 53 languages
* Often used in downstream phoneme or ASR fine-tuning
* Common checkpoints: `xlsr_53_56k.pt`, `xlsr_53_100k.pt`
* Phoneme labels typically derived from forced alignment (e.g., BABEL, CV)

### 2. **SUPERB Benchmarks**

* Many phoneme recognition results use multilingual wav2vec2 backbones
* Common format: fine-tune for framewise phoneme prediction or CTC

---

## ✅ Why This Is a Good Baseline

| Your Model                     | XLSR-Based Baseline                    |
| ------------------------------ | -------------------------------------- |
| Trained on Waxholm (Swedish)   | Trained on multilingual data           |
| Outputs phonetic words         | Outputs flat phoneme sequence          |
| Designed for dialect variation | Not dialect-aware or language-specific |
| Integrated into full lexicon   | Standalone recognizer                  |

> Comparing to XLSR-style phoneme models gives you a **head-to-head**: same architecture, different training strategy.

---

## 🧪 How to Use for Comparison

### 1. **Download a model checkpoint**

If using Fairseq:

```bash
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr_53_56k.pt
```

### 2. **Fine-tune it on your phoneme data (optional)**

Or evaluate as-is if it's a usable phoneme decoder (some require downstream finetuning).

### 3. **Use consistent decoding**

* Apply same phoneme decoding / CTC decoder as you do for your model
* Evaluate:

  * PER
  * Segment-level accuracy
  * Prosodic or dialectal coverage

---

## 📊 Suggested Comparison Table

| Model            | PER (%) | Phonetic Word Match | Notes                         |
| ---------------- | ------- | ------------------- | ----------------------------- |
| Your model       | 21.3    | 72.5%               | Swedish, word-segmented       |
| XLSR-53 baseline | 30.2    | 58.7%               | Multilingual, no segmentation |
| Allosaurus       | 34.7    | 52.1%               | Symbol set mismatch           |

---

## ✅ Action Steps

| Task                                      | Time Estimate | Description                    |
| ----------------------------------------- | ------------- | ------------------------------ |
| \[ ] Download XLSR checkpoint             | 5 min         | From Fairseq or HuggingFace    |
| \[ ] Run inference on test audio          | 30 min        | Via Fairseq or direct script   |
| \[ ] Align output with phonetic reference | 30 min        | Match to phonetic words or IPA |
| \[ ] Evaluate                             | 30 min        | Compute PER, analyze errors    |

---

Would you like:

* A minimal evaluation script for `xlsr_53_56k.pt` in Fairseq?
* Help decoding its output into phones (and mapping to your IPA set)?
* A suggestion on how to handle dialectal forms it might miss?

Let me know how deep you want to go with this comparison.

