[Chat](https://chatgpt.com/g/g-p-684dc8075e508191ae070a06c599469d-current-things-14-june-2025/c/685942a9-bec0-8011-b909-3d2c8bd9e1d6)

Great, then let’s **break down the recognizer retraining task into Docker-based action steps**, from environment setup to training. This will help you not just finish the task, but also deliver a reusable container for your supervisor.

---

## 🐳 ACTION PLAN: Re-training the Phonetic-Word Recognizer in Docker

---

### 🔹 PHASE 1: Prepare Docker Environment

#### ✅ 1. Create Dockerfile

* [ ] Base image: `pytorch/pytorch:latest` (or a CUDA-compatible variant if GPU support is needed)
* [ ] Install dependencies:

  * `transformers`
  * `datasets`
  * `torchaudio`
  * `librosa`
  * `jiwer` (for error rates)
  * `sox`, `ffmpeg` (for audio handling)
  * any forced aligner tools, if relevant (e.g., MFA CLI)
* [ ] Install your recognizer training code and scripts
* [ ] Set up working directory and entrypoint

✅ **Goal:** Self-contained environment to fine-tune wav2vec2 on phonetic data.

---

### 🔹 PHASE 2: Define Volumes / Inputs

#### ✅ 2. Organize Required Inputs

* [ ] Waxholm audio files (already converted to WAV, 16kHz)
* [ ] Corrected phonetic transcriptions, space-separated per utterance
* [ ] CSV/JSONL/TSV format: `utt_id, wav_path, phonetic_label`
* [ ] (Optional) Riksdag segments for silver data or later evaluation

✅ **Tip:** Keep data and output outside the container and mount with `-v`.

---

### 🔹 PHASE 3: Train Inside Docker

#### ✅ 3. Launch Training

* [ ] Use HuggingFace `Trainer` or PyTorch directly
* [ ] Confirm:

  * Label set fully matches phoneme inventory
  * Output predictions are space-separated phonetic words
* [ ] Save:

  * Model checkpoint
  * Config + tokenizer
  * Training log
  * Error rates

✅ **Goal:** Reproducible, containerized training pipeline with traceable outputs.

---

### 🔹 PHASE 4: Package and Share

#### ✅ 4. Clean and Share Container

* [ ] Create a versioned tag for your container: `swedish-phonrec:v1`
* [ ] Write a short `README.md`:

  * Usage
  * Expected input format
  * Entrypoint to train/test
* [ ] Push to Docker Hub, GHCR, or send `.tar.gz` if needed

✅ **Bonus:** This becomes your base for other tasks your supervisor needs.

---

## ✅ Today's Concrete Subtasks

If you want *just a day's worth of progress*:

| Task                               | Estimated Time | Notes                  |
| ---------------------------------- | -------------- | ---------------------- |
| Write Dockerfile skeleton          | 30–60 min      | Use PyTorch base image |
| Set up phoneme label list          | 15–30 min      | Include missing phones |
| Format Waxholm into training CSV   | 1 hr           | utt\_id, wav, label    |
| Add phonetic-word decoder to model | 30 min         | If not already set     |

---

Would you like a working Dockerfile starter and a training data format example for this? I can write that next so you don’t have to think about it.


```
python ../transformers/src/transformers/models/wav2vec2/convert_wav2vec2_original_pytorch_checkpoint_to_pytorch.py --pytorch_dump_folder /home/joregan/wav2vec2-swedish-phonetic-waxholm/ --checkpoint_path $(find $HOME/outputs/ -name '*_best.pt'|tail -n 1) --dict_path /home/joregan/waxholm_fairseq/dict.ltr.txt --config_path /home/joregan/wav2vec2-swedish-phonetic-waxholm/config.json
```

Needs:
- [X] Data
- [ ] Output model path
- [ ] KB base model