This notebook works through running the [DataPerf Speech](https://www.dataperf.org/training-set-selection-speech) challenge evaluation with a [baseline selection algorithm](https://github.com/harvard-edge/dataperf-speech-example/blob/main/selection/implementations/baseline_selection.py).

We start by cloning our example selection algorithm repository and installing some additional dependencies not preinstalled in Colab environments:

In [None]:
!pip install -q fire wget
!git clone https://github.com/harvard-edge/dataperf-speech-example/
import sys
sys.path.append("/content/dataperf-speech-example/")
import os
os.chdir("/content/dataperf-speech-example/")

Next, we download the spoken word embeddings which we will use for training coreset selection and evaluation.

In [None]:
!python utils/download_data.py --output_path workspace/data 1> /dev/null

Below, we generate a set of 25 training samples from the available embeddings for each language, using our default selection algorithm (which simply performs crossfold-validation). The evaluation strategy can be changed by editing `dataperf-speech-example/workspace/dataperf_speech_config.yaml` 

The goal of this challenge is to add your own selection algorithm and outperform the provided baselines' macro F1 scores.

The selection algorithm will output a training file for each language, `en_train.json`, `id_train.json`, and `pt_train.json`.

These are the files you would upload to Dynabench for official evaluation, but in the next cell, we will run local unofficial evaluation using our provided evaluation data.

In [None]:
TRAIN_SIZE = 25 # or 60
for lang in ["en", "id", "pt"]:
  !python -m selection.main \
     --language "{lang}" \
     --allowed_training_set "workspace/data/dataperf_{lang}_data/allowed_training_set.yaml" \
     --train_embeddings_dir "workspace/data//dataperf_{lang}_data/train_embeddings/" \
     --train_size {TRAIN_SIZE} \
     --outdir "/content/"

Finally, let's run a local unofficial evaluation on the results of the training set selection algorithm (`en_train.json`, `id_train.json`, and `pt_train.json`). 

For each language, a macro F1 score will be printed out, using the evaluation data samples specified in `eval.yaml`:

```
validating selected IDs
loading selected training data
Loading targets: 100% 5/5 [00:00<00:00, 18.82it/s]
Loading nontargets: 100% 9/9 [00:00<00:00, 175.78it/s]
loading eval data
Loading targets: 100% 5/5 [00:00<00:00, 86.93it/s]
Loading nontargets: 100% 200/200 [00:12<00:00, 15.48it/s]
Score:  0.33143968916284644
```


In [None]:
for lang in ["en", "id", "pt"]:
  !python eval.py \
    --language "{lang}" \
    --eval_embeddings_dir "workspace/data/dataperf_{lang}_data/eval_embeddings/" \
    --train_embeddings_dir "workspace/data/dataperf_{lang}_data/train_embeddings/" \
    --allowed_training_set "workspace/data/dataperf_{lang}_data/allowed_training_set.yaml" \
    --eval_file "workspace/data/dataperf_{lang}_data/eval.yaml" \
    --train_file "/content/{lang}_{TRAIN_SIZE}_train.json" \
    --train_size {TRAIN_SIZE} \
    --config_file workspace/dataperf_speech_config.yaml