# Hugging Face scripts

## Validate BIGOS dataset

Script [`huggingface/tests.py`](huggingface/tests.py) provides several tests to validate the dataset on the Hugging Face website.

To perform all available tests run the following line in terminal:

```bash
python -m pytest huggingface/tests.py
```

If you want to perform a selected test, you can specify the name of the test after the double colon, as in the example:

```bash
python -m pytest huggingface/tests.py::test_if_subset_uploaded
```

An example of the output of one of the tests can be seen below.

In [1]:
! python -m pytest -v huggingface/tests.py::test_if_subset_uploaded

platform linux -- Python 3.10.11, pytest-7.3.1, pluggy-1.0.0 -- /home/michal/.pyenv/versions/3.10.11/envs/bigos-venv-3.10/bin/python
cachedir: .pytest_cache
rootdir: /home/michal/Development/github/pl-asr-bigos-tools/scripts
plugins: dash-2.14.2, anyio-3.7.0, typeguard-4.0.0
collected 12 items                                                             [0m

huggingface/tests.py::test_if_subset_uploaded[polyai-minds14-21] [32mPASSED[0m[32m  [  8%][0m
huggingface/tests.py::test_if_subset_uploaded[mozilla-common_voice_15-23] [32mPASSED[0m[32m [ 16%][0m
huggingface/tests.py::test_if_subset_uploaded[pwr-viu-unk] [32mPASSED[0m[32m        [ 25%][0m
huggingface/tests.py::test_if_subset_uploaded[pjatk-clarin_studio-15] [32mPASSED[0m[32m [ 33%][0m
huggingface/tests.py::test_if_subset_uploaded[mailabs-corpus_librivox-19] [32mPASSED[0m[32m [ 41%][0m
huggingface/tests.py::test_if_subset_uploaded[google-fleurs-22] [32mPASSED[0m[32m   [ 50%][0m
huggingface/tests.py::test_if_s

### Available tests

- `test_if_subset_uploaded` - iterates over subsets included in `BIGOS_SUBSETS` const and checks whether subset is uploaded to the repository on Hugging Face.

- `test_for_extra_subsets` - checks if there are any subsets uploaded to the repository on Hugging Face, that are not included in `BIGOS_SUBSETS` const.

- `test_if_file_exists` - iterates over subsets included in `BIGOS_SUBSETS` const and checks whether all expected files exist for a subset, i.e. `train.tsv`, `train.tar.gz`, `validation.tsv`, `validation.tar.gz`, `test.tsv`, `test.tar.gz`.

- `test_for_extra_files` - iterates over subsets included in `BIGOS_SUBSETS` const and checks if there are any unexpected files uploaded to the repository on Hugging Face for a subset.

# Analyze BIGOS dataset

File `huggingface/data_analyzer.py` defines `DataAnalyzer` class, which can be used to generate statistics for a dataset.

The output depends on selected subsets and splits:

- `noSamples` (`int`) - total number of samples in dataset, subset, or split. Available in all cases.

- `noEmptyReferences` (`int`)- number of empty reference sentences. Available for `train` and `validation` split.

- `noDifferentLanguage` (`int`) - number of reference sentences most likely written in language different than Polish. Available for `train` and `validation` split.

- `noDoubleWhitespaces` (`int`) - number of reference sentences containing double white spaces. Available for `train` and `validation` split.

- `noWithoutPunctuation` (`int`) - number of reference sentences without punctuation marks. Available for `train` and `validation` split.

- `punctuationCounter` (`collections.Counter()`) - counter of punctuation marks in reference sentences. Available for `train` and `validation` split.

- `emptyReference` (`list`) - list of file IDs for which the reference sentence is empty. Available for `train` and `validation` split. Not available for `all` subset.

- `differentLanguage` (`list`) - list of file IDs for which language different than Polish was detected. Available for `train` and `validation` split. Not available for `all` subset.

- `noNonEmptyReferences` (`int`) - number of non-empty reference sentences. Available for `test` split.

An example of the output generated with `DataAnalyzer` is shown below.

In [3]:
from huggingface.data_analyzer import DataAnalyzer
import json
import datetime

analyzer = DataAnalyzer(subsets=['pwr-shortwords-unk', 'pwr-azon_spont-20'], splits=['train', 'validation', 'test'])
print(json.dumps(analyzer.analyze(), indent=2))


100%|██████████| 761/761 [00:01<00:00, 474.35it/s]
100%|██████████| 86/86 [00:00<00:00, 383.48it/s]
100%|██████████| 92/92 [00:00<00:00, 1372.49it/s]
100%|██████████| 357/357 [00:01<00:00, 307.20it/s]
100%|██████████| 51/51 [00:00<00:00, 265.06it/s]
100%|██████████| 48/48 [00:00<00:00, 706.01it/s]


{
  "pwr-shortwords-unk": {
    "train": {
      "noSamples": 761,
      "noEmptyReferences": 0,
      "noDifferentLanguage": 2,
      "noDoubleWhitespaces": 5,
      "noWithoutPunctuation": 759,
      "punctuationCounter": {
        ".": 2
      },
      "emptyReference": [],
      "differentLanguage": [
        "pwr-shortwords-unk-train-0001-00839",
        "pwr-shortwords-unk-train-0001-00849"
      ]
    },
    "validation": {
      "noSamples": 86,
      "noEmptyReferences": 0,
      "noDifferentLanguage": 0,
      "noDoubleWhitespaces": 0,
      "noWithoutPunctuation": 85,
      "punctuationCounter": {
        ".": 1
      },
      "emptyReference": [],
      "differentLanguage": []
    },
    "test": {
      "noSamples": 92,
      "noNonEmptyReferences": 92
    },
    "all": {
      "noSamples": 939
    }
  },
  "pwr-azon_spont-20": {
    "train": {
      "noSamples": 357,
      "noEmptyReferences": 0,
      "noDifferentLanguage": 0,
      "noDoubleWhitespaces": 2,
      "noWith

FileNotFoundError: [Errno 2] No such file or directory: 'reports/bigos-20240425.json'

In [4]:
# save the analysis to a file with the today's date
today = datetime.datetime.now().strftime('%Y%m%d')
print(today)
analyzer.save_output('../data/analyzer-reports/bigos-{}.json'.format(today))

20240425
