# English STELLA Transcriptions Dataset

The STELLA dataset...



# Data preparation

Format Dataset into the wanted architecture. This procedure extracts audiobook transcriptions from the original dataset and sorts them into the same splits as the audio files.

```
txt
├── LANG
│   ├── HOUR_SPLIT
│   │   ├── SECTION_SPLIT
│   │   │   ├── books.txt
│   │   │   ├── meta.json
│   │   │   └── transcription.txt
│   │   ├── ...
│   ├── ...
│   ...

```
- txt : folder containing transcriptions
- LANG: corresponds to the given language
- HOUR_SPLIR: corresponds to the size of the section splits in number of hours of speech,
              formatted as (50h, 100h, ..., 3200h)
- SECTION_SPLIT: separation of content into sections with equal amount of speech content.
- books.txt: the list of books used for this split
- meta.json: metadata generated during clean-up used to measure effectiveness of cleaning.
- transcript.txt: the agregated transcripts of the audiobooks in the list.

In [1]:
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.utils import timed_status

prep = stella.STELAPrepTranscripts(lang="EN")
with timed_status(status="Preping stela transcriptions...", complete_status="Succesfuly build STELA Transcript dataset !"):
    prep.build_transcript()

Output()

## Data Cleaning

Clean-up text to keep only clean words that can be piped through the dictionairy.

RULES (Order Matters):
1) Illustration tag removal
2) URL removal
3) TextNormalisation : correct accents & remove non-printable characters
4) Trancribe numbers
5) Remove roman numerals
6) Fix symbols ($,€, etc..)
7) AZFilter

    * replace '-' with a space to extract hyphenated words (fifty-five -> fifty five)

    * Keeps apostrophe char(*'*) to protect shorthands (ex: ain't)
  
    * purges everything not between [A-Z].

    * lowecases everything
8) Fix words by removing prefix and trailing quote char (')

In [2]:
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.utils import timed_status

cleaner = stella.StelaCleaner()
with timed_status(status="Cleaning STELA Transcripts", complete_status="Succesfuly cleanned up STELA Transcript dataset !"):
    cleaner.cleanup_raw(compute_word_freqs=True)

Output()

# Word Filtering

Using a pre-selected lexicon we filter the corpus to separated known from unknown words

In [1]:
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.datasets import utils
from lexical_benchmark.utils import timed_status

dataset_cleaner = stella.StelaCleaner()

with timed_status(status="Word Filtering", complete_status="Succesfuly completed word filtering !"):
    dataset_cleaner.mk_clean(
        word_cleaners={"EN": utils.DictionairyCleaner(lang="EN")},
        compute_word_freqs=True,
    )

Output()

## Computing cleanup stats

In [6]:
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.utils import timed_status
from pathlib import Path
# ------------------
DEBUG = True
SAVE_CSVs = True
PRINT = True
# ------------------
# Configure inputs
clean_dataset = stella.STELLADatasetClean()
raw_dataset = stella.STELLADatasetRaw()
meta_root = Path("stela_wf") if DEBUG else clean_dataset.wf_meta_dir
(meta_root / "raw").mkdir(exist_ok=True, parents=True)
(meta_root / "clean").mkdir(exist_ok=True, parents=True)
(meta_root / "bad").mkdir(exist_ok=True, parents=True)

stats_dict = {}

with timed_status(status="Extracting Stats...", complete_status="Succesfuly extracted statistics !"):
    for lang, hour_split in clean_dataset.iter_hour_splits('EN'):
        clean_stats = clean_dataset.word_stats_by_split(lang, hour_split)
        raw_stats = raw_dataset.word_stats_by_split(lang, hour_split)

        stats_dict[f"{lang}-{hour_split}"] = {
            "raw": raw_stats,
            "clean": clean_stats
        }

        # Save Files
        if SAVE_CSVs:
            raw_stats.freq_map.to_csv(meta_root / f"raw/{lang}_{hour_split}.word-freq.csv", index=False)
            clean_stats["good"].freq_map.to_csv(meta_root / f"clean/{lang}_{hour_split}.word-freq.csv", index=False)
            clean_stats["bad"].freq_map.to_csv(meta_root / f"bad/{lang}_{hour_split}.word-freq.csv", index=False)


# Print resuts
if PRINT:
    for key, value in stats_dict.items():
        print("-" * 10)
        print(f"==> {key}")
        # ALL
        print(f"RAW:: > Number of tokens : {value['raw'].token_nb:_}, Number of types : {value['raw'].type_nb:_}")

        # Clean
        print(f"CLEAN:: > Number of tokens : {value['clean']['good'].token_nb:_}, Number of types : {value['clean']['good'].type_nb:_}")

        # Rejected
        print(f"REJECT:: > Number of tokens : {value['clean']['bad'].token_nb:_}, Number of types : {value['clean']['bad'].type_nb:_}")

        print("-" * 10)

Output()

----------
==> EN-100h
RAW:: > Number of tokens : 42_030_077, Number of types : 336_023
CLEAN:: > Number of tokens : 41_394_999, Number of types : 132_199
REJECT:: > Number of tokens : 635_078, Number of types : 203_824
----------
----------
==> EN-1600h
RAW:: > Number of tokens : 42_030_077, Number of types : 336_023
CLEAN:: > Number of tokens : 41_394_999, Number of types : 132_199
REJECT:: > Number of tokens : 635_078, Number of types : 203_824
----------
----------
==> EN-200h
RAW:: > Number of tokens : 41_928_401, Number of types : 335_905
CLEAN:: > Number of tokens : 41_293_444, Number of types : 132_128
REJECT:: > Number of tokens : 634_957, Number of types : 203_777
----------
----------
==> EN-3200h
RAW:: > Number of tokens : 42_030_077, Number of types : 336_023
CLEAN:: > Number of tokens : 41_394_999, Number of types : 132_199
REJECT:: > Number of tokens : 635_078, Number of types : 203_824
----------
----------
==> EN-400h
RAW:: > Number of tokens : 42_030_077, Number of ty

In [None]:
""" Test Lexicon"""
from lexical_benchmark.datasets import utils
from lexical_benchmark.datasets.utils import lexicon

en_cleaner = utils.DictionairyCleaner(lang="EN")
lexique = lexicon.Lexicon(lang="EN")

"potato" in lexique.words, "frezdo" in lexique.words