# STELLA Transcription Dataset

The STELLA dataset...



# Data preparation

Format Dataset into the wanted architecture. This procedure extracts audiobook transcriptions from the original dataset and sorts them into the same splits as the audio files.

```
txt
├── LANG
│   ├── HOUR_SPLIT
│   │   ├── SECTION_SPLIT
│   │   │   ├── books.txt
│   │   │   ├── meta.json
│   │   │   └── transcription.txt
│   │   ├── ...
│   ├── ...
│   ...

```
- txt : folder containing transcriptions
- LANG: corresponds to the given language
- HOUR_SPLIR: corresponds to the size of the section splits in number of hours of speech,
              formatted as (50h, 100h, ..., 3200h)
- SECTION_SPLIT: separation of content into sections with equal amount of speech content.
- books.txt: the list of books used for this split
- meta.json: metadata generated during clean-up used to measure effectiveness of cleaning.
- transcript.txt: the agregated transcripts of the audiobooks in the list.

In [1]:
from lexical_benchmark.datasets.machine import stella
from rich.console import Console

out = Console()

prep = stella.STELAPrepTranscripts(lang="EN")
with out.status("Processing STELA source..."):
    prep.build_transcript()

out.print("Succesfuly build STELA Transcript dataset !", style="bold green")

Output()

## Data Cleaning

Clean-up text to keep only clean words that can be piped through the dictionairy.

RULES (Order Matters):
1) Illustration tag removal
2) URL removal
3) TextNormalisation : correct accents & remove non-printable characters
4) Trancribe numbers
5) Remove roman numerals
6) Fix symbols ($,€, etc..)
7) AZFilter

    * replace '-' with a space to extract hyphenated words (fifty-five -> fifty five)

    * Keeps apostrophe char(*'*) to protect shorthands (ex: ain't)
  
    * purges everything not between [A-Z].

    * lowecases everything
8) [TODO]: Add a word normalisation (maybe?)

In [1]:
from lexical_benchmark.datasets.machine import stella
from rich.console import Console

out = Console()

cleaner = stella.StelaCleaner()
cleaner.mk_clean(show_progress=True)
out.print("Succesfuly cleanned up STELA Transcript dataset !", style="bold green")

Output()

# Word Filtering

Using a pre-selected lexicon we filter the corpus to separated known from unknown words

In [2]:
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.datasets import utils
from rich.console import Console

console = Console()
dataset = stella.STELLADatasetClean()
en_cleaner = utils.DictionairyCleaner(lang="EN")
print(f'Working on:: {dataset.root_dir}')

with console.status("Computing clean words"):
    for _, (lang, hour_split, section) in enumerate(dataset.iter_all()):
        # Create clean, unclean subsets of transcriptions
        dataset.filter_words(lang, hour_split, section, cleaner=en_cleaner)

        # Compute word frequencies of both items
        _ = dataset.clean_words_freq(lang, hour_split, section)
        _ = dataset.rejected_words_freq(lang, hour_split, section)
console.print("Completed cleaning of words!!")

Output()

Working on:: /scratch1/projects/lexical-benchmark/v2/datasets/clean/StelaTrainDataset


## Computing cleanup stats

In [3]:
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.datasets import utils
from rich.console import Console

console = Console()
dataset = stella.STELLADatasetClean()
cleaning_stats = dataset.global_word_cleaning_stats()

# Print counts
all_count = cleaning_stats["all"].sum()
bad_count = cleaning_stats["bad"].sum()
good_count = cleaning_stats["good"].sum()
percent_bad = (bad_count / all_count)
percent_good = (good_count / all_count)

print(f"Kept {good_count:_}/{all_count:_} words")
print(f"Rejected {bad_count:_}/{all_count:_} words")
print("-"*10)
print(f"We have a {percent_bad:.2%} of word elimination !!")
print(f"We have a {percent_good:.2%} of word retention !!")
print(f"Average removal per set is   {cleaning_stats['bad_percent'].mean():.4}%")
print(f"Average retention per set is {cleaning_stats['good_percent'].mean():.4}%")

Kept 4_401_030/6_436_566 words
Rejected 2_031_433/6_436_566 words
----------
We have a 31.56% of word elimination !!
We have a 68.38% of word retention !!
Average removal per set is   21.02%
Average retention per set is 78.88%


### Export Global Word Frequencies

In [4]:
from lexical_benchmark.datasets.machine import stella
from rich.console import Console

console = Console()
dataset = stella.STELLADatasetClean()

with console.status("Extracting frequencies from sets..."):
    raw_word_freq = dataset.global_all_word_freq()
    clean_word_freq = dataset.global_clean_word_freq()
    rejected_word_freq = dataset.global_bad_word_freq()
console.print("Succesfully extracted global word frequencies !!", style="bold green")

Output()

In [5]:
""" Save word frequencies """
from pathlib import Path
Path("stela_wf").mkdir(exist_ok=True, parents=True)
rejected_word_freq.to_csv("stela_wf/rejected.words.csv", index=False)
clean_word_freq.to_csv("stela_wf/accepted.words.csv", index=False)
rejected_word_freq

Unnamed: 0,word,freq
0,',138449
1,'',7466
2,''',483
3,'''',98
4,''''',56
...,...,...
236912,zzx,7
236913,zzxl,7
236914,zzzl,7
236915,zzztts,7


In [14]:
""" Test Lexicon"""
from lexical_benchmark.datasets import utils
from lexical_benchmark.datasets.utils import lexicon

en_cleaner = utils.DictionairyCleaner(lang="EN")
lexique = lexicon.Lexicon(lang="EN")

"potato" in lexique.words, "frezdo" in lexique.words