# STELLA Transcription Dataset

The STELLA dataset...



# Data preparation

Format Dataset into the wanted architecture. This procedure extracts audiobook transcriptions from the original dataset and sorts them into the same splits as the audio files.

```
txt
├── LANG
│   ├── HOUR_SPLIT
│   │   ├── SECTION_SPLIT
│   │   │   ├── books.txt
│   │   │   ├── meta.json
│   │   │   └── transcription.txt
│   │   ├── ...
│   ├── ...
│   ...

```
- txt : folder containing transcriptions
- LANG: corresponds to the given language
- HOUR_SPLIR: corresponds to the size of the section splits in number of hours of speech,
              formatted as (50h, 100h, ..., 3200h)
- SECTION_SPLIT: separation of content into sections with equal amount of speech content.
- books.txt: the list of books used for this split
- meta.json: metadata generated during clean-up used to measure effectiveness of cleaning.
- transcript.txt: the agregated transcripts of the audiobooks in the list.

In [4]:
from lexical_benchmark.datasets.machine import stella
from rich.console import Console

out = Console()

prep = stella.STELAPrepTranscripts(lang="EN")
with out.status("Processing STELA source..."):
    prep.build_transcript()

out.print("Succesfuly build STELA Transcript dataset !", style="bold green")

Output()

## Data Cleaning

Clean-up text to keep only clean words that can be piped through the dictionairy.

RULES (Order Matters):
1) Illustration tag removal
2) URL removal
3) TextNormalisation : correct accents & remove non-printable characters
4) Trancribe numbers
5) Remove roman numerals
6) Fix symbols ($,€, etc..)
7) AZFilter

    * replace '-' with a space to extract hyphenated words (fifty-five -> fifty five)

    * Keeps apostrophe char(*'*) to protect shorthands (ex: ain't)
  
    * purges everything not between [A-Z].

    * lowecases everything


In [1]:
from lexical_benchmark.datasets.machine import stella
from rich.console import Console

out = Console()

cleaner = stella.StelaCleaner()
cleaner.mk_clean(show_progress=True)
out.print("Succesfuly cleanned up STELA Transcript dataset !", style="bold green")

Output()

# Word Filtering

Using a pre-selected lexicon we filter the corpus to separated known from unknown words

In [1]:
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.datasets import utils
from rich.console import Console

console = Console()
dataset = stella.STELLADatasetClean()
# BUG: for now only en exists we need different loopping if more languages are added 
en_cleaner = utils.DictionairyCleaner(lang="EN")
print(f'Working on:: {dataset.root_dir}')

with console.status("Computing clean words"):
    for _, (lang, hour_split, section) in enumerate(dataset.iter_all()):
        # Create clean, unclean subsets of transcriptions
        dataset.filter_words(lang, hour_split, section, cleaner=en_cleaner)
    
        # Compute word frequencies of both items
        _ = dataset.clean_words_freq(lang, hour_split, section)
        _ = dataset.rejected_words_freq(lang, hour_split, section)
console.print("Completed cleaning of words!!")

Working on:: /scratch1/projects/lexical-benchmark/v2/datasets/clean/StelaTrainDataset


Output()

## Computing cleanup stats

In [1]:
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.datasets import utils
from rich.console import Console

console = Console()
dataset = stella.STELLADatasetClean()
all_stats = {}
for _, (lang, hour_split, section) in enumerate(dataset.iter_all()):
    all_stats[(lang, hour_split, section)] = dataset.word_cleaning_stats(lang, hour_split, section)

In [19]:
import pandas as pd

def m(lang, hour_split, section):
    return f"{lang}_{hour_split}_{section}"

data = {f"{m(*k)}":v for k, v in all_stats.items()}
df = pd.DataFrame.from_dict(data, orient='index')
df

Unnamed: 0,all,good,bad,bad_percent,good_percent
EN_100h_00,32092,27967,4102,12.782002,87.146329
EN_100h_01,39002,31161,7778,19.942567,79.895903
EN_100h_02,34277,28818,5381,15.698573,84.073869
EN_100h_03,31719,29064,2602,8.203285,91.629623
EN_100h_04,33701,29953,3730,11.067921,88.878668
...,...,...,...,...,...
EN_50h_63,21496,20320,1176,5.470785,94.529215
EN_800h_00,115890,73803,42087,36.316334,63.683666
EN_800h_01,211138,86743,124395,58.916443,41.083557
EN_800h_02,149307,81359,67948,45.508918,54.491082


In [29]:
all_count = df["all"].sum()
bad_count = df["bad"].sum()
good_count = df["good"].sum()
percent_bad = (bad_count / all_count)
percent_good = (good_count / all_count)

print(f"We have a {percent_bad:.4%} of word elimination !!")
print(f"We have a {percent_good:.4%} of word retention !!")
print(f"Average removal per set is   {df['bad_percent'].mean():.4}%")
print(f"Average retention per set is {df['good_percent'].mean():.4}%")

We have a 31.6198% of word elimination !!
We have a 68.3714% of word retention !!
Average removal per set is   21.12%
Average retention per set is 78.87%
