# English STELLA Transcriptions Dataset

The STELLA dataset...



# Data preparation

Format Dataset into the wanted architecture. This procedure extracts audiobook transcriptions from the original dataset and sorts them into the same splits as the audio files.

```
txt
├── LANG
│   ├── HOUR_SPLIT
│   │   ├── SECTION_SPLIT
│   │   │   ├── books.txt
│   │   │   ├── meta.json
│   │   │   └── transcription.txt
│   │   ├── ...
│   ├── ...
│   ...

```
- txt : folder containing transcriptions
- LANG: corresponds to the given language
- HOUR_SPLIR: corresponds to the size of the section splits in number of hours of speech,
              formatted as (50h, 100h, ..., 3200h)
- SECTION_SPLIT: separation of content into sections with equal amount of speech content.
- books.txt: the list of books used for this split
- meta.json: metadata generated during clean-up used to measure effectiveness of cleaning.
- transcript.txt: the agregated transcripts of the audiobooks in the list.

In [1]:
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.utils import timed_status

prep = stella.STELAPrepTranscripts(lang="EN")
with timed_status(status="Preping stela transcriptions...", complete_status="Succesfuly build STELA Transcript dataset !"):
    prep.build_transcript()

Output()

## Data Cleaning

Clean-up text to keep only clean words that can be piped through the dictionairy.

RULES (Order Matters):
1) Illustration tag removal
2) URL removal
3) TextNormalisation : correct accents & remove non-printable characters
4) Trancribe numbers
5) Remove roman numerals
6) Fix symbols ($,€, etc..)
7) AZFilter

    * replace '-' with a space to extract hyphenated words (fifty-five -> fifty five)

    * Keeps apostrophe char(*'*) to protect shorthands (ex: ain't)
  
    * purges everything not between [A-Z].

    * lowecases everything
8) Fix words by removing prefix and trailing quote char (')

In [2]:
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.utils import timed_status

cleaner = stella.StelaCleaner()
with timed_status(status="Cleaning STELA Transcripts", complete_status="Succesfuly cleanned up STELA Transcript dataset !"):
    cleaner.cleanup_raw(compute_word_freqs=True)

Output()

# Word Filtering

Using a pre-selected lexicon we filter the corpus to separated known from unknown words

In [1]:
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.datasets import utils
from lexical_benchmark.utils import timed_status

dataset_cleaner = stella.StelaCleaner()

with timed_status(status="Word Filtering", complete_status="Succesfuly completed word filtering !"):
    dataset_cleaner.mk_clean(
        word_cleaners={"EN": utils.DictionairyCleaner(lang="EN")},
        compute_word_freqs=True,
    )

Output()

## Computing cleanup stats

In [None]:
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.utils import timed_status
from pathlib import Path
# ------------------
DEBUG = False

SAVE_CSVs = False
PRINT = True
# ------------------
# Configure inputs
clean_dataset = stella.STELLADatasetClean()
raw_dataset = stella.STELLADatasetRaw()
meta_root = Path("stela_wf") if DEBUG else clean_dataset.wf_meta_dir
(meta_root / "raw").mkdir(exist_ok=True, parents=True)
(meta_root / "clean").mkdir(exist_ok=True, parents=True)
(meta_root / "bad").mkdir(exist_ok=True, parents=True)

stats_dict = {}

with timed_status(status="Extracting Stats...", complete_status="Succesfuly extracted statistics !"):
    for lang, hour_split, section in clean_dataset.iter_all():
        clean_stats = clean_dataset.word_stats_by_split(lang, hour_split)
        raw_stats = raw_dataset.word_stats_by_split(lang, hour_split)

        stats_dict[f"{lang}-{hour_split}"] = {
            "raw": raw_stats,
            "clean": clean_stats
        }

        # Save Files
        if SAVE_CSVs:
            raw_stats.freq_map.to_csv(meta_root / f"raw/{lang}_{hour_split}.word-freq.csv", index=False)
            clean_stats["good"].freq_map.to_csv(meta_root / f"clean/{lang}_{hour_split}.word-freq.csv", index=False)
            clean_stats["bad"].freq_map.to_csv(meta_root / f"bad/{lang}_{hour_split}.word-freq.csv", index=False)


In [16]:
# Print resuts
if PRINT:
    for key, value in stats_dict.items():
        print("-" * 10 + f"\n==> {key}")
        print(f"Token rejection : {value['clean']['bad'].token_nb / value['raw'].token_nb:.2%}")
        print(f"Type rejection : {value['clean']['bad'].type_nb / value['raw'].type_nb:.2%} ")
        print("-" * 10)

----------
==> EN-100h
Token rejection : 1.51%
Type rejection : 60.66% 
----------
----------
==> EN-1600h
Token rejection : 1.51%
Type rejection : 60.66% 
----------
----------
==> EN-200h
Token rejection : 1.51%
Type rejection : 60.67% 
----------
----------
==> EN-3200h
Token rejection : 1.51%
Type rejection : 60.66% 
----------
----------
==> EN-400h
Token rejection : 1.51%
Type rejection : 60.66% 
----------
----------
==> EN-50h
Token rejection : 1.51%
Type rejection : 60.66% 
----------
----------
==> EN-800h
Token rejection : 1.51%
Type rejection : 60.66% 
----------


### Computing Block Averaging

To calculate word rejection rate in the dataset, we use the method of 
cutting each split into chunk of a specific size (16k tokens per chunk), and then proceed to calculate 
the rejection rate.

In [2]:
import random

import numpy as np
import matplotlib.pyplot as plt

from lexical_benchmark.datasets.utils import lexicon
from lexical_benchmark.datasets.machine import stella
from lexical_benchmark.utils import timed_status


# TODO make it per split not per hour_split

# TODO chunk size of 1600
# TODO chunk size of 160 000
def split_and_fill_chunks(word_list: list[str], chunk_size: int = 16_000):
    """Evenly spread words in the given list into chunks of given size.

    Throw away the chunk of unequal size as to not break the stats
    """
    # Step 1: Split the list into chunks of chunk_size
    chunks = [word_list[i:i + chunk_size] for i in range(0, len(word_list), chunk_size)]
    # Step 2: remove unequal chunk
    return [c0 for c0 in chunks if len(c0) == chunk_size]



def calculate_rejection_rates(chunk_list, lexicon: lexicon.Lexicon):
    rejection_rates = []

    for chunk in chunk_list:
        total_tokens = len(chunk)  # Total tokens (words)
        total_token_types = len(set(chunk))  # Unique token types

        # Check which tokens are valid
        invalid_tokens = 0
        invalid_token_types = set()  # To track invalid token types

        for token in chunk:
            if not lexicon(token):
                invalid_tokens += 1  # Count invalid tokens
                invalid_token_types.add(token)  # Add to invalid token types set

        # Calculate rejection rates
        token_rejection_rate = (invalid_tokens / total_tokens) * 100
        token_type_rejection_rate = (len(invalid_token_types) / total_token_types) * 100

        # Store rejection rates for the chunk
        rejection_rates.append({
            'chunk': chunk,
            'token_rejection_rate': token_rejection_rate,
            'token_type_rejection_rate': token_type_rejection_rate
        })

    return rejection_rates, total_tokens, total_token_types


def plot_rejection_rate_trends(rejection_rates, set_label):
    chunks = np.arange(len(rejection_rates)) + 1  # Chunk numbers
    token_rejection_rates = [rate['token_rejection_rate'] for rate in rejection_rates]
    token_type_rejection_rates = [rate['token_type_rejection_rate'] for rate in rejection_rates]

    plt.plot(chunks, token_rejection_rates, label='Token Rejection Rate', marker='o')
    plt.plot(chunks, token_type_rejection_rates, label='Token Type Rejection Rate', marker='o')

    plt.xlabel('Chunks')
    plt.ylabel('Rejection Rate (%)')
    plt.title(f'Rejection Rate Trends Across Chunks ({set_label})')
    plt.legend()
    plt.grid(True)
    plt.show()


def calculate_avg_and_median_rejection_rates(rejection_rates):
    # Extract token and token type rejection rates into lists
    token_rejection_rates = [rate['token_rejection_rate'] for rate in rejection_rates]
    token_type_rejection_rates = [rate['token_type_rejection_rate'] for rate in rejection_rates]

    # Calculate averages
    avg_token_rejection_rate = np.mean(token_rejection_rates)
    avg_token_type_rejection_rate = np.mean(token_type_rejection_rates)

    # Calculate medians
    median_token_rejection_rate = np.median(token_rejection_rates)
    median_token_type_rejection_rate = np.median(token_type_rejection_rates)

    return {
        'avg_token_rejection_rate': avg_token_rejection_rate,
        'avg_token_type_rejection_rate': avg_token_type_rejection_rate,
        'median_token_rejection_rate': median_token_rejection_rate,
        'median_token_type_rejection_rate': median_token_type_rejection_rate
    }


#### Computation of rejection rates

In [3]:
rejection_rates = {}
raw_dataset = stella.STELLADatasetRaw()
CHUNK_SIZE=16_000
with timed_status(status="Calculating Rejection Rates of StellaRaw ...", complete_status="Completed !"):
    for lang in raw_dataset.get_languages():
        lexique = lexicon.Lexicon(lang=lang)
        for _, split in raw_dataset.iter_hour_splits(lang):
            for _, _, section in raw_dataset.iter_sections(lang, split):
                word_list = raw_dataset.get_all_raw_words_from_section(language=lang, hour_split=split, section=section)
                word_chunk_list = split_and_fill_chunks(word_list, chunk_size=CHUNK_SIZE)
                rj_rate, token_count, type_count = calculate_rejection_rates(word_chunk_list, lexique)
                rejection_rates[f"{lang}_{split}_{section}"] = {
                    "rates": rj_rate,
                    "token_count": token_count,
                    "type_count": type_count
                }

Output()

#### Computation of Averages

In [5]:
import collections

table = collections.defaultdict(list)
with timed_status(status="Computing Averages ...", complete_status="Completed !"):
    for lang in raw_dataset.get_languages():
        for _, split in raw_dataset.iter_hour_splits(lang):
            for _, _, section in raw_dataset.iter_sections(lang, split):
                avgs = calculate_avg_and_median_rejection_rates(
                    rejection_rates[f"{lang}_{split}_{section}"]["rates"]
                )
                table[f"{lang}/{split}"].append(
                    {
                        "Section": f"{section}",
                        "Tokens": rejection_rates[f"{lang}_{split}_{section}"]['token_count'],
                        "Token Rejection (avg)": avgs['avg_token_rejection_rate'],
                        "Token Rejection (median)": avgs['median_token_rejection_rate'],
                        "Types": rejection_rates[f"{lang}_{split}_{section}"]['type_count'],
                        "Type Rejection (avg)": avgs['avg_token_type_rejection_rate'],
                        "Type Rejection (median)": avgs['median_token_type_rejection_rate']
                    }
                )

Output()

##### Print Results

In [9]:
import pandas as pd
from IPython.display import display

table = dict(table)
for lang in raw_dataset.get_languages():
    for _, split in raw_dataset.iter_hour_splits(lang):
        print("-"*5 + f"{lang}/{split}" + "-"*5)
        df = pd.DataFrame(table[f"{lang}/{split}"])
        df = df.style.format({
            "Token Rejection (avg)": "{:.2f}%",
            "Token Rejection (median)": "{:.2f}%",
            "Type Rejection (avg)": "{:.2f}%",
            "Type Rejection (median)": "{:.2f}%",
        })
        display(df)

Unnamed: 0,Section,Tokens,Token Rejection (avg),Token Rejection (median),Types,Type Rejection (avg),Type Rejection (median)
0,0,16000,16.64%,16.38%,3761,41.27%,41.41%
1,1,16000,17.03%,16.86%,5551,41.10%,41.13%
2,2,16000,17.17%,16.29%,4271,41.90%,41.01%
3,3,16000,16.39%,15.60%,4762,40.78%,40.55%
4,4,16000,16.86%,16.95%,5283,40.81%,40.81%
5,5,16000,18.17%,18.32%,3391,46.95%,44.97%
6,6,16000,15.63%,15.76%,4831,39.02%,39.15%
7,7,16000,17.21%,17.00%,4645,40.67%,40.71%
8,8,16000,18.72%,17.29%,3638,44.53%,41.49%
9,9,16000,15.34%,15.14%,4568,38.81%,38.12%


Unnamed: 0,Section,Tokens,Token Rejection (avg),Token Rejection (median),Types,Type Rejection (avg),Type Rejection (median)
0,0,16000,17.76%,17.09%,2924,42.78%,41.35%
1,1,16000,17.54%,16.64%,3616,42.82%,41.26%


Unnamed: 0,Section,Tokens,Token Rejection (avg),Token Rejection (median),Types,Type Rejection (avg),Type Rejection (median)
0,0,16000,16.78%,16.46%,4072,41.18%,41.08%
1,1,16000,16.73%,15.93%,5112,41.32%,40.40%
2,2,16000,17.75%,17.94%,4828,45.01%,42.20%
3,3,16000,16.36%,16.11%,5555,39.74%,39.61%
4,4,16000,17.17%,16.01%,3726,41.92%,40.38%
5,5,16000,18.36%,18.26%,4367,44.74%,42.93%
6,6,16000,19.78%,17.53%,3831,43.59%,39.85%
7,7,16000,18.16%,17.65%,4198,42.78%,42.28%
8,8,16000,15.97%,16.10%,3920,40.76%,40.14%
9,9,16000,16.17%,15.62%,4654,40.69%,40.45%


Unnamed: 0,Section,Tokens,Token Rejection (avg),Token Rejection (median),Types,Type Rejection (avg),Type Rejection (median)
0,0,16000,17.65%,16.89%,3528,42.79%,41.31%


Unnamed: 0,Section,Tokens,Token Rejection (avg),Token Rejection (median),Types,Type Rejection (avg),Type Rejection (median)
0,0,16000,16.79%,16.50%,4773,41.30%,40.78%
1,1,16000,17.11%,16.81%,4272,42.59%,41.03%
2,2,16000,17.87%,17.61%,4539,43.62%,42.04%
3,3,16000,19.11%,17.59%,4354,43.29%,41.36%
4,4,16000,16.04%,15.85%,4809,40.72%,39.98%
5,5,16000,19.40%,17.41%,4776,46.66%,41.97%
6,6,16000,17.42%,17.37%,5297,41.55%,42.18%
7,7,16000,16.76%,16.52%,3467,41.07%,40.58%


Unnamed: 0,Section,Tokens,Token Rejection (avg),Token Rejection (median),Types,Type Rejection (avg),Type Rejection (median)
0,0,16000,17.84%,17.44%,3861,42.33%,42.17%
1,1,16000,15.82%,15.62%,3522,40.62%,40.31%
2,2,16000,17.78%,17.10%,5288,42.51%,43.10%
3,3,16000,16.14%,16.80%,3959,39.89%,39.99%
4,4,16000,18.68%,16.85%,3788,44.62%,42.33%
5,5,16000,15.19%,15.31%,4398,38.26%,38.85%
6,6,16000,18.43%,18.57%,4936,42.21%,41.54%
7,7,16000,15.25%,14.43%,3527,39.98%,38.79%
8,8,16000,17.81%,17.90%,4435,41.88%,41.51%
9,9,16000,15.61%,15.93%,4933,39.46%,40.02%


Unnamed: 0,Section,Tokens,Token Rejection (avg),Token Rejection (median),Types,Type Rejection (avg),Type Rejection (median)
0,0,16000,16.97%,16.76%,3845,42.02%,40.89%
1,1,16000,18.46%,17.65%,4130,43.47%,41.80%
2,2,16000,17.89%,16.49%,4739,43.95%,41.08%
3,3,16000,17.08%,16.74%,3463,41.31%,41.39%
