# Wordlist filtering

We filter words to separate valid from invalid speech, we do it using specific lexicon dictionairies.

All the dictionnairies aggragated give us an approximate of 1,534,810 words. Which gives us a good coverage of the english language.
The ideal would be to use the Google Book's corpus with more that 2M words but it is not open-source (I think?).


### SCOWLv2 - ~676k words

[Aspell](http://wordlist.aspell.net) is a spellchecker created by open-source communities, and
variations of it is used by Mozzila, OpenOffice and various linux distributions.

Their site has various dictionairies & tools to make spellcheckers.

We use their tool in this page to generate a wordlist => [Create wordlist](http://app.aspell.net/create)

The following parameters are used :

- diacritic=strip (replaces words like café with cafe)
- max_size=95 (this is the biggest available has **~676k words**)
- max_variant=3 (this is the biggest available allows multiple variants of the same word)
- special=hacker (includes computer science related vocabulairy, we excluded roman numerals)
- spelling: US, GB( s & z variants)


**CITATION**: None copyright notice in github gives various thanks to open-sourced projects that contributed to lexicon.


### Kaiki.org - ~1M words

Kaikki.org is a digital archive and a data mining group. We aim to make our digital heritage more 
accessible and useful for people, researchers, linguists, software developers, and artificial 
intelligence (AI).

We used the All-English forms which is their biggest dictionairy for english available you can find it [here](https://kaikki.org/dictionary/English/index.html).

We did some post-processing to extract only the words from the dictionairy.


**CITATION**:  If you use this data in academic research, please cite [Tatu Ylonen: Wiktextract: Wiktionary as Machine-Readable Structured Data](http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.140.pdf), Proceedings of the 13th Conference on Language Resources and Evaluation (LREC), pp. 1317-1325, Marseille, 20-25 June 2022. Linking to the relevant page(s) under https://kaikki.org would also be greatly appreciated.


### Yet Another Word List (YAWL) - ~264k words

Open source word-list found here : https://github.com/elasticdog/yawl


### CHILDES extras

Using the tags from CHILDES we create a lexicon of non-words words

## Data Organisation

Each subset has been processed to a lexicon folder with the following format :

```
lexicon/
├── kaikki
│   ├── english-dictionairy.jsonl
│   ├── README.md
│   └── words.list
├── SCOWLv2
│   ├── README
│   ├── README_SCOWL
│   └── words.list
└── yawl
    ├── README.md
    └── words.list
```

The word-lists are formatted into text files containing one entry per line named `words.list`


# Kaiki.org post Processing

In [2]:

import json
from pathlib import Path
from rich.console import Console

console = Console()

dict_file = Path("/scratch1/projects/lexical-benchmark/v2/datasets/lexicon/kaikki/english-dictionairy.jsonl")
words_file = Path("/scratch1/projects/lexical-benchmark/v2/datasets/lexicon/kaikki/wordlist.txt")
def get_word(line: bytes) -> str | None:
    """Extract word."""
    data = json.loads(line)
    if "word" in data:
        return data["word"]
    return None


with dict_file.open() as f_dict, words_file.open("w") as f_words, console.status("Extracting kaiki.org english word list..."):
    words = [get_word(item) for item in f_dict]
    # Filter empty entries and only keep unique items
    words = set([w for w in words if w is not None])
    print(f"{len(words)=}")
    # Dump into file
    for w in words:
        f_words.write(f"{w}\n")
    
    
    


Output()

## Testing result lexicon

In [27]:
from pathlib import Path

lexicon_root = Path("/scratch1/projects/lexical-benchmark/v2/datasets/lexicon/")
kaiki_file = lexicon_root / "kaikki" / "words.list"
scowl_file = lexicon_root / "SCOWLv2" / "words.list"
yawl_file = lexicon_root / "yawl" / "words.list"

kaikki = set(kaiki_file.read_text().splitlines())
scowl = set(scowl_file.read_text().splitlines())
yawl = set(yawl_file.read_text().splitlines())

print(f"{len(kaikki)=:_} words")
print(f"{len(scowl)=:_} words")
print(f"{len(yawl)=:_} words")

print('-'*5)
all_words = set()
all_words.update(kaikki, scowl, yawl)
print(f"{len(all_words)=:_} words")
print('-'*5)
kaikki_scowl = set()
kaikki_scowl.update(kaiki, scowl)
print(f"{len(kaikki_scowl)=:_} words")
print('-'*5)
kaikki_yawl = set()
kaikki_yawl.update(kaikki, yawl)
print(f"{len(kaikki_yawl)=:_} words")
print('-'*5)
scowl_yawl = set()
scowl_yawl.update(scowl, yawl)
print(f"{len(scowl_yawl)=:_} words")

len(kaikki)=1_226_594 words
len(scowl)=675_611 words
len(yawl)=264_097 words
-----
len(all_words)=1_534_810 words
-----
len(kaikki_scowl)=1_534_524 words
-----
len(kaikki_yawl)=1_251_196 words
-----
len(scowl_yawl)=676_076 words
