Overview

This is a release point to distribute wordfreq-rs models via Assets. (Please note that the code at this tag is still under construction.)

The model files {large,small}_xx.txt (where xx is language code) describe words and their frequencies in the text format:

<word1> <freq1>
<word2> <freq2>
<word3> <freq3>
...

Credits

They are obtained by extracting the contents from the original model files {large,small}_xx.msgpack.gz distributed at wordfreq v3.0.2 (https://doi.org/10.5281/zenodo.7199437). Our files are compressed in zstandard.

The model files are licensed under CC BY-SA 4.0. Also, the original sources are listed, following the NOTICE:

Google Books Ngrams (https://books.google.com/ngrams/)
Google Books Syntactic Ngrams (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html)
The Leeds Internet Corpus, from the University of Leeds Centre for Translation Studies (http://corpus.leeds.ac.uk/list.html)
Wikipedia, the free encyclopedia (http://www.wikipedia.org/)
ParaCrawl, a multilingual Web crawl (https://paracrawl.eu/)
OPUS OpenSubtitles 2018 (http://opus.nlpl.eu/OpenSubtitles.php)
OpenSubtitles project (http://www.opensubtitles.org/)
SUBTLEX word lists (http://crr.ugent.be/programs-data/subtitle-frequencies) created by Marc Brysbaert, Xiaodong Liu, Emmanuel Keuleers, Paweł Mandera, Michaël Stevens, Qing Cai, Eline Liekens, Joke Lauwers, Wim Tops, Ark Verma, Lise Van der Haegen, Maaike Callens, Heleen Vander Beken, Myrthe Princen, and Marco Marelli

If you redistribute the models, please specify these credits as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributing wordfreq-rs models (v1)

Overview

Credits