Distributing wordfreq-rs models (v1)
Overview
This is a release point to distribute wordfreq-rs models via Assets. (Please note that the code at this tag is still under construction.)
The model files {large,small}_xx.txt
(where xx
is language code) describe words and their frequencies in the text format:
<word1> <freq1>
<word2> <freq2>
<word3> <freq3>
...
Credits
Copyright 2022 Robyn Speer
Copyright 2023 Shunsuke Kanda
They are obtained by extracting the contents from the original model files {large,small}_xx.msgpack.gz
distributed at wordfreq v3.0.2 (https://doi.org/10.5281/zenodo.7199437). Our files are compressed in zstandard.
The model files are licensed under CC BY-SA 4.0. Also, the original sources are listed, following the NOTICE:
- Google Books Ngrams (https://books.google.com/ngrams/)
- Google Books Syntactic Ngrams (http://commondatastorage.googleapis.com/books/syntactic-ngrams/index.html)
- The Leeds Internet Corpus, from the University of Leeds Centre for Translation Studies (http://corpus.leeds.ac.uk/list.html)
- Wikipedia, the free encyclopedia (http://www.wikipedia.org/)
- ParaCrawl, a multilingual Web crawl (https://paracrawl.eu/)
- OPUS OpenSubtitles 2018 (http://opus.nlpl.eu/OpenSubtitles.php)
- OpenSubtitles project (http://www.opensubtitles.org/)
- SUBTLEX word lists (http://crr.ugent.be/programs-data/subtitle-frequencies) created by Marc Brysbaert, Xiaodong Liu, Emmanuel Keuleers, Paweł Mandera, Michaël Stevens, Qing Cai, Eline Liekens, Joke Lauwers, Wim Tops, Ark Verma, Lise Van der Haegen, Maaike Callens, Heleen Vander Beken, Myrthe Princen, and Marco Marelli
If you redistribute the models, please specify these credits as well.