pt_PT digrams n_grams and word frequency files.
This small project is a efficient way of generating a digram file and frequency of words file for pt_PT (Portuguese) from the European Parliament Proceedings Parallel Corpus 1996-2011 (see reference below for details).
The processing that I have made:
-
The text file was divided into phrases and word space delimited.
-
Each word was checked for valid chars in Portuguese with a reg_ex.
-
Each word was checked if it was a valid Portuguese word, with the HunSpell dictionary pt_PT from 2021.12.25 (see Project Natura, below).
-
Because the text is pre orthographic treaty, I tried to save the words that had one more muted 'c' or 'p' char and words that started with an uppercase like "alemanha" vs "Alemanha" or opec vs OPEC.
-
Then I have made some mapping between wrong written words in terms of just one accent sign.
This was possible because HunSpell has a suggest() function in it's API that returns close lexical valid words. The previous similarity process seemed to me a "safe" and simple process to do.
Note:
The suggest() function in HunSpell is rather slow because it has to try every single permutation for a predetermined distance. SymSpell algorithm is faster in this regard. So I implemented a cache over the correct and incorrect words, to lower the number of calls made to the suggest() function from HunSpell. It worked the processing time went down from 3 H or 4 H to 20 minutes, on a single core.
-
Projeto Natura - Hunspell - dictionary pt_PT in Portuguese
https://natura.di.uminho.pt/wiki/doku.php?id=dicionarios:main -
European Parliament Proceedings Parallel Corpus 1996-2011 - Portuguese
https://www.statmt.org/europarl/ -
SymSpell crate in Rust
https://github.com/reneklacan/symspell -
Original SymSpell by Wolfgarbe in C#
It doesn't have a pt_PT word frequency dictionary and digrams.
https://github.com/wolfgarbe/SymSpell
I place my code under MIT Open Source License.
But the license of the digrams and the words frequency of the text are in there respective licenses. In reference 2. is said about the big 320 MB pt_PT text "We are not aware of any copyright restrictions of the material.". See the reference European Parliament Proceedings Parallel Corpus 1996-2011 above.
Best regards,
João Nuno Carvalho