Generation of word digrams 2 grams pt_PT in Rust

pt_PT digrams n_grams and word frequency files.

Description

This small project is a efficient way of generating a digram file and frequency of words file for pt_PT (Portuguese) from the European Parliament Proceedings Parallel Corpus 1996-2011 (see reference below for details).

The processing that I have made:

The text file was divided into phrases and word space delimited.
Each word was checked for valid chars in Portuguese with a reg_ex.
Each word was checked if it was a valid Portuguese word, with the HunSpell dictionary pt_PT from 2021.12.25 (see Project Natura, below).
Because the text is pre orthographic treaty, I tried to save the words that had one more muted 'c' or 'p' char and words that started with an uppercase like "alemanha" vs "Alemanha" or opec vs OPEC.
Then I have made some mapping between wrong written words in terms of just one accent sign.

This was possible because HunSpell has a suggest() function in it's API that returns close lexical valid words. The previous similarity process seemed to me a "safe" and simple process to do.

Note:
The suggest() function in HunSpell is rather slow because it has to try every single permutation for a predetermined distance. SymSpell algorithm is faster in this regard. So I implemented a cache over the correct and incorrect words, to lower the number of calls made to the suggest() function from HunSpell. It worked the processing time went down from 3 H or 4 H to 20 minutes, on a single core.

References

Projeto Natura - Hunspell - dictionary pt_PT in Portuguese
https://natura.di.uminho.pt/wiki/doku.php?id=dicionarios:main
European Parliament Proceedings Parallel Corpus 1996-2011 - Portuguese
https://www.statmt.org/europarl/
SymSpell crate in Rust
https://github.com/reneklacan/symspell
Original SymSpell by Wolfgarbe in C#
It doesn't have a pt_PT word frequency dictionary and digrams.
https://github.com/wolfgarbe/SymSpell

License

I place my code under MIT Open Source License.
But the license of the digrams and the words frequency of the text are in there respective licenses. In reference 2. is said about the big 320 MB pt_PT text "We are not aware of any copyright restrictions of the material.". See the reference European Parliament Proceedings Parallel Corpus 1996-2011 above.

Have fun!

Best regards,
João Nuno Carvalho

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
src		src
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generation of word digrams 2 grams pt_PT in Rust

Description

References

License

Have fun!

About

Releases

Packages

Languages

joaocarvalhoopen/Gen_word_digrams_2_grams_pt_PT_in_Rust

Folders and files

Latest commit

History

Repository files navigation

Generation of word digrams 2 grams pt_PT in Rust

Description

References

License

Have fun!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages