Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[nl] add potentially missing words from Tatoeba to speller / POS tagger #7016

Open
danielnaber opened this issue Aug 14, 2022 · 1 comment
Open
Assignees
Labels
Dutch Especially for Dutch

Comments

@danielnaber
Copy link
Member

danielnaber commented Aug 14, 2022

All words from Tatoeba with >= 5 occurrences unknown to LT's spell checker and/or POS tagger (see org.languagetool.dev.UnknownWordFinder). Should they be added to spelling.txt and added.txt?

Unknown to spell checker

146 Yanni
130 Tatoeba
101 Mennad
60 Ziri
32 Kyoto
29 Skura
28 Al-Sayib
28 Yidir
27 godsnaam
23 Tokyo
23 ziens
22 Hokkaido
20 nature
18 Fuji
17 vredesnaam
17 Mayuko
14 Tom's
13 Tiziri
13 hijaab
12 Schots-Gaelisch
12 Narita
10 Klingon
10 lk
10 Muiriel
10 Yumi
10 harte
9 koste
9 Yamada
9 Taninna
9 Keiko
9 zegene
8 Dr
8 St
8 én
7 the
7 Nagoya
7 Nahuatl
7 Volapük
7 São
6 Sadako
6 Saint
6 Pona
6 onrechte
6 gestresst
6 and
6 Guarani
6 Buenos
6 Toki
5 Masao
5 Indo-Arische
5 sneeuwig
5 COVID-19-vaccin
5 Toshio
5 Amhaars
5 journalistiekstudent
5 Lojban
5 dr
5 Occitaans
5 Patrick's
5 Berbertaal
5 Contratiempo
5 American
5 Quechua
5 Interlingua
5 XXX
5 advokaat
5 afweet
5 Monato
5 óf

No POS tags

wie
zo'n
ter
Yanni
Tatoeba
ten
degene
Mennad
mijne
mezelf
zulke
Ziri
uzelf
Kyoto
Skura
Al-Sayib
Yidir
godsnaam
der
ziens
Tokyo
Hokkaido
nature
uwe
Fuji
Algerijnse
vredesnaam
zulk
Mayuko
Ten
zo’n
Tom's
Tiziri
hijaab
Zo'n
Schots-Gaelisch
Narita
Zulke
Zamenhof
Muiriel
enz
Klingon
Yumi
harte
doorheen
lk
koste
Yamada
Taninna
Keiko
zegene
hare
Dr
St
Nagoya
Nahuatl
Volapük
São
hetgeen
Nijl
IJsberen
Sadako
onrechte
gestresst
ieders
Semitische
Neptunus
Tataars
and
Saint
Algerijns
Guarani
Buenos
Pona
Toki
American
Masao
Quechua
Indo-Arische
sneeuwig
COVID-19-vaccin
Esperantosprekers
Interlingua
advokaat
afweet
Toshio
Amhaars
journalistiekstudent
Lojban
dr
Occitaans
Monato
@danielnaber danielnaber added the Dutch Especially for Dutch label Aug 14, 2022
@danielnaber danielnaber changed the title [nl] add potentially missing words from Tatoeba to speller [nl] add potentially missing words from Tatoeba to speller / POS tagger Aug 14, 2022
@ghost
Copy link

ghost commented Oct 1, 2022

Postags for some words are quite uncertain and hard to determine.
There are many words in the list that are not Dutch at all, or unofficial spelling. Check woordenlijst.org. be carefull with compound verbs like 'afweet' those are very common mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dutch Especially for Dutch
Projects
None yet
Development

No branches or pull requests

2 participants