New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spoken languages #4
Comments
I think even in written Cantonese, there are still some similarities between formal written Chinese and how people from Hong Kong write. Not sure if we need to fully distinguish that. I come from mainland China, with Cantonese speaking background, and I also speak another dialect. I think most people still write the same in mainland. I do sometimes write in Cantonese but it is happening less and less. |
@pickfire I haven't differentiated between Simplified Chinese and Traditional Chinese or other dialects so far. The reason is that I could not find proper and large enough text corpora written in only a single of those variants. That's why I used a mixed corpus instead and only added I might work on this in the future but I cannot tell you when exactly as of yet. That's why I will close this issue for now. If you can point me to a good source offering corpora in Mandarin, Cantonese etc., I will be happy to take a look. But I will only add those dialects if they have their own iso 639-1 and 639-3 codes. If not, they are not properly qualified as separate languages. In case you consider it, feel free to make a pull request and add the missing dialects yourself. Thank you. |
Chinese is a written language but it is spoken in dialects, for example people write Chinese but speak Mandarin. Other (spoken) dialects exists as well, such as Cantonese, Hokien, Teochew and a bunch of them, is there a way to detect them?
The text was updated successfully, but these errors were encountered: