Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spoken languages #4

Closed
pickfire opened this issue Nov 25, 2020 · 2 comments
Closed

Spoken languages #4

pickfire opened this issue Nov 25, 2020 · 2 comments

Comments

@pickfire
Copy link

pickfire commented Nov 25, 2020

Chinese is a written language but it is spoken in dialects, for example people write Chinese but speak Mandarin. Other (spoken) dialects exists as well, such as Cantonese, Hokien, Teochew and a bunch of them, is there a way to detect them?

@lhr0909
Copy link
Contributor

lhr0909 commented Dec 9, 2020

Chinese is a written language but it is spoken in dialects, for example people write Chinese but speak Mandarin. Other (spoken) dialects exists as well, such as Cantonese, Hokien, Teochew and a bunch of them, is there a way to detect them?

I think even in written Cantonese, there are still some similarities between formal written Chinese and how people from Hong Kong write. Not sure if we need to fully distinguish that.

I come from mainland China, with Cantonese speaking background, and I also speak another dialect. I think most people still write the same in mainland. I do sometimes write in Cantonese but it is happening less and less.

@pemistahl
Copy link
Owner

@pickfire I haven't differentiated between Simplified Chinese and Traditional Chinese or other dialects so far. The reason is that I could not find proper and large enough text corpora written in only a single of those variants. That's why I used a mixed corpus instead and only added CHINESE as a language without any more differentiation.

I might work on this in the future but I cannot tell you when exactly as of yet. That's why I will close this issue for now.

If you can point me to a good source offering corpora in Mandarin, Cantonese etc., I will be happy to take a look. But I will only add those dialects if they have their own iso 639-1 and 639-3 codes. If not, they are not properly qualified as separate languages.

In case you consider it, feel free to make a pull request and add the missing dialects yourself. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants