Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add (Modern Standard) Arabic language #3

Open
behnam opened this issue Oct 23, 2017 · 9 comments
Open

Add (Modern Standard) Arabic language #3

behnam opened this issue Oct 23, 2017 · 9 comments

Comments

@behnam
Copy link
Collaborator

behnam commented Oct 23, 2017

Is there any work being done regarding any Arabic dialects?

We can start with http://www.dw.com/ar/, which is Modern Standard Arabic. I think MSA is a good start, and we can add regional dialects later.

Please list here any source you think we should add, for MSA or regional dialects.

@brawer
Copy link
Collaborator

brawer commented Oct 24, 2017

Are these Modern Standard Arabic, too? Adding them would be a matter of 1 line each.
http://www.bbc.com/arabic
https://arabic.sputniknews.com/

@brawer
Copy link
Collaborator

brawer commented Oct 24, 2017

Seeds for crawling a language corpus in Maroccan Arabic (BCP47 language code ary):

@brawer
Copy link
Collaborator

brawer commented Oct 24, 2017

For Algerian Arabic (BCP47 language code arq), see http://www.onlinenewspapers.com/algeria.htm but I wouldn’t know if any of these are in Standard Arabic

@behnam
Copy link
Collaborator Author

behnam commented Oct 24, 2017

Yes, @brawer. These two are definitely both Standard Arabic:
http://www.bbc.com/arabic
https://arabic.sputniknews.com/

About the country-specific news services, I can't tell if we they are in local dialects of Standard Arabic, or the regional Arabic. So, I think we need to ask some help reviewing them one by one.

@behnam
Copy link
Collaborator Author

behnam commented Oct 24, 2017

Actually, ar/ara is the macrolanguage, and we better us arb for Standard Arabic. That's not what websites do, but I think it's safe to make the assumption about the content here to be arb. What do you think?

@brawer
Copy link
Collaborator

brawer commented Oct 25, 2017

So far, I’ve tried to follow the BCP47 language tags as per Unicode conventions. There, macrolanguage codes stand for the individual language that “everyone” (a typical webmaster or programmer who isn’t deeply rooted in the internationalization scene) means when they see that code. For example, according to Unicode/ICU/CLDR, the code for Estonian is et instead of ekk; the code for Modern Standard Ararbic is ar instead of arb; the code for Uzbek is uz instead of uzn; the code for Mandarin is zh; etc. For the full list, see the languageAlias data in CLDR.

@behnam
Copy link
Collaborator Author

behnam commented Oct 25, 2017

Cool! Yeah, that's what I though is happening here, but wasn't sure.

About the other links, hopefully regional Arabic sources, I'll send an update as soon as I get more info.

@brawer
Copy link
Collaborator

brawer commented Oct 25, 2017

Regarding ary: According to an Arabic speaker, https://www.hespress.com/ (sitemap) might be a source for building a language corpus in Moroccan Arabic. My contact said that the Moroccan newspapers listed earlier on this bug are in Modern Standard Arabic, whereas some (but not all) comments on these sites are in Moroccan dialect.

@khaledhosny
Copy link

khaledhosny commented Oct 28, 2017

hespress.com is definitely MSA (including most comments on the random articles I checked). Actually, you are unlikely to find any newspapers in local dialects, your best bet would be forums and the likes that are considered less “formal”.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants