Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Arabic transliteration #7

Open
ronaldtse opened this issue Aug 1, 2021 · 4 comments
Open

Problem with Arabic transliteration #7

ronaldtse opened this issue Aug 1, 2021 · 4 comments
Assignees

Comments

@ronaldtse
Copy link
Contributor

From @gilgameshjw 's run using GNDB data.

  • ara is the source
  • ara_diacri is the diacriticized Arabic produced with rababa
  • DEST_FULL_NAME_RO is the manual transliteration provided in GNDB
  • ara_latinised is the output of Interscript
ara ara_diacri ara_latinised DEST_FULL_NAME_RO index dist_edit dist_jaro_winkler
0 گرجان گرِجانَ grijna 0 girjān 0.666667
1 چم كورك چمَ كُوَرِكَ chma kūarika 1 cham kūrik 0.400000
2 وادي نوباندي وَادِي نُوبَانْدِي wādī nūbāndī 2 wādī nūbāndī 0.000000
3 وادي خازيانلي وَادِي خَازِيَانْلِيٍّ wādī khāziyānlīyin 3 wādī khāzyānlī 0.285714
4 وادي ام بطمة وَادِي امْ بُطْمَةَ wādī am buṭmata 4 wādī umm buţmah 0.333333
... ... ... ... ... ... ...
89 القباقب القَبَاقِبُ al-qabāqibu 89 al qabāqib 0.200000
90 العِقلة العَقْلَةِ al-‘aqlahi 90 al ‘iqlah 0.333333
91 الظهرور الظُّهْرُورُ al-ẓẓuhrūru 91 az̧ z̧ahrūr 0.636364
92 أم الدنانير أَمْ الدَّنَانِيرَ am al-ddanānīra 92 umm ad danānīr 0.428571
93 أرض الرجوم أَرْضِ الرُّجُومِ arḍi al-rrujūmi 93 arḑ ar rujūm 0.500000

Clearly there is some difference in certain entries, if you look at 91 and 93, the transliteration system is different.

@gilgameshjw can you help confirm:

  • which GNDB dataset are you using?
  • which transliteration system are you using?

Method to easily reproduce this output? 😉 Thanks!

@gilgameshjw
Copy link

Analysis and Answers:

  • GNDB dataset?
    I am using :
    GNDBdataset/ara_Arab2Latn_ALA_1997.csv & GNDBdataset/ara_Arab2Latn_BGN_1956.csv

  • transliteration systems:

interscript ../GNDB/ara_ALA_1997_dia.txt --system=alalc-ara-Arab-Latn-1997 --output=../GNDB/ara_ALA_1997_dia_2_lat.txt
interscript ../GNDB/ara_BGN_1956_dia.txt --system=bgnpcgn-ara-Arab-Latn-1956 --output=../GNDB/ara_BGN_1
  • reproduce output:
    We attach the jupyter analysis (in txt format because .ipynb and .py not allowed) that has some of the commands we
    ran, in part under /python in rababa.
    We also added analysis files.

analysis_ALA_1997.csv
analysis_BGN_1956.csv
AnalysisGNDB.txt
.

@AhMohsen46
Copy link
Contributor

AhMohsen46 commented Aug 1, 2021 via email

@gilgameshjw
Copy link

@AhMohsen46 @ronaldtse mentionned on skype (which you might not be able to access) that possibly, the GNDB diacritized data itself could be bad.

Maybe we can investigate on that?

@ronaldtse
Copy link
Contributor Author

@AhMohsen46 as mentioned by @gilgameshjw , the GNDB datasets may contain mis-tagged transliterations. e.g. most of the Arabic transliterations should actually be BGN/PCGN, but some may be mis-tagged as ALA-LC.

Part of our work here is also find a good way to detect mis-tagged transliteration. The new "detect" feature in Interscript should help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants