Problem with Arabic transliteration #7

ronaldtse · 2021-08-01T04:21:06Z

From @gilgameshjw 's run using GNDB data.

ara is the source
ara_diacri is the diacriticized Arabic produced with rababa
DEST_FULL_NAME_RO is the manual transliteration provided in GNDB
ara_latinised is the output of Interscript

ara	ara_diacri	ara_latinised	DEST_FULL_NAME_RO	index	dist_edit	dist_jaro_winkler
0	گرجان	گرِجانَ	grijna	0	girjān	0.666667
1	چم كورك	چمَ كُوَرِكَ	chma kūarika	1	cham kūrik	0.400000
2	وادي نوباندي	وَادِي نُوبَانْدِي	wādī nūbāndī	2	wādī nūbāndī	0.000000
3	وادي خازيانلي	وَادِي خَازِيَانْلِيٍّ	wādī khāziyānlīyin	3	wādī khāzyānlī	0.285714
4	وادي ام بطمة	وَادِي امْ بُطْمَةَ	wādī am buṭmata	4	wādī umm buţmah	0.333333
...	...	...	...	...	...	...
89	القباقب	القَبَاقِبُ	al-qabāqibu	89	al qabāqib	0.200000
90	العِقلة	العَقْلَةِ	al-‘aqlahi	90	al ‘iqlah	0.333333
91	الظهرور	الظُّهْرُورُ	al-ẓẓuhrūru	91	az̧ z̧ahrūr	0.636364
92	أم الدنانير	أَمْ الدَّنَانِيرَ	am al-ddanānīra	92	umm ad danānīr	0.428571
93	أرض الرجوم	أَرْضِ الرُّجُومِ	arḍi al-rrujūmi	93	arḑ ar rujūm	0.500000

Clearly there is some difference in certain entries, if you look at 91 and 93, the transliteration system is different.

@gilgameshjw can you help confirm:

which GNDB dataset are you using?
which transliteration system are you using?

Method to easily reproduce this output? 😉 Thanks!

gilgameshjw · 2021-08-01T15:20:03Z

Analysis and Answers:

GNDB dataset?
I am using :
GNDBdataset/ara_Arab2Latn_ALA_1997.csv & GNDBdataset/ara_Arab2Latn_BGN_1956.csv
transliteration systems:

interscript ../GNDB/ara_ALA_1997_dia.txt --system=alalc-ara-Arab-Latn-1997 --output=../GNDB/ara_ALA_1997_dia_2_lat.txt
interscript ../GNDB/ara_BGN_1956_dia.txt --system=bgnpcgn-ara-Arab-Latn-1956 --output=../GNDB/ara_BGN_1

reproduce output:
We attach the jupyter analysis (in txt format because .ipynb and .py not allowed) that has some of the commands we
ran, in part under /python in rababa.
We also added analysis files.

analysis_ALA_1997.csv
analysis_BGN_1956.csv
AnalysisGNDB.txt
.

AhMohsen46 · 2021-08-01T16:53:41Z

Hello, Here’s what am thinking I’ll take the file jair created, and comment on every possible entry with error Whether it’s a mapping issue or a pointing issue To start with 1-I have a strong feeling the map used is not the best match for this dataset, as I can see in the examples provided by Ronald the sun letters rules are applied to the results, while the map used, doesn’t have sun letters 2-I can see that the output contains the final letter diacritic, which might/might not be omitted, based on the sentence Almost similar to how last letters in french are omitted sometimes in pronunciation — this can be modified in the maps, as a rule to optionally omit the three main diacritics (fatha-damma-kasragh) if they’re on the last letters of the word

…

On 1 Aug 2021, at 5:20 PM, gilgameshjw ***@***.***> wrote: Analysis and Answers: GNDB dataset? I am using : GNDBdataset/ara_Arab2Latn_ALA_1997.csv & GNDBdataset/ara_Arab2Latn_BGN_1956.csv transliteration systems: interscript ../GNDB/ara_ALA_1997_dia.txt --system=alalc-ara-Arab-Latn-1997 --output=../GNDB/ara_ALA_1997_dia_2_lat.txt interscript ../GNDB/ara_BGN_1956_dia.txt --system=bgnpcgn-ara-Arab-Latn-1956 --output=../GNDB/ara_BGN_1 reproduce output: We attach the jupyter analysis (in txt format because .ipynb and .py not allowed) that has some of the commands we ran, in part under /python in rababa. We also added analysis files. analysis_ALA_1997.csv analysis_BGN_1956.csv AnalysisGNDB.txt . — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

gilgameshjw · 2021-08-01T17:02:09Z

@AhMohsen46 @ronaldtse mentionned on skype (which you might not be able to access) that possibly, the GNDB diacritized data itself could be bad.

Maybe we can investigate on that?

ronaldtse · 2021-08-02T02:13:36Z

@AhMohsen46 as mentioned by @gilgameshjw , the GNDB datasets may contain mis-tagged transliterations. e.g. most of the Arabic transliterations should actually be BGN/PCGN, but some may be mis-tagged as ALA-LC.

Part of our work here is also find a good way to detect mis-tagged transliteration. The new "detect" feature in Interscript should help.

ronaldtse assigned AhMohsen46 and gilgameshjw Aug 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with Arabic transliteration #7

Problem with Arabic transliteration #7

ronaldtse commented Aug 1, 2021

gilgameshjw commented Aug 1, 2021

AhMohsen46 commented Aug 1, 2021 via email

gilgameshjw commented Aug 1, 2021

ronaldtse commented Aug 2, 2021

Problem with Arabic transliteration #7

Problem with Arabic transliteration #7

Comments

ronaldtse commented Aug 1, 2021

gilgameshjw commented Aug 1, 2021

Analysis and Answers:

AhMohsen46 commented Aug 1, 2021 via email

gilgameshjw commented Aug 1, 2021

ronaldtse commented Aug 2, 2021