-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with Arabic transliteration #7
Comments
Analysis and Answers:
analysis_ALA_1997.csv |
Hello,
Here’s what am thinking
I’ll take the file jair created, and comment on every possible entry with error
Whether it’s a mapping issue or a pointing issue
To start with
1-I have a strong feeling the map used is not the best match for this dataset, as I can see in the examples provided by Ronald the sun letters rules are applied to the results, while the map used, doesn’t have sun letters
2-I can see that the output contains the final letter diacritic, which might/might not be omitted, based on the sentence
Almost similar to how last letters in french are omitted sometimes in pronunciation — this can be modified in the maps, as a rule to optionally omit the three main diacritics (fatha-damma-kasragh) if they’re on the last letters of the word
… On 1 Aug 2021, at 5:20 PM, gilgameshjw ***@***.***> wrote:
Analysis and Answers:
GNDB dataset?
I am using :
GNDBdataset/ara_Arab2Latn_ALA_1997.csv & GNDBdataset/ara_Arab2Latn_BGN_1956.csv
transliteration systems:
interscript ../GNDB/ara_ALA_1997_dia.txt --system=alalc-ara-Arab-Latn-1997 --output=../GNDB/ara_ALA_1997_dia_2_lat.txt
interscript ../GNDB/ara_BGN_1956_dia.txt --system=bgnpcgn-ara-Arab-Latn-1956 --output=../GNDB/ara_BGN_1
reproduce output:
We attach the jupyter analysis (in txt format because .ipynb and .py not allowed) that has some of the commands we
ran, in part under /python in rababa.
We also added analysis files.
analysis_ALA_1997.csv
analysis_BGN_1956.csv
AnalysisGNDB.txt
.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@AhMohsen46 @ronaldtse mentionned on skype (which you might not be able to access) that possibly, the GNDB diacritized data itself could be bad. Maybe we can investigate on that? |
@AhMohsen46 as mentioned by @gilgameshjw , the GNDB datasets may contain mis-tagged transliterations. e.g. most of the Arabic transliterations should actually be BGN/PCGN, but some may be mis-tagged as ALA-LC. Part of our work here is also find a good way to detect mis-tagged transliteration. The new "detect" feature in Interscript should help. |
From @gilgameshjw 's run using GNDB data.
Clearly there is some difference in certain entries, if you look at 91 and 93, the transliteration system is different.
@gilgameshjw can you help confirm:
Method to easily reproduce this output? 😉 Thanks!
The text was updated successfully, but these errors were encountered: