-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Orthography] Ulithian: many duplicate forms from two different sources #16
Comments
The problem of this outlier in the data (45 consonants!) is that it seems to have been composed of two different sources using two different alphabets. It seems this can only be manually resolved. |
Should we exclude it for now from the data? |
Do we have two separate lists for it or are these sources combined into one set? |
It's all from a single source: https://abvd.shh.mpg.de/austronesian/language.php?id=1180, I suspect what has happened is that there is a mix of ipa and orthographic forms, I don't have this dictionary on hand but wikipedia says:
...so maybe forms with 'oe', 'ȯ', 'ae' could be removed, e.g. the first entry here:
It would be good not to exclude it (the more languages covered the better for the phylogeny). |
I see, so there are essentially double forms for every concept? In that case, it probably wouldn't take too long to just manually exclude one of the two, since the IPA forms will be pretty obvious. @LinguList do you think that would solve the issue? |
@maryewal, I just discussed a potential way to proceed with @antipodite, which consists in adding a new file What would this entail? It is in fact not very difficult, as I think:
@antipodite, would you be able to do a quick check of this very language to see how long this takes? If it takes more 20-30 minutes, I think it is worth it, and we could later even outsource this work to student assistants. |
Ah, just to add this: since we discussed this with @antipodite on another matter, it means that this is not the only case, so worth checking how well that workflow works. |
OK, on it. So placing the non-IPA forms in |
Not obvious to me which is which. @maryewal can you have a look at this? |
yep, not as obvious as I'd hoped without clear IPA. Simon is definitely right that one entry is probably orthography and the other some sort of pronunciation guide. This is because the data is based on a 2010 dictionary for students. It is partially online https://www.yumpu.com/en/document/read/11736907/ulithian-english-dictionary-habele, where we can see a pronunciation guide is the second entry in parentheses. So, perhaps get rid of all values that correspond to the orth. entries in the dictionary. The full book does have a section on "orthography" and another on "spelling and pronunciation" but I can't find access to it. In many cases, we will probably be able to guess the right sound from what is written for pronunciation (eg. ngal is probably ŋal), but I'm not totally comfortable making assumptions for the whole set... |
I'd say that even if you make wrong decisions, as long as you preserve
only one form, you enhance the data in many ways. There are obvious
markers of certain pronuncation distinctions, like two vowels oo or th
etc., so singling out these cases at first, then reordering, etc.,
should help to narrow this down.
|
Noted, @LinguList - let's see how far we can get, then! @antipodite do you want to do a first removal of the "orthographic" forms, based on the dictionary? Meanwhile, I can come up with likely distinctions. |
OK, I filter |
Here it is. 1 in the "Orth." column means this is orthography, empty cell means either pronunciation guide or orth and pronunciation guide are the same. I cross checked with the dictionary. Note that some have multiple guide pronunciations |
So which is the one you'd retain? The orth?
|
Ah, @antipodite, in order to get this rolling, can you now post the words to REMOVE to a file
This TSV file would then serve as the basis to exclude entries from being listed as "normal" entries. |
@mattis: Done, check pull requests. I put the pronunciation guide forms in for now, I think this is the better option to ignore as often English words are used as part of the pronunciation guide which would probably screw with the orth profile algo. Regardless I think the orthography profile will need quite a bit of manual correction, as the orthography of this language is somewhat quirky: d -> [θ] or [ð], e -> [i], some consonants seem to be written but not pronounced, etc |
want me to have a go at plumbing in |
@antipodite, based on what you say, it seems sensible to remove the pronunciation guide forms. Do you still want me to have a look at this? |
So we insert an if-else check before this line: abvdoceanic/lexibank_abvdoceanic.py Lines 105 to 117 in 8fb24c8
@antipodite, before this line, you can check for same value, language ID and concept id: lid = slug(wl.language.name, lowercase=False)
if ignored.get(lid, cid, entry.name):
continue Before, e.g., right after ignored = {(row[0], row[1], row[2]): row[3] for row in self.etc_dir.read_csv("ignore.tsv", delimiter="\t")} You have to play a round with this, as I did not test, but along those lines, it should work, and I gladly review this, if you make another PR and assign me as a reviewer. |
@maryewal I think we should just go ahead with removing the guide forms. No need to look at it again. @mattis cool, I'll have a look some time tomorrow. |
Can we close this? |
yup, we sorted this iirc |
The text was updated successfully, but these errors were encountered: