[Orthography] Ulithian: many duplicate forms from two different sources #16

LinguList · 2021-08-25T11:49:46Z

LinguList · 2021-08-25T11:50:30Z

The problem of this outlier in the data (45 consonants!) is that it seems to have been composed of two different sources using two different alphabets. It seems this can only be manually resolved.

LinguList · 2021-08-25T11:50:51Z

Should we exclude it for now from the data?

maryewal · 2021-08-25T12:23:13Z

Do we have two separate lists for it or are these sources combined into one set?
If there is just one set, it probably is okay to simply exclude - we have decent coverage of other Micronesian languages and, as far as I can tell, this one is not super unique phonologically.

SimonGreenhill · 2021-08-25T22:11:02Z

It's all from a single source: https://abvd.shh.mpg.de/austronesian/language.php?id=1180, I suspect what has happened is that there is a mix of ipa and orthographic forms, I don't have this dictionary on hand but wikipedia says:

Ulithian has eight vowels which is a large amount for a Pacific language. They are /i/, /u/, /e/, /ə/, /ɔ/, /æ/, /ɐ/, /a/. They are spelled i, u, e, oe or ȯ, o, ae or ė, oa or a, a or ȧ.

...so maybe forms with 'oe', 'ȯ', 'ae' could be removed, e.g. the first entry here:

48 | to sleep | maesoer |   | 1 |  
48 | to sleep | mawsur |   | 1 |

It would be good not to exclude it (the more languages covered the better for the phylogeny).

maryewal · 2021-08-26T11:28:08Z

I see, so there are essentially double forms for every concept? In that case, it probably wouldn't take too long to just manually exclude one of the two, since the IPA forms will be pretty obvious. @LinguList do you think that would solve the issue?

LinguList · 2021-08-26T11:32:27Z

@maryewal, I just discussed a potential way to proceed with @antipodite, which consists in adding a new file etc/ignore.tsv, in which we list the language (by ID), and the forms with their original VALUE to be excluded. We can then blacklist the entries.

What would this entail? It is in fact not very difficult, as I think:

open cldf/forms.csv in excel or libre office
copy-paste only forms for Ulithian
extract only the two columns with Value and Language_ID
manually quickly go over the data and kick out those word forms which we want to retain (the good ones)

@antipodite, would you be able to do a quick check of this very language to see how long this takes? If it takes more 20-30 minutes, I think it is worth it, and we could later even outsource this work to student assistants.

LinguList · 2021-08-26T11:33:28Z

Ah, just to add this: since we discussed this with @antipodite on another matter, it means that this is not the only case, so worth checking how well that workflow works.

antipodite · 2021-08-26T12:00:40Z

OK, on it. So placing the non-IPA forms in ignore.tsv as we discussed

antipodite · 2021-08-26T12:05:08Z

Not obvious to me which is which. @maryewal can you have a look at this?

maryewal · 2021-08-26T12:40:47Z

yep, not as obvious as I'd hoped without clear IPA. Simon is definitely right that one entry is probably orthography and the other some sort of pronunciation guide. This is because the data is based on a 2010 dictionary for students. It is partially online https://www.yumpu.com/en/document/read/11736907/ulithian-english-dictionary-habele, where we can see a pronunciation guide is the second entry in parentheses. So, perhaps get rid of all values that correspond to the orth. entries in the dictionary.

The full book does have a section on "orthography" and another on "spelling and pronunciation" but I can't find access to it. In many cases, we will probably be able to guess the right sound from what is written for pronunciation (eg. ngal is probably ŋal), but I'm not totally comfortable making assumptions for the whole set...

LinguList · 2021-08-27T10:54:11Z

I'd say that even if you make wrong decisions, as long as you preserve only one form, you enhance the data in many ways. There are obvious markers of certain pronuncation distinctions, like two vowels oo or th etc., so singling out these cases at first, then reordering, etc., should help to narrow this down.

maryewal · 2021-08-27T11:19:00Z

Noted, @LinguList - let's see how far we can get, then! @antipodite do you want to do a first removal of the "orthographic" forms, based on the dictionary? Meanwhile, I can come up with likely distinctions.

antipodite · 2021-08-27T12:13:30Z

OK, I filter forms.tsv to Ulithian and then sort ascending by ID. Now it looks like we have pairs (mostly, some triples) where the first element is the orthographic form and the second is the pronunciation guide. I will attach a modified Ulithian spreadsheet with my judgement of orth. vs pronunciation guide forms marked in a new column shortly so you can check them also @maryewal. Then I can just filter out the ones we don't want and put them in ignore.tsv

antipodite · 2021-08-27T13:00:39Z

Here it is. 1 in the "Orth." column means this is orthography, empty cell means either pronunciation guide or orth and pronunciation guide are the same. I cross checked with the dictionary. Note that some have multiple guide pronunciations

ulithian-orthography-judgments.csv

LinguList · 2021-08-27T13:20:52Z

So which is the one you'd retain? The orth?

LinguList · 2021-08-28T10:13:38Z

Ah, @antipodite, in order to get this rolling, can you now post the words to REMOVE to a file etc/ignore.tsv, where you give me three values, as discussed before, e.g., for a form paththba (which I just invented)

Language_ID	Parameter_Name	Value	Comment
Ulithian	hand	gumchiu	duplicate

This TSV file would then serve as the basis to exclude entries from being listed as "normal" entries.

antipodite · 2021-08-28T10:29:43Z

@mattis: Done, check pull requests. I put the pronunciation guide forms in for now, I think this is the better option to ignore as often English words are used as part of the pronunciation guide which would probably screw with the orth profile algo. Regardless I think the orthography profile will need quite a bit of manual correction, as the orthography of this language is somewhat quirky: d -> [θ] or [ð], e -> [i], some consonants seem to be written but not pronounced, etc

antipodite · 2021-08-28T10:36:39Z

want me to have a go at plumbing in ignore.tsv? Seems like you would just filter against ignore.tsv in the def cmd_makecldf(self, args): fn definition in lexibank_abvdoceanic.py

maryewal · 2021-08-28T17:04:40Z

@antipodite, based on what you say, it seems sensible to remove the pronunciation guide forms. Do you still want me to have a look at this?

LinguList · 2021-08-28T17:48:52Z

So we insert an if-else check before this line:

abvdoceanic/lexibank_abvdoceanic.py

Lines 105 to 117 in 8fb24c8

    
           try: 
        
               lex = args.writer.add_forms_from_value( 
        
                   Local_ID=entry.id, 
        
                   Language_ID=slug(wl.language.name, lowercase=False), 
        
                   Parameter_ID=cid, 
        
                   Value=entry.name, 
        
                   # set source to entry-level sources if they exist, otherwise use 
        
                   # the language level source. 
        
                   #Source=[entry.source] if entry.source else source, 
        
                   Cognacy=entry.cognacy, 
        
                   Comment=entry.comment or '', 
        
                   Loan=True if entry.loan and len(entry.loan) else False, 
        
               )

@antipodite, before this line, you can check for same value, language ID and concept id:

lid = slug(wl.language.name, lowercase=False)
if ignored.get(lid, cid, entry.name):
    continue

Before, e.g., right after def cmd_makecldf you load the ignored list:

ignored = {(row[0], row[1], row[2]): row[3] for row in self.etc_dir.read_csv("ignore.tsv", delimiter="\t")}

You have to play a round with this, as I did not test, but along those lines, it should work, and I gladly review this, if you make another PR and assign me as a reviewer.

antipodite · 2021-08-28T19:01:05Z

@maryewal I think we should just go ahead with removing the guide forms. No need to look at it again. @mattis cool, I'll have a look some time tomorrow.
I guess it would be worth checking the generated profiles for other micronesian languages too as I recall Pohnpeian, Woleaian etc have similarly quirky orthographies

maryewal · 2021-11-23T14:30:04Z

Can we close this?

antipodite · 2021-11-23T15:47:10Z

yup, we sorted this iirc

LinguList added the transcription label Aug 25, 2021

maryewal self-assigned this Aug 27, 2021

maryewal removed their assignment Aug 28, 2021

antipodite closed this as completed Nov 23, 2021

maryewal mentioned this issue Jun 10, 2022

Should long vowels be counted for Xaracuu / xara1243 (and other New Caledonian languages) #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Orthography] Ulithian: many duplicate forms from two different sources #16

[Orthography] Ulithian: many duplicate forms from two different sources #16

LinguList commented Aug 25, 2021

LinguList commented Aug 25, 2021

LinguList commented Aug 25, 2021

maryewal commented Aug 25, 2021 •

edited

Loading

SimonGreenhill commented Aug 25, 2021

maryewal commented Aug 26, 2021

LinguList commented Aug 26, 2021

LinguList commented Aug 26, 2021

antipodite commented Aug 26, 2021 •

edited

Loading

antipodite commented Aug 26, 2021

maryewal commented Aug 26, 2021

LinguList commented Aug 27, 2021 via email

maryewal commented Aug 27, 2021

antipodite commented Aug 27, 2021 •

edited

Loading

antipodite commented Aug 27, 2021 •

edited

Loading

LinguList commented Aug 27, 2021 via email

LinguList commented Aug 28, 2021

antipodite commented Aug 28, 2021 •

edited

Loading

antipodite commented Aug 28, 2021

maryewal commented Aug 28, 2021

LinguList commented Aug 28, 2021

antipodite commented Aug 28, 2021 •

edited

Loading

maryewal commented Nov 23, 2021

antipodite commented Nov 23, 2021

[Orthography] Ulithian: many duplicate forms from two different sources #16

[Orthography] Ulithian: many duplicate forms from two different sources #16

Comments

LinguList commented Aug 25, 2021

LinguList commented Aug 25, 2021

LinguList commented Aug 25, 2021

maryewal commented Aug 25, 2021 • edited Loading

SimonGreenhill commented Aug 25, 2021

maryewal commented Aug 26, 2021

LinguList commented Aug 26, 2021

LinguList commented Aug 26, 2021

antipodite commented Aug 26, 2021 • edited Loading

antipodite commented Aug 26, 2021

maryewal commented Aug 26, 2021

LinguList commented Aug 27, 2021 via email

maryewal commented Aug 27, 2021

antipodite commented Aug 27, 2021 • edited Loading

antipodite commented Aug 27, 2021 • edited Loading

LinguList commented Aug 27, 2021 via email

LinguList commented Aug 28, 2021

antipodite commented Aug 28, 2021 • edited Loading

antipodite commented Aug 28, 2021

maryewal commented Aug 28, 2021

LinguList commented Aug 28, 2021

antipodite commented Aug 28, 2021 • edited Loading

maryewal commented Nov 23, 2021

antipodite commented Nov 23, 2021

maryewal commented Aug 25, 2021 •

edited

Loading

antipodite commented Aug 26, 2021 •

edited

Loading

antipodite commented Aug 27, 2021 •

edited

Loading

antipodite commented Aug 27, 2021 •

edited

Loading

antipodite commented Aug 28, 2021 •

edited

Loading

antipodite commented Aug 28, 2021 •

edited

Loading