Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Orthography] Ulithian: many duplicate forms from two different sources #16

Closed
LinguList opened this issue Aug 25, 2021 · 23 comments
Closed

Comments

@LinguList
Copy link
Contributor

Screenshot 2021-08-25 at 13-51-26 EDICTOR

@LinguList
Copy link
Contributor Author

The problem of this outlier in the data (45 consonants!) is that it seems to have been composed of two different sources using two different alphabets. It seems this can only be manually resolved.

@LinguList
Copy link
Contributor Author

Should we exclude it for now from the data?

@maryewal
Copy link
Collaborator

maryewal commented Aug 25, 2021

Do we have two separate lists for it or are these sources combined into one set?
If there is just one set, it probably is okay to simply exclude - we have decent coverage of other Micronesian languages and, as far as I can tell, this one is not super unique phonologically.

@SimonGreenhill
Copy link
Collaborator

It's all from a single source: https://abvd.shh.mpg.de/austronesian/language.php?id=1180, I suspect what has happened is that there is a mix of ipa and orthographic forms, I don't have this dictionary on hand but wikipedia says:

Ulithian has eight vowels which is a large amount for a Pacific language. They are /i/, /u/, /e/, /ə/, /ɔ/, /æ/, /ɐ/, /a/. They are spelled i, u, e, oe or ȯ, o, ae or ė, oa or a, a or ȧ. 

...so maybe forms with 'oe', 'ȯ', 'ae' could be removed, e.g. the first entry here:

48 | to sleep | maesoer |   | 1 |  
48 | to sleep | mawsur |   | 1 |  

It would be good not to exclude it (the more languages covered the better for the phylogeny).

@maryewal
Copy link
Collaborator

I see, so there are essentially double forms for every concept? In that case, it probably wouldn't take too long to just manually exclude one of the two, since the IPA forms will be pretty obvious. @LinguList do you think that would solve the issue?

@LinguList
Copy link
Contributor Author

@maryewal, I just discussed a potential way to proceed with @antipodite, which consists in adding a new file etc/ignore.tsv, in which we list the language (by ID), and the forms with their original VALUE to be excluded. We can then blacklist the entries.

What would this entail? It is in fact not very difficult, as I think:

  1. open cldf/forms.csv in excel or libre office
  2. copy-paste only forms for Ulithian
  3. extract only the two columns with Value and Language_ID
  4. manually quickly go over the data and kick out those word forms which we want to retain (the good ones)

@antipodite, would you be able to do a quick check of this very language to see how long this takes? If it takes more 20-30 minutes, I think it is worth it, and we could later even outsource this work to student assistants.

@LinguList
Copy link
Contributor Author

Ah, just to add this: since we discussed this with @antipodite on another matter, it means that this is not the only case, so worth checking how well that workflow works.

@antipodite
Copy link
Collaborator

antipodite commented Aug 26, 2021

OK, on it. So placing the non-IPA forms in ignore.tsv as we discussed

@antipodite
Copy link
Collaborator

Not obvious to me which is which. @maryewal can you have a look at this?
Screenshot 2021-08-26 at 14 04 16

@maryewal
Copy link
Collaborator

yep, not as obvious as I'd hoped without clear IPA. Simon is definitely right that one entry is probably orthography and the other some sort of pronunciation guide. This is because the data is based on a 2010 dictionary for students. It is partially online https://www.yumpu.com/en/document/read/11736907/ulithian-english-dictionary-habele, where we can see a pronunciation guide is the second entry in parentheses. So, perhaps get rid of all values that correspond to the orth. entries in the dictionary.

The full book does have a section on "orthography" and another on "spelling and pronunciation" but I can't find access to it. In many cases, we will probably be able to guess the right sound from what is written for pronunciation (eg. ngal is probably ŋal), but I'm not totally comfortable making assumptions for the whole set...

@LinguList
Copy link
Contributor Author

LinguList commented Aug 27, 2021 via email

@maryewal
Copy link
Collaborator

Noted, @LinguList - let's see how far we can get, then! @antipodite do you want to do a first removal of the "orthographic" forms, based on the dictionary? Meanwhile, I can come up with likely distinctions.

@antipodite
Copy link
Collaborator

antipodite commented Aug 27, 2021

OK, I filter forms.tsv to Ulithian and then sort ascending by ID. Now it looks like we have pairs (mostly, some triples) where the first element is the orthographic form and the second is the pronunciation guide. I will attach a modified Ulithian spreadsheet with my judgement of orth. vs pronunciation guide forms marked in a new column shortly so you can check them also @maryewal. Then I can just filter out the ones we don't want and put them in ignore.tsv

@antipodite
Copy link
Collaborator

antipodite commented Aug 27, 2021

Here it is. 1 in the "Orth." column means this is orthography, empty cell means either pronunciation guide or orth and pronunciation guide are the same. I cross checked with the dictionary. Note that some have multiple guide pronunciations

ulithian-orthography-judgments.csv

@maryewal maryewal self-assigned this Aug 27, 2021
@LinguList
Copy link
Contributor Author

LinguList commented Aug 27, 2021 via email

@LinguList
Copy link
Contributor Author

Ah, @antipodite, in order to get this rolling, can you now post the words to REMOVE to a file etc/ignore.tsv, where you give me three values, as discussed before, e.g., for a form paththba (which I just invented)

Language_ID Parameter_Name Value Comment
Ulithian hand gumchiu duplicate

This TSV file would then serve as the basis to exclude entries from being listed as "normal" entries.

@antipodite
Copy link
Collaborator

antipodite commented Aug 28, 2021

@mattis: Done, check pull requests. I put the pronunciation guide forms in for now, I think this is the better option to ignore as often English words are used as part of the pronunciation guide which would probably screw with the orth profile algo. Regardless I think the orthography profile will need quite a bit of manual correction, as the orthography of this language is somewhat quirky: d -> [θ] or [ð], e -> [i], some consonants seem to be written but not pronounced, etc

@antipodite
Copy link
Collaborator

want me to have a go at plumbing in ignore.tsv? Seems like you would just filter against ignore.tsv in the def cmd_makecldf(self, args): fn definition in lexibank_abvdoceanic.py

@maryewal
Copy link
Collaborator

@antipodite, based on what you say, it seems sensible to remove the pronunciation guide forms. Do you still want me to have a look at this?

@maryewal maryewal removed their assignment Aug 28, 2021
@LinguList
Copy link
Contributor Author

So we insert an if-else check before this line:

try:
lex = args.writer.add_forms_from_value(
Local_ID=entry.id,
Language_ID=slug(wl.language.name, lowercase=False),
Parameter_ID=cid,
Value=entry.name,
# set source to entry-level sources if they exist, otherwise use
# the language level source.
#Source=[entry.source] if entry.source else source,
Cognacy=entry.cognacy,
Comment=entry.comment or '',
Loan=True if entry.loan and len(entry.loan) else False,
)

@antipodite, before this line, you can check for same value, language ID and concept id:

lid = slug(wl.language.name, lowercase=False)
if ignored.get(lid, cid, entry.name):
    continue

Before, e.g., right after def cmd_makecldf you load the ignored list:

ignored = {(row[0], row[1], row[2]): row[3] for row in self.etc_dir.read_csv("ignore.tsv", delimiter="\t")}

You have to play a round with this, as I did not test, but along those lines, it should work, and I gladly review this, if you make another PR and assign me as a reviewer.

@antipodite
Copy link
Collaborator

antipodite commented Aug 28, 2021

@maryewal I think we should just go ahead with removing the guide forms. No need to look at it again. @mattis cool, I'll have a look some time tomorrow.
I guess it would be worth checking the generated profiles for other micronesian languages too as I recall Pohnpeian, Woleaian etc have similarly quirky orthographies

@maryewal
Copy link
Collaborator

Can we close this?

@antipodite
Copy link
Collaborator

yup, we sorted this iirc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants