MRG: Autobuild features#198
Conversation
c477be9 to
c977f65
Compare
|
@bambooforest this is getting very close. I've thrown a few different changes into this PR (which normally I don't like to do, but it made things a lot easier), but the commit messages should give you a fairly clear picture. The # of unique segments with NA features is down to 13; if you run |
|
@drammock this is great. And it's only 5 languages. FYI: I get and error execution halted, although the file $ Rscript add-features.R Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: The following objects are masked from ‘package:base’: Error: The list that it creates, are some of these errors that we can clean up in the input? E.g. tsɦ, t̪s̪ɦ|tsɦ in UPSID, e.g. "voiceless aspirated alveolar sibilant affricate with breathy release" https://github.com/phoible/dev/blob/master/raw-data/UPSID/UPSID_IPA_correspondences.tsv#L386 ɲd̠ʒ should be n̠d̠ʒ according to our conventions as reported as prenasalized post-alveolar / palatal, eh? nɖʐ in 2231 Xumi should be ɳɖʐ for the pre-nasalized retroflex, as I understand it. This is from !Xun d̠ʒxʼ which we don't seem to be able to encode with features for all of click consonants anyway... ɡbr, kpr in 1424 Morokodo are reported in the source. d̠ʒɾ, ɡbɾ, kpɾ, ŋmkpɾ, n̠t̠ʃɾ, t̠ʃɾ in 1630 Mbembe are reported in the source. |
c977f65 to
4673a73
Compare
4673a73 to
e746dae
Compare
|
@bambooforest sorry for the late commit... realized when I woke up that I forgot to re-run the agg and feat-builder scripts after the rebase. also just found another minor tweak to do. |
|
@drammock i updated the space modifier order here: should i push directly somewhere else? to your branch? to this PR? this still doesn't put the length marker after the tones. it wasn't immediately clear to me where these are getting reordered. here? also, should we reorganize the diacritics also according to the conventions? here's one suggestion just going from the top down: diacritics <- c( but I noticed that we don't mention these in the order part of the conventions (or at all, including stuff, denasalized, and nasal emission): "͊", # denasalized (combining not tilde above) |
|
I suggest reviewing / merging this one and doing the diacritic reordering in a different PR. This one already addresses too many extra issues that aren't part of its explicit goal of just making the feature building work, and I guess I don't see unpreferred diacritic ordering as a critical, time-sensitive flaw on the same level as missing features. I should note that the orderIPA function does more than we need it to (i.e., it handles cases where base glyphs, diacritics, modifiers, and tone are all present in a single segment). Given that we rigorously segregate tonemes from phonemes, the section of code you link to is probably not needed, and probably deleting it would give you the result you want (tones before diacritics in toneme segments). As for the length marker, I'm fine with moving it to the end. I don't think it makes a meaningful difference; I originally put it early in the ordering because typographically I think it looks better. But I disagree with some of the other ordering choices in 08bdd37. Please open a new PR (preferably after merging this one) and we can hash out those decisions there. |
|
@drammock -- I'm ok with leaving the diacritic ordering as is here if you believe the semantics are OK -- e.g. qʷʰ vs now qʰʷ. I was only trying to make it reflect our conventions (e.g. length marker to the end). We can ship this off for 2.0 as is, I think. (But then we might want to update our conventions site.) @xrotwang the current CLDF dump reflects this PR |
|
i agree with you that it should change. i agree that the conventions website should match the data. i don't want it done in this PR, please start a new one. i don't feel strongly that it has to be fixed before 2.0, but if you do i can make time to review it this weekend.
…On March 9, 2019 1:14:57 AM AKST, Steven Moran ***@***.***> wrote:
@drammock -- I'm ok with leaving the diacritic ordering as is here if
you believe the semantics are OK -- e.g. qʷʰ vs now qʰʷ.>
>
I was only trying to make it reflect our conventions (e.g. length
marker to the end). We can ship this off for 2.0 as is, I think. (But
then we might want to update our conventions site.)>
>
@xrotwang the current CLDF dump reflects this PR>
>
-- >
You are receiving this because you were mentioned.>
Reply to this email directly or view it on GitHub:>
#198 (comment)
|
|
thanks @drammock. i looked at the basics here and doubled-checked the stuff that @xrotwang found. looks good. agree there's some issues to discuss some specifics to look into in detail. here's some cheap checks: https://github.com/bambooforest/phoible-scripts/blob/master/tests/test-features.md one thing i don't understand is how you tested against the previous segment-feature vectors (given the the reordering of characters, otherwise i would have included that check here) |
closes #192
closes #6
closes #7