MRG: Autobuild features by drammock · Pull Request #198 · phoible/dev

drammock · 2019-02-16T01:46:44Z

closes #192
closes #6
closes #7

drammock · 2019-03-03T11:07:03Z

@bambooforest this is getting very close. I've thrown a few different changes into this PR (which normally I don't like to do, but it made things a lot easier), but the commit messages should give you a fairly clear picture. The # of unique segments with NA features is down to 13; if you run scripts/add-features.R script it'll generate data/glyphs-with-na-feats.csv if you wanna see what remains. I'll try to knock out those last few in the next day or two and push to this PR; I'll ping you when that happens but feel free to have a look now if you want since there's a lot of changes.

bambooforest · 2019-03-03T22:50:32Z

@drammock this is great. And it's only 5 languages.

FYI: I get and error execution halted, although the file data/glyphs-with-na-feats.csv is created:

$ Rscript add-features.R
Warning in FUN(X[[i]], ...) : dʒx is not in special_feats table.
Warning in FUN(X[[i]], ...) : ɡbr is not in special_feats table.
Warning in FUN(X[[i]], ...) : kpr is not in special_feats table.
Warning in FUN(X[[i]], ...) : dʒɾ is not in special_feats table.
Warning in FUN(X[[i]], ...) : ɡbɾ is not in special_feats table.
Warning in FUN(X[[i]], ...) : kpɾ is not in special_feats table.
Warning in FUN(X[[i]], ...) : ntʃɾ is not in special_feats table.
Warning in FUN(X[[i]], ...) : ŋmkpɾ is not in special_feats table.
Warning in FUN(X[[i]], ...) : tʃɾ is not in special_feats table.
Warning in FUN(X[[i]], ...) : ɲdʒ is not in special_feats table.
Warning in FUN(X[[i]], ...) : nɖʐ is not in special_feats table.
Warning in FUN(X[[i]], ...) : tsɦ is not in special_feats table.
Warning in FUN(X[[i]], ...) : tsɦ is not in special_feats table.
Warning in FUN(X[[i]], ...) : tsɦ is not in special_feats table.

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

filter, lag

The following objects are masked from ‘package:base’:

intersect, setdiff, setequal, union

Error:
Execution halted

The list that it creates, are some of these errors that we can clean up in the input? E.g.

tsɦ, t̪s̪ɦ|tsɦ in UPSID, e.g. "voiceless aspirated alveolar sibilant affricate with breathy release"

https://github.com/phoible/dev/blob/master/raw-data/UPSID/UPSID_IPA_correspondences.tsv#L386

ɲd̠ʒ should be n̠d̠ʒ according to our conventions as reported as prenasalized post-alveolar / palatal, eh?

nɖʐ in 2231 Xumi should be ɳɖʐ for the pre-nasalized retroflex, as I understand it.

This is from !Xun d̠ʒxʼ which we don't seem to be able to encode with features for all of click consonants anyway...

ɡbr, kpr in 1424 Morokodo are reported in the source.

d̠ʒɾ, ɡbɾ, kpɾ, ŋmkpɾ, n̠t̠ʃɾ, t̠ʃɾ in 1630 Mbembe are reported in the source.

drammock · 2019-03-06T18:14:18Z

@bambooforest sorry for the late commit... realized when I woke up that I forgot to re-run the agg and feat-builder scripts after the rebase. also just found another minor tweak to do.

bambooforest · 2019-03-08T15:41:59Z

@drammock i updated the space modifier order here:

08bdd37

should i push directly somewhere else? to your branch? to this PR?

this still doesn't put the length marker after the tones. it wasn't immediately clear to me where these are getting reordered. here?

https://github.com/drammock/phoible/blob/autobuild-features/scripts/aggregation-helper-functions.R#L289

also, should we reorganize the diacritics also according to the conventions? here's one suggestion just going from the top down:

diacritics <- c(
"̴", # velarized/pharyngealized (combining tilde overlay)
"̼", # linguolabial (combining seagull below)
"̪", # dental (combining bridge below)
"̺", # apical (combining inverted bridge below)
"̻", # laminal (combining square below)
"̟", # advanced (combining plus sign below)
"̠", # retracted (combining minus sign below)
"̝", # raised (combining up tack below)
"̞", # lowered (combining down tack below)
"̘", # advanced tongue root (combining left tack below)
"̙", # retracted tongue root (combining right tack below)
"͓", # frictionalized (combining x below)
"̹", # more round (combining right half ring)
"̜", # less round (combining left half ring)
"̮", # derhoticized (combining breve below)
"̰", # creaky (combining tilde below)
"̤", # breathy (combining diaresis below)
"̥", # devoiced (combining ring below)
"̊", # devoiced (combining ring above)
"͇", # non-sibilant (combining equals sign below)
"͈", # fortis (combining double vertical line below)
"͉", # lenis (combining left angle below)
"̬", # stiff (combining caron below)
"̩", # syllabic (combining vertical line below)
"̯", # non-syllabic (combining inverted breve below)
"̃", # nasalized (combining tilde)
"͊", # denasalized (combining not tilde above)
"͋", # nasal emission (combining homothetic)
"̈", # centralized (combining diaresis)
"̽", # mid-centralized (combining x above)
"̆", # short (combining breve)
"̚" # unreleased (combining left angle above)
)

but I noticed that we don't mention these in the order part of the conventions (or at all, including stuff, denasalized, and nasal emission):

"͊", # denasalized (combining not tilde above)
"͋", # nasal emission (combining homothetic)
"̬", # stiff (combining caron below)
"͇", # non-sibilant (combining equals sign below)
"͈", # fortis (combining double vertical line below)
"͉", # lenis (combining left angle below)
"̆", # short (combining breve)

drammock · 2019-03-08T18:56:50Z

I suggest reviewing / merging this one and doing the diacritic reordering in a different PR. This one already addresses too many extra issues that aren't part of its explicit goal of just making the feature building work, and I guess I don't see unpreferred diacritic ordering as a critical, time-sensitive flaw on the same level as missing features.

I should note that the orderIPA function does more than we need it to (i.e., it handles cases where base glyphs, diacritics, modifiers, and tone are all present in a single segment). Given that we rigorously segregate tonemes from phonemes, the section of code you link to is probably not needed, and probably deleting it would give you the result you want (tones before diacritics in toneme segments).

As for the length marker, I'm fine with moving it to the end. I don't think it makes a meaningful difference; I originally put it early in the ordering because typographically I think it looks better. But I disagree with some of the other ordering choices in 08bdd37. Please open a new PR (preferably after merging this one) and we can hash out those decisions there.

bambooforest · 2019-03-09T10:14:56Z

@drammock -- I'm ok with leaving the diacritic ordering as is here if you believe the semantics are OK -- e.g. qʷʰ vs now qʰʷ.

I was only trying to make it reflect our conventions (e.g. length marker to the end). We can ship this off for 2.0 as is, I think. (But then we might want to update our conventions site.)

@xrotwang the current CLDF dump reflects this PR

drammock · 2019-03-09T16:58:31Z

i agree with you that it should change. i agree that the conventions website should match the data. i don't want it done in this PR, please start a new one. i don't feel strongly that it has to be fixed before 2.0, but if you do i can make time to review it this weekend.

…

On March 9, 2019 1:14:57 AM AKST, Steven Moran ***@***.***> wrote: @drammock -- I'm ok with leaving the diacritic ordering as is here if you believe the semantics are OK -- e.g. qʷʰ vs now qʰʷ.> > I was only trying to make it reflect our conventions (e.g. length marker to the end). We can ship this off for 2.0 as is, I think. (But then we might want to update our conventions site.)> > @xrotwang the current CLDF dump reflects this PR> > -- > You are receiving this because you were mentioned.> Reply to this email directly or view it on GitHub:> #198 (comment)

bambooforest · 2019-03-10T22:38:49Z

thanks @drammock. i looked at the basics here and doubled-checked the stuff that @xrotwang found. looks good. agree there's some issues to discuss some specifics to look into in detail. here's some cheap checks:

https://github.com/bambooforest/phoible-scripts/blob/master/tests/test-features.md

one thing i don't understand is how you tested against the previous segment-feature vectors (given the the reordering of characters, otherwise i would have included that check here)

drammock mentioned this pull request Feb 19, 2019

ressurect tie-bars? #148

Closed

drammock force-pushed the autobuild-features branch from c477be9 to c977f65 Compare March 3, 2019 10:53

drammock force-pushed the autobuild-features branch from c977f65 to 4673a73 Compare March 6, 2019 03:34

drammock added 12 commits March 5, 2019 21:26

WIP: auto-build features

80226ea

WIP: autobuild working

1b96f16

ENH: add SegmentClass

71db1b9

FIX: diphthong conventions; remove blank lines

c550254

minor code cleanup

f28742c

WIP: handle contextual diacritics

7c5aec1

regen data

50e1324

add new data file (w/ features)

07d87dc

nɖʐ -> ɳɖʐ

7ec465a

ɲd̠ʒ -> n̠d̠ʒ

45918f4

finish feature builder

3145373

regen data

e746dae

drammock force-pushed the autobuild-features branch from 4673a73 to e746dae Compare March 6, 2019 06:28

drammock changed the title ~~WIP: Autobuild features~~ MRG: Autobuild features Mar 6, 2019

drammock requested a review from bambooforest March 6, 2019 06:28

re-gen data after rebase (oops)

f2c779a

drammock removed the request for review from bambooforest March 6, 2019 18:38

drammock added 7 commits March 6, 2019 10:59

ᵊ -> ə

a2d61ec

feats for unreleased diacritic

7671176

fix: feat assignment for centralized / midcentralized diacritics

7dc5ced

cast InventoryID as integer instead of double

843f6b5

doc: clean up comments

b06561c

change ouput varnames and filenames

806a097

bugfixes, comment cleanup

81714de

regen everything

4aff202

drammock requested a review from bambooforest March 6, 2019 22:41

drammock added 3 commits March 6, 2019 14:36

rename LanguageCode -> ISO6393 and keep in output

792913e

feature assignment bugfixes

adb5d3f

regen data

ac3e94b

bambooforest added a commit to bambooforest/phoible-scripts that referenced this pull request Mar 8, 2019

runs data through on current branch phoible/dev#198

88942d8

bambooforest added a commit to bambooforest/phoible-scripts that referenced this pull request Mar 9, 2019

aligned to phoible/dev#198

68b0a48

bambooforest merged commit 0ec7201 into phoible:master Mar 10, 2019

drammock deleted the autobuild-features branch March 11, 2019 17:32

drammock mentioned this pull request Jan 17, 2022

Fix EA breathy #347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MRG: Autobuild features#198

MRG: Autobuild features#198
bambooforest merged 24 commits intophoible:masterfrom
drammock:autobuild-features

drammock commented Feb 16, 2019 •

edited

Loading

Uh oh!

drammock commented Mar 3, 2019

Uh oh!

bambooforest commented Mar 3, 2019

Uh oh!

drammock commented Mar 6, 2019 •

edited

Loading

Uh oh!

bambooforest commented Mar 8, 2019

Uh oh!

drammock commented Mar 8, 2019

Uh oh!

bambooforest commented Mar 9, 2019

Uh oh!

drammock commented Mar 9, 2019 via email

Uh oh!

bambooforest commented Mar 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drammock commented Feb 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drammock commented Mar 3, 2019

Uh oh!

bambooforest commented Mar 3, 2019

Uh oh!

drammock commented Mar 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bambooforest commented Mar 8, 2019

Uh oh!

drammock commented Mar 8, 2019

Uh oh!

bambooforest commented Mar 9, 2019

Uh oh!

drammock commented Mar 9, 2019 via email

Uh oh!

bambooforest commented Mar 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drammock commented Feb 16, 2019 •

edited

Loading

drammock commented Mar 6, 2019 •

edited

Loading