Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouped Sounds Notation in Lexibank and other libraries #258

Open
LinguList opened this issue Jul 3, 2022 · 0 comments
Open

Grouped Sounds Notation in Lexibank and other libraries #258

LinguList opened this issue Jul 3, 2022 · 0 comments
Assignees
Labels
question Further information is requested

Comments

@LinguList
Copy link
Contributor

LinguList commented Jul 3, 2022

Due to past efforts in automatic reconstruction, and individual tests in EDICTOR on individual datasets, I have realized that we can avoid having problematic alignments by introducing a "grouped-sounds notation" for sequences. This means, if I want to say that two sounds should form a unit, I separate them no longer by a space, but by a dot. This allows me to match, e.g., k.j vs. ts. We can also circumvent the problem of many diphthong vs. monophtong decisions, if we allow to notate a u as a.u where we are not sure. I am writing on a short article that shows how this can be very helpful in many approaches, specifically in alignments, where it avoids gaps, and gaps are always a problem, as they are often unmotivated (consider k j a ŋ vs. ts ã, which involves two gaps, but no gap if we resort to k.j a.ŋ for the former).

In orthography profiles this notation can be introduced with the profile. We can even introduce it only implicitly by (ab)using the slash notation, writing k .j/j a .ŋ/ŋ, which can be converted to k.j a.ŋ with a very short function:

def group_sounds(segments): 
    out = []
    for segment in segments:
        if "/" in segment:
            one, two = segment.split("/")
            if one.startswith("."):
                out[-1] += one
            else:
                out += [one]
        else:
            out += [segment]
    return out

In Lexibank, we can add a GroupedSegments to the FormTable in which the ungrouped Segments are grouped. Grouping can even be done later on the fly.

On the long run, when this has been properly tested, I'd however, suggest to make this part of normal Segments, and check for CLTS compatibility for the grouped elements individually rather than as a bunch, which would require to modify the pylexibank code.

@LinguList LinguList added the question Further information is requested label Jul 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants