Add structure from existing alignment #41

FredericBlum · 2023-06-05T12:33:14Z

I am currently running into trouble while trying to create a CV structure based on a pre-existing alignments with manually trimmed data. Let's use for example the following data:

ID	DOCULECT	CONCEPT	VALUE	FORM	TOKEN	COGID	ALIGNMENT
1	Marubo	type of bank	kɨnã	kɨnã	k ɨ n ã	60	['k', 'ɨ', 'n', 'ã', '-', '(', '-', '-', ')']
2	Chakobo	type of bank	kɨˈnanɨ	kɨˈnanɨ	k ɨ n a n ɨ	60	['k', 'ɨ', 'n', 'a', '-', '(', 'n', 'ɨ', ')']

Is there any pre-defined way of doing this, or would I need to create a new tokens-column based on the alignments, removing all content within brackets in a dynamic way? @LinguList Maybe you can help me on this.

The text was updated successfully, but these errors were encountered:

LinguList · 2023-06-05T12:55:45Z

I think you need to make the immediate strings that are all trimmed.

LinguList · 2023-06-05T12:57:01Z

But in order to make them, you can iterate over the alignment:

for cogid, msa in alms.msa["cogid"].items():
    print(msa["alignment"])
    print(msa)

The msa-object that is generated for you here stores the alignments, I think it even automatically only retains those that are not inside brackets!

LinguList · 2023-06-05T12:57:11Z

But I'd ask you to test for this.

LinguList · 2023-06-05T12:57:48Z

Please check also this function, that could otherwise be applied to the alignment object: https://lingpy.org/reference/lingpy.align.html#lingpy.align.sca.Alignments.reduce_alignments

LinguList · 2023-06-05T12:59:13Z

So, please check alms.reduce_alignments and how it behaves, as I do not remember if it returns a list of reduced alignments or if it reduces alignments in the alms.msa["cogid"] attribute, etc.

FredericBlum · 2023-06-05T13:10:39Z

Not much happens. It applies directly to alms and adds a _alignment entry in the Dictionary, but nothing is reduced. Even after looking at the code, I do not understand where this would proceed.

LinguList · 2023-06-05T13:11:45Z

Please use a minimal example and share it here (you can zip it). WE only need 3 cogids with 3 alignments that are "to be reduced".

LinguList · 2023-06-05T13:13:37Z

And please check the import statement in the script on align/sca.py in lingpy:

from lingpy.read.qlc import read_msa, normalize_alignment, reduce_alignment

So you have the function in read.qlc!

LinguList · 2023-06-05T13:14:01Z

def reduce_alignment(alignment):
    """
    Function reduces a given alignment.
    
    Notes
    -----
    Reduction here means that the output alignment consists only of those parts
    which have not been marked to be ignored by the user (parts in brackets).
    It requires that all data is properly coded. If reduction fails, this will
    throw a warning, and all brackets are simply removed in the output
    alignment.
    """

    # check for bracket indices in all columns
    cols = misc.transpose(alignment)

    ignore_indices = []
    ignore = False
    for i, col in enumerate(cols):
        reduced_col = sorted(set(col))

        if '(' in reduced_col:
            if len(reduced_col) == 1:
                ignore_indices += [i]
                ignore = True
            else:
                ignore = False
        elif ')' in reduced_col:
            if len(reduced_col) == 1:
                ignore_indices += [i]
                ignore = False
            else:
                ignore_indices = []
        elif ignore:
            ignore_indices += [i]

    if ignore_indices:
        new_cols = []
        for i, col in enumerate(cols):
            if i not in ignore_indices:
                new_cols += [col]
    else:
        new_cols = cols

    new_alm = misc.transpose(new_cols)

    for i, alm in enumerate(new_alm):
        for j, char in enumerate(alm):
            if char in '()':
                new_alm[i][j] = '-'

    return new_alm

LinguList · 2023-06-05T13:15:26Z

The alignment here is a simple array, so it is like the msa["alignment"] what you can pass it. This should reduce the alignment, and you can from the reduced alignment then store tokesn [x for x in alm if x != "-"] for each reduced alignment as well.

It is probably easier to integrate all of this later into the Sites class in lingrex, but it is not that difficult to make a small preprocessing with pure lingpy here.

LinguList · 2023-06-05T13:18:37Z

from lingpy.read.qlc import reduce_alignment
from lingpy import basictypes

dct = {}
for idx, msa in alms.msa["cogid"].items():
    reduced = reduce_alignment(msa["alignment"])
    for i, row in enumerate(reduced):
        dct[msa["ID"][i] = row

alms.add_entries("old_tokens", "tokens", lambda x: x)
alms.add_entries("tokens", dct, lambda x: basictypes.lists([y for y in x if y != "-"]), override=True)

# same for alignemnts

FredericBlum · 2023-06-05T13:39:14Z

minimum_example.zip

Here comes the minimum example based on your code, showcasing the problem. There is no reduction apparently.

LinguList · 2023-06-06T10:51:37Z

You DO realize that you have python lists typed into the alignment column, not space-segmented strings?

LinguList · 2023-06-06T10:51:58Z

['m', 'ã', 'n', '(', '-', ')', 'ã', '-', '(', '-', '-', '-', '-', ')']

LinguList · 2023-06-06T10:52:09Z

That is the first alignment ;-)

FredericBlum · 2023-06-06T11:51:38Z

But isn't that how reduce_alignment works, taking the whole `msa["alignment"] as input? If I run the following code:

for idx, msa in alms.msa["cogid"].items():
    for alg in msa["alignment"]:
        print("Alignment:", alg)
        reduce_alignment(alg)

I receive an error like this:

lingreg) blum@lingn45 example % python regularity.py
Alignment: ['m', 'ã', 'n', '(', '-', ')', 'ã', '-', '(', '-', '-', '-', '-', ')']
Traceback (most recent call last):
  File "/Users/blum/Projects/lingreg/example/regularity.py", line 42, in <module>
    reduce_alignment(alg)
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/read/qlc.py", line 23, in reduce_alignment
    cols = misc.transpose(alignment)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/algorithm/cython/_misc.py", line 20, in transpose
    out = [[matrix[i][j] for i in range(lA)] for j in range(lB)]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/algorithm/cython/_misc.py", line 20, in <listcomp>
    out = [[matrix[i][j] for i in range(lA)] for j in range(lB)]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/algorithm/cython/_misc.py", line 20, in <listcomp>
    out = [[matrix[i][j] for i in range(lA)] for j in range(lB)]
            ~~~~~~~~~^^^
IndexError: string index out of range

This does not happen if I use the whole msa as input.

LinguList · 2023-06-06T16:57:39Z

    cols = misc.transpose(alignment)

is a function that only pertains to one matrix (2-dim list). The requirement is that the length of all rows is identical per alignment. If it throws an index error, it means one of your rows (your strings, your aligned words) is of different length.

The function clearly does not take an msa-dictionary as input.

LinguList · 2023-06-06T16:59:12Z

The error is again in your data.

In [1]: from lingpy.read.qlc import reduce_alignment

In [3]: reduce_alignment([["1", "(", "-", ")"], ["2", "(", "2", ")"]])
Out[3]: [['1'], ['2']]

LinguList · 2023-06-06T17:00:36Z

To avoid that such errors occur, you must test:

if len(set([len(row) for row in msa["alignment"])) != 1:
    print("problem in alignment {0}".format(cogid))
else:
    ...

FredericBlum · 2023-06-07T08:06:24Z

You DO realize that you have python lists typed into the alignment column, not space-segmented strings?

Now I understood what you meant! And why I was so confused. I adapted the CLDF conversion of the dataset so that now space-segmented strings are added, not a python list. Adding the structure based on the reduced alignment works now:

dct = {}
for idx, msa in alms.msa["cogid"].items():
    msa_reduced = []
    for site in msa["alignment"]:
        # print("Alignment:", site)
        reduced = reduce_alignment([site])[0]
        msa_reduced.append(reduced)
    for i, row in enumerate(msa_reduced):
        dct[msa["ID"][i]] = "".join(row)

alms.add_entries("old_tokens", "tokens", lambda x: x)
alms.add_entries("tokens", dct, lambda x: [y for y in x], override=True)
alms.add_entries("structure", "tokens", lambda x: " ".join(Sites([x]).soundclasses))

FredericBlum · 2023-06-07T08:16:40Z

However, I now have problems matching alignment, tokens, and structure. I played around with the lambda expressions and come up with the following:

alms.add_entries("tokens", dct, lambda x: [y for y in x if y != "-"], override=True)
alms.add_entries("alignment", dct, lambda x: [y for y in x], override=True)
alms.add_entries("structure", "alignment", lambda x: " ".join(Sites([x]).soundclasses))

Now I get errors that the alignment and the structure do not match:

2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 6245
6251 C V + C V C V | ʂ o - t o k o | ʂ o t o k o
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 6251
6258 C V + C V C V | ʃ u - t a k u | ʃ u t a k u
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 6258
3601 C V + + | m ã - - | m ã
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 3601
3603 C V + + | m ã - - | m ã
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 3603
3605 C V + + | m ã - - | m ã

I strongly suspect that this is due to the "+" in the structure column. I do not understand how I need to adapt the "add_entries" command to succesfully match both. Any hints at what could solve this?

FredericBlum · 2023-06-07T08:17:57Z

Adding "0" instead of "+" results int he same error.

LinguList · 2023-06-07T08:18:30Z

Yes, you need to re-compute the structure, sorry.

LinguList · 2023-06-07T08:20:18Z

But since structure is just CV, it is not that difficult:

struc = tokens2class(tokens, "cv")

FredericBlum · 2023-06-07T08:33:43Z

alms.add_entries("tokens", dct, lambda x: "".join(y for y in x if y != "-"), override=True)
alms.add_entries("alignment", dct, lambda x: "".join(y for y in x), override=True)
alms.add_entries("structure", "alignment", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "]))

cop = get_copar(alms, ref="cogid", structure="structure", min_refs=3)

keeps throwing me the same errors, for all data points.

2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 6258
2558 C V C C V | k a tʃ i | k a tʃ i
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 2558
2567 C C V 0 V C C V 0 V | k !á/a r !í/i | k !á/a r !í/i
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 2567
3601 C V 0 0 | m ã - - | m ã
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 3601
3603 C V 0 0 | m ã - - | m ã
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 3603
3605 C V 0 0 | m ã - - | m ã

The data itself looks fine to me:

['Yaminawa', 'woman, wife', 'ʂotokoɸakɨ̃', 'ʂotokoɸakɨ̃', 'ʂ o  t o k o', 'None', 375, 'ʂ o - t o k o', 'C V 0 C V C V']
['Yawanawa', 'young woman', 'ʃutaku_βakɨ', 'ʃutaku_βakɨ', 'ʃ u  t a k u', 'None', 375, 'ʃ u - t a k u', 'C V 0 C V C V']
['Amawaka', 'yam', 'kari', 'kari', 'k a r i', 'II', 81, 'k a r i', 'C V C V']
['Chakobo', 'yam', 'ˈkari', 'ˈkari', 'k a r i', 'None', 81, 'k a r i', 'C V C V']
['Chaninawa', 'yam', 'kaɾi', 'kaɾi', 'k a ɾ i', 'None', 81, 'k a ɾ i', 'C V C V']

All three of tokens, alignment, and structure are space-segmented strings, not lists.

FredericBlum · 2023-06-07T08:39:46Z

The current setup has another fundamental problem. that does not even surface as of yet: Segments such as "ts" are separated when adding the structure. And it seems like the algorithm has problems with the slash annotation as well:

2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 6258
2558 C V C C V | k a tʃ i | k a tʃ i
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 2558
2567 C C V 0 V C C V 0 V | k !á/a r !í/i | k !á/a r !í/i

Should I get back to lists?

LinguList · 2023-06-07T08:46:42Z

@tarotis, if the warning says that alignment and structure do not match, it means they don't match. So even if the data looks fine to you, it is wrong, and you should have a look where the problem lies.

LinguList · 2023-06-07T08:47:48Z

And the data are not fine, I mean, check this string 'ʂ o t o k o', it has two spaces! If you have a 0 in the CV sound class conversion, it means lingpy does not know the sound. This points ot a problem in the data.

LinguList · 2023-06-07T08:49:05Z

And your problem is this line:

alms.add_entries("tokens", dct, lambda x: "".join(y for y in x if y != "-"), override=True)

It should be:

alms.add_entries("tokens", dct, lambda x: " ".join([y for y in x.split() if y != "-"]), override=True)

Assuming that your alignment is a string, space-segmented!

LinguList · 2023-06-07T08:49:49Z

    for i, row in enumerate(msa_reduced):
        dct[msa["ID"][i]] = "".join(row)

This should be:

        dct[msa["ID"][i]] = row

Then you have a list and not a string, which also does not really make sense.

FredericBlum · 2023-06-07T09:05:16Z

The spaces were in the tokens, not in alignments/structure, so they should not have caused any problems. I have removed them, thanks for highlighting that. They were introduced due to the joining taking overhand, trying to fix this.

The 0's in the structure get inserted based on the gaps, not from any sound, which I guess I did not communicate well The data produced by this command:

alms.add_entries("structure", dct, lambda x: [token2class(y, "cv") for y in x])

Alignment: ['m', 'ã', '-', '-']
Structure: ['C', 'V', '0', '0']

returns the following error :

2023-06-07 10:58:33,449 [WARNING] alignment and structure do not match in 3603
3605 C V 0 0 | m ã - - | m ã

So, the basic question is: Assuming that the presence of "-" in the alignments is correct (which I do) - what is the correct way of representing them in the structures column?

LinguList · 2023-06-07T10:12:44Z

Misunderstanding here is that the strucutre mimics the tokens, since the alignment can always be changed, so the tokens are the orientation point for the structure, and the comparison if structure fits the alignment is done by adding gaps internally where they are needed.

LinguList · 2023-06-07T10:13:52Z

So the error in the line is:

alms.add_entries("structure", "alignment", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "]))

which should be

alms.add_entries("structure", "tokens", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "]))

assuming that tokens is already the new entry.

LinguList · 2023-06-07T10:15:10Z

But even there, the lambda is a bit problematic, better use:

alms.add_entries("structure", "tokens", lambda x: tokens2class(x, "cv")))

Strucutre is exptected to be a list internally, and it will be explicitly checked.

LinguList · 2023-06-07T10:20:53Z

I thought I made this clear with my example, where I used tokens and not alignment.

FredericBlum · 2023-06-07T12:23:25Z

Sent another two hours on this, without success (but I made small progress). Please let's go through this step by step and make sure that I am using the correct formats. I am starting to seriously doubt myself, spending so many hours on this rather small problem, but with some fundamental concepts behind it.

I have space-segmented strings for my reduced alignments, stored in a dictionary.
My tokens are space-segmented strings. As I build them from the dictionar, I get a list when calling dict() that I have to join. I also eliminate gaps.

alms.add_entries("tokens", dct, lambda x: " ".join([y for y in x if y != "-"]), override=True)

I create the structure based on those new tokens. The structure is a list. As tokens2class requires a list as input, I split the tokens based on space.

alms.add_entries("structure", "tokens", lambda x: tokens2class(x.split(" "), "cv"))

I use the new, reduced alignments.

alms.add_entries("alignment", dct, lambda x: " ".join([y for y in x]), override=True)

Tokens: a j a
Alignment: - a j a
Structure: ['V', 'C', 'V']
---
Tokens: a j a
Alignment: - a j a
Structure: ['V', 'C', 'V']

Which leads to the rather new error:

cop = CoPaR(alms, segments="tokens", transcription="ipa", ref="cogid", structure="structure", min_refs=2)

Traceback (most recent call last):                                                                                                                                                           
  File "/Users/blum/Projects/lingreg/example/regularity.py", line 77, in <module>
    cop.get_sites()
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingrex/copar.py", line 298, in get_sites
    positions = self.positions_from_prostrings(cogid, _wlid, _alms, _strucs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingrex/copar.py", line 219, in positions_from_prostrings
    row = [x[i] for x in strucs if x[i] != "-"]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingrex/copar.py", line 219, in <listcomp>
    row = [x[i] for x in strucs if x[i] != "-"]
                                   ~^^^
IndexError: list index out of range

But we are some 16 code lines advanced, so I assume this is the correct way to progress. But where to now?

I have tried to add print-statements to the copar.py code to see where the error comes from, but I do not understand this part of the code. Some list index cannot be accessed - so there seems to be something wrong with the formats of my data. But where? Did I turn the wrong way with any of my assumptions 0-3?

I attach a new minimum example.

example.zip

LinguList · 2023-06-07T12:50:25Z

But this example is again with erroneous alignments, which are preceded by a space!

LinguList · 2023-06-07T12:50:41Z

If you check the file in edictor, you would see this directly.

FredericBlum · 2023-06-07T12:53:43Z

This is visually misleading due to the way I print the output. There are no spaces, and the alignments look fine in Edictor. There are no spaces

LinguList · 2023-06-07T13:00:57Z

from lingpy import Wordlist, Alignments
from lingrex.copar import CoPaR
from lingrex.util import prep_wordlist
from lingpy.read.qlc import reduce_alignment
from lingpy.sequence.sound_classes import tokens2class
from lingpy import basictypes
from lingrex.util import add_structure

data = Wordlist("minimum_data.tsv")
wordlist = prep_wordlist(data)
alms = Alignments(wordlist, ref="cogid", transcription="tokens")


dct = {}
for idx, msa in alms.msa["cogid"].items():
    msa_reduced = []
    for site in msa["alignment"]:
        reduced = reduce_alignment([site])[0]
        msa_reduced.append(reduced)
    for i, row in enumerate(msa_reduced):
        dct[msa["ID"][i]] = row

alms.add_entries("tokens", dct, lambda x: basictypes.lists([y for y in x if y != "-"]), override=True)
#alms.add_entries("ipa", dct, lambda x: "".join([y for y in x if y != "-"]), override=True)
alms.add_entries("alignment", dct, lambda x: basictypes.lists(x), override=True)
#alms.add_entries("structure", "tokens", lambda x:
#                 basictypes.lists(tokens2class(x, "cv")))
add_structure(alms)
alms.add_alignments()

for x in alms:
    print("Tokens:", alms[x, "tokens"])
    print("Alignment:", alms[x, "alignment"])
    print("Structure:", alms[x, "structure"])
    print(alms[x])
    print("---")

alms.output("tsv", filename="tmp")

cop = CoPaR(alms, segments="tokens", transcription="tokens", ref="cogid", structure="structure")
cop.get_sites()

cop2 = CoPaR("tmp.tsv", segments="tokens", transcription="tokens", ref="cogid",
             structure="structure")
cop2.get_sites()

This works and illustrates the problem.

LinguList · 2023-06-07T13:01:35Z

you MUST save the file before loading in copar. We have always done this, since the internal representation of alignments also needs to be recalculated here, but it is not, so saving triggers this.

LinguList · 2023-06-07T13:02:25Z

You also had wrong representations of things as strings. And I think my remark on your file holds, even if lingpy does strip off spaces at the end and the beginning.

FredericBlum · 2023-06-07T13:27:42Z

you MUST save the file before loading in copar. We have always done this, since the internal representation of alignments also needs to be recalculated here, but it is not, so saving triggers this.

Thanks for sticking with me through this, that did the trick. If you agree, I'd propose to create a PR modifying the docstring of CoPaR, which currently reads as this:

class CoPaR(Alignments):
    """Correspondence Pattern Recognition class

    Parameters
    ----------
    wordlist : ~lingpy.basic.wordlist.Wordlist
        A wordlist object which should have a column for segments and a column
        for cognate sets. Since the class inherits from LingPy's
        Alignments-class, the same kind of data should be submitted.

For me, this reads as if it takes a python object (Wordlist/Alignment), not a file. Is this due to my reading, or due to a potentially confusing description?

LinguList · 2023-06-07T19:12:15Z

This should definitely be changed, but I should also see if I cannot fix this internally, since it should then either throw an error if one does not load from file, or one should make sure to fix the problem with the types.

LinguList · 2023-06-07T19:13:46Z

So it is an issue that is annoying in lingrex and we should find ways to avoid it in general. The passing of a wordlist to classes derived from wordlists like Alignments and CoPaR is generally difficult and has been questioned, specifically since we have one init-function for all.

LinguList · 2023-06-07T19:15:48Z

So we can say: lingrex should for now at least fix the issue above and get the example file working for both cases, not just for one ;-)

FredericBlum mentioned this issue Nov 7, 2024

Edit Correspondence patterns not functional digling/edictor#223

Open

Add structure from existing alignment #41

Add structure from existing alignment #41

Comments

FredericBlum commented Jun 5, 2023

LinguList commented Jun 5, 2023

LinguList commented Jun 5, 2023

LinguList commented Jun 5, 2023

LinguList commented Jun 5, 2023

LinguList commented Jun 5, 2023

FredericBlum commented Jun 5, 2023

LinguList commented Jun 5, 2023

LinguList commented Jun 5, 2023

LinguList commented Jun 5, 2023

LinguList commented Jun 5, 2023

LinguList commented Jun 5, 2023

FredericBlum commented Jun 5, 2023

LinguList commented Jun 6, 2023

LinguList commented Jun 6, 2023

LinguList commented Jun 6, 2023

FredericBlum commented Jun 6, 2023 • edited Loading

LinguList commented Jun 6, 2023

LinguList commented Jun 6, 2023

LinguList commented Jun 6, 2023

FredericBlum commented Jun 7, 2023

FredericBlum commented Jun 7, 2023

FredericBlum commented Jun 7, 2023

LinguList commented Jun 7, 2023

LinguList commented Jun 7, 2023

FredericBlum commented Jun 7, 2023

FredericBlum commented Jun 7, 2023

LinguList commented Jun 7, 2023

LinguList commented Jun 7, 2023

LinguList commented Jun 7, 2023

LinguList commented Jun 7, 2023

FredericBlum commented Jun 7, 2023

LinguList commented Jun 7, 2023

LinguList commented Jun 7, 2023

LinguList commented Jun 7, 2023

LinguList commented Jun 7, 2023

FredericBlum commented Jun 7, 2023

LinguList commented Jun 7, 2023

LinguList commented Jun 7, 2023

FredericBlum commented Jun 7, 2023 • edited Loading

LinguList commented Jun 7, 2023

LinguList commented Jun 7, 2023

LinguList commented Jun 7, 2023

FredericBlum commented Jun 7, 2023

LinguList commented Jun 7, 2023

LinguList commented Jun 7, 2023

LinguList commented Jun 7, 2023

FredericBlum commented Jun 6, 2023 •

edited

Loading

FredericBlum commented Jun 7, 2023 •

edited

Loading