Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add structure from existing alignment #41

Open
FredericBlum opened this issue Jun 5, 2023 · 46 comments
Open

Add structure from existing alignment #41

FredericBlum opened this issue Jun 5, 2023 · 46 comments

Comments

@FredericBlum
Copy link
Collaborator

I am currently running into trouble while trying to create a CV structure based on a pre-existing alignments with manually trimmed data. Let's use for example the following data:

ID	DOCULECT	CONCEPT	VALUE	FORM	TOKEN	COGID	ALIGNMENT
1	Marubo	type of bank	kɨnã	kɨnã	k ɨ n ã	60	['k', 'ɨ', 'n', 'ã', '-', '(', '-', '-', ')']
2	Chakobo	type of bank	kɨˈnanɨ	kɨˈnanɨ	k ɨ n a n ɨ	60	['k', 'ɨ', 'n', 'a', '-', '(', 'n', 'ɨ', ')']

Is there any pre-defined way of doing this, or would I need to create a new tokens-column based on the alignments, removing all content within brackets in a dynamic way? @LinguList Maybe you can help me on this.

@LinguList
Copy link
Contributor

I think you need to make the immediate strings that are all trimmed.

@LinguList
Copy link
Contributor

But in order to make them, you can iterate over the alignment:

for cogid, msa in alms.msa["cogid"].items():
    print(msa["alignment"])
    print(msa)

The msa-object that is generated for you here stores the alignments, I think it even automatically only retains those that are not inside brackets!

@LinguList
Copy link
Contributor

But I'd ask you to test for this.

@LinguList
Copy link
Contributor

Please check also this function, that could otherwise be applied to the alignment object: https://lingpy.org/reference/lingpy.align.html#lingpy.align.sca.Alignments.reduce_alignments

@LinguList
Copy link
Contributor

So, please check alms.reduce_alignments and how it behaves, as I do not remember if it returns a list of reduced alignments or if it reduces alignments in the alms.msa["cogid"] attribute, etc.

@FredericBlum
Copy link
Collaborator Author

Not much happens. It applies directly to alms and adds a _alignment entry in the Dictionary, but nothing is reduced. Even after looking at the code, I do not understand where this would proceed.

@LinguList
Copy link
Contributor

Please use a minimal example and share it here (you can zip it). WE only need 3 cogids with 3 alignments that are "to be reduced".

@LinguList
Copy link
Contributor

And please check the import statement in the script on align/sca.py in lingpy:

from lingpy.read.qlc import read_msa, normalize_alignment, reduce_alignment

So you have the function in read.qlc!

@LinguList
Copy link
Contributor

def reduce_alignment(alignment):
    """
    Function reduces a given alignment.
    
    Notes
    -----
    Reduction here means that the output alignment consists only of those parts
    which have not been marked to be ignored by the user (parts in brackets).
    It requires that all data is properly coded. If reduction fails, this will
    throw a warning, and all brackets are simply removed in the output
    alignment.
    """

    # check for bracket indices in all columns
    cols = misc.transpose(alignment)

    ignore_indices = []
    ignore = False
    for i, col in enumerate(cols):
        reduced_col = sorted(set(col))

        if '(' in reduced_col:
            if len(reduced_col) == 1:
                ignore_indices += [i]
                ignore = True
            else:
                ignore = False
        elif ')' in reduced_col:
            if len(reduced_col) == 1:
                ignore_indices += [i]
                ignore = False
            else:
                ignore_indices = []
        elif ignore:
            ignore_indices += [i]

    if ignore_indices:
        new_cols = []
        for i, col in enumerate(cols):
            if i not in ignore_indices:
                new_cols += [col]
    else:
        new_cols = cols

    new_alm = misc.transpose(new_cols)

    for i, alm in enumerate(new_alm):
        for j, char in enumerate(alm):
            if char in '()':
                new_alm[i][j] = '-'

    return new_alm

@LinguList
Copy link
Contributor

The alignment here is a simple array, so it is like the msa["alignment"] what you can pass it. This should reduce the alignment, and you can from the reduced alignment then store tokesn [x for x in alm if x != "-"] for each reduced alignment as well.

It is probably easier to integrate all of this later into the Sites class in lingrex, but it is not that difficult to make a small preprocessing with pure lingpy here.

@LinguList
Copy link
Contributor

from lingpy.read.qlc import reduce_alignment
from lingpy import basictypes

dct = {}
for idx, msa in alms.msa["cogid"].items():
    reduced = reduce_alignment(msa["alignment"])
    for i, row in enumerate(reduced):
        dct[msa["ID"][i] = row

alms.add_entries("old_tokens", "tokens", lambda x: x)
alms.add_entries("tokens", dct, lambda x: basictypes.lists([y for y in x if y != "-"]), override=True)

# same for alignemnts

@FredericBlum
Copy link
Collaborator Author

minimum_example.zip

Here comes the minimum example based on your code, showcasing the problem. There is no reduction apparently.

@LinguList
Copy link
Contributor

You DO realize that you have python lists typed into the alignment column, not space-segmented strings?

@LinguList
Copy link
Contributor

['m', 'ã', 'n', '(', '-', ')', 'ã', '-', '(', '-', '-', '-', '-', ')']

@LinguList
Copy link
Contributor

That is the first alignment ;-)

@FredericBlum
Copy link
Collaborator Author

FredericBlum commented Jun 6, 2023

But isn't that how reduce_alignment works, taking the whole `msa["alignment"] as input? If I run the following code:

for idx, msa in alms.msa["cogid"].items():
    for alg in msa["alignment"]:
        print("Alignment:", alg)
        reduce_alignment(alg)

I receive an error like this:

lingreg) blum@lingn45 example % python regularity.py
Alignment: ['m', 'ã', 'n', '(', '-', ')', 'ã', '-', '(', '-', '-', '-', '-', ')']
Traceback (most recent call last):
  File "/Users/blum/Projects/lingreg/example/regularity.py", line 42, in <module>
    reduce_alignment(alg)
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/read/qlc.py", line 23, in reduce_alignment
    cols = misc.transpose(alignment)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/algorithm/cython/_misc.py", line 20, in transpose
    out = [[matrix[i][j] for i in range(lA)] for j in range(lB)]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/algorithm/cython/_misc.py", line 20, in <listcomp>
    out = [[matrix[i][j] for i in range(lA)] for j in range(lB)]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingpy/algorithm/cython/_misc.py", line 20, in <listcomp>
    out = [[matrix[i][j] for i in range(lA)] for j in range(lB)]
            ~~~~~~~~~^^^
IndexError: string index out of range

This does not happen if I use the whole msa as input.

@LinguList
Copy link
Contributor

    cols = misc.transpose(alignment)

is a function that only pertains to one matrix (2-dim list). The requirement is that the length of all rows is identical per alignment. If it throws an index error, it means one of your rows (your strings, your aligned words) is of different length.

The function clearly does not take an msa-dictionary as input.

@LinguList
Copy link
Contributor

The error is again in your data.

In [1]: from lingpy.read.qlc import reduce_alignment

In [3]: reduce_alignment([["1", "(", "-", ")"], ["2", "(", "2", ")"]])
Out[3]: [['1'], ['2']]

@LinguList
Copy link
Contributor

To avoid that such errors occur, you must test:

if len(set([len(row) for row in msa["alignment"])) != 1:
    print("problem in alignment {0}".format(cogid))
else:
    ...

@FredericBlum
Copy link
Collaborator Author

You DO realize that you have python lists typed into the alignment column, not space-segmented strings?

Now I understood what you meant! And why I was so confused. I adapted the CLDF conversion of the dataset so that now space-segmented strings are added, not a python list. Adding the structure based on the reduced alignment works now:

dct = {}
for idx, msa in alms.msa["cogid"].items():
    msa_reduced = []
    for site in msa["alignment"]:
        # print("Alignment:", site)
        reduced = reduce_alignment([site])[0]
        msa_reduced.append(reduced)
    for i, row in enumerate(msa_reduced):
        dct[msa["ID"][i]] = "".join(row)

alms.add_entries("old_tokens", "tokens", lambda x: x)
alms.add_entries("tokens", dct, lambda x: [y for y in x], override=True)
alms.add_entries("structure", "tokens", lambda x: " ".join(Sites([x]).soundclasses))

@FredericBlum
Copy link
Collaborator Author

However, I now have problems matching alignment, tokens, and structure. I played around with the lambda expressions and come up with the following:

alms.add_entries("tokens", dct, lambda x: [y for y in x if y != "-"], override=True)
alms.add_entries("alignment", dct, lambda x: [y for y in x], override=True)
alms.add_entries("structure", "alignment", lambda x: " ".join(Sites([x]).soundclasses))

Now I get errors that the alignment and the structure do not match:

2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 6245
6251 C V + C V C V | ʂ o - t o k o | ʂ o t o k o
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 6251
6258 C V + C V C V | ʃ u - t a k u | ʃ u t a k u
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 6258
3601 C V + + | m ã - - | m ã
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 3601
3603 C V + + | m ã - - | m ã
2023-06-07 10:15:51,820 [WARNING] alignment and structure do not match in 3603
3605 C V + + | m ã - - | m ã

I strongly suspect that this is due to the "+" in the structure column. I do not understand how I need to adapt the "add_entries" command to succesfully match both. Any hints at what could solve this?

@FredericBlum
Copy link
Collaborator Author

Adding "0" instead of "+" results int he same error.

@LinguList
Copy link
Contributor

Yes, you need to re-compute the structure, sorry.

@LinguList
Copy link
Contributor

But since structure is just CV, it is not that difficult:

struc = tokens2class(tokens, "cv")

@FredericBlum
Copy link
Collaborator Author

alms.add_entries("tokens", dct, lambda x: "".join(y for y in x if y != "-"), override=True)
alms.add_entries("alignment", dct, lambda x: "".join(y for y in x), override=True)
alms.add_entries("structure", "alignment", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "]))

cop = get_copar(alms, ref="cogid", structure="structure", min_refs=3)

keeps throwing me the same errors, for all data points.

2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 6258
2558 C V C C V | k a tʃ i | k a tʃ i
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 2558
2567 C C V 0 V C C V 0 V | k !á/a r !í/i | k !á/a r !í/i
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 2567
3601 C V 0 0 | m ã - - | m ã
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 3601
3603 C V 0 0 | m ã - - | m ã
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 3603
3605 C V 0 0 | m ã - - | m ã

The data itself looks fine to me:

['Yaminawa', 'woman, wife', 'ʂotokoɸakɨ̃', 'ʂotokoɸakɨ̃', 'ʂ o  t o k o', 'None', 375, 'ʂ o - t o k o', 'C V 0 C V C V']
['Yawanawa', 'young woman', 'ʃutaku_βakɨ', 'ʃutaku_βakɨ', 'ʃ u  t a k u', 'None', 375, 'ʃ u - t a k u', 'C V 0 C V C V']
['Amawaka', 'yam', 'kari', 'kari', 'k a r i', 'II', 81, 'k a r i', 'C V C V']
['Chakobo', 'yam', 'ˈkari', 'ˈkari', 'k a r i', 'None', 81, 'k a r i', 'C V C V']
['Chaninawa', 'yam', 'kaɾi', 'kaɾi', 'k a ɾ i', 'None', 81, 'k a ɾ i', 'C V C V']

All three of tokens, alignment, and structure are space-segmented strings, not lists.

@FredericBlum
Copy link
Collaborator Author

The current setup has another fundamental problem. that does not even surface as of yet: Segments such as "ts" are separated when adding the structure. And it seems like the algorithm has problems with the slash annotation as well:

2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 6258
2558 C V C C V | k a tʃ i | k a tʃ i
2023-06-07 10:32:21,037 [WARNING] alignment and structure do not match in 2558
2567 C C V 0 V C C V 0 V | k !á/a r !í/i | k !á/a r !í/i

Should I get back to lists?

@LinguList
Copy link
Contributor

@tarotis, if the warning says that alignment and structure do not match, it means they don't match. So even if the data looks fine to you, it is wrong, and you should have a look where the problem lies.

@LinguList
Copy link
Contributor

And the data are not fine, I mean, check this string 'ʂ o t o k o', it has two spaces! If you have a 0 in the CV sound class conversion, it means lingpy does not know the sound. This points ot a problem in the data.

@LinguList
Copy link
Contributor

And your problem is this line:

alms.add_entries("tokens", dct, lambda x: "".join(y for y in x if y != "-"), override=True)

It should be:

alms.add_entries("tokens", dct, lambda x: " ".join([y for y in x.split() if y != "-"]), override=True)

Assuming that your alignment is a string, space-segmented!

@LinguList
Copy link
Contributor

    for i, row in enumerate(msa_reduced):
        dct[msa["ID"][i]] = "".join(row)

This should be:

        dct[msa["ID"][i]] = row

Then you have a list and not a string, which also does not really make sense.

@FredericBlum
Copy link
Collaborator Author

The spaces were in the tokens, not in alignments/structure, so they should not have caused any problems. I have removed them, thanks for highlighting that. They were introduced due to the joining taking overhand, trying to fix this.

The 0's in the structure get inserted based on the gaps, not from any sound, which I guess I did not communicate well The data produced by this command:

alms.add_entries("structure", dct, lambda x: [token2class(y, "cv") for y in x])
Alignment: ['m', 'ã', '-', '-']
Structure: ['C', 'V', '0', '0']

returns the following error :

2023-06-07 10:58:33,449 [WARNING] alignment and structure do not match in 3603
3605 C V 0 0 | m ã - - | m ã

So, the basic question is: Assuming that the presence of "-" in the alignments is correct (which I do) - what is the correct way of representing them in the structures column?

@LinguList
Copy link
Contributor

Misunderstanding here is that the strucutre mimics the tokens, since the alignment can always be changed, so the tokens are the orientation point for the structure, and the comparison if structure fits the alignment is done by adding gaps internally where they are needed.

@LinguList
Copy link
Contributor

So the error in the line is:

alms.add_entries("structure", "alignment", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "]))

which should be

alms.add_entries("structure", "tokens", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "]))

assuming that tokens is already the new entry.

@LinguList
Copy link
Contributor

But even there, the lambda is a bit problematic, better use:

alms.add_entries("structure", "tokens", lambda x: tokens2class(x, "cv")))

Strucutre is exptected to be a list internally, and it will be explicitly checked.

@LinguList
Copy link
Contributor

I thought I made this clear with my example, where I used tokens and not alignment.

@FredericBlum
Copy link
Collaborator Author

Sent another two hours on this, without success (but I made small progress). Please let's go through this step by step and make sure that I am using the correct formats. I am starting to seriously doubt myself, spending so many hours on this rather small problem, but with some fundamental concepts behind it.

  1. I have space-segmented strings for my reduced alignments, stored in a dictionary.
  2. My tokens are space-segmented strings. As I build them from the dictionar, I get a list when calling dict() that I have to join. I also eliminate gaps.

alms.add_entries("tokens", dct, lambda x: " ".join([y for y in x if y != "-"]), override=True)

  1. I create the structure based on those new tokens. The structure is a list. As tokens2class requires a list as input, I split the tokens based on space.

alms.add_entries("structure", "tokens", lambda x: tokens2class(x.split(" "), "cv"))

  1. I use the new, reduced alignments.

alms.add_entries("alignment", dct, lambda x: " ".join([y for y in x]), override=True)

Tokens: a j a
Alignment: - a j a
Structure: ['V', 'C', 'V']
---
Tokens: a j a
Alignment: - a j a
Structure: ['V', 'C', 'V']

Which leads to the rather new error:

cop = CoPaR(alms, segments="tokens", transcription="ipa", ref="cogid", structure="structure", min_refs=2)
Traceback (most recent call last):                                                                                                                                                           
  File "/Users/blum/Projects/lingreg/example/regularity.py", line 77, in <module>
    cop.get_sites()
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingrex/copar.py", line 298, in get_sites
    positions = self.positions_from_prostrings(cogid, _wlid, _alms, _strucs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingrex/copar.py", line 219, in positions_from_prostrings
    row = [x[i] for x in strucs if x[i] != "-"]
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/blum/Projects/venv/lingreg/lib/python3.11/site-packages/lingrex/copar.py", line 219, in <listcomp>
    row = [x[i] for x in strucs if x[i] != "-"]
                                   ~^^^
IndexError: list index out of range

But we are some 16 code lines advanced, so I assume this is the correct way to progress. But where to now?

  1. I have tried to add print-statements to the copar.py code to see where the error comes from, but I do not understand this part of the code. Some list index cannot be accessed - so there seems to be something wrong with the formats of my data. But where? Did I turn the wrong way with any of my assumptions 0-3?

I attach a new minimum example.

example.zip

@LinguList
Copy link
Contributor

But this example is again with erroneous alignments, which are preceded by a space!

@LinguList
Copy link
Contributor

If you check the file in edictor, you would see this directly.

@FredericBlum
Copy link
Collaborator Author

FredericBlum commented Jun 7, 2023

This is visually misleading due to the way I print the output. There are no spaces, and the alignments look fine in Edictor. There are no spaces

@LinguList
Copy link
Contributor

from lingpy import Wordlist, Alignments
from lingrex.copar import CoPaR
from lingrex.util import prep_wordlist
from lingpy.read.qlc import reduce_alignment
from lingpy.sequence.sound_classes import tokens2class
from lingpy import basictypes
from lingrex.util import add_structure

data = Wordlist("minimum_data.tsv")
wordlist = prep_wordlist(data)
alms = Alignments(wordlist, ref="cogid", transcription="tokens")


dct = {}
for idx, msa in alms.msa["cogid"].items():
    msa_reduced = []
    for site in msa["alignment"]:
        reduced = reduce_alignment([site])[0]
        msa_reduced.append(reduced)
    for i, row in enumerate(msa_reduced):
        dct[msa["ID"][i]] = row

alms.add_entries("tokens", dct, lambda x: basictypes.lists([y for y in x if y != "-"]), override=True)
#alms.add_entries("ipa", dct, lambda x: "".join([y for y in x if y != "-"]), override=True)
alms.add_entries("alignment", dct, lambda x: basictypes.lists(x), override=True)
#alms.add_entries("structure", "tokens", lambda x:
#                 basictypes.lists(tokens2class(x, "cv")))
add_structure(alms)
alms.add_alignments()

for x in alms:
    print("Tokens:", alms[x, "tokens"])
    print("Alignment:", alms[x, "alignment"])
    print("Structure:", alms[x, "structure"])
    print(alms[x])
    print("---")

alms.output("tsv", filename="tmp")

cop = CoPaR(alms, segments="tokens", transcription="tokens", ref="cogid", structure="structure")
cop.get_sites()

cop2 = CoPaR("tmp.tsv", segments="tokens", transcription="tokens", ref="cogid",
             structure="structure")
cop2.get_sites()

This works and illustrates the problem.

@LinguList
Copy link
Contributor

you MUST save the file before loading in copar. We have always done this, since the internal representation of alignments also needs to be recalculated here, but it is not, so saving triggers this.

@LinguList
Copy link
Contributor

You also had wrong representations of things as strings. And I think my remark on your file holds, even if lingpy does strip off spaces at the end and the beginning.

@FredericBlum
Copy link
Collaborator Author

you MUST save the file before loading in copar. We have always done this, since the internal representation of alignments also needs to be recalculated here, but it is not, so saving triggers this.

Thanks for sticking with me through this, that did the trick. If you agree, I'd propose to create a PR modifying the docstring of CoPaR, which currently reads as this:

class CoPaR(Alignments):
    """Correspondence Pattern Recognition class

    Parameters
    ----------
    wordlist : ~lingpy.basic.wordlist.Wordlist
        A wordlist object which should have a column for segments and a column
        for cognate sets. Since the class inherits from LingPy's
        Alignments-class, the same kind of data should be submitted.

For me, this reads as if it takes a python object (Wordlist/Alignment), not a file. Is this due to my reading, or due to a potentially confusing description?

@LinguList
Copy link
Contributor

This should definitely be changed, but I should also see if I cannot fix this internally, since it should then either throw an error if one does not load from file, or one should make sure to fix the problem with the types.

@LinguList
Copy link
Contributor

So it is an issue that is annoying in lingrex and we should find ways to avoid it in general. The passing of a wordlist to classes derived from wordlists like Alignments and CoPaR is generally difficult and has been questioned, specifically since we have one init-function for all.

@LinguList
Copy link
Contributor

So we can say: lingrex should for now at least fix the issue above and get the example file working for both cases, not just for one ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants