-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add structure from existing alignment #41
Comments
I think you need to make the immediate strings that are all trimmed. |
But in order to make them, you can iterate over the alignment: for cogid, msa in alms.msa["cogid"].items():
print(msa["alignment"])
print(msa) The msa-object that is generated for you here stores the alignments, I think it even automatically only retains those that are not inside brackets! |
But I'd ask you to test for this. |
Please check also this function, that could otherwise be applied to the alignment object: https://lingpy.org/reference/lingpy.align.html#lingpy.align.sca.Alignments.reduce_alignments |
So, please check |
Not much happens. It applies directly to |
Please use a minimal example and share it here (you can zip it). WE only need 3 cogids with 3 alignments that are "to be reduced". |
And please check the import statement in the script on
So you have the function in read.qlc! |
def reduce_alignment(alignment):
"""
Function reduces a given alignment.
Notes
-----
Reduction here means that the output alignment consists only of those parts
which have not been marked to be ignored by the user (parts in brackets).
It requires that all data is properly coded. If reduction fails, this will
throw a warning, and all brackets are simply removed in the output
alignment.
"""
# check for bracket indices in all columns
cols = misc.transpose(alignment)
ignore_indices = []
ignore = False
for i, col in enumerate(cols):
reduced_col = sorted(set(col))
if '(' in reduced_col:
if len(reduced_col) == 1:
ignore_indices += [i]
ignore = True
else:
ignore = False
elif ')' in reduced_col:
if len(reduced_col) == 1:
ignore_indices += [i]
ignore = False
else:
ignore_indices = []
elif ignore:
ignore_indices += [i]
if ignore_indices:
new_cols = []
for i, col in enumerate(cols):
if i not in ignore_indices:
new_cols += [col]
else:
new_cols = cols
new_alm = misc.transpose(new_cols)
for i, alm in enumerate(new_alm):
for j, char in enumerate(alm):
if char in '()':
new_alm[i][j] = '-'
return new_alm |
The alignment here is a simple array, so it is like the It is probably easier to integrate all of this later into the Sites class in lingrex, but it is not that difficult to make a small preprocessing with pure lingpy here. |
from lingpy.read.qlc import reduce_alignment
from lingpy import basictypes
dct = {}
for idx, msa in alms.msa["cogid"].items():
reduced = reduce_alignment(msa["alignment"])
for i, row in enumerate(reduced):
dct[msa["ID"][i] = row
alms.add_entries("old_tokens", "tokens", lambda x: x)
alms.add_entries("tokens", dct, lambda x: basictypes.lists([y for y in x if y != "-"]), override=True)
# same for alignemnts |
Here comes the minimum example based on your code, showcasing the problem. There is no reduction apparently. |
You DO realize that you have python lists typed into the alignment column, not space-segmented strings? |
|
That is the first alignment ;-) |
But isn't that how
I receive an error like this:
This does not happen if I use the whole msa as input. |
is a function that only pertains to one matrix (2-dim list). The requirement is that the length of all rows is identical per alignment. If it throws an index error, it means one of your rows (your strings, your aligned words) is of different length. The function clearly does not take an msa-dictionary as input. |
The error is again in your data. In [1]: from lingpy.read.qlc import reduce_alignment
In [3]: reduce_alignment([["1", "(", "-", ")"], ["2", "(", "2", ")"]])
Out[3]: [['1'], ['2']] |
To avoid that such errors occur, you must test:
|
Now I understood what you meant! And why I was so confused. I adapted the CLDF conversion of the dataset so that now space-segmented strings are added, not a python list. Adding the structure based on the reduced alignment works now:
|
However, I now have problems matching alignment, tokens, and structure. I played around with the lambda expressions and come up with the following:
Now I get errors that the alignment and the structure do not match:
I strongly suspect that this is due to the "+" in the structure column. I do not understand how I need to adapt the "add_entries" command to succesfully match both. Any hints at what could solve this? |
Adding "0" instead of "+" results int he same error. |
Yes, you need to re-compute the structure, sorry. |
But since structure is just CV, it is not that difficult:
|
keeps throwing me the same errors, for all data points.
The data itself looks fine to me:
All three of tokens, alignment, and structure are space-segmented strings, not lists. |
The current setup has another fundamental problem. that does not even surface as of yet: Segments such as "ts" are separated when adding the structure. And it seems like the algorithm has problems with the slash annotation as well:
Should I get back to lists? |
@tarotis, if the warning says that alignment and structure do not match, it means they don't match. So even if the data looks fine to you, it is wrong, and you should have a look where the problem lies. |
And the data are not fine, I mean, check this string |
And your problem is this line: alms.add_entries("tokens", dct, lambda x: "".join(y for y in x if y != "-"), override=True) It should be: alms.add_entries("tokens", dct, lambda x: " ".join([y for y in x.split() if y != "-"]), override=True) Assuming that your alignment is a string, space-segmented! |
for i, row in enumerate(msa_reduced):
dct[msa["ID"][i]] = "".join(row) This should be:
Then you have a list and not a string, which also does not really make sense. |
The spaces were in the tokens, not in alignments/structure, so they should not have caused any problems. I have removed them, thanks for highlighting that. They were introduced due to the joining taking overhand, trying to fix this. The 0's in the structure get inserted based on the gaps, not from any sound, which I guess I did not communicate well The data produced by this command:
returns the following error :
So, the basic question is: Assuming that the presence of "-" in the alignments is correct (which I do) - what is the correct way of representing them in the structures column? |
Misunderstanding here is that the strucutre mimics the tokens, since the alignment can always be changed, so the tokens are the orientation point for the structure, and the comparison if structure fits the alignment is done by adding gaps internally where they are needed. |
So the error in the line is: alms.add_entries("structure", "alignment", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "])) which should be alms.add_entries("structure", "tokens", lambda x: " ".join([token2class(y, "cv") for y in x if y != " "])) assuming that tokens is already the new entry. |
But even there, the lambda is a bit problematic, better use: alms.add_entries("structure", "tokens", lambda x: tokens2class(x, "cv"))) Strucutre is exptected to be a list internally, and it will be explicitly checked. |
I thought I made this clear with my example, where I used |
Sent another two hours on this, without success (but I made small progress). Please let's go through this step by step and make sure that I am using the correct formats. I am starting to seriously doubt myself, spending so many hours on this rather small problem, but with some fundamental concepts behind it.
Which leads to the rather new error:
But we are some 16 code lines advanced, so I assume this is the correct way to progress. But where to now?
I attach a new minimum example. |
But this example is again with erroneous alignments, which are preceded by a space! |
If you check the file in edictor, you would see this directly. |
This is visually misleading due to the way I print the output. There are no spaces, and the alignments look fine in Edictor. There are no spaces |
from lingpy import Wordlist, Alignments
from lingrex.copar import CoPaR
from lingrex.util import prep_wordlist
from lingpy.read.qlc import reduce_alignment
from lingpy.sequence.sound_classes import tokens2class
from lingpy import basictypes
from lingrex.util import add_structure
data = Wordlist("minimum_data.tsv")
wordlist = prep_wordlist(data)
alms = Alignments(wordlist, ref="cogid", transcription="tokens")
dct = {}
for idx, msa in alms.msa["cogid"].items():
msa_reduced = []
for site in msa["alignment"]:
reduced = reduce_alignment([site])[0]
msa_reduced.append(reduced)
for i, row in enumerate(msa_reduced):
dct[msa["ID"][i]] = row
alms.add_entries("tokens", dct, lambda x: basictypes.lists([y for y in x if y != "-"]), override=True)
#alms.add_entries("ipa", dct, lambda x: "".join([y for y in x if y != "-"]), override=True)
alms.add_entries("alignment", dct, lambda x: basictypes.lists(x), override=True)
#alms.add_entries("structure", "tokens", lambda x:
# basictypes.lists(tokens2class(x, "cv")))
add_structure(alms)
alms.add_alignments()
for x in alms:
print("Tokens:", alms[x, "tokens"])
print("Alignment:", alms[x, "alignment"])
print("Structure:", alms[x, "structure"])
print(alms[x])
print("---")
alms.output("tsv", filename="tmp")
cop = CoPaR(alms, segments="tokens", transcription="tokens", ref="cogid", structure="structure")
cop.get_sites()
cop2 = CoPaR("tmp.tsv", segments="tokens", transcription="tokens", ref="cogid",
structure="structure")
cop2.get_sites() This works and illustrates the problem. |
you MUST save the file before loading in copar. We have always done this, since the internal representation of alignments also needs to be recalculated here, but it is not, so saving triggers this. |
You also had wrong representations of things as strings. And I think my remark on your file holds, even if lingpy does strip off spaces at the end and the beginning. |
Thanks for sticking with me through this, that did the trick. If you agree, I'd propose to create a PR modifying the docstring of CoPaR, which currently reads as this:
For me, this reads as if it takes a python object (Wordlist/Alignment), not a file. Is this due to my reading, or due to a potentially confusing description? |
This should definitely be changed, but I should also see if I cannot fix this internally, since it should then either throw an error if one does not load from file, or one should make sure to fix the problem with the types. |
So it is an issue that is annoying in lingrex and we should find ways to avoid it in general. The passing of a wordlist to classes derived from wordlists like Alignments and CoPaR is generally difficult and has been questioned, specifically since we have one init-function for all. |
So we can say: lingrex should for now at least fix the issue above and get the example file working for both cases, not just for one ;-) |
I am currently running into trouble while trying to create a CV structure based on a pre-existing alignments with manually trimmed data. Let's use for example the following data:
Is there any pre-defined way of doing this, or would I need to create a new tokens-column based on the alignments, removing all content within brackets in a dynamic way? @LinguList Maybe you can help me on this.
The text was updated successfully, but these errors were encountered: