New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some thoughts for a future version using nltk #29
Comments
Some good news! It seems something like that does exist, I knew it! And it even works with regex, so this will make our lives much easier. # %%
from nltk.chunk import RegexpParser
from nltk.tree import Tree
# %%
# From here: https://stackoverflow.com/questions/24975573/how-to-parse-custom-tags-using-nltk-regexp-parser/25005857?iemail=1&noredirect=1#25005857
tags = [
Tree('GENE', ['ase1']),
Tree('SPACER', ['::']),
Tree('MARKER', ['NatMx'])
]
# It even accepts regex, note the `?`
grammar = """ GENE_DELETION: {<GENE><SPACER>?<MARKER>} """
# Create an instance of your custom parser.
custom_tag_parser = RegexpParser(grammar)
# Parse!
custom_tag_parser.parse(tags).draw()
# In this second case, we have ommitted the spacer, since it is optional from the regex
tags2 = [
Tree('GENE', ['ase1']),
Tree('MARKER', ['NatMx'])
]
custom_tag_parser.parse(tags2).draw() |
One thing we might have to figure out how to do. I posted the question on stackoverflow: https://stackoverflow.com/questions/72956382/chunk-structures-with-untagged-tokens |
Hi @manulera I'm working on this now. |
Hi @manulera! Here's something that I wrote. I can't think of what can we do with the rule name. We have the tree and we know the rule name.
|
Hi Anamika, You were in the right track, but I thikn something slightly different would make more sense: #%%
from nltk.tree import Tree
from nltk.chunk import RegexpParser
#%%
# A variable with all the single-letter aminoacids
aa = '[GPAVLIMCFYWHKRQNEDST]'.lower()
# We can write a pseudo-grammar in json
# We use the hash symbol (#) that has no meaning in regex to delimitate the parts
# that are meant to match text, let's call this the "has pattern"
pseudo_grammar = {
'GENE_DELETION': '<GENE><SPACER>?#^(delta|δ)$#?<SPACER>?<MARKER>',
'GENE_PROMOTER': '#(?<![a-z])p$#<GENE><SPACER>?<GENE>',
'ALLELE_AA_SUBSTITUTION': f'<GENE><SPACER>?#\(?({aa}\d+{aa})+\)?#'
}
# Some info about the regex
# ^(delta|δ)$ a delta or δ that represent the whole 'other tag'
# it will match ase1deltanat but won't match ase1delta2nat
#
# (?<![a-z])p$ p at the end of the string, not preceded by a letter
#
# \(?({aa}\d+{aa})+\)? aminoacids in the form A123V, which may be flanked by parenthesis
# For the regex parser to work, we need to subsitute the things between # by <other>
# let's call the pattern, "other pattern"
# Now we write a function that takes this json as an input, and returns two things:
# 1. A dictionary where keys are the "other pattern" and the values are the keys in the first json that match
# that "other pattern"
# 2. A grammar where they keys and values are the same
example_output_1 = {
'<GENE><SPACER>?<other>?<SPACER>?<MARKER>': ['GENE_DELETION'],
'<other><GENE><SPACER>?<GENE>': ['PROMOTER'], # etc.
}
grammar = """
<GENE><SPACER>?<other>?<SPACER>?<MARKER>: {<GENE><SPACER>?<other>?<SPACER>?<MARKER>}
"""
custom_tag_parser = RegexpParser(grammar)
result = custom_tag_parser.parse( [
Tree('GENE', ['mph1']),
Tree('other', ['δ']),
Tree('SPACER', ['::']),
Tree('MARKER', ['kanr'])
])
# The tree is nasty but now we can easily see if any of the keys in example_output_1
# is present as a tag, and if so, see if the <other> tags match any pattern in the pseudo_grammar.
# E.g. if we find a tag with <GENE><SPACER>?<other>?<SPACER>?<MARKER>
# we check all the patterns in example_output_1['<GENE><SPACER>?<other>?<SPACER>?<MARKER>']
# in this case, `GENE_DELETION`,
result |
If you find the pseudo_grammar = {
'GENE_DELETION': {
'pattern': '<GENE><SPACER>?<other>?<SPACER>?<MARKER>',
'other_regex': ['^(delta|δ)$'] # This is a list because there may be more than one <other>
},
} Like this you don't have to deal with the [GENE_DELETION]
pattern = '<GENE><SPACER>?<other>?<SPACER>?<MARKER>'
other_regex = ['^(delta|δ)$'] # This is a list because there may be more than one <other> |
What we discussed on Zoom: pseudo_grammar = {
'GENE_DELETION': {
'pattern': '<GENE><SPACER>?<other>?<SPACER>?<MARKER>',
'other_regex': ['^(delta|δ)$'] # This is a list because there may be more than one <other>
},
'BLAH': {
'pattern': '<GENE><SPACER>?<other>?<SPACER>?<MARKER>',
'other_regex': ['^BLAH$'] # This is a list because there may be more than one <other>
},
}
# A function that takes pseudo_grammar and returns grammar, uses this kind of dict:
example_output_1 = {
'<GENE><SPACER>?<other>?<SPACER>?<MARKER>': ['GENE_DELETION'],
'<other><GENE><SPACER>?<GENE>': ['PROMOTER'], # etc.
}
#ase1deltanat > GENE_DELETION|BLAH > GENE_DELETION
#ase1hellonat > GENE_DELETION|BLAH > Remove the tree
grammar = """
GENE_DELETION|BLAH: {<GENE><SPACER>?<other>?<SPACER>?<MARKER>}
HELLO: {pattern}
"""
'HELLO' 'GENE_DELETION|BLAH'
custom_tag_parser = RegexpParser(grammar)
result = custom_tag_parser.parse( [
Tree('GENE', ['mph1']),
Tree('other', ['hello']),
Tree('SPACER', ['::']),
Tree('MARKER', ['kanr'])
])
result.check_match()
> replace the tag name
> undo the tree Tree('GENE_DELETION', ['ase1, hello']) |
You can start by deletion with:
Then move on to promoter:
|
* WIP:nltk trees * readme and doc strings * more addition to readme * alleles_pattern_nltk.json * tests updated * fixed the failing tests * updating psuedo grammar * more to tests * readme updated * bug fix in nltk_trees * Nltk trees manu (#39) * fix fpbase things * intermediate fix * simplified version, new grammar, does not handle split * half way * added poetry dependencies * simple version working * fix tests * update gitignore * Ci workflow (#38) * ci_yaml and docker * updating ci.yaml * updating ci.yaml * dockerfile updated * fixing ci Co-authored-by: Anamika Yadav <anamika310.yadav@gmail.com> * fix ci line * remove docker action * make action run at each push * download tags in CI * silly mistake CI fixed * fixed error Co-authored-by: Anamika Yadav <anamika310.yadav@gmail.com> Co-authored-by: Manuel Lera Ramirez <manulera14@gmail.com>
Done in #35 |
Hi @anamika-yadav99
I have been giving some thought to how to do the semantic patterns, as shown in
allele_components/other.toml
.Ultimately, the goal is to understand what are the elements in each allele, for an allele of the form GENE-TAG-MARKER, the allele is actually the sequence of the GENE, the sequence of the TAG and the sequence of the MARKER one after the other, so there's not much to do in such case.
In other cases however, such as gene deletion, the situation can be different. Let's take the simple example of the promoter: MARKER::pGENE-GENE means that the promoter of the second gene has been replaced by the promoter of the first gene, e.g. KanMX::pase1-klp9, we have replaced the promoter of klp9 by the promoter of ase1. The pipeline should understand that this is the promoter of a gene, and not the gene itself based on finding
pGENE
.As you said, it is not obvious what the best way to go about this is, and how to deal with more nested patterns that we will for sure find. I was digging a bit on what Anika sent us, and I think some of the objects from
nltk
might be useful for what we are trying to do, see the small example below:This produces the following graph:
There must be some object in the library ntlk where we can define rules to identify those semantic patterns once we have done the first round of substitutions. We can discuss this in a call.
The text was updated successfully, but these errors were encountered: