Some thoughts for a future version using nltk #29

manulera · 2022-07-12T16:12:07Z

I have been giving some thought to how to do the semantic patterns, as shown in allele_components/other.toml.

Ultimately, the goal is to understand what are the elements in each allele, for an allele of the form GENE-TAG-MARKER, the allele is actually the sequence of the GENE, the sequence of the TAG and the sequence of the MARKER one after the other, so there's not much to do in such case.

In other cases however, such as gene deletion, the situation can be different. Let's take the simple example of the promoter: MARKER::pGENE-GENE means that the promoter of the second gene has been replaced by the promoter of the first gene, e.g. KanMX::pase1-klp9, we have replaced the promoter of klp9 by the promoter of ase1. The pipeline should understand that this is the promoter of a gene, and not the gene itself based on finding pGENE.

As you said, it is not obvious what the best way to go about this is, and how to deal with more nested patterns that we will for sure find. I was digging a bit on what Anika sent us, and I think some of the objects from nltk might be useful for what we are trying to do, see the small example below:

from nltk.tree import Tree

# An example for ase1::NatMx klp9-mCherry::KanMx


allele_1 = Tree('ALLELE', [
                Tree('GENE_DELETION', [
                    Tree('GENE', ['ase1']),
                    Tree('SPACER', ['::']),
                    Tree('MARKER', ['NatMx'])
                ]
                )])

allele_2 = Tree('ALLELE', [
                Tree('GENE', ['klp9']),
                Tree('SPACER', ['-']),
                Tree('TAG', ['mCherry']),
                Tree('SPACER', ['::']),
                Tree('MARKER', ['KanMx']),
                ])

genotype = Tree('GENOTYPE',
                [allele_1, Tree('SPACER', [' ']), allele_2]
                )

genotype.draw()

This produces the following graph:

There must be some object in the library ntlk where we can define rules to identify those semantic patterns once we have done the first round of substitutions. We can discuss this in a call.

The text was updated successfully, but these errors were encountered:

manulera · 2022-07-12T16:35:09Z

Some good news! It seems something like that does exist, I knew it! And it even works with regex, so this will make our lives much easier.

# %%
from nltk.chunk import RegexpParser
from nltk.tree import Tree
# %%
# From here: https://stackoverflow.com/questions/24975573/how-to-parse-custom-tags-using-nltk-regexp-parser/25005857?iemail=1&noredirect=1#25005857

tags = [
    Tree('GENE', ['ase1']),
    Tree('SPACER', ['::']),
    Tree('MARKER', ['NatMx'])
]

# It even accepts regex, note the `?`
grammar = """ GENE_DELETION: {<GENE><SPACER>?<MARKER>} """

# Create an instance of your custom parser.
custom_tag_parser = RegexpParser(grammar)

# Parse!
custom_tag_parser.parse(tags).draw()

# In this second case, we have ommitted the spacer, since it is optional from the regex

tags2 = [
    Tree('GENE', ['ase1']),
    Tree('MARKER', ['NatMx'])
]

custom_tag_parser.parse(tags2).draw()

manulera · 2022-07-12T17:43:46Z

One thing we might have to figure out how to do. I posted the question on stackoverflow:

https://stackoverflow.com/questions/72956382/chunk-structures-with-untagged-tokens

anamika-yadav99 · 2022-08-15T12:03:58Z

One thing we might have to figure out how to do. I posted the question on stackoverflow:

https://stackoverflow.com/questions/72956382/chunk-structures-with-untagged-tokens

Hi @manulera I'm working on this now.

anamika-yadav99 · 2022-08-23T15:23:41Z

Hi @manulera! Here's something that I wrote. I can't think of what can we do with the rule name. We have the tree and we know the rule name.

# %%
from nltk.tree import Tree
from nltk.chunk import RegexpParser

# %%
#tags1 = ['p', ('ase1', 'GENE'), ('-', '-'), ('sad1', 'GENE')]

tags = [[
    Tree('other', ['p']),
    Tree('GENE', ['ase1']),
    Tree('SPACER', ['-']),
    Tree('GENE', ['sad1'])],
    [
        Tree('GENE', ['mph1']),
        Tree('other', ['δ']),
        Tree('SPACER', ['::']),
        Tree('MARKER', ['kanr'])
    ]
]
# It even accepts regex, note the `?`
grammar = """ 
            GENE_DELETION: {<GENE><SPACER>?<MARKER>}        
            unknown_promoter_substitution : {<other><GENE><SPACER><GENE>}
            unknown_gnene_deletion : {<GENE><other>?<SPACER>?<MARKER>}
            """
pattern_dict = {
    "unknown_promoter_substitution": {"(other p)": "p_substitution"},
    "unknown_gnene_deletion": {"(other δ)": "GENE_DELETION"}
    }

# Create an instance of your custom parser.
custom_tag_parser = RegexpParser(grammar)

for tag in tags:
    outcome = custom_tag_parser.parse(tag)
    for subtree_top in outcome.subtrees():
        if subtree_top.label() in pattern_dict:
            for subtree in subtree_top.subtrees(filter = lambda x: x.label() == 'other'):
                subtree_str = str(subtree)
                if subtree_str in pattern_dict[subtree_top.label()]:
                    rule_name = pattern_dict[subtree_top.label()][subtree_str]
                    print(rule_name)

manulera · 2022-08-23T16:45:34Z

Hi Anamika,

You were in the right track, but I thikn something slightly different would make more sense:

#%%
from nltk.tree import Tree
from nltk.chunk import RegexpParser
#%%
# A variable with all the single-letter aminoacids
aa = '[GPAVLIMCFYWHKRQNEDST]'.lower()

# We can write a pseudo-grammar in json
# We use the hash symbol (#) that has no meaning in regex to delimitate the parts
# that are meant to match text, let's call this the "has pattern"
pseudo_grammar = {
    'GENE_DELETION': '<GENE><SPACER>?#^(delta|δ)$#?<SPACER>?<MARKER>',
    'GENE_PROMOTER': '#(?<![a-z])p$#<GENE><SPACER>?<GENE>',
    'ALLELE_AA_SUBSTITUTION': f'<GENE><SPACER>?#\(?({aa}\d+{aa})+\)?#'
}

# Some info about the regex
# ^(delta|δ)$ a delta or δ that represent the whole 'other tag'
# it will match ase1deltanat but won't match ase1delta2nat
#
# (?<![a-z])p$ p at the end of the string, not preceded by a letter
#
# \(?({aa}\d+{aa})+\)? aminoacids in the form A123V, which may be flanked by parenthesis

# For the regex parser to work, we need to subsitute the things between # by <other>
# let's call the pattern, "other pattern"

# Now we write a function that takes this json as an input, and returns two things:
# 1. A dictionary where keys are the "other pattern" and the values are the keys in the first json that match
# that "other pattern"
# 2. A grammar where they keys and values are the same

example_output_1 = {
    '<GENE><SPACER>?<other>?<SPACER>?<MARKER>': ['GENE_DELETION'],
    '<other><GENE><SPACER>?<GENE>': ['PROMOTER'], # etc.
}

grammar = """
    <GENE><SPACER>?<other>?<SPACER>?<MARKER>: {<GENE><SPACER>?<other>?<SPACER>?<MARKER>}
"""


custom_tag_parser = RegexpParser(grammar)

result = custom_tag_parser.parse(   [
        Tree('GENE', ['mph1']),
        Tree('other', ['δ']),
        Tree('SPACER', ['::']),
        Tree('MARKER', ['kanr'])
    ])

# The tree is nasty but now we can easily see if any of the keys in example_output_1
# is present as a tag, and if so, see if the <other> tags match any pattern in the pseudo_grammar.
# E.g. if we find a tag with <GENE><SPACER>?<other>?<SPACER>?<MARKER>
# we check all the patterns in example_output_1['<GENE><SPACER>?<other>?<SPACER>?<MARKER>']
# in this case, `GENE_DELETION`,
result

manulera · 2022-08-23T16:52:10Z

If you find the pseudo_grammar format above too confusing, you can also use something like this instead:

pseudo_grammar = {
    'GENE_DELETION': {
        'pattern': '<GENE><SPACER>?<other>?<SPACER>?<MARKER>',
        'other_regex': ['^(delta|δ)$'] # This is a list because there may be more than one <other>
    },
}

Like this you don't have to deal with the # characters, and we could also write this kind of grammar as a toml in another file:

[GENE_DELETION]
pattern = '<GENE><SPACER>?<other>?<SPACER>?<MARKER>'
other_regex = ['^(delta|δ)$'] # This is a list because there may be more than one <other>

manulera · 2022-08-25T08:58:29Z

What we discussed on Zoom:

pseudo_grammar = {
    'GENE_DELETION': {
        'pattern': '<GENE><SPACER>?<other>?<SPACER>?<MARKER>',
        'other_regex': ['^(delta|δ)$'] # This is a list because there may be more than one <other>
    },
	'BLAH': {
        'pattern': '<GENE><SPACER>?<other>?<SPACER>?<MARKER>',
        'other_regex': ['^BLAH$'] # This is a list because there may be more than one <other>
    },
}

# A function that takes pseudo_grammar and returns grammar, uses this kind of dict:
example_output_1 = {
    '<GENE><SPACER>?<other>?<SPACER>?<MARKER>': ['GENE_DELETION'],
    '<other><GENE><SPACER>?<GENE>': ['PROMOTER'], # etc.
}



#ase1deltanat > GENE_DELETION|BLAH > GENE_DELETION


#ase1hellonat > GENE_DELETION|BLAH > Remove the tree


grammar = """
    GENE_DELETION|BLAH: {<GENE><SPACER>?<other>?<SPACER>?<MARKER>}
	HELLO: {pattern}
"""

'HELLO' 'GENE_DELETION|BLAH'

custom_tag_parser = RegexpParser(grammar)

result = custom_tag_parser.parse(   [
        Tree('GENE', ['mph1']),
        Tree('other', ['hello']),
        Tree('SPACER', ['::']),
        Tree('MARKER', ['kanr'])
    ])
result.check_match()
> replace the tag name 
> undo the tree Tree('GENE_DELETION', ['ase1, hello'])

manulera · 2022-08-25T09:10:29Z

You can start by deletion with:

example that matches: ase1deltanat
exmple that doesnt: blah
example that matches grammar but not pseudogrammar: ase1hellonat

Then move on to promoter:

example with p only: p-ase1
example where you have to split the other tag into two: 13p-ase1 > p-ase1 goes to promoter, 13 remains in

* WIP:nltk trees * readme and doc strings * more addition to readme * alleles_pattern_nltk.json * tests updated * fixed the failing tests * updating psuedo grammar * more to tests * readme updated * bug fix in nltk_trees * Nltk trees manu (#39) * fix fpbase things * intermediate fix * simplified version, new grammar, does not handle split * half way * added poetry dependencies * simple version working * fix tests * update gitignore * Ci workflow (#38) * ci_yaml and docker * updating ci.yaml * updating ci.yaml * dockerfile updated * fixing ci Co-authored-by: Anamika Yadav <anamika310.yadav@gmail.com> * fix ci line * remove docker action * make action run at each push * download tags in CI * silly mistake CI fixed * fixed error Co-authored-by: Anamika Yadav <anamika310.yadav@gmail.com> Co-authored-by: Manuel Lera Ramirez <manulera14@gmail.com>

manulera · 2022-09-15T12:15:58Z

Done in #35

manulera closed this as completed Sep 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some thoughts for a future version using nltk #29

Some thoughts for a future version using nltk #29

manulera commented Jul 12, 2022

manulera commented Jul 12, 2022 •

edited

manulera commented Jul 12, 2022

anamika-yadav99 commented Aug 15, 2022

anamika-yadav99 commented Aug 23, 2022 •

edited

manulera commented Aug 23, 2022 •

edited

manulera commented Aug 23, 2022

manulera commented Aug 25, 2022 •

edited

manulera commented Aug 25, 2022 •

edited

manulera commented Sep 15, 2022

Some thoughts for a future version using nltk #29

Some thoughts for a future version using nltk #29

Comments

manulera commented Jul 12, 2022

manulera commented Jul 12, 2022 • edited

manulera commented Jul 12, 2022

anamika-yadav99 commented Aug 15, 2022

anamika-yadav99 commented Aug 23, 2022 • edited

manulera commented Aug 23, 2022 • edited

manulera commented Aug 23, 2022

manulera commented Aug 25, 2022 • edited

manulera commented Aug 25, 2022 • edited

manulera commented Sep 15, 2022

manulera commented Jul 12, 2022 •

edited

anamika-yadav99 commented Aug 23, 2022 •

edited

manulera commented Aug 23, 2022 •

edited

manulera commented Aug 25, 2022 •

edited

manulera commented Aug 25, 2022 •

edited