Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some thoughts for a future version using nltk #29

Closed
manulera opened this issue Jul 12, 2022 · 9 comments
Closed

Some thoughts for a future version using nltk #29

manulera opened this issue Jul 12, 2022 · 9 comments

Comments

@manulera
Copy link
Owner

Hi @anamika-yadav99

I have been giving some thought to how to do the semantic patterns, as shown in allele_components/other.toml.

Ultimately, the goal is to understand what are the elements in each allele, for an allele of the form GENE-TAG-MARKER, the allele is actually the sequence of the GENE, the sequence of the TAG and the sequence of the MARKER one after the other, so there's not much to do in such case.

In other cases however, such as gene deletion, the situation can be different. Let's take the simple example of the promoter: MARKER::pGENE-GENE means that the promoter of the second gene has been replaced by the promoter of the first gene, e.g. KanMX::pase1-klp9, we have replaced the promoter of klp9 by the promoter of ase1. The pipeline should understand that this is the promoter of a gene, and not the gene itself based on finding pGENE.

As you said, it is not obvious what the best way to go about this is, and how to deal with more nested patterns that we will for sure find. I was digging a bit on what Anika sent us, and I think some of the objects from nltk might be useful for what we are trying to do, see the small example below:

from nltk.tree import Tree

# An example for ase1::NatMx klp9-mCherry::KanMx


allele_1 = Tree('ALLELE', [
                Tree('GENE_DELETION', [
                    Tree('GENE', ['ase1']),
                    Tree('SPACER', ['::']),
                    Tree('MARKER', ['NatMx'])
                ]
                )])

allele_2 = Tree('ALLELE', [
                Tree('GENE', ['klp9']),
                Tree('SPACER', ['-']),
                Tree('TAG', ['mCherry']),
                Tree('SPACER', ['::']),
                Tree('MARKER', ['KanMx']),
                ])

genotype = Tree('GENOTYPE',
                [allele_1, Tree('SPACER', [' ']), allele_2]
                )

genotype.draw()

This produces the following graph:

output

There must be some object in the library ntlk where we can define rules to identify those semantic patterns once we have done the first round of substitutions. We can discuss this in a call.

@manulera
Copy link
Owner Author

manulera commented Jul 12, 2022

Some good news! It seems something like that does exist, I knew it! And it even works with regex, so this will make our lives much easier.

# %%
from nltk.chunk import RegexpParser
from nltk.tree import Tree
# %%
# From here: https://stackoverflow.com/questions/24975573/how-to-parse-custom-tags-using-nltk-regexp-parser/25005857?iemail=1&noredirect=1#25005857

tags = [
    Tree('GENE', ['ase1']),
    Tree('SPACER', ['::']),
    Tree('MARKER', ['NatMx'])
]

# It even accepts regex, note the `?`
grammar = """ GENE_DELETION: {<GENE><SPACER>?<MARKER>} """

# Create an instance of your custom parser.
custom_tag_parser = RegexpParser(grammar)

# Parse!
custom_tag_parser.parse(tags).draw()

# In this second case, we have ommitted the spacer, since it is optional from the regex

tags2 = [
    Tree('GENE', ['ase1']),
    Tree('MARKER', ['NatMx'])
]

custom_tag_parser.parse(tags2).draw()

@manulera
Copy link
Owner Author

One thing we might have to figure out how to do. I posted the question on stackoverflow:

https://stackoverflow.com/questions/72956382/chunk-structures-with-untagged-tokens

@anamika-yadav99
Copy link
Collaborator

One thing we might have to figure out how to do. I posted the question on stackoverflow:

https://stackoverflow.com/questions/72956382/chunk-structures-with-untagged-tokens

Hi @manulera I'm working on this now.

@anamika-yadav99
Copy link
Collaborator

anamika-yadav99 commented Aug 23, 2022

Hi @manulera! Here's something that I wrote. I can't think of what can we do with the rule name. We have the tree and we know the rule name.

# %%
from nltk.tree import Tree
from nltk.chunk import RegexpParser

# %%
#tags1 = ['p', ('ase1', 'GENE'), ('-', '-'), ('sad1', 'GENE')]

tags = [[
    Tree('other', ['p']),
    Tree('GENE', ['ase1']),
    Tree('SPACER', ['-']),
    Tree('GENE', ['sad1'])],
    [
        Tree('GENE', ['mph1']),
        Tree('other', ['δ']),
        Tree('SPACER', ['::']),
        Tree('MARKER', ['kanr'])
    ]
]
# It even accepts regex, note the `?`
grammar = """ 
            GENE_DELETION: {<GENE><SPACER>?<MARKER>}        
            unknown_promoter_substitution : {<other><GENE><SPACER><GENE>}
            unknown_gnene_deletion : {<GENE><other>?<SPACER>?<MARKER>}
            """
pattern_dict = {
    "unknown_promoter_substitution": {"(other p)": "p_substitution"},
    "unknown_gnene_deletion": {"(other δ)": "GENE_DELETION"}
    }

# Create an instance of your custom parser.
custom_tag_parser = RegexpParser(grammar)

for tag in tags:
    outcome = custom_tag_parser.parse(tag)
    for subtree_top in outcome.subtrees():
        if subtree_top.label() in pattern_dict:
            for subtree in subtree_top.subtrees(filter = lambda x: x.label() == 'other'):
                subtree_str = str(subtree)
                if subtree_str in pattern_dict[subtree_top.label()]:
                    rule_name = pattern_dict[subtree_top.label()][subtree_str]
                    print(rule_name)


@manulera
Copy link
Owner Author

manulera commented Aug 23, 2022

Hi Anamika,

You were in the right track, but I thikn something slightly different would make more sense:

#%%
from nltk.tree import Tree
from nltk.chunk import RegexpParser
#%%
# A variable with all the single-letter aminoacids
aa = '[GPAVLIMCFYWHKRQNEDST]'.lower()

# We can write a pseudo-grammar in json
# We use the hash symbol (#) that has no meaning in regex to delimitate the parts
# that are meant to match text, let's call this the "has pattern"
pseudo_grammar = {
    'GENE_DELETION': '<GENE><SPACER>?#^(delta|δ)$#?<SPACER>?<MARKER>',
    'GENE_PROMOTER': '#(?<![a-z])p$#<GENE><SPACER>?<GENE>',
    'ALLELE_AA_SUBSTITUTION': f'<GENE><SPACER>?#\(?({aa}\d+{aa})+\)?#'
}

# Some info about the regex
# ^(delta|δ)$ a delta or δ that represent the whole 'other tag'
# it will match ase1deltanat but won't match ase1delta2nat
#
# (?<![a-z])p$ p at the end of the string, not preceded by a letter
#
# \(?({aa}\d+{aa})+\)? aminoacids in the form A123V, which may be flanked by parenthesis

# For the regex parser to work, we need to subsitute the things between # by <other>
# let's call the pattern, "other pattern"

# Now we write a function that takes this json as an input, and returns two things:
# 1. A dictionary where keys are the "other pattern" and the values are the keys in the first json that match
# that "other pattern"
# 2. A grammar where they keys and values are the same

example_output_1 = {
    '<GENE><SPACER>?<other>?<SPACER>?<MARKER>': ['GENE_DELETION'],
    '<other><GENE><SPACER>?<GENE>': ['PROMOTER'], # etc.
}

grammar = """
    <GENE><SPACER>?<other>?<SPACER>?<MARKER>: {<GENE><SPACER>?<other>?<SPACER>?<MARKER>}
"""


custom_tag_parser = RegexpParser(grammar)

result = custom_tag_parser.parse(   [
        Tree('GENE', ['mph1']),
        Tree('other', ['δ']),
        Tree('SPACER', ['::']),
        Tree('MARKER', ['kanr'])
    ])

# The tree is nasty but now we can easily see if any of the keys in example_output_1
# is present as a tag, and if so, see if the <other> tags match any pattern in the pseudo_grammar.
# E.g. if we find a tag with <GENE><SPACER>?<other>?<SPACER>?<MARKER>
# we check all the patterns in example_output_1['<GENE><SPACER>?<other>?<SPACER>?<MARKER>']
# in this case, `GENE_DELETION`,
result

@manulera
Copy link
Owner Author

If you find the pseudo_grammar format above too confusing, you can also use something like this instead:

pseudo_grammar = {
    'GENE_DELETION': {
        'pattern': '<GENE><SPACER>?<other>?<SPACER>?<MARKER>',
        'other_regex': ['^(delta|δ)$'] # This is a list because there may be more than one <other>
    },
}

Like this you don't have to deal with the # characters, and we could also write this kind of grammar as a toml in another file:

[GENE_DELETION]
pattern = '<GENE><SPACER>?<other>?<SPACER>?<MARKER>'
other_regex = ['^(delta|δ)$'] # This is a list because there may be more than one <other>

@manulera
Copy link
Owner Author

manulera commented Aug 25, 2022

What we discussed on Zoom:

pseudo_grammar = {
    'GENE_DELETION': {
        'pattern': '<GENE><SPACER>?<other>?<SPACER>?<MARKER>',
        'other_regex': ['^(delta|δ)$'] # This is a list because there may be more than one <other>
    },
	'BLAH': {
        'pattern': '<GENE><SPACER>?<other>?<SPACER>?<MARKER>',
        'other_regex': ['^BLAH$'] # This is a list because there may be more than one <other>
    },
}

# A function that takes pseudo_grammar and returns grammar, uses this kind of dict:
example_output_1 = {
    '<GENE><SPACER>?<other>?<SPACER>?<MARKER>': ['GENE_DELETION'],
    '<other><GENE><SPACER>?<GENE>': ['PROMOTER'], # etc.
}



#ase1deltanat > GENE_DELETION|BLAH > GENE_DELETION


#ase1hellonat > GENE_DELETION|BLAH > Remove the tree


grammar = """
    GENE_DELETION|BLAH: {<GENE><SPACER>?<other>?<SPACER>?<MARKER>}
	HELLO: {pattern}
"""

'HELLO' 'GENE_DELETION|BLAH'

custom_tag_parser = RegexpParser(grammar)

result = custom_tag_parser.parse(   [
        Tree('GENE', ['mph1']),
        Tree('other', ['hello']),
        Tree('SPACER', ['::']),
        Tree('MARKER', ['kanr'])
    ])
result.check_match()
> replace the tag name 
> undo the tree Tree('GENE_DELETION', ['ase1, hello'])

@manulera
Copy link
Owner Author

manulera commented Aug 25, 2022

You can start by deletion with:

  • example that matches: ase1deltanat
  • exmple that doesnt: blah
  • example that matches grammar but not pseudogrammar: ase1hellonat

Then move on to promoter:

  • example with p only: p-ase1
  • example where you have to split the other tag into two: 13p-ase1 > p-ase1 goes to promoter, 13 remains in

manulera added a commit that referenced this issue Sep 15, 2022
* WIP:nltk trees

* readme and doc strings

* more addition to readme

* alleles_pattern_nltk.json

* tests updated

* fixed the failing tests

* updating psuedo grammar

* more to tests

* readme updated

* bug fix in nltk_trees

* Nltk trees manu (#39)

* fix fpbase things

* intermediate fix

* simplified version, new grammar, does not handle split

* half way

* added poetry dependencies

* simple version working

* fix tests

* update gitignore

* Ci workflow (#38)

* ci_yaml and docker

* updating ci.yaml

* updating ci.yaml

* dockerfile updated

* fixing ci

Co-authored-by: Anamika Yadav <anamika310.yadav@gmail.com>

* fix ci line

* remove docker action

* make action run at each push

* download tags in CI

* silly mistake CI fixed

* fixed error

Co-authored-by: Anamika Yadav <anamika310.yadav@gmail.com>

Co-authored-by: Manuel Lera Ramirez <manulera14@gmail.com>
@manulera
Copy link
Owner Author

Done in #35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants