-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forth version (using nltk) #30
Comments
Some more things to add to this version:
|
|
|
Hello Anamika, I noticed that to read the trees into nltk, we would need to change the output. For example, if we have ase1::NatR: # The output currently looks like this:
[['ase1', 'GENE'], ['NatR','marker']]
# But it should look like this (note how the value of the strings is in a list):
[['GENE', ['ase1']], ['marker', ['NatR']]]
# So that we can make a tree from it:
Tree(*['GENE', ['ase1']])
# or
Tree('GENE', ['ase1']) You will need to fix this in the pipeline and tests, sorry that we did not pick it up before. Note that the reason for this is to be able to capture nested things like p-Gene: from nltk.tree import Tree
Tree(*["GENE", ["ase1"]])
Tree(*['PROMOTER', ['p', Tree(*['GENE', ['ase1']])]]) |
Hi @manulera, I have made the necessary changes to the code and readme. You can review now |
Hi @anamika-yadav99 wow that was fast! Good job! Also good job with the readme, it looks much better now! |
In this next version, let's use the tagging syntax used by
nltk
.For example, for pfus1-ase1-mCherry, we now would generate:
pGENE-GENE-TAG
, in their format it should be:Initially I thought we could store this as plain text, but I think it's not worth the headache of escaping characters (the default separator is
\
but I am sure we will encounter every thinkable separator, I would stick with json, something like this for each allele)So for this version, write some code that exports the alleles to this, and a test that verifies that it works from a few examples. To restore our previous patterns, which are also useful to generate the occurrences file, you can do something like this:
For the separators, make a list of known separators in a text file, for now I imagine it's mostly
:-
, but we will have to add more in the future. Any subsequent separators in an allele name should be counted as one, so use regex. They will have to be replaced after all other features, as some allele names contain separators in them. We could name the feature "SEPARATOR" but I think for readability is better to use "-" and this is supported by nltk.The text was updated successfully, but these errors were encountered: