Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forth version (using nltk) #30

Closed
manulera opened this issue Jul 15, 2022 · 6 comments
Closed

Forth version (using nltk) #30

manulera opened this issue Jul 15, 2022 · 6 comments

Comments

@manulera
Copy link
Owner

In this next version, let's use the tagging syntax used by nltk.

For example, for pfus1-ase1-mCherry, we now would generate: pGENE-GENE-TAG, in their format it should be:

['p',('fus1', 'GENE'),('-','-'),('ase1', 'GENE'),('-','-'),('mCherry','TAG')]

Initially I thought we could store this as plain text, but I think it's not worth the headache of escaping characters (the default separator is \ but I am sure we will encounter every thinkable separator, I would stick with json, something like this for each allele)

[
{
  "name": "pfus1-ase1-mCherry",
  "pattern": ["p",["fus1", "GENE"],["-","-"],["ase1", "GENE"],["-","-"],["mCherry","TAG"]]
}
]

So for this version, write some code that exports the alleles to this, and a test that verifies that it works from a few examples. To restore our previous patterns, which are also useful to generate the occurrences file, you can do something like this:

"".join([i if type(i)==str else i[1] for i in pattern])

For the separators, make a list of known separators in a text file, for now I imagine it's mostly :-, but we will have to add more in the future. Any subsequent separators in an allele name should be counted as one, so use regex. They will have to be replaced after all other features, as some allele names contain separators in them. We could name the feature "SEPARATOR" but I think for readability is better to use "-" and this is supported by nltk.

@manulera
Copy link
Owner Author

manulera commented Jul 16, 2022

Some more things to add to this version:

  • Create a folder in the root directory of the repo called analysis, in it:

    • Add the script that runs this new pipeline: fourth_version_pipeline.py such that it can be run like this:

      python fourth_version_pipeline.py ../Lab_strains/dey_lab/strains.tsv
      

      It should still save the output files in the directory where strains.tsv is.

      To use the arguments use the if __name__ == "__main__": syntax in the script. To write the tests, simply import the main function from the script or other functions.

  • In the new file, put all function definitions at the top, and then the code that calls them in the main function. This helps readability. main function should start with strain_list = build_strain_list('strains.tsv') (more or less)

@manulera
Copy link
Owner Author

manulera commented Jul 16, 2022

  • Fix the typo of occurrences (it's two c and two r) on functions and filenames.
  • Write format.py for all labs. Some will require further editing, you can ask me about them once you check their strain lists. If you need to do some pre-processing, do it like in the example in tran_lab/format.py and store it in an intermediary excel sheet called post_processed.xlsx like in the example. Like that you can call excel_to_tsv on that one.
  • Move data/strains.tsv to Lab_strains/nbrp_strains, rename it to strains_raw.tsv and write a format.py for that one as well. Commit the strains_raw.tsv file, since the data in it is public and we will use it for the documentation below.
  • delete trans_lab folder (the correct name is tran_lab)
  • Run your pipeline in all the Lab_strains folders and see that it does not fail for any of those
  • Have a look at the common occurrences across labs and see if you can identify more patterns. When you do, add them to the appropriate allele_components toml file. If some don't fit anywhere, add them to the previous google doc and we can discuss.

@manulera
Copy link
Owner Author

manulera commented Jul 16, 2022

  • Revise your code, and add some comments in the parts that you think will be harder to follow if you are looking at them for the first time. If you struggle understanding what a part of the code does, try to see if you can improve variable names, or add some concise comments.
  • Add a section to the readme called running the pipeline to explain what the scripts do, and the outputs the produce. Taking the Lab_strains/nbrp_strains It should cover:
    • What format.py does and what you would have to do to write your own format.py for your spreadsheet.
    • What the analysis script does, and for each output file, what is in the file.

@manulera
Copy link
Owner Author

manulera commented Aug 23, 2022

Hello Anamika, I noticed that to read the trees into nltk, we would need to change the output. For example, if we have ase1::NatR:

# The output currently looks like this:
[['ase1', 'GENE'], ['NatR','marker']]
# But it should look like this (note how the value of the strings is in a list):
[['GENE', ['ase1']], ['marker', ['NatR']]]
# So that we can make a tree from it:
Tree(*['GENE', ['ase1']])
# or
Tree('GENE', ['ase1'])

You will need to fix this in the pipeline and tests, sorry that we did not pick it up before. Note that the reason for this is to be able to capture nested things like p-Gene:

from nltk.tree import Tree
Tree(*["GENE", ["ase1"]])
Tree(*['PROMOTER', ['p', Tree(*['GENE', ['ase1']])]])

@anamika-yadav99
Copy link
Collaborator

anamika-yadav99 commented Aug 23, 2022

Hello Anamika, I noticed that to read the trees into nltk, we would need to change the output. For example, if we have ase1::NatR:

# The output currently looks like this:
[['ase1', 'GENE'], ['NatR','marker']]
# But it should look like this (note how the value of the strings is in a list):
[['GENE', ['ase1']], ['marker', ['NatR']]]
# So that we can make a tree from it:
Tree(*['GENE', ['ase1']])
# or
Tree('GENE', ['ase1'])

You will need to fix this in the pipeline and tests, sorry that we did not pick it up before. Note that the reason for this is to be able to capture nested things like p-Gene:

from nltk.tree import Tree
Tree(*["GENE", ["ase1"]])
Tree(*['PROMOTER', ['p', Tree(*['GENE', ['ase1']])]])

Hi @manulera, I have made the necessary changes to the code and readme. You can review now

@manulera
Copy link
Owner Author

Hi @anamika-yadav99 wow that was fast! Good job! Also good job with the readme, it looks much better now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants