Forth version (using nltk) #30

manulera · 2022-07-15T08:23:06Z

In this next version, let's use the tagging syntax used by nltk.

For example, for pfus1-ase1-mCherry, we now would generate: pGENE-GENE-TAG, in their format it should be:

['p',('fus1', 'GENE'),('-','-'),('ase1', 'GENE'),('-','-'),('mCherry','TAG')]

Initially I thought we could store this as plain text, but I think it's not worth the headache of escaping characters (the default separator is \ but I am sure we will encounter every thinkable separator, I would stick with json, something like this for each allele)

[
{
  "name": "pfus1-ase1-mCherry",
  "pattern": ["p",["fus1", "GENE"],["-","-"],["ase1", "GENE"],["-","-"],["mCherry","TAG"]]
}
]

So for this version, write some code that exports the alleles to this, and a test that verifies that it works from a few examples. To restore our previous patterns, which are also useful to generate the occurrences file, you can do something like this:

"".join([i if type(i)==str else i[1] for i in pattern])

For the separators, make a list of known separators in a text file, for now I imagine it's mostly :-, but we will have to add more in the future. Any subsequent separators in an allele name should be counted as one, so use regex. They will have to be replaced after all other features, as some allele names contain separators in them. We could name the feature "SEPARATOR" but I think for readability is better to use "-" and this is supported by nltk.

The text was updated successfully, but these errors were encountered:

manulera · 2022-07-16T16:04:18Z

Some more things to add to this version:

Create a folder in the root directory of the repo called analysis, in it:
- Add the script that runs this new pipeline: fourth_version_pipeline.py such that it can be run like this:
```
python fourth_version_pipeline.py ../Lab_strains/dey_lab/strains.tsv
```
  It should still save the output files in the directory where strains.tsv is.
  
  To use the arguments use the if __name__ == "__main__": syntax in the script. To write the tests, simply import the main function from the script or other functions.
In the new file, put all function definitions at the top, and then the code that calls them in the main function. This helps readability. main function should start with strain_list = build_strain_list('strains.tsv') (more or less)

manulera · 2022-07-16T16:17:57Z

Fix the typo of occurrences (it's two c and two r) on functions and filenames.
Write format.py for all labs. Some will require further editing, you can ask me about them once you check their strain lists. If you need to do some pre-processing, do it like in the example in tran_lab/format.py and store it in an intermediary excel sheet called post_processed.xlsx like in the example. Like that you can call excel_to_tsv on that one.
Move data/strains.tsv to Lab_strains/nbrp_strains, rename it to strains_raw.tsv and write a format.py for that one as well. Commit the strains_raw.tsv file, since the data in it is public and we will use it for the documentation below.
delete trans_lab folder (the correct name is tran_lab)
Run your pipeline in all the Lab_strains folders and see that it does not fail for any of those
Have a look at the common occurrences across labs and see if you can identify more patterns. When you do, add them to the appropriate allele_components toml file. If some don't fit anywhere, add them to the previous google doc and we can discuss.

manulera · 2022-07-16T16:24:03Z

Revise your code, and add some comments in the parts that you think will be harder to follow if you are looking at them for the first time. If you struggle understanding what a part of the code does, try to see if you can improve variable names, or add some concise comments.
Add a section to the readme called running the pipeline to explain what the scripts do, and the outputs the produce. Taking the Lab_strains/nbrp_strains It should cover:
- What format.py does and what you would have to do to write your own format.py for your spreadsheet.
- What the analysis script does, and for each output file, what is in the file.

manulera · 2022-08-23T13:44:53Z

Hello Anamika, I noticed that to read the trees into nltk, we would need to change the output. For example, if we have ase1::NatR:

# The output currently looks like this:
[['ase1', 'GENE'], ['NatR','marker']]
# But it should look like this (note how the value of the strings is in a list):
[['GENE', ['ase1']], ['marker', ['NatR']]]
# So that we can make a tree from it:
Tree(*['GENE', ['ase1']])
# or
Tree('GENE', ['ase1'])

You will need to fix this in the pipeline and tests, sorry that we did not pick it up before. Note that the reason for this is to be able to capture nested things like p-Gene:

from nltk.tree import Tree
Tree(*["GENE", ["ase1"]])
Tree(*['PROMOTER', ['p', Tree(*['GENE', ['ase1']])]])

anamika-yadav99 · 2022-08-23T16:36:18Z

Hello Anamika, I noticed that to read the trees into nltk, we would need to change the output. For example, if we have ase1::NatR:
# The output currently looks like this:
[['ase1', 'GENE'], ['NatR','marker']]
# But it should look like this (note how the value of the strings is in a list):
[['GENE', ['ase1']], ['marker', ['NatR']]]
# So that we can make a tree from it:
Tree(*['GENE', ['ase1']])
# or
Tree('GENE', ['ase1'])
You will need to fix this in the pipeline and tests, sorry that we did not pick it up before. Note that the reason for this is to be able to capture nested things like p-Gene:
from nltk.tree import Tree
Tree(*["GENE", ["ase1"]])
Tree(*['PROMOTER', ['p', Tree(*['GENE', ['ase1']])]])

Hi @manulera, I have made the necessary changes to the code and readme. You can review now

manulera · 2022-08-23T18:14:57Z

Hi @anamika-yadav99 wow that was fast! Good job! Also good job with the readme, it looks much better now!

manulera mentioned this issue Aug 5, 2022

Retreiving fluorescent protein data from a public API #32

Closed

5 tasks

anamika-yadav99 mentioned this issue Aug 15, 2022

Fourth version #33

Merged

manulera closed this as completed Aug 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forth version (using nltk) #30

Forth version (using nltk) #30

manulera commented Jul 15, 2022

manulera commented Jul 16, 2022 •

edited by anamika-yadav99

Loading

manulera commented Jul 16, 2022 •

edited by anamika-yadav99

Loading

manulera commented Jul 16, 2022 •

edited by anamika-yadav99

Loading

manulera commented Aug 23, 2022 •

edited

Loading

anamika-yadav99 commented Aug 23, 2022 •

edited

Loading

manulera commented Aug 23, 2022

Forth version (using nltk) #30

Forth version (using nltk) #30

Comments

manulera commented Jul 15, 2022

manulera commented Jul 16, 2022 • edited by anamika-yadav99 Loading

manulera commented Jul 16, 2022 • edited by anamika-yadav99 Loading

manulera commented Jul 16, 2022 • edited by anamika-yadav99 Loading

manulera commented Aug 23, 2022 • edited Loading

anamika-yadav99 commented Aug 23, 2022 • edited Loading

manulera commented Aug 23, 2022

manulera commented Jul 16, 2022 •

edited by anamika-yadav99

Loading

manulera commented Jul 16, 2022 •

edited by anamika-yadav99

Loading

manulera commented Jul 16, 2022 •

edited by anamika-yadav99

Loading

manulera commented Aug 23, 2022 •

edited

Loading

anamika-yadav99 commented Aug 23, 2022 •

edited

Loading