-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scGPT padding and tokenization component #754
Conversation
* add script to download scgpt test resources * Update resources_test_scripts/scgpt.sh Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com> * add drive folders containing data and model * chmod +x --------- Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>
vocab_file = f"{meta['resources_dir']}/scgpt/source/vocab.json" | ||
input_file = mu.read(input) | ||
|
||
## START TEMPORARY WORKAROUND DATA PREPROCESSING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you copied this START TEMPORARY...
from other scripts. The reason why it was used there because it was not possible to specify shared code (in the config using test_resources:
) to be used across multiple components by usingsys.path.append(meta['resources_dir'])
on run environments that use nextflow fusion.
I would not use this tag here because we are probably going to use a regex to remove the lines of code automatically when we fix this in viash.
I think here you want to either:
(1) Save the intermediary result of the preproc to resources test (and s3) and use it directly.
(2) Add some dummy output to be used as a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to implement this because padding and tokenizing can only be done once certain other scGPT-specific pre-processing steps are performed, but those components are not merged into the scGPT branch yet. So my idea would be to have all these components merged into scGPT and then update the scGPT_test_resources.sh script, to save intermediary temporary data and avoid pushing too many files to s3 (#780)... WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accidental review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
* add module for scgpt padding and tokenization * remove base requirement * update changelog * update component name * expand unit tests, update script with loggers and todo * fix unit tests * remove annotation script * run tests with subsampled data * use specific model input files instead of directory * remove unused binning script * update layer names and handling * Add script to download scgpt test resources (#750) * add script to download scgpt test resources * Update resources_test_scripts/scgpt.sh Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com> * add drive folders containing data and model * chmod +x --------- Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com> * preproc script * preproc script * tokenize and pad script * tokenize and pad script * embedding script * test resourcers and evaluation script * cross check gene set * Fix retag for viash-hub not using correct namespace separator (#745) * CI - Build: Fix second occurance of namespace separator (#746) * script to download scgpt test data * remove test resources script * pad_tokenize module * updat image * remove test resources, update inputs * use pytorch image * remove integration component * remove nvidia reqs * remove load_model option * adjust preprocessing script * add scgpt full preproc module * integration submodule * integration submodule and add normalize_total flag * add params * update scanpy version * remove branch irrelevant scripts * update output handling * update unit tests, add output compression * update key name input output * fix test * update unit tests * Update CHANGELOG.md Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com> * add pars to logging --------- Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>
Changelog
module for padding + tokenizing data for scGPT integration (zero shot or fine-tuning)
Checklist before requesting a review
I have performed a self-review of my code
Conforms to the Contributor's guide
Check the correct box. Does this PR contain:
Proposed changes are described in the CHANGELOG.md
CI tests succeed!