Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scGPT padding and tokenization component #754

Merged
merged 45 commits into from
Apr 19, 2024
Merged

scGPT padding and tokenization component #754

merged 45 commits into from
Apr 19, 2024

Conversation

dorien-er
Copy link
Collaborator

@dorien-er dorien-er commented Mar 20, 2024

Changelog

module for padding + tokenizing data for scGPT integration (zero shot or fine-tuning)

Checklist before requesting a review

  • I have performed a self-review of my code

  • Conforms to the Contributor's guide

  • Check the correct box. Does this PR contain:

    • Breaking changes
    • New functionality
    • Major changes
    • Minor changes
    • Documentation
    • Bug fixes
  • Proposed changes are described in the CHANGELOG.md

  • CI tests succeed!

@dorien-er dorien-er changed the title add module for scgpt padding and tokenization scGPT padding and tokenization component Mar 22, 2024
CHANGELOG.md Show resolved Hide resolved
src/scgpt/pad_tokenize/script.py Show resolved Hide resolved
src/scgpt/pad_tokenize/script.py Outdated Show resolved Hide resolved
vocab_file = f"{meta['resources_dir']}/scgpt/source/vocab.json"
input_file = mu.read(input)

## START TEMPORARY WORKAROUND DATA PREPROCESSING
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you copied this START TEMPORARY... from other scripts. The reason why it was used there because it was not possible to specify shared code (in the config using test_resources:) to be used across multiple components by usingsys.path.append(meta['resources_dir']) on run environments that use nextflow fusion.

I would not use this tag here because we are probably going to use a regex to remove the lines of code automatically when we fix this in viash.

I think here you want to either:
(1) Save the intermediary result of the preproc to resources test (and s3) and use it directly.
(2) Add some dummy output to be used as a test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to implement this because padding and tokenizing can only be done once certain other scGPT-specific pre-processing steps are performed, but those components are not merged into the scGPT branch yet. So my idea would be to have all these components merged into scGPT and then update the scGPT_test_resources.sh script, to save intermediary temporary data and avoid pushing too many files to s3 (#780)... WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

dorien-er and others added 2 commits April 18, 2024 13:24
Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>
Copy link
Collaborator Author

@dorien-er dorien-er left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accidental review

Copy link
Member

@DriesSchaumont DriesSchaumont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@DriesSchaumont DriesSchaumont merged commit 5bec37a into scgpt Apr 19, 2024
7 checks passed
@DriesSchaumont DriesSchaumont deleted the tokenize-pad branch April 24, 2024 09:02
dorien-er added a commit that referenced this pull request Apr 25, 2024
* add module for scgpt padding and tokenization

* remove base requirement

* update changelog

* update component name

* expand unit tests, update script with loggers and todo

* fix unit tests

* remove annotation script

* run tests with subsampled data

* use specific model input files instead of directory

* remove unused binning script

* update layer names and handling

* Add script to download scgpt test resources (#750)

* add script to download scgpt test resources

* Update resources_test_scripts/scgpt.sh

Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>

* add drive folders containing data and model

* chmod +x

---------

Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>

* preproc script

* preproc script

* tokenize and pad script

* tokenize and pad script

* embedding script

* test resourcers and evaluation script

* cross check gene set

* Fix retag for viash-hub not using correct namespace separator (#745)

* CI - Build: Fix second occurance of namespace separator (#746)

* script to download scgpt test data

* remove test resources script

* pad_tokenize module

* updat image

* remove test resources, update inputs

* use pytorch image

* remove integration component

* remove nvidia reqs

* remove load_model option

* adjust preprocessing script

* add scgpt full preproc module

* integration submodule

* integration submodule and add normalize_total flag

* add params

* update scanpy version

* remove branch irrelevant scripts

* update output handling

* update unit tests, add output compression

* update key name input output

* fix test

* update unit tests

* Update CHANGELOG.md

Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>

* add pars to logging

---------

Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants