scGPT padding and tokenization component #754

dorien-er · 2024-03-20T08:44:06Z

Changelog

module for padding + tokenizing data for scGPT integration (zero shot or fine-tuning)

Checklist before requesting a review

* add script to download scgpt test resources * Update resources_test_scripts/scgpt.sh Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com> * add drive folders containing data and model * chmod +x --------- Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>

CHANGELOG.md

src/scgpt/pad_tokenize/script.py

DriesSchaumont · 2024-04-18T09:34:12Z

src/scgpt/pad_tokenize/test.py

+vocab_file = f"{meta['resources_dir']}/scgpt/source/vocab.json"
+input_file = mu.read(input)
+
+## START TEMPORARY WORKAROUND DATA PREPROCESSING


I think you copied this START TEMPORARY... from other scripts. The reason why it was used there because it was not possible to specify shared code (in the config using test_resources:) to be used across multiple components by usingsys.path.append(meta['resources_dir']) on run environments that use nextflow fusion.

I would not use this tag here because we are probably going to use a regex to remove the lines of code automatically when we fix this in viash.

I think here you want to either:
(1) Save the intermediary result of the preproc to resources test (and s3) and use it directly.
(2) Add some dummy output to be used as a test.

Had to implement this because padding and tokenizing can only be done once certain other scGPT-specific pre-processing steps are performed, but those components are not merged into the scGPT branch yet. So my idea would be to have all these components merged into scGPT and then update the scGPT_test_resources.sh script, to save intermediary temporary data and avoid pushing too many files to s3 (#780)... WDYT?

Sounds good!

Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>

dorien-er

accidental review

DriesSchaumont

LGTM!

* add module for scgpt padding and tokenization * remove base requirement * update changelog * update component name * expand unit tests, update script with loggers and todo * fix unit tests * remove annotation script * run tests with subsampled data * use specific model input files instead of directory * remove unused binning script * update layer names and handling * Add script to download scgpt test resources (#750) * add script to download scgpt test resources * Update resources_test_scripts/scgpt.sh Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com> * add drive folders containing data and model * chmod +x --------- Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com> * preproc script * preproc script * tokenize and pad script * tokenize and pad script * embedding script * test resourcers and evaluation script * cross check gene set * Fix retag for viash-hub not using correct namespace separator (#745) * CI - Build: Fix second occurance of namespace separator (#746) * script to download scgpt test data * remove test resources script * pad_tokenize module * updat image * remove test resources, update inputs * use pytorch image * remove integration component * remove nvidia reqs * remove load_model option * adjust preprocessing script * add scgpt full preproc module * integration submodule * integration submodule and add normalize_total flag * add params * update scanpy version * remove branch irrelevant scripts * update output handling * update unit tests, add output compression * update key name input output * fix test * update unit tests * Update CHANGELOG.md Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com> * add pars to logging --------- Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>

add module for scgpt padding and tokenization

5b9ab23

dorien-er requested a review from DriesSchaumont March 20, 2024 08:44

remove base requirement

0062e43

dorien-er changed the title ~~add module for scgpt padding and tokenization~~ scGPT padding and tokenization component Mar 22, 2024

dorien-er and others added 26 commits March 22, 2024 08:49

update changelog

8082697

update component name

ad2246a

expand unit tests, update script with loggers and todo

b78d6ed

fix unit tests

ae25a0d

remove annotation script

ceb35fd

run tests with subsampled data

e5c1605

use specific model input files instead of directory

5ab01b0

remove unused binning script

3e4fb7c

update layer names and handling

1ef1b07

preproc script

adf7542

preproc script

7076836

tokenize and pad script

9746fde

tokenize and pad script

72167b3

embedding script

11756bb

test resourcers and evaluation script

c7ec7b8

cross check gene set

4fd260a

Fix retag for viash-hub not using correct namespace separator (#745)

fab0aa1

CI - Build: Fix second occurance of namespace separator (#746)

b8f5d08

script to download scgpt test data

3c4c09f

remove test resources script

932633f

pad_tokenize module

7321509

updat image

2afe096

remove test resources, update inputs

5b5de6f

use pytorch image

b5c0fb1

remove integration component

59d91b4

dorien-er added 14 commits March 28, 2024 10:31

remove nvidia reqs

6abaf7e

remove load_model option

d9377f8

adjust preprocessing script

33b0bba

add scgpt full preproc module

a773b9a

integration submodule

d1e17eb

integration submodule and add normalize_total flag

559c10e

add params

f6796f4

update scanpy version

bf41d1c

remove branch irrelevant scripts

16e54d6

update output handling

ace2fb5

update unit tests, add output compression

18c4abc

update key name input output

c5fe4e7

fix test

a688017

update unit tests

5ce0f4a

DriesSchaumont requested changes Apr 18, 2024

View reviewed changes

dorien-er and others added 2 commits April 18, 2024 13:24

Update CHANGELOG.md

732ae6b

Co-authored-by: Dries Schaumont <5946712+DriesSchaumont@users.noreply.github.com>

add pars to logging

0977792

dorien-er commented Apr 19, 2024

View reviewed changes

dorien-er mentioned this pull request Apr 19, 2024

scGPT: use sparse matrix calculations #790

Open

Merge branch 'scgpt' into tokenize-pad

568f9f6

dorien-er requested a review from DriesSchaumont April 19, 2024 08:33

DriesSchaumont approved these changes Apr 19, 2024

View reviewed changes

DriesSchaumont merged commit 5bec37a into scgpt Apr 19, 2024
7 checks passed

DriesSchaumont deleted the tokenize-pad branch April 24, 2024 09:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scGPT padding and tokenization component #754

scGPT padding and tokenization component #754

dorien-er commented Mar 20, 2024 •

edited

DriesSchaumont Apr 18, 2024

dorien-er Apr 18, 2024

DriesSchaumont Apr 19, 2024

dorien-er left a comment

DriesSchaumont left a comment

scGPT padding and tokenization component #754

scGPT padding and tokenization component #754

Conversation

dorien-er commented Mar 20, 2024 • edited

Changelog

Checklist before requesting a review

DriesSchaumont Apr 18, 2024

Choose a reason for hiding this comment

dorien-er Apr 18, 2024

Choose a reason for hiding this comment

DriesSchaumont Apr 19, 2024

Choose a reason for hiding this comment

dorien-er left a comment

Choose a reason for hiding this comment

DriesSchaumont left a comment

Choose a reason for hiding this comment

dorien-er commented Mar 20, 2024 •

edited