Feature/improve speed and limit memory #11

sambenfredj · 2023-04-12T15:27:03Z

No description provided.

- add max_workers to cli arguments - add poetry

max_workers arg: See merge request msaid/chimerys/mokapot!1

…nto 'master' Feature/multithreaded grid search and triqler update See merge request msaid/chimerys/mokapot!2

- add weights as argument to cmd - add save_weights function - save weights if cmd is set

…ghts file

save weights: See merge request msaid/chimerys/mokapot!5

* expose max workers parameter to cli * update triqler package to fix the issue with calculating PEPs which is extremely slow * make grid search run with multithreading * test max_workers cli argument * fix format * update triqler Co-authored-by: Siegfried Gessulat <s.gessulat@gmail.com> Co-authored-by: Tobias <tobias.schmidt@msaid.de>

sync upstream fork See merge request msaid/chimerys/mokapot!3

- add init_weights cmd argument to use the function load_model which imports a model by saved weights or a saved model - if the init_weights argument is set import a pretrained model instead of initializing a new model - if model is already trained, skip the training process - use psms.features.values instead of psms.features because it fails for the descision function - define and calculate the variable feat_pass to fix failure at the _get_starting_labels function when using trained model

- replace dummy scaler with StandardScaler - fit scaler before predictions

import weights: See merge request msaid/chimerys/mokapot!4

update version See merge request msaid/chimerys/mokapot!6

update dependencies See merge request msaid/chimerys/mokapot!7

update version to 1.0.2 See merge request msaid/chimerys/mokapot!8

…reading

- load all saved models from training folds - check input model from brew function if it is a list then we don't run training - remove saving weights as we use only saved models - remove changes related to using saved weights from _get_starting_labels function

sort models after training to have deterministic results with multithreading See merge request msaid/chimerys/mokapot!9

make import weights have the same results as training run with mokapot: Closes #4 See merge request msaid/chimerys/mokapot!10

update version See merge request msaid/chimerys/mokapot!11

…tput

Fix random state to svm See merge request msaid/chimerys/mokapot!12

- add new flag rescale to activate rescaling - add new flag subset_max_rescale to limit max number pf psms for rescaling - subset data to subset_max_rescale and rescale after loading each model

- add new flag to activate ensemble mode - new function to predict all psms using all trained models and average results for final scores

rescale data when loading pretrained models See merge request msaid/chimerys/mokapot!39

ensemble prediction Closes #47 See merge request msaid/chimerys/mokapot!40

- fix depricated warnings for pandas

fix warnings Closes #48 See merge request msaid/chimerys/mokapot!41

move some logs messages from info to debug level Closes #44 and #46 See merge request msaid/chimerys/mokapot!42

skip deduplication version for markus See merge request msaid/chimerys/mokapot!44

skip deduplication version for markus See merge request msaid/chimerys/mokapot!46

bump mokapot version See merge request msaid/chimerys/mokapot!47

The result files where only newly created when they did not exist which can lead to duplicated results or invalid result files when two searches with different column counts where concatenated. Now the result files are always overwritten.

There was a warning when subsampling option was choosen but data size was not big enough to be subsampled. It makes more sense to log that not all data was used. This happens now on the info level. It is not a warning since the config option was actively chosen so no real reason to warn about it. The logic was changed such that the sampling only takes place when train data size is strictly bigger than the sample size. Before it would have been sampled on equality. This has the chance to chance some results(when subset_size==data_size) since before the idx where sampled and now the idx are sorted.

Resolve "Mokapot should not append to its result and log files" Closes wfondrie#53 See merge request msaid/chimerys/mokapot!48

- fix use of multiple file input - fix use of aggregate flag - revert gridsearch to use one thread because it uses a lot of memory

…one values (the current value lead to high ram consumption and even with reduced chunk size runtime is the same)

Sync with upstream Closes #19 See merge request msaid/chimerys/mokapot!50

- create new class OnDiskPsmDataset to stream data from disk and move associated function to this class - fix bug of file_root flag and prefixes for output files - fix bug of group_column and GroupedConfidence

- fix docstrings for refactored functions and classes - refactor CrossLinkedConfidence to work with the new class OnDiskPsmDataset - remove `concat_chunks` function and use pd.concat directly - - log subsetting psms for training inly when total psms is bigger than subset size

- replace varibale p with _psms for ietration over psms objects - import only required functions from util file

…t' into 'develop' refactor implementation of psms streaming object Closes wfondrie#62 See merge request msaid/chimerys/mokapot!51

Improve speed and limit memory consumption - stream input files for inference - add feature: skip deduplication - add feature: ensemble model - add feature: rescale input before inference with pre-trained models

@jspaezp

…ation (wfondrie#119) * 💄 lint mokapot * 💄 lints tests * 💄 fixes format with ruff - adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api * 💄 fixes format with ruff - adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api * 💄 make ruff and black happy together * Fix problems with nnls * Feature/improve speed and limit memory (#11) Improve speed and limit memory consumption - stream input files for inference - add feature: skip deduplication - add feature: ensemble model - add feature: rescale input before inference with pre-trained models * 💄 linting (#12) :lipstick: fix linting * Fix bugs (#17) - fix bug member variables not assigned when model is not trained - allow throw when input file is malformed: remove skip on bad lines from pandas read function * fix test model: remove subset_max_train from percolator model (#18) * Fix test brew: (#20) - Create new object of OnDiskPsmDataset to use for brew tests - Update brew function outputs and assert statements * fix test datasets: (#19) - remove assign confidence tests because datasets don't have assign confidence methods anymore - add eval_fdr value to the _update_labels function * Fix test confidence (#22) * Fix test confidence: - fix bugs for grouped confidence - fix test_one_group : create file using psm_df_1000 to create OnDiskPsmDataset. - remove test_pickle because confidence does not return dataframe results anymore. - add test_multi_groups to test that different group results are saved correctly. * fix bugs: - overwrite default fdr for update_labels function - return dataframe for psm_df_1000 to use with LinearPsmDataset * Fix cli tests: (#28) - Remove test_cli_pepxml because xml files don't work with streaming - Replace old output file names - Add random generator 'rng' variable to confidence since it is required for proteins - Remove subset_max_train from PluginModel - Fix bug: convert targets column after reading in chunks - Fix peptide column name for confidence - Fix test cli plugins : replace DecisionTreeClassifier with LinearSVC BECAUSE DecisionTreeClassifier return scores as 0 or 1 * Fix system tests: (#29) - Refactor test structure : Separate brew and confidence functions, read results from output files. - Fix bugs: fix output columns for proteins, sort proteins data by score * Fix parser pin test: (#30) - Add label value to initial direction because it has to have a numerical number - Read pin does not return dataframe anymore - Compare output of read_pin function to example dataframe * Add tests: (#31) - Add skip_deduplication flag test - Add ensemble flag test - Agg rescale flag test - Fix bug: remove target_column variable from read file for read_data_for_rescale * Fix writer tests: (#32) - Remove writer tests with confidence object becaause LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence * fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse (#33) * Introduce new executable and bug fixes * Create new executable to aggregate psms to peptides. * Fix bugs: - fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse - raise error when pep values are all equal to 1 - prefixes paths to dest_dir to not pollute the workdir - catch error to prevent traces logged: Catch all errors to not break structured logging by error traces - fixes parallelism in parse_in_chunks to max_workers - fix indeterminism - fixed small column chunk bug - fix bug when using multiple input files * Fix and add tests: - remove writer tests with confidence object because LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence - add test for the new function "get_unique_peptides_from_psms" - add cli test for aggregatePsmsToPeptides * ✨ force ci re-run * 💄 lint mokapot * 💄 lints tests * 💄 fixes format with ruff - adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api * 💄 fixes format with ruff - adds line break in dataset.py - updates call of ruff in CI - updates pyproject.toml according to new ruff api * 💄 make ruff and black happy together * ✨ removed deprecated error ignore * Fix two boolean conditions in nnls algorithm * Set tolerance to fixed value in fit_nnls to avoid non-convergence * Adjust unittest for hist_nnls to new error cases * Add documentation and test for create_chunks * Make cli unit tests for aggregateP2P easier debuggable * Improve test for peptide_csv in test_utils * Improve and test convert_targets_column function And fix some minor pep8 issues * Enable switching in system tests from subprocess to direct calls * Fix cli system and utils tests * Fix unit tests * Add documentation and test for create_chunks * Make cli unit tests for aggregateP2P easier debuggable * Improve test for peptide_csv in test_utils * Improve and test convert_targets_column function And fix some minor pep8 issues * Enable switching in system tests from subprocess to direct calls * Fix cli system and utils tests * Fix unit tests * parquet reader for mokapot * merge sort function adapted for parquet * brew function adapted for parquet input * confidence assignment modified for parquet format * merge sort chunk size added as constant * update label func modified for parquet * main function uses format arg to choose between csv and parquet * pyarrow added to dependancies * Change conversion of target column values Revert it to "old style" because otherwise some tests break * fixed failing tests * added new tests for parquet * refactor: Add type hinting to tuplize function * refactor: Refactor find_column(s) functions and use it in read_percolator * refactor: Insert some newlines for improved readability * refactor: Insert some newlines for improved readability * refactor: Insert some newlines for improved readability refactor: Insert some newlines for improved readability * refactor: Remove redundant test case for case-sensitive column matching. * Add typeguard * refactored unchunked file reader for parquet and csv * Add map_columns_to_indices function and more type checking * Make debugging dataframe issues easier in unit test by adding some pandas config options to pytest_sessionstart in conftest.py * Fix test_utils * Add level_columns to OnDiskPsmDataset * Rename deduplication to do_rollup * Change deduplication to do_rollup * Fix pin reading by adding level_columns * Revert "Merge branch 'feature/parquet_parser' into 'main'" This reverts merge request !39 * Move peptides.csv to dest_dir and remove it where it's created * Save changes * Get rid of path manipulation via strings * Clean up more path related stuff * Correct documentation of return values of brew function * Simplify and generalize path definitions in confidence.py * Move confidence related functions to confidence * Fix problem with parameter lists in cli tests * Add checking of column names for OnDiskPsmDataset * Fix column index stuff * Disable parallel unit tests when debugging * Refactor the confidence.to_txt function * Fix chunked reader to read in column order as passed to the function * Add comments and put temp file naming for merge-sorting in one place * Add test for chunked reader * Fix warning in read_file_in_chunks test * Remove unnecessary conversion (back and fro) in confidence.py * Add pyarrow as a dependency * Remove superfluous conversions and some more superfluous stuff * Improve find_column* and use consistently in pin parser * Introduce tabbed reader and writers (for csv for now) * Correct buggy import statement * Fix bug and add test related to --aggregate flag * Remove ignoring of warnings Warnings should not be ignored, except when the verbosity is set to error level. If there are annoying and known warnings, that are handled appropriately, they should be ignored locally on a case by case basis. * Comment regarding to_txt function (can be removed) * Add file type detection to readers and writers * Ignore warning in PIN reader (locally now) * Improve column ordering/mapping and add unit test * Correct targets conversion function and fix offending unit tests * Improve on label updating and type safety * Correct output capturing in test helper * Use TabbedWriter in save_sorted_metadata_chunks * Progress towards rollup * Improve map_columns_to_indices for dicts * Add stringify methods to tabbed readers * Change assign_confidence for rollup * Fix column ordering problem (temp) * Add tests for the rollup * Fix a bug in the rollup unit test * Rename pcms to precursors * Remove a now superfluous function and comments * squashed sqlite writer branch * failing test fixed * unused function get_unique_peptides_from_psms removed * format interpreted implicitly using filename * Confidencewriter class implemented * sqlite path changed to Path type * perquet module deleted and integrated into read_pin * failing tests fixed * instantiation done before returning object * PSM_PEP column name changed to POSTERIOR_ERROR_PROBABILITY * add pipeline status for main branch * ✨ fixes wfondrie#54 * Do some cosmetics * Remove rescale stuff Was broken and unused anyway * Renamed TabbedFileReader and Writer to TabularDataReader and Writer * Separate general tabular data and confidence writer stuff * Fix import bug in confidence * Fix another import bug Due to pycharm refactoring... sigh... * Add better unit test for (chunked) confidence * Make chunked confidence unit test fail with small chunk size * Fix confidence chunk size bug * test case added for sqlite writer * test data added for sqlite writer * Add option for suppressing warnings * Revert the change in confidence and adapt unit test CONFIDENCE_CHUNK_SIZE is now directly modified from the unit test. * prepare tables sqlite db added as helper func * Fix problems with sqlit after the merge * Remove all group related stuff * Remove crosslink stuff * Remove plugins * Remove skipped tests and skip marks * Do some minor cleanup * Add type inference and tests for tabular data * Add reader and tests for in memory dataframe reader * Add streaming module * Add checks to merged reader * Add creation of dataframe reader from series and arrays * Add JoinedTabularData and tests * Add column renaming for tabular data readers * Add context manager to tabular data writer and make confidence writer a function * Get rid of all kinds of warnings during tests * Fix another warning * Add context manager to TabularDataWriter * Fix a problem with indexing in the merged reader * Add buffering to writers * Correct problem in unit test with log output and typechecking * Add functionality to add computed columns to TabularData * Add method to get an associated reader from a writer * Add cli and test for the rollup * Fix underscore problem * Fix problem with typechecked/contextmanager order * Remove typechecking from auto_finalizer for the moment * Fix path problem in rollup unit test * Show rollup levels not found only if non-empty * Simplify sqlite connection in unit tests * Add new suffixes for csv * Change options: add src_dir and remove keep_decoys * Add files for rollup testing * buffered write for parquet intermediary files implemented * tests updated for parquet writing * fixed aggregatePsmstoPeptides cli * test data structure changed to list of dicts to match merge sort output structure * test data updated to be dataframe readable * Remove aggregatePsmsToPeptides * Move remove_columns function to tabular_data * Fix program name in cli output * Let brew_rollup also search for parquet files * Use csv or parquet suffix also for temp and output files * Filter a warning in the system tests * Make the column types a bit more lenient * fixed rollup app for parquets * Fix unclosed files problem * Remove unused parameter target_column from merge_sort * Remove superfluous passing of sep * Change tabs to colons in protein(s) column of pin file * Test parquet merge_sort more extensively * Unify csv and parquet methods in merge_sort * Simplify get_row_iterator * Make brew rollup faster * Fix bug in MergedTabularReader * Fix problem with last line in buffering * Fix problem with type conversions in merge_sort (And make it a bit less ugly...) * ✨ addresses @jspaezp suggestions from PR wfondrie#119 - removes MSAID internal status banner in README - removes @TypeChecked for brew() (we can reintroduce it when we have a typed version of brew() - uses yield from pattern - uses dictionaries to assert for correct column-name and type mappings - removes installation of sample-plugin from tests * ✨ addresses review suggestions - adds back typechecked for brew - uses parameterized tests for test_peps_hist_nnls * Add check for length of mokapot output file * Make test for output file length more "elastic" * Revert documentation on psms parameter for brew function. And add some additional info. * Change f-string to normal string where unnecessary * Remove types_from_dataframe function, since unnecessary * (chore) updated cicd,ruff,black and tests (#42) * (chore) updated cicd, ruff, black and tests * resolved comment question by @jspaezp * ✨ fix setting scores when training failed * Readd previously commented out check for feature columns * ✨ proper docstring for `sqlite_db_path` * Add doc strings for write_confidence (and improve types) * Improve documentation and type hints of assign_confidence * Add class and module documentation for the tabular data classes * Feature/remove nnls patch (#43) * ✨ remove patched nnls (using fixed scipy version now) * 💄 linting; line breaks Co-authored-by: Elmar Zander <elmar@zandere.de> * Fix/windows tests (#44) * Remove nnls patch and fix scipy version (with fixed nnls) * ✨ remove patched nnls (using fixed scipy version now) * 💄 linting; line breaks * 🚑 fix dtypes to ensure windows ci to pass --------- Co-authored-by: Elmar Zander <elmar@zandere.de> * Fix/windows tests (#45) * Remove nnls patch and fix scipy version (with fixed nnls) * ✨ remove patched nnls (using fixed scipy version now) * 💄 linting; line breaks * 🚑 fix dtypes to ensure windows ci to pass * 💄 linting --------- Co-authored-by: Elmar Zander <elmar@zandere.de> * ✨ draft of pin to tsv converter * ✨ adds is_valid_tsv * ✨ adds tsv verification for pin files and conversion * 📝 remove print * 🔥 add required default for --dest_dir - using Mokapot without --dest_dir was broken without this new default before --------- Co-authored-by: Siegfried Gessulat <siegfried@msaid.de> Co-authored-by: Elmar Zander <elmar@zandere.de> Co-authored-by: sambenfredj <100685091+sambenfredj@users.noreply.github.com> Co-authored-by: Elmar Zander <elmar.zander@toptal.com> Co-authored-by: Vishal Sukumar <vishal.sukumar@msaid.de> Co-authored-by: Florian Seefried <florian.seefried@msaid.de> Co-authored-by: Graber Michael <michael.graber@msaid.de> Co-authored-by: Tobias Schmidt <tobias.schmidt@msaid.de> Co-authored-by: J. Sebastian Paez <jspaezp@users.noreply.github.com>

sambenfredj and others added 30 commits February 4, 2022 15:40

max_workers arg:

0f56cd8

- add max_workers to cli arguments - add poetry

Merge branch 'feature/workers_cli_arg' into 'master'

a37fada

max_workers arg: See merge request msaid/chimerys/mokapot!1

add n_jobs=-1 to grid search

7101336

update triqler version

7371a60

Merge branch 'feature/multithreaded_grid_search_and_triqler_update' i…

f5537d8

…nto 'master' Feature/multithreaded grid search and triqler update See merge request msaid/chimerys/mokapot!2

save weights:

18307db

- add weights as argument to cmd - add save_weights function - save weights if cmd is set

add empty line to weights output to be compatible with percolator wei…

4cc959d

…ghts file

Merge branch 'feature/export-trained-model' into 'master'

41ee444

save weights: See merge request msaid/chimerys/mokapot!5

Prepare release v0.8.0 (#45)

d43eaf5

Merge branch 'sync-upstream-fork' into 'master'

93b540a

sync upstream fork See merge request msaid/chimerys/mokapot!3

fix predictions for load model using weights:

0b30dd2

- replace dummy scaler with StandardScaler - fit scaler before predictions

Merge branch 'feature/import-weights' into 'master'

b7ab7de

import weights: See merge request msaid/chimerys/mokapot!4

update version

1cef7c5

Merge branch 'update-version' into 'master'

a1cf5c2

update version See merge request msaid/chimerys/mokapot!6

update dependencies

9d80807

Merge branch 'update-poetry-dependencies' into 'master'

8f0006c

update dependencies See merge request msaid/chimerys/mokapot!7

update version to 1.0.2

4214e80

Merge branch 'update-version' into 'master'

df33791

update version to 1.0.2 See merge request msaid/chimerys/mokapot!8

sort models after training to have deterministic results with multith…

5cae6b0

…reading

Merge branch 'feature/make_mokapot_deterministic' into 'master'

ec803c6

sort models after training to have deterministic results with multithreading See merge request msaid/chimerys/mokapot!9

Merge branch 'feature/reproduce_results_with_saved_models' into 'master'

8c5efb3

make import weights have the same results as training run with mokapot: Closes #4 See merge request msaid/chimerys/mokapot!10

update version

7777d05

Merge branch 'update-version' into 'master'

1dcf80e

update version See merge request msaid/chimerys/mokapot!11

assign random state to svm model to have deterministic grid search ou…

f12c7a6

…tput

update to test version

ae8805f

update version

fa9416f

Merge branch 'fix-random-state-to-svm' into 'master'

16c8e29

Fix random state to svm See merge request msaid/chimerys/mokapot!12

sambenfredj and others added 26 commits March 7, 2023 16:22

rescale data when loading pretrained models:

237f3ad

- add new flag rescale to activate rescaling - add new flag subset_max_rescale to limit max number pf psms for rescaling - subset data to subset_max_rescale and rescale after loading each model

skip creating training data when using pretrained models

78142f2

ensemble prediction:

1d34073

- add new flag to activate ensemble mode - new function to predict all psms using all trained models and average results for final scores

Merge branch 'feature/50-rescale-before-prediction' into 'develop'

99aaf85

rescale data when loading pretrained models See merge request msaid/chimerys/mokapot!39

Merge branch 'feature/47-ensemble-model' into 'develop'

1654956

ensemble prediction Closes #47 See merge request msaid/chimerys/mokapot!40

- hide warnings when debug is activated

516229d

- fix depricated warnings for pandas

Merge branch 'fix/48-fix-and-hide-warning' into 'develop'

23ad60f

fix warnings Closes #48 See merge request msaid/chimerys/mokapot!41

move some logs messages from info to debug level

864c13d

Merge branch 'refactor/minimal_logging' into 'develop'

9bec320

move some logs messages from info to debug level Closes #44 and #46 See merge request msaid/chimerys/mokapot!42

bump mokapot version

066fc0e

Merge branch 'tag-version-skip-deduplication' into 'develop'

8fb9dcf

skip deduplication version for markus See merge request msaid/chimerys/mokapot!44

Merge branch 'fix/mokapot_version' into 'develop'

0e06618

skip deduplication version for markus See merge request msaid/chimerys/mokapot!46

bump mokapot version

9027d42

Merge branch 'mgraber-develop-patch-16284' into 'develop'

08b7eeb

bump mokapot version See merge request msaid/chimerys/mokapot!47

overwrite existing result files

1c7785f

The result files where only newly created when they did not exist which can lead to duplicated results or invalid result files when two searches with different column counts where concatenated. Now the result files are always overwritten.

Merge branch 'fix/53-stop-file-append' into 'develop'

e079771

Resolve "Mokapot should not append to its result and log files" Closes wfondrie#53 See merge request msaid/chimerys/mokapot!48

fix merge conflicts from upstream repo:

52664e8

- fix use of multiple file input - fix use of aggregate flag - revert gridsearch to use one thread because it uses a lot of memory

remove unused variables and reduce chunk size for drop columns with N…

868c6f1

…one values (the current value lead to high ram consumption and even with reduced chunk size runtime is the same)

use function get_column_names_from_file to read columns

16f2b42

Merge branch 'sync_with_main_upstream' into 'develop'

5f9f85f

Sync with upstream Closes #19 See merge request msaid/chimerys/mokapot!50

refactor implementation of psms streaming object:

bcf5ef8

- create new class OnDiskPsmDataset to stream data from disk and move associated function to this class - fix bug of file_root flag and prefixes for output files - fix bug of group_column and GroupedConfidence

refactor:

9ecdb70

- fix docstrings for refactored functions and classes - refactor CrossLinkedConfidence to work with the new class OnDiskPsmDataset - remove `concat_chunks` function and use pd.concat directly - - log subsetting psms for training inly when total psms is bigger than subset size

refactor:

1b27a89

- replace varibale p with _psms for ietration over psms objects - import only required functions from util file

Merge branch 'refactor/refactor_streaming_function_to_OnDiskPsmDatase…

fce4308

…t' into 'develop' refactor implementation of psms streaming object Closes wfondrie#62 See merge request msaid/chimerys/mokapot!51

Merge branch 'master' into feature/improve_speed_and_limit_memory

8a82e83

sambenfredj merged commit 2f879e5 into master Apr 12, 2023

sambenfredj deleted the feature/improve_speed_and_limit_memory branch July 18, 2023 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/improve speed and limit memory #11

Feature/improve speed and limit memory #11

sambenfredj commented Apr 12, 2023

Feature/improve speed and limit memory #11

Feature/improve speed and limit memory #11

Conversation

sambenfredj commented Apr 12, 2023