Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/improve speed and limit memory #11

Merged
merged 153 commits into from
Apr 12, 2023

Conversation

sambenfredj
Copy link
Collaborator

No description provided.

sambenfredj and others added 30 commits February 4, 2022 15:40
- add max_workers to cli arguments
- add poetry
max_workers arg:

See merge request msaid/chimerys/mokapot!1
…nto 'master'

Feature/multithreaded grid search and triqler update

See merge request msaid/chimerys/mokapot!2
- add weights  as argument to cmd
- add save_weights function
- save weights if cmd is set
save weights:

See merge request msaid/chimerys/mokapot!5
* expose max workers parameter to cli

* update triqler package to fix the issue with calculating PEPs which is extremely slow

* make grid search run with multithreading

* test max_workers cli argument

* fix format

* update triqler

Co-authored-by: Siegfried Gessulat <s.gessulat@gmail.com>
Co-authored-by: Tobias <tobias.schmidt@msaid.de>
sync upstream fork

See merge request msaid/chimerys/mokapot!3
- add init_weights cmd argument to use the function load_model which imports a model by saved weights or a saved model
- if the init_weights argument is set import a pretrained model instead of initializing a new model
- if model is already trained, skip the training process
- use psms.features.values instead of psms.features because it fails for the descision function
- define and calculate the variable feat_pass to fix failure at the _get_starting_labels function  when using trained model
- replace dummy scaler with StandardScaler
- fit scaler before predictions
import weights:

See merge request msaid/chimerys/mokapot!4
update version

See merge request msaid/chimerys/mokapot!6
update dependencies

See merge request msaid/chimerys/mokapot!7
update version to 1.0.2

See merge request msaid/chimerys/mokapot!8
- load all saved models from training folds
- check input model from brew function if it is a list then we don't run training
- remove saving weights
as we use only saved models
- remove changes related to using saved weights from _get_starting_labels function
sort models after training to have deterministic results with multithreading

See merge request msaid/chimerys/mokapot!9
make import weights have the same results as training run with mokapot:

Closes #4

See merge request msaid/chimerys/mokapot!10
update version

See merge request msaid/chimerys/mokapot!11
Fix random state to svm

See merge request msaid/chimerys/mokapot!12
sambenfredj and others added 26 commits March 7, 2023 16:22
- add new flag rescale to activate rescaling
- add new flag subset_max_rescale to limit max number pf psms for rescaling
- subset data to subset_max_rescale and rescale after loading each model
- add new flag to activate ensemble mode
- new function to predict all psms using all trained models and average results for final scores
rescale data when loading pretrained models

See merge request msaid/chimerys/mokapot!39
ensemble prediction

Closes #47

See merge request msaid/chimerys/mokapot!40
- fix depricated warnings for pandas
fix warnings

Closes #48

See merge request msaid/chimerys/mokapot!41
move some logs messages from info to debug level

Closes #44 and #46

See merge request msaid/chimerys/mokapot!42
skip deduplication version for markus

See merge request msaid/chimerys/mokapot!44
skip deduplication version for markus

See merge request msaid/chimerys/mokapot!46
bump mokapot version

See merge request msaid/chimerys/mokapot!47
The result files where only newly created when they did not exist which
can lead to duplicated results or invalid result files when two searches
with different column counts where concatenated.
Now the result files are always overwritten.
There was a warning when subsampling option was choosen but data size
was not big enough to be subsampled. It makes more sense to log that not
all data was used. This happens now on the info level. It is not a
warning since the config option was actively chosen so no real reason to
warn about it.
The logic was changed such that the sampling only takes place when
train data size is strictly bigger than the sample size. Before it would
have been sampled on equality. This has the chance to chance some
results(when subset_size==data_size) since before the idx where sampled
and now the idx are sorted.
Resolve "Mokapot should not append to its result and log files"

Closes wfondrie#53

See merge request msaid/chimerys/mokapot!48
- fix use of multiple file input
- fix use of aggregate flag
- revert gridsearch to use one thread because it uses a lot of memory
…one values (the current value lead to high ram consumption and even with reduced chunk size runtime is the same)
Sync with upstream

Closes #19

See merge request msaid/chimerys/mokapot!50
- create new class OnDiskPsmDataset to stream data from disk and move associated function to this class
- fix bug of file_root flag and prefixes for output files
- fix bug of group_column and GroupedConfidence
- fix docstrings for refactored functions and classes
- refactor CrossLinkedConfidence to work with the new class OnDiskPsmDataset
- remove `concat_chunks` function and use pd.concat directly
- - log subsetting psms for training inly when total psms is bigger than subset size
- replace varibale p with _psms for ietration over psms objects
- import only required functions from util file
…t' into 'develop'

refactor implementation of psms streaming object

Closes wfondrie#62

See merge request msaid/chimerys/mokapot!51
@sambenfredj sambenfredj merged commit 2f879e5 into master Apr 12, 2023
@sambenfredj sambenfredj deleted the feature/improve_speed_and_limit_memory branch July 18, 2023 12:40
gessulat pushed a commit that referenced this pull request Feb 27, 2024
Improve speed and limit memory consumption

- stream input files for inference
- add feature: skip deduplication
- add feature: ensemble model
- add feature: rescale input before inference with pre-trained models
gessulat added a commit that referenced this pull request Sep 18, 2024
…ation (wfondrie#119)

* 💄 lint mokapot

* 💄 lints tests

* 💄 fixes format with ruff

- adds line break in dataset.py
- updates call of ruff in CI
- updates pyproject.toml according to new ruff api

* 💄 fixes format with ruff

- adds line break in dataset.py
- updates call of ruff in CI
- updates pyproject.toml according to new ruff api

* 💄 make ruff and black happy together

* Fix problems with nnls

* Feature/improve speed and limit memory (#11)

Improve speed and limit memory consumption

- stream input files for inference
- add feature: skip deduplication
- add feature: ensemble model
- add feature: rescale input before inference with pre-trained models

* 💄 linting (#12)

:lipstick: fix linting

* Fix bugs (#17)

- fix bug member variables not assigned when model is not trained
- allow throw when input file is malformed: remove skip on bad lines from pandas read function

* fix test model: remove subset_max_train from percolator model (#18)

* Fix test brew: (#20)

- Create new object of OnDiskPsmDataset to use for brew tests
- Update brew function outputs and assert statements

* fix test datasets: (#19)

- remove assign confidence tests because datasets don't have assign confidence methods anymore
- add eval_fdr value to the _update_labels function

* Fix test confidence (#22)

* Fix test confidence:
- fix bugs for grouped confidence
- fix test_one_group : create file using psm_df_1000 to create OnDiskPsmDataset.
- remove test_pickle because confidence does not return dataframe results anymore.
- add test_multi_groups to test that different group results are saved correctly.

* fix bugs:
- overwrite default fdr for update_labels function
- return dataframe for psm_df_1000 to use with LinearPsmDataset

* Fix cli tests: (#28)

- Remove test_cli_pepxml because xml files don't work with streaming
- Replace old output file names
- Add random generator 'rng' variable to confidence since it is required for proteins
- Remove subset_max_train from PluginModel
- Fix bug: convert targets column after reading in chunks
- Fix peptide column name for confidence
- Fix test cli plugins : replace DecisionTreeClassifier with LinearSVC BECAUSE DecisionTreeClassifier return scores as 0 or 1

* Fix system tests: (#29)

- Refactor test structure : Separate brew and confidence functions, read results from output files.
- Fix bugs: fix output columns for proteins, sort proteins data by score

* Fix parser pin test: (#30)

- Add label value to initial direction because it has to have a numerical number
- Read pin does not return dataframe anymore
- Compare output of read_pin function to example dataframe

* Add tests: (#31)

- Add skip_deduplication flag test
- Add ensemble flag test
- Agg rescale flag test
- Fix bug: remove target_column variable from read file for read_data_for_rescale

* Fix writer tests: (#32)

- Remove writer tests with confidence object becaause LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence

* fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse (#33)

* Introduce new executable and bug fixes

* Create new executable to aggregate psms to peptides.
* Fix bugs:
- fix error no psms found during training : if no psms passed the fdr value then raise error that model performed worse
- raise error when pep values are all equal to 1
- prefixes paths to dest_dir to not pollute the workdir
- catch error to prevent traces logged: Catch all errors to not break structured logging by error traces
- fixes parallelism in parse_in_chunks to max_workers
- fix indeterminism
- fixed small column chunk bug
- fix bug when using multiple input files
* Fix and add tests:
- remove writer tests with confidence object because LinearPsmDataset does not have asign_confidence method anymore and results are streamed to output files while computing confidence
- add test for the new function "get_unique_peptides_from_psms"
- add cli test for aggregatePsmsToPeptides

* ✨ force ci re-run

* 💄 lint mokapot

* 💄 lints tests

* 💄 fixes format with ruff

- adds line break in dataset.py
- updates call of ruff in CI
- updates pyproject.toml according to new ruff api

* 💄 fixes format with ruff

- adds line break in dataset.py
- updates call of ruff in CI
- updates pyproject.toml according to new ruff api

* 💄 make ruff and black happy together

* ✨ removed deprecated error ignore

* Fix two boolean conditions in nnls algorithm

* Set tolerance to fixed value in fit_nnls to avoid non-convergence

* Adjust unittest for hist_nnls to new error cases

* Add documentation and test for create_chunks

* Make cli unit tests for aggregateP2P easier debuggable

* Improve test for peptide_csv in test_utils

* Improve and test convert_targets_column function

And fix some minor pep8 issues

* Enable switching in system tests from subprocess to direct calls

* Fix cli system and utils tests

* Fix unit tests

* Add documentation and test for create_chunks

* Make cli unit tests for aggregateP2P easier debuggable

* Improve test for peptide_csv in test_utils

* Improve and test convert_targets_column function

And fix some minor pep8 issues

* Enable switching in system tests from subprocess to direct calls

* Fix cli system and utils tests

* Fix unit tests

* parquet reader for mokapot

* merge sort function adapted for parquet

* brew function adapted for parquet input

* confidence assignment modified for parquet format

* merge sort chunk size added as constant

* update label func modified for parquet

* main function uses format arg to choose between csv and parquet

* pyarrow added to dependancies

* Change conversion of target column values

Revert it to "old style" because otherwise some tests break

* fixed failing tests

* added new tests for parquet

* refactor: Add type hinting to tuplize function

* refactor: Refactor find_column(s) functions and use it in read_percolator

* refactor: Insert some newlines for improved readability

* refactor: Insert some newlines for improved readability

* refactor: Insert some newlines for improved readability
refactor: Insert some newlines for improved readability

* refactor: Remove redundant test case for case-sensitive column matching.

* Add typeguard

* refactored unchunked file reader for parquet and csv

* Add map_columns_to_indices function and more type checking

* Make debugging dataframe issues easier in unit test

by adding some pandas config options to pytest_sessionstart in
conftest.py

* Fix test_utils

* Add level_columns to OnDiskPsmDataset

* Rename deduplication to do_rollup

* Change deduplication to do_rollup

* Fix pin reading by adding level_columns

* Revert "Merge branch 'feature/parquet_parser' into 'main'"

This reverts merge request !39

* Move peptides.csv to dest_dir and remove it where it's created

* Save changes

* Get rid of path manipulation via strings

* Clean up more path related stuff

* Correct documentation of return values of brew function

* Simplify and generalize path definitions in confidence.py

* Move confidence related functions to confidence

* Fix problem with parameter lists in cli tests

* Add checking of column names for OnDiskPsmDataset

* Fix column index stuff

* Disable parallel unit tests when debugging

* Refactor the confidence.to_txt function

* Fix chunked reader to read in column order as passed to the function

* Add comments and put temp file naming for merge-sorting in one place

* Add test for chunked reader

* Fix warning in read_file_in_chunks test

* Remove unnecessary conversion (back and fro) in confidence.py

* Add pyarrow as a dependency

* Remove superfluous conversions and some more superfluous stuff

* Improve find_column* and use consistently in pin parser

* Introduce tabbed reader and writers (for csv for now)

* Correct buggy import statement

* Fix bug and add test related to --aggregate flag

* Remove ignoring of warnings

Warnings should not be ignored, except when the verbosity is set to
error level. If there are annoying and known warnings, that are handled
appropriately, they should be ignored locally on a case by case basis.

* Comment regarding to_txt function (can be removed)

* Add file type detection to readers and writers

* Ignore warning in PIN reader (locally now)

* Improve column ordering/mapping and add unit test

* Correct targets conversion function and fix offending unit tests

* Improve on label updating and type safety

* Correct output capturing in test helper

* Use TabbedWriter in save_sorted_metadata_chunks

* Progress towards rollup

* Improve map_columns_to_indices for dicts

* Add stringify methods to tabbed readers

* Change assign_confidence for rollup

* Fix column ordering problem (temp)

* Add tests for the rollup

* Fix a bug in the rollup unit test

* Rename pcms to precursors

* Remove a now superfluous function and comments

* squashed sqlite writer branch

* failing test fixed

* unused function get_unique_peptides_from_psms removed

* format interpreted implicitly using filename

* Confidencewriter class implemented

* sqlite path changed to Path type

* perquet module deleted and integrated into read_pin

* failing tests fixed

* instantiation done before returning object

* PSM_PEP column name changed to POSTERIOR_ERROR_PROBABILITY

* add pipeline status for main branch

* ✨ fixes wfondrie#54

* Do some cosmetics

* Remove rescale stuff

Was broken and unused anyway

* Renamed TabbedFileReader and Writer to TabularDataReader and Writer

* Separate general tabular data and confidence writer stuff

* Fix import bug in confidence

* Fix another import bug

Due to pycharm refactoring... sigh...

* Add better unit test for (chunked) confidence

* Make chunked confidence unit test fail with small chunk size

* Fix confidence chunk size bug

* test case added for sqlite writer

* test data added for sqlite writer

* Add option for suppressing warnings

* Revert the change in confidence and adapt unit test

CONFIDENCE_CHUNK_SIZE is now directly modified from the unit test.

* prepare tables sqlite db added as helper func

* Fix problems with sqlit after the merge

* Remove all group related stuff

* Remove crosslink stuff

* Remove plugins

* Remove skipped tests and skip marks

* Do some minor cleanup

* Add type inference and tests for tabular data

* Add reader and tests for in memory dataframe reader

* Add streaming module

* Add checks to merged reader

* Add creation of dataframe reader from series and arrays

* Add JoinedTabularData and tests

* Add column renaming for tabular data readers

* Add context manager to tabular data writer and make confidence writer a function

* Get rid of all kinds of warnings during tests

* Fix another warning

* Add context manager to TabularDataWriter

* Fix a problem with indexing in the merged reader

* Add buffering to writers

* Correct problem in unit test with log output and typechecking

* Add functionality to add computed columns to TabularData

* Add method to get an associated reader from a writer

* Add cli and test for the rollup

* Fix underscore problem

* Fix problem with typechecked/contextmanager order

* Remove typechecking from auto_finalizer for the moment

* Fix path problem in rollup unit test

* Show rollup levels not found only if non-empty

* Simplify sqlite connection in unit tests

* Add new suffixes for csv

* Change options: add src_dir and remove keep_decoys

* Add files for rollup testing

* buffered write for parquet intermediary files implemented

* tests updated for parquet writing

* fixed aggregatePsmstoPeptides cli

* test data structure changed to list of dicts to match merge sort output structure

* test data updated to be dataframe readable

* Remove aggregatePsmsToPeptides

* Move remove_columns function to tabular_data

* Fix program name in cli output

* Let brew_rollup also search for parquet files

* Use csv or parquet suffix also for temp and output files

* Filter a warning in the system tests

* Make the column types a bit more lenient

* fixed rollup app for parquets

* Fix unclosed files problem

* Remove unused parameter target_column from merge_sort

* Remove superfluous passing of sep

* Change tabs to colons in protein(s) column of pin file

* Test parquet merge_sort more extensively

* Unify csv and parquet methods in merge_sort

* Simplify get_row_iterator

* Make brew rollup faster

* Fix bug in MergedTabularReader

* Fix problem with last line in buffering

* Fix problem with type conversions in merge_sort

(And make it a bit less ugly...)

* ✨ addresses @jspaezp suggestions from PR wfondrie#119

- removes MSAID internal status banner in README
- removes @TypeChecked for brew() (we can reintroduce it when we have a
  typed version of brew()
- uses yield from pattern
- uses dictionaries to assert for correct column-name and type mappings
- removes installation of sample-plugin from tests

* ✨ addresses review suggestions

- adds back typechecked for brew
- uses parameterized tests for test_peps_hist_nnls

* Add check for length of mokapot output file

* Make test for output file length more "elastic"

* Revert documentation on psms parameter for brew function.

And add some additional info.

* Change f-string to normal string where unnecessary

* Remove types_from_dataframe function, since unnecessary

* (chore) updated cicd,ruff,black and tests (#42)

* (chore) updated cicd, ruff, black and tests
* resolved comment question
by @jspaezp

* ✨ fix setting scores when training failed

* Readd previously commented out check for feature columns

* ✨ proper docstring for `sqlite_db_path`

* Add doc strings for write_confidence (and improve types)

* Improve documentation and type hints of assign_confidence

* Add class and module documentation for the tabular data classes

* Feature/remove nnls patch (#43)

* ✨ remove patched nnls (using fixed scipy version now)
* 💄 linting; line breaks

Co-authored-by: Elmar Zander <elmar@zandere.de>

* Fix/windows tests (#44)

* Remove nnls patch and fix scipy version (with fixed nnls)

* ✨ remove patched nnls (using fixed scipy version now)

* 💄 linting; line breaks

* 🚑 fix dtypes to ensure windows ci to pass

---------

Co-authored-by: Elmar Zander <elmar@zandere.de>

* Fix/windows tests (#45)

* Remove nnls patch and fix scipy version (with fixed nnls)

* ✨ remove patched nnls (using fixed scipy version now)

* 💄 linting; line breaks

* 🚑 fix dtypes to ensure windows ci to pass

* 💄  linting

---------

Co-authored-by: Elmar Zander <elmar@zandere.de>

* ✨ draft of pin to tsv converter

* ✨ adds is_valid_tsv

* ✨ adds tsv verification for pin files and conversion

* 📝 remove print

* 🔥 add required default for --dest_dir

- using Mokapot without --dest_dir was broken without this new default
  before

---------

Co-authored-by: Siegfried Gessulat <siegfried@msaid.de>
Co-authored-by: Elmar Zander <elmar@zandere.de>
Co-authored-by: sambenfredj <100685091+sambenfredj@users.noreply.github.com>
Co-authored-by: Elmar Zander <elmar.zander@toptal.com>
Co-authored-by: Vishal Sukumar <vishal.sukumar@msaid.de>
Co-authored-by: Florian Seefried <florian.seefried@msaid.de>
Co-authored-by: Graber Michael <michael.graber@msaid.de>
Co-authored-by: Tobias Schmidt <tobias.schmidt@msaid.de>
Co-authored-by: J. Sebastian Paez <jspaezp@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants