Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ability to perform stratified data splits #831

Conversation

scap3yvt
Copy link
Collaborator

Fixes #829

Proposed Changes

  • added a new module under utils called data_splitter that allows a user to perform either stratified k-fold split or normal
  • using the new module in training_manager

Checklist

  • CONTRIBUTING guide has been followed.
  • PR is based on the current GaNDLF master .
  • Non-breaking change (does not break existing functionality): provide as many details as possible for any breaking change.
  • Function/class source code documentation added/updated (ensure typing is used to provide type hints, including and not limited to using Optional if a variable has a pre-defined value).
  • Code has been blacked for style consistency and linting.
  • If applicable, version information has been updated in GANDLF/version.py.
  • If adding a git submodule, add to list of exceptions for black styling in pyproject.toml file.
  • Usage documentation has been updated, if appropriate.
  • Tests added or modified to cover the changes; if coverage is reduced, please give explanation.
  • If customized dependency installation is required (i.e., a separate pip install step is needed for PR to be functional), please ensure it is reflected in all the files that control the CI, namely: python-test.yml, and all docker files [1,2,3].

Copy link
Contributor

github-actions bot commented Mar 21, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@scap3yvt scap3yvt marked this pull request as draft March 21, 2024 19:47
Copy link

codecov bot commented Mar 23, 2024

Codecov Report

Attention: Patch coverage is 96.47887% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 95.09%. Comparing base (0be31c2) to head (b003c3c).
Report is 1 commits behind head on master.

❗ Current head b003c3c differs from pull request most recent head 1d2352b. Consider uploading reports for the commit 1d2352b to get more accurate results

Files Patch % Lines
GANDLF/training_manager.py 84.84% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #831      +/-   ##
==========================================
+ Coverage   95.01%   95.09%   +0.07%     
==========================================
  Files         120      121       +1     
  Lines        8270     8312      +42     
==========================================
+ Hits         7858     7904      +46     
+ Misses        412      408       -4     
Flag Coverage Δ
unittests 95.09% <96.47%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@scap3yvt scap3yvt marked this pull request as ready for review March 23, 2024 02:10
Copy link
Collaborator

@sarthakpati sarthakpati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add another test, please.

GANDLF/utils/data_splitter.py Outdated Show resolved Hide resolved
…iontesting-csv-with-proportional-splits' into 828-feature-add-the-ability-to-split-csvs-for-trainingvalidationtesting-as-a-separate-script
GANDLF/cli/data_split_saver.py Outdated Show resolved Hide resolved
gandlf_splitCSV Outdated Show resolved Hide resolved
@Geeks-Sid
Copy link
Collaborator

@scap3yvt tag me when it is ready to review.

@sarthakpati
Copy link
Collaborator

@scap3yvt tag me when it is ready to review.

This should be ready for review, @Geeks-Sid

…for-trainingvalidationtesting-as-a-separate-script
…for-trainingvalidationtesting-as-a-separate-script
@sarthakpati sarthakpati merged commit 32d70d4 into mlcommons:master Mar 26, 2024
19 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add the ability to generate training/validation/testing CSV with proportional splits
3 participants