Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use percent slices for splits #883

Merged
merged 3 commits into from
Jul 19, 2023
Merged

Use percent slices for splits #883

merged 3 commits into from
Jul 19, 2023

Conversation

aakashdp6548
Copy link
Collaborator

@aakashdp6548 aakashdp6548 commented Jul 18, 2023

Fixes #880. Parsing full slices ("0:80/80:90/90:100") was easier than relative values ("80/10/10") so I just went with this for now, since it allows for more flexibility anyway. Not sure if we need more sophisticated input validation - let me know if you think I should add it.

@aakashdp6548 aakashdp6548 marked this pull request as ready for review July 18, 2023 20:36
Copy link
Contributor

@jeff-regier jeff-regier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks.

@codecov
Copy link

codecov bot commented Jul 19, 2023

Codecov Report

Merging #883 (8c73c60) into master (355f5db) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #883      +/-   ##
==========================================
- Coverage   95.65%   95.59%   -0.07%     
==========================================
  Files          21       21              
  Lines        2256     2247       -9     
==========================================
- Hits         2158     2148      -10     
- Misses         98       99       +1     
Flag Coverage Δ
unittests 95.59% <100.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
bliss/api.py 87.17% <100.00%> (-0.14%) ⬇️
bliss/simulator/simulated_dataset.py 91.20% <100.00%> (-1.23%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@zhixiangteoh zhixiangteoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Thanks!

for idx in self.val_split_file_idxs:
filename = f"{self.file_prefix}_{idx}.pt"
self.valid += self.read_file(f"{self.cached_data_path}/{filename}")
def pct_to_idx(self, x, length):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we just have this be percent_to_idx?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

filename = f"{self.file_prefix}_{idx}.pt"
self.test += self.read_file(f"{self.cached_data_path}/{filename}")
def parse_slices(self, splits: str, length: int):
slices = [slice(0, 0) for _ in range(3)] # default to empty slice for each split
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we default to 100% for train?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only time I can see that being needed is if a user passes in an empty string for splits, which imo should be considered bad input. In that case I think failing is the right result, instead of silently setting it to 100%.

def parse_slices(self, splits: str, length: int):
slices = [slice(0, 0) for _ in range(3)] # default to empty slice for each split
for i, data_split in enumerate(splits.split("/")):
# map "start_pct:stop_pct" to slice(start_idx, stop_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's use _percent here (for readability)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@aakashdp6548 aakashdp6548 merged commit 05dfd35 into master Jul 19, 2023
3 checks passed
@aakashdp6548 aakashdp6548 deleted the data-splits branch July 19, 2023 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Specify train/val/test splits as percentages instead of n_batches or file indices
3 participants