Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifications to generation process #762

Closed
4 tasks
aakashdp6548 opened this issue May 15, 2023 · 1 comment · Fixed by #765
Closed
4 tasks

Modifications to generation process #762

aakashdp6548 opened this issue May 15, 2023 · 1 comment · Fixed by #765
Labels
not high priority refactoring refactoring code or quick fixes.

Comments

@aakashdp6548
Copy link
Collaborator

aakashdp6548 commented May 15, 2023

Making this issue to compile a list of minor tweaks we could/should make to the data generation process:

  • Log the parameters used to generate the cached data to the directory where it is saved
  • Instead of specifying the number of batches in the config, only require total number of images desired and the batch size, and the number of batches can be calculated at runtime
  • Similarly, remove checks for certain amount of data - just load whatever is given
    assert len(self.data) >= self.n_batches * self.batch_size, (
    f"Insufficient cached data loaded; "
    f"need at least {self.n_batches * self.batch_size} "
    f"but only have {len(self.data)}. Re-run `generate.py` with "
    f"different generation `n_batches` and/or `batch_size`."
    )
  • Don't require that all three splits are present when loading CachedSimulatedDataset - we might want to generate just a separate test split and test on that without train/validation data for example. We can instead raise an exception when the user tries to use data that wasn't loaded (e.g. in train/val/test_dataloader)
    # fix validation set
    assert os.path.exists(
    f"{self.cached_data_path}/dataset_valid.pt"
    ), "No cached validation data found; run `generate.py` first"
@aakashdp6548 aakashdp6548 added refactoring refactoring code or quick fixes. not high priority labels May 15, 2023
@aakashdp6548
Copy link
Collaborator Author

Discussed with @zhixiangteoh - the reason for using num_batches instead of a number of images is 1) to keep things in multiples of batch size without ceil/flooring, and 2) consistency with SimulatedDataset, so the second and third points aren't necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not high priority refactoring refactoring code or quick fixes.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant