Modifications to generation process #762

aakashdp6548 · 2023-05-15T17:58:29Z

Making this issue to compile a list of minor tweaks we could/should make to the data generation process:

Log the parameters used to generate the cached data to the directory where it is saved
~~Instead of specifying the number of batches in the config, only require total number of images desired and the batch size, and the number of batches can be calculated at runtime~~

~~Similarly, remove checks for certain amount of data - just load whatever is given~~

bliss/bliss/simulator/simulated_dataset.py

Lines 120 to 125 in 760e909

    
           assert len(self.data) >= self.n_batches * self.batch_size, ( 
        
               f"Insufficient cached data loaded; " 
        
               f"need at least {self.n_batches * self.batch_size} " 
        
               f"but only have {len(self.data)}. Re-run `generate.py` with " 
        
               f"different generation `n_batches` and/or `batch_size`." 
        
           )

Don't require that all three splits are present when loading CachedSimulatedDataset - we might want to generate just a separate test split and test on that without train/validation data for example. We can instead raise an exception when the user tries to use data that wasn't loaded (e.g. in train/val/test_dataloader)

bliss/bliss/simulator/simulated_dataset.py

Lines 127 to 130 in 760e909

    
           # fix validation set 
        
           assert os.path.exists( 
        
               f"{self.cached_data_path}/dataset_valid.pt" 
        
           ), "No cached validation data found; run `generate.py` first"

The text was updated successfully, but these errors were encountered:

aakashdp6548 · 2023-05-15T19:47:03Z

Discussed with @zhixiangteoh - the reason for using num_batches instead of a number of images is 1) to keep things in multiples of batch size without ceil/flooring, and 2) consistency with SimulatedDataset, so the second and third points aren't necessary.

aakashdp6548 added refactoring refactoring code or quick fixes. not high priority labels May 15, 2023

aakashdp6548 linked a pull request May 16, 2023 that will close this issue

Log generation parameters and allow loading cached dataset without all splits #765

Merged

aakashdp6548 closed this as completed in #765 May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modifications to generation process #762

Modifications to generation process #762

aakashdp6548 commented May 15, 2023 •

edited

Loading

aakashdp6548 commented May 15, 2023

Modifications to generation process #762

Modifications to generation process #762

Comments

aakashdp6548 commented May 15, 2023 • edited Loading

aakashdp6548 commented May 15, 2023

aakashdp6548 commented May 15, 2023 •

edited

Loading