Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up the ImageDecoder #721

Closed
6 tasks done
jeff-regier opened this issue Apr 19, 2023 · 0 comments · Fixed by #738
Closed
6 tasks done

speed up the ImageDecoder #721

jeff-regier opened this issue Apr 19, 2023 · 0 comments · Fixed by #738
Assignees

Comments

@jeff-regier
Copy link
Contributor

jeff-regier commented Apr 19, 2023

ImageDecoder is quite slow currently. Even with 32 workers, it can't keep up with the encoder. GalSim's ScatteredImageBuilder may less us write batches of light sources more efficiently. It could also be useful to look at how imSim interacts with GalSim.

If we can't speed up ImageDecoder by at least 10x, then as an alternative we can generate simulated training images ahead of time and write them to disk. We may also want to use data augmentation in this case to make more of each image we generate: apply random 90-degree rotations and small translations. Such data augmentation would need to be reflected in the tile catalog too.

Steps to using cached images:

  • Modify case_studies/summer_template/main.py to support a new mode: generate
  • Create a new file named bliss/generate.py that is in some sense analogous to predict.py and train.py. It would contain a function called generate(...) that takes cfg as an argument
  • When called generate(cfg) should create a SimulatedDataset object (using the instantiate function provided by hydra, as in train.py), generate a lot of data, serialized the data, and write the data to a file whose name is specified in the cfg object.
  • case_studies/summer_template/config.yaml should probably have a new top-level entry called cached_simulator (analogous to simulator, but with many fewer fields). This is where we'd store that path the filename (or directory) that contains the cache images
  • simulated_dataset.py should contain an additional class called CachedSimulatedDataset, with a constructor that takes a filename (or directory path) of the cached images and loads the file into memory.
  • CachedSimulatedDataset won't use any workers because it's just looking up loaded data. It will provide minibatches that are sampled at random from the available cached images.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants