Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cleanup stages of processing multi threaded #8

Open
mrosseel opened this issue Nov 20, 2019 · 4 comments
Open

Make cleanup stages of processing multi threaded #8

mrosseel opened this issue Nov 20, 2019 · 4 comments

Comments

@mrosseel
Copy link

removing images and everything that comes after that is not multi threaded and takes a very long time for 35k images.

@kirxkirx
Copy link
Owner

Yes, processing speed for sets of >10k images can and should be improved. Not writing multiple log files associated with each image is one way to save a lot of time on input/output - the working directory will contain "a_few times more_than_10k" files less. (These image logs are mostly useful at the debugging stage rather than for mass processing.)

Can you please specify what exactly do you mean by "removing images" stage? What VaST is writing to the terminal at this time?

@mrosseel
Copy link
Author

Less files would certainly help!

Also make multithread what is now single thread such as:

  • Searching lightcurves to be removed
  • Removing them from all lightcurves... done! =)
  • Removing lightcurves with less than 2 points... done! =)
  • Removing (7.0 sigma) outliers from lightcurves... done! =)
  • Removing lightcurves with less than 2 points... done! =)

@kirxkirx
Copy link
Owner

I'm sorry, so far I was unable to get a speed-up from parallelizing these steps. Basically, all of them are limited by the need to read every lightcurve (out*.dat) file and, depending on its content, either delete the file or replace it with its modified version. The procedure seems to be limited by the disk I/O speed rather than by CPU usage. If I try to read files in parallel, I get about the same execution time, but use all processor cores instead of just one - there is no speed up. Maybe the result would be different on a system with a very fast disk I/O, but I need to see it before introducing changes to the code.

In the mean time, I've made some minor improvements in the lightcurve reading routine (an option not to parse the whole VaST-format lightcurve string if we are interested only in the first three columns "JD mag err"), that provides a small, but measurable speed-up on large sets of lightcurves.

@mrosseel
Copy link
Author

Thx for looking into this!
Am I correct in assuming that these files are only read once for each filter ? (nr of points, outliers, less than 2 points)

As discussed above, did you also look at limiting log files? Will help with speed but also dirs get very slow with that many files :). I will leave this open in case you want to add something but feel free to close if you feel this issue is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants