Make cleanup stages of processing multi threaded #8

mrosseel · 2019-11-20T13:07:12Z

removing images and everything that comes after that is not multi threaded and takes a very long time for 35k images.

kirxkirx · 2019-11-20T17:14:46Z

Yes, processing speed for sets of >10k images can and should be improved. Not writing multiple log files associated with each image is one way to save a lot of time on input/output - the working directory will contain "a_few times more_than_10k" files less. (These image logs are mostly useful at the debugging stage rather than for mass processing.)

Can you please specify what exactly do you mean by "removing images" stage? What VaST is writing to the terminal at this time?

mrosseel · 2019-11-23T08:04:39Z

Less files would certainly help!

Also make multithread what is now single thread such as:

Searching lightcurves to be removed
Removing them from all lightcurves... done! =)
Removing lightcurves with less than 2 points... done! =)
Removing (7.0 sigma) outliers from lightcurves... done! =)
Removing lightcurves with less than 2 points... done! =)

kirxkirx · 2020-05-18T11:49:41Z

I'm sorry, so far I was unable to get a speed-up from parallelizing these steps. Basically, all of them are limited by the need to read every lightcurve (out*.dat) file and, depending on its content, either delete the file or replace it with its modified version. The procedure seems to be limited by the disk I/O speed rather than by CPU usage. If I try to read files in parallel, I get about the same execution time, but use all processor cores instead of just one - there is no speed up. Maybe the result would be different on a system with a very fast disk I/O, but I need to see it before introducing changes to the code.

In the mean time, I've made some minor improvements in the lightcurve reading routine (an option not to parse the whole VaST-format lightcurve string if we are interested only in the first three columns "JD mag err"), that provides a small, but measurable speed-up on large sets of lightcurves.

mrosseel · 2020-05-20T09:54:15Z

Thx for looking into this!
Am I correct in assuming that these files are only read once for each filter ? (nr of points, outliers, less than 2 points)

As discussed above, did you also look at limiting log files? Will help with speed but also dirs get very slow with that many files :). I will leave this open in case you want to add something but feel free to close if you feel this issue is done.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make cleanup stages of processing multi threaded #8

Make cleanup stages of processing multi threaded #8

mrosseel commented Nov 20, 2019

kirxkirx commented Nov 20, 2019

mrosseel commented Nov 23, 2019

kirxkirx commented May 18, 2020

mrosseel commented May 20, 2020

Make cleanup stages of processing multi threaded #8

Make cleanup stages of processing multi threaded #8

Comments

mrosseel commented Nov 20, 2019

kirxkirx commented Nov 20, 2019

mrosseel commented Nov 23, 2019

kirxkirx commented May 18, 2020

mrosseel commented May 20, 2020