capture basic performance data from jobs #283

levinas · 2015-01-28T22:39:37Z

Minimally a four-tuple for each assembly job:

Input data size
Assembler/recipe
Peak memory usage
Execution time

This data will be used to prepare for regular worker nodes devoted to small jobs.

sebhtml · 2015-02-05T21:16:04Z

@levinas For the input data size, is this measured in bytes or in number of reads ?

For the time, I suppose this can be captured by the Python code, using the elapsed time between the
moment that the job starts and the moment that the job ends.

For assembly/recipe, this is obviously already available in the Python code. Is this a string ?

For memory usage: GNU time and tstime can report peak memory usage and other related metrics, but I don't know if they capture the information concerning the children of the main process.

levinas · 2015-02-05T21:25:20Z

For the input data size, is this measured in bytes or in number of reads ?

Ideally, this should be measured in the number of bases. We have talked about running FastQC on all assembly input; maybe we should just extract this number from there. Otherwise, just the filetype and the raw file size could be a good proxy (e.g. (fasta, 1G) or (fastq.bz2, 300M)).
For the time, I suppose this can be captured by the Python code, using the elapsed time between the
moment that the job starts and the moment that the job ends.

For assembly/recipe, this is obviously already available in the Python code. Is this a string ?

Yes. We could probably just capture the method string including the “assembler/recipe/pipeline/wasp” prefix. So something like “-a velvet” or “-r smart”. We could postprocess/cluster these strings later.
For memory usage: GNU time and tstime can report peak memory usage and other related metrics, but I don't know if they capture the information concerning the children of the main process.

I don’t know how to do that either.

cbun · 2015-02-06T18:45:04Z

Can we grab the PID from the subprocesses and poll memory usage? Not sure if this is the best way.

levinas · 2015-02-12T03:08:50Z

Can we implement something like a conditional pull for the compute nodes? If the data set is small, for example, the control node can tag it "small", and it could be consumed by a regular VM with 24GB memory. This is what Chris envisioned in the original architectural diagram.

cbun · 2015-02-12T03:32:51Z

Yes, I'll have to double check, but the idea is that nodes can subscribe to
multiple queues, and the control server would route to the correct ones.

On Wed Feb 11 2015 at 9:08:51 PM Fangfang Xia notifications@github.com
wrote:

Can we implement something like a conditional pull for the compute nodes?
If the data set is small, for example, the control node can tag it "small",
and it could be consumed by a regular VM with 24GB memory. This is what
Chris envisioned in the original architectural diagram.

—
Reply to this email directly or view it on GitHub
#283 (comment).

sebhtml · 2015-02-12T14:00:16Z

In the callback method in consume.py, the json payload is received. Does the tag need to be specified in channel.basic_consume ?

levinas assigned cbun Jan 28, 2015

levinas added this to the v0.6.1: more assemblers, feature collection milestone Jan 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

capture basic performance data from jobs #283

capture basic performance data from jobs #283

levinas commented Jan 28, 2015

sebhtml commented Feb 5, 2015

levinas commented Feb 5, 2015

cbun commented Feb 6, 2015

levinas commented Feb 12, 2015

cbun commented Feb 12, 2015

sebhtml commented Feb 12, 2015

capture basic performance data from jobs #283

capture basic performance data from jobs #283

Comments

levinas commented Jan 28, 2015

sebhtml commented Feb 5, 2015

levinas commented Feb 5, 2015

cbun commented Feb 6, 2015

levinas commented Feb 12, 2015

cbun commented Feb 12, 2015

sebhtml commented Feb 12, 2015