Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Suggestion: Improved logging #48
I'm a big fan of HISAT2 but I struggle with the summary logs. Whilst they're intuitive to read as a user whilst running the tool, they're very tricky to parse computationally after the run is complete. The reason that I'm interested in this is because I'm the author of a tool called MultiQC, which creates summary reports describing the output of multiple bioinfo tools. MultiQC can only rely on the output from each tool and no additional logs or information from the user (which would vary case by case).
My first request would be to log the HISAT software version number and input parameters along with the summary stats (such as input filenames). This is especially useful when a pipeline concatenates the stderr from multiple runs into a single file. In the case of HISAT2 it would also help as the stderr log is identical to that of Bowtie2 and it's impossible to tell where the log output came from.
Secondly, a machine parseable format such as YAML would be a lot easier to unambiguously interpret, whilst still being fairly easy to read by eye.
Finally, and this one is less important and more subjective, I prefer summary statistics to be printed to a file - this makes them much easier to find. For example, for an input file
So, in summary, instead of / in addition to the current stderr stream:
It would be fantastic to have something like this:
- HISAT2_version: 2.0.4 - input_files: - input_R1.fastq.gz - input_R2.fastq.gz - summary_stats: - total_reads: 20000 - unpaired_reads: - counts: - total: 20000 - aligned_0_times: 1247 - aligned_1_time: 18739 - aligned_gt1_time: 14 - percentages: - total: 100.00% - aligned_0_times: 6.24% - aligned_1_time: 93.69% - aligned_gt1_time: 0.07% - overall_alignment_rate: 93.77%
Does this make sense? Do you have any thoughts on the topic?
Thanks in advance,
It took me such a long time to implement your suggested output format due to multiple (very exciting) projects, job hunting, grant writing, etc.
How about the summary output format?
-- paired-end reads --
I also implemented a new option, --summary-file, to output the summary to a file (in addition to stderr).
No problem - I know the feeling! Thanks for looking into this.
The output you suggest looks great... A couple of minor suggestions:
ps. A question - one of the plots I'd like to make for MultiQC is a stacked bargraph showing how all of the input read pairs are aligned (eg. like this one). So what proportion are not aligned at all, what proportion have > 1 alignment and so on. However, it's not entirely clear to me how the numbers from your paired-end output can be summed:
I assume that this is because reads pairs can be assigned to multiple categories. Or are some numbers sub-categories of others (eg. unpaired reads?). Is there a way to put this together into a stacked bar plot that your recommend?
Ok, looking at this a little longer. I guess Aligned discordantly 1 time is part of Aligned concordantly 0 time, which is why the top part doesn't sum to
Still a bit confused about where the
How does the Overall alignment rate take this into account? Presumably you have to come to a total number of aligned reads to calculate this.
Apologies if I'm being slow here..
Thank you - I just changed the log a bit as follows:
HISAT2 summary stats:
Below is a breakdown of some numbers, and Total unpaired reads are twice the number of unaligned pairs.
Total pairs (1000000) = Aligned concordantly or discordantly 0 time + Aligned concordantly 1 time + Aligned concordantly >1 times + Aligned discordantly 1 time
Total unpaired reads (2130) = 2 * Aligned concordantly or discordantly 0 time
Overall alignment rate is number of aligned reads / number of total reads