New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Improved logging #48

Closed
ewels opened this Issue May 27, 2016 · 14 comments

Comments

Projects
None yet
3 participants
@ewels

ewels commented May 27, 2016

Hi there,

I'm a big fan of HISAT2 but I struggle with the summary logs. Whilst they're intuitive to read as a user whilst running the tool, they're very tricky to parse computationally after the run is complete. The reason that I'm interested in this is because I'm the author of a tool called MultiQC, which creates summary reports describing the output of multiple bioinfo tools. MultiQC can only rely on the output from each tool and no additional logs or information from the user (which would vary case by case).

My first request would be to log the HISAT software version number and input parameters along with the summary stats (such as input filenames). This is especially useful when a pipeline concatenates the stderr from multiple runs into a single file. In the case of HISAT2 it would also help as the stderr log is identical to that of Bowtie2 and it's impossible to tell where the log output came from.

Secondly, a machine parseable format such as YAML would be a lot easier to unambiguously interpret, whilst still being fairly easy to read by eye.

Finally, and this one is less important and more subjective, I prefer summary statistics to be printed to a file - this makes them much easier to find. For example, for an input file input_R1.fastq.gz you could create input_R1.fastq.gz.HISAT2_summary.yaml. This is what a lot of other bioinformatics tools do (eg. STAR, Salmon and even Tophat with the align_summary.txt file). Note that if printing summary stats to a file then you can of course keep the stderr messages as they are currently.

So, in summary, instead of / in addition to the current stderr stream:

20000 reads; of these:
  20000 (100.00%) were unpaired; of these:
    1247 (6.24%) aligned 0 times
    18739 (93.69%) aligned exactly 1 time
    14 (0.07%) aligned >1 times
93.77% overall alignment rate

It would be fantastic to have something like this:

- HISAT2_version: 2.0.4
- input_files:
  - input_R1.fastq.gz
  - input_R2.fastq.gz
- summary_stats:
  - total_reads: 20000
  - unpaired_reads:
    - counts:
      - total: 20000
      - aligned_0_times: 1247
      - aligned_1_time: 18739
      - aligned_gt1_time: 14
    - percentages:
      - total: 100.00%
      - aligned_0_times: 6.24%
      - aligned_1_time: 93.69%
      - aligned_gt1_time: 0.07%
  - overall_alignment_rate: 93.77%

Does this make sense? Do you have any thoughts on the topic?

Thanks in advance,

Phil

@infphilo

This comment has been minimized.

Show comment
Hide comment
@infphilo

infphilo May 27, 2016

Owner

Thank you for your suggestion, @ewels

The suggested format looks great to me! I'll try to incorporate it in the next version of HISAT2. I'll be out of the country the rest of week and the next week, sorry for the brief response.

Owner

infphilo commented May 27, 2016

Thank you for your suggestion, @ewels

The suggested format looks great to me! I'll try to incorporate it in the next version of HISAT2. I'll be out of the country the rest of week and the next week, sorry for the brief response.

@ewels

This comment has been minimized.

Show comment
Hide comment
@ewels

ewels May 27, 2016

Fantastic, thanks! Looking forward to it..

ewels commented May 27, 2016

Fantastic, thanks! Looking forward to it..

@lcolladotor

This comment has been minimized.

Show comment
Hide comment
@lcolladotor

lcolladotor commented May 12, 2017

+1 ^^

@infphilo

This comment has been minimized.

Show comment
Hide comment
@infphilo

infphilo May 31, 2017

Owner

It took me such a long time to implement your suggested output format due to multiple (very exciting) projects, job hunting, grant writing, etc.

How about the summary output format?
-- single-end reads --
Summary stats:
Total reads: 1000000
Aligned 0 time: 956 (0.10%)
Aligned 1 time: 957987 (95.80%)
Aligned >1 times: 41057 (4.11%)
Overall alignment rate: 99.90%

-- paired-end reads --
Summary stats:
Total pairs: 1000000
Aligned concordantly 0 time: 1116 (0.11%)
Aligned concordantly 1 time: 965412 (96.54%)
Aligned concordantly >1 times: 33472 (3.35%)
Aligned discordantly 1 time: 51 (4.57%)
Total unpaired reads: 2130
Aligned 0 time: 1057 (49.62%)
Aligned 1 time: 1057 (49.62%)
Aligned >1 times: 16 (0.75%)
Overall alignment rate: 99.95%

I also implemented a new option, --summary-file, to output the summary to a file (in addition to stderr).

Owner

infphilo commented May 31, 2017

It took me such a long time to implement your suggested output format due to multiple (very exciting) projects, job hunting, grant writing, etc.

How about the summary output format?
-- single-end reads --
Summary stats:
Total reads: 1000000
Aligned 0 time: 956 (0.10%)
Aligned 1 time: 957987 (95.80%)
Aligned >1 times: 41057 (4.11%)
Overall alignment rate: 99.90%

-- paired-end reads --
Summary stats:
Total pairs: 1000000
Aligned concordantly 0 time: 1116 (0.11%)
Aligned concordantly 1 time: 965412 (96.54%)
Aligned concordantly >1 times: 33472 (3.35%)
Aligned discordantly 1 time: 51 (4.57%)
Total unpaired reads: 2130
Aligned 0 time: 1057 (49.62%)
Aligned 1 time: 1057 (49.62%)
Aligned >1 times: 16 (0.75%)
Overall alignment rate: 99.95%

I also implemented a new option, --summary-file, to output the summary to a file (in addition to stderr).

@ewels

This comment has been minimized.

Show comment
Hide comment
@ewels

ewels May 31, 2017

Hi @infphilo,

No problem - I know the feeling! Thanks for looking into this.

The output you suggest looks great... A couple of minor suggestions:

  • Could you change Summary stats: to HISAT2 Summary stats:? The addition of the specific HISAT2 string makes the output a lot easier to find programmatically.
  • If it's possible to print the input filenames that would be great. Some users concatenate stderr from multiple samples, then it's nice to have the input sample associated with the summary stats.

Cheers,

Phil

ewels commented May 31, 2017

Hi @infphilo,

No problem - I know the feeling! Thanks for looking into this.

The output you suggest looks great... A couple of minor suggestions:

  • Could you change Summary stats: to HISAT2 Summary stats:? The addition of the specific HISAT2 string makes the output a lot easier to find programmatically.
  • If it's possible to print the input filenames that would be great. Some users concatenate stderr from multiple samples, then it's nice to have the input sample associated with the summary stats.

Cheers,

Phil

@ewels

This comment has been minimized.

Show comment
Hide comment
@ewels

ewels May 31, 2017

ps. A question - one of the plots I'd like to make for MultiQC is a stacked bargraph showing how all of the input read pairs are aligned (eg. like this one). So what proportion are not aligned at all, what proportion have > 1 alignment and so on. However, it's not entirely clear to me how the numbers from your paired-end output can be summed:

Category Number of Reads Running Total
Total pairs 1000000
Aligned concordantly 0 time 1116 1116
Aligned concordantly 1 time 965412 966528
Aligned concordantly >1 times 33472 1000000
Aligned discordantly 1 time 51 1000051
Total unpaired reads 2130
Aligned 0 time 1057 1001108
Aligned 1 time 1057 1002165
Aligned >1 times 16 1002181

I assume that this is because reads pairs can be assigned to multiple categories. Or are some numbers sub-categories of others (eg. unpaired reads?). Is there a way to put this together into a stacked bar plot that your recommend?

Cheers,

Phil

ewels commented May 31, 2017

ps. A question - one of the plots I'd like to make for MultiQC is a stacked bargraph showing how all of the input read pairs are aligned (eg. like this one). So what proportion are not aligned at all, what proportion have > 1 alignment and so on. However, it's not entirely clear to me how the numbers from your paired-end output can be summed:

Category Number of Reads Running Total
Total pairs 1000000
Aligned concordantly 0 time 1116 1116
Aligned concordantly 1 time 965412 966528
Aligned concordantly >1 times 33472 1000000
Aligned discordantly 1 time 51 1000051
Total unpaired reads 2130
Aligned 0 time 1057 1001108
Aligned 1 time 1057 1002165
Aligned >1 times 16 1002181

I assume that this is because reads pairs can be assigned to multiple categories. Or are some numbers sub-categories of others (eg. unpaired reads?). Is there a way to put this together into a stacked bar plot that your recommend?

Cheers,

Phil

@ewels

This comment has been minimized.

Show comment
Hide comment
@ewels

ewels May 31, 2017

Ok, looking at this a little longer. I guess Aligned discordantly 1 time is part of Aligned concordantly 0 time, which is why the top part doesn't sum to 1000000. So I can subtract one from the other to make a new category and everything should add up.

Still a bit confused about where the 2130 Total unpaired reads come from though. Are they part of the 1000000 read pair input? Or did the input FastQ files somehow have 1000000 paired-end reads and 2130 single-end reads mixed together?

How does the Overall alignment rate take this into account? Presumably you have to come to a total number of aligned reads to calculate this.

Apologies if I'm being slow here..

Phil

ewels commented May 31, 2017

Ok, looking at this a little longer. I guess Aligned discordantly 1 time is part of Aligned concordantly 0 time, which is why the top part doesn't sum to 1000000. So I can subtract one from the other to make a new category and everything should add up.

Still a bit confused about where the 2130 Total unpaired reads come from though. Are they part of the 1000000 read pair input? Or did the input FastQ files somehow have 1000000 paired-end reads and 2130 single-end reads mixed together?

How does the Overall alignment rate take this into account? Presumably you have to come to a total number of aligned reads to calculate this.

Apologies if I'm being slow here..

Phil

@infphilo

This comment has been minimized.

Show comment
Hide comment
@infphilo

infphilo May 31, 2017

Owner

Thank you - I just changed the log a bit as follows:

HISAT2 summary stats:
Total pairs: 1000000
Aligned concordantly or discordantly 0 time: 1065 (0.11%)
Aligned concordantly 1 time: 965412 (96.54%)
Aligned concordantly >1 times: 33472 (3.35%)
Aligned discordantly 1 time: 51 (0.01%)
Total unpaired reads: 2130
Aligned 0 time: 1057 (49.62%)
Aligned 1 time: 1057 (49.62%)
Aligned >1 times: 16 (0.75%)
Overall alignment rate: 99.95%

Below is a breakdown of some numbers, and Total unpaired reads are twice the number of unaligned pairs.

Total pairs (1000000) = Aligned concordantly or discordantly 0 time + Aligned concordantly 1 time + Aligned concordantly >1 times + Aligned discordantly 1 time

Total unpaired reads (2130) = 2 * Aligned concordantly or discordantly 0 time

Overall alignment rate is number of aligned reads / number of total reads
= (2 * (Aligned concordantly 1 time + Aligned concordantly >1 times + Aligned discordantly 1 time) + Aligned 1 time + Aligned >1 times) / (2 * Total pairs)

Owner

infphilo commented May 31, 2017

Thank you - I just changed the log a bit as follows:

HISAT2 summary stats:
Total pairs: 1000000
Aligned concordantly or discordantly 0 time: 1065 (0.11%)
Aligned concordantly 1 time: 965412 (96.54%)
Aligned concordantly >1 times: 33472 (3.35%)
Aligned discordantly 1 time: 51 (0.01%)
Total unpaired reads: 2130
Aligned 0 time: 1057 (49.62%)
Aligned 1 time: 1057 (49.62%)
Aligned >1 times: 16 (0.75%)
Overall alignment rate: 99.95%

Below is a breakdown of some numbers, and Total unpaired reads are twice the number of unaligned pairs.

Total pairs (1000000) = Aligned concordantly or discordantly 0 time + Aligned concordantly 1 time + Aligned concordantly >1 times + Aligned discordantly 1 time

Total unpaired reads (2130) = 2 * Aligned concordantly or discordantly 0 time

Overall alignment rate is number of aligned reads / number of total reads
= (2 * (Aligned concordantly 1 time + Aligned concordantly >1 times + Aligned discordantly 1 time) + Aligned 1 time + Aligned >1 times) / (2 * Total pairs)

@infphilo

This comment has been minimized.

Show comment
Hide comment
@infphilo

infphilo May 31, 2017

Owner

I think adding input file names is also a good idea, but we'd like to have minimal summary output for now. We might add the additional info. in a later version of HISAT2.

Owner

infphilo commented May 31, 2017

I think adding input file names is also a good idea, but we'd like to have minimal summary output for now. We might add the additional info. in a later version of HISAT2.

@ewels

This comment has been minimized.

Show comment
Hide comment
@ewels

ewels May 31, 2017

Ok brilliant - thanks for the log changes and explanation 👍

Regarding the input filenames - no problem, I'll just take the log filename for now and hope that people name their logs after their samples :)

Phil

ewels commented May 31, 2017

Ok brilliant - thanks for the log changes and explanation 👍

Regarding the input filenames - no problem, I'll just take the log filename for now and hope that people name their logs after their samples :)

Phil

@infphilo

This comment has been minimized.

Show comment
Hide comment
@infphilo

infphilo May 31, 2017

Owner

Thank you for your suggestion again, and developing MultiQC, a very powerful tool! :-)

Owner

infphilo commented May 31, 2017

Thank you for your suggestion again, and developing MultiQC, a very powerful tool! :-)

@infphilo infphilo closed this May 31, 2017

@ewels

This comment has been minimized.

Show comment
Hide comment
@ewels

ewels May 31, 2017

No problem at all, thanks for HISAT2! Open-source bioinformatics is great 😁 🌟

ewels commented May 31, 2017

No problem at all, thanks for HISAT2! Open-source bioinformatics is great 😁 🌟

@ewels

This comment has been minimized.

Show comment
Hide comment
@ewels

ewels Jul 5, 2017

Hi @infphilo,

I've just written the new HISAT2 MultiQC module to work with the output from --new-summary, so it's now available in v1.1dev.

Thanks again,

Phil

ewels commented Jul 5, 2017

Hi @infphilo,

I've just written the new HISAT2 MultiQC module to work with the output from --new-summary, so it's now available in v1.1dev.

Thanks again,

Phil

@infphilo

This comment has been minimized.

Show comment
Hide comment
@infphilo

infphilo Jul 5, 2017

Owner

Thank you for your great work, @ewels !

Owner

infphilo commented Jul 5, 2017

Thank you for your great work, @ewels !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment