Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional info in fastq header? #433

Closed
jfnjdoh opened this issue Oct 23, 2023 · 7 comments
Closed

Additional info in fastq header? #433

jfnjdoh opened this issue Oct 23, 2023 · 7 comments

Comments

@jfnjdoh
Copy link

jfnjdoh commented Oct 23, 2023

I received some data that was called using live base calling on the machine. The fastq files have lots of information in the headers for each read like the start time, the flow cell id, etc. However, I called the same files from their pod5's using dorado basecaller --emit-fastq <model> <pod5> > <fastq> and the headers only have the read ID. Is that the intended behavior? Can I get that additional info out of dorado or is it not stored in the pod5?

@tijyojwad
Copy link
Collaborator

Hi @jfnjdoh - that's intended behavior. The BAM output (default) from dorado contains more information in the header.

Can you share your motivation for using --emit-fastq instead of BAM?

@jfnjdoh
Copy link
Author

jfnjdoh commented Oct 26, 2023

Hi @tijyojwad, they ran the sample for 24 hours and wanted to know how long they need to run a sample in order to generate a particular level of assembly quality. So I was going to run assemblies for all reads collected after 2 hours, 4 hours, 6 hours, etc, and see how that affected the final assembly. Hence I needed the time a read was collected to do that. If it matters, the assembler is Flye.

The live called data has that info in the fastq headers so I assumed using dorado would more or less mimic that output. Seeing as how that's intended behavior, running dorado basecaller <model> <pod5> | samtools fastq -T '*' - > <fastq> suffices for my purpose.

If that's all there is too it and my approach to doing what I'm trying to do is correct, then you can close this issue.

@tijyojwad
Copy link
Collaborator

Hi @jfnjdoh - thanks for the details. That's quite an interesting use case!

For now using samtools -T to keep the tags is the best solution. We will keep such use cases under advisement for future changes.

@jfnjdoh
Copy link
Author

jfnjdoh commented Oct 26, 2023

If I could add one more comment, knowing how quickly you could get your data seems to be an important factor, see the following paper https://www.nature.com/articles/s41586-023-06615-2. Our use case is more of a "time is critical, how quickly can we accurately identify what's in this sample?". It's also useful for more mundane things like "it's late, should I start this now or just let it run overnight?"

@NikoLichi
Copy link

NikoLichi commented Jan 10, 2024

Hi @jfnjdoh,

I am running into the same issue as you. I gave a try to your samtools -T but there is an error showing [main] unrecognized command '-T'.

Could you please elaborate a bit more on how you ran it to have the complete fasta headers?

Best,
Niko

@jfnjdoh
Copy link
Author

jfnjdoh commented Jan 10, 2024

@NikoLichi The full command is to use samtools fastq -T '*' - > <fastq>, I'm not sure why I didn't type fastq there. I've edited the original comment so it's clear.

@NikoLichi
Copy link

Thank you for your help and fast reply. I'll give a try :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants