-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working on step 7 ("HISAT2 confirmation of removal of human data") #21
Comments
This was resolved in commit 7b580ba right? |
I believe we wanted confirmation to be through HISAT2 alignment against the human genome? This looks like the removal of non-SARS-CoV-2 reads, but not necessarily the confirmation step. |
Considering we are throwing away any reads that don't map to the reference with the new "core" (see current branch) we can probably dispense with this entirely. |
There was discussion in one of the groups that step 5 (removal of non-SARS-CoV-2) be via alignment to the SARS-CoV-2 genome but step 7 verification was alignment against human genome as this combination would absolutely pass ethics/privacy requirements and provide full confidence to our healthcare colleagues. |
Just chatting about that on the cancogen call. Sounds like: map raw sorted reads to human reference and remove any reads that map. Then go ahead with the remaining reads for trimming. The human removed but otherwise raw reads can then be uploaded to SRA? I can add that and remake the PR. |
Are we sure removing based on alignment to human won't exclude some legitimate SARS-CoV-2 data? Don't want to create a region of false low coverage. |
I think a useful analysis to inform this discussion would be: a) map human reads to the ncov reference to see what maps (NB: in real amplicon sequencing sets very few reads (if any) will be human so a) will be a massive overestimate of what happens in a real experiment) |
Thinking about it a bit more human WGS won't be representative of the type of off-target sequences we might see so I'm going off the idea of a) a bit. b) is worth doing though |
Yeah, not sure about benefit of a. B seems useful, as that's the issue we're worried about. Might be good to try b with a diverse sample of strains from GISAID, to get a sense of this |
Yeah the thought with a) was to test whether mapping to human is necessary to remove any potential host reads. It would be far simpler if we can just map to coronavirus and discard everything else, I wanted to address that with a) but realized it won't really answer the question |
Yes, agree, a) doesn't address that question. |
Just throwing this in here so its all together.
|
Pending further analysis (re: #21) here is a crude host removal workflow for the raw reads (only sorted and the pools combined). - Map against pre-made human GRCh38 BWA indexed reference - Use samtools and bedtools to pull out all reads that don't map to human reference and gzip for SRA submission.
I took the one wuhan scheme illumina sample I had to hand and ran BWA-MEM versus a composite human + viral reference. Of those reads which mapped to viral and human contigs (~200 or 0.02%) the distribution the respective mapping qualities looking like this: So most of the small number of problematic reads are a clear viral hit and lower quality human hit (as Torsten suggests in the linked thread). We could save a whole 4 of them by comparing the map-qualities between human and viral. The remaining 11 reads with equally good hits to viral and human aren't likely to majorly affect the viral consensus or variant calling. I could grab a bunch more SRAs and do this across a lot more samples but honestly, we are probably fine just using BWA-MEM MAPQ>=30 in the host removal stage and calling it a good'un. |
This excellent, please make sure it is clear in the documentation, including the supporting data. |
I started working on this
The text was updated successfully, but these errors were encountered: