Working on step 7 ("HISAT2 confirmation of removal of human data") #21

kmsmith137 · 2020-04-23T15:01:40Z

I started working on this

fmaguire · 2020-05-14T21:36:05Z

This was resolved in commit 7b580ba right?

jaleezyy · 2020-05-15T03:05:33Z

I believe we wanted confirmation to be through HISAT2 alignment against the human genome? This looks like the removal of non-SARS-CoV-2 reads, but not necessarily the confirmation step.

fmaguire · 2020-05-15T03:54:29Z

Considering we are throwing away any reads that don't map to the reference with the new "core" (see current branch) we can probably dispense with this entirely.

agmcarthur · 2020-05-15T18:34:16Z

There was discussion in one of the groups that step 5 (removal of non-SARS-CoV-2) be via alignment to the SARS-CoV-2 genome but step 7 verification was alignment against human genome as this combination would absolutely pass ethics/privacy requirements and provide full confidence to our healthcare colleagues.

fmaguire · 2020-05-15T18:39:46Z

Just chatting about that on the cancogen call. Sounds like: map raw sorted reads to human reference and remove any reads that map. Then go ahead with the remaining reads for trimming.

The human removed but otherwise raw reads can then be uploaded to SRA?

I can add that and remake the PR.

agmcarthur · 2020-05-15T18:46:18Z

Are we sure removing based on alignment to human won't exclude some legitimate SARS-CoV-2 data? Don't want to create a region of false low coverage.

jts · 2020-05-15T18:46:57Z

I think a useful analysis to inform this discussion would be:

a) map human reads to the ncov reference to see what maps
b) map coronavirus reads to human to see what gets lost

(NB: in real amplicon sequencing sets very few reads (if any) will be human so a) will be a massive overestimate of what happens in a real experiment)

jts · 2020-05-15T19:03:15Z

Thinking about it a bit more human WGS won't be representative of the type of off-target sequences we might see so I'm going off the idea of a) a bit. b) is worth doing though

robynslee · 2020-05-15T19:10:33Z

Yeah, not sure about benefit of a. B seems useful, as that's the issue we're worried about. Might be good to try b with a diverse sample of strains from GISAID, to get a sense of this

jts · 2020-05-15T19:49:16Z

Yeah the thought with a) was to test whether mapping to human is necessary to remove any potential host reads. It would be far simpler if we can just map to coronavirus and discard everything else, I wanted to address that with a) but realized it won't really answer the question

robynslee · 2020-05-15T19:53:23Z

Yes, agree, a) doesn't address that question.

fmaguire · 2020-05-22T06:19:51Z

Just throwing this in here so its all together.

Need to consider mapping scores: Missing Indels galaxyproject/SARS-CoV-2#49
Could do with a set of SRA archives across the nextstrain tree for doing this analysis (among other QC): e.g. create a composite reference and see if there is a good threshold for distinguishing host contamination.

Pending further analysis (re: #21) here is a crude host removal workflow for the raw reads (only sorted and the pools combined). - Map against pre-made human GRCh38 BWA indexed reference - Use samtools and bedtools to pull out all reads that don't map to human reference and gzip for SRA submission.

fmaguire · 2020-05-24T01:33:16Z

I took the one wuhan scheme illumina sample I had to hand and ran BWA-MEM versus a composite human + viral reference.

Of those reads which mapped to viral and human contigs (~200 or 0.02%) the distribution the respective mapping qualities looking like this:

So most of the small number of problematic reads are a clear viral hit and lower quality human hit (as Torsten suggests in the linked thread).
If we just take those multihit reads with a MAPQ>=30 to the human reference, we are left with 13 (0.002%) reads:

We could save a whole 4 of them by comparing the map-qualities between human and viral. The remaining 11 reads with equally good hits to viral and human aren't likely to majorly affect the viral consensus or variant calling.

I could grab a bunch more SRAs and do this across a lot more samples but honestly, we are probably fine just using BWA-MEM MAPQ>=30 in the host removal stage and calling it a good'un.

agmcarthur · 2020-05-26T16:42:06Z

This excellent, please make sure it is clear in the documentation, including the supporting data.

kmsmith137 self-assigned this Apr 23, 2020

fmaguire mentioned this issue May 15, 2020

Major refactor for consensus core tools #50

Merged

fmaguire closed this as completed in #50 May 15, 2020

fmaguire reopened this May 15, 2020

fmaguire closed this as completed May 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working on step 7 ("HISAT2 confirmation of removal of human data") #21

Working on step 7 ("HISAT2 confirmation of removal of human data") #21

kmsmith137 commented Apr 23, 2020

fmaguire commented May 14, 2020 •

edited

Loading

jaleezyy commented May 15, 2020

fmaguire commented May 15, 2020

agmcarthur commented May 15, 2020

fmaguire commented May 15, 2020

agmcarthur commented May 15, 2020

jts commented May 15, 2020 •

edited

Loading

jts commented May 15, 2020

robynslee commented May 15, 2020

jts commented May 15, 2020

robynslee commented May 15, 2020

fmaguire commented May 22, 2020

fmaguire commented May 24, 2020 •

edited

Loading

agmcarthur commented May 26, 2020

Working on step 7 ("HISAT2 confirmation of removal of human data") #21

Working on step 7 ("HISAT2 confirmation of removal of human data") #21

Comments

kmsmith137 commented Apr 23, 2020

fmaguire commented May 14, 2020 • edited Loading

jaleezyy commented May 15, 2020

fmaguire commented May 15, 2020

agmcarthur commented May 15, 2020

fmaguire commented May 15, 2020

agmcarthur commented May 15, 2020

jts commented May 15, 2020 • edited Loading

jts commented May 15, 2020

robynslee commented May 15, 2020

jts commented May 15, 2020

robynslee commented May 15, 2020

fmaguire commented May 22, 2020

fmaguire commented May 24, 2020 • edited Loading

agmcarthur commented May 26, 2020

fmaguire commented May 14, 2020 •

edited

Loading

jts commented May 15, 2020 •

edited

Loading

fmaguire commented May 24, 2020 •

edited

Loading