Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working on step 7 ("HISAT2 confirmation of removal of human data") #21

Closed
kmsmith137 opened this issue Apr 23, 2020 · 14 comments · Fixed by #50
Closed

Working on step 7 ("HISAT2 confirmation of removal of human data") #21

kmsmith137 opened this issue Apr 23, 2020 · 14 comments · Fixed by #50
Assignees

Comments

@kmsmith137
Copy link
Collaborator

I started working on this

@kmsmith137 kmsmith137 self-assigned this Apr 23, 2020
@fmaguire
Copy link
Collaborator

fmaguire commented May 14, 2020

This was resolved in commit 7b580ba right?

@jaleezyy
Copy link
Owner

I believe we wanted confirmation to be through HISAT2 alignment against the human genome? This looks like the removal of non-SARS-CoV-2 reads, but not necessarily the confirmation step.

@fmaguire
Copy link
Collaborator

Considering we are throwing away any reads that don't map to the reference with the new "core" (see current branch) we can probably dispense with this entirely.

@agmcarthur
Copy link
Collaborator

There was discussion in one of the groups that step 5 (removal of non-SARS-CoV-2) be via alignment to the SARS-CoV-2 genome but step 7 verification was alignment against human genome as this combination would absolutely pass ethics/privacy requirements and provide full confidence to our healthcare colleagues.

@fmaguire
Copy link
Collaborator

Just chatting about that on the cancogen call. Sounds like: map raw sorted reads to human reference and remove any reads that map. Then go ahead with the remaining reads for trimming.

The human removed but otherwise raw reads can then be uploaded to SRA?

I can add that and remake the PR.

@agmcarthur
Copy link
Collaborator

Are we sure removing based on alignment to human won't exclude some legitimate SARS-CoV-2 data? Don't want to create a region of false low coverage.

@jts
Copy link
Collaborator

jts commented May 15, 2020

I think a useful analysis to inform this discussion would be:

a) map human reads to the ncov reference to see what maps
b) map coronavirus reads to human to see what gets lost

(NB: in real amplicon sequencing sets very few reads (if any) will be human so a) will be a massive overestimate of what happens in a real experiment)

@jts
Copy link
Collaborator

jts commented May 15, 2020

Thinking about it a bit more human WGS won't be representative of the type of off-target sequences we might see so I'm going off the idea of a) a bit. b) is worth doing though

@fmaguire fmaguire reopened this May 15, 2020
@robynslee
Copy link
Collaborator

Yeah, not sure about benefit of a. B seems useful, as that's the issue we're worried about. Might be good to try b with a diverse sample of strains from GISAID, to get a sense of this

@jts
Copy link
Collaborator

jts commented May 15, 2020

Yeah the thought with a) was to test whether mapping to human is necessary to remove any potential host reads. It would be far simpler if we can just map to coronavirus and discard everything else, I wanted to address that with a) but realized it won't really answer the question

@robynslee
Copy link
Collaborator

Yes, agree, a) doesn't address that question.

@fmaguire
Copy link
Collaborator

Just throwing this in here so its all together.

  • Need to consider mapping scores: Missing Indels galaxyproject/SARS-CoV-2#49
  • Could do with a set of SRA archives across the nextstrain tree for doing this analysis (among other QC): e.g. create a composite reference and see if there is a good threshold for distinguishing host contamination.

fmaguire added a commit that referenced this issue May 22, 2020
Pending further analysis (re: #21) here is a crude host removal workflow
for the raw reads (only sorted and the pools combined).

    - Map against pre-made human GRCh38 BWA indexed reference
    - Use samtools and bedtools to pull out all reads that don't map to
      human reference and gzip for SRA submission.
@fmaguire
Copy link
Collaborator

fmaguire commented May 24, 2020

I took the one wuhan scheme illumina sample I had to hand and ran BWA-MEM versus a composite human + viral reference.

Of those reads which mapped to viral and human contigs (~200 or 0.02%) the distribution the respective mapping qualities looking like this:

composite_reference

So most of the small number of problematic reads are a clear viral hit and lower quality human hit (as Torsten suggests in the linked thread).
If we just take those multihit reads with a MAPQ>=30 to the human reference, we are left with 13 (0.002%) reads:

multimaps MAPQ>=30

We could save a whole 4 of them by comparing the map-qualities between human and viral. The remaining 11 reads with equally good hits to viral and human aren't likely to majorly affect the viral consensus or variant calling.

I could grab a bunch more SRAs and do this across a lot more samples but honestly, we are probably fine just using BWA-MEM MAPQ>=30 in the host removal stage and calling it a good'un.

@agmcarthur
Copy link
Collaborator

This excellent, please make sure it is clear in the documentation, including the supporting data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants