Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isolate Annotation from Seqr loading pipeline #9

Closed
MattWellie opened this issue Mar 30, 2022 · 5 comments · Fixed by #29
Closed

Isolate Annotation from Seqr loading pipeline #9

MattWellie opened this issue Mar 30, 2022 · 5 comments · Fixed by #29
Assignees

Comments

@MattWellie
Copy link
Collaborator

Currently the annotation step (within Dataproc) is not the sole source of annotations

As the process begins from a MT instead of a VCF, some of the annotations on that original MT are carried over instead of being updated. These are:

  • GnomAD/ExAC frequencies
  • CADD
  • REVEL
  • SpliceAI
  • Clinvar scores

The source for these annotations:
https://github.com/populationgenomics/hail-elasticsearch-pipelines/tree/main/download_and_create_reference_datasets/v02
https://github.com/populationgenomics/hail-elasticsearch-pipelines/blob/main/download_and_create_reference_datasets/v02/hail_scripts/write_combined_reference_data_ht.py#L39-L48

The annotations are available as a single Hail Table, which is used to annotate the MatrixTable as a join.

For now - make sure the Annotation stage can supply these annotations, with as little dependence on external libraries as possible

Future - obtain all annotations from a single source, so that this bespoke preparation isn't required?

@MattWellie MattWellie self-assigned this Mar 30, 2022
@MattWellie MattWellie mentioned this issue Mar 30, 2022
3 tasks
@MattWellie
Copy link
Collaborator Author

MattWellie commented May 1, 2022

Plan with @vladsaveliev:

  • AIP to take 'ownership' of the VCF annotation component (methods present here)
  • Pipelines will import annotation function from here

This means:

  • AIP will be dependent on cpg_utils & cpg_pipes, but cpg_pipes is a separate sub-package within production-pipelines so no circular dependencies exist
  • AIP doesn't require the stage wrapper framework to use the annotation process, but methods are available as stage logic for other processes
  • Will remain dependent on preparation of input files for annotation, as currently defined in hail-elasticsearch-pipelines
    • @tiboloic has some thoughts on using the Gnomad-generated annotation bundles directly, which is already in use elsewhere in the CPG. Requires checks on size of data, genome build, availability, and frequency of regeneration
  • Should probably generate a package wrapper for AIP once first draft is polished, so it can be imported from pypi by other processes

Next:

  • Port over the annotation function(s) and check that they work outside of production pipelines
    • will require generation of a pure VCF from the ACG cohort MT for testing

@MattWellie
Copy link
Collaborator Author

Note - the current logic mixes 3 things:

  1. annotating a VCF in chunks with VEP
  2. applying those VEP annotations to a MT
  3. applying VQSR annotations to a MT

Instead this will focus on parts 1 & 2, with any references to VQSR removed

@tiboloic
Copy link

tiboloic commented May 11, 2022

in reply to this comment

gnomAD has indeed generated various hail tables containing all the possible SNPs and annotated them with VEP. Here is the location of some of them:

  • VEP 85 (GENCODE 19): gs://gnomad-public-requester-pays/resources/context/grch37_context_vep_annotated.ht
  • VEP 95 (GENCODE 29): gs://gnomad-public-requester-pays/resources/context/grch38_context_vep_annotated.ht
  • VEP 101 (GENCODE 35): gs://gnomad-public-requester-pays/resources/context/grch38_context_vep_annotated.v101.ht

The VEP 85 and 95 Tables are also available through various cloud providers' open datasets programs and can be downloaded without paying egress. More detail about how to access those can be found in this blog post and the gnomAD downloads page.

In my experience performing a join with these tables to get the annotations is extremely fast and convenient. For example I use it to get the LOFTEE annotations on pLoF variants

@cassimons
Copy link
Collaborator

gnomAD has indeed generated various hail tables containing all the possible SNPs and annotated them with VEP

@tiboloic do you know if they have released the details of how this was generated? My google foo is coming up blank.

@tiboloic
Copy link

gnomAD has indeed generated various hail tables containing all the possible SNPs and annotated them with VEP

@tiboloic do you know if they have released the details of how this was generated? My google foo is coming up blank.

@cassimons the best I have found is the description in the supplementary material , page 31 of the gnomAD flagship paper, but it is very succinct. I pasted it below:

A dataset of every possible SNV in the human genome (2,858,658,098 sites x 3 substitutions at each site = 8,575,974,294 variants) along with 3 bases of genomic context was created using the GRCh37 reference. This dataset was annotated with methylation data for all CpG variants and coverage summaries as described above, and was subsequently used to annotate the exome and genome datasets where required downstream. The number of variants observed at each downsampling, broken down by variant class, is shown in Extended Data Fig. 4a. As previously shown , the CpG sites begin to saturate at a sample size of about 10,000 individuals, which affects the callset-wide transition/transversion (TiTv) ratio (Extended Data Fig. 4b). In order to compute the proportion of possible variants observed, we filtered the dataset of all possible SNVs to the exome calling intervals described previously4 and considered only bases where exome coverage was >= 30X (Extended Data Fig. 4c).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants