Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[qob] Fix VEP for GRCh38 #14071

Merged
merged 12 commits into from Jan 12, 2024
Merged

[qob] Fix VEP for GRCh38 #14071

merged 12 commits into from Jan 12, 2024

Conversation

jigold
Copy link
Collaborator

@jigold jigold commented Dec 5, 2023

CHANGELOG: Use indexed VEP cache files for GRCh38 on both dataproc and QoB.

Fixes #13989

In this PR, I did the following:

  1. Installed samtools into the Docker image to get rid of errors in the log output
  2. Added the --merged flag so that VEP will use the directory homo_sapiens_merged for the cache

Outstanding Issues:

  1. The FASTA files that are in homo_sapiens/ were not present in the merged dataset. Do we keep both the homo_sapiens and homo_sapiens_merged/ directories in our bucket or do we transfer the FASTA files to the merged directory?
  2. Once we decide the answer to (1), then I can fix this in dataproc. The easiest thing to do is to add the tar file with the _merged data to the dataproc vep folders and use the --merged flag. However, that will double the startup time for VEP on a worker node in dataproc.

Before:
Screenshot 2023-12-05 at 12 42 16 PM

After:
Screenshot 2023-12-05 at 12 46 30 PM

@@ -0,0 +1,21 @@
locus alleles
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good test!

dataproc_result = hl.import_table(resource('vep_grch38_input_req_indexed_cache.tsv'),
key=['locus', 'alleles'],
types={'locus': hl.tlocus('GRCh38'), 'alleles': hl.tarray(hl.tstr),
'vep': hl.tstr}, force=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types seem to include columns that don't exist in the TSV.

Also formatting.

@skip_unless_service_backend(clouds=['gcp'])
@set_gcs_requester_pays_configuration(GCS_REQUESTER_PAYS_PROJECT)
@test_timeout(batch=10 * 60)
def test_vep_grch38_using_indexed_cache(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this describes the motivation, this test is really just testing that we can run VEP on loci with large positions. I prefer that we describe what the test is rather than what it aspires to be.

@skip_unless_service_backend(clouds=['gcp'])
@set_gcs_requester_pays_configuration(GCS_REQUESTER_PAYS_PROJECT)
@test_timeout(batch=5 * 60)
@test_timeout(batch=10 * 60)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did these tests get slower?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I kept getting unlucky with the cluster being oversubscribed when I was trying to test this. I can revert back if you'd like.

danking
danking previously requested changes Dec 7, 2023
@@ -836,7 +836,7 @@ def command(self,
--offline \
--minimal \
--assembly GRCh38 \
--fasta {self.data_mount}/homo_sapiens/95_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz \
--fasta {self.data_mount}homo_sapiens/95_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz \
--plugin "LoF,loftee_path:/vep/ensembl-vep/Plugins/,gerp_bigwig:{self.data_mount}/gerp_conservation_scores.homo_sapiens.GRCh38.bw,human_ancestor_fa:{self.data_mount}/human_ancestor.fa.gz,conservation_file:{self.data_mount}/loftee.sql" \
--dir_plugins /vep/ensembl-vep/Plugins/ \
--dir_cache {self.data_mount} \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not need to specify --merged?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we decided to not use the "merged" data cache and are instead using "homo_sapiens_vep_95_GRCh38".

@danking
Copy link
Collaborator

danking commented Dec 7, 2023

AFAICT, FASTAs live at:

ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna_index/

whereas the VEP cache lives at

ftp.ensembl.org/pub/release-95/variation/indexed_vep_cache/homo_sapiens_merged_vep_95_GRCh38.tar.gz

These seem to be two distinct sources of data, so my inclination is to not move the FASTAs inside the cache folder. That seems likely to cause confusion for ourselves in the future. Seems very reasonable to have gs://bucket/cache/95_GRCh38/homo_sapiens_merged/... and gs://bucket/fasta/95_GRCh38/homo_sapiens/....

@danking
Copy link
Collaborator

danking commented Dec 7, 2023

That said, fixing dataproc with minimal changes seems best to me. If/when we upgrade to a newer VEP version we can change to a more sensible structure then.

@jigold jigold added the WIP label Jan 3, 2024
@jigold
Copy link
Collaborator Author

jigold commented Jan 3, 2024

I put the WIP tag on because I didn't move the new indexed tar file to all of the VEP buckets yet for dataproc.

@jigold
Copy link
Collaborator Author

jigold commented Jan 3, 2024

I also need to test this on dataproc.

@jigold jigold removed the WIP label Jan 8, 2024
@jigold
Copy link
Collaborator Author

jigold commented Jan 8, 2024

The new tar file is now in all VEP replicates for dataproc. The only change is it uses the indexed cache files and the tar file has the word "_indexed" in it. Otherwise, it should have the same contents / file structure as the non-indexed tar file that is there currently.

I tested this as best as I could, but it would be prudent to give ourselves time when releasing this in case there is a problem in the release script.

@danking
Copy link
Collaborator

danking commented Jan 9, 2024

AFAICT, you didn't edit the release.sh script; do I misunderstand what you're worried about?

Can you run the dataproc tests via dev deploy and post the batch links here? I think this should do it

hailctl dev deploy --branch jigold/fix-vep-grch38-cache -s test_dataproc-38 -s test_dataproc-37

If those pass then I'm confident vep-GRCh38.sh is correct.

danking
danking previously requested changes Jan 9, 2024
Copy link
Collaborator

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment

@jigold
Copy link
Collaborator Author

jigold commented Jan 10, 2024

@jigold
Copy link
Collaborator Author

jigold commented Jan 10, 2024

I was just concerned that I hadn't tested dataproc after the changes and didn't want the release to fail. There wasn't anything about the actual release I changed.

Copy link
Collaborator

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jigold jigold removed the WIP label Jan 12, 2024
@danking danking merged commit 7ee4141 into hail-is:main Jan 12, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[query] VEP appears to be broken in QoB (and perhaps also dataproc)
2 participants