[qob] Fix VEP for GRCh38 #14071

jigold · 2023-12-05T17:54:42Z

CHANGELOG: Use indexed VEP cache files for GRCh38 on both dataproc and QoB.

Fixes #13989

In this PR, I did the following:

Installed samtools into the Docker image to get rid of errors in the log output
Added the --merged flag so that VEP will use the directory homo_sapiens_merged for the cache

Outstanding Issues:

The FASTA files that are in homo_sapiens/ were not present in the merged dataset. Do we keep both the homo_sapiens and homo_sapiens_merged/ directories in our bucket or do we transfer the FASTA files to the merged directory?
Once we decide the answer to (1), then I can fix this in dataproc. The easiest thing to do is to add the tar file with the _merged data to the dataproc vep folders and use the --merged flag. However, that will double the startup time for VEP on a worker node in dataproc.

Before:

After:

danking · 2023-12-07T22:02:30Z

hail/src/test/resources/vep_grch38_input_req_indexed_cache.tsv

@@ -0,0 +1,21 @@
+locus alleles


danking · 2023-12-07T22:03:38Z

hail/python/test/hail/methods/test_qc.py

+        dataproc_result = hl.import_table(resource('vep_grch38_input_req_indexed_cache.tsv'),
+                                          key=['locus', 'alleles'],
+                                          types={'locus': hl.tlocus('GRCh38'), 'alleles': hl.tarray(hl.tstr),
+                                                 'vep': hl.tstr}, force=True,


The types seem to include columns that don't exist in the TSV.

Also formatting.

danking · 2023-12-07T22:04:35Z

hail/python/test/hail/methods/test_qc.py

+    @skip_unless_service_backend(clouds=['gcp'])
+    @set_gcs_requester_pays_configuration(GCS_REQUESTER_PAYS_PROJECT)
+    @test_timeout(batch=10 * 60)
+    def test_vep_grch38_using_indexed_cache(self):


While this describes the motivation, this test is really just testing that we can run VEP on loci with large positions. I prefer that we describe what the test is rather than what it aspires to be.

danking · 2023-12-07T22:04:59Z

hail/python/test/hail/methods/test_qc.py

    @skip_unless_service_backend(clouds=['gcp'])
    @set_gcs_requester_pays_configuration(GCS_REQUESTER_PAYS_PROJECT)
-    @test_timeout(batch=5 * 60)
+    @test_timeout(batch=10 * 60)


did these tests get slower?

No. I kept getting unlucky with the cluster being oversubscribed when I was trying to test this. I can revert back if you'd like.

danking · 2023-12-07T22:06:43Z

hail/python/hail/methods/qc.py

@@ -836,7 +836,7 @@ def command(self,
 --offline \
 --minimal \
 --assembly GRCh38 \
--fasta {self.data_mount}/homo_sapiens/95_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz \
+--fasta {self.data_mount}homo_sapiens/95_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz \
 --plugin "LoF,loftee_path:/vep/ensembl-vep/Plugins/,gerp_bigwig:{self.data_mount}/gerp_conservation_scores.homo_sapiens.GRCh38.bw,human_ancestor_fa:{self.data_mount}/human_ancestor.fa.gz,conservation_file:{self.data_mount}/loftee.sql" \
 --dir_plugins /vep/ensembl-vep/Plugins/ \
 --dir_cache {self.data_mount} \


Why do we not need to specify --merged?

Because we decided to not use the "merged" data cache and are instead using "homo_sapiens_vep_95_GRCh38".

danking · 2023-12-07T22:16:12Z

AFAICT, FASTAs live at:

ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna_index/

whereas the VEP cache lives at

ftp.ensembl.org/pub/release-95/variation/indexed_vep_cache/homo_sapiens_merged_vep_95_GRCh38.tar.gz

These seem to be two distinct sources of data, so my inclination is to not move the FASTAs inside the cache folder. That seems likely to cause confusion for ourselves in the future. Seems very reasonable to have gs://bucket/cache/95_GRCh38/homo_sapiens_merged/... and gs://bucket/fasta/95_GRCh38/homo_sapiens/....

danking · 2023-12-07T22:17:21Z

That said, fixing dataproc with minimal changes seems best to me. If/when we upgrade to a newer VEP version we can change to a more sensible structure then.

jigold · 2024-01-03T21:10:59Z

I put the WIP tag on because I didn't move the new indexed tar file to all of the VEP buckets yet for dataproc.

jigold · 2024-01-03T21:13:58Z

I also need to test this on dataproc.

jigold · 2024-01-08T16:42:11Z

The new tar file is now in all VEP replicates for dataproc. The only change is it uses the indexed cache files and the tar file has the word "_indexed" in it. Otherwise, it should have the same contents / file structure as the non-indexed tar file that is there currently.

I tested this as best as I could, but it would be prudent to give ourselves time when releasing this in case there is a problem in the release script.

done

danking · 2024-01-09T23:11:09Z

AFAICT, you didn't edit the release.sh script; do I misunderstand what you're worried about?

Can you run the dataproc tests via dev deploy and post the batch links here? I think this should do it

hailctl dev deploy --branch jigold/fix-vep-grch38-cache -s test_dataproc-38 -s test_dataproc-37

If those pass then I'm confident vep-GRCh38.sh is correct.

danking

see comment

jigold · 2024-01-10T14:56:43Z

grch38 - https://ci.hail.is/batches/8104688
grch37 - https://ci.hail.is/batches/8104689

done

jigold · 2024-01-10T15:19:02Z

I was just concerned that I hadn't tested dataproc after the changes and didn't want the release to fail. There wasn't anything about the actual release I changed.

danking

LGTM

…grch38-cache

jigold assigned danking Dec 5, 2023

jigold force-pushed the fix-vep-grch38-cache branch from dbdd9c5 to adaaa70 Compare December 5, 2023 17:56

danking requested changes Dec 7, 2023

View reviewed changes

danking previously requested changes Dec 7, 2023

View reviewed changes

jigold added the WIP label Jan 3, 2024

jigold removed the WIP label Jan 8, 2024

jigold added 7 commits January 8, 2024 11:39

[qob] Fix VEP for GRCh38

3590451

fixes

d4cdb09

revert merged

4a83387

stuff

649ce6e

address comments

a48e007

use new indexed data

7a05880

cleanup

7feda77

jigold force-pushed the fix-vep-grch38-cache branch from 6b38ffe to 7feda77 Compare January 8, 2024 16:40

jigold added 2 commits January 8, 2024 14:20

fixes

d158a98

fix

e1f4525

danking previously requested changes Jan 9, 2024

View reviewed changes

danking approved these changes Jan 11, 2024

View reviewed changes

jigold force-pushed the fix-vep-grch38-cache branch from e1f4525 to a52e369 Compare January 12, 2024 16:00

jigold added the WIP label Jan 12, 2024

jigold added 3 commits January 12, 2024 11:57

Merge commit '24525adb9c09a73a1ae820c9945acd35299878ed' into fix-vep-…

bbca719

…grch38-cache

Merge commit 'fa2ef0f2c76654d0c037ff6db60ccb8842fb8539' into fix-vep-…

252cb7f

…grch38-cache

delint

2386c9c

jigold force-pushed the fix-vep-grch38-cache branch from a52e369 to 2386c9c Compare January 12, 2024 16:59

jigold removed the WIP label Jan 12, 2024

danking merged commit 7ee4141 into hail-is:main Jan 12, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[qob] Fix VEP for GRCh38 #14071

[qob] Fix VEP for GRCh38 #14071

jigold commented Dec 5, 2023 •

edited

danking Dec 7, 2023

danking Dec 7, 2023

danking Dec 7, 2023

danking Dec 7, 2023

jigold Jan 3, 2024

danking Dec 7, 2023

jigold Jan 3, 2024

danking commented Dec 7, 2023

danking commented Dec 7, 2023

jigold commented Jan 3, 2024

jigold commented Jan 3, 2024

jigold commented Jan 8, 2024

danking commented Jan 9, 2024

danking left a comment

jigold commented Jan 10, 2024

jigold commented Jan 10, 2024

danking left a comment

[qob] Fix VEP for GRCh38 #14071

[qob] Fix VEP for GRCh38 #14071

Conversation

jigold commented Dec 5, 2023 • edited

danking Dec 7, 2023

Choose a reason for hiding this comment

danking Dec 7, 2023

Choose a reason for hiding this comment

danking Dec 7, 2023

Choose a reason for hiding this comment

danking Dec 7, 2023

Choose a reason for hiding this comment

jigold Jan 3, 2024

Choose a reason for hiding this comment

danking Dec 7, 2023

Choose a reason for hiding this comment

jigold Jan 3, 2024

Choose a reason for hiding this comment

danking commented Dec 7, 2023

danking commented Dec 7, 2023

jigold commented Jan 3, 2024

jigold commented Jan 3, 2024

jigold commented Jan 8, 2024

danking commented Jan 9, 2024

danking left a comment

Choose a reason for hiding this comment

jigold commented Jan 10, 2024

jigold commented Jan 10, 2024

danking left a comment

Choose a reason for hiding this comment

jigold commented Dec 5, 2023 •

edited