Skip to content

Conversation

@jigold
Copy link
Contributor

@jigold jigold commented Dec 5, 2023

CHANGELOG: Use indexed VEP cache files for GRCh38 on both dataproc and QoB.

Fixes #13989

In this PR, I did the following:

  1. Installed samtools into the Docker image to get rid of errors in the log output
  2. Added the --merged flag so that VEP will use the directory homo_sapiens_merged for the cache

Outstanding Issues:

  1. The FASTA files that are in homo_sapiens/ were not present in the merged dataset. Do we keep both the homo_sapiens and homo_sapiens_merged/ directories in our bucket or do we transfer the FASTA files to the merged directory?
  2. Once we decide the answer to (1), then I can fix this in dataproc. The easiest thing to do is to add the tar file with the _merged data to the dataproc vep folders and use the --merged flag. However, that will double the startup time for VEP on a worker node in dataproc.

Before:
Screenshot 2023-12-05 at 12 42 16 PM

After:
Screenshot 2023-12-05 at 12 46 30 PM

@@ -0,0 +1,21 @@
locus alleles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good test!

dataproc_result = hl.import_table(resource('vep_grch38_input_req_indexed_cache.tsv'),
key=['locus', 'alleles'],
types={'locus': hl.tlocus('GRCh38'), 'alleles': hl.tarray(hl.tstr),
'vep': hl.tstr}, force=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The types seem to include columns that don't exist in the TSV.

Also formatting.

@skip_unless_service_backend(clouds=['gcp'])
@set_gcs_requester_pays_configuration(GCS_REQUESTER_PAYS_PROJECT)
@test_timeout(batch=10 * 60)
def test_vep_grch38_using_indexed_cache(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this describes the motivation, this test is really just testing that we can run VEP on loci with large positions. I prefer that we describe what the test is rather than what it aspires to be.

@skip_unless_service_backend(clouds=['gcp'])
@set_gcs_requester_pays_configuration(GCS_REQUESTER_PAYS_PROJECT)
@test_timeout(batch=5 * 60)
@test_timeout(batch=10 * 60)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did these tests get slower?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I kept getting unlucky with the cluster being oversubscribed when I was trying to test this. I can revert back if you'd like.

danking
danking previously requested changes Dec 7, 2023
--fasta {self.data_mount}homo_sapiens/95_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz \
--plugin "LoF,loftee_path:/vep/ensembl-vep/Plugins/,gerp_bigwig:{self.data_mount}/gerp_conservation_scores.homo_sapiens.GRCh38.bw,human_ancestor_fa:{self.data_mount}/human_ancestor.fa.gz,conservation_file:{self.data_mount}/loftee.sql" \
--dir_plugins /vep/ensembl-vep/Plugins/ \
--dir_cache {self.data_mount} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not need to specify --merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we decided to not use the "merged" data cache and are instead using "homo_sapiens_vep_95_GRCh38".

@danking
Copy link
Contributor

danking commented Dec 7, 2023

AFAICT, FASTAs live at:

ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna_index/

whereas the VEP cache lives at

ftp.ensembl.org/pub/release-95/variation/indexed_vep_cache/homo_sapiens_merged_vep_95_GRCh38.tar.gz

These seem to be two distinct sources of data, so my inclination is to not move the FASTAs inside the cache folder. That seems likely to cause confusion for ourselves in the future. Seems very reasonable to have gs://bucket/cache/95_GRCh38/homo_sapiens_merged/... and gs://bucket/fasta/95_GRCh38/homo_sapiens/....

@danking
Copy link
Contributor

danking commented Dec 7, 2023

That said, fixing dataproc with minimal changes seems best to me. If/when we upgrade to a newer VEP version we can change to a more sensible structure then.

@jigold jigold added the WIP label Jan 3, 2024
@jigold
Copy link
Contributor Author

jigold commented Jan 3, 2024

I put the WIP tag on because I didn't move the new indexed tar file to all of the VEP buckets yet for dataproc.

@jigold
Copy link
Contributor Author

jigold commented Jan 3, 2024

I also need to test this on dataproc.

@jigold jigold removed the WIP label Jan 8, 2024
@jigold jigold force-pushed the fix-vep-grch38-cache branch from 6b38ffe to 7feda77 Compare January 8, 2024 16:40
@jigold
Copy link
Contributor Author

jigold commented Jan 8, 2024

The new tar file is now in all VEP replicates for dataproc. The only change is it uses the indexed cache files and the tar file has the word "_indexed" in it. Otherwise, it should have the same contents / file structure as the non-indexed tar file that is there currently.

I tested this as best as I could, but it would be prudent to give ourselves time when releasing this in case there is a problem in the release script.

@danking
Copy link
Contributor

danking commented Jan 9, 2024

AFAICT, you didn't edit the release.sh script; do I misunderstand what you're worried about?

Can you run the dataproc tests via dev deploy and post the batch links here? I think this should do it

hailctl dev deploy --branch jigold/fix-vep-grch38-cache -s test_dataproc-38 -s test_dataproc-37

If those pass then I'm confident vep-GRCh38.sh is correct.

danking
danking previously requested changes Jan 9, 2024
Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment

@jigold
Copy link
Contributor Author

jigold commented Jan 10, 2024

@jigold
Copy link
Contributor Author

jigold commented Jan 10, 2024

I was just concerned that I hadn't tested dataproc after the changes and didn't want the release to fail. There wasn't anything about the actual release I changed.

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jigold jigold force-pushed the fix-vep-grch38-cache branch from e1f4525 to a52e369 Compare January 12, 2024 16:00
@jigold jigold added the WIP label Jan 12, 2024
@jigold jigold force-pushed the fix-vep-grch38-cache branch from a52e369 to 2386c9c Compare January 12, 2024 16:59
@jigold jigold removed the WIP label Jan 12, 2024
@danking danking merged commit 7ee4141 into hail-is:main Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[query] VEP appears to be broken in QoB (and perhaps also dataproc)

2 participants