New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[qob] Fix VEP for GRCh38 #14071
[qob] Fix VEP for GRCh38 #14071
Conversation
dbdd9c5
to
adaaa70
Compare
@@ -0,0 +1,21 @@ | |||
locus alleles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good test!
dataproc_result = hl.import_table(resource('vep_grch38_input_req_indexed_cache.tsv'), | ||
key=['locus', 'alleles'], | ||
types={'locus': hl.tlocus('GRCh38'), 'alleles': hl.tarray(hl.tstr), | ||
'vep': hl.tstr}, force=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The types seem to include columns that don't exist in the TSV.
Also formatting.
@skip_unless_service_backend(clouds=['gcp']) | ||
@set_gcs_requester_pays_configuration(GCS_REQUESTER_PAYS_PROJECT) | ||
@test_timeout(batch=10 * 60) | ||
def test_vep_grch38_using_indexed_cache(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this describes the motivation, this test is really just testing that we can run VEP on loci with large positions. I prefer that we describe what the test is rather than what it aspires to be.
@skip_unless_service_backend(clouds=['gcp']) | ||
@set_gcs_requester_pays_configuration(GCS_REQUESTER_PAYS_PROJECT) | ||
@test_timeout(batch=5 * 60) | ||
@test_timeout(batch=10 * 60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did these tests get slower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I kept getting unlucky with the cluster being oversubscribed when I was trying to test this. I can revert back if you'd like.
@@ -836,7 +836,7 @@ def command(self, | |||
--offline \ | |||
--minimal \ | |||
--assembly GRCh38 \ | |||
--fasta {self.data_mount}/homo_sapiens/95_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz \ | |||
--fasta {self.data_mount}homo_sapiens/95_GRCh38/Homo_sapiens.GRCh38.dna.toplevel.fa.gz \ | |||
--plugin "LoF,loftee_path:/vep/ensembl-vep/Plugins/,gerp_bigwig:{self.data_mount}/gerp_conservation_scores.homo_sapiens.GRCh38.bw,human_ancestor_fa:{self.data_mount}/human_ancestor.fa.gz,conservation_file:{self.data_mount}/loftee.sql" \ | |||
--dir_plugins /vep/ensembl-vep/Plugins/ \ | |||
--dir_cache {self.data_mount} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we not need to specify --merged
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we decided to not use the "merged" data cache and are instead using "homo_sapiens_vep_95_GRCh38".
AFAICT, FASTAs live at:
whereas the VEP cache lives at
These seem to be two distinct sources of data, so my inclination is to not move the FASTAs inside the cache folder. That seems likely to cause confusion for ourselves in the future. Seems very reasonable to have |
That said, fixing dataproc with minimal changes seems best to me. If/when we upgrade to a newer VEP version we can change to a more sensible structure then. |
I put the WIP tag on because I didn't move the new indexed tar file to all of the VEP buckets yet for dataproc. |
I also need to test this on dataproc. |
6b38ffe
to
7feda77
Compare
The new tar file is now in all VEP replicates for dataproc. The only change is it uses the indexed cache files and the tar file has the word "_indexed" in it. Otherwise, it should have the same contents / file structure as the non-indexed tar file that is there currently. I tested this as best as I could, but it would be prudent to give ourselves time when releasing this in case there is a problem in the release script. |
AFAICT, you didn't edit the release.sh script; do I misunderstand what you're worried about? Can you run the dataproc tests via dev deploy and post the batch links here? I think this should do it
If those pass then I'm confident |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment
grch38 - https://ci.hail.is/batches/8104688 |
I was just concerned that I hadn't tested dataproc after the changes and didn't want the release to fail. There wasn't anything about the actual release I changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
e1f4525
to
a52e369
Compare
a52e369
to
2386c9c
Compare
CHANGELOG: Use indexed VEP cache files for GRCh38 on both dataproc and QoB.
Fixes #13989
In this PR, I did the following:
--merged
flag so that VEP will use the directoryhomo_sapiens_merged
for the cacheOutstanding Issues:
homo_sapiens/
were not present in the merged dataset. Do we keep both thehomo_sapiens
andhomo_sapiens_merged/
directories in our bucket or do we transfer the FASTA files to the merged directory?_merged
data to the dataproc vep folders and use the--merged
flag. However, that will double the startup time for VEP on a worker node in dataproc.Before:
After: