Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions gcp_variant_transforms/testing/data/vcf/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
This file summarizes the contents and the purpose for each files/folder within
current folder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "current folder", does it mean that you intend to copy valid-4.* files here as well? They are not copied in this PR, are they?

Copy link
Contributor Author

@allieychen allieychen Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new .vcf files are added in the folder testing\data.vcf\, which already inlcudes valid-4.* files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, that's my bad, I should have copied the valid-4.2_VEP.vcf test file to here as well.


`valid-4.0.vcf`, `valid-4.0.vcf.gz`, `valid-4.0.vcf.bz2` are used to test
Variant Call Format version 4.0 files in the form of uncompressed, gzip and
bzip, respectively. For more details on the VCF format version specifications,
please refer to [VCF Specification](https://samtools.github.io/hts-specs/).

`valid-4.1-large.vcf`, `valid-4.1-large.vcf.gz` are used to test version 4.1
uncompressed, gzip VCF file, respectively.

`valid-4.2.vcf`, `valid-4.2.vcf.gz` are used to test version 4.2 uncompressed,
gzip VCF file, respectively.

`invalid-4.0-AF-field-removed.vcf` is created by removing `AF` field definition
from the meta-information based on `valid-4.0.vcf`. It is used to test `AF`
field can be parsed correctly given a representative_header_file containing
`AF`.

`invalid-4.0-POS-empty.vcf` is created based on `valid-4.0.vcf` by removing the
POS value for the first entry. It is used to test when `allow_malformed_records`
is enabled, failed VCF record reads will not raise errors and the BigQuery table
can still be generated.

The folder `merge` is created to test the merge options. Three .vcf files are
created. `merge1.vcf` contains two samples, while `merge2.vcf` and `merge3.vcf`
contain one other sample, respectively. When MERGE_TO_CALLS is selected, the
variant call with `POS = 14370` is meant to merge across three files, while the
call with `POS = 1234567` is designed to be merged for `merge1.vcf` and
`merge2.vcf`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
19 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
23 changes: 23 additions & 0 deletions gcp_variant_transforms/testing/data/vcf/invalid-4.0-POS-empty.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
19 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
20 changes: 20 additions & 0 deletions gcp_variant_transforms/testing/data/vcf/merge/merge1.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002
20 14370 rs6054257 G A 10 q10 NS=2;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51
20 17290 . T A 3 q10 NS=2;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3
19 1234567 microsat1 GTCT G,GTACT 50 PASS NS=2;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2
21 changes: 21 additions & 0 deletions gcp_variant_transforms/testing/data/vcf/merge/merge2.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00003
20 14370 rs6054257 G A 29 PASS NS=1;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=1;DP=11;AF=0.017 GT:GQ:DP:HQ 0/0:41:3
19 1234567 microsat2 GTCT G,GTACT 50 PASS NS=1;DP=9;AA=G GT:GQ:DP 1/1:40:3
19 changes: 19 additions & 0 deletions gcp_variant_transforms/testing/data/vcf/merge/merge3.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00004
20 14370 rs6054257 G A 30 PASS NS=1;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51

Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
{
"test_name": "merge-option-copy-filter-to-calls",
"table_name": "merge_option_copy_filter_to_calls",
"input_pattern": "gs://gcp-variant-transforms-testfiles/small_tests/merge/*.vcf",
"variant_merge_strategy": "MOVE_TO_CALLS",
"copy_filter_to_calls": true,
"runner": "DataflowRunner",
"assertion_configs": [
{
"query": ["NUM_ROWS_QUERY"],
"expected_result": {"num_rows": 4}
},
{
"query": ["SUM_START_QUERY"],
"expected_result": {"sum_start": 1283553}
},
{
"query": ["SUM_END_QUERY"],
"expected_result": {"sum_end": 1283560}
},
{
"query": [
"SELECT COUNT(0) AS num_rows ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00001' ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider verifying position 1234567 as well since you expect different number of calls.

Copy link
Contributor Author

@allieychen allieychen Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not trying to test the number of calls. Instead, what I try to test is that we have a column call.filter in the BQ table when copy_filter_to_calls is set to true, and the value is copied from the original vcf file. Am I in the wrong direction?

Do you mean add those test cases for move_to_calls test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, I misunderstood what you are trying to do; and yes, both for this comment and the next one, I was thinking about testing MOVE_TO_CALLS merge strategy and checking that calls are merged. Up to you whether you want to test call names in that test or not, I am okay either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have added a test case for merge_option_move _to_calls to check that the calls are merged for one specific position. Meanwhile, I added a similar test case in merge_option_none to validate that the calls are not merged.

"AND 'q10' IN UNNEST (call.filter)"
],
"expected_result": {"num_rows": 1}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having 4 queries returning "1" each, what do you think about having a single query and instead of COUNT, selects call.name with an ORDER BY at the end (to make the output unique); for the expected_results you then verity the list of call names, i.e., "NA*"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For one thing, I think it is expected to have only one row in the query result. Another concern I have is that if we use call.name instead of some queries based on call.filter, the test will pass with/without --copy_filter_to_calls.

},
{
"query": [
"SELECT COUNT(0) AS num_rows ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00002' ",
"AND 'q10' IN UNNEST (call.filter)"
],
"expected_result": {"num_rows": 1}
},
{
"query": [
"SELECT COUNT(0) AS num_rows ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00003' ",
"AND 'PASS' IN UNNEST (call.filter)"
],
"expected_result": {"num_rows": 1}
},
{
"query": [
"SELECT COUNT(0) AS num_rows ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00004' ",
"AND 'PASS' IN UNNEST (call.filter)"
],
"expected_result": {"num_rows": 1}
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
{
"test_name": "merge-option-copy-quality-to-calls",
"table_name": "merge_option_copy_quality_to_calls",
"input_pattern": "gs://gcp-variant-transforms-testfiles/small_tests/merge/*.vcf",
"variant_merge_strategy": "MOVE_TO_CALLS",
"copy_quality_to_calls": true,
"runner": "DataflowRunner",
"assertion_configs": [
{
"query": ["NUM_ROWS_QUERY"],
"expected_result": {"num_rows": 4}
},
{
"query": ["SUM_START_QUERY"],
"expected_result": {"sum_start": 1283553}
},
{
"query": ["SUM_END_QUERY"],
"expected_result": {"sum_end": 1283560}
},
{
"query": [
"SELECT call.quality AS quality ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00001'"
],
"expected_result": {"quality": 10.0}
},
{
"query": [
"SELECT call.quality AS quality ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00002'"
],
"expected_result": {"quality": 10.0}
},
{
"query": [
"SELECT call.quality AS quality ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00003'"
],
"expected_result": {"quality": 29.0}
},
{
"query": [
"SELECT call.quality AS quality ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00004'"
],
"expected_result": {"quality": 30.0}
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
{
"test_name": "merge-option-info-keys-to-move-to-calls-regex",
"table_name": "merge_option_info_keys_to_move_to_calls_regex",
"input_pattern": "gs://gcp-variant-transforms-testfiles/small_tests/merge/*.vcf",
"variant_merge_strategy": "MOVE_TO_CALLS",
"info_keys_to_move_to_calls_regex": "^NS$",
"runner": "DataflowRunner",
"assertion_configs": [
{
"query": ["NUM_ROWS_QUERY"],
"expected_result": {"num_rows": 4}
},
{
"query": ["SUM_START_QUERY"],
"expected_result": {"sum_start": 1283553}
},
{
"query": ["SUM_END_QUERY"],
"expected_result": {"sum_end": 1283560}
},
{
"query": [
"SELECT call.NS AS NS ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00001'"
],
"expected_result": {"NS": 2}
},
{
"query": [
"SELECT call.NS AS NS ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00002'"
],
"expected_result": {"NS": 2}
},
{
"query": [
"SELECT call.NS AS NS ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00003'"
],
"expected_result": {"NS": 1}
},
{
"query": [
"SELECT call.NS AS NS ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00004'"
],
"expected_result": {"NS": 1}
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
{
"test_name": "merge-option-move-to-calls",
"table_name": "merge_option_move_to_calls",
"input_pattern": "gs://gcp-variant-transforms-testfiles/small_tests/merge/*.vcf",
"runner": "DataflowRunner",
"variant_merge_strategy": "MOVE_TO_CALLS",
"assertion_configs": [
{
"query": ["NUM_ROWS_QUERY"],
"expected_result": {"num_rows": 4}
},
{
"query": ["SUM_START_QUERY"],
"expected_result": {"sum_start": 1283553}
},
{
"query": ["SUM_END_QUERY"],
"expected_result": {"sum_end": 1283560}
},
{
"query": [
"SELECT COUNT(0) AS num_rows FROM {TABLE_NAME} ",
"WHERE start_position = 14369"
],
"expected_result": {"num_rows": 1}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same start_position, you can also count number of calls and verify it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean SELECT COUNT(0) AS num_rows FROM {TABLE_NAME} AS t, t.call AS call WHERE start_position = 14369?
In this case, there is no difference with or without merge, i.e, the expected rows will be 3 for both cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, they are the same. I meant that in combination with what you already have, i.e., first you are counting that there is only one row in the table with start_position = 14369 and then you verify that on that single row, there are three calls.

Not a big deal either way, so please feel free to submit as is, if you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Thanks for the details.

}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"test_name": "merge-option-none",
"table_name": "merge_option_none",
"input_pattern": "gs://gcp-variant-transforms-testfiles/small_tests/merge/*.vcf",
"runner": "DataflowRunner",
"assertion_configs": [
{
"query": ["NUM_ROWS_QUERY"],
"expected_result": {"num_rows": 7}
},
{
"query": ["SUM_START_QUERY"],
"expected_result": {"sum_start": 2546857}
},
{
"query": ["SUM_END_QUERY"],
"expected_result": {"sum_end": 2546870}
},
{
"query": [
"SELECT COUNT(0) AS num_rows FROM {TABLE_NAME} ",
"WHERE start_position = 14369"
],
"expected_result": {"num_rows": 3}
}
]
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"test_name": "option-allow-malformed-records",
"table_name": "option_allow_malformed_records",
"input_pattern": "gs://gcp-variant-transforms-testfiles/small_tests/invalid-4.0-POS-empty.vcf",
"allow_malformed_records": true,
"runner": "DataflowRunner",
"assertion_configs": [
{
"query": ["NUM_ROWS_QUERY"],
"expected_result": {"num_rows": 4}
},
{
"query": ["SUM_START_QUERY"],
"expected_result": {"sum_start": 3592826}
},
{
"query": ["SUM_END_QUERY"],
"expected_result": {"sum_end": 3592833}
}
]
}
Loading