Skip to content

Conversation

@allieychen
Copy link
Contributor

@allieychen allieychen commented Mar 5, 2018

Add integration test for the following options:

  1. reference_names: tested on valid-4.2.vcf. Used "20 Y" as the reference_names.
  2. split_alternate_allele_info_fields: set this field to false and validate there is a column AF in BQ.
  3. allow_malformed_records: created a new file "invalid-4.0-POS-empty" based on "valid-4.0.vcf" by removing the POS value for the first entry.
  4. representative_header_file: created a new file "invalid-4.0-AF-field-removed.vcf" by removing "AF" field from "valid-4.0.vcf".
  5. merge_option_none: created a new folder containing three new .vcf files for merge test. By default, there is no merge. Verified that it successfully read three files in the folder and form one BQ Table.
  6. merge_option_move_to_calls: set variant_merge_stragegy to MOVE_TO_CALLS. Verified that the three files are merged on the keys.
  7. info_keys_to_move_to_calls_regex: set this field to be "NS" and verified thatthe call.NS value is copied from each VCF file.
  8. copy_quality_to_calls: set to true. Verified that call.quality is copied from each VCF file.
  9. copy_filter_to_calls: set to true. Verified that call.filter is copied from each VCF file.
  • All the created .vcf files are uploaded to the gcp-variant-transforms-test.
  • Integration tests are passed.

Issue: #106

@allieychen allieychen changed the title Add integration test for some options #106 Add integration tests for some options #106 Mar 5, 2018
@coveralls
Copy link

coveralls commented Mar 5, 2018

Pull Request Test Coverage Report for Build 447

  • 0 of 0 (NaN%) changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 90.252%

Totals Coverage Status
Change from base Build 446: 0.0%
Covered Lines: 3259
Relevant Lines: 3611

💛 - Coveralls

@allieychen allieychen force-pushed the add_integration_test_for_vt_options branch 3 times, most recently from 678c5a3 to 96f2563 Compare March 7, 2018 16:01
1. reference_names: tested on valid-4.2.vcf. Used "20 Y" as the
reference_names.
2. split_alternate_allele_info_fields: set this field to false and validate there is a column AF in BQ.
3. allow_malformed_records: created a new file "invalid-4.0-POS-empty" based on "valid-4.0.vcf" by removing the POS value for the first entry.
4. representative_header_file: created a new file
"invalid-4.0-AF-field-removed.vcf" by removing "AF" field from "valid-4.0.vcf".
5. merge_option_none: created a new folder containing three new .vcf
files for merge test. By default, there is no merge. Verified that it
successfully read three files in the folder and form one BQ Table.
6. merge_option_move_to_calls: set variant_merge_stragegy to
MOVE_TO_CALLS. Verified that the three files are merged on the keys.
7. info_keys_to_move_to_calls_regex: set this field to be "NS" and verified thatthe call.NS value is copied from each VCF file.
8. copy_quality_to_calls: set to true. Verified that call.quality is
copied from each VCF file.
9. copy_filter_to_calls: set to true. Verified that call.filter is
copied from each VCF file.

All the created .vcf files are uploaded to the gcp-variant-transforms-test.
Integration tests are passed.
@allieychen allieychen force-pushed the add_integration_test_for_vt_options branch from 96f2563 to 5f54d7d Compare March 8, 2018 15:35
Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Allie! This looks great!
Could you please also add a short README.md file to the data/vcf folder that summarizes the contents and purpose of each file/folder?

"table_name": "merge_option_info_keys_to_move_to_calls_regex",
"input_pattern": "gs://gcp-variant-transforms-testfiles/small_tests/merge/*.vcf",
"variant_merge_strategy": "MOVE_TO_CALLS",
"info_keys_to_move_to_calls_regex": "NS",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should be ^NS$ as otherwise it would match anything that has "NS" as a substring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

},
{
"query": [
"SELECT call.NS AS num_samples ",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please keep this as "NS" (num_samples can get confused with num_rows or other inferred results).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@allieychen allieychen force-pushed the add_integration_test_for_vt_options branch from 018f6a0 to 704a0a0 Compare March 8, 2018 19:06
@allieychen
Copy link
Contributor Author

Thanks for the comment. PTAL.

@arostamianfar arostamianfar requested a review from bashir2 March 9, 2018 16:00
Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Allie! Looks good. I think the readme needs a bit more love, so I'll let Bashir make comments as necessary.

Copy link
Member

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these tests; some minor comments.


`invalid-4.0-AF-field-removed.vcf` is created by removing `AF` field definition
from the meta-information based on `valid-4.0.vcf`. It is used to test `AF`
field can be parsed correctly given a representative_header_file contains `AF`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "contains" -> "that contains" OR "containing"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.


`invalid-4.0-POS-empty.vcf` is created based on `valid-4.0.vcf` by removing the
POS value for the first entry. It is used to test when `allow_malformed_records`
is enabled, failed VCF record reads will not raise errors and the BQ can still
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "BQ can still generate" -> "BigQuery table can still be generated"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected.

current folder.

`valid-4.0.vcf`, `valid-4.0.vcf.gz`, `valid-4.0.vcf.bz2` are used to test
version 4.0 uncompressed, gzip, bzip VCF file, respectively.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify what do you mean by ver. 4.0, i.e., the VCF format version number. Maybe it is good to have a link to this page too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,27 @@
This file summarizes the contents and the purpose for each files/folder within
current folder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say "current folder", does it mean that you intend to copy valid-4.* files here as well? They are not copied in this PR, are they?

Copy link
Contributor Author

@allieychen allieychen Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new .vcf files are added in the folder testing\data.vcf\, which already inlcudes valid-4.* files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, that's my bad, I should have copied the valid-4.2_VEP.vcf test file to here as well.

The folder `merge` is created to test the merge options. Three .vcf files are
created. `merge1.vcf` contains two samples, while `merge2.vcf` and `merge3.vcf`
contain one other sample, respectively. When MERGE_TO_CALLS is selected, the
variant call where `POS = 14370` is meant to merge across three files, while the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "where" -> "with" (also in the next line)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Corrected.

"WHERE start_position = 14369 AND call.name ='NA00001' ",
"AND 'q10' IN UNNEST (call.filter)"
],
"expected_result": {"num_rows": 1}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having 4 queries returning "1" each, what do you think about having a single query and instead of COUNT, selects call.name with an ORDER BY at the end (to make the output unique); for the expected_results you then verity the list of call names, i.e., "NA*"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For one thing, I think it is expected to have only one row in the query result. Another concern I have is that if we use call.name instead of some queries based on call.filter, the test will pass with/without --copy_filter_to_calls.

"query": [
"SELECT COUNT(0) AS num_rows ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00001' ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider verifying position 1234567 as well since you expect different number of calls.

Copy link
Contributor Author

@allieychen allieychen Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not trying to test the number of calls. Instead, what I try to test is that we have a column call.filter in the BQ table when copy_filter_to_calls is set to true, and the value is copied from the original vcf file. Am I in the wrong direction?

Do you mean add those test cases for move_to_calls test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, I misunderstood what you are trying to do; and yes, both for this comment and the next one, I was thinking about testing MOVE_TO_CALLS merge strategy and checking that calls are merged. Up to you whether you want to test call names in that test or not, I am okay either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I have added a test case for merge_option_move _to_calls to check that the calls are merged for one specific position. Meanwhile, I added a similar test case in merge_option_none to validate that the calls are not merged.

bashir2
bashir2 previously approved these changes Mar 13, 2018
Copy link
Member

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -0,0 +1,27 @@
This file summarizes the contents and the purpose for each files/folder within
current folder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see, that's my bad, I should have copied the valid-4.2_VEP.vcf test file to here as well.

"query": [
"SELECT COUNT(0) AS num_rows ",
"FROM {TABLE_NAME} AS t, t.call as call ",
"WHERE start_position = 14369 AND call.name ='NA00001' ",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, I misunderstood what you are trying to do; and yes, both for this comment and the next one, I was thinking about testing MOVE_TO_CALLS merge strategy and checking that calls are merged. Up to you whether you want to test call names in that test or not, I am okay either way.

@bashir2
Copy link
Member

bashir2 commented Mar 13, 2018

BTW, consider referencing Issue #106 in the PR description to indicate under that issue that this PR is for fixing that.

@allieychen allieychen force-pushed the add_integration_test_for_vt_options branch from 6600826 to 078a5f8 Compare March 13, 2018 14:11
@allieychen
Copy link
Contributor Author

PTAL :)

"SELECT COUNT(0) AS num_rows FROM {TABLE_NAME} ",
"WHERE start_position = 14369"
],
"expected_result": {"num_rows": 1}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same start_position, you can also count number of calls and verify it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean SELECT COUNT(0) AS num_rows FROM {TABLE_NAME} AS t, t.call AS call WHERE start_position = 14369?
In this case, there is no difference with or without merge, i.e, the expected rows will be 3 for both cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, they are the same. I meant that in combination with what you already have, i.e., first you are counting that there is only one row in the table with start_position = 14369 and then you verify that on that single row, there are three calls.

Not a big deal either way, so please feel free to submit as is, if you prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Thanks for the details.

"SELECT COUNT(0) AS num_rows FROM {TABLE_NAME} ",
"WHERE start_position = 14369"
],
"expected_result": {"num_rows": 1}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, they are the same. I meant that in combination with what you already have, i.e., first you are counting that there is only one row in the table with start_position = 14369 and then you verify that on that single row, there are three calls.

Not a big deal either way, so please feel free to submit as is, if you prefer.

@allieychen allieychen merged commit 019c418 into googlegenomics:master Mar 13, 2018
@allieychen allieychen deleted the add_integration_test_for_vt_options branch March 13, 2018 20:39
mhsaul pushed a commit to mhsaul/gcp-variant-transforms that referenced this pull request Mar 29, 2018
…mics#130)

Add integration test for the following options:

- reference_names: tested on valid-4.2.vcf. Used "20 Y" as the reference_names.
- split_alternate_allele_info_fields: set this field to false and validate there is a column AF in BQ.
- allow_malformed_records: created a new file "invalid-4.0-POS-empty" based on "valid-4.0.vcf" by removing the POS value for the first entry.
- representative_header_file: created a new file "invalid-4.0-AF-field-removed.vcf" by removing "AF" field from "valid-4.0.vcf".
- merge_option_none: created a new folder containing three new .vcf files for merge test. By default, there is no merge. Verified that it successfully read three files in the folder and form one BQ Table.
- merge_option_move_to_calls: set variant_merge_stragegy to MOVE_TO_CALLS. Verified that the three files are merged on the keys.
- info_keys_to_move_to_calls_regex: set this field to be "NS" and verified thatthe call.NS value is copied from each VCF file.
- copy_quality_to_calls: set to true. Verified that call.quality is copied from each VCF file.
- copy_filter_to_calls: set to true. Verified that call.filter is copied from each VCF file.

- All the created .vcf files are uploaded to the gcp-variant-transforms-test.
- Integration tests are passed.
Issue: googlegenomics#106
mhsaul pushed a commit to mhsaul/gcp-variant-transforms that referenced this pull request Mar 29, 2018
…mics#130)

Add integration test for the following options:

- reference_names: tested on valid-4.2.vcf. Used "20 Y" as the reference_names.
- split_alternate_allele_info_fields: set this field to false and validate there is a column AF in BQ.
- allow_malformed_records: created a new file "invalid-4.0-POS-empty" based on "valid-4.0.vcf" by removing the POS value for the first entry.
- representative_header_file: created a new file "invalid-4.0-AF-field-removed.vcf" by removing "AF" field from "valid-4.0.vcf".
- merge_option_none: created a new folder containing three new .vcf files for merge test. By default, there is no merge. Verified that it successfully read three files in the folder and form one BQ Table.
- merge_option_move_to_calls: set variant_merge_stragegy to MOVE_TO_CALLS. Verified that the three files are merged on the keys.
- info_keys_to_move_to_calls_regex: set this field to be "NS" and verified thatthe call.NS value is copied from each VCF file.
- copy_quality_to_calls: set to true. Verified that call.quality is copied from each VCF file.
- copy_filter_to_calls: set to true. Verified that call.filter is copied from each VCF file.

- All the created .vcf files are uploaded to the gcp-variant-transforms-test.
- Integration tests are passed.
Issue: googlegenomics#106
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants