**QC of ETL starting with GDC release 24 clinical tables**

This notebook focuses on the QC of program **HCMI** data_category clinical

This program has a total of five clinical tables present in this release

Tables listed below ---

- `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI`

- `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag`

- `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag__treat`

- `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow`

- `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow__mol_test`




##QC table checklist 

Multiple one-to-many tables present QC list

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

**2. Look at table row number and size**

Do these metrics make sense?

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

[ISB-CGC BigQuery table  search test tier](https://isb-cgc-test.appspot.com/bq_meta_search/)

Run a manual check in the console with the steps mentioned in step 1.

*Note from developer:
There are some columns which are sparsely populated (so they might look empty if you’re just scrolling through the table in the GUI), but there should be at least one non-null entry for every column in every table.*

**4. Number of case_id versus BigQuery metadata table**

**5.Check for any duplicate rows present in the table**

**7. Verify case_id count of table against  master rel_clinical_data table**

##Reference material



*   [NextGenETL](https://github.com/isb-cgc/NextGenETL) GitHub repository
*   [ETL QC SOP draft](https://docs.google.com/document/d/1Wskf3BxJLkMjhIXD62B6_TG9h5KRcSp8jSAGqcCP1lQ/edit)

##Before you begin

You need to load the BigQuery module, authenticate ourselves, create a client variable, and load the necessary libraries.


In [0]:
from google.colab import auth
try:
  auth.authenticate_user()
  print('You have been successfully authenticated!')
except:
  print('You have not been authenticated.')

You have been successfully authenticated!


In [0]:
from google.cloud import bigquery
try:
  project_id = 'isb-project-zero' # Update your_project_number with your project number
  client = bigquery.Client(project=project_id)
  print('BigQuery client successfully initialized')
except:
  print('Failed')

BigQuery client successfully initialized


In [0]:
#Install pypika to build a Query 
!pip install pypika
# Import from PyPika
from pypika import Query, Table, Field, Order

import pandas

Collecting pypika
[?25l  Downloading https://files.pythonhosted.org/packages/ea/22/63a4b2194462c54de8450de3d61eb44eddc2e7a85b06792603af09c606e1/PyPika-0.37.7.tar.gz (53kB)
[K     |██████▏                         | 10kB 16.4MB/s eta 0:00:01[K     |████████████▍                   | 20kB 1.7MB/s eta 0:00:01[K     |██████████████████▌             | 30kB 2.2MB/s eta 0:00:01[K     |████████████████████████▊       | 40kB 2.5MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51kB 2.0MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 1.9MB/s 
[?25hBuilding wheels for collected packages: pypika
  Building wheel for pypika (setup.py) ... [?25l[?25hdone
  Created wheel for pypika: filename=PyPika-0.37.7-py2.py3-none-any.whl size=42747 sha256=530944ee05eb0be1926d41b5cf4ea93484e48df92457bcc04ddc9fff04fc37b7
  Stored in directory: /root/.cache/pip/wheels/40/b2/20/cf67d3c67186b46241b5069c93da2c9beedbb3f08dba75fffe
Successfully built pypika
Installing collected packag

## READY TO BEGIN TESTING

##Clin HCMI 

**Testing Full ID** `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI`

[Table location](https://console.cloud.google.com/bigquery?authuser=1&folder=&organizationId=&project=isb-project-zero&p=isb-project-zero&d=GDC_Clinical_Data&t=rel24_clin_HCMI&page=table)

Source : GDC API

Release version : v24


###test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [0]:
#return all table information for rel24_clin_HCMI

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLES')
clin_query = Query.from_(clin_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI') \
                  
clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
clin.head()

Unnamed: 0,table_catalog,table_schema,table_name,table_type
0,isb-project-zero,GDC_Clinical_Data,rel24_clin_HCMI,BASE TABLE


In [0]:
#return all table information for rel24_clin_CGCI

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLE_OPTIONS')
clin_query = Query.from_(clin_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(clin)):
    print(clin['option_name'][i] + '\n')
    print('\t' + clin['option_value'][i] + '\n')
    print('\t' + clin['option_type'][i] + '\n')

else:

    print('QC of friendly name, table description and labels --- FAILED')

QC of friendly name, table description and labels --- FAILED


In [0]:
#check for empty schemas in dataset rel24_clin_HCMI

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLE_OPTIONS')
clin_query = Query.from_(clin_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
clin.empty

Are there any empty cells in the table schema?


True

FIELD Descriptions pulled example below


In [0]:
#list of field descriptions for table rel24_clin_HCMI

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
clin_query = Query.from_(clin_table) \
                  .select('table_name, column_name, description') \
                  .where(clin_table.table_name=='rel24_clin_HCMI') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(clin)):
  print(clin['table_name'][i] + '\n')
  print('\t' + clin['column_name'][i] + '\n')
  print('\t' + clin['description'][i] + '\n')

rel24_clin_HCMI

	submitter_id

	

rel24_clin_HCMI

	case_id

	

rel24_clin_HCMI

	diag__count

	Total child record count (located in cases table).

rel24_clin_HCMI

	follow__count

	Total child record count (located in cases table).

rel24_clin_HCMI

	primary_site

	

rel24_clin_HCMI

	disease_type

	

rel24_clin_HCMI

	index_date

	

rel24_clin_HCMI

	proj__name

	Display name for the project

rel24_clin_HCMI

	proj__project_id

	

rel24_clin_HCMI

	demo__demographic_id

	

rel24_clin_HCMI

	demo__gender

	Text designations that identify gender. Gender is described as the assemblage of properties that distinguish people on the basis of their societal roles. [Explanatory Comment 1: Identification of gender is based upon self-report and may come from a form, questionnaire, interview, etc.]

rel24_clin_HCMI

	demo__race

	An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is 

In [0]:
# check for empty schemas in dataset rel24_clin_HCMI

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
clin_query = Query.from_(clin_table) \
                  .select('table_name, column_name, description') \
                  .where(clin_table.table_name=='rel24_clin_HCMI') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
print(clin)

Are there any empty cells in the table schema?
         table_name  ...                                        description
0   rel24_clin_HCMI  ...                                                   
1   rel24_clin_HCMI  ...                                                   
2   rel24_clin_HCMI  ...  Total child record count (located in cases tab...
3   rel24_clin_HCMI  ...  Total child record count (located in cases tab...
4   rel24_clin_HCMI  ...                                                   
5   rel24_clin_HCMI  ...                                                   
6   rel24_clin_HCMI  ...                                                   
7   rel24_clin_HCMI  ...                       Display name for the project
8   rel24_clin_HCMI  ...                                                   
9   rel24_clin_HCMI  ...                                                   
10  rel24_clin_HCMI  ...  Text designations that identify gender. Gender...
11  rel24_clin_HCMI  ...  An arbitrary cl

###test 2 - row number verification

**2. Look at table row number and size**

Do these metrics make sense?

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(submitter_id)
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI`

Unnamed: 0,f0_
0,23


In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(case_id)
FROM `isb-project-zero.GDC_Clinical_Data.rel23_clin_HCMI`

Unnamed: 0,f0_
0,23


In [0]:
%%bigquery --project isb-project-zero
SELECT *
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI`

Unnamed: 0,submitter_id,case_id,diag__count,follow__count,primary_site,disease_type,index_date,proj__name,proj__project_id,demo__demographic_id,demo__gender,demo__race,demo__ethnicity,demo__vital_status,demo__days_to_birth,demo__year_of_birth,demo__age_at_index,demo__days_to_death,demo__cause_of_death,demo__state,demo__created_datetime,demo__updated_datetime,expose__exposure_id,expose__tobacco_smoking_status,expose__tobacco_smoking_quit_year,expose__pack_years_smoked,expose__asbestos_exposure,expose__radon_exposure,expose__alcohol_intensity,expose__state,expose__created_datetime,expose__updated_datetime,fam_hist__family_history_id,fam_hist__relative_with_cancer_history,fam_hist__relationship_primary_diagnosis,fam_hist__state,fam_hist__created_datetime,fam_hist__updated_datetime,state,created_datetime,updated_datetime
0,HCM-CSHL-0147-C24,c992b973-299c-49d0-b5b5-3bdabb7ef575,1,3,Other and unspecified parts of biliary tract,Adenomas and Adenocarcinomas,Diagnosis,NCI Cancer Model Development for the Human Can...,HCMI-CMDC,9fdf8886-15e1-4b79-ac1a-5ad5ececb468,male,Unknown,hispanic or latino,Dead,,1946,,251.0,Not Reported,released,2019-04-03T17:50:51.184182-05:00,2020-02-11T15:25:23.744984-06:00,9dfb8f71-3af2-4001-8c53-08014aced656,3,,,,,,released,2019-04-03T17:50:51.184182-05:00,2020-01-13T20:10:25.832065-06:00,9c9a0f73-501e-4766-9b80-0bf15035323a,unknown,,released,2019-04-03T17:50:51.184182-05:00,2020-01-13T20:10:25.832065-06:00,released,2019-02-19T09:29:24.992069-06:00,2020-01-13T20:10:25.832065-06:00
1,HCM-BROD-0005-C41,ddf7c939-5278-4034-b861-36717e11695d,1,5,"Bones, joints and articular cartilage of other...",Miscellaneous Bone Tumors,Diagnosis,NCI Cancer Model Development for the Human Can...,HCMI-CMDC,ecfcba7a-163c-4f6a-b382-68d21642df15,male,white,not hispanic or latino,Dead,,2007,,1218.0,Cancer Related,released,2019-04-03T17:56:08.557154-05:00,2020-02-11T13:29:04.347635-06:00,5196a487-a0f5-402e-9c73-65274ce0d069,1,,,,,,released,2019-04-03T17:56:08.557154-05:00,2020-01-13T20:10:25.832065-06:00,c50fb078-9460-4988-994c-7d106e267827,unknown,,released,2019-04-03T17:56:08.557154-05:00,2020-01-13T20:10:25.832065-06:00,released,2019-04-03T17:56:08.557154-05:00,2020-01-13T20:10:25.832065-06:00
2,HCM-BROD-0036-C41,307c905f-c276-4501-988a-b083f9462a98,1,4,"Bones, joints and articular cartilage of other...",Miscellaneous Bone Tumors,Diagnosis,NCI Cancer Model Development for the Human Can...,HCMI-CMDC,afba7d8a-b15e-422e-ad2c-f7e8810a0f20,male,white,Unknown,Dead,,2002,,4687.0,Cancer Related,released,2019-04-03T17:59:55.017435-05:00,2020-02-11T15:09:26.895929-06:00,eeead869-d75a-453d-8d25-ca78203ad659,1,,,,,,released,2019-04-03T17:59:55.017435-05:00,2020-01-13T20:10:25.832065-06:00,c17736df-c8c9-47c0-becd-ebc2388897bd,unknown,,released,2019-04-03T17:59:55.017435-05:00,2020-01-13T20:10:25.832065-06:00,released,2019-04-03T17:59:55.017435-05:00,2020-01-13T20:10:25.832065-06:00
3,HCM-BROD-0002-C71,3092d72b-75b1-4ae2-ac38-d4c1cd377e4c,1,5,Brain,Gliomas,Diagnosis,NCI Cancer Model Development for the Human Can...,HCMI-CMDC,ac8d893e-3408-44a6-b177-a184f3daaf7d,male,white,not hispanic or latino,Dead,,1947,24213.0,639.0,Cancer Related,released,2019-03-12T10:12:01.634698-05:00,2020-01-13T20:10:25.832065-06:00,ee40720e-9e5d-49e3-9e14-2e148b422b60,1,,,,,,released,2019-03-12T10:12:01.634698-05:00,2020-01-13T20:10:25.832065-06:00,3a8b9e2f-5141-4bd0-b504-ed7fea5be348,no,,released,2019-03-12T10:12:01.634698-05:00,2020-01-13T20:10:25.832065-06:00,released,2018-10-02T15:54:28.009245-05:00,2020-01-13T20:10:25.832065-06:00
4,HCM-BROD-0003-C71,c811d6dd-992f-435a-80ec-b282a2e38aad,1,5,Brain,Gliomas,Diagnosis,NCI Cancer Model Development for the Human Can...,HCMI-CMDC,e63259e2-9043-47d2-9762-83e48398920c,female,white,not hispanic or latino,Dead,,1933,,364.0,Cancer Related,released,2019-04-04T14:40:31.851849-05:00,2020-02-11T13:26:13.297569-06:00,a2a69f51-6800-4885-a500-78cd815e01c3,1,,,,,,released,2019-04-04T14:40:31.851849-05:00,2020-01-13T20:10:25.832065-06:00,1fe51000-8a39-4d5b-b94e-c4e754f268aa,no,,released,2019-04-04T14:40:31.851849-05:00,2020-01-13T20:10:25.832065-06:00,released,2019-02-19T09:25:48.459960-06:00,2020-01-13T20:10:25.832065-06:00
5,HCM-BROD-0011-C71,69eced5b-1e76-45c9-bc9c-2aa71a921c57,1,5,Brain,Gliomas,Diagnosis,NCI Cancer Model Development for the Human Can...,HCMI-CMDC,9dc17f6d-720e-4bf8-8291-6e3060a09f8b,male,white,not hispanic or latino,Dead,,1961,,373.0,Cancer Related,released,2019-03-12T10:13:02.718709-05:00,2020-02-27T12:07:18.437040-06:00,c75fe65a-a362-4727-bd76-5d9ca9276621,3,,,,,,released,2019-03-12T10:13:02.718709-05:00,2019-05-07T12:14:28.332870-05:00,a9c1388d-69cd-40f4-bb3b-e9fe55a25a27,yes,Not Reported,released,2019-03-12T10:13:02.718709-05:00,2019-05-07T12:14:28.332870-05:00,released,2018-10-02T15:54:41.566315-05:00,2019-09-20T15:18:06.383154-05:00
6,HCM-BROD-0012-C71,0f89e089-4a1d-4e66-8537-502a788cfe75,1,5,Brain,Gliomas,Diagnosis,NCI Cancer Model Development for the Human Can...,HCMI-CMDC,33d9f61f-856c-4816-9cc4-6303d2afa4d2,female,white,not hispanic or latino,Dead,,1959,,420.0,Cancer Related,released,2019-03-12T10:15:03.937781-05:00,2020-02-11T14:18:02.138622-06:00,306ea86d-1235-47ec-ae54-5f005a9459de,2,,,,,,released,2019-03-12T10:15:03.937781-05:00,2019-04-29T14:45:04.583966-05:00,d746ede6-e157-4861-9c58-ff137b1b34dc,yes,Not Reported,released,2019-03-12T10:15:03.937781-05:00,2019-04-29T14:45:04.583966-05:00,released,2018-10-02T15:54:54.934990-05:00,2019-09-20T15:18:11.278204-05:00
7,HCM-BROD-0014-C71,5e5684d2-14b0-4823-9956-d69aea6d6beb,1,5,Brain,Gliomas,Diagnosis,NCI Cancer Model Development for the Human Can...,HCMI-CMDC,076eaf08-962f-4225-9af1-572328c3c452,male,white,not hispanic or latino,Dead,,1948,,134.0,Unknown,released,2019-03-12T10:15:41.981178-05:00,2020-02-11T14:19:56.148445-06:00,e8e8fb19-a877-4cc5-a446-6e2a02d92f44,7,,,,,,released,2019-03-12T10:15:41.981178-05:00,2020-01-13T20:10:25.832065-06:00,c525b839-b54d-476a-9f68-30b141c705c8,unknown,,released,2019-03-12T10:15:41.981178-05:00,2020-01-13T20:10:25.832065-06:00,released,2019-02-19T09:26:04.631865-06:00,2020-01-13T20:10:25.832065-06:00
8,HCM-BROD-0028-C71,6375cc59-6cd1-4b1e-8de1-510841c3bebd,1,5,Brain,Gliomas,Diagnosis,NCI Cancer Model Development for the Human Can...,HCMI-CMDC,8acdfe1d-3b7a-4be7-984f-6aa3ba6cc285,female,white,not hispanic or latino,Dead,,1956,,918.0,Cancer Related,released,2019-03-12T10:13:49.767516-05:00,2020-02-11T14:24:50.202554-06:00,9fcea46c-3b18-4931-8234-d66185dce34d,1,,,,,,released,2019-03-12T10:13:49.767516-05:00,2020-01-13T20:10:25.832065-06:00,b3ffb675-213f-42f8-bda1-304da3c5cedf,yes,Not Reported,released,2019-03-12T10:13:49.767516-05:00,2020-01-13T20:10:25.832065-06:00,released,2019-02-19T09:26:17.820896-06:00,2020-01-13T20:10:25.832065-06:00
9,HCM-BROD-0047-C71,babbc3cb-14b7-4894-b91d-c160f4cb48e0,1,5,Brain,Gliomas,Diagnosis,NCI Cancer Model Development for the Human Can...,HCMI-CMDC,2ffabae0-0229-4d9d-8e28-80686fd458b3,male,asian,Unknown,Dead,,1953,,495.0,Cancer Related,released,2019-03-12T10:14:29.676556-05:00,2020-02-11T15:10:28.186938-06:00,d192666e-c5dd-4b8e-b5a3-878334ed08e2,7,,,,,,released,2019-03-12T10:14:29.676556-05:00,2020-01-13T20:10:25.832065-06:00,429b6706-fb74-4720-8f96-b58ce5d873cb,unknown,,released,2019-03-12T10:14:29.676556-05:00,2020-01-13T20:10:25.832065-06:00,released,2019-02-19T09:26:31.634698-06:00,2020-01-13T20:10:25.832065-06:00


###test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?authuser=1&folder=&organizationId=&project=isb-project-zero&p=isb-project-zero&d=GDC_Clinical_Data&t=rel24_clin_HCMI&page=table).

Run a manual check in the console with the steps mentioned in step 1 

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

*Note from developer:
There are some columns which are sparsely populated (so they might look empty if you’re just scrolling through the table in the GUI), but there should be at least one non-null entry for every column in every table.*

###test 4 - case_gdc_id file metadata table count verification

**4. Number of case_id versus BigQuery metadata table**



In [0]:
# clinical case_id counts table reuslts below

# Query below will display the number of cases presents in this table.

clin_table = Table('`isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI`')
clin_query = Query.from_(clin_table) \
                  .select(' DISTINCT case_id, count(*) as count') \
                  .groupby('case_id')

clin_query_clean = str(clin_query).replace('"', "")
#print(clin_query_clean)
clin = client.query(clin_query_clean).to_dataframe()
print('number of case from case_id = ' + str(len(clin.index)))


number of case from case_id = 23


In [0]:
# GDC file metadata table case_gdc_id count for clinical below

%%bigquery --project isb-project-zero
SELECT case_gdc_id, program_name
FROM `isb-project-zero.GDC_metadata.rel24_caseData`
where program_name = 'HCMI'
group by case_gdc_id, program_name

Unnamed: 0,case_gdc_id,program_name
0,c5e9a845-e0ff-4612-aa8e-c310098503a7,HCMI
1,9e615018-8669-4dd6-8265-25e766a34dd0,HCMI
2,d7fc5874-8ae0-4357-9d8d-01af39ee521f,HCMI
3,5e5684d2-14b0-4823-9956-d69aea6d6beb,HCMI
4,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,HCMI
5,6375cc59-6cd1-4b1e-8de1-510841c3bebd,HCMI
6,e00914a1-997e-44ba-9c88-a5e8741b4ee1,HCMI
7,d000dcfe-181c-46fe-93ba-5fce8336b52d,HCMI
8,3b797720-33ef-45e0-9830-5f6007f1d5a7,HCMI
9,6796ddd8-0295-4965-911c-d4ad67232d13,HCMI


###match

In [0]:
%%bigquery --project isb-project-zero

SELECT distinct case_id, count(case_id) as count
FROM `isb-project-zero.GDC_metadata.rel24_fileData_current` as active, `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI` as clinical
WHERE program_name = 'HCMI'
AND active.case_gdc_id = clinical.case_id
group by case_id
order by count

Unnamed: 0,case_id,count
0,d7fc5874-8ae0-4357-9d8d-01af39ee521f,55
1,c5e9a845-e0ff-4612-aa8e-c310098503a7,55
2,9e615018-8669-4dd6-8265-25e766a34dd0,55
3,5e5684d2-14b0-4823-9956-d69aea6d6beb,56
4,6375cc59-6cd1-4b1e-8de1-510841c3bebd,56
5,3b797720-33ef-45e0-9830-5f6007f1d5a7,56
6,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,56
7,d000dcfe-181c-46fe-93ba-5fce8336b52d,56
8,e00914a1-997e-44ba-9c88-a5e8741b4ee1,56
9,6796ddd8-0295-4965-911c-d4ad67232d13,56


###test 5 - duplication verifcation

**5. Check for any duplicate rows present in the table**


In [0]:
%%bigquery --project isb-project-zero

SELECT count(case_id) AS count
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI`
group by submitter_id, case_id, diag__count, follow__count, primary_site, disease_type, index_date, proj__name, proj__project_id, demo__demographic_id, demo__gender, demo__race, demo__ethnicity, demo__vital_status, demo__days_to_birth, demo__year_of_birth, demo__age_at_index, demo__days_to_death, demo__cause_of_death, demo__state, demo__created_datetime, demo__updated_datetime, expose__exposure_id, expose__tobacco_smoking_status, expose__tobacco_smoking_quit_year, expose__pack_years_smoked, expose__asbestos_exposure, expose__radon_exposure, expose__alcohol_intensity, expose__state, expose__created_datetime, expose__updated_datetime, fam_hist__family_history_id, fam_hist__relative_with_cancer_history, fam_hist__relationship_primary_diagnosis, fam_hist__state, fam_hist__created_datetime, fam_hist__updated_datetime, state, created_datetime, updated_datetime
ORDER BY count DESC
LIMIT 10

Unnamed: 0,count
0,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
8,1
9,1


###test 6 - case_id master clinical data table count verifcation

**6. Verify case_id count of table against master rel_clinical_data table**

In [0]:
# case_id count from the program HCMI clinical table

%%bigquery --project isb-project-zero

select distinct case_id, count(case_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI` 
group by case_id
order by count

Unnamed: 0,case_id,count
0,c992b973-299c-49d0-b5b5-3bdabb7ef575,1
1,ddf7c939-5278-4034-b861-36717e11695d,1
2,307c905f-c276-4501-988a-b083f9462a98,1
3,3092d72b-75b1-4ae2-ac38-d4c1cd377e4c,1
4,c811d6dd-992f-435a-80ec-b282a2e38aad,1
5,69eced5b-1e76-45c9-bc9c-2aa71a921c57,1
6,0f89e089-4a1d-4e66-8537-502a788cfe75,1
7,5e5684d2-14b0-4823-9956-d69aea6d6beb,1
8,6375cc59-6cd1-4b1e-8de1-510841c3bebd,1
9,babbc3cb-14b7-4894-b91d-c160f4cb48e0,1


In [0]:
# case_id count from the master clinical table

%%bigquery --project isb-project-zero

SELECT distinct case_id, count(case_id) as count
FROM `isb-project-zero.GDC_metadata.rel24_fileData_current` as active, `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI` as clinical
WHERE program_name = 'HCMI'
AND active.case_gdc_id = clinical.case_id
group by case_id
order by count


Unnamed: 0,case_id,count
0,d7fc5874-8ae0-4357-9d8d-01af39ee521f,55
1,c5e9a845-e0ff-4612-aa8e-c310098503a7,55
2,9e615018-8669-4dd6-8265-25e766a34dd0,55
3,5e5684d2-14b0-4823-9956-d69aea6d6beb,56
4,6375cc59-6cd1-4b1e-8de1-510841c3bebd,56
5,3b797720-33ef-45e0-9830-5f6007f1d5a7,56
6,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,56
7,d000dcfe-181c-46fe-93ba-5fce8336b52d,56
8,e00914a1-997e-44ba-9c88-a5e8741b4ee1,56
9,6796ddd8-0295-4965-911c-d4ad67232d13,56


###match

##Clin HCMI_diag

**Testing Full ID** `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag`

[Table location](https://console.cloud.google.com/bigquery?authuser=1&folder=&organizationId=&project=isb-project-zero&p=isb-project-zero&d=GDC_Clinical_Data&t=rel24_clin_HCMI_diag&page=table)

Source : GDC API

Release version : v24


###test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [0]:
#return all table information for rel24_clin_HCMI_diag

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLES')
clin_query = Query.from_(clin_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_diag') \
                  
clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
clin.head()

Unnamed: 0,table_catalog,table_schema,table_name,table_type
0,isb-project-zero,GDC_Clinical_Data,rel24_clin_HCMI_diag,BASE TABLE


In [0]:
#return all table information for rel24_clin_HCMI_diag

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLE_OPTIONS')
clin_query = Query.from_(clin_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_diag') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(clin)):
    print(clin['option_name'][i] + '\n')
    print('\t' + clin['option_value'][i] + '\n')
    print('\t' + clin['option_type'][i] + '\n')

else:

    print('QC of friendly name, table description and labels --- FAILED')

QC of friendly name, table description and labels --- FAILED


In [0]:
#check for empty schemas in dataset rel24_clin_HCMI_diag

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLE_OPTIONS')
clin_query = Query.from_(clin_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_diag') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
clin.empty

Are there any empty cells in the table schema?


True

FIELD Descriptions pulled example below


In [0]:
#list of field descriptions for table rel24_clin_HCMI_diag

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
clin_query = Query.from_(clin_table) \
                  .select('table_name, column_name, description') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_diag') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(clin)):
  print(clin['table_name'][i] + '\n')
  print('\t' + clin['column_name'][i] + '\n')
  print('\t' + clin['description'][i] + '\n')

rel24_clin_HCMI_diag

	diag__diagnosis_id

	Reference to ancestor diag__diagnosis_id, located in rel24_clin_HCMI_diag.

rel24_clin_HCMI_diag

	case_id

	Reference to ancestor case_id, located in rel24_clin_HCMI.

rel24_clin_HCMI_diag

	diag__treat__count

	Total child record count (located in cases.diagnoses table).

rel24_clin_HCMI_diag

	diag__primary_diagnosis

	Text term used to describe the patient's histologic diagnosis, as described by the World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O).

rel24_clin_HCMI_diag

	diag__days_to_last_known_disease_status

	Time interval from the date of last follow up to the date of initial pathologic diagnosis, represented as a calculated number of days.

rel24_clin_HCMI_diag

	diag__perineural_invasion_present

	a yes/no indicator to ask if perineural invasion or infiltration of tumor or cancer is present.

rel24_clin_HCMI_diag

	diag__peripancreatic_lymph_nodes_tested

	The total number of peripancr

In [0]:
# check for empty schemas in dataset rel24_clin_HCMI_diag

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
clin_query = Query.from_(clin_table) \
                  .select('table_name, column_name, description') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_diag') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
print(clin)

              table_name  ...                                        description
0   rel24_clin_HCMI_diag  ...  Reference to ancestor diag__diagnosis_id, loca...
1   rel24_clin_HCMI_diag  ...  Reference to ancestor case_id, located in rel2...
2   rel24_clin_HCMI_diag  ...  Total child record count (located in cases.dia...
3   rel24_clin_HCMI_diag  ...  Text term used to describe the patient's histo...
4   rel24_clin_HCMI_diag  ...  Time interval from the date of last follow up ...
5   rel24_clin_HCMI_diag  ...  a yes/no indicator to ask if perineural invasi...
6   rel24_clin_HCMI_diag  ...  The total number of peripancreatic lymph nodes...
7   rel24_clin_HCMI_diag  ...  The yes/no/unknown indicator used to describe ...
8   rel24_clin_HCMI_diag  ...  Number of days between the date used for index...
9   rel24_clin_HCMI_diag  ...  Yes/No/Unknown indicator to identify whether a...
10  rel24_clin_HCMI_diag  ...  Code to represent the defined absence or prese...
11  rel24_clin_HCMI_diag  ..

###test 2 - row number verification

**2. Look at table row number and size**

Do these metrics make sense?

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(case_id)
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag`

Unnamed: 0,f0_
0,337


In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(case_id)
FROM `isb-project-zero.GDC_Clinical_Data.rel23_clin_HCMI_diag`

Unnamed: 0,f0_
0,24


In [0]:
%%bigquery --project isb-project-zero
SELECT *
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag`

Unnamed: 0,diag__diagnosis_id,case_id,diag__treat__count,diag__primary_diagnosis,diag__days_to_last_known_disease_status,diag__perineural_invasion_present,diag__peripancreatic_lymph_nodes_tested,diag__micropapillary_features,diag__days_to_diagnosis,diag__progression_or_recurrence,diag__ajcc_pathologic_m,diag__metastasis_at_diagnosis,diag__gastric_esophageal_junction_involvement,diag__site_of_resection_or_biopsy,diag__ajcc_staging_system_edition,diag__icd_10_code,diag__laterality,diag__age_at_diagnosis,diag__days_to_last_follow_up,diag__lymph_nodes_tested,diag__goblet_cells_columnar_mucosa_present,diag__metastasis_at_diagnosis_site,diag__ajcc_pathologic_stage,diag__esophageal_columnar_metaplasia_present,diag__tumor_grade,diag__lymph_nodes_positive,diag__last_known_disease_status,diag__residual_disease,diag__vascular_invasion_present,diag__esophageal_columnar_dysplasia_degree,diag__ajcc_clinical_stage,diag__synchronous_malignancy,diag__morphology,diag__ajcc_pathologic_t,diag__lymphatic_invasion_present,diag__vascular_invasion_type,diag__classification_of_tumor,diag__tumor_stage,diag__prior_treatment,diag__peripancreatic_lymph_nodes_positive,diag__ajcc_pathologic_n,diag__tissue_or_organ_of_origin,diag__prior_malignancy,diag__state,diag__created_datetime,diag__updated_datetime
0,724929d9-4c63-4b2f-9d65-56cffcb6edcd,c5e9a845-e0ff-4612-aa8e-c310098503a7,0,Not Reported,185.0,,37.0,,28.0,not reported,,,,Not Reported,,C77.0,,,185.0,,,,,,Not Reported,,not reported,,,,,,Not Reported,,,,recurrence,,,4 or More,,"Pancreas, NOS",,released,2020-02-27T13:55:43.042019-06:00,2020-03-26T16:04:47.706904-05:00
1,14d3799c-779e-4d6d-b9a2-0491dca76c52,ddf7c939-5278-4034-b861-36717e11695d,9,Ewing sarcoma,,,,,,not reported,,Distant Metastasis,,"Connective, subcutaneous and other soft tissue...",,C79,,2908.0,1218.0,,,Lung,,,Not Reported,,not reported,,,,,,9260/3,,,,,,No,,,"Bone, NOS",no,released,2019-04-03T17:56:08.557154-05:00,2020-02-11T13:29:04.347635-06:00
2,c45b5acf-9aad-4308-9fd4-acaa0a536bc1,307c905f-c276-4501-988a-b083f9462a98,11,Ewing sarcoma,,,,,,not reported,,No Metastasis,,"Pleura, NOS",,C78.2,,4997.0,4687.0,,,,,,Not Reported,,not reported,,,,,,9260/3,,,,,,Yes,,,"Bone, NOS",yes,released,2019-04-03T17:59:55.017435-05:00,2020-02-11T15:09:25.806884-06:00
3,b8698977-9702-4a4c-96b7-55db868e6c71,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,6,"Carcinoma, NOS",,,,Not Reported,,not reported,,No Metastasis,,"Brain, NOS",,C79.3,Unknown,23897.0,995.0,,,,,,Not Reported,,not reported,,,,,,8010/3,,,,metastasis,,No,,,"Lung, NOS",no,released,2019-04-04T15:11:28.695214-05:00,2020-02-11T14:23:00.566965-06:00
4,f68b1f17-da54-4595-b017-37fd6d6d1e3e,3092d72b-75b1-4ae2-ac38-d4c1cd377e4c,6,Gliosarcoma,,,,,,not reported,,No Metastasis,,"Brain, NOS",,C71.9,,24213.0,639.0,,,,,,Not Reported,,not reported,,,,,,9442/3,,,,,not reported,No,,,"Brain, NOS",no,released,2019-03-12T10:12:01.634698-05:00,2020-01-13T20:10:25.832065-06:00
5,7c0759eb-9a3d-4b98-8412-f156a7f04234,c811d6dd-992f-435a-80ec-b282a2e38aad,6,Glioblastoma,,,,,,not reported,,No Metastasis,,"Brain, NOS",,C71.9,,29995.0,364.0,,,,,,Not Reported,,not reported,,,,,,9440/3,,,,,,No,,,"Brain, NOS",no,released,2019-04-04T14:40:31.851849-05:00,2020-02-11T13:26:13.297569-06:00
6,7b142a98-5022-43e9-bcbb-647d3b28b308,69eced5b-1e76-45c9-bc9c-2aa71a921c57,10,Glioblastoma,,,,,,not reported,,No Metastasis,,"Brain, NOS",,C71.9,,19996.0,373.0,,,,,,Not Reported,,not reported,,,,,,9440/3,,,,,,No,,,"Brain, NOS",no,released,2019-03-12T10:13:02.718709-05:00,2020-02-11T14:11:40.298623-06:00
7,26192cfd-1129-40b7-a99c-7ae133fcb595,0f89e089-4a1d-4e66-8537-502a788cfe75,6,Glioblastoma,,,,,,not reported,,No Metastasis,,"Brain, NOS",,C71.9,,20744.0,420.0,,,,,,Not Reported,,not reported,,,,,,9440/3,,,,,,No,,,"Brain, NOS",no,released,2019-03-12T10:15:03.937781-05:00,2020-02-11T14:18:02.138622-06:00
8,ba1d7f88-da33-4a82-84d1-5c8192c6842d,5e5684d2-14b0-4823-9956-d69aea6d6beb,6,Glioblastoma,,,,,,not reported,,No Metastasis,,"Brain, NOS",,C71.9,,24957.0,134.0,,,,,,Not Reported,,not reported,,,,,,9440/3,,,,,,No,,,"Brain, NOS",no,released,2019-03-12T10:15:41.981178-05:00,2020-02-11T14:19:56.148445-06:00
9,2c497764-9031-4c03-b7f7-45615b605416,6375cc59-6cd1-4b1e-8de1-510841c3bebd,6,Glioblastoma,,,,,,not reported,,No Metastasis,,"Brain, NOS",,C71.9,,21523.0,918.0,,,,,,Not Reported,,not reported,,,,,,9440/3,,,,,,No,,,"Brain, NOS",no,released,2019-03-12T10:13:49.767516-05:00,2020-02-11T14:24:50.202554-06:00


###test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?authuser=1&folder=&organizationId=&project=isb-project-zero&p=isb-project-zero&d=GDC_Clinical_Data&t=rel24_clin_MMRF_diag__treat&page=table).

Run a manual check in the console with the steps mentioned in step 1.

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

###test 4 - case_gdc_id file metadata table count verification

**4. Number of case_id versus BigQuery metadata table**



In [0]:
# clinical case_id counts table reuslts below

# Query below will display the number of cases presents in this table.

clin_table = Table('`isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag`')
clin_query = Query.from_(clin_table) \
                  .select(' DISTINCT case_id, count(*) as count') \
                  .groupby('case_id')

clin_query_clean = str(clin_query).replace('"', "")
#print(clin_query_clean)
clin = client.query(clin_query_clean).to_dataframe()
print('number of case from case_id = ' + str(len(clin.index)))


number of case from case_id = 23


###match

In [0]:
# GDC file metadata table case_gdc_id count for clinical below

%%bigquery --project isb-project-zero
SELECT case_gdc_id, program_name
FROM `isb-project-zero.GDC_metadata.rel24_caseData`
where program_name = 'HCMI'
group by case_gdc_id, program_name

Unnamed: 0,case_gdc_id,program_name
0,c5e9a845-e0ff-4612-aa8e-c310098503a7,HCMI
1,9e615018-8669-4dd6-8265-25e766a34dd0,HCMI
2,d7fc5874-8ae0-4357-9d8d-01af39ee521f,HCMI
3,5e5684d2-14b0-4823-9956-d69aea6d6beb,HCMI
4,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,HCMI
5,6375cc59-6cd1-4b1e-8de1-510841c3bebd,HCMI
6,e00914a1-997e-44ba-9c88-a5e8741b4ee1,HCMI
7,d000dcfe-181c-46fe-93ba-5fce8336b52d,HCMI
8,3b797720-33ef-45e0-9830-5f6007f1d5a7,HCMI
9,6796ddd8-0295-4965-911c-d4ad67232d13,HCMI


In [0]:
%%bigquery --project isb-project-zero

SELECT distinct case_id, count(case_id) as count
FROM `isb-project-zero.GDC_metadata.rel24_caseData` as active, `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag` as clinical
WHERE program_name = 'HCMI'
AND active.case_gdc_id = clinical.case_id
group by case_id
order by count

Unnamed: 0,case_id,count
0,9e615018-8669-4dd6-8265-25e766a34dd0,1
1,d7fc5874-8ae0-4357-9d8d-01af39ee521f,1
2,5e5684d2-14b0-4823-9956-d69aea6d6beb,1
3,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,1
4,6375cc59-6cd1-4b1e-8de1-510841c3bebd,1
5,e00914a1-997e-44ba-9c88-a5e8741b4ee1,1
6,d000dcfe-181c-46fe-93ba-5fce8336b52d,1
7,3b797720-33ef-45e0-9830-5f6007f1d5a7,1
8,6796ddd8-0295-4965-911c-d4ad67232d13,1
9,c992b973-299c-49d0-b5b5-3bdabb7ef575,1


###test 5 - duplication verifcation

**5. Check for any duplicate rows present in the table**

In [0]:
%%bigquery --project isb-project-zero

SELECT count(case_id) AS count
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag`
group by diag__diagnosis_id, case_id, diag__treat__count, diag__primary_diagnosis, diag__days_to_last_known_disease_status, diag__perineural_invasion_present, diag__peripancreatic_lymph_nodes_tested, diag__micropapillary_features, diag__days_to_diagnosis, diag__progression_or_recurrence, diag__ajcc_pathologic_m, diag__metastasis_at_diagnosis, diag__gastric_esophageal_junction_involvement, diag__site_of_resection_or_biopsy, diag__ajcc_staging_system_edition, diag__icd_10_code, diag__laterality, diag__age_at_diagnosis, diag__days_to_last_follow_up, diag__lymph_nodes_tested, diag__goblet_cells_columnar_mucosa_present, diag__metastasis_at_diagnosis_site, diag__ajcc_pathologic_stage, diag__esophageal_columnar_metaplasia_present, diag__tumor_grade, diag__lymph_nodes_positive, diag__last_known_disease_status, diag__residual_disease, diag__vascular_invasion_present, diag__esophageal_columnar_dysplasia_degree, diag__ajcc_clinical_stage, diag__synchronous_malignancy, diag__morphology, diag__ajcc_pathologic_t, diag__lymphatic_invasion_present, diag__vascular_invasion_type, diag__classification_of_tumor, diag__tumor_stage, diag__prior_treatment, diag__peripancreatic_lymph_nodes_positive, diag__ajcc_pathologic_n, diag__tissue_or_organ_of_origin, diag__prior_malignancy, diag__state, diag__created_datetime, diag__updated_datetime
ORDER BY count DESC
LIMIT 10

Unnamed: 0,count
0,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
8,1
9,1


###test 6 - case_id master clinical data table count verifcation

**6. Verify case_id count of table against master rel_clinical_data table**

In [0]:
# case_id count from the program HCMI clinical table

%%bigquery --project isb-project-zero

select distinct case_id, count(case_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI` 
group by case_id
order by count

Unnamed: 0,case_id,count
0,c992b973-299c-49d0-b5b5-3bdabb7ef575,1
1,ddf7c939-5278-4034-b861-36717e11695d,1
2,307c905f-c276-4501-988a-b083f9462a98,1
3,3092d72b-75b1-4ae2-ac38-d4c1cd377e4c,1
4,c811d6dd-992f-435a-80ec-b282a2e38aad,1
5,69eced5b-1e76-45c9-bc9c-2aa71a921c57,1
6,0f89e089-4a1d-4e66-8537-502a788cfe75,1
7,5e5684d2-14b0-4823-9956-d69aea6d6beb,1
8,6375cc59-6cd1-4b1e-8de1-510841c3bebd,1
9,babbc3cb-14b7-4894-b91d-c160f4cb48e0,1


In [0]:
# case_id count from the program HCMI_diag clinical table

%%bigquery --project isb-project-zero

select distinct case_id, count(case_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag` 
group by case_id
order by count

Unnamed: 0,case_id,count
0,ddf7c939-5278-4034-b861-36717e11695d,1
1,307c905f-c276-4501-988a-b083f9462a98,1
2,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,1
3,3092d72b-75b1-4ae2-ac38-d4c1cd377e4c,1
4,c811d6dd-992f-435a-80ec-b282a2e38aad,1
5,69eced5b-1e76-45c9-bc9c-2aa71a921c57,1
6,0f89e089-4a1d-4e66-8537-502a788cfe75,1
7,5e5684d2-14b0-4823-9956-d69aea6d6beb,1
8,6375cc59-6cd1-4b1e-8de1-510841c3bebd,1
9,babbc3cb-14b7-4894-b91d-c160f4cb48e0,1


###match

##Clin HCMI_diag__treat

**Testing Full ID** `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag__treat`

[Table location][Table location](https://console.cloud.google.com/bigquery?authuser=1&folder=&organizationId=&project=isb-project-zero&p=isb-project-zero&d=GDC_Clinical_Data&t=rel24_clin_HCMI_diag__treat&page=table)

Source : GDC API

Release version : v24

###test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [0]:
#return all table information for rel24_clin_HCMI_diag__treat

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLES')
clin_query = Query.from_(clin_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_diag__treat') \
                  
clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
clin.head()

Unnamed: 0,table_catalog,table_schema,table_name,table_type
0,isb-project-zero,GDC_Clinical_Data,rel24_clin_HCMI_diag__treat,BASE TABLE


In [0]:
#return all table information for rel24_clin_HCMI_diag__treat

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLE_OPTIONS')
clin_query = Query.from_(clin_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_diag__treatt') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(clin)):
    print(clin['option_name'][i] + '\n')
    print('\t' + clin['option_value'][i] + '\n')
    print('\t' + clin['option_type'][i] + '\n')

else:

    print('QC of friendly name, table description and labels --- FAILED')

QC of friendly name, table description and labels --- FAILED


In [0]:
#check for empty schemas in dataset rel24_clin_HCMI_diag__treat

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLE_OPTIONS')
clin_query = Query.from_(clin_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_diag__treat') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
clin.empty

Are there any empty cells in the table schema?


True

FIELD Descriptions pulled example below

In [0]:
#list of field descriptions for table rel24_clin_HCMI_diag__treat

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
clin_query = Query.from_(clin_table) \
                  .select('table_name, column_name, description') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_diag__treat') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(clin)):
  print(clin['table_name'][i] + '\n')
  print('\t' + clin['column_name'][i] + '\n')
  print('\t' + clin['description'][i] + '\n')

rel24_clin_HCMI_diag__treat

	diag__treat__treatment_id

	

rel24_clin_HCMI_diag__treat

	diag__diagnosis_id

	Reference to ancestor diag__diagnosis_id, located in rel24_clin_HCMI_diag.

rel24_clin_HCMI_diag__treat

	case_id

	Reference to ancestor case_id, located in rel24_clin_HCMI.

rel24_clin_HCMI_diag__treat

	diag__treat__days_to_treatment_start

	Number of days between the date used for index and the date the treatment started.

rel24_clin_HCMI_diag__treat

	diag__treat__treatment_outcome

	Text term that describes the patient's final outcome after the treatment was administered.

rel24_clin_HCMI_diag__treat

	diag__treat__treatment_type

	Text term that describes the kind of treatment administered.

rel24_clin_HCMI_diag__treat

	diag__treat__treatment_or_therapy

	A yes/no/unknown/not applicable indicator related to the administration of therapeutic agents received.

rel24_clin_HCMI_diag__treat

	diag__treat__therapeutic_agents

	Text identification of the individual agent(s) u

In [0]:
# check for empty schemas in dataset rel24_clin_HCMI_diag__treat

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
clin_query = Query.from_(clin_table) \
                  .select('table_name, column_name, description') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_diag__treat') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
print(clin)

                     table_name  ...                                        description
0   rel24_clin_HCMI_diag__treat  ...                                                   
1   rel24_clin_HCMI_diag__treat  ...  Reference to ancestor diag__diagnosis_id, loca...
2   rel24_clin_HCMI_diag__treat  ...  Reference to ancestor case_id, located in rel2...
3   rel24_clin_HCMI_diag__treat  ...  Number of days between the date used for index...
4   rel24_clin_HCMI_diag__treat  ...  Text term that describes the patient's final o...
5   rel24_clin_HCMI_diag__treat  ...  Text term that describes the kind of treatment...
6   rel24_clin_HCMI_diag__treat  ...  A yes/no/unknown/not applicable indicator rela...
7   rel24_clin_HCMI_diag__treat  ...  Text identification of the individual agent(s)...
8   rel24_clin_HCMI_diag__treat  ...  The text term used to describe the status of t...
9   rel24_clin_HCMI_diag__treat  ...  Number of days between the date used for index...
10  rel24_clin_HCMI_diag__treat 

###test 2 - row number verification

**2. Look at table row number and size**

Do these metrics make sense?

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(case_id)
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag__treat`

Unnamed: 0,f0_
0,149


In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(case_id)
FROM `isb-project-zero.GDC_Clinical_Data.rel23_clin_HCMI_diag__treat`

Unnamed: 0,f0_
0,149


In [0]:
%%bigquery --project isb-project-zero
SELECT *
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag__treat`

Unnamed: 0,diag__treat__treatment_id,diag__diagnosis_id,case_id,diag__treat__days_to_treatment_start,diag__treat__treatment_outcome,diag__treat__treatment_type,diag__treat__treatment_or_therapy,diag__treat__therapeutic_agents,diag__treat__initial_disease_status,diag__treat__days_to_treatment_end,diag__treat__regimen_or_line_of_therapy,diag__treat__treatment_intent_type,diag__treat__state,diag__treat__created_datetime,diag__treat__updated_datetime
0,990cf73c-6491-4530-9c96-4ebb33bcc200,f948667b-7daa-4316-a77d-5cfb3f2c8156,c5e9a845-e0ff-4612-aa8e-c310098503a7,,,,no,,,,,Neoadjuvant,released,2020-02-27T13:55:43.042019-06:00,2020-03-26T16:04:47.706904-05:00
1,c4574c56-946b-49c5-94fc-86759df035d2,3bc3a18d-b3b5-46cc-875e-09545978654a,c992b973-299c-49d0-b5b5-3bdabb7ef575,,,,no,,,,,Neoadjuvant,released,2019-04-03T17:50:51.184182-05:00,2020-01-13T20:10:25.832065-06:00
2,bcfcb96f-cbda-4f86-9d04-952f433ff629,14d3799c-779e-4d6d-b9a2-0491dca76c52,ddf7c939-5278-4034-b861-36717e11695d,,,,no,,,,,Neoadjuvant,released,2019-04-03T17:56:08.557154-05:00,2020-01-13T20:10:25.832065-06:00
3,4b918f22-a15a-4f24-b962-418839960e93,b8698977-9702-4a4c-96b7-55db868e6c71,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,,,,no,,,,,Neoadjuvant,released,2019-04-04T15:11:28.695214-05:00,2020-01-13T20:10:25.832065-06:00
4,07c5c0b5-735e-44f0-9124-897ff844ea9e,f68b1f17-da54-4595-b017-37fd6d6d1e3e,3092d72b-75b1-4ae2-ac38-d4c1cd377e4c,,,,no,,,,,Neoadjuvant,released,2019-03-12T10:12:01.634698-05:00,2020-01-13T20:10:25.832065-06:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
144,edd6b471-04b2-4606-9bf1-ace36786e526,e0fffc04-b694-41a0-8d9d-425504b92784,8b3b1f24-419e-4043-82be-2bd41268bb0e,,,"Pharmaceutical Therapy, NOS",unknown,,Residual Disease,,,Adjuvant,released,2019-05-15T13:02:25.351730-05:00,2019-05-24T12:11:41.511797-05:00
145,e85a1205-3ae9-4d6a-a94f-10c304ae6762,e0fffc04-b694-41a0-8d9d-425504b92784,8b3b1f24-419e-4043-82be-2bd41268bb0e,,,"Radiation Therapy, NOS",unknown,,Residual Disease,,,Adjuvant,released,2019-05-15T13:02:25.351730-05:00,2019-05-24T12:11:41.511797-05:00
146,30e0efc8-2571-49f0-beee-8ac4213ad6a1,807dc915-5b1b-48a1-b21f-0e986f33a347,9e615018-8669-4dd6-8265-25e766a34dd0,,,Chemotherapy,unknown,,Residual Disease,,,Adjuvant,released,2020-02-11T14:38:31.889814-06:00,2020-03-26T16:04:47.706904-05:00
147,2b885b52-de48-423c-bb85-062c9e42850f,14d3799c-779e-4d6d-b9a2-0491dca76c52,ddf7c939-5278-4034-b861-36717e11695d,,,Surgery,unknown,,Initial Diagnosis,,,,released,2019-04-16T11:15:52.632540-05:00,2020-01-13T20:10:25.832065-06:00


###test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel24_clin_HCMI_diag__treat&page=table).

Run a manual check in the console with the steps mentioned in step 1 

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

###test 4 - case_gdc_id file metadata table count verification

**4. Number of case_id versus BigQuery metadata table**

In [0]:
# clinical case_id counts table reuslts below

# Query below will display the number of cases presents in this table.

clin_table = Table('`isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag__treat`')
clin_query = Query.from_(clin_table) \
                  .select(' DISTINCT case_id, count(*) as count') \
                  .groupby('case_id')

clin_query_clean = str(clin_query).replace('"', "")
#print(clin_query_clean)
clin = client.query(clin_query_clean).to_dataframe()
print('number of case from case_id = ' + str(len(clin.index)))

number of case from case_id = 23


In [0]:
# GDC file metadata table case_gdc_id count for clinical below

%%bigquery --project isb-project-zero
SELECT case_gdc_id, program_name
FROM `isb-project-zero.GDC_metadata.rel24_caseData`
where program_name = 'HCMI'
group by case_gdc_id, program_name

Unnamed: 0,case_gdc_id,program_name
0,c5e9a845-e0ff-4612-aa8e-c310098503a7,HCMI
1,9e615018-8669-4dd6-8265-25e766a34dd0,HCMI
2,d7fc5874-8ae0-4357-9d8d-01af39ee521f,HCMI
3,5e5684d2-14b0-4823-9956-d69aea6d6beb,HCMI
4,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,HCMI
5,6375cc59-6cd1-4b1e-8de1-510841c3bebd,HCMI
6,e00914a1-997e-44ba-9c88-a5e8741b4ee1,HCMI
7,d000dcfe-181c-46fe-93ba-5fce8336b52d,HCMI
8,3b797720-33ef-45e0-9830-5f6007f1d5a7,HCMI
9,6796ddd8-0295-4965-911c-d4ad67232d13,HCMI


### match

In [0]:
%%bigquery --project isb-project-zero

SELECT distinct case_id, count(case_id) as count
FROM `isb-project-zero.GDC_metadata.rel24_caseData` as active, `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag__treat` as clinical
WHERE program_name = 'HCMI'
AND active.case_gdc_id = clinical.case_id
group by case_id
order by count

Unnamed: 0,case_id,count
0,c992b973-299c-49d0-b5b5-3bdabb7ef575,4
1,cd6f37d5-d192-48b9-a63f-54d8d1e810c6,4
2,b5fe3fa5-0f46-4e63-8e1f-dc3d28734936,4
3,8b3b1f24-419e-4043-82be-2bd41268bb0e,4
4,e00914a1-997e-44ba-9c88-a5e8741b4ee1,5
5,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,6
6,3092d72b-75b1-4ae2-ac38-d4c1cd377e4c,6
7,c811d6dd-992f-435a-80ec-b282a2e38aad,6
8,0f89e089-4a1d-4e66-8537-502a788cfe75,6
9,5e5684d2-14b0-4823-9956-d69aea6d6beb,6


###test 5 - duplication verifcation

**5. Check for any duplicate rows present in the table**

In [0]:
%%bigquery --project isb-project-zero

SELECT count(case_id) AS count
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag__treat`
group by diag__treat__treatment_id, diag__diagnosis_id, case_id, diag__treat__days_to_treatment_start, diag__treat__treatment_outcome, diag__treat__treatment_type, diag__treat__treatment_or_therapy, diag__treat__therapeutic_agents, diag__treat__initial_disease_status, diag__treat__days_to_treatment_end, diag__treat__regimen_or_line_of_therapy, diag__treat__treatment_intent_type, diag__treat__state, diag__treat__created_datetime, diag__treat__updated_datetime
ORDER BY count DESC
LIMIT 10

Unnamed: 0,count
0,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
8,1
9,1


###test 6 - case_id master clinical data table count verifcation

**6. Verify case_id count of table against master rel_clinical_data table**

In [0]:
# case_id count from the program HCMI clinical table

%%bigquery --project isb-project-zero

select distinct case_id, count(case_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI` 
group by case_id
order by count

Unnamed: 0,case_id,count
0,c992b973-299c-49d0-b5b5-3bdabb7ef575,1
1,ddf7c939-5278-4034-b861-36717e11695d,1
2,307c905f-c276-4501-988a-b083f9462a98,1
3,3092d72b-75b1-4ae2-ac38-d4c1cd377e4c,1
4,c811d6dd-992f-435a-80ec-b282a2e38aad,1
5,69eced5b-1e76-45c9-bc9c-2aa71a921c57,1
6,0f89e089-4a1d-4e66-8537-502a788cfe75,1
7,5e5684d2-14b0-4823-9956-d69aea6d6beb,1
8,6375cc59-6cd1-4b1e-8de1-510841c3bebd,1
9,babbc3cb-14b7-4894-b91d-c160f4cb48e0,1


In [0]:
# case_id count from the program HCMI_diag__treat clinical table

%%bigquery --project isb-project-zero

select distinct case_id, count(case_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag__treat` 
group by case_id
order by count

Unnamed: 0,case_id,count
0,c992b973-299c-49d0-b5b5-3bdabb7ef575,4
1,cd6f37d5-d192-48b9-a63f-54d8d1e810c6,4
2,b5fe3fa5-0f46-4e63-8e1f-dc3d28734936,4
3,8b3b1f24-419e-4043-82be-2bd41268bb0e,4
4,e00914a1-997e-44ba-9c88-a5e8741b4ee1,5
5,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,6
6,3092d72b-75b1-4ae2-ac38-d4c1cd377e4c,6
7,c811d6dd-992f-435a-80ec-b282a2e38aad,6
8,0f89e089-4a1d-4e66-8537-502a788cfe75,6
9,5e5684d2-14b0-4823-9956-d69aea6d6beb,6


### match 


###test 7 - disgnosis_id count verification

**7. QC diagnosis_id count from parent diag table if applicable**

In [0]:
# diag_diagnosis_id count from the program HCMI_diag clinical table

%%bigquery --project isb-project-zero

select distinct diag__diagnosis_id, count(diag__diagnosis_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag` 
group by diag__diagnosis_id
order by count

Unnamed: 0,diag__diagnosis_id,count
0,724929d9-4c63-4b2f-9d65-56cffcb6edcd,1
1,14d3799c-779e-4d6d-b9a2-0491dca76c52,1
2,c45b5acf-9aad-4308-9fd4-acaa0a536bc1,1
3,b8698977-9702-4a4c-96b7-55db868e6c71,1
4,f68b1f17-da54-4595-b017-37fd6d6d1e3e,1
5,7c0759eb-9a3d-4b98-8412-f156a7f04234,1
6,7b142a98-5022-43e9-bcbb-647d3b28b308,1
7,26192cfd-1129-40b7-a99c-7ae133fcb595,1
8,ba1d7f88-da33-4a82-84d1-5c8192c6842d,1
9,2c497764-9031-4c03-b7f7-45615b605416,1


In [0]:
# diag_diagnosis_id count from the program HCMI_diag__treat clinical table

%%bigquery --project isb-project-zero

select distinct diag__diagnosis_id, count(diag__diagnosis_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_diag__treat` 
group by diag__diagnosis_id
order by count

Unnamed: 0,diag__diagnosis_id,count
0,3bc3a18d-b3b5-46cc-875e-09545978654a,4
1,5dea9a2b-f5da-4f3d-aefa-f53f21a81fde,4
2,027c64ba-cd09-40e9-9495-12142f99aeb6,4
3,e0fffc04-b694-41a0-8d9d-425504b92784,4
4,0a37ccf7-cabd-4589-b04e-2ed8f84399e7,5
5,b8698977-9702-4a4c-96b7-55db868e6c71,6
6,f68b1f17-da54-4595-b017-37fd6d6d1e3e,6
7,7c0759eb-9a3d-4b98-8412-f156a7f04234,6
8,26192cfd-1129-40b7-a99c-7ae133fcb595,6
9,ba1d7f88-da33-4a82-84d1-5c8192c6842d,6


### no match 

Explanation: parent table count returns 24 rows, while the treat table has 23 rows present

##Clin HCMI_follow

**Testing Full ID** `isb-project-zero.GDC_Clinical_Data.rel23_clin_HCMI_follow`

[Table location](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel24_clin_HCMI_follow&page=table)

Source : GDC API

Release version : v24

###test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [0]:
#return all table information for rel24_clin_HCMI_follow

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLES')
clin_query = Query.from_(clin_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_follow') \
                  
clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
clin.head()

Unnamed: 0,table_catalog,table_schema,table_name,table_type
0,isb-project-zero,GDC_Clinical_Data,rel24_clin_HCMI_follow,BASE TABLE


In [0]:
#return all table information for rel24_clin_HCMI_follow

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLE_OPTIONS')
clin_query = Query.from_(clin_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_follow') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(clin)):
    print(clin['option_name'][i] + '\n')
    print('\t' + clin['option_value'][i] + '\n')
    print('\t' + clin['option_type'][i] + '\n')

else:

    print('QC of friendly name, table description and labels --- FAILED')

QC of friendly name, table description and labels --- FAILED


In [0]:
#check for empty schemas in dataset rel24_clin_HCMI_follow

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLE_OPTIONS')
clin_query = Query.from_(clin_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_follow') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
clin.empty

Are there any empty cells in the table schema?


True

FIELD Descriptions pulled example below




In [0]:
#list of field descriptions for table rel24_clin_HCMI_follow

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
clin_query = Query.from_(clin_table) \
                  .select('table_name, column_name, description') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_follow') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(clin)):
  print(clin['table_name'][i] + '\n')
  print('\t' + clin['column_name'][i] + '\n')
  print('\t' + clin['description'][i] + '\n')

rel24_clin_HCMI_follow

	follow__follow_up_id

	Reference to ancestor follow__follow_up_id, located in rel24_clin_HCMI_follow.

rel24_clin_HCMI_follow

	case_id

	Reference to ancestor case_id, located in rel24_clin_HCMI.

rel24_clin_HCMI_follow

	follow__mol_test__count

	Total child record count (located in cases.follow_ups table).

rel24_clin_HCMI_follow

	follow__days_to_follow_up

	Number of days between the date used for index and the date of the patient's last follow-up appointment or contact.

rel24_clin_HCMI_follow

	follow__height

	The height of the patient in centimeters.

rel24_clin_HCMI_follow

	follow__weight

	The weight of the patient measured in kilograms.

rel24_clin_HCMI_follow

	follow__bmi

	A calculated numerical quantity that represents an individual's weight to height ratio.

rel24_clin_HCMI_follow

	follow__progression_or_recurrence_type

	The text term used to describe the type of progressive or recurrent disease or relapsed disease.

rel24_clin_HCMI_follow



In [0]:
# check for empty schemas in dataset rel24_clin_HCMI_follow
clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
clin_query = Query.from_(clin_table) \
                  .select('table_name, column_name, description') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_follow') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
print(clin)

                table_name  ...                                        description
0   rel24_clin_HCMI_follow  ...  Reference to ancestor follow__follow_up_id, lo...
1   rel24_clin_HCMI_follow  ...  Reference to ancestor case_id, located in rel2...
2   rel24_clin_HCMI_follow  ...  Total child record count (located in cases.fol...
3   rel24_clin_HCMI_follow  ...  Number of days between the date used for index...
4   rel24_clin_HCMI_follow  ...          The height of the patient in centimeters.
5   rel24_clin_HCMI_follow  ...   The weight of the patient measured in kilograms.
6   rel24_clin_HCMI_follow  ...  A calculated numerical quantity that represent...
7   rel24_clin_HCMI_follow  ...  The text term used to describe the type of pro...
8   rel24_clin_HCMI_follow  ...  Number of days between the date used for index...
9   rel24_clin_HCMI_follow  ...  The text term used to describe a comorbidity d...
10  rel24_clin_HCMI_follow  ...  Number of days between the date used for index...
11  

###test 2 - row number verification

**2. Look at table row number and size**

Do these metrics make sense?

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(case_id)
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow`

Unnamed: 0,f0_
0,101


In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(case_id)
FROM `isb-project-zero.GDC_Clinical_Data.rel23_clin_HCMI_follow`

Unnamed: 0,f0_
0,101


In [0]:
%%bigquery --project isb-project-zero
SELECT *
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow`

Unnamed: 0,follow__follow_up_id,case_id,follow__mol_test__count,follow__days_to_follow_up,follow__height,follow__weight,follow__bmi,follow__progression_or_recurrence_type,follow__days_to_progression,follow__comorbidity,follow__days_to_comorbidity,follow__diabetes_treatment_type,follow__karnofsky_performance_status,follow__disease_response,follow__reflux_treatment_type,follow__ecog_performance_status,follow__progression_or_recurrence,follow__progression_or_recurrence_anatomic_site,follow__risk_factor,follow__state,follow__created_datetime,follow__updated_datetime
0,bf8357ad-40a5-4edb-ae68-2af7c907d997,c992b973-299c-49d0-b5b5-3bdabb7ef575,0,,182.8,104.32,31.0,,,,,,,,,,,,,released,2019-04-03T17:50:51.184182-05:00,2020-01-13T20:10:25.832065-06:00
1,555ab977-3414-4e18-86d7-5da7ab009a80,c992b973-299c-49d0-b5b5-3bdabb7ef575,0,,,,,Not Reported,,,,,,,,,,,,released,2019-04-03T17:50:51.184182-05:00,2020-01-13T20:10:25.832065-06:00
2,b9b4acd8-4f9c-4166-9763-04eaa0dd5584,ddf7c939-5278-4034-b861-36717e11695d,0,,,,,,,,,,,,,,,"Bone, NOS",,released,2020-02-11T13:29:04.347635-06:00,2020-03-26T16:04:47.706904-05:00
3,c4bc895c-00bb-4ac2-b9f0-e769802a546d,ddf7c939-5278-4034-b861-36717e11695d,0,22.0,,,,,22.0,,,,,,,,,"Connective, subcutaneous and other soft tissue...",,released,2019-04-16T11:15:52.632540-05:00,2020-02-11T13:29:04.347635-06:00
4,0c9b89f2-ff48-44b9-b123-79aa0123d1c6,ddf7c939-5278-4034-b861-36717e11695d,0,,,,,Not Reported,,,,,,,,,,,,released,2019-04-03T17:56:08.557154-05:00,2020-01-13T20:10:25.832065-06:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,189970f9-b04e-4d36-b77f-252165fc4f30,d000dcfe-181c-46fe-93ba-5fce8336b52d,7,,172.0,90.70,30.0,,,,,,,,,,,,,released,2020-02-11T14:43:34.823831-06:00,2020-03-26T16:04:47.706904-05:00
97,d84e2b5a-8b10-451f-8e51-847c93809c12,3b797720-33ef-45e0-9830-5f6007f1d5a7,7,,185.4,127.91,37.0,,,,,,,,,,,,,released,2020-02-11T14:36:35.999470-06:00,2020-03-26T16:04:47.706904-05:00
98,101294f6-a41d-4442-baa0-d87c4378a09e,cd6f37d5-d192-48b9-a63f-54d8d1e810c6,7,,182.0,79.37,24.0,,,,,,,,,,,,,released,2019-05-15T13:04:39.907559-05:00,2019-05-24T12:11:41.511797-05:00
99,2650764a-22df-4e21-a792-c60bb61cda92,b5fe3fa5-0f46-4e63-8e1f-dc3d28734936,7,,162.5,61.20,23.0,,,,,,,,,,,,,released,2019-05-15T15:01:01.539372-05:00,2019-05-24T12:11:41.511797-05:00


###test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel24_clin_HCMI_followt&page=table).

Run a manual check in the console with the steps mentioned in step 1 

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

###test 4 - case_gdc_id file metadata table count verification

**4. Number of case_id versus BigQuery metadata table**

In [0]:
# clinical case_id counts table reuslts below

# Query below will display the number of cases presents in this table.

clin_table = Table('`isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow`')
clin_query = Query.from_(clin_table) \
                  .select(' DISTINCT case_id, count(*) as count') \
                  .groupby('case_id')

clin_query_clean = str(clin_query).replace('"', "")
#print(clin_query_clean)
clin = client.query(clin_query_clean).to_dataframe()
print('number of case from case_id = ' + str(len(clin.index)))

number of case from case_id = 23


In [0]:
# GDC file metadata table case_gdc_id count for clinical below

%%bigquery --project isb-project-zero
SELECT case_gdc_id, program_name
FROM `isb-project-zero.GDC_metadata.rel24_caseData`
where program_name = 'HCMI'
group by case_gdc_id, program_name

Unnamed: 0,case_gdc_id,program_name
0,c5e9a845-e0ff-4612-aa8e-c310098503a7,HCMI
1,9e615018-8669-4dd6-8265-25e766a34dd0,HCMI
2,d7fc5874-8ae0-4357-9d8d-01af39ee521f,HCMI
3,5e5684d2-14b0-4823-9956-d69aea6d6beb,HCMI
4,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,HCMI
5,6375cc59-6cd1-4b1e-8de1-510841c3bebd,HCMI
6,e00914a1-997e-44ba-9c88-a5e8741b4ee1,HCMI
7,d000dcfe-181c-46fe-93ba-5fce8336b52d,HCMI
8,3b797720-33ef-45e0-9830-5f6007f1d5a7,HCMI
9,6796ddd8-0295-4965-911c-d4ad67232d13,HCMI


### match

###test 5 - duplication verifcation

**5. Check for any duplicate rows present in the table**

In [0]:
%%bigquery --project isb-project-zero

SELECT count(case_id) AS count
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow`
group by follow__follow_up_id, case_id, follow__mol_test__count, follow__days_to_follow_up, follow__height, follow__weight, follow__bmi, follow__progression_or_recurrence_type, follow__days_to_progression, follow__comorbidity, follow__days_to_comorbidity, follow__diabetes_treatment_type, follow__karnofsky_performance_status, follow__disease_response, follow__reflux_treatment_type, follow__ecog_performance_status, follow__progression_or_recurrence, follow__progression_or_recurrence_anatomic_site, follow__risk_factor, follow__state, follow__created_datetime, follow__updated_datetime
ORDER BY count DESC
LIMIT 10

Unnamed: 0,count
0,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
8,1
9,1


###test 6 - case_id master clinical data table count verifcation

**6. Verify case_id count of table against master rel_clinical_data table**

In [0]:
# case_id count from the program HCMI clinical table

%%bigquery --project isb-project-zero

select distinct case_id, count(case_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI` 
group by case_id
order by count

Unnamed: 0,case_id,count
0,c992b973-299c-49d0-b5b5-3bdabb7ef575,1
1,ddf7c939-5278-4034-b861-36717e11695d,1
2,307c905f-c276-4501-988a-b083f9462a98,1
3,3092d72b-75b1-4ae2-ac38-d4c1cd377e4c,1
4,c811d6dd-992f-435a-80ec-b282a2e38aad,1
5,69eced5b-1e76-45c9-bc9c-2aa71a921c57,1
6,0f89e089-4a1d-4e66-8537-502a788cfe75,1
7,5e5684d2-14b0-4823-9956-d69aea6d6beb,1
8,6375cc59-6cd1-4b1e-8de1-510841c3bebd,1
9,babbc3cb-14b7-4894-b91d-c160f4cb48e0,1


In [0]:
# case_id count from the program HCMI_follow clinical table

%%bigquery --project isb-project-zero

select distinct case_id, count(case_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow` 
group by case_id
order by count

Unnamed: 0,case_id,count
0,c992b973-299c-49d0-b5b5-3bdabb7ef575,3
1,118bf677-caed-4f7f-a837-dd966dce5ca7,3
2,9e615018-8669-4dd6-8265-25e766a34dd0,3
3,e00914a1-997e-44ba-9c88-a5e8741b4ee1,3
4,c5e9a845-e0ff-4612-aa8e-c310098503a7,3
5,307c905f-c276-4501-988a-b083f9462a98,4
6,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,4
7,e802e579-5293-465c-a867-74e290268299,4
8,8b3b1f24-419e-4043-82be-2bd41268bb0e,4
9,ddf7c939-5278-4034-b861-36717e11695d,5


### match

##Clin HCMI_follow__mol_test

**Testing Full ID** `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow__mol_test`

[Table location](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel24_clin_HCMI_follow__mol_test&page=table)

Source : GDC API

Release version : v24

###test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [0]:
#return all table information for rel24_clin_HCMI_follow__mol_test

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLES')
clin_query = Query.from_(clin_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_follow__mol_test') \
                  
clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
clin.head()

Unnamed: 0,table_catalog,table_schema,table_name,table_type
0,isb-project-zero,GDC_Clinical_Data,rel24_clin_HCMI_follow__mol_test,BASE TABLE


In [0]:
#return all table information for rel24_clin_HCMI_follow__mol_test

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLE_OPTIONS')
clin_query = Query.from_(clin_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_follow__mol_test') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(clin)):
    print(clin['option_name'][i] + '\n')
    print('\t' + clin['option_value'][i] + '\n')
    print('\t' + clin['option_type'][i] + '\n')

else:

    print('QC of friendly name, table description and labels --- FAILED')

QC of friendly name, table description and labels --- FAILED


In [0]:
#check for empty schemas in dataset rel24_clin_HCMI_follow__mol_test

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLE_OPTIONS')
clin_query = Query.from_(clin_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_follow__mol_test') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
clin.empty

Are there any empty cells in the table schema?


True

FIELD Descriptions pulled example below


In [0]:
#list of field descriptions for table rel24_clin_HCMI_follow__mol_test

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
clin_query = Query.from_(clin_table) \
                  .select('table_name, column_name, description') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_follow__mol_test') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(clin)):
  print(clin['table_name'][i] + '\n')
  print('\t' + clin['column_name'][i] + '\n')
  print('\t' + clin['description'][i] + '\n')

rel24_clin_HCMI_follow__mol_test

	follow__mol_test__molecular_test_id

	

rel24_clin_HCMI_follow__mol_test

	follow__follow_up_id

	Reference to ancestor follow__follow_up_id, located in rel24_clin_HCMI_follow.

rel24_clin_HCMI_follow__mol_test

	case_id

	Reference to ancestor case_id, located in rel24_clin_HCMI.

rel24_clin_HCMI_follow__mol_test

	follow__mol_test__variant_type

	The text term used to describe the type of genetic variation.

rel24_clin_HCMI_follow__mol_test

	follow__mol_test__test_result

	The text term used to describe the result of the molecular test. If the test result was a numeric value see test_value.

rel24_clin_HCMI_follow__mol_test

	follow__mol_test__aa_change

	Alphanumeric value used to describe the amino acid change for a specific genetic variant. Example: R116Q.

rel24_clin_HCMI_follow__mol_test

	follow__mol_test__blood_test_normal_range_upper

	Numeric value used to describe the upper limit of the normal range used to describe a healthy individual a

In [0]:
# check for empty schemas in dataset rel24_clin_HCMI_follow__mol_test

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
clin_query = Query.from_(clin_table) \
                  .select('table_name, column_name, description') \
                  .where(clin_table.table_name=='rel24_clin_HCMI_follow__mol_test') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
print(clin)

                          table_name  ...                                        description
0   rel24_clin_HCMI_follow__mol_test  ...                                                   
1   rel24_clin_HCMI_follow__mol_test  ...  Reference to ancestor follow__follow_up_id, lo...
2   rel24_clin_HCMI_follow__mol_test  ...  Reference to ancestor case_id, located in rel2...
3   rel24_clin_HCMI_follow__mol_test  ...  The text term used to describe the type of gen...
4   rel24_clin_HCMI_follow__mol_test  ...  The text term used to describe the result of t...
5   rel24_clin_HCMI_follow__mol_test  ...  Alphanumeric value used to describe the amino ...
6   rel24_clin_HCMI_follow__mol_test  ...  Numeric value used to describe the upper limit...
7   rel24_clin_HCMI_follow__mol_test  ...  The text term used to describe an antigen incl...
8   rel24_clin_HCMI_follow__mol_test  ...  The text term or numeric value used to describ...
9   rel24_clin_HCMI_follow__mol_test  ...  The text term used to descr

###test 2 - row number verification

**2. Look at table row number and size**

Do these metrics make sense?

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(case_id)
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow__mol_test`

Unnamed: 0,f0_
0,85


In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(case_id)
FROM `isb-project-zero.GDC_Clinical_Data.rel23_clin_HCMI_follow__mol_test`

Unnamed: 0,f0_
0,85


In [0]:
%%bigquery --project isb-project-zero
SELECT *
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow__mol_test`

Unnamed: 0,follow__mol_test__molecular_test_id,follow__follow_up_id,case_id,follow__mol_test__variant_type,follow__mol_test__test_result,follow__mol_test__aa_change,follow__mol_test__blood_test_normal_range_upper,follow__mol_test__antigen,follow__mol_test__test_value,follow__mol_test__molecular_analysis_method,follow__mol_test__gene_symbol,follow__mol_test__second_gene_symbol,follow__mol_test__mismatch_repair_mutation,follow__mol_test__state,follow__mol_test__created_datetime,follow__mol_test__updated_datetime
0,6176220f-c86c-4a56-a53a-daaecd85d4c0,d84e2b5a-8b10-451f-8e51-847c93809c12,3b797720-33ef-45e0-9830-5f6007f1d5a7,,Unknown,,,,,Microsatellite Analysis,Not Applicable,,,released,2020-02-11T14:36:35.999470-06:00,2020-03-26T16:04:47.706904-05:00
1,74fbb7cb-c80f-41f9-ae29-1512df278d54,9f9e8563-87e4-4198-8cc6-48597a37e95b,118bf677-caed-4f7f-a837-dd966dce5ca7,,Unknown,,,,,Microsatellite Analysis,Not Applicable,,,released,2019-05-15T15:00:21.508948-05:00,2019-05-24T12:11:41.511797-05:00
2,d4635280-56ea-47c9-bb74-a963f5b98691,bce629c8-e143-4f17-9f79-3535ce0fc0b3,e802e579-5293-465c-a867-74e290268299,,Unknown,,,,,Microsatellite Analysis,Not Applicable,,,released,2019-05-15T14:59:45.171654-05:00,2019-05-24T12:11:41.511797-05:00
3,6e94ccca-f2f6-4a72-b3c9-13245319a31c,101294f6-a41d-4442-baa0-d87c4378a09e,cd6f37d5-d192-48b9-a63f-54d8d1e810c6,,Unknown,,,,,Microsatellite Analysis,Not Applicable,,,released,2019-05-15T13:04:39.907559-05:00,2019-05-24T12:11:41.511797-05:00
4,34929b31-1d2f-4c24-80a3-c043e1d774f8,2650764a-22df-4e21-a792-c60bb61cda92,b5fe3fa5-0f46-4e63-8e1f-dc3d28734936,,Unknown,,,,,Microsatellite Analysis,Not Applicable,,,released,2019-05-15T15:01:01.539372-05:00,2019-05-24T12:11:41.511797-05:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
80,595b3df4-16d9-41c0-90e9-55a38a7084c1,976c1ad7-6d5d-42b2-9137-0c7442aec917,69eced5b-1e76-45c9-bc9c-2aa71a921c57,,Not Reported,,,,,Not Reported,Not Applicable,,Yes,released,2019-04-04T14:14:00.900831-05:00,2019-05-07T12:14:28.332870-05:00
81,d643f08d-1682-444e-aa06-e8346b971426,1609b50e-c1a8-4752-93b1-7c7555bf5133,0f89e089-4a1d-4e66-8537-502a788cfe75,,Not Reported,,,,,Not Reported,Not Applicable,,Yes,released,2019-04-04T14:16:25.317982-05:00,2019-04-29T14:45:04.583966-05:00
82,73d765df-47fa-48b7-9021-9c7dd603f1d1,0743a2fc-035a-427b-8076-fb5bc4d09c70,5e5684d2-14b0-4823-9956-d69aea6d6beb,,Not Reported,,,,,Not Reported,Not Applicable,,Yes,released,2019-04-04T14:17:46.995307-05:00,2020-01-13T20:10:25.832065-06:00
83,c817d7d4-176d-4ab3-9e51-51a8673f602d,c78a59e6-a3a7-411f-be7c-2a6d927159ed,6375cc59-6cd1-4b1e-8de1-510841c3bebd,,Not Reported,,,,,Not Reported,Not Applicable,,Yes,released,2019-04-04T14:18:18.835652-05:00,2020-01-13T20:10:25.832065-06:00


###test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel24_clin_HCMI_follow__mol_test&page=table).

Run a manual check in the console with the steps mentioned in step 1 

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

###test 4 - case_gdc_id file metadata table count verification

**4. Number of case_id versus BigQuery metadata table**

In [0]:
# clinical case_id counts table reuslts below

# Query below will display the number of cases presents in this table.

clin_table = Table('`isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow__mol_test`')
clin_query = Query.from_(clin_table) \
                  .select(' DISTINCT case_id, count(*) as count') \
                  .groupby('case_id')

clin_query_clean = str(clin_query).replace('"', "")
#print(clin_query_clean)
clin = client.query(clin_query_clean).to_dataframe()
print('number of case from case_id = ' + str(len(clin.index)))

number of case from case_id = 20


In [0]:
# GDC file metadata table case_gdc_id count for clinical below

%%bigquery --project isb-project-zero
SELECT case_gdc_id, program_name
FROM `isb-project-zero.GDC_metadata.rel24_caseData`
where program_name = 'HCMI'
group by case_gdc_id, program_name

Unnamed: 0,case_gdc_id,program_name
0,c5e9a845-e0ff-4612-aa8e-c310098503a7,HCMI
1,9e615018-8669-4dd6-8265-25e766a34dd0,HCMI
2,d7fc5874-8ae0-4357-9d8d-01af39ee521f,HCMI
3,5e5684d2-14b0-4823-9956-d69aea6d6beb,HCMI
4,56c07b06-c6d3-4c03-9e57-7be636e7cc5c,HCMI
5,6375cc59-6cd1-4b1e-8de1-510841c3bebd,HCMI
6,e00914a1-997e-44ba-9c88-a5e8741b4ee1,HCMI
7,d000dcfe-181c-46fe-93ba-5fce8336b52d,HCMI
8,3b797720-33ef-45e0-9830-5f6007f1d5a7,HCMI
9,6796ddd8-0295-4965-911c-d4ad67232d13,HCMI


### no match 

Explanation: Row count doesn't match due to not all cases contain certain field groups. 

###test 5 - duplication verifcation

**5. Check for any duplicate rows present in the table**

In [0]:
%%bigquery --project isb-project-zero

SELECT count(case_id) AS count
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow__mol_test`
group by follow__mol_test__molecular_test_id, follow__follow_up_id, case_id, follow__mol_test__variant_type, follow__mol_test__test_result, follow__mol_test__aa_change, follow__mol_test__blood_test_normal_range_upper, follow__mol_test__antigen, follow__mol_test__test_value, follow__mol_test__molecular_analysis_method, follow__mol_test__gene_symbol, follow__mol_test__second_gene_symbol, follow__mol_test__mismatch_repair_mutation, follow__mol_test__state, follow__mol_test__created_datetime, follow__mol_test__updated_datetime
ORDER BY count DESC
LIMIT 10

Unnamed: 0,count
0,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
8,1
9,1


###test 6 - case_id master clinical data table count verifcation

**6. Verify case_id count of table against master rel_clinical_data table**

In [0]:
# case_id count from the program HCMI clinical table

%%bigquery --project isb-project-zero

select distinct case_id, count(case_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI` 
group by case_id
order by count

Unnamed: 0,case_id,count
0,1a81e2c4-b637-4028-ab28-4c64bc8bab76,1
1,60acb146-43b3-44a3-9791-79284dae285e,1
2,5fadbb10-de7f-482f-8460-54d513b5de89,1
3,3adab9b3-da32-44df-9525-5f81e8cbfc02,1
4,ef5d356c-89c9-48b1-a567-519aa478f166,1
...,...,...
327,18bce2c3-5060-4282-b7a0-11ada9a3eebf,1
328,f07f2302-2276-4d5c-b52b-0b900464d39c,1
329,9d05b62c-6948-4a15-8599-503436349fbc,1
330,89c3d38f-8eda-4f62-bb9d-8d3b818c969f,1


In [0]:
# case_id count from the program HCMI_follow__mol_test clinical table

%%bigquery --project isb-project-zero

select distinct case_id, count(case_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow__mol_test` 
group by case_id
order by count

Unnamed: 0,case_id,count
0,d7fc5874-8ae0-4357-9d8d-01af39ee521f,1
1,ddf7c939-5278-4034-b861-36717e11695d,2
2,307c905f-c276-4501-988a-b083f9462a98,2
3,c5e9a845-e0ff-4612-aa8e-c310098503a7,2
4,6796ddd8-0295-4965-911c-d4ad67232d13,2
5,3092d72b-75b1-4ae2-ac38-d4c1cd377e4c,3
6,c811d6dd-992f-435a-80ec-b282a2e38aad,3
7,69eced5b-1e76-45c9-bc9c-2aa71a921c57,3
8,0f89e089-4a1d-4e66-8537-502a788cfe75,3
9,5e5684d2-14b0-4823-9956-d69aea6d6beb,3


### no match 

Explanation: Row count doesn't match due to not all cases contain certain field groups. 

###test 7 - follow__follow_up_id count verification

**7. QC follow__follow_up_id count from parent diag table if applicable**

In [0]:
# follow__follow_up_id count from the program HCMI_follow clinical table

%%bigquery --project isb-project-zero

select distinct follow__follow_up_id, count(follow__follow_up_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow` 
group by follow__follow_up_id
order by count

Unnamed: 0,follow__follow_up_id,count
0,bf8357ad-40a5-4edb-ae68-2af7c907d997,1
1,555ab977-3414-4e18-86d7-5da7ab009a80,1
2,b9b4acd8-4f9c-4166-9763-04eaa0dd5584,1
3,c4bc895c-00bb-4ac2-b9f0-e769802a546d,1
4,0c9b89f2-ff48-44b9-b123-79aa0123d1c6,1
...,...,...
96,189970f9-b04e-4d36-b77f-252165fc4f30,1
97,d84e2b5a-8b10-451f-8e51-847c93809c12,1
98,101294f6-a41d-4442-baa0-d87c4378a09e,1
99,2650764a-22df-4e21-a792-c60bb61cda92,1


In [0]:
# follow__follow_up_id count from the program HCMI_follow__mol_test clinical table

%%bigquery --project isb-project-zero

select distinct follow__follow_up_id, count(follow__follow_up_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel24_clin_HCMI_follow__mol_test` 
group by follow__follow_up_id
order by count

Unnamed: 0,follow__follow_up_id,count
0,39a6aa78-4491-4db6-b784-645669fd45cc,1
1,a8bfd279-1ebd-4508-a4c9-94165b3f4596,1
2,4e19d259-80a5-4f64-a99c-ab05283a5d94,1
3,62089acc-cde0-477d-a372-b195b417590b,2
4,cb64c6d7-4293-4ef4-8f60-80b78bdb0cf8,2
5,dc1f3d31-6143-41a0-a6c4-24c716062f52,2
6,815f7ee9-436e-4699-ab21-2252e88bac6d,3
7,71239033-f0c1-428c-b818-0bf7d8426b90,3
8,976c1ad7-6d5d-42b2-9137-0c7442aec917,3
9,1609b50e-c1a8-4752-93b1-7c7555bf5133,3


### no match 

Explanation: not sure of reason. Will confirm with Lauren W of the possible 