QC of ETL starting with GDC release 24 clinical tables.

This notebook focuses on the QC of program BEATAML data_category clinical

##QC table checklist 


**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

**2. Look at table row number and size**

Do these metrics make sense?

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

*Note from developer:
There are some columns which are sparsely populated (so they might look empty if you’re just scrolling through the table in the GUI), but there should be at least one non-null entry for every column in every table.*

**4. Number of submitter_id versus BigQuery metadata table**

**5. Number of case_id versus BigQuery metadata table**

**6.Check for any duplicate rows present in the table**

**7. Verify count of table against main program clinical table if available**

**8. Verify submiiter_id count of table against master rel_clinical_data table**

**9. Verify case_id count of table against  master rel_clinical_data table**

##Reference material



*   [NextGenETL](https://github.com/isb-cgc/NextGenETL) GitHub repository
*   [ETL QC SOP draft](https://docs.google.com/document/d/1Wskf3BxJLkMjhIXD62B6_TG9h5KRcSp8jSAGqcCP1lQ/edit)

##Before you begin

You need to load the BigQuery module, authenticate ourselves, create a client variable, and load the necessary libraries.


In [1]:
from google.colab import auth
try:
  auth.authenticate_user()
  print('You have been successfully authenticated!')
except:
  print('You have not been authenticated.')

You have been successfully authenticated!


In [2]:
from google.cloud import bigquery
try:
  project_id = 'isb-project-zero' # Update your_project_number with your project number
  client = bigquery.Client(project=project_id)
  print('BigQuery client successfully initialized')
except:
  print('Failed')

BigQuery client successfully initialized


In [3]:
#Install pypika to build a Query 
!pip install pypika
# Import from PyPika
from pypika import Query, Table, Field, Order

import pandas

Collecting pypika
[?25l  Downloading https://files.pythonhosted.org/packages/ea/22/63a4b2194462c54de8450de3d61eb44eddc2e7a85b06792603af09c606e1/PyPika-0.37.7.tar.gz (53kB)
[K     |██████▏                         | 10kB 23.4MB/s eta 0:00:01[K     |████████████▍                   | 20kB 2.2MB/s eta 0:00:01[K     |██████████████████▌             | 30kB 2.8MB/s eta 0:00:01[K     |████████████████████████▊       | 40kB 3.1MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51kB 2.5MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 2.3MB/s 
[?25hBuilding wheels for collected packages: pypika
  Building wheel for pypika (setup.py) ... [?25l[?25hdone
  Created wheel for pypika: filename=PyPika-0.37.7-py2.py3-none-any.whl size=42747 sha256=b692011b3cf656c9b8a8bfa2640f159fb6b8896f425263801b710476e719f0be
  Stored in directory: /root/.cache/pip/wheels/40/b2/20/cf67d3c67186b46241b5069c93da2c9beedbb3f08dba75fffe
Successfully built pypika
Installing collected packag

## READY TO BEGIN TESTING

##Program BEATAML 

**Testing Full ID** `isb-project-zero.GDC_Clinical_Data.rel24_clin_BEATAML1_0`

[Table location](https://console.cloud.google.com/bigquery?authuser=1&folder=&organizationId=&project=isb-project-zero&p=isb-project-zero&d=GDC_Clinical_Data&t=rel24_clin_BEATAML1_0&page=table)

Source : GDC API

Date Created : 	Apr 1, 2020, 7:06:22 PM

Release version : v24


##test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [0]:
#return all table information for rel24_clin_BEATAML1_0

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLES')
clin_query = Query.from_(clin_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(clin_table.table_name=='rel24_clin_BEATAML1_0') \
                  
clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
clin.head()

Unnamed: 0,table_catalog,table_schema,table_name,table_type
0,isb-project-zero,GDC_Clinical_Data,rel24_clin_BEATAML1_0,BASE TABLE


In [0]:
#return all table information for rel24_clin_BEATAML1_0

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLE_OPTIONS')
clin_query = Query.from_(clin_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(clin_table.table_name=='rel24_clin_BEATAML1_0') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(clin)):
    print(clin['option_name'][i] + '\n')
    print('\t' + clin['option_value'][i] + '\n')
    print('\t' + clin['option_type'][i] + '\n')

else:

    print('QC of friendly name, table description and labels --- FAILED')

QC of friendly name, table description and labels --- FAILED


In [0]:
#check for empty schemas in dataset rel24_clin_BEATAML1_0

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.TABLE_OPTIONS')
clin_query = Query.from_(clin_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(clin_table.table_name=='rel24_clin_BEATAML1_0') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
#pandas.isempty(clin).values.any()
clin.empty

Are there any empty cells in the table schema?


True

FIELD Descriptions pulled example below


In [17]:
#list of field descriptions for table rel24_clin_BEATAML1_0

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
clin_query = Query.from_(clin_table) \
                  .select('table_name, column_name, description') \
                  .where(clin_table.table_name=='rel24_clin_BEATAML1_0') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(clin)):
  print(clin['table_name'][i] + '\n')
  print('\t' + clin['column_name'][i] + '\n')
  print('\t' + clin['description'][i] + '\n')

rel24_clin_BEATAML1_0

	submitter_id

	

rel24_clin_BEATAML1_0

	case_id

	

rel24_clin_BEATAML1_0

	primary_site

	

rel24_clin_BEATAML1_0

	disease_type

	

rel24_clin_BEATAML1_0

	index_date

	

rel24_clin_BEATAML1_0

	proj__name

	Display name for the project

rel24_clin_BEATAML1_0

	proj__project_id

	

rel24_clin_BEATAML1_0

	demo__demographic_id

	

rel24_clin_BEATAML1_0

	demo__gender

	Text designations that identify gender. Gender is described as the assemblage of properties that distinguish people on the basis of their societal roles. [Explanatory Comment 1: Identification of gender is based upon self-report and may come from a form, questionnaire, interview, etc.]

rel24_clin_BEATAML1_0

	demo__race

	An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common histo

In [21]:
# check for empty schemas in dataset rel24_clin_BEATAML

clin_table = Table('`isb-project-zero`.GDC_Clinical_Data.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
clin_query = Query.from_(clin_table) \
                  .select('table_name, column_name, description') \
                  .where(clin_table.table_name=='rel23_clin_BEATAML1_0') \

clin_query_clean = str(clin_query).replace('"', "")
clin = client.query(clin_query_clean).to_dataframe()

print(clin)

               table_name  ...                                        description
0   rel23_clin_BEATAML1_0  ...                                                   
1   rel23_clin_BEATAML1_0  ...                                                   
2   rel23_clin_BEATAML1_0  ...                                                   
3   rel23_clin_BEATAML1_0  ...                                                   
4   rel23_clin_BEATAML1_0  ...                                                   
5   rel23_clin_BEATAML1_0  ...                                                   
6   rel23_clin_BEATAML1_0  ...  Text designations that identify gender. Gender...
7   rel23_clin_BEATAML1_0  ...  An arbitrary classification of a taxonomic gro...
8   rel23_clin_BEATAML1_0  ...  An individual's self-described social and cult...
9   rel23_clin_BEATAML1_0  ...  The survival state of the person registered on...
10  rel23_clin_BEATAML1_0  ...  The patient's age (in years) on the reference ...
11  rel23_clin_B

##test 2 row number verification

**2. Look at table row number and size**

Do these metrics make sense?

### pass

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(submitter_id)
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_BEATAML1_0`

Unnamed: 0,f0_
0,639


In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(case_id)
FROM `isb-project-zero.GDC_Clinical_Data.rel23_clin_BEATAML1_0`

Unnamed: 0,f0_
0,639


In [0]:
%%bigquery --project isb-project-zero
SELECT *
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_BEATAML1_0`

Unnamed: 0,submitter_id,case_id,primary_site,disease_type,index_date,demo__demographic_id,demo__gender,demo__race,demo__ethnicity,demo__vital_status,demo__age_at_index,demo__state,demo__created_datetime,demo__updated_datetime,diag__diagnosis_id,diag__primary_diagnosis,diag__progression_or_recurrence,diag__site_of_resection_or_biopsy,diag__age_at_diagnosis,diag__tumor_grade,diag__last_known_disease_status,diag__morphology,diag__tumor_stage,diag__tissue_or_organ_of_origin,diag__state,diag__created_datetime,diag__updated_datetime,diag__anno__annotation_id,diag__anno__category,diag__anno__classification,diag__anno__notes,diag__anno__status,diag__anno__state,diag__anno__created_datetime,diag__anno__updated_datetime,diag__anno__legacy_created_datetime,state,created_datetime,updated_datetime
0,2196,21e1419b-5b51-4d3a-8aeb-33b42eb22b05,Hematopoietic and reticuloendothelial systems,Myeloid Leukemias,,de8d2015-54ed-4afc-beef-0f0beab5630f,male,asian,Unknown,Alive,,released,2019-03-12T13:55:26.655310-05:00,2019-08-21T12:42:58.829706-05:00,2b20775e-da88-448c-b9b3-b3172a8e42ce,"Acute myeloid leukemia, NOS",no,Not Reported,20433.0,Not Reported,not reported,9861/3,Adverse,Bone marrow,released,2019-03-12T12:10:30.725957-05:00,2019-08-21T12:42:58.829706-05:00,,,,,,,,,,released,2018-08-21T13:17:47.034310-05:00,2019-09-11T07:01:10.489780-05:00
1,2546,61cd4099-3b46-4a83-982b-98ec2d3bed06,Hematopoietic and reticuloendothelial systems,Myeloid Leukemias,,eac1ddc6-f541-491c-b113-dd96fa4a8a0d,male,asian,Unknown,Dead,,released,2019-03-12T13:55:26.655310-05:00,2019-08-21T12:42:58.829706-05:00,80d9b16d-15ce-41e6-9b64-867e4cfc0a5d,"Acute myeloid leukemia, NOS",no,Not Reported,28787.0,Not Reported,not reported,9861/3,Intermediate,Bone marrow,released,2019-03-12T12:10:30.725957-05:00,2019-08-21T12:42:58.829706-05:00,,,,,,,,,,released,2018-08-21T13:17:47.034310-05:00,2019-09-11T07:01:10.489780-05:00
2,2374,a8e0b40a-1158-4332-978a-3b9a620a4c4d,Hematopoietic and reticuloendothelial systems,Myeloid Leukemias,,ed249f49-b658-4fed-b80e-6fb3a337f465,male,asian,not hispanic or latino,Dead,,released,2019-03-12T13:55:26.655310-05:00,2019-08-21T12:42:58.829706-05:00,2b05c636-23f7-4075-9b77-f278636568de,Acute myeloid leukemia with mutated NPM1,yes,Not Reported,20567.0,Not Reported,not reported,9861/3,Intermediate,Bone marrow,released,2019-03-12T12:10:30.725957-05:00,2019-08-21T12:42:58.829706-05:00,,,,,,,,,,released,2018-08-21T13:17:47.034310-05:00,2019-09-11T07:01:10.489780-05:00
3,2274,71153ee9-9eed-4bf7-ad10-beb4f728ef22,Hematopoietic and reticuloendothelial systems,Myeloid Leukemias,,7c9c79ac-ae6f-4fd8-9a2a-f112ca4681e3,male,asian,not hispanic or latino,Alive,,released,2019-03-12T13:55:26.655310-05:00,2019-08-21T12:42:58.829706-05:00,a2e8ba6d-3cd1-489c-8ce9-0ef7b8891793,Acute myeloid leukemia with mutated CEBPA,yes,Not Reported,11668.0,Not Reported,not reported,9861/3,Favorable,Bone marrow,released,2019-03-12T12:10:30.725957-05:00,2019-08-21T12:42:58.829706-05:00,,,,,,,,,,released,2018-08-21T13:17:47.034310-05:00,2019-09-11T07:01:10.489780-05:00
4,2204,a0d7b280-8133-420f-af41-038c22d2a034,Hematopoietic and reticuloendothelial systems,Myeloid Leukemias,,eab75341-23a9-4b59-a944-eab9e9c7bd8a,female,asian,not hispanic or latino,Alive,,released,2019-03-12T13:55:26.655310-05:00,2019-08-21T12:42:58.829706-05:00,1f9af446-c7eb-419e-a22e-ae74b5bc6658,"Atypical chronic myeloid leukemia, BCR/ABL neg...",no,Not Reported,17394.0,Not Reported,not reported,9876/3,Intermediate,Bone marrow,released,2019-03-12T12:10:30.725957-05:00,2019-08-21T12:42:58.829706-05:00,,,,,,,,,,released,2018-08-21T13:17:47.034310-05:00,2019-09-11T07:01:10.489780-05:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
634,2018,3e0a9baf-dc38-405c-87ba-af489c4a869b,Hematopoietic and reticuloendothelial systems,Myeloid Leukemias,,b26c4d2f-1e8d-4af7-b756-3e9da2c06f92,female,black or african american,Unknown,Unknown,,released,2019-03-12T13:55:26.655310-05:00,2019-08-21T12:42:58.829706-05:00,7abf174b-cd58-4114-a2e3-29a799f183e9,"Acute myeloid leukemia, minimal differentiation",no,Not Reported,16274.0,Not Reported,not reported,9872/3,Adverse,Bone marrow,released,2019-03-12T12:10:30.725957-05:00,2019-08-21T12:42:58.829706-05:00,,,,,,,,,,released,2018-08-21T13:17:47.034310-05:00,2019-09-11T07:01:10.489780-05:00
635,2125,baf61533-bb00-42e3-805c-3d24e735eeb8,Hematopoietic and reticuloendothelial systems,Myeloid Leukemias,,d3ed4d28-3880-4baf-bcc6-f23b21c3c561,male,black or african american,not hispanic or latino,Dead,,released,2019-03-12T13:55:26.655310-05:00,2019-08-21T12:42:58.829706-05:00,d263476a-6f7b-4ba3-be74-1d30a176b96a,Acute myeloid leukemia with myelodysplasia-rel...,no,Not Reported,22809.0,Not Reported,not reported,9895/3,Adverse,Bone marrow,released,2019-03-12T12:10:30.725957-05:00,2019-08-21T12:42:58.829706-05:00,,,,,,,,,,released,2018-08-21T13:17:47.034310-05:00,2019-09-11T07:01:10.489780-05:00
636,2158,f2273e2b-41e6-4509-aecd-1683ccefffc2,Hematopoietic and reticuloendothelial systems,Myeloid Leukemias,,a2b4ca94-96f4-41e0-8021-459d3f9636dc,female,black or african american,Unknown,Dead,,released,2019-03-12T13:55:26.655310-05:00,2019-08-21T12:42:58.829706-05:00,331f81b9-1faf-4ff5-8c3d-8abf7e40ae0e,Acute myeloid leukemia with myelodysplasia-rel...,no,Not Reported,19276.0,Not Reported,not reported,9895/3,Adverse,Bone marrow,released,2019-03-12T12:10:30.725957-05:00,2019-08-21T12:42:58.829706-05:00,,,,,,,,,,released,2018-08-21T13:17:47.034310-05:00,2019-09-11T07:01:10.489780-05:00
637,2330,f12b28ba-d793-4cea-bfc9-e5b0c0d86730,Hematopoietic and reticuloendothelial systems,Myeloid Leukemias,,62a19c29-4e00-4c1d-9e2c-bc69b3bec409,male,black or african american,not hispanic or latino,Alive,,released,2019-03-12T13:55:26.655310-05:00,2019-08-21T12:42:58.829706-05:00,90c25d1d-4381-4fb8-82b5-927226abe263,Acute myeloid leukemia with myelodysplasia-rel...,no,Not Reported,26585.0,Not Reported,not reported,9895/3,Adverse,Bone marrow,released,2019-03-12T12:10:30.725957-05:00,2019-08-21T12:42:58.829706-05:00,,,,,,,,,,released,2018-08-21T13:17:47.034310-05:00,2019-09-11T07:01:10.489780-05:00


##test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?authuser=1&folder=&organizationId=&project=isb-project-zero&p=isb-project-zero&d=GDC_Clinical_Data&t=rel24_clin_BEATAML1_0&page=table).

Run a manual check in the console with the steps mentioned in step 1 

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

*Note from developer:
There are some columns which are sparsely populated (so they might look empty if you’re just scrolling through the table in the GUI), but there should be at least one non-null entry for every column in every table.*


##test 4 - submitter_id file metadata table count verification

**4. Number of submitter_id cases versus BigQuery metadata table**

In [0]:
# clinical submitter_id counts table reuslts below

# Query below will display the number of cases presents in this table.

clin_table = Table('`isb-project-zero.GDC_Clinical_Data.rel24_clin_BEATAML1_0`')
clin_query = Query.from_(clin_table) \
                  .select(' DISTINCT submitter_id, count(*) as count') \
                  .groupby('submitter_id')

clin_query_clean = str(clin_query).replace('"', "")
#print(clin_query_clean)
clin = client.query(clin_query_clean).to_dataframe()
print('number of cases from submitter_id = ' + str(len(clin.index)))


number of cases from submitter_id = 639


In [0]:
# GDC file metadata table submitter_id count for clinical below

%%bigquery --project isb-project-zero

SELECT case_barcode, program_name
FROM `isb-project-zero.GDC_metadata.rel24_caseData`
where program_name = 'BEATAML1.0'
and active_file_count != 0
group by case_barcode, program_name

Unnamed: 0,case_barcode,program_name
0,2040,BEATAML1.0
1,2051,BEATAML1.0
2,2086,BEATAML1.0
3,2091,BEATAML1.0
4,2098,BEATAML1.0
...,...,...
634,2284,BEATAML1.0
635,2301,BEATAML1.0
636,2453,BEATAML1.0
637,2238,BEATAML1.0


##test 5 - case_gdc_id file metadata table count verification

**5. Number of case_id versus BigQuery metadata table**

In [0]:
# clinical case_id counts table reuslts below

# Query below will display the number of cases presents in this table.

clin_table = Table('`isb-project-zero.GDC_Clinical_Data.rel24_clin_BEATAML1_0`')
clin_query = Query.from_(clin_table) \
                  .select(' DISTINCT case_id, count(*) as count') \
                  .groupby('case_id')

clin_query_clean = str(clin_query).replace('"', "")
#print(clin_query_clean)
clin = client.query(clin_query_clean).to_dataframe()
print('number of case from submitter_id = ' + str(len(clin.index)))


number of case from submitter_id = 639


In [0]:
# GDC file metadata table case_gdc_id count for clinical below

%%bigquery --project isb-project-zero
SELECT case_gdc_id, program_name
FROM `isb-project-zero.GDC_metadata.rel24_caseData`
where program_name = 'BEATAML1.0'
and active_file_count != 0
group by case_gdc_id, program_name

Unnamed: 0,case_gdc_id,program_name
0,d65a1730-ff79-4115-b087-98a871bc325a,BEATAML1.0
1,cbbd299d-b8a3-4cd4-a73b-b8e568bbc58f,BEATAML1.0
2,f6ca403f-0f94-496b-ad0a-ad48995eef4f,BEATAML1.0
3,738b69a3-7100-4b43-9a53-a53d2e2f6092,BEATAML1.0
4,87268ac4-9c03-4ddc-b9ff-32b86b4a394b,BEATAML1.0
...,...,...
634,38a7758c-c96b-4d0d-b3e0-29aa77fd7014,BEATAML1.0
635,13fc70d1-3574-4651-9d15-e4bcc4bb6382,BEATAML1.0
636,1b3392f0-c9f5-4e86-be22-021eb88f5611,BEATAML1.0
637,bed1a005-667d-4223-806e-5b74ef6b8254,BEATAML1.0


##test 6 - duplication verifcation

**6. Check for any duplicate rows present in the table**



### pass

In [0]:
%%bigquery --project isb-project-zero

SELECT count(submitter_id) as count
FROM `isb-project-zero.GDC_Clinical_Data.rel24_clin_BEATAML1_0` 
GROUP BY submitter_id, case_id, primary_site, disease_type, index_date, state, created_datetime, updated_datetime, demo__demographic_id, demo__gender, demo__race, demo__ethnicity, demo__vital_status, demo__age_at_index, demo__state, demo__created_datetime, demo__updated_datetime, diag__diagnosis_id, diag__primary_diagnosis, diag__progression_or_recurrence, diag__site_of_resection_or_biopsy, diag__age_at_diagnosis, diag__tumor_grade, diag__last_known_disease_status, diag__morphology, diag__tumor_stage, diag__tissue_or_organ_of_origin, diag__state, diag__created_datetime, diag__updated_datetime, diag__anno__annotation_id, diag__anno__category, diag__anno__classification, diag__anno__notes, diag__anno__status, diag__anno__state, diag__anno__created_datetime, diag__anno__updated_datetime, diag__anno__legacy_created_datetime
order by count desc
limit 10

Unnamed: 0,count
0,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
8,1
9,1


## test 7 - one to many tables count verifcation

**7. Verify count of table against main program clinical table if available**

In [0]:
#no one to many tables avaialable for program BEATAML data_category clinical

##test 8 - submitter_id master clinical data table count verifcation

**8. Verify submiiter_id count of table against master rel_clinical_data table**

In [0]:
# submitter_id count from the program BEATAML clinical table

%%bigquery --project isb-project-zero

select distinct submitter_id, count(submitter_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel23_clin_BEATAML1_0` 
group by submitter_id
order by count

Unnamed: 0,submitter_id,count
0,2196,1
1,2546,1
2,2374,1
3,2274,1
4,2204,1
...,...,...
634,2018,1
635,2125,1
636,2158,1
637,2330,1


In [0]:
# submitter_id count from the master clinical table

%%bigquery --project isb-project-zero

SELECT distinct submitter_id, count(submitter_id) as count
FROM `isb-cgc.GDC_metadata.rel23_caseData` as caseData, `isb-project-zero.GDC_Clinical_Data.rel23_clinical_data` as clinical
WHERE active_file_count != 0 and program_name = 'BEATAML1.0' 
AND caseData.case_barcode = clinical.submitter_id
group by submitter_id
order by count

Unnamed: 0,submitter_id,count
0,2040,1
1,2086,1
2,2091,1
3,2098,1
4,2113,1
...,...,...
634,2358,2
635,2363,2
636,2432,2
637,2493,2


##test 9 - case_id master clinical data table count verifcation

**9. Verify case_id count of table against master rel_clinical_data table**

### no match rel23_clin_BEATAML1_0: 639 and rel23_caseData: 592

In [0]:
# case_id count from the program BEATAML clinical table

%%bigquery --project isb-project-zero

select distinct case_id, count(case_id) as count
from `isb-project-zero.GDC_Clinical_Data.rel23_clin_BEATAML1_0` 
group by case_id
order by count

Unnamed: 0,case_id,count
0,21e1419b-5b51-4d3a-8aeb-33b42eb22b05,1
1,61cd4099-3b46-4a83-982b-98ec2d3bed06,1
2,a8e0b40a-1158-4332-978a-3b9a620a4c4d,1
3,71153ee9-9eed-4bf7-ad10-beb4f728ef22,1
4,a0d7b280-8133-420f-af41-038c22d2a034,1
...,...,...
634,3e0a9baf-dc38-405c-87ba-af489c4a869b,1
635,baf61533-bb00-42e3-805c-3d24e735eeb8,1
636,f2273e2b-41e6-4509-aecd-1683ccefffc2,1
637,f12b28ba-d793-4cea-bfc9-e5b0c0d86730,1


In [0]:
# case_id count from the master clinical table

%%bigquery --project isb-project-zero

SELECT distinct case_id, count(case_id) as count
FROM `isb-project-zero.GDC_metadata.rel24_caseData` as casedata, `isb-project-zero.GDC_Clinical_Data.rel23_clinical_data` as clinical
WHERE active_file_count != 0 and program_name = 'BEATAML1.0'
AND casedata.case_gdc_id = clinical.case_id
group by case_id
order by count


Unnamed: 0,case_id,count
0,83ff6b6d-d954-4874-81ef-e545d61a6b8b,1
1,d88e0be1-525e-4f25-9207-0bbcda50cdf7,1
2,5ab4209e-e03e-4022-a3bb-f55f8194ffd0,1
3,e17710bc-613f-4bea-ba8f-084529e68d26,1
4,b7e83797-9d3b-4dcd-ad5c-1f2dd2f9b8e7,1
...,...,...
634,951d1275-7e3c-41a3-b718-ae08156179ca,1
635,a9f6f4c1-75ae-471d-94ff-5014c02350f1,1
636,b61514ec-cdef-4713-8481-cb9995520b70,1
637,72bcbe21-7fa1-4654-b3cb-7f39b5d7c905,1
