QC of ETL starting with GDC release 24 metadata tables.


##QC table checklist 

**Change Log Check**

i. Aliquot changes

ii. Slide changes

iii. Case data changes

iV. Current file changes

v. Legacy file changes


QC per table**



**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

**2. Look at table row number and size**

Do these metrics make sense?

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

##Reference material



*   [NextGenETL](https://github.com/isb-cgc/NextGenETL) GitHub repository
*   [ETL QC SOP draft](https://docs.google.com/document/d/1Wskf3BxJLkMjhIXD62B6_TG9h5KRcSp8jSAGqcCP1lQ/edit)

##Before you begin

You need to load the BigQuery module, authenticate ourselves, create a client variable, and load the necessary libraries.


In [0]:
from google.colab import auth
try:
  auth.authenticate_user()
  print('You have been successfully authenticated!')
except:
  print('You have not been authenticated.')

You have been successfully authenticated!


In [0]:
from google.cloud import bigquery
try:
  project_id = 'isb-project-zero' # Update your_project_number with your project number
  client = bigquery.Client(project=project_id)
  print('BigQuery client successfully initialized')
except:
  print('Failed')

BigQuery client successfully initialized


In [0]:
#Install pypika to build a Query 
!pip install pypika
# Import from PyPika
from pypika import Query, Table, Field, Order

import pandas

Collecting pypika
[?25l  Downloading https://files.pythonhosted.org/packages/ea/22/63a4b2194462c54de8450de3d61eb44eddc2e7a85b06792603af09c606e1/PyPika-0.37.7.tar.gz (53kB)
[K     |██████▏                         | 10kB 18.1MB/s eta 0:00:01[K     |████████████▍                   | 20kB 1.7MB/s eta 0:00:01[K     |██████████████████▌             | 30kB 2.2MB/s eta 0:00:01[K     |████████████████████████▊       | 40kB 2.5MB/s eta 0:00:01[K     |██████████████████████████████▉ | 51kB 2.0MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 1.9MB/s 
[?25hBuilding wheels for collected packages: pypika
  Building wheel for pypika (setup.py) ... [?25l[?25hdone
  Created wheel for pypika: filename=PyPika-0.37.7-py2.py3-none-any.whl size=42747 sha256=f1fefab06b8cefb17d4229b5a405016362811d078aea6f8f5fe9744f1378bb4d
  Stored in directory: /root/.cache/pip/wheels/40/b2/20/cf67d3c67186b46241b5069c93da2c9beedbb3f08dba75fffe
Successfully built pypika
Installing collected packag

## READY TO BEGIN TESTING

## Change Log Check

This section will use the Change summary printed to the console during ETL to compare to the GDC Release Notes.

GDC Release notes: https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-240

### ALIQUOT Changes


In [0]:
Current aliquot count: 262792
Previous aliquot count: 261199
Difference: 1593
 
Removed aliquot count: 188
    188 ORGANOID-PANCREATIC     Next Generation Cancer Model
 
Added aliquot count: 1781
    210 CGCI-HTMCP-CC   Blood Derived Normal
    484 CGCI-HTMCP-CC   Primary Tumor
    111 CPTAC-3 Blood Derived Normal
    477 CPTAC-3 Primary Tumor
    123 CPTAC-3 Solid Tissue Normal
    376 ORGANOID-PANCREATIC     Next Generation Cancer Model
 
Changed aliquot count: 2761
    341       1 CPTAC-2 Blood Derived Normal
    682       1 CPTAC-2 Primary Tumor
      1       1 CPTAC-2 Solid Tissue Normal
    429       1 CPTAC-3 Blood Derived Normal
    898       1 CPTAC-3 Primary Tumor
    410       1 CPTAC-3 Solid Tissue Normal

In [0]:
##### DETAILED ALIQUOT CHANGES ######
portion_gdc_id: 2761
  Example: 3702ef55-3362-4242-80da-af74129de5b0 -> a3c097ae-04bd-5b8e-8dfb-2da98e54a52e
  Example: ef140c27-2d96-408a-a226-c615d577e6bf -> 8f1f09f9-b7f7-55eb-80e1-7e728712fdf3
  Example: a677461e-8829-4c24-b605-4caa0f68f156 -> 8f6f05bf-e234-5164-968a-d850c8005816
  Example: 2dc883d9-d625-4359-845f-87516b209f97 -> 0c67235d-2cc7-52be-96e2-5d02ea5e451f
  Example: a39adf77-d9c6-49aa-8b38-8c3ee4c7f250 -> 164c2583-a9af-5e42-9060-2d305a92772b
analyte_gdc_id: 2761
  Example: 904a34d9-f0db-4482-a360-0a3f132fa348 -> bdd653c4-c837-513a-9503-8f33d434b0cf
  Example: 8322d3fb-1b35-418d-9bfe-e8d212272c9e -> 7be861c9-dfe4-573c-9b43-21b2c678fc5e
  Example: c3015e84-8b19-4246-80c2-a664afeaa498 -> 1edb00ee-1135-555e-acb7-9e599f64d752
  Example: f6a26ed5-fa45-4327-8d25-6dbdbd40c3eb -> 89d5b2dd-281f-52d9-b4f9-c6b5d08992e9
  Example: c9e41da1-7606-43db-ba88-953aa2bd81eb -> c4a139e5-b12e-5415-abc4-4f5416f6e90c

In [0]:
# Query to check into changed analyte check
analyte_query = """
SELECT *
FROM `isb-project-zero.GDC_metadata.rel23_aliquot2caseIDmap`
WHERE analyte_gdc_id = '904a34d9-f0db-4482-a360-0a3f132fa348'
UNION ALL
SELECT *
FROM `isb-project-zero.GDC_metadata.rel24_aliquot2caseIDmap`
WHERE analyte_gdc_id = 'bdd653c4-c837-513a-9503-8f33d434b0cf'
ORDER BY analyte_gdc_id
"""

In [0]:
# Query to check into changed portion check
portion_query = """
SELECT *
FROM `isb-project-zero.GDC_metadata.rel23_aliquot2caseIDmap`
WHERE portion_gdc_id = '3702ef55-3362-4242-80da-af74129de5b0'
UNION ALL
SELECT *
FROM `isb-project-zero.GDC_metadata.rel24_aliquot2caseIDmap`
WHERE portion_gdc_id = 'a3c097ae-04bd-5b8e-8dfb-2da98e54a52e'
ORDER BY analyte_gdc_id
"""

In [0]:
analyte = client.query(analyte_query).to_dataframe()
analyte

Unnamed: 0,program_name,project_id,case_gdc_id,case_barcode,sample_gdc_id,sample_barcode,sample_type,sample_type_name,sample_is_ffpe,sample_preservation_method,portion_gdc_id,portion_barcode,analyte_gdc_id,analyte_barcode,aliquot_gdc_id,aliquot_barcode
0,CPTAC,CPTAC-3,579139e1-e8f9-4260-baf1-dbe7435c8aff,C3N-00738,730e9961-2e11-4b35-82de-0f2d00091d00,C3N-00738-01,,Primary Tumor,False,Snap Frozen,3702ef55-3362-4242-80da-af74129de5b0,,904a34d9-f0db-4482-a360-0a3f132fa348,,acedee26-c758-4bf5-91b1-78bc39abf825,CPT0112060006
1,CPTAC,CPTAC-3,579139e1-e8f9-4260-baf1-dbe7435c8aff,C3N-00738,730e9961-2e11-4b35-82de-0f2d00091d00,C3N-00738-01,,Primary Tumor,False,Snap Frozen,a3c097ae-04bd-5b8e-8dfb-2da98e54a52e,,bdd653c4-c837-513a-9503-8f33d434b0cf,,acedee26-c758-4bf5-91b1-78bc39abf825,CPT0112060006


In [0]:
portion = client.query(portion_query).to_dataframe()
portion

Unnamed: 0,program_name,project_id,case_gdc_id,case_barcode,sample_gdc_id,sample_barcode,sample_type,sample_type_name,sample_is_ffpe,sample_preservation_method,portion_gdc_id,portion_barcode,analyte_gdc_id,analyte_barcode,aliquot_gdc_id,aliquot_barcode
0,CPTAC,CPTAC-3,579139e1-e8f9-4260-baf1-dbe7435c8aff,C3N-00738,730e9961-2e11-4b35-82de-0f2d00091d00,C3N-00738-01,,Primary Tumor,False,Snap Frozen,3702ef55-3362-4242-80da-af74129de5b0,,904a34d9-f0db-4482-a360-0a3f132fa348,,acedee26-c758-4bf5-91b1-78bc39abf825,CPT0112060006
1,CPTAC,CPTAC-3,579139e1-e8f9-4260-baf1-dbe7435c8aff,C3N-00738,730e9961-2e11-4b35-82de-0f2d00091d00,C3N-00738-01,,Primary Tumor,False,Snap Frozen,a3c097ae-04bd-5b8e-8dfb-2da98e54a52e,,bdd653c4-c837-513a-9503-8f33d434b0cf,,acedee26-c758-4bf5-91b1-78bc39abf825,CPT0112060006


The ORGANOID-PANCREATIC aliquote changes, additons, and removals are from an on going issue with data coming from GDC. The code has been run to update the aliquot identifiers. The added aliquots for CGCI-HTMCP-CC, CPTAC-2, and CPTAC-3 are to be expected from the release notes. The changed aliquot is probably due to the random generation of psudo-identification and CPTAC-2 and CPTAC-3 data being updated in this release.

### SLIDE Changes

In [0]:
SLIDE
 
Current slide count: 48782
Previous slide count: 48782
Difference: 0
 
Removed slide count: 0
 
Added slide count: 0
 
Changed slide count: 0

In [0]:
##### DETAILED SLIDE CHANGES ######
No changes to analyze


No slide changes were found in this release.

### CASE DATA Changes

In [0]:
CASEDATA
 
Current caseData count: 85340
Previous caseData count: 85018
Difference: 322
 
Removed caseData count: 0
 
Added caseData count: 322
    212 CGCI-HTMCP-CC
    110 CPTAC-3
 
Changed caseData count: 772
    342 CPTAC-2
    430 CPTAC-3

In [0]:
##### DETAILED CASE CHANGES ######
active_file_count: 748
  Example: 23 -> 30
  Example: 23 -> 30
  Example: 38 -> 45
  Example: 38 -> 45
  Example: 23 -> 30
project__name: 772
  Example:  -> CPTAC-Breast, Colon, Ovary
  Example:  -> CPTAC-Breast, Colon, Ovary
  Example:  -> CPTAC-Brain, Head and Neck, Kidney, Lung, Uterus
  Example:  -> CPTAC-Brain, Head and Neck, Kidney, Lung, Uterus
  Example:  -> CPTAC-Breast, Colon, Ovary
project__disease_type: 430
  Example: Adenomas and Adenocarcinomas;Gliomas;Not Applicable -> Adenomas and Adenocarcinomas
;Gliomas;Not Applicable;Squamous Cell Neoplasms
  Example: Adenomas and Adenocarcinomas;Gliomas;Not Applicable -> Adenomas and Adenocarcinomas
;Gliomas;Not Applicable;Squamous Cell Neoplasms
  Example: Adenomas and Adenocarcinomas;Gliomas;Not Applicable -> Adenomas and Adenocarcinomas
;Gliomas;Not Applicable;Squamous Cell Neoplasms
  Example: Adenomas and Adenocarcinomas;Gliomas;Not Applicable -> Adenomas and Adenocarcinomas
;Gliomas;Not Applicable;Squamous Cell Neoplasms
  Example: Adenomas and Adenocarcinomas;Gliomas;Not Applicable -> Adenomas and Adenocarcinomas

The active file count is going up which is good as files should have been added wtih this release. The project disease type was updated.

###CURRENT FILES Changes

In [0]:
CURRENTFILES
 
Current currentFiles count: 570845
Previous currentFiles count: 559346
Difference: 11499
 
Removed currentFiles count: 35
     13 CPTAC-3 BAM
     10 CPTAC-3 TSV
     12 CPTAC-3 TXT
 
Added currentFiles count: 11534
    903 CGCI-HTMCP-CC   BAM
    212 CGCI-HTMCP-CC   BCR XML
    482 CGCI-HTMCP-CC   TSV
    369 CGCI-HTMCP-CC   TXT
      2 CGCI-HTMCP-CC   XLSX
     12 CPTAC-2 BAM
   2294 CPTAC-2 MAF
     60 CPTAC-2 VCF
   1174 CPTAC-3 BAM
   3731 CPTAC-3 MAF
    682 CPTAC-3 TSV
    513 CPTAC-3 TXT
   1100 CPTAC-3 VCF
 
Changed currentFiles count: 20746
   2016 CPTAC-2 BAM
   1360 CPTAC-2 TSV
   1020 CPTAC-2 TXT
   3216 CPTAC-2 VCF
   4556 CPTAC-3 BAM
   2482 CPTAC-3 TSV
   1866 CPTAC-3 TXT
   4230 CPTAC-3 VCF

In [0]:
##### DETAILED CURRENT FILE CHANGES ######
updated_datetime: 227
  Example: 2019-10-24T07:59:21.887408-05:00 -> 2019-12-02T08:34:28.180009-06:00
  Example: 2019-10-24T07:59:21.887408-05:00 -> 2019-12-02T08:34:36.671857-06:00
  Example: 2019-10-24T07:59:21.887408-05:00 -> 2019-12-02T08:34:11.319762-06:00
  Example: 2019-10-24T07:59:21.887408-05:00 -> 2019-12-02T08:34:18.457334-06:00
  Example: 2019-10-24T07:59:21.887408-05:00 -> 2019-12-02T08:35:21.063526-06:00
cases__project__name: 20746
  Example:  -> CPTAC-Brain, Head and Neck, Kidney, Lung, Uterus
  Example:  -> CPTAC-Breast, Colon, Ovary
  Example:  -> CPTAC-Breast, Colon, Ovary
  Example:  -> CPTAC-Brain, Head and Neck, Kidney, Lung, Uterus
  Example:  -> CPTAC-Breast, Colon, Ovary
downstream_analyses__output_files__file_id: 3723
  Example: 2583ab8d-4eac-4067-a187-8f480f0dd46f -> 1b1ec5aa-4d7d-4e4c-b4ef-81f4257fdddf;2583ab
8d-4eac-4067-a187-8f480f0dd46f
  Example: 63a9c78e-f4dd-4a2c-a238-3141edbf679d -> 0f481547-8f0a-4e48-a10b-6c927d8802e7;63a9c7
8e-f4dd-4a2c-a238-3141edbf679d
  Example: 0f12ef9f-9cd3-4fd8-8052-d978d901cd6c -> 0f12ef9f-9cd3-4fd8-8052-d978d901cd6c;2925a4
85-045d-40f3-8a76-a4b301ca80d3
  Example: 60d4b399-fce8-45e5-9044-bc718cbb8f0a -> 60d4b399-fce8-45e5-9044-bc718cbb8f0a;cfaed0
b2-a7da-4ed1-905b-687e26c1e5ae
  Example: 488c74c3-5bc4-462e-a0eb-10285291e7b1 -> 488c74c3-5bc4-462e-a0eb-10285291e7b1;495a4f
f1-9b6d-4140-9488-ae826111b7c2

In [0]:
file_removed_query = """
WITH
  rel23 AS (
  SELECT
    file_gdc_id,
    case_gdc_id,
    project_short_name,
    data_category,
    data_format,
    data_type,
    file_name,
    file_type
  FROM
    `isb-project-zero.GDC_metadata.rel23_fileData_current`),
  rel24 AS (
  SELECT
    file_gdc_id
  FROM
    `isb-project-zero.GDC_metadata.rel24_fileData_current`)
SELECT
    rel23.file_gdc_id,
    rel23.case_gdc_id,
    rel23.project_short_name,
    rel23.data_category,
    rel23.data_format,
    rel23.data_type,
    rel23.file_name,
    rel23.file_type
FROM rel23
LEFT OUTER JOIN rel24
ON rel23.file_gdc_id = rel24.file_gdc_id
WHERE rel24.file_gdc_id IS NULL
ORDER BY rel23.file_type"""

In [0]:
file_removed_sum_query = """
WITH
  rel23 AS (
  SELECT
    file_gdc_id,
    project_short_name,
    data_category,
    data_format,
    data_type,
    file_name,
    file_type
  FROM
    `isb-project-zero.GDC_metadata.rel23_fileData_current`),
  rel24 AS (
  SELECT
    file_gdc_id
  FROM
    `isb-project-zero.GDC_metadata.rel24_fileData_current`)
SELECT
  rel23.project_short_name,
  rel23.data_category,
  rel23.data_format,
  rel23.data_type,
  rel23.file_type,
  COUNT(rel23.file_gdc_id) AS n
FROM
  rel23
LEFT OUTER JOIN
  rel24
ON
  rel23.file_gdc_id = rel24.file_gdc_id
WHERE
  rel24.file_gdc_id IS NULL
GROUP BY
  rel23.project_short_name,
  rel23.data_category,
  rel23.data_format,
  rel23.data_type,
  rel23.file_type
ORDER BY
  rel23.file_type
  """

In [0]:
file_removed = client.query(file_removed_query).to_dataframe()
file_removed.head()

Unnamed: 0,file_gdc_id,case_gdc_id,project_short_name,data_category,data_format,data_type,file_name,file_type
0,398129c7-e54f-4423-b5d4-2a3dfb374ddc,3c055f6e-751a-430d-8549-ce873c0e64b5,CPTAC-3,Sequencing Reads,BAM,Aligned Reads,64d66d0c-9441-4c38-b8cf-08c40c0467bc.rna_seq.t...,aligned_reads
1,d7b5179d-63cb-47f5-9dac-faf1fdbc19f1,3c055f6e-751a-430d-8549-ce873c0e64b5,CPTAC-3,Sequencing Reads,BAM,Aligned Reads,64d66d0c-9441-4c38-b8cf-08c40c0467bc.rna_seq.c...,aligned_reads
2,e4740b8e-a683-40b8-9122-27a7f2b796f5,3c055f6e-751a-430d-8549-ce873c0e64b5,CPTAC-3,Sequencing Reads,BAM,Aligned Reads,64d66d0c-9441-4c38-b8cf-08c40c0467bc.rna_seq.g...,aligned_reads
3,62283599-ec19-46e6-b259-52a38fd516fe,98bed41e-96ac-434b-90ad-cc7223998602,CPTAC-3,Sequencing Reads,BAM,Aligned Reads,2553a2f4-4e4d-4c12-90d6-1a3cc5f05d2a.rna_seq.c...,aligned_reads
4,7942613a-c7ef-448e-a62c-994bcf263ea6,98bed41e-96ac-434b-90ad-cc7223998602,CPTAC-3,Sequencing Reads,BAM,Aligned Reads,2553a2f4-4e4d-4c12-90d6-1a3cc5f05d2a.rna_seq.t...,aligned_reads


In [0]:
file_removed_sum = client.query(file_removed_sum_query).to_dataframe()
file_removed_sum

Unnamed: 0,project_short_name,data_category,data_format,data_type,file_type,n
0,CPTAC-3,Sequencing Reads,BAM,Aligned Reads,aligned_reads,13
1,CPTAC-3,Transcriptome Profiling,TSV,Splice Junction Quantification,gene_expression,4
2,CPTAC-3,Transcriptome Profiling,TXT,Gene Expression Quantification,gene_expression,12
3,CPTAC-3,Transcriptome Profiling,TSV,Gene Expression Quantification,gene_expression,4
4,CPTAC-3,Transcriptome Profiling,TSV,Isoform Expression Quantification,mirna_expression,1
5,CPTAC-3,Transcriptome Profiling,TSV,miRNA Expression Quantification,mirna_expression,1


The removed files seem to be replaced wtih new files during the update to CPTAC-3. The other changes appear to be normal.

### LEGACY FILE Changes

In [0]:
LEGACYFILES
 
Current legacyFiles count: 837960
Previous legacyFiles count: 837960
Difference: 0
 
Removed legacyFiles count: 0
 
Added legacyFiles count: 0
 
Changed legacyFiles count: 0

In [0]:
##### DETAILED LEGACY FILE CHANGES ######
No changes to analyze

No changes to the legacy files which is to be expected.

##GDC aliquot2caseIDmap

**Testing Full ID** `isb-project-zero.GDC_metadata.rel24_aliquot2caseIDmap`

[Table location](https://console.cloud.google.com/bigquery?p=isb-project-zero&d=GDC_metadata&t=rel24_aliquot2caseIDmap&page=table&project=high-transit-276919&authuser=2)

Source : GDC API

Date Created : 	May 18, 2020, 1:20:58 PM

Release version : v24


###test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [0]:
#return all table information for dataset rel24_aliquot2caseIDmap 

meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLES')
meta_query = Query.from_(meta_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(meta_table.table_name=='rel24_aliquot2caseIDmap') \
                  
meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
meta.head()

Unnamed: 0,table_catalog,table_schema,table_name,table_type
0,isb-project-zero,GDC_metadata,rel24_aliquot2caseIDmap,BASE TABLE


In [0]:
#return all table information for dataset rel24_aliquot2caseIDmap

meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLE_OPTIONS')
meta_query = Query.from_(meta_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(meta_table.table_name=='rel24_aliquot2caseIDmap') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(meta)):
    print(meta['option_name'][i] + '\n')
    print('\t' + meta['option_value'][i] + '\n')
    print('\t' + meta['option_type'][i] + '\n')

friendly_name

	"RELEASE 24 ALIQUOT IDS TO CASE IDS"

	STRING

description

	"Data was generated from file metadata information from the GDC release 24, downloaded May 2020. Aliquot barcodes are mapped to case information.\nMore details: https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-240"

	STRING

labels

	[STRUCT("data_type", "file_metadata"), STRUCT("category", "file_metadata"), STRUCT("access", "open"), STRUCT("status", "current"), STRUCT("source", "gdc")]

	ARRAY<STRUCT<STRING, STRING>>



In [0]:
#check for empty schemas in dataset rel24_aliquot2caseIDmap 
meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLE_OPTIONS')
meta_query = Query.from_(meta_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(meta_table.table_name=='rel24_aliquot2caseIDmap') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(meta).values.any()

Are there any empty cells in the table schema?


False

FIELD Descriptions pulled example below


In [0]:
#list of field descriptions for table 

#return all table information for dataset rel24_aliquot2caseIDmap 
meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
meta_query = Query.from_(meta_table) \
                  .select('table_name, column_name, description') \
                  .where(meta_table.table_name=='rel24_aliquot2caseIDmap') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows

for i in range(len(meta)):
  print(meta['table_name'][i] + '\n')
  print('\t' + meta['column_name'][i] + '\n')
  print('\t' + meta['description'][i] + '\n')

rel24_aliquot2caseIDmap

	program_name

	Program name, eg TCGA or TARGET

rel24_aliquot2caseIDmap

	project_id

	ID of the Project -- e.g. TCGA-BRCA, TARGET-AML, etc.

rel24_aliquot2caseIDmap

	case_gdc_id

	GDC unique identifier

rel24_aliquot2caseIDmap

	case_barcode

	Submitter identifier, eg HCC1954, TCGA-ZX-AA5X, etc

rel24_aliquot2caseIDmap

	sample_gdc_id

	Unique GDC identifier for this sample (corresponds to the sample_barcode), eg a1ec9279-c1a6-4e58-97ed-9ec1f36187c5  --  this can be used to access more information from the GDC data portal

rel24_aliquot2caseIDmap

	sample_barcode

	Original/submitter sample ID

rel24_aliquot2caseIDmap

	sample_type

	Two-digit sample_type code which forms part of the sample barcode eg. 01,11,06, etc.

rel24_aliquot2caseIDmap

	sample_type_name

	The longer name of the sample type; eg. Primary Tumor, Recurrent Tumor, etc.

rel24_aliquot2caseIDmap

	sample_is_ffpe

	Whether a smaple is FFPE, either  False,  None ,  True, or null

rel24_aliquot

In [0]:
#list of field descriptions for table 

#check for empty schemas in dataset rel24_aliquot2caseIDmap
meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
meta_query = Query.from_(meta_table) \
                  .select('table_name, column_name, description') \
                  .where(meta_table.table_name=='rel24_aliquot2caseIDmap') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(meta).values.any()

Are there any empty cells in the table schema?


False

###test 2 - row number verification

**2. Look at table row number and size**

Do these metrics make sense?

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(project_id)
FROM `isb-project-zero.GDC_metadata.rel24_aliquot2caseIDmap`

Unnamed: 0,f0_
0,262603


In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(project_id)
FROM `isb-project-zero.GDC_metadata.rel23_aliquot2caseIDmap`

Unnamed: 0,f0_
0,261198


###test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?p=isb-project-zero&d=GDC_metadata&t=rel24_aliquot2caseIDmap&page=table&project=high-transit-276919&authuser=2).

Run a manual check in the console with the steps mentioned in step 1 

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

##GDC caseData

**Testing Full ID** `isb-project-zero.GDC_metadata.rel24_caseData`

[Table location](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel24_caseData&page=table)

Source : GDC API

Date Created : 	May 18, 2020, 1:21:08 PM

Release version : v24

###test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [0]:
#return all table information for dataset rel24_caseData 

meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLES')
meta_query = Query.from_(meta_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(meta_table.table_name=='rel24_caseData') \
                  
meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
meta.head()


Unnamed: 0,table_catalog,table_schema,table_name,table_type
0,isb-project-zero,GDC_metadata,rel24_caseData,BASE TABLE


In [0]:
#return all table information for dataset rel24_caseData

meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLE_OPTIONS')
meta_query = Query.from_(meta_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(meta_table.table_name=='rel24_caseData') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(meta)):
    print(meta['option_name'][i] + '\n')
    print('\t' + meta['option_value'][i] + '\n')
    print('\t' + meta['option_type'][i] + '\n')

friendly_name

	"RELEASE 24 CASE DATA"

	STRING

description

	"Data was generated from file metadata information from the GDC release 24, downloaded May, 2020. \nMore details: https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-240"

	STRING

labels

	[STRUCT("access", "open"), STRUCT("source", "gdc"), STRUCT("data_type", "file_metadata"), STRUCT("status", "current"), STRUCT("category", "file_metadata")]

	ARRAY<STRUCT<STRING, STRING>>



In [0]:
#check for empty schemas in dataset rel24_caseData 
meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLE_OPTIONS')
meta_query = Query.from_(meta_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(meta_table.table_name=='rel24_caseData') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(meta).values.any()

Are there any empty cells in the table schema?


False

FIELD Descriptions pulled example below


In [0]:
#list of field descriptions for table 

#return all table information for dataset rel24_caseData 
meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
meta_query = Query.from_(meta_table) \
                  .select('table_name, column_name, description') \
                  .where(meta_table.table_name=='rel24_caseData') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows

for i in range(len(meta)):
  print(meta['table_name'][i] + '\n')
  print('\t' + meta['column_name'][i] + '\n')
  print('\t' + meta['description'][i] + '\n')

rel24_caseData

	case_gdc_id

	GDC unique identifier

rel24_caseData

	project_dbgap_accession_number

	Program dbGaP accession number, eg phs000187

rel24_caseData

	project_disease_type

	The type of disease studied by the project; e.g. Sarcoma,  Breast Invasive Carcinoma,Acute Lymphoblastic Leukemia, etc

rel24_caseData

	project_name

	Project Name; frequently the same as or similar to the project disease type eg   Acute Lymphoblastic Leukemia - Phase II   or   Breast Invasive Carcinoma   etc

rel24_caseData

	program_dbgap_accession_number

	Program dbGaP accession number, eg phs000178

rel24_caseData

	program_name

	Program name, eg TCGA or TARGET

rel24_caseData

	project_id

	ID of the Project -- e.g. TCGA-BRCA, TARGET-AML, etc

rel24_caseData

	case_barcode

	Submitter identifier, eg HCC1954, TCGA-ZX-AA5X, etc

rel24_caseData

	legacy_file_count

	Total number of files for this case in the GDC legacy archive

rel24_caseData

	active_file_count

	Total number of files for this

In [0]:
#list of field descriptions for table 

#check for empty schemas in dataset rel24_caseData
meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
meta_query = Query.from_(meta_table) \
                  .select('table_name, column_name, description') \
                  .where(meta_table.table_name=='rel24_caseData') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(meta).values.any()

Are there any empty cells in the table schema?


False

###test 2 - row number verification

**2. Look at table row number and size**

Do these metrics make sense?

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(project_id)
FROM `isb-project-zero.GDC_metadata.rel24_caseData`

Unnamed: 0,f0_
0,85339


In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(project_id)
FROM `isb-project-zero.GDC_metadata.rel23_caseData`

Unnamed: 0,f0_
0,85017


###test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel24_caseData&page=table).

Run a manual check in the console with the steps mentioned in step 1 

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

##GDC fileData_current

**Testing Full ID** `isb-project-zero.GDC_metadata.rel24_fileData_current`

[Table location](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel23_fileData_current&page=table)

Source : GDC API

Date Created : 	Apr 17, 2020, 6:20:59 PM

Release version : v24

###test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [0]:
#return all table information for dataset rel24_fileData_current

meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLES')
meta_query = Query.from_(meta_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(meta_table.table_name=='rel24_fileData_current') \
                  
meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
meta.head()

Unnamed: 0,table_catalog,table_schema,table_name,table_type
0,isb-project-zero,GDC_metadata,rel24_fileData_current,BASE TABLE


In [0]:
#return all table information for dataset rel24_fileData_current

meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLE_OPTIONS')
meta_query = Query.from_(meta_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(meta_table.table_name=='rel24_fileData_current') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(meta)):
    print(meta['option_name'][i] + '\n')
    print('\t' + meta['option_value'][i] + '\n')
    print('\t' + meta['option_type'][i] + '\n')

friendly_name

	"RELEASE 24 FILE DATA ACTIVE"

	STRING

description

	"Data was generated from file metadata information from the GDC active archive release 24, downloaded May 2020. \nMore details: https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-240"

	STRING

labels

	[STRUCT("source", "gdc"), STRUCT("access", "open"), STRUCT("category", "file_metadata"), STRUCT("status", "current"), STRUCT("data_type", "file_metadata")]

	ARRAY<STRUCT<STRING, STRING>>



In [0]:
#check for empty schemas in dataset rel24_fileData_current 
meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLE_OPTIONS')
meta_query = Query.from_(meta_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(meta_table.table_name=='rel24_fileData_current') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(meta).values.any()

NameError: ignored

FIELD Descriptions pulled example below

In [0]:
#list of field descriptions for table 

#return all table information for dataset rel24_fileData_current 
meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
meta_query = Query.from_(meta_table) \
                  .select('table_name, column_name, description') \
                  .where(meta_table.table_name=='rel24_fileData_current') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows

for i in range(len(meta)):
  print(meta['table_name'][i] + '\n')
  print('\t' + meta['column_name'][i] + '\n')
  print('\t' + meta['description'][i] + '\n')

rel24_fileData_current

	dbName

	Referring to the default/main database at the GDC. In this table, this field is always current.

rel24_fileData_current

	file_gdc_id

	Unique GDC identifier for this file (corresponds to the file_barcode)  --  this can be used to access more information from the GDC data portal like this: https://portal.gdc.cancer.gov/files/c21b332c-06c6-4403-9032-f91c8f407ba4

rel24_fileData_current

	access

	Data accessibility policy (open or controlled)

rel24_fileData_current

	acl

	Access Control List -- if access is controlled, this field contains the dbGaP accession, eg phs000179

rel24_fileData_current

	analysis_input_file_gdc_ids

	GDC file UUIDs for input file(s)

rel24_fileData_current

	analysis_workflow_link

	Link to GDC workflow on github

rel24_fileData_current

	analysis_workflow_type

	Workflow name eg  STAR 2-Pass  

rel24_fileData_current

	archive_gdc_id

	GDC archive UUID

rel24_fileData_current

	archive_revision

	Archive revision number

re

In [0]:
#check for empty schemas in dataset rel24_fileData_current


meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
meta_query = Query.from_(meta_table) \
                  .select('table_name, column_name, description') \
                  .where(meta_table.table_name=='rel24_fileData_current') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(meta).values.any()

Are there any empty cells in the table schema?


False

###test 2 - row number verification

**2. Look at table row number and size**

Do these metrics make sense?

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(project_short_name)
FROM `isb-project-zero.GDC_metadata.rel24_fileData_current`

Unnamed: 0,f0_
0,570844


In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(project_short_name)
FROM `isb-project-zero.GDC_metadata.rel23_fileData_current`

Unnamed: 0,f0_
0,559345


###test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel23_fileData_current&page=table).

Run a manual check in the console with the steps mentioned in step 1 

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

##GDC fileData_legacy

**Testing Full ID** `isb-project-zero.GDC_metadata.rel24_fileData_legacy`

[Table location](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel23_fileData_legacy&page=table)

Source : GDC API

Date Created : 		Apr 17, 2020, 6:19:47 PM

Release version : v24

###test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [0]:
#return all table information for dataset rel24_fileData_legacy 

meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLES')
meta_query = Query.from_(meta_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(meta_table.table_name=='rel24_fileData_legacy') \
                  
meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
meta.head()

Unnamed: 0,table_catalog,table_schema,table_name,table_type
0,isb-project-zero,GDC_metadata,rel24_fileData_legacy,BASE TABLE


In [0]:
#return all table information for dataset rel24_fileData_legacy

meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLE_OPTIONS')
meta_query = Query.from_(meta_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(meta_table.table_name=='rel24_fileData_legacy') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(meta)):
    print(meta['option_name'][i] + '\n')
    print('\t' + meta['option_value'][i] + '\n')
    print('\t' + meta['option_type'][i] + '\n')

friendly_name

	"RELEASE 24 FILE DATA LEGACY"

	STRING

description

	"Data was generated from file metadata information from the GDC legacy archive release 24, downloaded May, 2020. \nMore details: https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-240"

	STRING

labels

	[STRUCT("access", "open"), STRUCT("data_type", "file_metadata"), STRUCT("status", "current"), STRUCT("source", "gdc"), STRUCT("category", "file_metadata")]

	ARRAY<STRUCT<STRING, STRING>>



In [0]:
#check for empty schemas in dataset rel24_fileData_legacy
 
meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLE_OPTIONS')
meta_query = Query.from_(meta_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(meta_table.table_name=='rel24_fileData_legacy') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(meta).values.any()

Are there any empty cells in the table schema?


False

FIELD Descriptions pulled example below




In [0]:
#list of field descriptions for table 

#return all table information for dataset rel24_fileData_legacy 
meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
meta_query = Query.from_(meta_table) \
                  .select('table_name, column_name, description') \
                  .where(meta_table.table_name=='rel24_fileData_legacy') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows

for i in range(len(meta)):
  print(meta['table_name'][i] + '\n')
  print('\t' + meta['column_name'][i] + '\n')
  print('\t' + meta['description'][i] + '\n')

rel24_fileData_legacy

	dbName

	Referring to the default/main database at the GDC. In this table, this field is always legacy.

rel24_fileData_legacy

	file_gdc_id

	Unique GDC identifier for this file (corresponds to the file_barcode)  --  this can be used to access more information from the GDC data portal like this: https://portal.gdc.cancer.gov/files/c21b332c-06c6-4403-9032-f91c8f407ba5

rel24_fileData_legacy

	access

	Data accessibility policy (open or controlled)

rel24_fileData_legacy

	acl

	Access Control List -- if access is controlled, this field contains the dbGaP accession, eg phs000179

rel24_fileData_legacy

	archive_gdc_id

	GDC archive UUID

rel24_fileData_legacy

	archive_revision

	Archive revision number

rel24_fileData_legacy

	archive_state

	Archive state

rel24_fileData_legacy

	archive_submitter_id

	Original archive sumbitter name, eg  nationwidechildrens.org_BRCA.bio.Level_1.56

rel24_fileData_legacy

	associated_entities__case_gdc_id

	GDC case uuid for as

In [0]:
#check for empty schemas in dataset rel24_fileData_legacy

meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
meta_query = Query.from_(meta_table) \
                  .select('table_name, column_name, description') \
                  .where(meta_table.table_name=='rel24_fileData_legacy') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(meta).values.any()

Are there any empty cells in the table schema?


False

###test 2 - row number verification

**2. Look at table row number and size**

Do these metrics make sense?

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(project_short_name)
FROM `isb-project-zero.GDC_metadata.rel24_fileData_legacy`

Unnamed: 0,f0_
0,761543


In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(project_short_name)
FROM `isb-project-zero.GDC_metadata.rel23_fileData_legacy`

Unnamed: 0,f0_
0,761543


###test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel23_fileData_legacy&page=table).

Run a manual check in the console with the steps mentioned in step 1 

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

##GDC slide2caseIDmap

**Testing Full ID** `isb-project-zero.GDC_metadata.rel24_slide2caseIDmap`

[Table location](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel23_fileData_slide2caseIDmap&page=table)

Source : GDC API

Date Created : 		May 18, 2020, 1:20:41 PM

Release version : v24

###test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [0]:
#return all table information for dataset rel24_slide2caseIDmap 

meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLES')
meta_query = Query.from_(meta_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(meta_table.table_name=='rel24_slide2caseIDmap') \
                  
meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
meta.head()

Unnamed: 0,table_catalog,table_schema,table_name,table_type
0,isb-project-zero,GDC_metadata,rel24_slide2caseIDmap,BASE TABLE


In [0]:
#return all table information for dataset rel24_slide2caseIDmap

meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLE_OPTIONS')
meta_query = Query.from_(meta_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(meta_table.table_name=='rel24_slide2caseIDmap') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows


for i in range(len(meta)):
    print(meta['option_name'][i] + '\n')
    print('\t' + meta['option_value'][i] + '\n')
    print('\t' + meta['option_type'][i] + '\n')

friendly_name

	"RELEASE 24 SLIDE IMAGES TO CASE IDS"

	STRING

description

	"Data was generated from file metadata information from the GDC release 24, downloaded May, 2020. Image slides are mapped to case information.\nMore details: https://docs.gdc.cancer.gov/Data/Release_Notes/Data_Release_Notes/#data-release-240"

	STRING

labels

	[STRUCT("access", "open"), STRUCT("source", "gdc"), STRUCT("data_type", "file_metadata"), STRUCT("category", "file_metadata"), STRUCT("status", "current")]

	ARRAY<STRUCT<STRING, STRING>>



In [0]:
#check for empty schemas in dataset rel24_slide2caseIDmap
 
meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.TABLE_OPTIONS')
meta_query = Query.from_(meta_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(meta_table.table_name=='rel24_slide2caseIDmap') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(meta).values.any()

Are there any empty cells in the table schema?


False

FIELD Descriptions pulled example below


In [0]:
#list of field descriptions for table 

#return all table information for dataset rel24_slide2caseIDmap 
meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
meta_query = Query.from_(meta_table) \
                  .select('table_name, column_name, description') \
                  .where(meta_table.table_name=='rel24_slide2caseIDmap') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows

for i in range(len(meta)):
  print(meta['table_name'][i] + '\n')
  print('\t' + meta['column_name'][i] + '\n')
  print('\t' + meta['description'][i] + '\n')

rel24_slide2caseIDmap

	program_name

	Program name, eg TCGA or TARGET

rel24_slide2caseIDmap

	project_id

	ID of the Project -- e.g. TCGA-BRCA, TARGET-AML, etc.

rel24_slide2caseIDmap

	case_gdc_id

	GDC unique identifier

rel24_slide2caseIDmap

	case_barcode

	Submitter identifier, eg HCC1954, TCGA-ZX-AA5X, etc

rel24_slide2caseIDmap

	sample_gdc_id

	Unique GDC identifier for this sample (corresponds to the sample_barcode), eg a1ec9279-c1a6-4e58-97ed-9ec1f36187c5  --  this can be used to access more information from the GDC data portal

rel24_slide2caseIDmap

	sample_barcode

	Original/submitter sample ID

rel24_slide2caseIDmap

	sample_type

	Two-digit sample_type code which forms part of the sample barcode eg. 01,11,06, etc. (OR NA)

rel24_slide2caseIDmap

	sample_type_name

	The longer name of the sample type; eg. Primary Tumor, Recurrent Tumor, etc.

rel24_slide2caseIDmap

	portion_gdc_id

	Original/submitter portion ID

rel24_slide2caseIDmap

	portion_barcode

	Original/submit

In [0]:
#check for empty schemas in dataset rel24_slide2caseIDmap

meta_table = Table('`isb-project-zero`.GDC_metadata.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
meta_query = Query.from_(meta_table) \
                  .select('table_name, column_name, description') \
                  .where(meta_table.table_name=='rel24_slide2caseIDmap') \

meta_query_clean = str(meta_query).replace('"', "")
meta = client.query(meta_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(meta).values.any()

Are there any empty cells in the table schema?


False

###test 2 - row number verification

**2. Look at table row number and size**

Do these metrics make sense?

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(project_id)
FROM `isb-project-zero.GDC_metadata.rel24_slide2caseIDmap`

Unnamed: 0,f0_
0,48781


In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(project_id)
FROM `isb-project-zero.GDC_metadata.rel23_slide2caseIDmap`

Unnamed: 0,f0_
0,48781


###test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?project=high-transit-276919&authuser=2&p=isb-project-zero&d=GDC_metadata&t=rel23_fileData_slide2caseIDmap&page=table).

Run a manual check in the console with the steps mentioned in step 1 

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?