QC of ETL starting with GDC release 22 for programs TARGET, ORGANOID, and BEATAML. 


This notebook focuses on the QC of program BEATAML data_type RNA-Seq

##QC table checklist 

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

**2. Look at table row number and size**

Do these metrics make sense?

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

**4. Number of cases on GDC portal versus table?**

**5. Number of cases / aliquots versus BigQuery metadata table**

**6. Number of entries per gene - should equal aliquot count**

##Reference material



*   [NextGenETL](https://github.com/isb-cgc/NextGenETL) GitHub repository
*   [ETL QC SOP draft](https://docs.google.com/document/d/1Wskf3BxJLkMjhIXD62B6_TG9h5KRcSp8jSAGqcCP1lQ/edit)

##Before you begin

You need to load the BigQuery module, authenticate ourselves, create a client variable, and load the necessary libraries.


In [2]:
from google.colab import auth
try:
  auth.authenticate_user()
  print('You have been successfully authenticated!')
except:
  print('You have not been authenticated.')

You have been successfully authenticated!


In [9]:
from google.cloud import bigquery
try:
  project_id = 'isb-project-zero' # Update your_project_number with your project number
  client = bigquery.Client(project=project_id)
  print('BigQuery client successfully initialized')
except:
  print('Failed')

BigQuery client successfully initialized


In [4]:
#Install pypika to build a Query 
!pip install pypika
# Import from PyPika
from pypika import Query, Table, Field, Order

import pandas

Collecting pypika
[?25l  Downloading https://files.pythonhosted.org/packages/5d/12/09a36b1e4891433ea4ae0e75c87dd1ff038b19ab33d679aab3538d800cd8/PyPika-0.37.6.tar.gz (53kB)
[K     |██████▏                         | 10kB 18.3MB/s eta 0:00:01[K     |████████████▎                   | 20kB 1.7MB/s eta 0:00:01[K     |██████████████████▍             | 30kB 2.3MB/s eta 0:00:01[K     |████████████████████████▌       | 40kB 1.7MB/s eta 0:00:01[K     |██████████████████████████████▋ | 51kB 2.0MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 1.9MB/s 
[?25hBuilding wheels for collected packages: pypika
  Building wheel for pypika (setup.py) ... [?25l[?25hdone
  Created wheel for pypika: filename=PyPika-0.37.6-py2.py3-none-any.whl size=42748 sha256=c808b05452c3a85a8e9f215f545152448a87d8597fc4f5b0554c8fb6d0c87535
  Stored in directory: /root/.cache/pip/wheels/7e/39/df/d08ca9b40bba9f6d626a32c2e49c1ba61441eaa166f2cc8eb5
Successfully built pypika
Installing collected packag

## READY TO BEGIN TESTING

##Program BEATAML

**Testing Full ID** `isb-project-zero.fs_scratch.TARGET_hg38_data_v0_RNAseq_Gene_Expression`

[Table location](https://console.cloud.google.com/bigquery?project=isb-project-zero&p=isb-project-zero&d=fs_scratch&t=TARGET_hg38_data_v0_RNAseq_Gene_Expression&page=table)

Source : GDC buckets

Date Created : 	May 4, 2020, 6:32:13 PM

Release version : v22


###test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [5]:
#return all table information for dataset RNAseq_Gene_Expression 
rnaseq_table = Table('`isb-project-zero`.fs_scratch.INFORMATION_SCHEMA.TABLES')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(rnaseq_table.table_name=='TARGET_hg38_data_v0_RNAseq_Gene_Expression') \
                  
rnaseq_query_clean = str(rnaseq_query).replace('"', "")
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
rnaseq.head()

NameError: ignored

In [0]:
#return all table information for dataset RNAseq_Gene_Expression 
rnaseq_table = Table('`isb-project-zero`.fs_scratch.INFORMATION_SCHEMA.TABLE_OPTIONS')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(rnaseq_table.table_name=='TARGET_hg38_data_v0_RNAseq_Gene_Expression') \

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
pandas.options.display.max_rows
#rnaseq

for i in range(len(rnaseq)):
    print(rnaseq['option_name'][i] + '\n')
    print('\t' + rnaseq['option_value'][i] + '\n')
    print('\t' + rnaseq['option_type'][i] + '\n')

friendly_name

	"TARGET HG38 RNASEQ GENE EXPRESSION"

	STRING

description

	"Data was extracted from active archive of the GDC on December 2018 for RNAseq data for TARGET samples."

	STRING

labels

	[STRUCT("category", "processed_-omics_data"), STRUCT("experimental_strategy", "rnaseq"), STRUCT("source", "gdc"), STRUCT("status", "current"), STRUCT("program", "target"), STRUCT("access", "open"), STRUCT("reference_genome_0", "hg38"), STRUCT("data_type", "gene_expression")]

	ARRAY<STRUCT<STRING, STRING>>



In [0]:
#check for empty schemas in dataset RNAseq_Gene_Expression 
rnaseq_table = Table('`isb-project-zero`.fs_scratch.INFORMATION_SCHEMA.TABLE_OPTIONS')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(rnaseq_table.table_name=='TARGET_hg38_data_v0_RNAseq_Gene_Expression') \

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(rnaseq).values.any()

Are there any empty cells in the table schema?


False

FIELD Descriptions pulled example below


In [0]:
#list of field descriptions for table 

#return all table information for dataset RNAseq_Gene_Expression 
rnaseq_table = Table('`isb-project-zero`.fs_scratch.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select('table_name, column_name, description') \
                  .where(rnaseq_table.table_name=='TARGET_hg38_data_v0_RNAseq_Gene_Expression') \

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
pandas.options.display.max_rows
#rnaseq

for i in range(len(rnaseq)):
  print(rnaseq['column_name'][i] + '\n')
  print('\t' + rnaseq['description'][i] + '\n')

project_short_name

	Project name abbreviation, eg TARGET-AML

case_barcode

	Original TARGET case barcode, eg TARGET-20-PASCGR

sample_barcode

	TARGET sample barcode, eg TARGET-20-PASWAT-09A

aliquot_barcode

	TARGET aliquot barcode, eg TARGET-20-PAJLIP-01A-01R

gene_name

	Gene name e.g. TTN, DDR1, etc

gene_type

	The type of genetic element the reads mapped to, eg protein_coding, ribozyme

Ensembl_gene_id

	The Ensembl gene ID from the underlying file, but stripped of the version suffix  --  eg ENSG00000185028

Ensembl_gene_id_v

	The Ensembl gene ID from the underlying file, including the version suffix  --  eg ENSG00000235943.1

HTSeq__Counts

	Number of mapped reads to each gene as calculated by the Python package HTSeq. https://docs.gdc.cancer.gov/Encyclopedia/pages/HTSeq-Counts/

HTSeq__FPKM

	FPKM is implemented at the GDC on gene-level read counts that are produced by HTSeq1 and generated using custom. scripts https://docs.gdc.cancer.gov/Encyclopedia/pages/HTSeq-FPKM/

HTSe

In [0]:
#list of field descriptions for table 

#check for empty schemas in dataset RNAseq_Gene_Expression 
rnaseq_table = Table('`isb-project-zero`.RNAseq_Gene_Expression.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select('table_name, column_name, description') \
                  .where(rnaseq_table.table_name=='BEATAML1_0_RNAseq_Gene_Expression') \

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(rnaseq).values.any()

Are there any empty cells in the table schema?


False

###test 2 row number verification

**2. Look at table row number and size**

Do these metrics make sense?

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(project_short_name)
FROM `isb-project-zero.RNAseq_Gene_Expression.BEATAML1_0_RNAseq_Gene_Expression`

Unnamed: 0,f0_
0,30846330


###test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?authuser=1&folder=&organizationId=&project=isb-project-zero&p=isb-project-zero&d=RNAseq_Gene_Expression&t=BEATAML1_0_RNAseq_Gene_Expression&page=table).

Run a manual check in the console with the steps mentioned in step 1 

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

###test 4 - GDC Data Portal count verfication


**4. Number of cases on GDC portal versus table?**

In [0]:
# Query below will display the number of cases presents in this table.

rnaseq_table = Table('`isb-project-zero.fs_scratch.TARGET_hg38_data_v0_RNAseq_Gene_Expression`')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select(' DISTINCT case_barcode, count(*) as count') \
                  .groupby('case_barcode')

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
#print(rnaseq_query_clean)
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
print('number of cases = ' + str(len(rnaseq.index)))

# rnaseq.set_option("display.max_rows", None, "display.max_columns", None)

number of cases = 1188


To copmare against the GDC Data Portal, 
you first go the GDC Data Portal and search for program TARGET, workflow_type HTSeq - Counts, HTSeq - FPKM , HTSeq - FPKM-UQ, and experimental_strategy RNA-Seq, the cases number returned is 1192. 

[GDC Data portal](https://portal.gdc.cancer.gov/repository?facetTab=files&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.program.name%22%2C%22value%22%3A%5B%22TARGET%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.analysis.workflow_type%22%2C%22value%22%3A%5B%22HTSeq%20-%20Counts%22%2C%22HTSeq%20-%20FPKM%22%2C%22HTSeq%20-%20FPKM-UQ%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.data_category%22%2C%22value%22%3A%5B%22transcriptome%20profiling%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.experimental_strategy%22%2C%22value%22%3A%5B%22RNA-Seq%22%5D%7D%7D%5D%7D) filter results. 

###test 5 - file metadata table count verification

**5. Number of cases / aliquots versus BigQuery metadata table**

RNA-Seq cases counts table reuslts below.

In [10]:
# Query below will display the number of cases presents in this table.

rnaseq_table = Table('`isb-project-zero.fs_scratch.TARGET_hg38_data_v0_RNAseq_Gene_Expression`')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select(' DISTINCT case_barcode, count(*) as count') \
                  .groupby('case_barcode')

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
#print(rnaseq_query_clean)
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
print('number of cases = ' + str(len(rnaseq.index)))

# rnaseq.set_option("display.max_rows", None, "display.max_columns", None)

number of cases = 1188


GDC file metadata table cases count for RNA-seq below

In [11]:
%%bigquery --project isb-project-zero
SELECT case_gdc_id, program_name
FROM `isb-project-zero.GDC_metadata.rel23_fileData_current`
where program_name = 'TARGET'
and experimental_strategy = 'RNA-Seq'
and analysis_workflow_type IN ('HTSeq - Counts', 'HTSeq - FPKM-UQ', 'HTSeq - FPKM')
group by case_gdc_id, program_name

Unnamed: 0,case_gdc_id,program_name
0,74443fab-cc7e-59f4-b1af-a109095cc83e,TARGET
1,87c9e381-e555-5b87-9bc4-f0f541693d0f,TARGET
2,e27acb51-a523-509a-9e17-5ba6b3852e17,TARGET
3,db87ac4a-3d99-5641-92d5-69cf148b0594,TARGET
4,8c00d2ef-6816-5cd5-81a4-407d4a4903b3,TARGET
...,...,...
1187,a830f857-f5b6-50ce-922a-6182a7d0a272,TARGET
1188,45b8bab9-8628-5a6d-83bf-18202c126e1e,TARGET
1189,26079d5f-b061-5a39-968c-199e017cb880,TARGET
1190,e0815ea1-1ecd-5c27-81cf-e8ed93fa5fbf,TARGET


RNA-Seq aliquot counts table reuslts below.

In [12]:
# Query below will display the number of cases presents in this table.

rnaseq_table = Table('`isb-project-zero.fs_scratch.TARGET_hg38_data_v0_RNAseq_Gene_Expression`')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select(' distinct aliquot_gdc_id, count(*) as count') \
                  .groupby('aliquot_gdc_id')

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
#print(rnaseq_query_clean)
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
print('number of aliquots = ' + str(len(rnaseq.index)))

# rnaseq.set_option("display.max_rows", None, "display.max_columns", None)

number of aliquots = 1330


GDC file metadata table aliquot count for RNA-seq below.

In [13]:
%%bigquery --project isb-project-zero
select distinct associated_entities__entity_gdc_id 
from `isb-cgc.GDC_metadata.rel22_fileData_active` 
where program_name = "TARGET"
and experimental_strategy = "RNA-Seq"
and analysis_workflow_type IN ('HTSeq - Counts', 'HTSeq - FPKM-UQ', 'HTSeq - FPKM')

Unnamed: 0,associated_entities__entity_gdc_id
0,5873e46e-0538-502a-b846-f3635573ad84
1,d7be27d1-9a01-59c1-b82f-0a5aecb61646
2,03cacdd1-52c7-5f4f-a662-a5072541f18c
3,545c90ee-a93d-55d6-ad00-97394fae70b3
4,64084dc2-9b92-5cf9-bc34-58afb1649e38
...,...
1329,fff7e5aa-3242-52da-98aa-db8d7d526b24
1330,332a9f73-0395-5a86-80b6-73b217309f02
1331,8b95b757-43b3-585f-9585-a7b48ecdd944
1332,87d9fdce-e130-596e-9757-55ff3994cc34


### test 6 - gene entry verification

**6. Number of entries per gene - should equal aliquot count**

In [0]:
%%bigquery --project isb-project-zero

select distinct Ensembl_gene_id_v, count(Ensembl_gene_id_v) as count
from `isb-project-zero.fs_scratch.TARGET_hg38_data_v0_RNAseq_Gene_Expression` 
group by Ensembl_gene_id_v 
order by count

Unnamed: 0,Ensembl_gene_id_v,count
0,ENSG00000279730.1,1330
1,ENSG00000244682.6,1330
2,ENSG00000211934.3,1330
3,ENSG00000211689.5,1330
4,ENSG00000262185.1,1330
...,...,...
60478,ENSG00000271977.1,1330
60479,ENSG00000263887.5,1330
60480,ENSG00000223972.5,1330
60481,ENSG00000230578.3,1330


###step 7 - duplication verifcation



In [0]:
%%bigquery --project isb-project-zero

SELECT count(project_short_name) as count
FROM `isb-project-zero.fs_scratch.TARGET_hg38_data_v0_RNAseq_Gene_Expression` 
GROUP BY project_short_name, case_barcode, sample_barcode, aliquot_barcode, gene_name, gene_type, Ensembl_gene_id, Ensembl_gene_id_v, HTSeq__Counts, HTSeq__FPKM, HTSeq__FPKM_UQ, case_gdc_id, sample_gdc_id, aliquot_gdc_id, file_gdc_id_counts, file_gdc_id_fpkm, file_gdc_id_fpkm_uq, platform
order by count desc
LIMIT 10

#query to be run manually for duplication verification of QC

Unnamed: 0,count
0,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
8,1
9,1
