QC of ETL starting with GDC release 22 for programs TARGET, ORGANOID, and BEATAML. 


This notebook focuses on the QC of program ORGANOID data_type RNA-Seq

##QC table checklist 

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

**2. Look at table row number and size**

Do these metrics make sense?

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

**4. Number of cases on GDC portal versus table?**

**5. Number of cases / aliquots versus BigQuery metadata table**

**6. Number of entries per gene - should equal aliquot count**

**7.Check for any duplicate rows present in the table**

##Reference material



*   [NextGenETL](https://github.com/isb-cgc/NextGenETL) GitHub repository
*   [ETL QC SOP draft](https://docs.google.com/document/d/1Wskf3BxJLkMjhIXD62B6_TG9h5KRcSp8jSAGqcCP1lQ/edit)

##Before you begin

You need to load the BigQuery module, authenticate ourselves, create a client variable, and load the necessary libraries.


In [1]:
from google.colab import auth
try:
  auth.authenticate_user()
  print('You have been successfully authenticated!')
except:
  print('You have not been authenticated.')

You have been successfully authenticated!


In [6]:
from google.cloud import bigquery
try:
  project_id = 'isb-project-zero' # Update your_project_number with your project number
  client = bigquery.Client(project=project_id)
  print('BigQuery client successfully initialized')
except:
  print('Failed')

BigQuery client successfully initialized


In [3]:
#Install pypika to build a Query 
!pip install pypika
# Import from PyPika
from pypika import Query, Table, Field, Order

import pandas

Collecting pypika
[?25l  Downloading https://files.pythonhosted.org/packages/5d/12/09a36b1e4891433ea4ae0e75c87dd1ff038b19ab33d679aab3538d800cd8/PyPika-0.37.6.tar.gz (53kB)
[K     |██████▏                         | 10kB 16.6MB/s eta 0:00:01[K     |████████████▎                   | 20kB 1.8MB/s eta 0:00:01[K     |██████████████████▍             | 30kB 2.3MB/s eta 0:00:01[K     |████████████████████████▌       | 40kB 2.6MB/s eta 0:00:01[K     |██████████████████████████████▋ | 51kB 2.0MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 1.9MB/s 
[?25hBuilding wheels for collected packages: pypika
  Building wheel for pypika (setup.py) ... [?25l[?25hdone
  Created wheel for pypika: filename=PyPika-0.37.6-py2.py3-none-any.whl size=42748 sha256=8153701e55c1b0e31c1451b63d8527f4fd7fd76b2d457e0678f4136e7c61b74e
  Stored in directory: /root/.cache/pip/wheels/7e/39/df/d08ca9b40bba9f6d626a32c2e49c1ba61441eaa166f2cc8eb5
Successfully built pypika
Installing collected packag

## READY TO BEGIN TESTING

##Program ORGANOID

**Testing Full ID** `isb-project-zero.RNAseq_Gene_Expression.ORGANOID_RNAseq_Gene_Expression`

[Table location](https://console.cloud.google.com/bigquery?authuser=2&project=isb-project-zero&p=isb-project-zero&d=RNAseq_Gene_Expression&t=ORGANOID_RNAseq_Gene_Expression&page=table)

Source : GDC API

Date Created : 	Apr 1, 2020, 7:06:22 PM

Release version : v22


##test 1 - schema verification

**1. Check schema**

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields
    
Are the labels correct

Google documentation column descriptions for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#column_field_paths_view).

Google documentation table options for [reference](https://cloud.google.com/bigquery/docs/information-schema-tables#options_table).

In [7]:
#return all table information for dataset RNAseq_Gene_Expression 
rnaseq_table = Table('`isb-project-zero`.RNAseq_Gene_Expression.INFORMATION_SCHEMA.TABLES')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select(' table_catalog, table_schema, table_name, table_type ') \
                  .where(rnaseq_table.table_name=='ORGANOID_RNAseq_Gene_Expression') \
                  
rnaseq_query_clean = str(rnaseq_query).replace('"', "")
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
rnaseq.head()

Unnamed: 0,table_catalog,table_schema,table_name,table_type
0,isb-project-zero,RNAseq_Gene_Expression,ORGANOID_RNAseq_Gene_Expression,BASE TABLE


In [0]:
#return all table information for dataset RNAseq_Gene_Expression 
rnaseq_table = Table('`isb-project-zero`.RNAseq_Gene_Expression.INFORMATION_SCHEMA.TABLE_OPTIONS')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(rnaseq_table.table_name=='ORGANOID_RNAseq_Gene_Expression') \

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
pandas.options.display.max_rows
#rnaseq

for i in range(len(rnaseq)):
    print(rnaseq['option_name'][i] + '\n')
    print('\t' + rnaseq['option_value'][i] + '\n')
    print('\t' + rnaseq['option_type'][i] + '\n')

friendly_name

	"ORGANOID RNASEQ GENE EXPRESSION"

	STRING

description

	"Data was extracted from GDC on March 2020. mRNA expression data was generated using Illumina GA or HiSeq sequencing platforms with information from each of the three files (HTSeq Counts, HTSeq FPKM, HTSeq FPKM-UQ) from the GDC's RNAseq pipeline was combine for each aliquot."

	STRING

labels

	[STRUCT("data_type", "gene_expression"), STRUCT("status", "current"), STRUCT("reference_genome_0", "hg38"), STRUCT("source", "gdc"), STRUCT("category", "processed_-omics_data"), STRUCT("access", "open"), STRUCT("program", "organoid"), STRUCT("experimental_strategy", "rnaseq")]

	ARRAY<STRUCT<STRING, STRING>>



In [0]:
#check for empty schemas in dataset RNAseq_Gene_Expression 
rnaseq_table = Table('`isb-project-zero`.RNAseq_Gene_Expression.INFORMATION_SCHEMA.TABLE_OPTIONS')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select(' table_name, option_name, option_type, option_value ') \
                  .where(rnaseq_table.table_name=='ORGANOID_RNAseq_Gene_Expression') \

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(rnaseq).values.any()

Are there any empty cells in the table schema?


False

FIELD Descriptions pulled example below


In [0]:
#list of field descriptions for table 

#return all table information for dataset RNAseq_Gene_Expression 
rnaseq_table = Table('`isb-project-zero`.RNAseq_Gene_Expression.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select('table_name, column_name, description') \
                  .where(rnaseq_table.table_name=='ORGANOID_RNAseq_Gene_Expression') \

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
pandas.options.display.max_rows
#rnaseq
for i in range(len(rnaseq)):
  print(rnaseq['table_name'][i] + '\n')
  print('\t' + rnaseq['column_name'][i] + '\n')
  print('\t' + rnaseq['description'][i] + '\n')

ORGANOID_RNAseq_Gene_Expression

	project_short_name

	Project name abbreviation; the program name appended with a project name abbreviation; eg. TCGA-OV, etc.

ORGANOID_RNAseq_Gene_Expression

	case_barcode

	Original case barcode

ORGANOID_RNAseq_Gene_Expression

	sample_barcode

	sample barcode, eg TCGA-12-1089-01A. One sample may have multiple sets of CN segmentations corresponding to multiple aliquots; use GROUP BY appropriately in queries

ORGANOID_RNAseq_Gene_Expression

	aliquot_barcode

	TCGA aliquot barcode, eg TCGA-12-1089-01A-01D-0517-31

ORGANOID_RNAseq_Gene_Expression

	gene_name

	Gene name e.g. TTN, DDR1, etc.

ORGANOID_RNAseq_Gene_Expression

	gene_type

	The type of genetic element the reads mapped to, eg protein_coding, ribozyme

ORGANOID_RNAseq_Gene_Expression

	Ensembl_gene_id

	The Ensembl gene ID from the underlying file, but stripped of the version suffix -- eg ENSG00000185028

ORGANOID_RNAseq_Gene_Expression

	Ensembl_gene_id_v

	The Ensembl gene ID from the un

In [8]:
#list of field descriptions for table 

#check for empty schemas in dataset RNAseq_Gene_Expression 
rnaseq_table = Table('`isb-project-zero`.RNAseq_Gene_Expression.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select('table_name, column_name, description') \
                  .where(rnaseq_table.table_name=='ORGANOID_RNAseq_Gene_Expression') \

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
pandas.options.display.max_rows
print("Are there any empty cells in the table schema?")
pandas.isnull(rnaseq).values.any()

Are there any empty cells in the table schema?


False

##test 2 row number verification

**2. Look at table row number and size**

Do these metrics make sense?

In [0]:
%%bigquery --project isb-project-zero
SELECT COUNT(project_short_name)
FROM `isb-project-zero.RNAseq_Gene_Expression.ORGANOID_RNAseq_Gene_Expression`

Unnamed: 0,f0_
0,3326565


##test 3 - manual verification

**3. Scroll through table manually**

See if anything stands out - empty columns, etc.

The BigQuery table search user interface is useful in for this test run. The test tier points to the isb-etl-open. 

ISB-CGC BigQuery table  search [test tier](https://isb-cgc-test.appspot.com/bq_meta_search/).

BigQuery console [isb-project-zero](https://console.cloud.google.com/bigquery?authuser=1&folder=&organizationId=&project=isb-project-zero&p=isb-project-zero&d=RNAseq_Gene_Expression&t=ORGANOID_RNAseq_Gene_Expression&page=table).

Run a manual check in the console with the steps mentioned in step 1 

Are all the fields labeled?

Is there a table description?

Do the field labels make sense for all fields?
    
Are the labels correct?

##test 4 - GDC Data Portal count verfication


**4. Number of cases on GDC portal versus table?**

In [0]:
# Query below will display the number of cases presents in this table.

rnaseq_table = Table('`isb-project-zero.RNAseq_Gene_Expression.ORGANOID_RNAseq_Gene_Expression`')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select(' DISTINCT case_barcode, count(*) as count') \
                  .groupby('case_barcode')

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
#print(rnaseq_query_clean)
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
print('number of cases = ' + str(len(rnaseq.index)))

# rnaseq.set_option("display.max_rows", None, "display.max_columns", None)

number of cases = 49


To copmare against the GDC Data Portal, 
you first go the GDC Data Portal and search for program ORGANOID and experimental_strategy RNA-Seq, the cases number returned is 49. 

[GDC Data portal](https://portal.gdc.cancer.gov/repository?facetTab=cases&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22cases.project.program.name%22%2C%22value%22%3A%5B%22ORGANOID%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22files.experimental_strategy%22%2C%22value%22%3A%5B%22RNA-Seq%22%5D%7D%7D%5D%7D&searchTableTab=cases) filter results. 

##test 5 - file metadata table count verification

**5. Number of cases / aliquots versus BigQuery metadata table**

RNA-Seq cases counts table reuslts below.

In [0]:
# Query below will display the number of cases presents in this table.

rnaseq_table = Table('`isb-project-zero.RNAseq_Gene_Expression.ORGANOID_RNAseq_Gene_Expression`')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select(' DISTINCT case_barcode, count(*) as count') \
                  .groupby('case_barcode')

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
#print(rnaseq_query_clean)
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
print('number of cases = ' + str(len(rnaseq.index)))

# rnaseq.set_option("display.max_rows", None, "display.max_columns", None)

number of cases = 49


GDC file metadata table cases count for RNA-seq below

In [0]:
%%bigquery --project isb-project-zero
SELECT case_gdc_id, program_name
FROM `isb-project-zero.GDC_metadata.rel23_fileData_current`
where program_name = 'ORGANOID'
and experimental_strategy = 'RNA-Seq'
and analysis_workflow_type IN ('HTSeq - Counts', 'HTSeq - FPKM-UQ', 'HTSeq - FPKM')
group by case_gdc_id, program_name

Unnamed: 0,case_gdc_id,project_short_name
0,0199d671-0b2d-4673-9a3a-8df12e0a17cd,ORGANOID-PANCREATIC
1,f7a6e992-9f97-4387-85c6-18da46a97aac,ORGANOID-PANCREATIC
2,63f87552-091c-4c27-bce0-a37e6657777f,ORGANOID-PANCREATIC
3,ece88485-dae8-4f8e-b922-e2c6f8bffd7c,ORGANOID-PANCREATIC
4,f5393eea-abad-4f59-a19d-4ac65bd0ad65,ORGANOID-PANCREATIC
5,eee1adb4-f965-43a2-8037-ff706c37e0cc,ORGANOID-PANCREATIC
6,d277d0dd-4be9-484f-ba57-2d9ac16c6736,ORGANOID-PANCREATIC
7,c6f03bd6-e8e1-4e41-a2ec-464188a3277f,ORGANOID-PANCREATIC
8,9c5b064f-99bd-4178-ad75-2b98c61c761c,ORGANOID-PANCREATIC
9,08f63445-b236-4f97-b76a-2e5f4b3c5ff5,ORGANOID-PANCREATIC


RNA-Seq aliquot counts table reuslts below.

In [0]:
# Query below will display the number of cases presents in this table.

rnaseq_table = Table('`isb-project-zero.RNAseq_Gene_Expression.ORGANOID_RNAseq_Gene_Expression`')
rnaseq_query = Query.from_(rnaseq_table) \
                  .select(' distinct aliquot_gdc_id, count(*) as count') \
                  .groupby('aliquot_gdc_id')

rnaseq_query_clean = str(rnaseq_query).replace('"', "")
#print(rnaseq_query_clean)
rnaseq = client.query(rnaseq_query_clean).to_dataframe()
print('number of aliquots = ' + str(len(rnaseq.index)))

# rnaseq.set_option("display.max_rows", None, "display.max_columns", None)

number of aliquots = 55


GDC file metadata table aliquot count for RNA-seq below.

In [0]:
%%bigquery --project isb-project-zero
select distinct associated_entities__entity_gdc_id 
from `isb-cgc.GDC_metadata.rel22_fileData_active` 
where program_name = "ORGANOID"
and experimental_strategy = "RNA-Seq"
and analysis_workflow_type IN ('HTSeq - Counts', 'HTSeq - FPKM-UQ', 'HTSeq - FPKM')

Unnamed: 0,associated_entities__entity_gdc_id
0,06d92431-2da0-48f8-ad82-1157fde84b5a
1,02a8663f-1c57-45b1-ab55-b72adb7bd03c
2,0086338c-08d2-4cd5-8f33-63c35de4a5ba
3,02b94bad-b3e5-4f62-a92c-4fcaddf91f6a
4,02d1a414-8d17-49b8-991c-13ff4182c6bb
...,...
505,fd9ed915-e052-476f-9b5b-b74fb00891b8
506,fee6fcd3-e2d0-471e-a08d-b13204831746
507,fddf18a3-f787-4547-a5b8-576f11a29532
508,fdab16d0-151b-4428-80fa-e9c0da63099c


## test 6 - gene entry verification

**6. Number of entries per gene - should equal aliquot count**

In [0]:
%%bigquery --project isb-project-zero

select distinct Ensembl_gene_id_v, count(Ensembl_gene_id_v) as count
from `isb-project-zero.RNAseq_Gene_Expression.ORGANOID_RNAseq_Gene_Expression` 
group by Ensembl_gene_id_v 
order by count

Unnamed: 0,Ensembl_gene_id_v,count
0,ENSG00000177738.3,55
1,ENSG00000261254.2,55
2,ENSG00000253588.1,55
3,ENSG00000253759.1,55
4,ENSG00000228668.1,55
...,...,...
60478,ENSG00000204622.9,55
60479,ENSG00000185495.9,55
60480,ENSG00000204663.8,55
60481,ENSG00000180747.14,55


##step 7 - duplication verifcation

**7.Check for any duplicate rows present in the table**



In [0]:
%%bigquery --project isb-project-zero

SELECT count(project_short_name) as count
FROM `isb-project-zero.RNAseq_Gene_Expression.ORGANOID_RNAseq_Gene_Expression` 
GROUP BY project_short_name, case_barcode, sample_barcode, aliquot_barcode, gene_name, gene_type, Ensembl_gene_id, Ensembl_gene_id_v, HTSeq__Counts, HTSeq__FPKM, HTSeq__FPKM_UQ, case_gdc_id, sample_gdc_id, aliquot_gdc_id, file_gdc_id_counts, file_gdc_id_fpkm, file_gdc_id_fpkm_uq, platform
order by count desc
limit 10
#query to be run manually for duplication verification QC

Unnamed: 0,count
0,1
1,1
2,1
3,1
4,1
...,...
3326560,1
3326561,1
3326562,1
3326563,1
