-
Notifications
You must be signed in to change notification settings - Fork 16
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
smrgit
committed
Jan 11, 2016
1 parent
a6e39b6
commit ca88dce
Showing
14 changed files
with
488 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
*********************** | ||
ETL for BigQuery Tables | ||
*********************** | ||
|
||
Data Quality and General Formatting | ||
################################### | ||
|
||
- All data uploaded into ISB-CGC BigQuery tables have a consistent | ||
UTF-8 character set formatting. If the encoding of a character from | ||
the raw files could not be detected, the characters are simply | ||
ignored. The character encodings are detected using the Python | ||
library `Chardet <https://www.google.com/url?q=https://pypi.python.org/pypi/chardet&sa=D&usg=AFQjCNEqIpFiwf3f-ynJmNtP1ZqXe-TvRg>`__. | ||
- All missing information value strings such as: 'none', 'None', | ||
'NONE', 'null', 'Null', 'NULL', , 'NA', '\_\_UNKNOWN\_\_', <empty | ||
spaces>, and '?'; are represented as NULL values in the BigQuery | ||
tables. | ||
- The numbers are stored as integer or float value columns, whenever | ||
possible. The scientific number format (e.g. 10E2) and thousand | ||
separator comma is not used in any of the number columns. | ||
- The End of File (EOF) and End of Line (EOL) delimiters, including | ||
CTRL-M characters, are removed while loading data into BigQuery. | ||
- Single and double quotes around the values are removed. The quotes | ||
within a value are not changed. | ||
- The SDRF file was parsed to find the correct association between the | ||
aliquot barcode and the Level-3 data file(s), wherever needed and | ||
possible. | ||
|
||
Major Data Types | ||
################ | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
data2/ETL_Clinical | ||
data2/ETL_Biospecimen | ||
data2/ETL_somaticMutations | ||
data2/ETL_DNAcopyNumber | ||
data2/ETL_DNAmethylation | ||
data2/ETL_mRNAexpression | ||
data2/ETL_microRNAexpression | ||
data2/ETL_proteinExpression | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
Biospecimen | ||
=========== | ||
|
||
Parsing Biospecimen XML | ||
----------------------- | ||
|
||
Similarly, selected biospecimen fields from the biospecimen XML files | ||
were extracted and loaded into a “biospecimen” table in BigQuery. | ||
Biospecimen BigQuery | ||
\ `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.Biospecimen_data&sa=D&usg=AFQjCNFWq7NUA2BkQ2br8PFG6VNySeFcxw>`__\ is | ||
a sample-level. | ||
|
||
In the first step, while iterating through the sample block the elements | ||
(XML tags) and their values are collected. The slides’ info is averaged | ||
across portions while iterating over the portions block. Also the | ||
slides’ max and min values are calculated. The total number of slides | ||
(num\_slides) and portions (num\_portions) is calculated for each | ||
sample, along with the average, max and min values observed. All the | ||
calculated and derived values are added as new columns in the BigQuery | ||
tables. | ||
|
||
Filters | ||
------- | ||
|
||
- Samples with "is\_ffpe: is True are removed. | ||
- Patients/Samples where the "Project" is "null" are removed. | ||
|
||
Formatting | ||
---------- | ||
|
||
- "pregnancies" and "total_number_of_pregnancies" are be merged into a | ||
single "pregnancies" field. The counts above four are represented as | ||
"4+" (e.g: [0,1,2,3,4+]) | ||
- "number\_of\_lymphnodes\_examined" and "lymph\_node\_examined\_count" are | ||
merged into a single "number\_of\_lymphnodes\_examined" column. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
Clinical | ||
======== | ||
|
||
Selection of Clinical Metadata Fields | ||
------------------------------------- | ||
|
||
XML features with tag “procurement\_status=Competed” which exist in at | ||
least 20% of the patients in each Study are considered for the metadata. | ||
A few important features like smoking, pregnancy etc were added to the | ||
list as necessary. Selected clinical fields from the clinical and | ||
auxiliary XML files were extracted and loaded into a “clinical data” | ||
\ `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.Clinical_data&sa=D&usg=AFQjCNHP0Em9YewAXdL_vgIpbRzGiF2Dgg>`__\ in | ||
BigQuery. Each row in the table contains all the information for a | ||
single patient, with only the most recent follow-up information included | ||
(for the patients where multiple follow-up sections exist in the | ||
clinical XML file). Clinical BigQuery table is at patient-level. | ||
|
||
Parsing Clinical XML | ||
-------------------- | ||
|
||
A clinical XML file is divided into admin and patient blocks, and each | ||
of them is processed separately. | ||
|
||
Patient block iteration | ||
|
||
In the first step, while iterating through the patient block, the | ||
elements (XML tags) and their values are collected. While parsing the | ||
follow-up block, only the most recent follow-up sub-block elements info | ||
is obtained (that which have the highest sequence number). Since the | ||
clinical XML is nested along with element tag repetitions, care is taken | ||
not to replace the upper block element values with the lower block | ||
element values. In the last step, patient elements and | ||
follow-up elements are merged with preference given to | ||
follow-up elements. | ||
|
||
Formatting | ||
---------- | ||
|
||
- for all patients who are "Alive", | ||
|
||
- days\_to\_last\_known\_alive is set to days\_to\_last\_followup | ||
- days\_to\_death is set to NULL | ||
|
||
- for all patients who are "Dead", we should have | ||
|
||
- days\_to\_last\_known\_alive is set to days\_to\_death | ||
- days\_to\_last\_followup is set to NULL | ||
- if days\_to\_last\_followup or is available , vital\_status is set | ||
to 'Alive'. | ||
|
||
- The following fields are extracted from the cqcf block of the XML | ||
file: ‘gleason\_score\_combined', 'country', | ||
'history\_of\_prior\_malignancy', 'frozen\_specimen\_anatomic\_site' | ||
- hpv\_calls, hpv\_status, | ||
mononucleotide\_and\_dinucleotide\_marker\_panel\_analysis\_status, | ||
and mononucleotide\_marker\_panel\_analysis\_status from the | ||
Auxiliary XML are added to the Clinical metadata table, if the batch | ||
numbers of the both Clinical and Auxiliary XML files matches. | ||
- BMI column is calculated based on the height and weight column. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
DNA Copy-Number Segments | ||
======================== | ||
|
||
Each individual CNV Level-3 data archive has 4 output files - two based on the hg18 reference, and two based on the hg19 reference. | ||
The BigQuery `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.Copy_Number_segments&sa=D&usg=AFQjCNHs3vCBx_G7ls1NlgFYHwoBj1-xfw>`__ is populated only with the files ending with "nocnv\_hg19.seg.txt". | ||
The "num_probes: and "segment_mean" in the raw files is sometimes represented in Exponential Scientific Notation (8.7E+07) and are converted to INT or FLOAT values. | ||
|
||
Formatting | ||
---------- | ||
|
||
- "num_probes" column values are stored as integer values in BigQuery | ||
tables. Exponential Scientific notation is not used to represent the | ||
integers. | ||
- "segment_mean" column values are formatted to 4 point float values. | ||
Values represented in Exponential Scientific notation in the raw | ||
files are converted to float values. | ||
- The aliquot barcode information was obtained from the SDRF file | ||
associated with the Level-3 data file. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
DNA Methylation | ||
=============== | ||
|
||
The BigQuery | ||
\ `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.DNA_Methylation_betas&sa=D&usg=AFQjCNFuAXrnRbAzG0U4-f1uPmY8xC6gSQ>`__ \ is | ||
populated only with the files matching the pattern - | ||
%.HumanMethylation%.txt. The data from both 27k and 450k platform are | ||
merged together into a single table. If there are samples that were run | ||
on both platforms, then the 450k data takes precedence and the duplicate | ||
27k data is removed from the table. The table has a platform column | ||
indicating the name of the platform for each sample. | ||
|
||
Filters | ||
------- | ||
|
||
- Filter out rows with "Beta\_value" is NA or NULL. | ||
|
||
Formatting | ||
---------- | ||
|
||
- Round "Beta\_value" to two digit float (e.g: 0.88). The original | ||
beta\_value is 14 digit precision floating number. | ||
|
||
Output | ||
------ | ||
|
||
- Only "Probe\_Id", "Beta\_Value" from the data file are stored in the | ||
BigQuery table. | ||
- The aliquot barcode information was obtained from the SDRF file | ||
associated with the Level-3 data file. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
mRNA bcgsc | ||
---------- | ||
|
||
The mrna bcgsc BigQuery | ||
\ `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.mRNA_BCGSC_HiSeq_RPKM&sa=D&usg=AFQjCNHGoaSTTA93ZnPTHDJzcN0VREmvWg>`__\ is | ||
populated only with the files matching the pattern | ||
-%.gene.quantification.txt'. The raw “gene quantification” files have | ||
four columns: gene, raw\_counts, median\_length\_normalized, and RPKM. | ||
The information in the gene and RPKM columns is stored in a BigQuery | ||
table. The gene string contains either two or three parts, similarly | ||
separated by a “\|”, eg “TP53\|7157\_calculated” or | ||
“Mir\_1302\|?\|3of7\_calculated”. | ||
|
||
Formatting | ||
^^^^^^^^^^ | ||
|
||
- ‘gene’ column field value is split into ‘original\_gene\_symbol', | ||
‘gene\_id’ , and ‘gene\_addenda’ columns. | ||
- The “\_calculated” string is stripped off from the “gene\_id” value. | ||
“?” is replaced with a null value. | ||
- Based on the ‘gene\_id’ columns, HGNC approved gene symbol is added | ||
as a new column “HGNC\_gene\_symbol”. The HGNC approved symbols were | ||
obtained from the following url: | ||
\ `http://rest.genenames.org/fetch/status/Approve <https://www.google.com/url?q=http://rest.genenames.org/fetch/status/Approved&sa=D&usg=AFQjCNHVRPnQGE0KLpbqF7KUePUWqr9uPg>`__\ d. | ||
|
||
mRNA unc | ||
-------- | ||
|
||
The mrna UNC BigQuery | ||
\ `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.mRNA_UNC_HiSeq_RSEM&sa=D&usg=AFQjCNFDandkapnU15Btk5cnsxT2q9I2uw>`__\ is | ||
populated only with the files matching the pattern - | ||
'%.rsem.genes.normalized\_results'. The raw “RSEM genes normalized | ||
results” files have two columns, the contents of which will be stored in | ||
a BigQuery table. The first column contains the gene\_id, and the | ||
second the normalized\_count. The gene\_id string contains two parts: | ||
the gene symbol, and the gene id, separated by a “\|”, eg: “TP53\|7157”. | ||
|
||
Formatting | ||
^^^^^^^^^^ | ||
|
||
- The ‘gene\_id’ column is split into 'original\_gene\_symbol' and | ||
'gene\_id'- both are stored as separate columns in BigQuery. | ||
- Based on the ‘gene\_id’, HGNC approved gene symbol is added as a new | ||
column “HGNC\_gene\_symbol”. | ||
|
||
|
24 changes: 24 additions & 0 deletions
24
docs/source/sections/data/data2/ETL_microRNAexpression.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
microRNA Expression | ||
=================== | ||
|
||
The current ISB TCGA data pipeline uses a Perl script | ||
(expression\_matrix\_mimat.pl) from Andy Chu at BCGSC which reads the | ||
isoform data files and outputs expression values for "mature microRNAs". | ||
It outputs a matrix with a consistent number of mature microRNAs, in | ||
which the microRNAs are referred to using a combination of the microRNA | ||
gene name and the unique accession number, eg: | ||
"hsa-mir-21.MIMAT0000076" - both the microRNA name and accession number | ||
are stored as separate columns in the BigQuery | ||
\ `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.miRNA_expression&sa=D&usg=AFQjCNGPgJ1sAHyrdUV6jqHeNs5ZTjc2KQ>`__\ . | ||
The entire matrix is melted into a flat structure(tidy data) and loaded | ||
into the table. The isoform files were searched with the following | ||
pattern - "%.isoform.quantification.txt". The aliquot barcode | ||
information was obtained from the SDRF file associated with the Level-3 | ||
isoform data file. | ||
|
||
Filters | ||
------- | ||
|
||
- The pipeline is run only on the hg19 isoform files and others are | ||
filtered out. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
Protein | ||
======= | ||
|
||
The raw protein data file contains just two columns: The "Composite Element REF", which corresponds to the third column in the antibody | ||
annotation file, and the estimated expression value for that particular | ||
protein. The "Composite Element REF" was parsed to generate additional | ||
information(see details in the formatting section). The BigQuery | ||
`table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.Protein_RPPA_data>`_ | ||
was populated with all TCGA Level-3 RPPA data matching the pattern - | ||
"%\_RPPA\_Core.protein\_expression%.txt". | ||
|
||
The antibody annotation files are parsed to get the relationship between | ||
the antibody name and the associated proteins, and genes. Below is the | ||
detailed explanation about the generation of the antibody, gene, protein | ||
map. | ||
|
||
Generation of Composite\_element\_ref, gene, and protein name map | ||
----------------------------------------------------------------- | ||
|
||
(Manual Curation of the gene and protein names) | ||
|
||
- Check the antibody annotation files for missing columns. | ||
|
||
- If “protein\_name” is missing, generate one from | ||
“composite\_element\_ref” | ||
|
||
- Make a map of ‘composite\_element\_ref’,’ gene\_name’, | ||
‘protein\_name’ values. | ||
- Check any other variant of the gene and protein symbols in the table. | ||
- HGNC Validation | ||
|
||
- If the gene symbol is in the HGNC approved symbols, ‘Approved’. | ||
Gene\_symbol = Gene\_symbol. | ||
- If not, check the Alias symbols. If found, Gene\_symbol = | ||
Alias\_symbol. | ||
- If not, check the Previous symbols. If found, Gene\_symbol = | ||
“Approved” Gene\_symbol. | ||
- If not, Gene\_symbol = Gene\_symbol | ||
- The file generated is manually curated and fed back into the | ||
algorithm. | ||
|
||
Formatting | ||
---------- | ||
|
||
- Duplicate the rows if there are multiple genes concatenated in the | ||
"gene\_name" value. For example: ‘gene\_name’ with value like ‘AKT1 | ||
AKT2 AKT3’ is stored as three separate rows with each gene in a row. | ||
- 'Protein\_Name' is split into 'Protein\_Basename', Phospho' and are | ||
stored as separate columns. | ||
- ‘Composite element ref’ is parsed to get 'validationStatus' and | ||
'antibodySource' – both are stored as separate columns in the | ||
BigQuery table. | ||
- Data from both Illumina GA and HiSeq platforms are stored in the same | ||
table. | ||
|
Oops, something went wrong.