Skip to content

Commit

Permalink
updates
Browse files Browse the repository at this point in the history
  • Loading branch information
smrgit committed Jan 11, 2016
1 parent a6e39b6 commit ca88dce
Show file tree
Hide file tree
Showing 14 changed files with 488 additions and 0 deletions.
42 changes: 42 additions & 0 deletions docs/source/sections/data/BQ_ETL.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
***********************
ETL for BigQuery Tables
***********************

Data Quality and General Formatting
###################################

- All data uploaded into ISB-CGC BigQuery tables have a consistent
UTF-8 character set formatting. If the encoding of a character from
the raw files could not be detected, the characters are simply
ignored. The character encodings are detected using the Python
library `Chardet <https://www.google.com/url?q=https://pypi.python.org/pypi/chardet&sa=D&usg=AFQjCNEqIpFiwf3f-ynJmNtP1ZqXe-TvRg>`__.
- All missing information value strings such as: 'none', 'None',
'NONE', 'null', 'Null', 'NULL', , 'NA', '\_\_UNKNOWN\_\_', <empty
spaces>, and '?'; are represented as NULL values in the BigQuery
tables.
- The numbers are stored as integer or float value columns, whenever
possible. The scientific number format (e.g. 10E2) and thousand
separator comma is not used in any of the number columns.
- The End of File (EOF) and End of Line (EOL) delimiters, including
CTRL-M characters, are removed while loading data into BigQuery.
- Single and double quotes around the values are removed. The quotes
within a value are not changed.
- The SDRF file was parsed to find the correct association between the
aliquot barcode and the Level-3 data file(s), wherever needed and
possible.

Major Data Types
################

.. toctree::
:maxdepth: 1

data2/ETL_Clinical
data2/ETL_Biospecimen
data2/ETL_somaticMutations
data2/ETL_DNAcopyNumber
data2/ETL_DNAmethylation
data2/ETL_mRNAexpression
data2/ETL_microRNAexpression
data2/ETL_proteinExpression

36 changes: 36 additions & 0 deletions docs/source/sections/data/data2/ETL_Biospecimen.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
Biospecimen
===========

Parsing Biospecimen XML
-----------------------

Similarly, selected biospecimen fields from the biospecimen XML files
were extracted and loaded into a “biospecimen” table in BigQuery.
 Biospecimen BigQuery
\ `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.Biospecimen_data&sa=D&usg=AFQjCNFWq7NUA2BkQ2br8PFG6VNySeFcxw>`__\  is
a sample-level.

In the first step, while iterating through the sample block the elements
(XML tags) and their values are collected. The slides’ info is averaged
across portions while iterating over the portions block. Also the
slides’ max and min values are calculated. The total number of slides
(num\_slides) and portions (num\_portions) is calculated for each
sample, along with the average, max and min values observed. All the
calculated and derived values are added as new columns in the BigQuery
tables.

Filters
-------

- Samples with "is\_ffpe: is True are removed.
- Patients/Samples where the "Project" is "null" are removed.

Formatting
----------

- "pregnancies" and "total_number_of_pregnancies" are be merged into a
single "pregnancies" field. The counts above four are represented as
"4+" (e.g: [0,1,2,3,4+])
- "number\_of\_lymphnodes\_examined" and "lymph\_node\_examined\_count" are
merged into a single "number\_of\_lymphnodes\_examined" column.

61 changes: 61 additions & 0 deletions docs/source/sections/data/data2/ETL_Clinical.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Clinical
========

Selection of Clinical Metadata Fields
-------------------------------------

XML features with tag  “procurement\_status=Competed” which exist in at
least 20% of the patients in each Study are considered for the metadata.
A few important features like smoking, pregnancy etc were added to the
list as necessary. Selected clinical fields from the clinical and
auxiliary XML files were extracted and loaded into a “clinical data”
\ `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.Clinical_data&sa=D&usg=AFQjCNHP0Em9YewAXdL_vgIpbRzGiF2Dgg>`__\  in
BigQuery.  Each row in the table contains all the information for a
single patient, with only the most recent follow-up information included
(for the patients where multiple follow-up sections exist in the
clinical XML file). Clinical BigQuery table is at patient-level.

Parsing Clinical XML
--------------------

A clinical XML file is divided into admin and patient blocks, and each
of them is processed separately.

Patient block iteration

In the first step, while iterating through the patient block, the
elements (XML tags) and their values are collected. While parsing the
follow-up block, only the most recent follow-up sub-block elements info
is obtained (that which have the highest sequence number). Since the
clinical XML is nested along with element tag repetitions, care is taken
not to replace the upper block element values with the lower block
element values. In the last step, patient elements and
follow-up elements are merged with preference given to
follow-up elements.

Formatting
----------

- for all patients who are "Alive",

- days\_to\_last\_known\_alive  is set to days\_to\_last\_followup
- days\_to\_death is set to NULL

- for all patients who are "Dead", we should have

- days\_to\_last\_known\_alive  is set to days\_to\_death
- days\_to\_last\_followup is set to NULL
- if days\_to\_last\_followup or is available , vital\_status  is set
to 'Alive'.

- The following fields are extracted from the cqcf block of the XML
file: ‘gleason\_score\_combined', 'country',
'history\_of\_prior\_malignancy', 'frozen\_specimen\_anatomic\_site'
- hpv\_calls, hpv\_status,
mononucleotide\_and\_dinucleotide\_marker\_panel\_analysis\_status,
and mononucleotide\_marker\_panel\_analysis\_status from the
Auxiliary XML are added to the Clinical metadata table, if the batch
numbers of the both Clinical and Auxiliary XML files matches.
- BMI column is calculated based on the height and weight column.


19 changes: 19 additions & 0 deletions docs/source/sections/data/data2/ETL_DNAcopyNumber.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
DNA Copy-Number Segments
========================

Each individual CNV Level-3 data archive has 4 output files - two based on the hg18 reference, and two based on the hg19 reference.
The BigQuery `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.Copy_Number_segments&sa=D&usg=AFQjCNHs3vCBx_G7ls1NlgFYHwoBj1-xfw>`__ is populated only with the files ending with "nocnv\_hg19.seg.txt".
The "num_probes: and "segment_mean" in the raw files is sometimes represented in Exponential Scientific Notation (8.7E+07) and are converted to INT or FLOAT values.

Formatting
----------

- "num_probes" column values are stored as integer values in BigQuery
tables. Exponential Scientific notation is not used to represent the
integers.
- "segment_mean" column values are formatted to 4 point float values.
Values represented in Exponential Scientific notation in the raw
files are converted to float values.
- The aliquot barcode information was obtained from the SDRF file
associated with the Level-3 data file.

31 changes: 31 additions & 0 deletions docs/source/sections/data/data2/ETL_DNAmethylation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
DNA Methylation
===============

The BigQuery
\ `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.DNA_Methylation_betas&sa=D&usg=AFQjCNFuAXrnRbAzG0U4-f1uPmY8xC6gSQ>`__ \ is
populated only with the files matching the pattern -
%.HumanMethylation%.txt. The data from both 27k and 450k platform are
merged together into a single table. If there are samples that were run
on both platforms, then the 450k data takes precedence and the duplicate
27k data is removed from the table. The table has a platform column
indicating the name of the platform for each sample.

Filters
-------

- Filter out rows with "Beta\_value" is NA or NULL.

Formatting
----------

- Round "Beta\_value" to two digit float (e.g: 0.88). The original
beta\_value is 14 digit precision floating number.

Output
------

- Only "Probe\_Id", "Beta\_Value" from the data file are stored in the
BigQuery table.
- The aliquot barcode information was obtained from the SDRF file
associated with the Level-3 data file.

46 changes: 46 additions & 0 deletions docs/source/sections/data/data2/ETL_mRNAexpression.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
mRNA bcgsc
----------

The mrna bcgsc  BigQuery
\ `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.mRNA_BCGSC_HiSeq_RPKM&sa=D&usg=AFQjCNHGoaSTTA93ZnPTHDJzcN0VREmvWg>`__\  is
populated only with the files matching the pattern
-%.gene.quantification.txt'. The raw “gene quantification” files have
four columns: gene, raw\_counts, median\_length\_normalized, and RPKM.
 The information in the gene and RPKM columns is stored in a BigQuery
table.  The gene string contains either two or three parts, similarly
separated by a “\|”, eg “TP53\|7157\_calculated” or
“Mir\_1302\|?\|3of7\_calculated”.

Formatting
^^^^^^^^^^

- ‘gene’ column field value is split into ‘original\_gene\_symbol',
‘gene\_id’ , and ‘gene\_addenda’ columns.
- The “\_calculated” string is stripped off from the “gene\_id” value.
“?” is replaced with a null value.
- Based on the ‘gene\_id’ columns, HGNC approved gene symbol is added
as a new column “HGNC\_gene\_symbol”. The HGNC approved symbols were
obtained from the following url:
\ `http://rest.genenames.org/fetch/status/Approve <https://www.google.com/url?q=http://rest.genenames.org/fetch/status/Approved&sa=D&usg=AFQjCNHVRPnQGE0KLpbqF7KUePUWqr9uPg>`__\ d.

mRNA unc
--------

The mrna UNC BigQuery
\ `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.mRNA_UNC_HiSeq_RSEM&sa=D&usg=AFQjCNFDandkapnU15Btk5cnsxT2q9I2uw>`__\  is
populated only with the files matching the pattern -
'%.rsem.genes.normalized\_results'. The raw “RSEM genes normalized
results” files have two columns, the contents of which will be stored in
a BigQuery table.  The first column contains the gene\_id, and the
second the normalized\_count.  The gene\_id string contains two parts:
the gene symbol, and the gene id, separated by a “\|”, eg: “TP53\|7157”.

Formatting
^^^^^^^^^^

- The ‘gene\_id’ column is split into 'original\_gene\_symbol' and
'gene\_id'- both are stored as separate columns in BigQuery.
- Based on the ‘gene\_id’, HGNC approved gene symbol is added as a new
column “HGNC\_gene\_symbol”.


24 changes: 24 additions & 0 deletions docs/source/sections/data/data2/ETL_microRNAexpression.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
microRNA Expression
===================

The current ISB TCGA data pipeline uses a Perl script
(expression\_matrix\_mimat.pl) from Andy Chu at BCGSC which reads the
isoform data files and outputs expression values for "mature microRNAs".
It outputs a matrix with a consistent number of mature microRNAs, in
which the microRNAs are referred to using a combination of the microRNA
gene name and the unique accession number, eg:
"hsa-mir-21.MIMAT0000076" - both the microRNA name and accession number
are stored as separate columns in the BigQuery
\ `table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.miRNA_expression&sa=D&usg=AFQjCNGPgJ1sAHyrdUV6jqHeNs5ZTjc2KQ>`__\ .
The entire matrix is melted into a flat structure(tidy data) and loaded
into the table. The isoform files were searched with the following
pattern - "%.isoform.quantification.txt". The aliquot barcode
information was obtained from the SDRF file associated with the Level-3
isoform data file.

Filters
-------

-  The pipeline is run only on the hg19 isoform files and others are
filtered out.

55 changes: 55 additions & 0 deletions docs/source/sections/data/data2/ETL_proteinExpression.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
Protein
=======

The raw protein data file contains just two columns: The "Composite Element REF", which corresponds to the third column in the antibody
annotation file, and the estimated expression value for that particular
protein. The "Composite Element REF" was parsed to generate additional
information(see details in the formatting section). The BigQuery
`table <https://www.google.com/url?q=https://bigquery.cloud.google.com/table/isb-cgc:tcga_201510_alpha.Protein_RPPA_data>`_
was populated with all TCGA Level-3 RPPA data matching the pattern -
"%\_RPPA\_Core.protein\_expression%.txt".

The antibody annotation files are parsed to get the relationship between
the antibody name and the associated proteins, and genes. Below is the
detailed explanation about the generation of the antibody, gene, protein
map.

Generation of Composite\_element\_ref, gene, and protein name map
-----------------------------------------------------------------

      (Manual Curation of the gene and protein names)

- Check the antibody annotation files for missing columns.

- If “protein\_name” is missing, generate one from
“composite\_element\_ref”

- Make a map of ‘composite\_element\_ref’,’ gene\_name’,
‘protein\_name’ values.
- Check any other variant of the gene and protein symbols in the table.
- HGNC Validation

- If the gene symbol is in the HGNC approved symbols, ‘Approved’.
 Gene\_symbol = Gene\_symbol.
- If not, check the Alias symbols. If found,  Gene\_symbol =
Alias\_symbol.
- If not, check the Previous symbols. If found, Gene\_symbol =
“Approved” Gene\_symbol.
- If not, Gene\_symbol = Gene\_symbol
- The file generated is manually curated and fed back into the
algorithm.

Formatting
----------

- Duplicate the rows if there are multiple genes concatenated in the
"gene\_name" value. For example: ‘gene\_name’ with value like ‘AKT1
AKT2 AKT3’ is stored as three separate rows with each gene in a row.
- 'Protein\_Name' is split into 'Protein\_Basename', Phospho' and are
stored as separate columns.
- ‘Composite element ref’ is parsed to get 'validationStatus' and
'antibodySource' – both are stored as separate columns in the
BigQuery table.
- Data from both Illumina GA and HiSeq platforms are stored in the same
table.

0 comments on commit ca88dce

Please sign in to comment.