Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
220 lines (175 sloc) 8.17 KB

How to make a heatmap using BigQuery

ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   How to make a heatmap using BigQuery 
Author:  Lauren Hagen
Created: 2020-02-21
Purpose: Demonstrate how to make a correlation matrix and create a heatmap visualization.
Notes:   Adapted from the How to make a heatmap using BigQuery python notebook by David L Gibbs
Repo:    https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_make_a_heatmap_using_BigQuery.md

In this notebook, we will cover how to use BigQuery to pull some data, make a correlation (or distance) matrix, and visualize that result in a heatmap.

It is also possible to compute the correlations inside BigQuery, which is a good use of the technology, but for simplicity’s sake, we will use R.

The methods are used in cancer data analysis (see practically all TCGA studies) but can apply to any data. Here we will be using TCGA gene expression data from two studies, KIRC (kidney cancer) and GBM (brain cancer).

In this work, we will see if the two tissue types separate into clusters.

Google Auth & Load packages

Before we can begin working with BigQuery, we will need to load the BigQuery library and create a GCP variable.

library(bigrquery)
project <- 'your_project_number' # Insert your project ID in the ''
if (project == 'your_project_number') {
  print('Please update the project number with your Google Cloud Project')
}

We will also need to load the heatmap packages and dplyr.

library(tidyverse)
library(stringr)
library(pheatmap)
library(RColorBrewer)

Query for data

OK, we’re going to query for some data using a preselected gene set from MSigDB.

http://software.broadinstitute.org/gsea/msigdb/cards/HALLMARK_TGF_BETA_SIGNALING.html

genes <- read.delim('https://www.gsea-msigdb.org/gsea/msigdb/download_geneset.jsp?geneSetName=HALLMARK_TGF_BETA_SIGNALING&fileType=txt', skip = 1)

So, now that we have our list of genes, we will create a SQL string to communicate to the Google mothership.

Here, the SQL is captured as a string. Using the Google BigQuery web interface is a good place to prototype SQL.

sql = str_c("
SELECT project_short_name, sample_barcode, HGNC_gene_symbol, normalized_count 
FROM `isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM`
WHERE project_short_name IN ('TCGA-KIRC', 'TCGA-GBM')
AND HGNC_gene_symbol IN ('", str_c(genes[1:10,1], collapse = "', '"), "') 
GROUP BY 1,2,3,4")

Now we use the client, send off the SQL, and convert the results to a dataframe.

result <- bq_project_query(project, sql, quiet = TRUE)
dat <- bq_table_download(result, quiet = TRUE)
head(dat)
## # A tibble: 6 x 4
##   project_short_name sample_barcode   HGNC_gene_symbol normalized_count
##   <chr>              <chr>            <chr>                       <dbl>
## 1 TCGA-GBM           TCGA-12-0616-01A ACVR1                       575. 
## 2 TCGA-KIRC          TCGA-BP-4343-01A ACVR1                      1229. 
## 3 TCGA-KIRC          TCGA-B0-5117-01A APC                         573. 
## 4 TCGA-KIRC          TCGA-AK-3453-01A CDKN1C                       42.5
## 5 TCGA-KIRC          TCGA-CJ-4644-01A BMPR1A                      497. 
## 6 TCGA-KIRC          TCGA-CW-5587-01A BMPR2                      4710.

Converting tidy data to a matrix, and computing a correlation matrix

mat <- dat %>%
  select(-project_short_name) %>%
  group_by(sample_barcode, HGNC_gene_symbol) %>%
  summarise(genes = max(normalized_count)) %>%
  pivot_wider(names_from = HGNC_gene_symbol, values_from = genes)
head(mat)
## # A tibble: 6 x 11
## # Groups:   sample_barcode [6]
##   sample_barcode   ACVR1   APC ARID4B BCAR3  BMP2 BMPR1A BMPR2   CDH1  CDK9 CDKN1C
##   <chr>            <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>
## 1 TCGA-02-0047-01A  989. 3389.   687.  122.  287.   903. 2020. 265.    954.   269.
## 2 TCGA-02-0055-01A 1309.  683.   630. 1501.  986.   423. 1219.   2.24 1166.   227.
## 3 TCGA-02-2483-01A  603. 1136.   885.  134.  287.   622. 1447.  19.0  1939.   274.
## 4 TCGA-02-2485-01A  801. 2499.   711.  624.  194.   394. 1628. 112.   1120    117.
## 5 TCGA-02-2486-01A  693. 1296.   509.  270.  199.   352. 1331.   9.15 1351.   313.
## 6 TCGA-06-0125-01A  629. 3077.   838.  171.  435.   727. 1984.  17.5   895.   229.

Let’s choose a smaller set of samples, to speed things up.

set.seed(10)
matSample <- sample_n(data.frame(mat), 60, replace = FALSE)
rownames(matSample) <- matSample[,1]
matSample <- matSample[-1]
head(matSample)
##                      ACVR1       APC    ARID4B    BCAR3     BMP2   BMPR1A    BMPR2      CDH1
## TCGA-BP-4770-01A 1995.5634  337.6180  844.0450 462.4515 326.9788 461.0330 3130.059   19.1506
## TCGA-CJ-5684-01A  862.2079  970.5882  780.0242 631.9380 602.7397 441.9823 3170.024 1786.8654
## TCGA-B0-5110-01A  910.8398 1968.5535 1325.2793 575.5901 980.7621 620.7917 4931.927  908.2501
## TCGA-B0-5707-01A 1009.6061  457.7329  536.9837 665.6388 811.7195 378.4822 1372.718 2036.0231
## TCGA-BP-4331-01A  650.9347  913.2700  860.1318 292.0625 334.0484 336.5002 2280.417 1412.8103
## TCGA-BP-4158-01A  720.0864  993.9525 1103.4212 520.5184 274.7300 420.3024 2824.190 2347.3002
##                       CDK9   CDKN1C
## TCGA-BP-4770-01A 1382.7442 167.0358
## TCGA-CJ-5684-01A 1051.5713 458.0983
## TCGA-B0-5110-01A 1015.5383 174.9908
## TCGA-B0-5707-01A 1078.7704 421.7099
## TCGA-BP-4331-01A 1051.1799 288.9979
## TCGA-BP-4158-01A  967.6026 490.7127

Now we’ll compute the correlations:

cormat <- cor(t(matSample), use = "pairwise.complete.obs")

and make a heatmap with the basic built-in stats package:

# Unsorted heatmap with basic built-in stats package
heatmap(cormat, Rowv = NA, Colv = NA)

We can also use heatmap() to create clustered heatmaps with colored side labels.

# create a data frame to label each sample in the matrix
label <- unique.data.frame(dat[,1:2])
# Create a new data frame for the labeled data and add a column with the sample_barcodes
cormat_lab <- as.data.frame(cormat)
cormat_lab$sample_barcode <- as.character(rownames(cormat))
# Join the project names to label data
cormat_lab <- left_join(cormat_lab, label, key = "sample_barcode")
# Create a vector with the project names as numeric
my_group <- as.numeric(as.factor(cormat_lab$project_short_name))
# Set the color pallet for the project names
colSide <- brewer.pal(9, "Set1")[my_group]
# Create the heatmap
heatmap(cormat, ColSideColors = colSide)

Cool! The tissue types separated out, and it looks like KIRC (blue bar) clusters into two groups.

There are several packages to give more options for creating heatmaps. We are going to explore pheatmap from the pheatmap package, but there is also heatmap.2 from the gplots package and superheat from the superheat package.

First, we will make a basic heat map with pheatmap:

pheatmap(cormat, fontsize_col = 5, fontsize_row = 4, cluster_row = FALSE, cluster_cols = FALSE)

Now, lets add some clustering to the heatmap:

# Create a dataframe with the study as factors and the sample barcodes as the row names
annot <- data.frame(Study = factor(cormat_lab$project_short_name))
rownames(annot) <- cormat_lab$sample_barcode
# Define a list for the annotation labels
ann_colors <- list(
  Study = c("TCGA-KIRC" = "#377EB8", "TCGA-GBM" = "#E41A1C")
)
# Create heatmap
pheatmap(cormat, fontsize_col = 5, fontsize_row = 4, 
         annotation_row = annot, annotation_names_row = FALSE,
         annotation_colors = ann_colors,
         cutree_rows = 3, border_color = NA)

That’s it! Please let us know if you have any questions at feedback@isb-cgc.org.

You can’t perform that action at this time.