Skip to content
Permalink
master
Switch branches/tags
Go to file
 
 
Cannot retrieve contributors at this time

How to plot BigQuery results

# ISB-CGC Community Notebooks
# Check out more notebooks at our [Community Notebooks Repository](https://github.com/isb-cgc/Community-Notebooks)!
Title:   How to visualize results from BigQuery
Author:  David L Gibbs
Created: 2019-07-17
Purpose: Demonstrate how visualize the results from a query.
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_plot_BigQuery_results.Rmd
Notes:   
***

#How to visualize results from BigQuery

In this example, we’ll perform a few easy queries, and use the ggplot library to visualize the results.

We will use theh bigrquery package to execute our SQL: https://github.com/r-dbi/bigrquery

It is expected that you’re running the notebook in an interactive mode, where you have access to a web browser for authorization with Google.

Library documentation: https://ggplot2.tidyverse.org/

There’s many tidyverse tutorials. For example, http://www.storybench.org/getting-started-with-tidyverse-in-r/

#Function to check whether package is installed
#https://stackoverflow.com/questions/9341635/check-for-installed-packages-before-running-install-packages
is.installed <- function(mypkg){
    is.element(mypkg, installed.packages()[,1])
} 
  
if (!is.installed("bigrquery")){
    install.packages("bigrquery")
}

if (!is.installed("ggplot2")){
    install.packages("ggplot2")
}


if (!is.installed("dbplyr")){
  install.packages("dbplyr")
}

if (!is.installed("dplyr")){
  install.packages(dplyr)
}

if (!is.installed("dbplyr")){
  install.packages(DBI)
}

Getting setup

library(bigrquery)
library(ggplot2)
billing <- 'your_project_number' ### Replace with your project ID!! ### 
if (billing == 'your_project_number') {
  print('Please update the project number with your Google Cloud Project')
}

Making barplots

The first time you run bq_project_query, you will be asked if you want to authorize locally in the R session. By selecting ‘yes’, a browser tab will open allowing google auth.

# We define queries as strings #
sql = "
SELECT
   icd_10,
   COUNT(*) as Count
FROM
   `isb-cgc.TCGA_bioclin_v0.Clinical`
GROUP BY
   1  -- this is the same as 'group by icd_10'
ORDER BY
   Count
"

tb <- bq_project_query(billing, sql)
## Complete

## Billed: 0 B
df <- bq_table_download(tb, max_results = 200)
## Downloading 168 rows in 1 pages.
df[1:5,]
## # A tibble: 5 x 2
##   icd_10 Count
##   <chr>  <int>
## 1 C71.8      1
## 2 C34.10     1
## 3 C34.90     1
## 4 C16.5      1
## 5 C80.1      1

Now we can make a bar plot with ggplot2.

df2 <- df[df$Count > 200,]
g <- ggplot(data=df2, aes(x=icd_10, y=Count))
g + geom_bar(stat="identity")

Making scatter plots

sql = "
SELECT
   avg_percent_neutrophil_infiltration,
   avg_percent_lymphocyte_infiltration
FROM
   `isb-cgc.TCGA_bioclin_v0.Biospecimen`
GROUP BY
   1,2
"

tb3 <- bq_project_query(billing, sql)
## Complete

## Billed: 0 B
df3 <- bq_table_download(tb3, max_results = 200)
## Downloading 200 rows in 1 pages.
df3[1:5,]
## # A tibble: 5 x 2
##   avg_percent_neutrophil_infiltration avg_percent_lymphocyte_infiltration
##                                 <dbl>                               <dbl>
## 1                                  NA                                  NA
## 2                                   2                                   5
## 3                                   3                                   2
## 4                                   0                                   2
## 5                                   0                                   1

And now we make plot those results.

Note: It is always best practice to summarize within BigQuery, and plot summarized results.

g <- ggplot(data=df3, aes(x=avg_percent_lymphocyte_infiltration, y=avg_percent_neutrophil_infiltration))
g + geom_point()
## Warning: Removed 22 rows containing missing values (geom_point).

Using dplyr

You can use dplyr to create SQL! We’ll recreate the first query.

sql = "
SELECT
   icd_10,
   COUNT(*) as Count
FROM
   `isb-cgc.TCGA_bioclin_v0.Clinical`
GROUP BY
   1  -- this is the same as 'group by icd_10'
ORDER BY
   Count
"

library(dplyr)
library(DBI)

con <- dbConnect(
  bigrquery::bigquery(),
  project = "isb-cgc",
  dataset = "TCGA_bioclin_v0",
  billing = billing
)
con
## <BigQueryConnection>
##   Dataset: isb-cgc.TCGA_bioclin_v0
##   Billing: isb-cgc-02-0001
tcga_clinical <- tbl(con, "Clinical")

res4 <- tcga_clinical %>% select(icd_10) %>% group_by(icd_10) %>% tally()

head(res4)
## Complete

## Billed: 10.49 MB

## Downloading 6 rows in 1 pages.

## # Source:   lazy query [?? x 2]
## # Database: BigQueryConnection
##   icd_10     n
##   <chr>  <int>
## 1 <NA>     355
## 2 C71.9    652
## 3 C56.9    582
## 4 C34.3    347
## 5 C34.1    577
## 6 C34.2     37

And now we can plot the results!

df3 <- res4 %>% filter(n > 200)
g <- ggplot(data=df3, aes(x=icd_10, y=n))
## Complete

## Billed: 10.49 MB

## Downloading 15 rows in 1 pages.
g + geom_bar(stat="identity")