Skip to content

Latest commit

 

History

History
336 lines (311 loc) · 9.31 KB

How_to_perform_an_ANOVA_test_in_BigQuery.md

File metadata and controls

336 lines (311 loc) · 9.31 KB

One-Way ANOVA Test in BigQuery

ISB-CGC Community Notebooks

Check out more notebooks at our Community Notebooks Repository!

Title:   One-Way ANOVA Test in BigQuery
Author:  Lauren Hagen
Created: 2019-08-02
Purpose: Demonstrate an ANOVA test within BigQuery
URL:     https://github.com/isb-cgc/Community-Notebooks/blob/master/Notebooks/How_to_perform_an_ANOVA_test_in_BigQuery.md
Notes:   This notebook was adapted from work by David L Gibbs, September 2017 Query of the Month

In this notebook, we will cover how to do a one-way ANOVA test in BigQuery. This statistical test can be used to determine whether there is a statistically significant difference between the means of two or more independent groups. Although in this example, I’m only looking at two groups, it would not be difficult to extend this to any number of groups, assuming there is a reasonable number of samples within each group.

Consider the model yij = m + ai + eij, where yij is a continuous variable over samples j, in groups i, and ai is a constant for each group i, and eij is a gaussian error term with mean 0.

Using this model, we are describing the data as being sampled from groups, with each group having a mean value equal to m + ai. The null hypothesis is that each of the group means is the same (ie that the ai terms are zero), while the alternative hypothesis is that at least one of the ai terms is not zero.

We use the F-test to compare these two hypotheses. To compute the test statistic, we compute the within-group variation and the between-group variation. Recall that sample variance is defined as the sum of squared differences between observations and the mean, divided by the number of samples (normalized).

First, we will need to import the required libraries and create a client variable. For more information see ‘Quick Start Guide to ISB-CGC’ and alternative authentication methods can be found here.

library(bigrquery)
library(dplyr)

Then let us set up the billing variables we will be using in this notebook:

billing <- 'your_project_number' # Insert your project ID in the ''
if (billing == 'your_project_number') {
  print('Please update the project number with your Google Cloud Project')
}

Let’s look at the query:

anova_query = "
WITH
  -- using standard SQL,
  -- we'll select our cohort and gene expression
  --
  cohortExpr AS (
  SELECT
    sample_barcode,
    LOG10(normalized_count) AS expr
  FROM
    `isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM`
  WHERE
    project_short_name = 'TCGA-BRCA'
    AND HGNC_gene_symbol = 'TP53'
    AND normalized_count IS NOT NULL
    AND normalized_count > 0),
  --
  -- And we'll select the variant data for our cohort,
  -- we're going to be comparing variant types (SNP, DEL, etc)
  --
  cohortVar AS (
  SELECT
    Variant_Type,
    sample_barcode_tumor AS sample_barcode
  FROM
    `isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_MC3`
  WHERE
    SYMBOL = 'TP53' ),
  --
  -- then we join the expression and variant data using sample barcodes
  --
  cohort AS (
  SELECT
    cohortExpr.sample_barcode AS sample_barcode,
    Variant_Type AS group_name,
    expr
  FROM
    cohortExpr
  JOIN
    cohortVar
  ON
    cohortExpr.sample_barcode = cohortVar.sample_barcode ),
  --
  -- First part of the calculation, the grand mean (over everything)
  --
  grandMeanTable AS (
  SELECT
    AVG(expr) AS grand_mean
  FROM
    cohort ),
  --
  -- Then we need a mean per group, and we can get a count of samples
  -- per group.
  --
  groupMeansTable AS (
  SELECT
    AVG(expr) AS group_mean,
    group_name,
    COUNT(sample_barcode) AS n
  FROM
    cohort
  GROUP BY
    group_name),
  --
  -- To get the between-group variance
  -- we take the difference between the grand mean
  -- and the means for each group and sum over all samples
  -- ... a short cut being taking the product with n.
  -- Later we'll sum over the n_sq_diff
  --
  ssBetween AS (
  SELECT
    group_name,
    group_mean,
    grand_mean,
    n,
    n*POW(group_mean - grand_mean,2) AS n_diff_sq
  FROM
    groupMeansTable
  CROSS JOIN
    grandMeanTable ),
  --
  -- Then, to get the variance within each group
  -- we have to build a table matching up the group mean
  -- with the values for each group. So we join the group
  -- means to the values on group name. We are going to
  -- sum over this table just like ssBetween
  --
  ssWithin AS (
  SELECT
    a.group_name AS group_name,
    expr,
    group_mean,
    b.n AS n,
    POW(expr - group_mean, 2) AS s2
  FROM
    cohort a
  JOIN
    ssBetween b
  ON
    a.group_name = b.group_name ),
  --
  -- The F stat comes from a ratio, the numerator is
  -- calculated using the between group variance, and
  -- dividing by the number of groups (k) minus 1.
  --
  numerator AS (
  SELECT
    'dummy' AS dummy,
    SUM(n_diff_sq) / (count(group_name) - 1) AS mean_sq_between
  FROM
    ssBetween ),
  --
  -- The denominator of the F stat ratio is found using the
  -- variance within groups. We divide the sum of the within
  -- group variance and divide it by (n-k).
  --
  denominator AS (
  SELECT
    'dummy' AS dummy,
    COUNT(distinct(group_name)) AS k,
    COUNT(group_name) AS n,
    SUM(s2)/(COUNT(group_name)-COUNT(distinct(group_name))) AS mean_sq_within
  FROM
    ssWithin),
  --
  -- Now we're ready to calculate F!
  --
  Ftable AS (
  SELECT
    n,
    k,
    mean_sq_between,
    mean_sq_within,
    mean_sq_between / mean_sq_within AS F
  FROM
    numerator
  JOIN
    denominator
  ON
    numerator.dummy = denominator.dummy)

SELECT
  *
FROM
  Ftable
"
# To see the R console output with query processing information, turn queit to FALSE
anova_result <- bq_project_query(billing, anova_query, quiet = TRUE)
# Transform the query result into a tibble
anova_result <- bq_table_download(anova_result, quiet = TRUE)
anova_result
## # A tibble: 1 x 5
##       n     k mean_sq_between mean_sq_within     F
##   <int> <int>           <dbl>          <dbl> <dbl>
## 1   375     3            6.23         0.0956  65.1

OK, so let’s check our work. Using the BRCA cohort and TP53 as our gene, we have 375 samples with a variant in this gene. We’re going to look at whether the type of variant is related to the gene expression we observe. If we just pull down the data using the ‘cohort’ subtable (as above), we can get a small data frame, which let’s us do the standard F stat table in R.

dat_query = "
WITH
  -- using standard SQL,
  -- we'll select our cohort and gene expression
  --
  cohortExpr AS (
  SELECT
    sample_barcode,
    LOG10(normalized_count) AS expr
  FROM
    `isb-cgc.TCGA_hg19_data_v0.RNAseq_Gene_Expression_UNC_RSEM`
  WHERE
    project_short_name = 'TCGA-BRCA'
    AND HGNC_gene_symbol = 'TP53'
    AND normalized_count IS NOT NULL
    AND normalized_count > 0),
  --
  -- And we'll select the variant data for our cohort,
  -- we're going to be comparing variant types (SNP, DEL, etc)
  --
  cohortVar AS (
  SELECT
    Variant_Type,
    sample_barcode_tumor AS sample_barcode
  FROM
    `isb-cgc.TCGA_hg19_data_v0.Somatic_Mutation_MC3`
  WHERE
    SYMBOL = 'TP53' ),
  --
  -- then we join the expression and variant data using sample barcodes
  --
  cohort AS (
  SELECT
    cohortExpr.sample_barcode AS sample_barcode,
    Variant_Type AS group_name,
    expr
  FROM
    cohortExpr
  JOIN
    cohortVar
  ON
    cohortExpr.sample_barcode = cohortVar.sample_barcode ),
  --
  -- First part of the calculation, the grand mean (over everything)
  --
  grandMeanTable AS (
  SELECT
    AVG(expr) AS grand_mean
  FROM
    cohort )
SELECT sample_barcode, group_name, expr
FROM cohort"
# To see the R console output with query processing information, turn queit to FALSE
dat <- bq_project_query(billing, dat_query, quiet = TRUE)
# Transform the query result into a tibble
dat <- bq_table_download(dat, quiet = TRUE)
dat
## # A tibble: 375 x 3
##    sample_barcode   group_name  expr
##    <chr>            <chr>      <dbl>
##  1 TCGA-AR-A24P-01A SNP         3.12
##  2 TCGA-BH-A0RX-01A SNP         3.03
##  3 TCGA-BH-A1FD-01A SNP         3.36
##  4 TCGA-A8-A08R-01A INS         2.74
##  5 TCGA-LL-A73Y-01A SNP         2.77
##  6 TCGA-GM-A2DD-01A SNP         3.37
##  7 TCGA-E2-A155-01A SNP         2.98
##  8 TCGA-E2-A1LG-01A DEL         3.41
##  9 TCGA-BH-A1FE-01A SNP         3.23
## 10 TCGA-PE-A5DE-01A SNP         3.14
## # ... with 365 more rows
dat %>% group_by(group_name) %>% summarize(mean=mean(expr), sd=sd(expr))
## # A tibble: 3 x 3
##   group_name  mean    sd
##   <chr>      <dbl> <dbl>
## 1 DEL         2.79 0.322
## 2 INS         2.64 0.116
## 3 SNP         3.22 0.313
anova(lm(data=dat, expr~group_name))
## Analysis of Variance Table
## 
## Response: expr
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## group_name   2 12.460  6.2302  65.147 < 2.2e-16 ***
## Residuals  372 35.576  0.0956                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1