# Lecture 2: Working with gene expression data

## Scientific Background

One of the most important challenges in all of science is **understanding the genetic architecture of complex traits**.

![mendelian trait](https://upload.wikimedia.org/wikipedia/commons/1/17/Punnett_square_mendel_flowers.svg)
(Wikipedia)


# Classical genetics
https://www.khanacademy.org/science/high-school-biology/hs-classical-genetics/hs-introduction-to-heredity/a/probabilities-in-genetics
![image.png](attachment:image.png)

## Mendelian traits
- The simplest traits behave like Mendel's peas:
    - Lactase persistence
    - Albinism
    - Huntington's disease
    - ([Full list](https://en.wikipedia.org/wiki/Mendelian_traits_in_humans)])

- However, many traits have a much more complex "architecture", and are affected by thousands of genetic variants scattered throughout the genome.


- Height is one such example; schizophrenia is another.

![image-2.png](attachment:image-2.png)

## GWAS
- If we have gene sequences from many different people ($X$), as well as corresponding trait information ($y$), we might try regression to understand the genetic architecture:

$$\underbrace{y}_{\text{height, disease status, etc.}} = \underbrace{\mathbf{X}}_{\text{genotype}}\cdot \beta + \epsilon$$

- This is called a **genomewide association study** (GWAS).
- GWAS explains the primary effect of DNA mutations on phenotypes of interest.
- However, the picture is more complicated...

## DNA, RNA, Proteins

How does genetic code translate into phenotype (observable trait)? This is called the **central dogma of molecular biology**:


![complex traits](https://cdn.kastatic.org/ka-perseus-images/53b7ece60303244264411d03bfbe55d36312b64e.png)

## The central dogma

Notice in the above figure that there is a step (in fact, multiple steps) "between" the genetic code and the ultimate phenotype. In fact the process by which "your code" becomes "you" is quite complicated. It is known as the **central dogma of molecular biology**:

$$\text{DNA} \longrightarrow \text{RNA} \longrightarrow \text{protein} (\longrightarrow \cdots \longrightarrow \text{trait})$$

### Variation in the steps along this pathway generates all<sup>*</sup> the diversity of life.


<span style="font-size: 80%">*: Environmental factors also play a role.</span>

### Cells

<img src="images/Animal_cell.png" width="80%" class="center">

### DNA, RNA, Proteins

<img src="images/central_dogma.jpg" width="30%" class="center">

### DNA, RNA, Proteins

<img src="images/protein_synthesis.png" width="65%" class="center">

Summarizing:

* Each cell has two full copies of your DNA (one from each parent).
* Some, **but not all**, genes will get transcribed to mRNA
  * The **amount** (*expression*) of mRNA can differ dramatically from gene to gene.
  * Gene expression varies from cell to cell.
  * Gene expression varies over time within a cell.
* The mRNA gets tranlsated to proteins:
  * The proteins "do stuff".
  * Structural: pores in the cell membrane, microfillaments, etc.
  * Chemical: catalyze reactions, etc.
  
**All of these steps together govern the final outcome.**

### Regulation

* The process of 
$\text{DNA} \longrightarrow \text{RNA} \longrightarrow \text{protein}$
is highly regulated.
* Gene expression can be upregulated or downregulated, e.g., via various feedback loops within the cell.
* Understanding how gene expression works is key to further understand how complex traits evolve and are governed.
    - Many diseases, including certain cancers, are driven by changes in gene expression.
    - By knowing which genes are turned on or off by a particular treatment, researchers can develop more targeted and effective drugs.
* Despite a two centuries of progress, **we still know surprisingly little about the architecture of the genome**.

![image.png](attachment:image.png)

## 🤔 Discussion

- What is the point of this paper? Why is it in Science?
- Did you find it easy or hard to read?
- What is/are their main results? Do you find their conclusions convincing?
- How was the data collected? What are some potential limitations of the study design?

![image.png](attachment:image.png)

## Working with the GTEx data

- Most of the data is freely available at the [GTEx Portal](https://www.gtexportal.org/home/).
- For privacy reasons, access to the raw data controlled. We'll work with the summarized data.

### Individual-level phenotypes
(restricted due to privacy)

In [24]:
library(tidyverse)
base_url <- "https://storage.googleapis.com/gtex_analysis_v8"
pheno_url <- str_c(base_url, "/annotations/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt")
download.file(pheno_url, "phenotypes.txt")
donors_df <- read_delim("phenotypes.txt") %>% print

[1mRows: [22m[34m980[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (2): SUBJID, AGE
[32mdbl[39m (2): SEX, DTHHRDY

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[90m# A tibble: 980 × 4[39m
   SUBJID       SEX AGE   DTHHRDY
   [3m[90m<chr>[39m[23m      [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m
[90m 1[39m GTEX-1117F     2 60-69       4
[90m 2[39m GTEX-111CU     1 50-59       0
[90m 3[39m GTEX-111FC     1 60-69       1
[90m 4[39m GTEX-111VG     1 60-69       3
[90m 5[39m GTEX-111YS     1 60-69       0
[90m 6[39m GTEX-1122O     2 60-69       0
[90m 7[39m GTEX-1128S     2 60-69       2
[90m 8[39m GTEX-113IC     1 60-69      [31mNA[39m
[90m 9[39m GTEX-113JC     2 50-59       2
[90m10[39m GTEX-117XS     1 60-69       2
[90m# … with 970 more rows[39m


In [15]:
samples_url <- str_c(base_url, "/annotations/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt") 
download.file(samples_url, "samples.txt")
samples_df <- read_delim("samples.txt") %>% print

[1mRows: [22m[34m22951[39m [1mColumns: [22m[34m63[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (14): SAMPID, SMCENTER, SMPTHNTS, SMTS, SMTSD, SMUBRID, SMNABTCH, SMNABT...
[32mdbl[39m (41): SMATSSCR, SMRIN, SMTSISCH, SMTSPAX, SME2MPRT, SMCHMPRS, SMNTRART, ...
[33mlgl[39m  (8): SMNUMGPS, SM550NRM, SM350NRM, SMMNCPB, SMMNCV, SMCGLGTH, SMGAPPCT,...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[90m# A tibble: 22,951 × 63[39m
   SAMPID      SMATS…¹ SMCEN…² SMPTH…³ SMRIN SMTS  SMTSD SMUBRID SMTSI…⁴ SMTSPAX
   [3m[90m<chr>[39m[23m         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m     [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m
[90m 1[39m GTEX-1117F…      [31mNA[39m B1      [31mNA[39m       [31mNA[39m   Blood Whol… 0013756    [4m1[24m188      [31mNA[39m
[90m 2[39m GTEX-1117F…      [31mNA[39m B1      [31mNA[39m       [31mNA[39m   Blood Whol… 0013756    [4m1[24m188      [31mNA[39m
[90m 3[39m GTEX-1117F…      [31mNA[39m B1      [31mNA[39m       [31mNA[39m   Blood Whol… 0013756    [4m1[24m188      [31mNA[39m
[90m 4[39m GTEX-1117F…      [31mNA[39m B1, A1  [31mNA[39m       [31mNA[39m   Brain Brai… 0009834    [4m1[24m193      [31mNA[39m
[90m 5[39m GTEX-1117F…      [31mNA[39m B1, A1  [3

For this lab we'll focus on the RNA-seq samples:

In [17]:
rnaseq_df <- samples_df %>% filter(SMAFRZE == "RNASEQ") %>% print

[90m# A tibble: 17,382 × 63[39m
   SAMPID      SMATS…¹ SMCEN…² SMPTH…³ SMRIN SMTS  SMTSD SMUBRID SMTSI…⁴ SMTSPAX
   [3m[90m<chr>[39m[23m         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m     [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m
[90m 1[39m GTEX-1117F…       0 B1      2 piec…   6.8 Adip… Adip… 0002190    [4m1[24m214    [4m1[24m125
[90m 2[39m GTEX-1117F…       0 B1      2 piec…   7.1 Musc… Musc… 0011907    [4m1[24m220    [4m1[24m119
[90m 3[39m GTEX-1117F…       0 B1      2 piec…   8   Bloo… Arte… 0007610    [4m1[24m221    [4m1[24m120
[90m 4[39m GTEX-1117F…       1 B1      2 piec…   6.9 Bloo… Arte… 0001621    [4m1[24m243    [4m1[24m098
[90m 5[39m GTEX-1117F…       1 B1      2 piec…   6.3 Heart Hear… 0006631    [4m1[24m244    [4m1[24m097
[90m 6[39m GTEX-1117F…       1 B1      2 piec…   5.9 Adip… Adip… 00

The `SMTSD` stands for 'Tissue Site Detail'. It tells us which tissue each of the samples was collected from:

In [23]:
rnaseq_df %>% count(SMTSD) %>% top_n(5)

[1m[22mSelecting by n


SMTSD,n
<chr>,<int>
Adipose - Subcutaneous,663
Artery - Tibial,663
Muscle - Skeletal,803
Skin - Sun Exposed (Lower leg),701
Whole Blood,755


The first two components of the sample ID are the donor ID. So, to find e.g. all whole blood samples from the male subjects, we could use the query:

In [69]:
rnaseq_df %>% 
    mutate(SUBJID = map_chr(SAMPID, \(s) str_c(str_split(s, "-", simplify = T)[1:2], collapse="-"))) %>% 
    left_join(donors_df) %>% 
    filter(SEX == 1, SMTSD == "Whole Blood") %>% 
    print

[1m[22mJoining with `by = join_by(SUBJID)`


[90m# A tibble: 501 × 67[39m
   SAMPID      SMATS…¹ SMCEN…² SMPTH…³ SMRIN SMTS  SMTSD SMUBRID SMTSI…⁴ SMTSPAX
   [3m[90m<chr>[39m[23m         [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m   [3m[90m<chr>[39m[23m   [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m     [3m[90m<dbl>[39m[23m   [3m[90m<dbl>[39m[23m
[90m 1[39m GTEX-111YS…      [31mNA[39m B1      [31mNA[39m        8.2 Blood Whol… 0013756    -[31m121[39m      [31mNA[39m
[90m 2[39m GTEX-113IC…      [31mNA[39m B1      [31mNA[39m        8.6 Blood Whol… 0013756    -[31m351[39m      [31mNA[39m
[90m 3[39m GTEX-117XS…      [31mNA[39m B1      [31mNA[39m        6.4 Blood Whol… 0013756     802      [31mNA[39m
[90m 4[39m GTEX-117YW…      [31mNA[39m B1      [31mNA[39m        8.5 Blood Whol… 0013756     771      [31mNA[39m
[90m 5[39m GTEX-1192W…      [31mNA[39m B1      [31mNA[39m        8   Blood Whol… 0013756     645      [31m

## Expression data
Expression data are expressed as [TPM](https://academic.oup.com/bioinformatics/article/26/4/493/243395?login=true) (transcripts per million). (Higher means more expression.)

In [67]:
#  rna_seq <- str_c(base_url, '/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz')
#  download.file(rna_seq, 'gene_tpm.gct.gz')  # warning: large

rna_seq <- str_c(base_url, '/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_median_tpm.gct.gz')
download.file(rna_seq, 'gene_tpm.gct.gz')
gene_tpm_df <- read_delim("gene_tpm.gct.gz", skip=2)

[1mRows: [22m[34m56200[39m [1mColumns: [22m[34m56[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m  (2): Name, Description
[32mdbl[39m (54): Adipose - Subcutaneous, Adipose - Visceral (Omentum), Adrenal Glan...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [68]:
gene_tpm_df %>% head

Name,Description,Adipose - Subcutaneous,Adipose - Visceral (Omentum),Adrenal Gland,Artery - Aorta,Artery - Coronary,Artery - Tibial,Bladder,Brain - Amygdala,⋯,Skin - Not Sun Exposed (Suprapubic),Skin - Sun Exposed (Lower leg),Small Intestine - Terminal Ileum,Spleen,Stomach,Testis,Thyroid,Uterus,Vagina,Whole Blood
<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
ENSG00000223972.5,DDX11L1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,0.0,0.0,0.0,0.0,0.0,0.166403,0.0,0.0,0.0,0.0
ENSG00000227232.5,WASH7P,4.06403,3.37111,2.68549,4.04762,3.90076,3.63963,5.16375,1.43859,⋯,5.93298,6.13265,4.19378,5.92631,3.06248,4.70253,6.27255,7.19001,5.74554,2.64743
ENSG00000278267.1,MIR6859-1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000243485.5,MIR1302-2HG,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,0.0,0.0,0.0,0.0,0.0,0.0542228,0.0,0.0,0.0,0.0
ENSG00000237613.2,FAM138A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,⋯,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSG00000268020.3,OR4G4P,0.0,0.0,0.0363952,0.0,0.0,0.0,0.0354698,0.0496721,⋯,0.0267156,0.0,0.0354079,0.0,0.0325904,0.0,0.0,0.0,0.0,0.0


##  Project 1

This project is designed to practice **exploratory data analysis**. Imagine that:

* You are the statistical collaborator to the lab that produced this data.
* The experiment just completed, and the data just became available.
* You have been given the data, and been asked to take a look at it.
* You will be presenting at group meeting in two weeks.

As you poke around this data set, keep an eye out for:
* Potential problems with the data.
* Anything odd or unexpected.
* Challenges the data will pose (possibly unanticipated).
* Anything especially interesting.
* Initial findings.
* Suggestions for future runs of the experiment.

Additional notes:

* Insightful data visualization is important.
* Fancy models (or any models) are not needed.
* The overall goal is *insight* -- into the data, into the experimental methodology, into the science.

## Deliverables
- An 8-10 page writeup describing the data, what analyses you performed, and your findings, due two weeks from yesterday.
- A brief (not to exceed 15m) presentation summarizing your results, given in class in two weeks.
- Groups of three students have been randomly assigned on Canvas.
- It's okay to divide up the work, as long as everyone does roughly an equal amount of work.