-
Notifications
You must be signed in to change notification settings - Fork 0
/
scripts.Rmd
420 lines (338 loc) · 15.1 KB
/
scripts.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
---
title: "Scripts for RNA-seq and ChIP-seq analysis primer"
author: Jianhong Ou
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Scripts for RNA-seq and ChIP-seq analysis primer}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, eval=FALSE)
```
## Learning Objectives:
1. pipeline: kallisto and Salmon + tximport + DESeq2 for RNA-seq
2. pipeline: bwa + MACS2 for ChIP-seq
For more information please refer to the [slides](https://github.com/jianhong/genomictools/blob/master/inst/extdata/BasicBioinformaticsRNAseqChIPseqATACseqPrimer.pdf)
[![Youtube Video](https://img.youtube.com/vi/N_ul4CwM0ZU/0.jpg)](https://youtu.be/N_ul4CwM0ZU)
## Set-up
All the materials used in this workshop is included in
[Docker](https://www.docker.com/) file:
[jianhong/genomictools](https://hub.docker.com/repository/docker/jianhong/genomictools)
Before we start, please download Docker and install Docker Desktop.
### Install docker
Once the docker installed on you system, please try to run the following code in a terminal.
This script will get the latest version of docker container for this workshop.
```{bash, eval=FALSE}
which docker
#docker pull jianhong/genomictools:latest
docker pull ghcr.io/jianhong/genomictools:latest
```
Change the docker Memory Resources to >= 5G in the Preferences setting page of Docker.
And then run the following code in a terminal.
```{bash, eval=FALSE}
## set docker memory to > 5G
## set docker memory to > 5G ; !important
## RStudio server will not work for M1 chip MacBook.
cd ~
mkdir tmp4genomictools
docker run --memory 5g -e PASSWORD=123456 -p 8787:8787 \
-v ${PWD}/tmp4genomictools:/home/rstudio/data \
jianhong/genomictools:latest
```
"docker run" will open port 8787 on your system. Now you can login rstudio in
the container at http://localhost:8787 with username "rstudio" and password "123456".
All the following steps are running in Rstudio "Terminal" or "Console".
You may want to open the source of "scripts.Rmd" at the "Source Panes" by
```{r, eval=FALSE}
rstudioapi::documentOpen("/usr/local/lib/R/site-library/basicBioinformaticsDRC2024/doc/scripts.Rmd")
```
## Run kallisto and Salmon for RNA-seq
The sample files are packaged
in basicBioinformaticsDRC2024 package and Docker container.
Now we will download the zebrafish cDNA files from ENSEMBL in order to build the
Kallisto and Salmon transcript index. If you are doing rRNA depletion library,
please download and merge the cDNA and ncDNA files from ENSEMBL to make the
full transcriptome.
```{bash, eval=FALSE}
cd RNAseq
wget ftp://ftp.ensembl.org/pub/release-105/fasta/danio_rerio/cdna/Danio_rerio.GRCz11.cdna.all.fa.gz
```
Now we can build the transcriptome index. It will take some time and memory for full dataset.
This is the reason why we need to set docker memory to > 5G.
```{bash, eval=FALSE}
kallisto index -i danRer.GRCz11_transcrits.idx Danio_rerio.GRCz11.cdna.all.fa.gz
salmon index -i danRer.GRCz11_transcrits.salmon.idx -t Danio_rerio.GRCz11.cdna.all.fa.gz
```
The data are only a subset of the whole genome. We will only use genes in
chromosome 4, 13, 16 and 21 to speed up the test run.
```{bash, eval=FALSE}
mkdir data/RNAseq
cd data/RNAseq
# wget https://raw.githubusercontent.com/jianhong/genomictools/master/inst/extdata/Danio_rerio.GRCz11.cdna.toy.fa.gz
ln -s /usr/local/lib/R/site-library/basicBioinformaticsDRC2024/extdata/Danio_rerio.GRCz11.cdna.toy.fa.gz ./
```
It will take several seconds to build the toy index.
```{bash, eval=FALSE}
kallisto index -i danRer.GRCz11_transcrits.toy.idx Danio_rerio.GRCz11.cdna.toy.fa.gz
salmon index -i danRer.GRCz11_transcrits.toy.salmon.idx -t Danio_rerio.GRCz11.cdna.toy.fa.gz
```
It's time for quantifying the FASTQ files against our Kallisto index and
Salmon index. We run both of them for comparison. For real data, you can
select one of them.
Because our FASTQ files are single end reads, we need to set --single for
kallisto. And we also need to give the fragment length and standard error
of the length for kallisto.
For Salmon, now we can set library type to "A".
In this example, we put "-t 2" and "-p 2" so we can use up to two processors
in the bootstrapping. We set the bootstrapping by "-b 30" and "--numBootstraps 30".
It will take several minutes for the sample data.
```{bash, eval=FALSE}
# the toy RNAseq data is saved in /home/rstudio/RNAseq folder
# create a link to current folder
ln -s /home/rstudio/RNAseq/fastq ./
# mapping
# Ablated and Uninjured are related with fastq file name
mkdir -p kallisto_quant
mkdir -p salmon_quant
for rep in 1 2
do
for cond in Ablated Uninjured
do
kallisto quant -i danRer.GRCz11_transcrits.toy.idx \
-o kallisto_quant/$cond.rep$rep \
-b 30 -t 2 fastq/$cond.rep$rep.fastq.gz \
--single -l 200 -s 50
salmon quant -i danRer.GRCz11_transcrits.toy.salmon.idx -l A \
-r fastq/$cond.rep$rep.fastq.gz \
--validateMappings -p 2 \
-o salmon_quant/$cond.rep$rep \
--numBootstraps 30 --seqBias --gcBias
done
done
```
## R scripts for RNAseq
Here we give an example workflow for a differential expression (DE) analysis.
The following code should be run in Console of Rstudio.
### prepare the transcripts to genes map table
We are trying to do gene level DE analysis. We need to aggregate the transcript
level counts to gene level. We borrowed from the Sleuth documentation by
retrieve ENSEMBL transcript id to gene id.
```{r, eval=FALSE}
setwd("data/RNAseq")
library(biomaRt)
library(dplyr)
tx2gene <- function(species="hsapiens", ...){
mart <- biomaRt::useMart(biomart = "ensembl",
dataset = paste0(species, "_gene_ensembl"), ...)
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id",
"ensembl_gene_id",
"external_gene_name"), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
return(t2g)
}
t2g <- tx2gene("drerio")
head(t2g, n=3)
```
If you have trouble in downloading the data from ensembl, try to load the pre-saved
object for the toy data.
```{r, eval=FALSE}
# t2g <- readRDS(url("https://raw.githubusercontent.com/jianhong/genomictools/master/inst/extdata/t2g.rds"))
t2g <- readRDS(system.file("extdata", "t2g.rds", package = "basicBioinformaticsDRC2024"))
head(t2g, n=3)
```
The following codes are using tximport to import the transcripts counts and
aggregate to gene level. Downstream are using DESeq2 to do gene level DE analysis.
### run tximport + DESeq2 for Salmon results
```{r, eval=FALSE}
library(tximport)
library(DESeq2)
setwd("~/data/RNAseq") ## RNAseq is the folder where saved the salmon_quant and kallisto_quant
(salmon_files <- dir("salmon_quant", "sf$",
recursive = TRUE,
full.names = TRUE))
txi.salmon <- tximport(salmon_files, type = "salmon",
tx2gene = t2g, ignoreTxVersion=TRUE)
sampleTable <- data.frame(condition = sub(".rep.", "",
basename(dirname(salmon_files))))
rownames(sampleTable) <- colnames(txi.salmon$counts)
dds.salmon <- DESeqDataSetFromTximport(txi.salmon, sampleTable, ~condition)
dds.salmon <- DESeq(dds.salmon)
res.salmon <- results(dds.salmon)
res.salmon[!is.na(res.salmon$padj), ]
```
### run tximport + DESeq2 for kallisto results
```{r, eval=FALSE}
(kallisto_files <- dir("kallisto_quant", "abundance.tsv",
recursive = TRUE, full.names = TRUE))
txi.kallisto <- tximport(kallisto_files, type = "kallisto",
tx2gene = t2g, ignoreTxVersion=TRUE)
sampleTable <- data.frame(condition = sub(".rep.", "",
basename(dirname(kallisto_files))))
rownames(sampleTable) <- colnames(txi.kallisto$counts)
dds.kallisto <- DESeqDataSetFromTximport(txi.kallisto,
sampleTable, ~condition)
dds.kallisto <- DESeq(dds.kallisto)
res.kallisto <- results(dds.kallisto)
res.kallisto[!is.na(res.kallisto$padj), ]
```
### run Sleuth
The following code is showing how to use sleuth to do DE analysis.
However, the sleuth package is not updated. It may be broken.
```{r, eval=FALSE}
#BiocManager::install("pachterlab/sleuth")
library(sleuth)
samples <- dir("kallisto_quant")
s2c <- data.frame(path=dir("kallisto_quant", full.names = TRUE),
sample=samples,
condition=sub(".rep.", "", samples))
s2c
so <- sleuth_prep(s2c, ~-1+condition, target_mapping = t2g)
so <- sleuth_fit(so)
## which_beta must be in the colname of design table.
so <- sleuth_wt(so, which_beta = "conditionAblated")
sleuth_live(so)
```
## scripts for ChIPseq
### prepare index file for bwa
Run following code in Terminal panel of rstudio.
The sample data is located at "/home/rstudio/ChIPseq" folder. It is zebrafish
H3 ChIP-seq data.
We will map the reads by BWA (Burrows-Wheeler Aligner). Compare to Bowtie2,
BWA is a little slower but a bit more accurate and provides information on
which alignments are trustworthy. The bwa-mem mode is generally recommended
for high-quality queries. It is not limited by sequence reads size as
bwa-backtrack and bwa-sw. Aligning reads with bwa-mem, there are two steps,
build the index and do alignment. BWA indexes the genome with an FM index
based on the Burrows-Wheeler Transform to keep memory requirements low for
the alignment process.
```{bash, eval=FALSE}
## change you working directory
mkdir -p $HOME/data/ChIPseq
ln -s $HOME/ChIPseq/fastq $HOME/data/ChIPseq/
cd $HOME/data/ChIPseq
## download the zebrafish GRCz11 genome from ENSEMBLE
wget ftp://ftp.ensembl.org/pub/release-105/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
## build the index
## -p: prefix for all index files
bwa index -p GRCz11 Danio_rerio.GRCz11.dna.primary_assembly.fa.gz
```
The data are only a subset of the whole genome. We will only use genes in
chromosome 4, 13, 16 and 21 to speed up the test run.
```{bash, eval=FALSE}
# wget https://raw.githubusercontent.com/jianhong/genomictools/master/inst/extdata/Danio_rerio.GRCz11.dna.toy.fa.gz
ln -s /usr/local/lib/R/site-library/basicBioinformaticsDRC2024/extdata/Danio_rerio.GRCz11.dna.toy.fa.gz ./
## the following step will take about 3min
bwa index -p GRCz11.toy Danio_rerio.GRCz11.dna.toy.fa.gz
```
### Run fastQC, mapping and MACS2
We will run the scripts in a bash loop. The samples are from two uninjured
and two ablated fish heart tissue.
We use fastQC to check the quality of raw reads. The results will saved to
"fastqc" folder. You can go through the results by click the index.html file
in each sub-folder.
We will perform alignment on the single-end reads. For more information on BWA
and its functionality please refer to the [user manual](http://bio-bwa.sourceforge.net/bwa.shtml).
[MACS2](https://github.com/macs3-project/MACS) will be used to call the peaks.
```{bash, eval=FALSE}
group=(Ablated Ablated Uninjured Uninjured)
tag=(Ablated.rep1 Ablated.rep2 Uninjured.rep1 Uninjured.rep2)
species=danRer11
prefix=bwa
for i in {0..3}
do
## fastQC
mkdir -p fastqc/${group[$i]}
fastqc -o fastqc/${group[$i]} -t 2 \
fastq/${tag[$i]}.fastq.gz
## trim adapter, need trim_galore be installed, here we do not do this step
# mkdir -p fastq.trimmed
# trim_galore -q 15 --fastqc -o fastq.trimmed/${group[$i]} fastq/${tag[$i]}.fastq.gz
## mapping by bwa
mkdir -p sam
## -t: number of threads
## -M: mark shorter split hits as secondary, this is optional for Picard compatibility.
## >: save alignment to a SAM file
## 2>: save standard error to log file
bwa mem -M -t 2 GRCz11.toy \
fastq/${tag[$i]}.fastq.gz \
> sam/$prefix.$species.${tag[$i]}.sam \
2> bwa.$prefix.$species.${tag[$i]}.log.txt
## convert sam file to bam and clean-up
mkdir -p bam
## -q: skip alignments with MAPQ samller than 30.
samtools view -bhS -q 30 sam/$prefix.$species.${tag[$i]}.sam > bam/$prefix.$species.${tag[$i]}.bam
rm sam/$prefix.$species.${tag[$i]}.sam
## sort and index the bam file for quick access.
samtools sort bam/$prefix.$species.${tag[$i]}.bam -o bam/$prefix.$species.${tag[$i]}.srt.bam
samtools index bam/$prefix.$species.${tag[$i]}.srt.bam
## remove un-sorted bam file.
rm bam/$prefix.$species.${tag[$i]}.bam
## we remove the duplicated by picard::MarkDuplicates.
mkdir -p bam/picard
picard MarkDuplicates \
INPUT=bam/$prefix.$species.${tag[$i]}.srt.bam \
OUTPUT=bam/$prefix.$species.${tag[$i]}.srt.markDup.bam \
METRICS_FILE=bam/picard/$prefix.$species.${tag[$i]}.srt.fil.picard_info.txt \
REMOVE_DUPLICATES=true ASSUME_SORTED=true VALIDATION_STRINGENCY=LENIENT
samtools index bam/$prefix.$species.${tag[$i]}.srt.markDup.bam
## use deeptools::bamCoverage to generate bigwig files
## the bw file can be viewed in IGV
mkdir -p bw
bamCoverage -b bam/$prefix.$species.${tag[$i]}.srt.markDup.bam -o bw/$prefix.$species.${tag[$i]}.bw --normalizeUsing CPM
## call peaks by macs2
mkdir -p macs3/${tag[$i]}
## -g: mappable genome size
## -q: use minimum FDR 0.05 cutoff to call significant regions.
## -B: ask MACS3 to output bedGraph files for experiment.
## --nomodel --extsize 150: the subset data is not big enough (<1000 peak) for
## macs3 to generate a model. We manually feed one.
## because we used toy genome, the genome size we set as 10M
macs3 callpeak -t bam/${prefix}.$species.${tag[$i]}.srt.markDup.bam \
-f BAM -g 10e6 -n ${prefix}.$species.${tag[$i]} \
--outdir macs3/${tag[$i]} -q 0.05 \
-B --nomodel --extsize 150
done
```
### Differential analysis
We will use `DiffBind` package to do the differential analysis.
```{r, eval=FALSE}
setwd("/home/rstudio/data/ChIPseq")
library("DiffBind")
(bams <- dir("bam", "markDup.bam$", full.names = TRUE))
(Peaks <- dir("macs3", "narrowPeak$", full.names = TRUE, recursive = TRUE))
(SampleID <- basename(dirname(Peaks)))
(Condition <- sub(".rep.", "", SampleID))
(Replicate <- sub("^.*rep", "", SampleID))
Peakcaller <- "macs3"
PeakFormat <- "narrowPeak"
samples <- data.frame(SampleID=SampleID,
Condition=Condition,
Replicate=Replicate,
bamReads=bams,
Peaks=Peaks,
Peakcaller=Peakcaller,
PeakFormat=PeakFormat,
ScoreCol=5)
pf <- "DiffBind"
out <- "sample.csv"
dir.create(pf)
write.csv(samples, file.path(pf, out))
chip <- dba(sampleSheet = file.path(pf, out))
pdf(file.path(pf, "DiffBind.sample.correlation.pdf"), width = 9, height = 9)
plot(chip)
dev.off()
pdf(file.path(pf, "DiffBind.PCA.plot.pdf"))
dba.plotPCA(chip, DBA_CONDITION, label=DBA_ID)
dev.off()
chip <- dba.count(chip)
chip <- dba.normalize(chip) ## add for DiffBind 3.0
BLACKLIST <- FALSE ## fish data, no blacklist yet.
chip <- dba.blacklist(chip, blacklist=BLACKLIST, greylist=FALSE)
chip <- dba.contrast(chip, minMembers=2, categories = DBA_CONDITION)
chip <- dba.analyze(chip, bBlacklist = FALSE, bGreylist = FALSE)
chip.DB <- dba.report(chip, th=1)
head(chip.DB)
write.csv(chip.DB, file.path(pf, "DiffBind.results.csv"), row.names=FALSE)
```