# Sequencing data analysis

### IMPORTANT: Please make sure that your are using the bash kernel to run this notebook.
#### (Do this at the beginning of every session) ###

### This notebook covers analysis of DNA sequencing data from raw files to processed signals.

Although this analysis is for ATAC-seq data, many of the steps (especially the first section) are the same for other types of DNA sequencing experiments.

We'll be doing the analysis in Bash, which is the standard language for UNIX command-line scripting.

The steps in the analysis pipeline that are covered in this notebook are indicated below:
![Sequencing Data Analysis 1](images/part1.png)

## Part 1: Setting up the data

We start with raw `.fastq.gz` files, which are provided by the sequencing instrument. For each DNA molecule (read) that was sequenced, they provide the nucleotide sequence, and information about the quality of the signal of that nucleotide.

In [1]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/srv/scratch/training_camp/work/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/srv/scratch/training_camp/metadata"
export AGGREGATE_DATA_DIR="/srv/scratch/training_camp/data"
export AGGREGATE_ANALYSIS_DIR="/srv/scratch/training_camp/aggregate_analysis"
export YEAST_DIR="/srv/scratch/training_camp/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"





Now, let's check exactly which fastqs we have (we copied these from \$AGGREGATE_DATA_DIR to your personal $DATA_DIR in the last tutorial):

(recall that the `ls` command lists the contents of a directory)

In [2]:
ls $DATA_DIR

atac.bds.20180908_171658_871.dag.js	  cln3-SCE-Rep2_R2_001.fastq.gz
atac.bds.20180908_171658_871.report.html  run.sh
atac.bds.20180908_172210_278.dag.js	  run.sh~
atac.bds.20180908_172210_278.report.html  whi5-cln3-SCE-Rep1_R1_001.fastq.gz
atac.bds.20180908_173819_316.dag.js	  whi5-cln3-SCE-Rep1_R2_001.fastq.gz
atac.bds.20180908_173819_316.report.html  whi5-cln3-SCE-Rep2_R1_001.fastq.gz
atac.bds.20180908_174431_590.dag.js	  whi5-cln3-SCE-Rep2_R2_001.fastq.gz
atac.bds.20180908_174431_590.report.html  whi5-SCE-Rep1_R1_001.fastq.gz
atac.bds.20180908_183721_333.dag.js	  whi5-SCE-Rep1_R2_001.fastq.gz
atac.bds.20180908_183721_333.report.html  whi5-SCE-Rep2_R1_001.fastq.gz
atac.bds.20180908_183721_438.dag.js	  whi5-SCE-Rep2_R2_001.fastq.gz
atac.bds.20180908_183721_438.report.html  WT-SCD-0_6MNaCl-Rep1_R1_001.fastq.gz
cln3-SCD-0_6MNaCl-Rep1_R1_001.fastq.gz	  WT-SCD-0_6MNaCl-Rep1_R2_001.fastq.gz
cln3-SCD-0_6MNaCl-Rep1_R2_001.fastq.gz	  WT-SCD-0_6MNaCl-Rep2_R1_001.fastq.gz
cln3-SCD

As a sanity check, we can also look at the size and last edited time of some of the fastqs by addind `-lrth` to the `ls` command:

In [3]:
ls -lrth $DATA_DIR

total 5.4G
-rwxrwxrwx 1 ubuntu ubuntu  81M Sep  8 17:03 WT-SCD-0_6MNaCl-Rep1_R1_001.fastq.gz
-rwxrwxrwx 1 ubuntu ubuntu  73M Sep  8 17:04 WT-SCD-0_6MNaCl-Rep1_R2_001.fastq.gz
-rwxrwxrwx 1 ubuntu ubuntu 159M Sep  8 17:04 WT-SCD-0_6MNaCl-Rep2_R1_001.fastq.gz
-rwxrwxrwx 1 ubuntu ubuntu 150M Sep  8 17:04 WT-SCD-0_6MNaCl-Rep2_R2_001.fastq.gz
-rwxrwxrwx 1 ubuntu ubuntu  32M Sep  8 17:04 WT-SCD-Rep1_R1_001.fastq.gz
-rwxrwxrwx 1 ubuntu ubuntu  28M Sep  8 17:04 WT-SCD-Rep1_R2_001.fastq.gz
-rwxrwxrwx 1 ubuntu ubuntu 168M Sep  8 17:04 WT-SCD-Rep2_R1_001.fastq.gz
-rwxrwxrwx 1 ubuntu ubuntu 157M Sep  8 17:04 WT-SCD-Rep2_R2_001.fastq.gz
-rwxrwxrwx 1 ubuntu ubuntu 203M Sep  8 17:04 WT-SCE-0_6MNaCl-Rep1_R1_001.fastq.gz
-rwxrwxrwx 1 ubuntu ubuntu 183M Sep  8 17:04 WT-SCE-0_6MNaCl-Rep1_R2_001.fastq.gz
-rwxrwxrwx 1 ubuntu ubuntu 212M Sep  8 17:04 WT-SCE-0_6MNaCl-Rep2_R1_001.fastq.gz
-rwxrwxrwx 1 ubuntu ubuntu 199M Sep  8 17:04 WT-SCE-0_6MNaCl-Rep2_R2_001.fastq.gz
-rwxrwxrwx 1 ubuntu ubuntu 9

Let's also inspect the format of one of the fastqs. Notice that each read takes up 4 lines:
1. the read name
2. the read's nucleotide sequence
3. a '+' to indicate the record contains another line
4. a quality score for each base (a number encoded as a letter)

In [4]:
zcat $(ls $DATA_DIR/*gz | head -n 1) | head -n 8

@NS500418:691:HTFJ7AFXX:1:11101:13781:5936 1:N:0:CGAGGCTG+GCGATCTA
CTTATACACATCTCCGAGCCCACGAGACCGAGGCTGATCTCGTATGCCGTCTTCTGCTGGAAAAAAAAAAGGGGGG
+
AAAAAAEAEEEEEAEE//EEEEEE/EEEEE/EEEEE6E</EA/EEEEEEE/AEAEE<E/E/EEEE6E6///EEEEE
@NS500418:691:HTFJ7AFXX:1:11101:23098:6041 1:N:0:CGAGGCTG+GCGATCTA
CCCAGATATTGGGCGACAGCCAGGTTTTCAGCCAGACGACAGGCGAACTTTTGTTGACCTCAACGCGCACCTCCGT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEAEEEEEEAAEEEEEEEEEAEEEEEEEEEEEEEAEEEAA

gzip: stdout: Broken pipe


## Part 2:ATAC-seq data processing

The ENCODE consortium (https://www.encodeproject.org/) uses a standard ATAC-seq data processing pipeline, which can be downloaded here: https://github.com/ENCODE-DCC/atac-seq-pipeline

This pipeline is pre-installed on this computer and can be executed by running the **atac.bds** script. 



In [5]:
#/opt/atac_dnase_pipelines/atac.bds --help




Though the pipeline is highly customizable and all the customizations might seem a bit confusing at first, do not worry -- for our purposes, the default settings will suffice. You will run the pipeline on your two experiments. Fill in the names of the FASTQ files corresponding to your two experiments below, as well as the name of the ouptut directory to store the processed data. 

In [2]:
#You can find the experiment names in the file $METADATA_DIR/TC2018_samples.tsv.
#Look under the column labeled "ID"
#example: 

#export experiment1="WT_ethanol_1"
#export experiment2="asf1_ethanol_1"

export experiment1="WT-SCD-Rep1"
export experiment2="WT-SCD-Rep2"

#Create directories to store outputs from the pipeline
#We will store the outputs in the $WORK_DIR
export outdir1=$WORK_DIR/$experiment1\_out 
export outdir2=$WORK_DIR/$experiment2\_out
mkdir $outdir1
mkdir $outdir2


mkdir: cannot create directory ‘/srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep1_out’: File exists
mkdir: cannot create directory ‘/srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep2_out’: File exists


Now, kick off the pipeline! 

In [15]:
#first experiment:
echo "bds_scr $experiment1 $experiment1.log atac.bds -out_dir $outdir1 -species saccer3 -fastq1_1 $DATA_DIR/$experiment1\_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment1\_R2_001.fastq.gz -nth 4"
bds_scr $experiment1 $outdir1/log.txt /opt/atac_dnase_pipelines/atac.bds -out_dir $outdir1 -species saccer3 -fastq1_1 $DATA_DIR/$experiment1\_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment1\_R2_001.fastq.gz -nth 4 

bds_scr WT-SCD-Rep1 WT-SCD-Rep1.log atac.bds -out_dir /srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep1_out -species saccer3 -fastq1_1 /srv/scratch/training_camp/work/ubuntu/data/WT-SCD-Rep1\_R1_001.fastq.gz -fastq1_2 /srv/scratch/training_camp/work/ubuntu/data/WT-SCD-Rep1\_R2_001.fastq.gz -nth 4
[SCR_NAME] : WT-SCD-Rep1.BDS
[HOST] : ip-172-31-26-41.us-west-1.compute.internal
[LOG_FILE_NAME] : /srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep1_out/log.txt
[BDS_PARAM] :  /opt/atac_dnase_pipelines/atac.bds -out_dir /srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep1_out -species saccer3 -fastq1_1 /srv/scratch/training_camp/work/ubuntu/data/WT-SCD-Rep1_R1_001.fastq.gz -fastq1_2 /srv/scratch/training_camp/work/ubuntu/data/WT-SCD-Rep1_R2_001.fastq.gz -nth 4



In [16]:
#second experiment:
echo "bds_scr $experiment2 $experiment2.log atac.bds -out_dir $outdir2 -species saccer3 -fastq1_1 $DATA_DIR/$experiment2_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment2_R2_001.fastq.gz -nth 4"
bds_scr $experiment2 $outdir2/log.txt /opt/atac_dnase_pipelines/atac.bds -out_dir $outdir2 -species saccer3 -fastq1_1 $DATA_DIR/$experiment2\_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment2\_R2_001.fastq.gz -nth 4 

bds_scr WT-SCD-Rep2 WT-SCD-Rep2.log atac.bds -out_dir /srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep2_out -species saccer3 -fastq1_1 /srv/scratch/training_camp/work/ubuntu/data/.fastq.gz -fastq1_2 /srv/scratch/training_camp/work/ubuntu/data/.fastq.gz -nth 4
[SCR_NAME] : WT-SCD-Rep2.BDS
[HOST] : ip-172-31-26-41.us-west-1.compute.internal
[LOG_FILE_NAME] : /srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep2_out/log.txt
[BDS_PARAM] :  /opt/atac_dnase_pipelines/atac.bds -out_dir /srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep2_out -species saccer3 -fastq1_1 /srv/scratch/training_camp/work/ubuntu/data/WT-SCD-Rep2_R1_001.fastq.gz -fastq1_2 /srv/scratch/training_camp/work/ubuntu/data/WT-SCD-Rep2_R2_001.fastq.gz -nth 4



The pipeline may run for an hour or so, so meanwhile, we will learn more about what it's doing under the hood. 
If you want to check on the progress, you can examine the latest entries in the  log file generated by the pipeline with teh *tail* command. The log files are specified by the [LOG_FILE_NAME] entry above. 


In [10]:
tail $outdir1/log.txt

00:09:05.526	Wait: Waiting for task to finish: atac.bds.20180908_052955_292/task.callpeak_macs2_atac.macs2_n_s_rep1_pr1.line_74.id_30, state: FINISHED
00:09:05.526	Wait: Task 'atac.bds.20180908_052955_292/task.callpeak_macs2_atac.macs2_n_s_rep1_pr1.line_74.id_30' finished.
00:09:05.526	Wait: Waiting for task to finish: atac.bds.20180908_052955_292/task.callpeak_idr.FRiP_rep1_pr.line_159.id_33, state: FINISHED
00:09:05.526	Wait: Task 'atac.bds.20180908_052955_292/task.callpeak_idr.FRiP_rep1_pr.line_159.id_33' finished.
00:09:05.526	Waiting for all 'parrallel' to finish.
00:09:05.526	Waiting for parallel 'atac.bds.20180908_052955_292_parallel_29' to finish. RunState: FINISHED
00:09:05.526	Writing report file 'atac.bds.20180908_052955_292.report.html'
00:09:05.556	Program 'atac.bds.20180908_052955_292' finished, exit value: 0, tasks executed: 26, tasks failed: 0, tasks failed names: .
00:09:05.560	Finished. Exit code: 0
00:09:05.560	ExecutionerLocal 'Local[30]': Killed


In [11]:
tail  $outdir2/log.txt

00:25:12.050	Wait: Waiting for task to finish: atac.bds.20180908_053005_965/task.callpeak_macs2_atac.macs2_n_s_rep1.line_74.id_29, state: FINISHED
00:25:12.050	Wait: Task 'atac.bds.20180908_053005_965/task.callpeak_macs2_atac.macs2_n_s_rep1.line_74.id_29' finished.
00:25:12.050	Wait: Waiting for task to finish: atac.bds.20180908_053005_965/task.graphviz.report.line_98.id_54, state: FINISHED
00:25:12.050	Wait: Task 'atac.bds.20180908_053005_965/task.graphviz.report.line_98.id_54' finished.
00:25:12.050	Waiting for all 'parrallel' to finish.
00:25:12.050	Waiting for parallel 'atac.bds.20180908_053005_965_parallel_29' to finish. RunState: FINISHED
00:25:12.050	Writing report file 'atac.bds.20180908_053005_965.report.html'
00:25:12.060	Program 'atac.bds.20180908_053005_965' finished, exit value: 0, tasks executed: 26, tasks failed: 0, tasks failed names: .
00:25:12.060	Finished. Exit code: 0
00:25:12.060	ExecutionerLocal 'Local[30]': Killed


## Part 3: Examining the pipeline output

The pipeline consists of multiple modules, with output files that include the following: 

```
out                               # root dir. of outputs
│
├ *report.html                    #  HTML report
├ *tracks.json                    #  Tracks datahub (JSON) for WashU browser
├ ENCODE_summary.json             #  Metadata of all datafiles and QC results
│
├ align                           #  mapped alignments
│ ├ rep1                          #   for true replicate 1 
│ │ ├ *.trim.fastq.gz             #    adapter-trimmed fastq
│ │ ├ *.bam                       #    raw bam
│ │ ├ *.nodup.bam (E)             #    filtered and deduped bam
│ │ ├ *.tagAlign.gz               #    tagAlign (bed6) generated from filtered bam
│ │ ├ *.tn5.tagAlign.gz           #    TN5 shifted tagAlign for ATAC pipeline (not for DNase pipeline)
│ │ └ *.*M.tagAlign.gz            #    subsampled tagAlign for cross-corr. analysis
│ ├ rep2                          #   for true repilicate 2
│ ...
│ ├ pooled_rep                    #   for pooled replicate
│ ├ pseudo_reps                   #   for self pseudo replicates
│ │ ├ rep1                        #    for replicate 1
│ │ │ ├ pr1                       #     for self pseudo replicate 1 of replicate 1
│ │ │ ├ pr2                       #     for self pseudo replicate 2 of replicate 1
│ │ ├ rep2                        #    for repilicate 2
│ │ ...                           
│ └ pooled_pseudo_reps            #   for pooled pseudo replicates
│   ├ ppr1                        #    for pooled pseudo replicate 1 (rep1-pr1 + rep2-pr1 + ...)
│   └ ppr2                        #    for pooled pseudo replicate 2 (rep1-pr2 + rep2-pr2 + ...)
│
├ peak                             #  peaks called
│ └ macs2                          #   peaks generated by MACS2
│   ├ rep1                         #    for replicate 1
│   │ ├ *.narrowPeak.gz            #     narrowPeak (p-val threshold = 0.01)
│   │ ├ *.filt.narrowPeak.gz (E)   #     blacklist filtered narrowPeak 
│   │ ├ *.narrowPeak.bb (E)        #     narrowPeak bigBed
│   │ ├ *.narrowPeak.hammock.gz    #     narrowPeak track for WashU browser
│   │ ├ *.pval0.1.narrowPeak.gz    #     narrowPeak (p-val threshold = 0.1)
│   │ └ *.pval0.1.*K.narrowPeak.gz #     narrowPeak (p-val threshold = 0.1) with top *K peaks
│   ├ rep2                         #    for replicate 2
│   ...
│   ├ pseudo_reps                          #   for self pseudo replicates
│   ├ pooled_pseudo_reps                   #   for pooled pseudo replicates
│   ├ overlap                              #   naive-overlapped peaks
│   │ ├ *.naive_overlap.narrowPeak.gz      #     naive-overlapped peak
│   │ └ *.naive_overlap.filt.narrowPeak.gz #     naive-overlapped peak after blacklist filtering
│   └ idr                           #   IDR thresholded peaks
│     ├ true_reps                   #    for replicate 1
│     │ ├ *.narrowPeak.gz           #     IDR thresholded narrowPeak
│     │ ├ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     │ └ *.12-col.bed.gz           #     IDR thresholded narrowPeak track for WashU browser
│     ├ pseudo_reps                 #    for self pseudo replicates
│     │ ├ rep1                      #    for replicate 1
│     │ ...
│     ├ optimal_set                 #    optimal IDR thresholded peaks
│     │ └ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     ├ conservative_set            #    optimal IDR thresholded peaks
│     │ └ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     ├ pseudo_reps                 #    for self pseudo replicates
│     └ pooled_pseudo_reps          #    for pooled pseudo replicate
│
│   
│ 
├ qc                              #  QC logs
│ ├ *IDR_final.qc                 #   Final IDR QC
│ ├ rep1                          #   for true replicate 1
│ │ ├ *.align.log                 #    Bowtie2 mapping stat log
│ │ ├ *.dup.qc                    #    Picard (or sambamba) MarkDuplicate QC log
│ │ ├ *.pbc.qc                    #    PBC QC
│ │ ├ *.nodup.flagstat.qc         #    Flagstat QC for filtered bam
│ │ ├ *M.cc.qc                    #    Cross-correlation analysis score for tagAlign
│ │ ├ *M.cc.plot.pdf/png          #    Cross-correlation analysis plot for tagAlign
│ │ └ *_qc.html/txt               #    ATAQC report
│ ...
│
├ signal                          #  signal tracks
│ ├ macs2                         #   signal tracks generated by MACS2
│ │ ├ rep1                        #    for true replicate 1 
│ │ │ ├ *.pval.signal.bigwig (E)  #     signal track for p-val
│ │ │ └ *.fc.signal.bigwig   (E)  #     signal track for fold change
│ ...
│ └ pooled_rep                    #   for pooled replicate
│ 
├ report                          # files for HTML report
└ meta                            # text files containing md5sum of output files and other metadata
```

Let's examine how well the reads aligned to the reference saccer3 genome. We'd like to see an overall alignment rate >=90% 

In [12]:
cat $outdir1/qc/rep1/*align.log

601654 reads; of these:
  601654 (100.00%) were paired; of these:
    157957 (26.25%) aligned concordantly 0 times
    275151 (45.73%) aligned concordantly exactly 1 time
    168546 (28.01%) aligned concordantly >1 times
    ----
    157957 pairs aligned concordantly 0 times; of these:
      60787 (38.48%) aligned discordantly 1 time
    ----
    97170 pairs aligned 0 times concordantly or discordantly; of these:
      194340 mates make up the pairs; of these:
        106902 (55.01%) aligned 0 times
        948 (0.49%) aligned exactly 1 time
        86490 (44.50%) aligned >1 times
91.12% overall alignment rate


In [13]:
cat $outdir2/qc/rep1/*align.log

3277012 reads; of these:
  3277012 (100.00%) were paired; of these:
    664721 (20.28%) aligned concordantly 0 times
    1652322 (50.42%) aligned concordantly exactly 1 time
    959969 (29.29%) aligned concordantly >1 times
    ----
    664721 pairs aligned concordantly 0 times; of these:
      259947 (39.11%) aligned discordantly 1 time
    ----
    404774 pairs aligned 0 times concordantly or discordantly; of these:
      809548 mates make up the pairs; of these:
        350719 (43.32%) aligned 0 times
        6069 (0.75%) aligned exactly 1 time
        452760 (55.93%) aligned >1 times
94.65% overall alignment rate


Now, let's examine how many  peaks were called for each sample. We use the *zcat* command to examine the contents of a zipped file 

In [14]:
zcat $outdir1/peak/macs2/overlap/optimal_set/*narrowPeak.gz | wc -l 

1520


In [15]:
zcat $outdir2/peak/macs2/overlap/optimal_set/*narrowPeak.gz | wc -l 

3641


## Part 4: Creating a merged peak set across all samples for downstream analysis 

Finally, we merge the peaks across all conditions to create a master list of peaks for analysis. To do this, we concatenate the IDR peaks from all experiments, sort them, and merge them. 

We take the output of the processing pipeline from the $AGGREGATE_ANALYSIS directory. This is the same analysis you performed above, but gathered in one location for all experiments conducted. 

In [6]:
cd $WORK_DIR



In [7]:

#Use the "find" command to identify all IDR narrowPeak output files and write them to a file. 
find $AGGREGATE_ANALYSIS_DIR  -wholename "*peak/macs2/overlap/optimal_set/*narrowPeak.gz" > narrowPeak_files.txt

#sanity check the file 
head narrowPeak_files.txt


/srv/scratch/training_camp/aggregate_analysis/whi5-SCE-Rep1/peak/macs2/overlap/optimal_set/whi5-SCE-Rep1_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/WT-SCD-Rep2_out/peak/macs2/overlap/optimal_set/WT-SCD-Rep2_out_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/cln3-SCD-Rep2/peak/macs2/overlap/optimal_set/cln3-SCD-Rep2_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/cln3-SCE-Rep2/peak/macs2/overlap/optimal_set/cln3-SCE-Rep2_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/whi5-cln3-SCE-Rep2/peak/macs2/overlap/optimal_set/whi5-cln3-SCE-Rep2_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/WT-SCE-0_6MNaCl-Rep1/peak/macs2/overlap/optimal_set/WT-SCE-0_6MNaCl-Rep1_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/cln3-SCD-Rep1/peak/macs2/overlap/optimal_set/cln3-SCD-Rep1_rep1-pr.naive_o

In [8]:
#Now, iterate through the list of narrowPeak files and concatenate them into a single master peak list. 
for f in `cat narrowPeak_files.txt`
do 
    zcat $f >> all.peaks.bed
done

#sanity check the all.peaks.bed file 
head all.peaks.bed


chrI	108652	109044	Peak_3176	94	.	2.16203	9.42043	8.13851	129
chrI	108652	109044	Peak_4811	22	.	1.43200	2.27093	1.31249	377
chrI	113337	114374	Peak_1054	369	.	3.76250	36.94333	35.13755	145
chrI	113337	114374	Peak_3483	73	.	1.99356	7.39529	6.17374	708
chrI	113337	114374	Peak_3694	61	.	1.88125	6.16174	4.98351	411
chrI	114606	114837	Peak_1041	375	.	3.79058	37.53549	35.72075	111
chrI	129068	129274	Peak_2465	148	.	2.55513	14.88796	13.47084	84
chrI	130158	130752	Peak_3177	94	.	2.16203	9.42043	8.13851	228
chrI	130158	130752	Peak_3281	87	.	2.10588	8.72283	7.46047	424
chrI	138986	139388	Peak_1672	241	.	3.01769	24.11693	22.52225	239


In [9]:
#sort the concatenated file 
bedtools sort -i all.peaks.bed > all.peaks.sorted.bed 

head all.peaks.sorted.bed 

chrI	0	638	Peak_557	168	.	8.54555	16.87634	14.86262	203
chrI	0	638	Peak_740	112	.	6.51090	11.29750	9.43357	515
chrI	12	638	Peak_1260	184	.	3.49583	18.41892	16.80362	169
chrI	12	638	Peak_1118	213	.	3.74915	21.35185	19.68702	495
chrI	18	779	Peak_941	704	.	4.31909	70.49263	68.69950	179
chrI	18	779	Peak_811	803	.	4.60832	80.38562	78.52775	481
chrI	33	649	Peak_1500	424	.	3.67656	42.43305	40.86024	156
chrI	33	649	Peak_1170	576	.	4.25102	57.68048	55.99532	471
chrI	37	779	Peak_2004	418	.	3.60449	41.88854	40.40920	169
chrI	37	779	Peak_1423	633	.	4.38323	63.32488	61.68840	464


In [10]:
#merge the sorted, concatenated fileto join overlapping peaks 
bedtools merge -i all.peaks.sorted.bed > all_merged.peaks.bed 

head all_merged.peaks.bed

chrI	0	781
chrI	6332	6549
chrI	9138	9609
chrI	20611	21197
chrI	28155	29092
chrI	29173	30197
chrI	31527	31972
chrI	32456	36256
chrI	39017	39243
chrI	42035	42993


In [11]:
#Finally, we use the awk command to add row numbers to the merged peak file, such that each peak has a unique identifier. 

#We cannot do this 'in place', so we use an intermediate output file 
awk  -v OFS='\t' '{print $0,NR}' all_merged.peaks.bed > o.tmp
mv o.tmp all_merged.peaks.bed

head all_merged.peaks.bed

chrI	0	781	1
chrI	6332	6549	2
chrI	9138	9609	3
chrI	20611	21197	4
chrI	28155	29092	5
chrI	29173	30197	6
chrI	31527	31972	7
chrI	32456	36256	8
chrI	39017	39243	9
chrI	42035	42993	10


## Creating read count and fold change matrices.

We would like to calculate the signal strength in each sample at the genomic regions in **all_merged.peaks.bed**. The ATAC-seq pipeline generates genome-wide fold change signal tracks for each sample that can be used for this calculation:

In [12]:
ls $outdir1/signal/macs2/rep1/
echo ""
ls $outdir2/signal/macs2/rep1/

ls: cannot access /signal/macs2/rep1/: No such file or directory

ls: cannot access /signal/macs2/rep1/: No such file or directory


We use the **bigWigAverageOverBed** utility to computue the mean signal from the pval tracks and the mean signal from the fold change tracks for each genomic region in each sample. 

In [13]:
bigWigAverageOverBed

bigWigAverageOverBed v2 - Compute average score of big wig over each bed, which may have introns.
usage:
   bigWigAverageOverBed in.bw in.bed out.tab
The output columns are:
   name - name field from bed, which should be unique
   size - size of bed (sum of exon sizes
   covered - # bases within exons covered by bigWig
   sum - sum of values over all bases covered
   mean0 - average over bases with non-covered bases counting as zeroes
   mean - average over just covered bases
Options:
   -stats=stats.ra - Output a collection of overall statistics to stat.ra file
   -bedOut=out.bed - Make output bed that is echo of input bed but with mean column appended
   -sampleAroundCenter=N - Take sample at region N bases wide centered around bed item, rather
                     than the usual sample in the bed item.
   -minMax - include two additional columns containing the min and max observed in the area.



In [15]:
#First, we find all the fold change bigWig files
cd $WORK_DIR
find $AGGREGATE_ANALYSIS_DIR  -name "*fc*bigwig" > all.fc.bigwig
head all.fc.bigwig


/srv/scratch/training_camp/aggregate_analysis/whi5-SCE-Rep1/signal/macs2/rep1/whi5-SCE-Rep1_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_camp/aggregate_analysis/WT-SCD-Rep2_out/signal/macs2/rep1/WT-SCD-Rep2_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_camp/aggregate_analysis/cln3-SCD-Rep2/signal/macs2/rep1/cln3-SCD-Rep2_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_camp/aggregate_analysis/cln3-SCE-Rep2/signal/macs2/rep1/cln3-SCE-Rep2_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_camp/aggregate_analysis/whi5-cln3-SCE-Rep2/signal/macs2/rep1/whi5-cln3-SCE-Rep2_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_camp/aggregate_analysis/WT-SCE-0_6MNaCl-Rep1/signal/macs2/rep1/WT-SCE-0_6MNaCl-Rep1_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_camp/aggregate_analysis/cln3-SCD-Rep1/signal/macs2/rep1/cln3-SCD-Rep1_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_ca

In [16]:
#Iterate through all bigWig fold change tracks to compute mean signal strength at each genomic region 
for f in `cat all.fc.bigwig`
do

    #we extract the part of the filename that corresponds to the sample name and write it as the header in the fc.signal file
    sample_name=`basename $f | awk -F'[.]' '{print $1}'`
    echo "$sample_name"
    echo $sample_name > $sample_name.fc.signal.tmp 
    
    
    bigWigAverageOverBed $f all_merged.peaks.bed $sample_name.fc.signal.data.tmp 
    cut -f5 $sample_name.fc.signal.data.tmp >> $sample_name.fc.signal.tmp

    #cleanup the intermediate file 
    rm $sample_name.fc.signal.data.tmp 
done
paste *fc.signal.tmp > all.fc.txt
#cleanup intermediate files that were generated 
rm *.tmp

#examine the output 
head all.fc.txt

whi5-SCE-Rep1_R1_001
processing chromosomes................
WT-SCD-Rep2_R1_001
processing chromosomes................
cln3-SCD-Rep2_R1_001
processing chromosomes................
cln3-SCE-Rep2_R1_001
processing chromosomes................
whi5-cln3-SCE-Rep2_R1_001
processing chromosomes................
WT-SCE-0_6MNaCl-Rep1_R1_001
processing chromosomes................
cln3-SCD-Rep1_R1_001
processing chromosomes................
WT-SCD-0_6MNaCl-Rep2_R1_001
processing chromosomes................
cln3-SCE-0_6MNaCl-Rep2_R1_001
processing chromosomes................
WT-SCE-Rep2_R1_001
processing chromosomes................
whi5-SCE-Rep2_R1_001
processing chromosomes................
whi5-cln3-SCE-Rep1_R1_001
processing chromosomes................
WT-SCE-Rep1_R1_001
processing chromosomes................
WT-SCE-0_6MNaCl-Rep2_R1_001
processing chromosomes................
WT-SCD-0_6MNaCl-Rep1_R1_001
processing chromosomes................
cln3-SCE-Rep1_R1_001
process

In addition to the fold change data matrix, we would also like to know the number of reads that pile up at each peak region. This is useful for determining differential chromatin accessibility across samples. 
To calculate the read count matrix, we will use the **bedtools coverage** command on the *tagAlign* files generated by the processing pipeline. 

In [18]:
#First, we find all the tagAlign
cd $WORK_DIR
find $AGGREGATE_ANALYSIS_DIR  -name "*nodup.tn5.no_chrM.25M.R1.tagAlign*" > all.tagAlign.files.txt

head all.tagAlign.files.txt

/srv/scratch/training_camp/aggregate_analysis/whi5-SCE-Rep1/align/rep1/whi5-SCE-Rep1_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/scratch/training_camp/aggregate_analysis/cln3-SCD-0_6MNaCl-Rep2/align/rep1/cln3-SCD-0_6MNaCl-Rep2_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/scratch/training_camp/aggregate_analysis/WT-SCD-Rep2_out/align/rep1/WT-SCD-Rep2_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/scratch/training_camp/aggregate_analysis/cln3-SCD-Rep2/align/rep1/cln3-SCD-Rep2_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/scratch/training_camp/aggregate_analysis/cln3-SCE-Rep2/align/rep1/cln3-SCE-Rep2_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/scratch/training_camp/aggregate_analysis/whi5-cln3-SCE-Rep2/align/rep1/whi5-cln3-SCE-Rep2_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/scratch/training_camp/aggregate_analysis/WT-SCE-0_6MNaCl-Rep1/align/rep1/WT-SCE-0_6MNaCl-Rep1_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/s

In [19]:
#Let's see how the bedtools coverage command works
bedtools coverage


Tool:    bedtools coverage (aka coverageBed)
Version: v2.17.0
Summary: Returns the depth and breadth of coverage of features from A
	 on the intervals in B.

Usage:   bedtools coverage [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>

Options: 
	-abam	The A input file is in BAM format.

	-s	Require same strandedness.  That is, only counts hits in A that
		overlap B on the _same_ strand.
		- By default, overlaps are counted without respect to strand.

	-S	Require different strandedness.  That is, only report hits in A
		that overlap B on the _opposite_ strand.
		- By default, overlaps are counted without respect to strand.

	-hist	Report a histogram of coverage for each feature in B
		as well as a summary histogram for _all_ features in B.

		Output (tab delimited) after each feature in B:
		  1) depth
		  2) # bases at depth
		  3) size of B
		  4) % of B at depth

	-d	Report the depth at each position in each B feature.
		Positions reported are one based.  Eac

In [20]:
#Iterate through all tagAlign files to compute read count at each peak region.  
for f in `cat all.tagAlign.files.txt`
do
    sample_name=`basename $f | awk -F'[.]' '{print $1}'`
    echo "$sample_name"
    echo $sample_name > $sample_name.readcount.tmp 
    zcat $f | bedtools coverage -counts -a stdin -b all_merged.peaks.bed  | cut -f5 >>$sample_name.readcount.tmp 
done
paste *.readcount.tmp > all.readcount.txt
#cleanup the temporary files
rm *.tmp

#examine the output 
head all.readcount.txt

whi5-SCE-Rep1_R1_001
cln3-SCD-0_6MNaCl-Rep2_R1_001
WT-SCD-Rep2_R1_001
cln3-SCD-Rep2_R1_001
cln3-SCE-Rep2_R1_001
whi5-cln3-SCE-Rep2_R1_001
WT-SCE-0_6MNaCl-Rep1_R1_001
cln3-SCD-Rep1_R1_001
WT-SCD-0_6MNaCl-Rep2_R1_001
cln3-SCE-0_6MNaCl-Rep2_R1_001
WT-SCE-Rep2_R1_001
whi5-SCE-Rep2_R1_001
cln3-SCD-0_6MNaCl-Rep1_R1_001
whi5-cln3-SCE-Rep1_R1_001
WT-SCE-Rep1_R1_001
WT-SCE-0_6MNaCl-Rep2_R1_001
WT-SCD-0_6MNaCl-Rep1_R1_001
cln3-SCE-Rep1_R1_001
WT-SCD-Rep1_R1_001
cln3-SCE-0_6MNaCl-Rep1_R1_001
cln3-SCD-0_6MNaCl-Rep1_R1_001	cln3-SCD-0_6MNaCl-Rep2_R1_001	cln3-SCD-Rep1_R1_001	cln3-SCD-Rep2_R1_001	cln3-SCE-0_6MNaCl-Rep1_R1_001	cln3-SCE-0_6MNaCl-Rep2_R1_001	cln3-SCE-Rep1_R1_001	cln3-SCE-Rep2_R1_001	whi5-cln3-SCE-Rep1_R1_001	whi5-cln3-SCE-Rep2_R1_001	whi5-SCE-Rep1_R1_001	whi5-SCE-Rep2_R1_001	WT-SCD-0_6MNaCl-Rep1_R1_001	WT-SCD-0_6MNaCl-Rep2_R1_001	WT-SCD-Rep1_R1_001	WT-SCD-Rep2_R1_001	WT-SCE-0_6MNaCl-Rep1_R1_001	WT-SCE-0_6MNaCl-Rep2_R1_001	WT-SCE-Rep1_R1_001	WT-SCE-Rep2_R1_001
0	0	151

We observe that the counts in the first and second columns are on a different scale. This makes sense because if a particular sample had more reads to begin with, the raw counts for each peak will be higher. 
We can address this problem with sample normalization, covered in the next section.


In [21]:
#Finally, we add in the peak names to our counts file and fold change file so we can keep track of which row 
#corresponds to which peak. 


#add a header to the merged peak file 
sed -i '1i\Chrom\tStart\tEnd\tID' all_merged.peaks.bed

#paste the peak bed file region annotation matrix to the signal matrix
paste all_merged.peaks.bed all.fc.txt > o.tmp 
mv o.tmp all.fc.txt 

paste all_merged.peaks.bed all.readcount.txt > o.tmp
mv o.tmp all.readcount.txt



In [22]:
head all.readcount.txt


Chrom	Start	End	ID	cln3-SCD-0_6MNaCl-Rep1_R1_001	cln3-SCD-0_6MNaCl-Rep2_R1_001	cln3-SCD-Rep1_R1_001	cln3-SCD-Rep2_R1_001	cln3-SCE-0_6MNaCl-Rep1_R1_001	cln3-SCE-0_6MNaCl-Rep2_R1_001	cln3-SCE-Rep1_R1_001	cln3-SCE-Rep2_R1_001	whi5-cln3-SCE-Rep1_R1_001	whi5-cln3-SCE-Rep2_R1_001	whi5-SCE-Rep1_R1_001	whi5-SCE-Rep2_R1_001	WT-SCD-0_6MNaCl-Rep1_R1_001	WT-SCD-0_6MNaCl-Rep2_R1_001	WT-SCD-Rep1_R1_001	WT-SCD-Rep2_R1_001	WT-SCE-0_6MNaCl-Rep1_R1_001	WT-SCE-0_6MNaCl-Rep2_R1_001	WT-SCE-Rep1_R1_001	WT-SCE-Rep2_R1_001
chrI	0	781	1	0	0	151	191	226	158	210	127	292	296	232	188	83	246	25	182	241	203	9	244
chrI	6332	6549	2	0	0	537	820	1342	1050	1157	590	1460	1624	1562	713	590	1585	115	732	2227	2032	90	1230
chrI	9138	9609	3	0	0	175	222	366	251	304	160	401	483	410	261	143	379	34	220	379	383	17	344
chrI	20611	21197	4	0	0	249	309	369	282	316	189	394	406	322	314	134	342	60	370	334	310	19	410
chrI	28155	29092	5	0	0	50	50	48	37	42	22	57	65	55	72	12	49	7	47	65	64	1	60
chrI	29173	30197	6	0	0	88	115	215	226	225	1

In [23]:
head all.fc.txt

Chrom	Start	End	ID	cln3-SCD-Rep1_R1_001	cln3-SCD-Rep2_R1_001	cln3-SCE-0_6MNaCl-Rep1_R1_001	cln3-SCE-0_6MNaCl-Rep2_R1_001	cln3-SCE-Rep1_R1_001	cln3-SCE-Rep2_R1_001	whi5-cln3-SCE-Rep1_R1_001	whi5-cln3-SCE-Rep2_R1_001	whi5-SCE-Rep1_R1_001	whi5-SCE-Rep2_R1_001	WT-SCD-0_6MNaCl-Rep1_R1_001	WT-SCD-0_6MNaCl-Rep2_R1_001	WT-SCD-Rep1_R1_001	WT-SCD-Rep2_R1_001	WT-SCE-0_6MNaCl-Rep1_R1_001	WT-SCE-0_6MNaCl-Rep2_R1_001	WT-SCE-Rep1_R1_001	WT-SCE-Rep2_R1_001
chrI	0	781	1	1.2082	1.25879	2.07344	2.9692	2.30141	2.62036	1.86505	2.87168	2.52725	0.965167	2.73239	2.67876	1.90212	1.74266	2.82429	2.64587	6.503	2.1628
chrI	6332	6549	2	1.78946	1.78578	1.71822	1.41723	2.07408	0.991613	2.20631	1.89088	1.90687	1.54817	0.965532	0.875607	1.30748	1.76865	2.39724	2.5723	0.74938	2.00934
chrI	9138	9609	3	0.919425	1.04169	1.42442	1.20006	1.5744	1.22962	1.41871	1.42797	1.24815	1.34926	0.399234	0.432705	0.947398	0.885702	0.990087	1.25556	0.437034	1.92736
chrI	20611	21197	4	1.30906	1.33	1.87316	1.53013	1.78733	1.42063	1.75

We have now generated a read count matrix and a fold change signal peak regions in our dataset. 
This completes the basic data processing pipeline. 
Now, on to drawing conclusions about our data. 