# Sequencing data analysis

### IMPORTANT: Please make sure that your are using the bash kernel to run this notebook.
#### (Do this at the beginning of every session) ###

### This notebook covers analysis of DNA sequencing data from raw files to processed signals.

Although this analysis is for ATAC-seq data, many of the steps (especially the first section) are the same for other types of DNA sequencing experiments.

We'll be doing the analysis in Bash, which is the standard language for UNIX command-line scripting.

The steps in the analysis pipeline that are covered in this notebook are indicated below:
![Sequencing Data Analysis 1](images/part1.png)

## Part 1: Setting up the data

We start with raw `.fastq.gz` files, which are provided by the sequencing instrument. For each DNA molecule (read) that was sequenced, they provide the nucleotide sequence, and information about the quality of the signal of that nucleotide.

In [1]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/srv/scratch/training_camp/work/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/srv/scratch/training_camp/metadata"
export AGGREGATE_DATA_DIR="/srv/scratch/training_camp/data"
export AGGREGATE_ANALYSIS_DIR="/srv/scratch/training_camp/aggregate_analysis"
export YEAST_DIR="/srv/scratch/training_camp/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"





Now, let's check exactly which fastqs we have (we copied these from $AGGREGATE_DATA_DIR to your personal $DATA_DIR in the last tutorial):

(recall that the `ls` command lists the contents of a directory)

In [2]:
ls $DATA_DIR

WT-SCD-Rep1_R1_001.fastq.gz  WT-SCD-Rep2_R1_001.fastq.gz
WT-SCD-Rep1_R2_001.fastq.gz  WT-SCD-Rep2_R2_001.fastq.gz


As a sanity check, we can also look at the size and last edited time of some of the fastqs by addind `-lrth` to the `ls` command:

In [3]:
ls -lrth $DATA_DIR

total 383M
-rwxrwxr-x 1 ubuntu ubuntu  32M Sep  8 02:05 WT-SCD-Rep1_R1_001.fastq.gz
-rwxrwxr-x 1 ubuntu ubuntu  28M Sep  8 02:05 WT-SCD-Rep1_R2_001.fastq.gz
-rwxrwxr-x 1 ubuntu ubuntu 168M Sep  8 02:06 WT-SCD-Rep2_R1_001.fastq.gz
-rwxrwxr-x 1 ubuntu ubuntu 157M Sep  8 02:06 WT-SCD-Rep2_R2_001.fastq.gz


Let's also inspect the format of one of the fastqs. Notice that each read takes up 4 lines:
1. the read name
2. the read's nucleotide sequence
3. a '+' to indicate the record contains another line
4. a quality score for each base (a number encoded as a letter)

In [4]:
zcat $(ls $DATA_DIR/*gz | head -n 1) | head -n 8

@NS500418:691:HTFJ7AFXX:1:11101:13955:1071 1:N:0:CTCTCTAC+GCGATCTA
TCTCTATGATGGTAATAGGCAAACATCGGGCGTACCTTAAAAGTCTTAGACATCACATAAACTGTCTCTTATACAC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEAEEEAEEEEEEEEE
@NS500418:691:HTFJ7AFXX:1:11101:20718:1090 1:N:0:CTCTCTAC+GCGATCTA
CCCCCTCCCATTACAAACTAAAATCTTACTTTTATTTTCTTTTGCCCTCTCTGTCGCCTGTCTCTTATACACATCT
+
AAAAAEEEEEEEEEEEEAEEEEEEEEEEAEEEEEEEEE<EEEEEEEEAE/EEEEE/EEEEAE<6EEEAEEEAEEAE

gzip: stdout: Broken pipe


## Part 2:ATAC-seq data processing

The ENCODE consortium (https://www.encodeproject.org/) uses a standard ATAC-seq data processing pipeline, which can be downloaded here: https://github.com/ENCODE-DCC/atac-seq-pipeline

This pipeline is pre-installed on this computer and can be executed by running the **atac.bds** script. 



In [8]:
#/opt/atac_dnase_pipelines/atac.bds --help




Though the pipeline is highly customizable and all the customizations might seem a bit confusing at first, do not worry -- for our purposes, the default settings will suffice. You will run the pipeline on your two experiments. Fill in the names of the FASTQ files corresponding to your two experiments below, as well as the name of the ouptut directory to store the processed data. 

In [6]:
#You can find the experiment names in the file $METADATA_DIR/TC2018_samples.tsv.
#Look under the column labeled "ID"
#example: 

#export experiment1="WT_ethanol_1"
#export experiment2="asf1_ethanol_1"

export experiment1="WT-SCD-Rep1"
export experiment2="WT-SCD-Rep2"

#Create directories to store outputs from the pipeline
#We will store the outputs in the $WORK_DIR
export outdir1=$WORK_DIR/$experiment1\_out 
export outdir2=$WORK_DIR/$experiment2\_out
mkdir $outdir1
mkdir $outdir2




Now, kick off the pipeline! 

In [15]:
#first experiment:
echo "bds_scr $experiment1 $experiment1.log atac.bds -out_dir $outdir1 -species saccer3 -fastq1_1 $DATA_DIR/$experiment1\_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment1\_R2_001.fastq.gz -nth 4"
bds_scr $experiment1 $outdir1/log.txt /opt/atac_dnase_pipelines/atac.bds -out_dir $outdir1 -species saccer3 -fastq1_1 $DATA_DIR/$experiment1\_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment1\_R2_001.fastq.gz -nth 4 

bds_scr WT-SCD-Rep1 WT-SCD-Rep1.log atac.bds -out_dir /srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep1_out -species saccer3 -fastq1_1 /srv/scratch/training_camp/work/ubuntu/data/WT-SCD-Rep1\_R1_001.fastq.gz -fastq1_2 /srv/scratch/training_camp/work/ubuntu/data/WT-SCD-Rep1\_R2_001.fastq.gz -nth 4
[SCR_NAME] : WT-SCD-Rep1.BDS
[HOST] : ip-172-31-26-41.us-west-1.compute.internal
[LOG_FILE_NAME] : /srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep1_out/log.txt
[BDS_PARAM] :  /opt/atac_dnase_pipelines/atac.bds -out_dir /srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep1_out -species saccer3 -fastq1_1 /srv/scratch/training_camp/work/ubuntu/data/WT-SCD-Rep1_R1_001.fastq.gz -fastq1_2 /srv/scratch/training_camp/work/ubuntu/data/WT-SCD-Rep1_R2_001.fastq.gz -nth 4



In [16]:
#second experiment:
echo "bds_scr $experiment2 $experiment2.log atac.bds -out_dir $outdir2 -species saccer3 -fastq1_1 $DATA_DIR/$experiment2_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment2_R2_001.fastq.gz -nth 4"
bds_scr $experiment2 $outdir2/log.txt /opt/atac_dnase_pipelines/atac.bds -out_dir $outdir2 -species saccer3 -fastq1_1 $DATA_DIR/$experiment2\_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment2\_R2_001.fastq.gz -nth 4 

bds_scr WT-SCD-Rep2 WT-SCD-Rep2.log atac.bds -out_dir /srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep2_out -species saccer3 -fastq1_1 /srv/scratch/training_camp/work/ubuntu/data/.fastq.gz -fastq1_2 /srv/scratch/training_camp/work/ubuntu/data/.fastq.gz -nth 4
[SCR_NAME] : WT-SCD-Rep2.BDS
[HOST] : ip-172-31-26-41.us-west-1.compute.internal
[LOG_FILE_NAME] : /srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep2_out/log.txt
[BDS_PARAM] :  /opt/atac_dnase_pipelines/atac.bds -out_dir /srv/scratch/training_camp/work/ubuntu/WT-SCD-Rep2_out -species saccer3 -fastq1_1 /srv/scratch/training_camp/work/ubuntu/data/WT-SCD-Rep2_R1_001.fastq.gz -fastq1_2 /srv/scratch/training_camp/work/ubuntu/data/WT-SCD-Rep2_R2_001.fastq.gz -nth 4



The pipeline may run for an hour or so, so meanwhile, we will learn more about what it's doing under the hood. 
If you want to check on the progress, you can examine the latest entries in the  log file generated by the pipeline with teh *tail* command. The log files are specified by the [LOG_FILE_NAME] entry above. 


In [70]:
tail $outdir1/log.txt

00:09:05.526	Wait: Waiting for task to finish: atac.bds.20180908_052955_292/task.callpeak_macs2_atac.macs2_n_s_rep1_pr1.line_74.id_30, state: FINISHED
00:09:05.526	Wait: Task 'atac.bds.20180908_052955_292/task.callpeak_macs2_atac.macs2_n_s_rep1_pr1.line_74.id_30' finished.
00:09:05.526	Wait: Waiting for task to finish: atac.bds.20180908_052955_292/task.callpeak_idr.FRiP_rep1_pr.line_159.id_33, state: FINISHED
00:09:05.526	Wait: Task 'atac.bds.20180908_052955_292/task.callpeak_idr.FRiP_rep1_pr.line_159.id_33' finished.
00:09:05.526	Waiting for all 'parrallel' to finish.
00:09:05.526	Waiting for parallel 'atac.bds.20180908_052955_292_parallel_29' to finish. RunState: FINISHED
00:09:05.526	Writing report file 'atac.bds.20180908_052955_292.report.html'
00:09:05.556	Program 'atac.bds.20180908_052955_292' finished, exit value: 0, tasks executed: 26, tasks failed: 0, tasks failed names: .
00:09:05.560	Finished. Exit code: 0
00:09:05.560	ExecutionerLocal 'Local[30]': Killed


In [71]:
tail  $outdir2/log.txt

00:25:12.050	Wait: Waiting for task to finish: atac.bds.20180908_053005_965/task.callpeak_macs2_atac.macs2_n_s_rep1.line_74.id_29, state: FINISHED
00:25:12.050	Wait: Task 'atac.bds.20180908_053005_965/task.callpeak_macs2_atac.macs2_n_s_rep1.line_74.id_29' finished.
00:25:12.050	Wait: Waiting for task to finish: atac.bds.20180908_053005_965/task.graphviz.report.line_98.id_54, state: FINISHED
00:25:12.050	Wait: Task 'atac.bds.20180908_053005_965/task.graphviz.report.line_98.id_54' finished.
00:25:12.050	Waiting for all 'parrallel' to finish.
00:25:12.050	Waiting for parallel 'atac.bds.20180908_053005_965_parallel_29' to finish. RunState: FINISHED
00:25:12.050	Writing report file 'atac.bds.20180908_053005_965.report.html'
00:25:12.060	Program 'atac.bds.20180908_053005_965' finished, exit value: 0, tasks executed: 26, tasks failed: 0, tasks failed names: .
00:25:12.060	Finished. Exit code: 0
00:25:12.060	ExecutionerLocal 'Local[30]': Killed


## Part 3: Examining the pipeline output

The pipeline consists of multiple modules, with output files that include the following: 

```
out                               # root dir. of outputs
│
├ *report.html                    #  HTML report
├ *tracks.json                    #  Tracks datahub (JSON) for WashU browser
├ ENCODE_summary.json             #  Metadata of all datafiles and QC results
│
├ align                           #  mapped alignments
│ ├ rep1                          #   for true replicate 1 
│ │ ├ *.trim.fastq.gz             #    adapter-trimmed fastq
│ │ ├ *.bam                       #    raw bam
│ │ ├ *.nodup.bam (E)             #    filtered and deduped bam
│ │ ├ *.tagAlign.gz               #    tagAlign (bed6) generated from filtered bam
│ │ ├ *.tn5.tagAlign.gz           #    TN5 shifted tagAlign for ATAC pipeline (not for DNase pipeline)
│ │ └ *.*M.tagAlign.gz            #    subsampled tagAlign for cross-corr. analysis
│ ├ rep2                          #   for true repilicate 2
│ ...
│ ├ pooled_rep                    #   for pooled replicate
│ ├ pseudo_reps                   #   for self pseudo replicates
│ │ ├ rep1                        #    for replicate 1
│ │ │ ├ pr1                       #     for self pseudo replicate 1 of replicate 1
│ │ │ ├ pr2                       #     for self pseudo replicate 2 of replicate 1
│ │ ├ rep2                        #    for repilicate 2
│ │ ...                           
│ └ pooled_pseudo_reps            #   for pooled pseudo replicates
│   ├ ppr1                        #    for pooled pseudo replicate 1 (rep1-pr1 + rep2-pr1 + ...)
│   └ ppr2                        #    for pooled pseudo replicate 2 (rep1-pr2 + rep2-pr2 + ...)
│
├ peak                             #  peaks called
│ └ macs2                          #   peaks generated by MACS2
│   ├ rep1                         #    for replicate 1
│   │ ├ *.narrowPeak.gz            #     narrowPeak (p-val threshold = 0.01)
│   │ ├ *.filt.narrowPeak.gz (E)   #     blacklist filtered narrowPeak 
│   │ ├ *.narrowPeak.bb (E)        #     narrowPeak bigBed
│   │ ├ *.narrowPeak.hammock.gz    #     narrowPeak track for WashU browser
│   │ ├ *.pval0.1.narrowPeak.gz    #     narrowPeak (p-val threshold = 0.1)
│   │ └ *.pval0.1.*K.narrowPeak.gz #     narrowPeak (p-val threshold = 0.1) with top *K peaks
│   ├ rep2                         #    for replicate 2
│   ...
│   ├ pseudo_reps                          #   for self pseudo replicates
│   ├ pooled_pseudo_reps                   #   for pooled pseudo replicates
│   ├ overlap                              #   naive-overlapped peaks
│   │ ├ *.naive_overlap.narrowPeak.gz      #     naive-overlapped peak
│   │ └ *.naive_overlap.filt.narrowPeak.gz #     naive-overlapped peak after blacklist filtering
│   └ idr                           #   IDR thresholded peaks
│     ├ true_reps                   #    for replicate 1
│     │ ├ *.narrowPeak.gz           #     IDR thresholded narrowPeak
│     │ ├ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     │ └ *.12-col.bed.gz           #     IDR thresholded narrowPeak track for WashU browser
│     ├ pseudo_reps                 #    for self pseudo replicates
│     │ ├ rep1                      #    for replicate 1
│     │ ...
│     ├ optimal_set                 #    optimal IDR thresholded peaks
│     │ └ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     ├ conservative_set            #    optimal IDR thresholded peaks
│     │ └ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     ├ pseudo_reps                 #    for self pseudo replicates
│     └ pooled_pseudo_reps          #    for pooled pseudo replicate
│
│   
│ 
├ qc                              #  QC logs
│ ├ *IDR_final.qc                 #   Final IDR QC
│ ├ rep1                          #   for true replicate 1
│ │ ├ *.align.log                 #    Bowtie2 mapping stat log
│ │ ├ *.dup.qc                    #    Picard (or sambamba) MarkDuplicate QC log
│ │ ├ *.pbc.qc                    #    PBC QC
│ │ ├ *.nodup.flagstat.qc         #    Flagstat QC for filtered bam
│ │ ├ *M.cc.qc                    #    Cross-correlation analysis score for tagAlign
│ │ ├ *M.cc.plot.pdf/png          #    Cross-correlation analysis plot for tagAlign
│ │ └ *_qc.html/txt               #    ATAQC report
│ ...
│
├ signal                          #  signal tracks
│ ├ macs2                         #   signal tracks generated by MACS2
│ │ ├ rep1                        #    for true replicate 1 
│ │ │ ├ *.pval.signal.bigwig (E)  #     signal track for p-val
│ │ │ └ *.fc.signal.bigwig   (E)  #     signal track for fold change
│ ...
│ └ pooled_rep                    #   for pooled replicate
│ 
├ report                          # files for HTML report
└ meta                            # text files containing md5sum of output files and other metadata
```

Let's examine how well the reads aligned to the reference saccer3 genome. We'd like to see an overall alignment rate >=90% 

In [41]:
cat $outdir1/qc/rep1/*align.log

601654 reads; of these:
  601654 (100.00%) were paired; of these:
    157957 (26.25%) aligned concordantly 0 times
    275151 (45.73%) aligned concordantly exactly 1 time
    168546 (28.01%) aligned concordantly >1 times
    ----
    157957 pairs aligned concordantly 0 times; of these:
      60787 (38.48%) aligned discordantly 1 time
    ----
    97170 pairs aligned 0 times concordantly or discordantly; of these:
      194340 mates make up the pairs; of these:
        106902 (55.01%) aligned 0 times
        948 (0.49%) aligned exactly 1 time
        86490 (44.50%) aligned >1 times
91.12% overall alignment rate


In [63]:
cat $outdir2/qc/rep1/*align.log

3277012 reads; of these:
  3277012 (100.00%) were paired; of these:
    664721 (20.28%) aligned concordantly 0 times
    1652322 (50.42%) aligned concordantly exactly 1 time
    959969 (29.29%) aligned concordantly >1 times
    ----
    664721 pairs aligned concordantly 0 times; of these:
      259947 (39.11%) aligned discordantly 1 time
    ----
    404774 pairs aligned 0 times concordantly or discordantly; of these:
      809548 mates make up the pairs; of these:
        350719 (43.32%) aligned 0 times
        6069 (0.75%) aligned exactly 1 time
        452760 (55.93%) aligned >1 times
94.65% overall alignment rate


Now, let's examine how many  peaks were called for each sample. We use the *zcat* command to examine the contents of a zipped file 

In [72]:
zcat $outdir1/peak/macs2/overlap/optimal_set/*narrowPeak.gz | wc -l 

1520


In [73]:
zcat $outdir2/peak/macs2/overlap/optimal_set/*narrowPeak.gz | wc -l 

3641


## Part 4: Creating a merged peak set across all samples for downstream analysis 

Finally, we merge the peaks across all conditions to create a master list of peaks for analysis. To do this, we concatenate the IDR peaks from all experiments, sort them, and merge them. 

We take the output of the processing pipeline from the $AGGREGATE_ANALYSIS directory. This is the same analysis you performed above, but gathered in one location for all experiments conducted. 

In [3]:
cd $WORK_DIR
#Use the "find" command to identify all IDR narrowPeak output files and write them to a file. 
find $AGGREGATE_ANALYSIS_DIR  -wholename "*peak/macs2/overlap/optimal_set/*narrowPeak.gz" > narrowPeak_files.txt

#sanity check the file 
head narrowPeak_files.txt


/srv/scratch/training_camp/aggregate_analysis/WT-SCD-Rep2_out/peak/macs2/overlap/optimal_set/WT-SCD-Rep2_out_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/WT-SCD-Rep1_out/peak/macs2/overlap/optimal_set/WT-SCD-Rep1_out_rep1-pr.naive_overlap.narrowPeak.gz


In [4]:
#Now, iterate through the list of narrowPeak files and concatenate them into a single master peak list. 
for f in `cat narrowPeak_files.txt`
do 
    zcat $f >> all.peaks.bed
done

#sanity check the all.peaks.bed file 
head all.peaks.bed


chrI	108652	109044	Peak_3176	94	.	2.16203	9.42043	8.13851	129
chrI	108652	109044	Peak_4811	22	.	1.43200	2.27093	1.31249	377
chrI	113337	114374	Peak_1054	369	.	3.76250	36.94333	35.13755	145
chrI	113337	114374	Peak_3483	73	.	1.99356	7.39529	6.17374	708
chrI	113337	114374	Peak_3694	61	.	1.88125	6.16174	4.98351	411
chrI	114606	114837	Peak_1041	375	.	3.79058	37.53549	35.72075	111
chrI	129068	129274	Peak_2465	148	.	2.55513	14.88796	13.47084	84
chrI	130158	130752	Peak_3177	94	.	2.16203	9.42043	8.13851	228
chrI	130158	130752	Peak_3281	87	.	2.10588	8.72283	7.46047	424
chrI	138986	139388	Peak_1672	241	.	3.01769	24.11693	22.52225	239


In [5]:
#sort the concatenated file 
bedtools sort -i all.peaks.bed > all.peaks.sorted.bed 

head all.peaks.sorted.bed 

chrI	77	637	Peak_3179	94	.	2.16203	9.42043	8.13851	101
chrI	77	637	Peak_2111	184	.	2.77976	18.43291	16.94266	416
chrI	125	573	Peak_2324	33	.	2.35166	3.32030	1.93039	164
chrI	125	573	Peak_2148	38	.	2.50844	3.80537	2.35601	362
chrI	20678	21197	Peak_2417	153	.	2.58321	15.31523	13.88857	387
chrI	20678	21197	Peak_4276	37	.	1.62855	3.75592	2.68746	72
chrI	28377	28921	Peak_1113	351	.	3.67827	35.18621	33.40816	323
chrI	28398	28927	Peak_647	112	.	4.38977	11.27319	9.23240	292
chrI	29730	30101	Peak_1052	370	.	3.55305	37.05812	35.25110	208
chrI	32573	33282	Peak_452	626	.	4.28082	62.61963	60.38811	478


In [6]:
#merge the sorted, concatenated fileto join overlapping peaks 
bedtools merge -i all.peaks.sorted.bed > all_merged.peaks.bed 

head all_merged.peaks.bed

chrI	77	637
chrI	20678	21197
chrI	28377	28927
chrI	29730	30101
chrI	32573	33282
chrI	34225	35046
chrI	45220	45746
chrI	48290	48723
chrI	58123	58848
chrI	61040	61459


In [7]:
#zip the merged peak file to save space 
gzip -f all_merged.peaks.bed 

