# Sequencing data analysis

### IMPORTANT: Please make sure that your are using the bash kernel to run this notebook.
#### (Do this at the beginning of every session) ###

### This notebook covers analysis of DNA sequencing data from raw files to processed signals.

Although this analysis is for ATAC-seq data, many of the steps (especially the first section) are the same for other types of DNA sequencing experiments.

We'll be doing the analysis in Bash, which is the standard language for UNIX command-line scripting.

The steps in the analysis pipeline that are covered in this notebook are indicated below:
![Sequencing Data Analysis 1](images/part1.png)

## Part 1: Setting up the data

We start with raw `.fastq.gz` files, which are provided by the sequencing instrument. For each DNA molecule (read) that was sequenced, they provide the nucleotide sequence, and information about the quality of the signal of that nucleotide.

In [None]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/srv/scratch/training_camp/work/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/srv/scratch/training_camp/metadata"
export AGGREGATE_DATA_DIR="/srv/scratch/training_camp/data"
export AGGREGATE_ANALYSIS_DIR="/srv/scratch/training_camp/aggregate_analysis"
export YEAST_DIR="/srv/scratch/training_camp/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"



Now, let's check exactly which fastqs we have (we copied these from \$AGGREGATE_DATA_DIR to your personal $DATA_DIR in the last tutorial):

(recall that the `ls` command lists the contents of a directory)

In [None]:
ls $DATA_DIR

As a sanity check, we can also look at the size and last edited time of some of the fastqs by addind `-lrth` to the `ls` command:

In [None]:
ls -lrth $DATA_DIR

Let's also inspect the format of one of the fastqs. Notice that each read takes up 4 lines:
1. the read name
2. the read's nucleotide sequence
3. a '+' to indicate the record contains another line
4. a quality score for each base (a number encoded as a letter)

In [None]:
zcat $(ls $DATA_DIR/*gz | head -n 1) | head -n 8

## Part 2:ATAC-seq data processing

The ENCODE consortium (https://www.encodeproject.org/) uses a standard ATAC-seq data processing pipeline, which can be downloaded here: https://github.com/ENCODE-DCC/atac-seq-pipeline

This pipeline is pre-installed on this computer and can be executed by running the **atac.bds** script. 



In [None]:
#/opt/atac_dnase_pipelines/atac.bds --help


Though the pipeline is highly customizable and all the customizations might seem a bit confusing at first, do not worry -- for our purposes, the default settings will suffice. You will run the pipeline on your two experiments. Fill in the names of the FASTQ files corresponding to your two experiments below, as well as the name of the ouptut directory to store the processed data. 

In [None]:
#You can find the experiment names in the file $METADATA_DIR/TC2018_samples.tsv.
#Look under the column labeled "ID"
#example: 

export experiment1="hrosenbl_WT_YPGE_1"
export experiment2="pgoddard_asf1_YPGE_1"

#Create directories to store outputs from the pipeline

#We will store the outputs in the $WORK_DIR
export outdir1=$WORK_DIR/$experiment1\_out 
export outdir2=$WORK_DIR/$experiment2\_out
mkdir $outdir1
mkdir $outdir2


Now, kick off the pipeline! 

In [None]:
#first experiment:
echo "bds_scr $experiment1 $experiment1.log atac.bds -out_dir $outdir1 -species saccer3 -fastq1_1 $DATA_DIR/$experiment1\_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment1\_R2_001.fastq.gz -nth 4"
bds_scr $experiment1 $outdir1/log.txt /opt/atac_dnase_pipelines/atac.bds -out_dir $outdir1 -species saccer3 -fastq1_1 $DATA_DIR/$experiment1\_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment1\_R2_001.fastq.gz -nth 4 

In [None]:
#second experiment:
echo "bds_scr $experiment2 $experiment2.log atac.bds -out_dir $outdir2 -species saccer3 -fastq1_1 $DATA_DIR/$experiment2_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment2_R2_001.fastq.gz -nth 4"
bds_scr $experiment2 $outdir2/log.txt /opt/atac_dnase_pipelines/atac.bds -out_dir $outdir2 -species saccer3 -fastq1_1 $DATA_DIR/$experiment2\_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment2\_R2_001.fastq.gz -nth 4 

The pipeline may run for an hour or so, so meanwhile, we will learn more about what it's doing under the hood. 
If you want to check on the progress, you can examine the latest entries in the  log file generated by the pipeline with teh *tail* command. The log files are specified by the [LOG_FILE_NAME] entry above. 


In [None]:
tail $outdir1/log.txt

In [None]:
tail  $outdir2/log.txt

## Part 3: Examining the pipeline output

The pipeline consists of multiple modules, with output files that include the following: 

```
out                               # root dir. of outputs
│
├ *report.html                    #  HTML report
├ *tracks.json                    #  Tracks datahub (JSON) for WashU browser
├ ENCODE_summary.json             #  Metadata of all datafiles and QC results
│
├ align                           #  mapped alignments
│ ├ rep1                          #   for true replicate 1 
│ │ ├ *.trim.fastq.gz             #    adapter-trimmed fastq
│ │ ├ *.bam                       #    raw bam
│ │ ├ *.nodup.bam (E)             #    filtered and deduped bam
│ │ ├ *.tagAlign.gz               #    tagAlign (bed6) generated from filtered bam
│ │ ├ *.tn5.tagAlign.gz           #    TN5 shifted tagAlign for ATAC pipeline (not for DNase pipeline)
│ │ └ *.*M.tagAlign.gz            #    subsampled tagAlign for cross-corr. analysis
│ ├ rep2                          #   for true repilicate 2
│ ...
│ ├ pooled_rep                    #   for pooled replicate
│ ├ pseudo_reps                   #   for self pseudo replicates
│ │ ├ rep1                        #    for replicate 1
│ │ │ ├ pr1                       #     for self pseudo replicate 1 of replicate 1
│ │ │ ├ pr2                       #     for self pseudo replicate 2 of replicate 1
│ │ ├ rep2                        #    for repilicate 2
│ │ ...                           
│ └ pooled_pseudo_reps            #   for pooled pseudo replicates
│   ├ ppr1                        #    for pooled pseudo replicate 1 (rep1-pr1 + rep2-pr1 + ...)
│   └ ppr2                        #    for pooled pseudo replicate 2 (rep1-pr2 + rep2-pr2 + ...)
│
├ peak                             #  peaks called
│ └ macs2                          #   peaks generated by MACS2
│   ├ rep1                         #    for replicate 1
│   │ ├ *.narrowPeak.gz            #     narrowPeak (p-val threshold = 0.01)
│   │ ├ *.filt.narrowPeak.gz (E)   #     blacklist filtered narrowPeak 
│   │ ├ *.narrowPeak.bb (E)        #     narrowPeak bigBed
│   │ ├ *.narrowPeak.hammock.gz    #     narrowPeak track for WashU browser
│   │ ├ *.pval0.1.narrowPeak.gz    #     narrowPeak (p-val threshold = 0.1)
│   │ └ *.pval0.1.*K.narrowPeak.gz #     narrowPeak (p-val threshold = 0.1) with top *K peaks
│   ├ rep2                         #    for replicate 2
│   ...
│   ├ pseudo_reps                          #   for self pseudo replicates
│   ├ pooled_pseudo_reps                   #   for pooled pseudo replicates
│   ├ overlap                              #   naive-overlapped peaks
│   │ ├ *.naive_overlap.narrowPeak.gz      #     naive-overlapped peak
│   │ └ *.naive_overlap.filt.narrowPeak.gz #     naive-overlapped peak after blacklist filtering
│   └ idr                           #   IDR thresholded peaks
│     ├ true_reps                   #    for replicate 1
│     │ ├ *.narrowPeak.gz           #     IDR thresholded narrowPeak
│     │ ├ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     │ └ *.12-col.bed.gz           #     IDR thresholded narrowPeak track for WashU browser
│     ├ pseudo_reps                 #    for self pseudo replicates
│     │ ├ rep1                      #    for replicate 1
│     │ ...
│     ├ optimal_set                 #    optimal IDR thresholded peaks
│     │ └ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     ├ conservative_set            #    optimal IDR thresholded peaks
│     │ └ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     ├ pseudo_reps                 #    for self pseudo replicates
│     └ pooled_pseudo_reps          #    for pooled pseudo replicate
│
│   
│ 
├ qc                              #  QC logs
│ ├ *IDR_final.qc                 #   Final IDR QC
│ ├ rep1                          #   for true replicate 1
│ │ ├ *.align.log                 #    Bowtie2 mapping stat log
│ │ ├ *.dup.qc                    #    Picard (or sambamba) MarkDuplicate QC log
│ │ ├ *.pbc.qc                    #    PBC QC
│ │ ├ *.nodup.flagstat.qc         #    Flagstat QC for filtered bam
│ │ ├ *M.cc.qc                    #    Cross-correlation analysis score for tagAlign
│ │ ├ *M.cc.plot.pdf/png          #    Cross-correlation analysis plot for tagAlign
│ │ └ *_qc.html/txt               #    ATAQC report
│ ...
│
├ signal                          #  signal tracks
│ ├ macs2                         #   signal tracks generated by MACS2
│ │ ├ rep1                        #    for true replicate 1 
│ │ │ ├ *.pval.signal.bigwig (E)  #     signal track for p-val
│ │ │ └ *.fc.signal.bigwig   (E)  #     signal track for fold change
│ ...
│ └ pooled_rep                    #   for pooled replicate
│ 
├ report                          # files for HTML report
└ meta                            # text files containing md5sum of output files and other metadata
```

Let's examine how well the reads aligned to the reference saccer3 genome. We'd like to see an overall alignment rate >=90% 

In [None]:
cat $outdir1/qc/rep1/*align.log

In [None]:
cat $outdir2/qc/rep1/*align.log

Now, let's examine how many  peaks were called for each sample. We use the *zcat* command to examine the contents of a zipped file 

In [None]:
zcat $outdir1/peak/macs2/overlap/optimal_set/*narrowPeak.gz | wc -l 

In [None]:
zcat $outdir2/peak/macs2/overlap/optimal_set/*narrowPeak.gz | wc -l 

## Part 4: Visualizing signal tracks in the WashU and UCSC genome browsers ##

The pipeline uses the MACS2 peak caller to generate two types of signal tracks across the yeast genome: 

* P-value Tracks 
* Fold Change Tracks 

In [None]:
ls $outdir1/signal/macs2/rep1/

ls $outdir2/signal/macs2/rep1/

These files are in binary format, so we cannot print their contents to the terminal, but a number of genome browser tools have been developed that allow us to visualize their contents.  Two of the most popular of these are

* UCSC Genome Browser (https://genome.ucsc.edu/cgi-bin/hgGateway) 

* WashU Epigenome Browser (https://epigenomegateway.wustl.edu/) 

Both browsers enable you to upload or link your data for visualization. The most efficient way to do this, is to place your bigwig files on a publically accessible  web server, and to link to them from the browser. 

We have uploaded the fold change and pval bigwigs to the mitra server, here: 

mitra.stanford.edu/kundaje/tc2018

In that directory, you see a folder of fc bigwig tracks (http://mitra.stanford.edu/kundaje/tc2018/fc_tracks/) as well as a folder of pval bigwig tracks (http://mitra.stanford.edu/kundaje/tc2018/pvalue_tracks/)

You can visualize the full set of fc or pval bigwigs by following this link: http://mitra.stanford.edu/kundaje/tc2018/saccer3_tracks.html 

We will now go step-by-step through the process used to generate this visualization. To begin, point your browser to 
https://epigenomegateway.wustl.edu/


It's quite inefficient to upload our 35 track files one by one. To visualiza files in bulk, the WashU browser allows you to upload "datahubs". A datahub is  a file in the json format, which use a nested syntax to specify attributes about how the files are to be visualized. If you're curious, there's more information about such json "datahubs" here: http://washugb.blogspot.com/2012/04/data-hub.html. 


We have generated datahubs for our fc and pval bigwig files here: 

http://mitra.stanford.edu/kundaje/tc2018/pval.datahub.json and

http://mitra.stanford.edu/kundaje/tc2018/fc.datahub.json

don't worry about the syntax of these files for now (you can generally copy the syntax of these and just replace your file names and urls). The main point is to be aware that these hubs can be used to group visualizations of multiple browser tracks. 

## Part 5: Creating a merged peak set across all samples for downstream analysis 

Finally, we merge the peaks across all conditions to create a master list of peaks for analysis. To do this, we concatenate the IDR peaks from all experiments, sort them, and merge them. 

We take the output of the processing pipeline from the $AGGREGATE_ANALYSIS directory. This is the same analysis you performed above, but gathered in one location for all experiments conducted. 

In [None]:
cd $WORK_DIR

In [None]:

#Use the "find" command to identify all IDR narrowPeak output files and write them to a file. 
find -L $AGGREGATE_ANALYSIS_DIR  -wholename "*peak/macs2/overlap/optimal_set/*narrowPeak.gz" > narrowPeak_files.txt

#sanity check the file 
head narrowPeak_files.txt


In [None]:
#Now, iterate through the list of narrowPeak files and concatenate them into a single master peak list. 
for f in `cat narrowPeak_files.txt`
do 
    zcat $f >> all.peaks.bed
done

#sanity check the all.peaks.bed file 
head all.peaks.bed


In [None]:
#sort the concatenated file 
bedtools sort -i all.peaks.bed > all.peaks.sorted.bed 

head all.peaks.sorted.bed 

In [None]:
#merge the sorted, concatenated fileto join overlapping peaks 
bedtools merge -i all.peaks.sorted.bed > all_merged.peaks.bed 

head all_merged.peaks.bed

In [None]:
#Finally, we use the awk command to add row numbers to the merged peak file, such that each peak has a unique identifier. 

#We cannot do this 'in place', so we use an intermediate output file 
awk  -v OFS='\t' '{print $0,NR}' all_merged.peaks.bed > o.tmp
mv o.tmp all_merged.peaks.bed

head all_merged.peaks.bed

## Part 6: Creating read count and fold change matrices.

We would like to calculate the signal strength in each sample at the genomic regions in **all_merged.peaks.bed**. As we saw above, the ATAC-seq pipeline generates genome-wide fold change signal tracks for each sample that can be used for this calculation (the \*fc.bigwig and \*pval.bigwig files). We use the **bigWigAverageOverBed** utility to computue the mean signal from the pval tracks and the mean signal from the fold change tracks for each genomic region in each sample. 

In [None]:
bigWigAverageOverBed

In [None]:
#First, we find all the fold change bigWig files
cd $WORK_DIR
find -L $AGGREGATE_ANALYSIS_DIR  -name "*fc*bigwig" > all.fc.bigwig
head all.fc.bigwig


In [None]:
wc -l all.fc.bigwig

In [None]:
#Iterate through all bigWig fold change tracks to compute mean signal strength at each genomic region 
for f in `cat all.fc.bigwig`
do

    #we extract the part of the filename that corresponds to the sample name and write it as the header in the fc.signal file
    sample_name=`basename $f | awk -F'[.]' '{print $1}'`
    echo "$sample_name"
    echo $sample_name > $sample_name.fc.signal.tmp 
    
    
    bigWigAverageOverBed $f all_merged.peaks.bed $sample_name.fc.signal.data.tmp 
    cut -f5 $sample_name.fc.signal.data.tmp >> $sample_name.fc.signal.tmp

    #cleanup the intermediate file 
    rm $sample_name.fc.signal.data.tmp 
done
paste *fc.signal.tmp > all.fc.txt
#cleanup intermediate files that were generated 
rm *.tmp

#examine the output 
head all.fc.txt

In addition to the fold change data matrix, we would also like to know the number of reads that pile up at each peak region. This is useful for determining differential chromatin accessibility across samples. 
To calculate the read count matrix, we will use the **bedtools coverage** command on the *tagAlign* files generated by the processing pipeline. 

In [None]:
#First, we find all the tagAlign
cd $WORK_DIR
find -L $AGGREGATE_ANALYSIS_DIR  -name "*nodup.tn5.no_chrM.25M.R1.tagAlign*" > all.tagAlign.files.txt

head all.tagAlign.files.txt

In [None]:
wc -l all.tagAlign.files.txt

In [None]:
#Let's see how the bedtools coverage command works
bedtools coverage

In [None]:
#Iterate through all tagAlign files to compute read count at each peak region.  
for f in `cat all.tagAlign.files.txt`
do
    sample_name=`basename $f | awk -F'[.]' '{print $1}'`
    echo "$sample_name"
    echo $sample_name > $sample_name.readcount.tmp 
    zcat $f | bedtools coverage -counts -a stdin -b all_merged.peaks.bed  | cut -f5 >>$sample_name.readcount.tmp 
done
paste *.readcount.tmp > all.readcount.txt
#cleanup the temporary files
rm *.tmp

#examine the output 
head all.readcount.txt

We observe that the counts in the first and second columns are on a different scale. This makes sense because if a particular sample had more reads to begin with, the raw counts for each peak will be higher. 
We can address this problem with sample normalization, covered in the next section.


In [None]:
#Finally, we add in the peak names to our counts file and fold change file so we can keep track of which row 
#corresponds to which peak. 


#add a header to the merged peak file 
sed -i '1i\Chrom\tStart\tEnd\tID' all_merged.peaks.bed

#paste the peak bed file region annotation matrix to the signal matrix
paste all_merged.peaks.bed all.fc.txt > o.tmp 
mv o.tmp all.fc.txt 

paste all_merged.peaks.bed all.readcount.txt > o.tmp
mv o.tmp all.readcount.txt

In [None]:
head all.readcount.txt


In [None]:
head all.fc.txt

In examining the files, we notice that all the files end with the suffix "\_R1_001". This is an artifact generated by the processing pipeline. This part of the filename is not informative for our purposes, since it's shared by all samples, so we can remove it with the **sed** command. The syntax is illustrated below: 

In [None]:
sed -i 's/_R1_001//g' all.fc.txt
sed -i 's/_R1_001//g' all.readcount.txt


In [None]:
head all.fc.txt

In [None]:
head all.readcount.txt

We have now generated a read count matrix and a fold change signal peak regions in our dataset. 
This completes the basic data processing pipeline. 
Now, on to drawing conclusions about our data. 