# Sequencing data analysis

### IMPORTANT: Please make sure that your are using the bash kernel to run this notebook.
### IMPORTANT: Run the command below to git pull and make sure you are running the latest code!! ###
#### (Do this at the beginning of every session) ###

##UPDATES 
1. single helper script to run the pipeline: (ATAC-seq pipeline)
2. students will have to generate count file from tagAlign & bed 


In [1]:
cd /srv/scratch/training_camp/tc2017/`whoami`/src/training_camp
git stash 
git pull 

No local changes to save
remote: Counting objects: 48, done.        
remote: Compressing objects:   2% (1/44)           remote: Compressing objects:   4% (2/44)           remote: Compressing objects:   6% (3/44)           remote: Compressing objects:   9% (4/44)           remote: Compressing objects:  11% (5/44)           remote: Compressing objects:  13% (6/44)           remote: Compressing objects:  15% (7/44)           remote: Compressing objects:  18% (8/44)           remote: Compressing objects:  20% (9/44)           remote: Compressing objects:  22% (10/44)           remote: Compressing objects:  25% (11/44)           remote: Compressing objects:  27% (12/44)           remote: Compressing objects:  29% (13/44)           remote: Compressing objects:  31% (14/44)           remote: Compressing objects:  34% (15/44)           remote: Compressing objects:  36% (16/44)           remote: Compressing objects:  38% (17/44)           remote: Compressing objects:  40% (18

### This notebook covers analysis of DNA sequencing data from raw files to processed signals.

Although this analysis is for ATAC-seq data, many of the steps (especially the first section) are the same for other types of DNA sequencing experiments.

We'll be doing the analysis in Bash, which is the standard language for UNIX command-line scripting.

The steps in the analysis pipeline that are covered in this notebook are indicated below:
![Sequencing Data Analysis 1](part1.png)

## Part 1: Setting up the data

We start with raw `.fastq.gz` files, which are provided by the sequencing instrument. For each DNA molecule (read) that was sequenced, they provide the nucleotide sequence, and information about the quality of the signal of that nucleotide.

In [5]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/srv/scratch/training_camp/tc2017/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
export FASTQ_DIR="${DATA_DIR}/fastq/"
export SRC_DIR="${WORK_DIR}/src/training_camp/src/"
export ANALYSIS_DIR="${WORK_DIR}/analysis/"
export YEAST_DIR="/srv/scratch/training_camp/saccer3/seq"
export YEAST_INDEX="/srv/scratch/training_camp/saccer3/bowtie2_index/saccer3"
export YEAST_CHR="/srv/scratch/training_camp/saccer3/sacCer3.chrom.sizes"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP 
export TMPDIR=$TMP




Now, let's check exactly which fastqs we have:

(recall that the `ls` command lists the contents of a directory)

In [6]:
ls $FASTQ_DIR

WT-SCD-0_6MNaCl-Rep1_R1_001.fastq.gz	cln3-SCD-Rep1_R1_001.fastq.gz
WT-SCD-0_6MNaCl-Rep1_R2_001.fastq.gz	cln3-SCD-Rep1_R2_001.fastq.gz
WT-SCD-0_6MNaCl-Rep2_R1_001.fastq.gz	cln3-SCD-Rep2_R1_001.fastq.gz
WT-SCD-0_6MNaCl-Rep2_R2_001.fastq.gz	cln3-SCD-Rep2_R2_001.fastq.gz
WT-SCD-Rep1_R1_001.fastq.gz		cln3-SCE-0_6MNaCl-Rep1_R1_001.fastq.gz
WT-SCD-Rep1_R2_001.fastq.gz		cln3-SCE-0_6MNaCl-Rep1_R2_001.fastq.gz
WT-SCD-Rep2_R1_001.fastq.gz		cln3-SCE-0_6MNaCl-Rep2_R1_001.fastq.gz
WT-SCD-Rep2_R2_001.fastq.gz		cln3-SCE-0_6MNaCl-Rep2_R2_001.fastq.gz
WT-SCE-0_6MNaCl-Rep1_R1_001.fastq.gz	cln3-SCE-Rep1_R1_001.fastq.gz
WT-SCE-0_6MNaCl-Rep1_R2_001.fastq.gz	cln3-SCE-Rep1_R2_001.fastq.gz
WT-SCE-0_6MNaCl-Rep2_R1_001.fastq.gz	cln3-SCE-Rep2_R1_001.fastq.gz
WT-SCE-0_6MNaCl-Rep2_R2_001.fastq.gz	cln3-SCE-Rep2_R2_001.fastq.gz
WT-SCE-Rep1_R1_001.fastq.gz		whi5-SCE-Rep1_R1_001.fastq.gz
WT-SCE-Rep1_R2_001.fastq.gz		whi5-SCE-Rep1_R2_001.fastq.gz
WT-SCE-Rep2_R1_001.fastq.gz		whi5-SCE-Rep2_R1_001.fastq.gz


As a sanity check, we can also look at the size and last edited time of some of the fastqs by addind `-lrth` to the `ls` command:

In [7]:
ls -lrth $FASTQ_DIR | head

total 5.4G
-rwxrwxr-x 1 user1 user1  32M Sep 21 17:07 WT-SCD-Rep1_R1_001.fastq.gz
-rwxrwxr-x 1 user1 user1  28M Sep 21 17:07 WT-SCD-Rep1_R2_001.fastq.gz
-rwxrwxr-x 1 user1 user1 168M Sep 21 17:07 WT-SCD-Rep2_R1_001.fastq.gz
-rwxrwxr-x 1 user1 user1 157M Sep 21 17:07 WT-SCD-Rep2_R2_001.fastq.gz
-rwxrwxr-x 1 user1 user1 203M Sep 21 17:07 WT-SCE-0_6MNaCl-Rep1_R1_001.fastq.gz
-rwxrwxr-x 1 user1 user1 183M Sep 21 17:07 WT-SCE-0_6MNaCl-Rep1_R2_001.fastq.gz
-rwxrwxr-x 1 user1 user1 212M Sep 21 17:07 WT-SCE-0_6MNaCl-Rep2_R1_001.fastq.gz
-rwxrwxr-x 1 user1 user1 199M Sep 21 17:07 WT-SCE-0_6MNaCl-Rep2_R2_001.fastq.gz
-rwxrwxr-x 1 user1 user1 9.7M Sep 21 17:07 WT-SCE-Rep1_R1_001.fastq.gz


Let's also inspect the format of one of the fastqs. Notice that each read takes up 4 lines:
1. the read name
2. the read's nucleotide sequence
3. a '+' to indicate the record contains another line
4. a quality score for each base (a number encoded as a letter)

In [8]:
zcat $(ls $FASTQ_DIR* | head -n 1) | head -n 8

@NS500418:691:HTFJ7AFXX:1:11101:11481:1060 1:N:0:AAGAGGCA+GCGATCTA
CTAAGAAGTGGATAACCAGCAAATGCTAGCACCACTATTTAGTAGGTTAAGGTCTCGTTCGTTATCGCAATTAAGC
+
AAAAAEEEEEEEEAEEEEEEEEEEEEE/EE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEAEAEEEEEEEEEEA/EA
@NS500418:691:HTFJ7AFXX:1:11101:12189:1060 1:N:0:AAGAGGCA+GCGATCTA
CCTTCACCCAGGTAGGATAAGGATCAGGCGGAGCGACAGTATTAACAACAACTCGAGAAAAAACGATACATATACT
+
AAAAAEAEEAAE/EEEEEEEEEEEAEEAEAEEEAEAEA/EEEAEAEEEEA/EE<EAE/EEEA/AE//EAEEEEEAE

gzip: stdout: Broken pipe


## Part 2: Adapter trimming

- In many kinds of DNA and RNA sequencing experiments, sometimes the sequences will read through the targeted sequence insert and into sequencing adapter or PCR primer sequences on the end of the fragment. When the insert size is shorter than the read length (like in some of our ATAC-seq reads), the adapter sequence is read by the sequencer.

- We need to remove such adapter sequences because they won't align to the genome.

- In ATAC-seq (the data we're analyzing), the fragment length follows a periodic distribution. Some reads have very short inserts (only a few basepairs), while other reads have inserts that are much longer (100's of basepairs — much longer than the 77bp reads we're using to read them.

- We know ahead of time that the first part of the adapter sequence is `CTGTCTCTTATA`, since our reads are sequenced using a Nextera sample prep kit.

In [9]:
# Let's sanity check our adapter sequence by seeing
# how many times it occurs in the first 100000 reads.

ADAPTER="CTGTCTCTTATA"

NUM_LINES=400000  # 4 * num_reads, since each fastq entry is 4 lines

zcat $(ls $FASTQ_DIR*R1* | head -n 1) | head -n $NUM_LINES | grep $ADAPTER | wc -l

19440

gzip: stdout: Broken pipe


In [10]:
# Let's also check how often a permutation (rearrangement)
# of the adapter sequence occurs:

NOT_ADAPTER="CGTTCTTCTATA"  # A permutation of the adapter sequence

zcat $(ls $FASTQ_DIR*R1* | head -n 1) | head -n $NUM_LINES | grep $NOT_ADAPTER | wc -l

0

gzip: stdout: Broken pipe


Notice that the correct adapter sequence occurs *many* times more in the reads than a permutation of the adapter sequene — this is an important validation that we have the right sequence.

Now, we'll trim the paired-end reads using a tool called `cutadapt`:

In [11]:
#create a directory to store the trimmed data 
export TRIMMED_DIR="$ANALYSIS_DIR/trimmed/"
[[ ! -d $TRIMMED_DIR ]] && mkdir -p "$TRIMMED_DIR"



In [None]:
for R1_fastq in ${FASTQ_DIR}*_R1*fastq.gz; do
    
    # Get the read 2 fastq file from the filename of read 1
    R2_fastq=$(echo $R1_fastq | sed -e 's/R1/R2/')
    
    # Generate names for the trimmed fastq files

    trimmed_R1_fastq=$TRIMMED_DIR$(echo $(basename $R1_fastq)| sed -e 's/.fastq.gz/.trimmed.fastq.gz/')
    trimmed_R2_fastq=$TRIMMED_DIR$(echo $(basename $R2_fastq)| sed -e 's/.fastq.gz/.trimmed.fastq.gz/')   
    echo cutadapt -m 5 -e 0.20 -a CTGTCTCTTATA -A CTGTCTCTTATA \
        -o ${trimmed_R1_fastq} \
        -p ${trimmed_R2_fastq} \
        $R1_fastq \
        $R2_fastq
    cutadapt -m 5 -e 0.20 -a CTGTCTCTTATA -A CTGTCTCTTATA \
        -o ${trimmed_R1_fastq} \
        -p ${trimmed_R2_fastq} \
        $R1_fastq \
        $R2_fastq

done

## Part 3: Alignment

Now, we're ready to align our trimmed reads to the Yeast SacCer3 reference genome.

We'll use [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml), which is a [Burrows-Wheeler](https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform) based spliced aligner.

Bowtie2 outputs a SAM (Sequence Alignment Map) file, which is a standard text encoding. To save space, we'll use `samtools view -b` to encode the output as a binarized SAM file — a BAM file.

In [12]:
#set the bowtie index
export bowtie_index=$YEAST_INDEX
echo $bowtie_index

/srv/scratch/training_camp/saccer3/bowtie2_index/saccer3


In [13]:
#create a directory to store the aligned data 
export ALIGNMENT_DIR="$ANALYSIS_DIR/aligned/"
[[ ! -d $ALIGNMENT_DIR ]] && mkdir -p "$ALIGNMENT_DIR"



In [14]:
for trimmed_fq1 in ${TRIMMED_DIR}*_R1*fastq.gz; do

    trimmed_fq2=$(echo $trimmed_fq1 | sed -e 's/_R1/_R2/')
    
    bam=$(echo "${ALIGNMENT_DIR}${trimmed_fq1##*/}" | sed -e 's/.fastq.gz/.bam/')
    bowtie2 -X2000 --mm --threads 10 -x $bowtie_index -1 $trimmed_fq1 -2 $trimmed_fq2 | samtools view -bS - > $bam        
done

[samopen] SAM header is present: 17 sequences.
1508102 reads; of these:
  1508102 (100.00%) were paired; of these:
    73214 (4.85%) aligned concordantly 0 times
    630015 (41.78%) aligned concordantly exactly 1 time
    804873 (53.37%) aligned concordantly >1 times
    ----
    73214 pairs aligned concordantly 0 times; of these:
      8601 (11.75%) aligned discordantly 1 time
    ----
    64613 pairs aligned 0 times concordantly or discordantly; of these:
      129226 mates make up the pairs; of these:
        90618 (70.12%) aligned 0 times
        8547 (6.61%) aligned exactly 1 time
        30061 (23.26%) aligned >1 times
97.00% overall alignment rate
[samopen] SAM header is present: 17 sequences.
2998028 reads; of these:
  2998028 (100.00%) were paired; of these:
    152546 (5.09%) aligned concordantly 0 times
    1522929 (50.80%) aligned concordantly exactly 1 time
    1322553 (44.11%) aligned concordantly >1 times
    ----
    152546 pairs aligned concordan

[samopen] SAM header is present: 17 sequences.
4376854 reads; of these:
  4376854 (100.00%) were paired; of these:
    169693 (3.88%) aligned concordantly 0 times
    2813168 (64.27%) aligned concordantly exactly 1 time
    1393993 (31.85%) aligned concordantly >1 times
    ----
    169693 pairs aligned concordantly 0 times; of these:
      37019 (21.82%) aligned discordantly 1 time
    ----
    132674 pairs aligned 0 times concordantly or discordantly; of these:
      265348 mates make up the pairs; of these:
        181304 (68.33%) aligned 0 times
        29265 (11.03%) aligned exactly 1 time
        54779 (20.64%) aligned >1 times
97.93% overall alignment rate

## Part 4: Finding duplicate reads and alignment filtering

During sequencing, we perform PCR, which can lead to duplicate reads. In many kinds of DNA sequencing, we want to remove duplicates so that we don't double-count signal originating from the same molecule.

To do so, we use an algorithm called `sambamba` that looks for reads that mapped to exactly the same places in the genome. We also need to sort the aligned files before we can mark duplicates, since we need reads aligned to the same position to be next to each other in the file.

Bowtie2 also sets certian labels (or "flags") in the resulting alignment file to indicate information like the score of the alignment, the orientation of both mates of the fragment, and other details.

We can use these flags as a way to discard low-quality reads. [This website](https://broadinstitute.github.io/picard/explain-flags.html) provides a convenient way to interpret the meaning of these bitwise flags; for conveninece they can be encoded as numbers.

Here, we want to filter reads that fall into any of the following categories:
- the read wasn't mapped to the genome
- the read's mate wasn't mapped to the genome
- the alignment reported is not the primary alignment (it is a "runner-up" alignment)
- the read was marked as "low-quality" by the sequencer software
- the read has a mapping quality less than 30

In [15]:
for bam_file in ${ALIGNMENT_DIR}*.trimmed.bam; do

    bam_file_sorted=$(echo $bam_file | sed -e 's/.bam/.sorted.bam/')
    bam_file_dup=$(echo $bam_file | sed -e 's/.bam/.sorted.dup.bam/')
    nodup_bam_file=$(echo $bam_file | sed -e 's/.bam/.nodup.bam/')
    
    # Sort and remove duplicates
    sambamba sort -m 4G -t 40 -u $bam_file 
    sambamba markdup -l 0 -t 40 $bam_file_sorted $bam_file_dup
    samtools view -F 1804 -f 2 -q 30 -b $bam_file_dup  > $nodup_bam_file
    samtools index $nodup_bam_file
done

finding positions of the duplicate reads in the file...
  sorted 1455280 end pairs
     and 15026 single ends (among them 0 unmatched pairs)
  collecting indices of duplicate reads...   done in 229 ms
  found 956072 duplicates
collected list of positions in 0 min 8 sec
marking duplicates...
total time elapsed: 0 min 15 sec
[bam_index_build2] fail to create the index file.
finding positions of the duplicate reads in the file...
  sorted 2883323 end pairs
     and 24935 single ends (among them 0 unmatched pairs)
  collecting indices of duplicate reads...   done in 422 ms
  found 1728829 duplicates
collected list of positions in 0 min 17 sec
marking duplicates...
total time elapsed: 0 min 30 sec
[bam_index_build2] fail to create the index file.
finding positions of the duplicate reads in the file...
  sorted 543951 end pairs
     and 4254 single ends (among them 0 unmatched pairs)
  collecting indices of duplicate reads...   done in 76 ms
  found 153659 duplicates
c

## Part 5: Peak calling

Now that we've aligned our reads to the genome and filtered the alignments, we want to identify areas of locally enriched signals, or "peaks".

For ATAC-seq, peaks correspond to accessible regions. They can include promoters, enhancers, and other regulatory regions.

We'll call peaks using [MACS2](http://liulab.dfci.harvard.edu/MACS/)

In [None]:
#create a directory to store the tagAlign data 
TAGALIGN_DIR="${ANALYSIS_DIR}tagAlign/"
[[ ! -d $TAGALIGN_DIR ]] && mkdir -p "$TAGALIGN_DIR"

In [None]:
#create a directory to store the MACS peaks 
PEAKS_DIR="${ANALYSIS_DIR}peaks/"
[[ ! -d $PEAKS_DIR ]] && mkdir -p "$PEAKS_DIR"
echo $PEAKS_DIR

In [None]:
SacCer3GenSz=12157105  # The sum of the sizes of the chromosomes in the SacCer3 genome

Macs2PvalThresh="0.05"  # The p-value threshold for calling peaks 

Macs2SmoothWindow=150  # The window size to smooth alignment signal over
Macs2ShiftSize=$(python -c "print(int(${Macs2SmoothWindow}/2))")

for nodup_bam_file in ${ALIGNMENT_DIR}*.nodup.bam; do
    
    # First, we need to convert each bam to a .tagAlign,
    # which just contains the start/end positions of each read:
    
    tagalign_file=$TAGALIGN_DIR$(echo $(basename $nodup_bam_file) | sed -e 's/.bam/.tagAlign.gz/')
    bedtools bamtobed -i $nodup_bam_file | awk 'BEGIN{OFS="\t"}{$4="N";$5="1000";print $0}' | gzip -c > $tagalign_file
    
    # Now, we can run MACS:
    output_prefix=$PEAKS_DIR$(echo $(basename $tagalign_file)| sed -e 's/.tagAlign.gz//')
     macs2 callpeak \
        -t $tagalign_file -f BED -n $output_prefix -g "$SacCer3GenSz" -p $Macs2PvalThresh \
        --nomodel --shift -$Macs2ShiftSize --extsize $Macs2SmoothWindow -B --SPMR --keep-dup all --call-summits

    #We also generate a fold change file comparing the sample to the control(DMSO)
    macs2 bdgcmp -t $output_prefix\_treat_pileup.bdg -c $output_prefix\_control_lambda.bdg -o $output_prefix\_FE.bdg -m FE
done

Finally, we merge the peaks across all conditions to create a master list of peaks for analysis. 

In [None]:
cd $PEAKS_DIR
#concatenate all .narrowPeak files together 
cat *narrowPeak > all.peaks.bed 

#sort the concatenated file 
bedtools sort -i all.peaks.bed > all.peaks.sorted.bed 

#merge the sorted, concatenated fileto join overlapping peaks 

cat all.peaks.sorted.bed | awk -F '\t' 'BEGIN {{ OFS="\t" }} {{ $2=$2+$10-100; $3=$2+$10+100; if ($2<0) {{$2 = 0}} print $0 }}' | bedtools sort | bedtools merge > all_merged.peaks.bed
gzip -f all_merged.peaks.bed 

