# Sequencing data analysis

### IMPORTANT: Please make sure that your are using the bash kernel to run this notebook.
#### (Do this at the beginning of every session) ###

### This notebook covers analysis of DNA sequencing data from raw files to processed signals.

Although this analysis is for ATAC-seq data, many of the steps (especially the first section) are the same for other types of DNA sequencing experiments.

We'll be doing the analysis in Bash, which is the standard language for UNIX command-line scripting.

The steps in the analysis pipeline that are covered in this notebook are indicated below:
![Sequencing Data Analysis 1](images/part1.png)

## Part 1: Setting up the data

We start with raw `.fastq.gz` files, which are provided by the sequencing instrument. For each DNA molecule (read) that was sequenced, they provide the nucleotide sequence, and information about the quality of the signal of that nucleotide.

In [None]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/scratch/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/metadata"
export AGGREGATE_DATA_DIR="/data"
export AGGREGATE_ANALYSIS_DIR="/outputs"
export YEAST_DIR="/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"



Now, let's check exactly which fastqs we have (we copied these from `$AGGREGATE_DATA_DIR`to your personal `$DATA_DIR` in the last tutorial):

(recall that the `ls` command lists the contents of a directory)

In [None]:
cd $WORK_DIR

In [None]:
ls $DATA_DIR

As a sanity check, we can also look at the size and last edited time of some of the fastqs by addind `-lrth` to the `ls` command:

In [None]:
ls -lrth $DATA_DIR

Let's also inspect the format of one of the fastqs. Notice that each read takes up 4 lines:
1. the read name
2. the read's nucleotide sequence
3. a '+' to indicate the record contains another line
4. a quality score for each base (a number encoded as a letter)

In [None]:
zcat $(ls $DATA_DIR/*gz | head -n 1) | head -n 8

## Part 2:ATAC-seq data processing

The ENCODE consortium (https://www.encodeproject.org/) uses a standard ATAC-seq data processing pipeline, which can be downloaded here: https://github.com/ENCODE-DCC/atac-seq-pipeline

This pipeline is pre-installed on this computer and can be executed by running the **atac.wdl** script through the caper(https://github.com/ENCODE-DCC/caper) tool.  

We have not submitted jobs yet, so the command `caper list` shows that no jobs are running. 

In [None]:
caper list

Though the pipeline is highly customizable and all the customizations might seem a bit confusing at first, do not worry -- for our purposes, the default settings will suffice. You will run the pipeline on a single sample (i.e. the two replicates for a given strain/timepoint combination). We construct a json file with the parameters needed to run the pipeline. More information about this json file is available here: https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/input.md

In [None]:
## the ATAC-seq pipeline accepts a json file containing the 
## input parameters for analysis 
cat ~/cromwell_input_template.json

Replace the placeholders "REP1_R1_PLACEHOLDER", "REP1_R2_PLACEHOLDER", "REP2_R1_PLACEHOLDER", "REP2_R2_PLACEHOLDER" with your files. You can do this with the "sed" command. 

In [None]:
export REP1_R1=$DATA_DIR/0min_HOG1_1_R1.fastq.gz
export REP1_R2=$DATA_DIR/0min_HOG1_1_R2.fastq.gz
export REP2_R1=$DATA_DIR/0min_HOG1_2_R1.fastq.gz
export REP2_R2=$DATA_DIR/0min_HOG1_2_R2.fastq.gz
export experiment=0min_HOG1

In [None]:
cp ~/cromwell_input_template.json $WORK_DIR/cromwell_input.json

In [None]:
sed -i "s|REP1_R1_PLACEHOLDER|$REP1_R1|g" $WORK_DIR/cromwell_input.json
sed -i "s|REP1_R2_PLACEHOLDER|$REP1_R2|g" $WORK_DIR/cromwell_input.json
sed -i "s|REP2_R1_PLACEHOLDER|$REP2_R1|g" $WORK_DIR/cromwell_input.json
sed -i "s|REP2_R2_PLACEHOLDER|$REP2_R2|g" $WORK_DIR/cromwell_input.json


In [None]:
cat $WORK_DIR/cromwell_input.json

We are now ready to submit the json file to the caper server. 

In [None]:
source activate encode-atac-seq-pipeline
caper submit /opt/atac-seq-pipeline/atac.wdl -i $WORK_DIR/cromwell_input.json -s $experiment --ip localhost --port 8000

#not a typo, run this command twice to prevent the notebook from printing (base) after each cell. 
conda deactivate 
conda deactivate


Run the `caper list` command to check on your submitted job. 

In [None]:
caper list

store the id of your workflow:

In [None]:
export caper_id=a88a2a6d-474d-45e8-a923-9d0a0e7bac19 #replace with the id of your workflow

If the status is "Failed", you can use the `caper troubleshoot` command to print out the error message for the job. `caper troubleshoot` will also tell you the command that the pipeline is executing.

In [None]:
caper troubleshoot $caper_id

caper will write the pipeline outputs to **/scratch/caper/atac**. Run `ls` on that directory to examine the structure of the outputs. 

In [None]:
ls /scratch/caper/atac/$caper_id

The pipeline may run for an hour or so, so meanwhile, we will learn more about what it's doing under the hood. 
To check on the progress, you can use the `caper troubleshoot` command. 

let's use the `tree`  command to examine the output directory hash: 

In [None]:
tree /scratch/caper/atac/$caper_id

This is quite a complex directory structure, we can use the croo tool (https://github.com/ENCODE-DCC/croo#installation) to aggregate the pipeline outputs that we care about. 

In [None]:
croo --help

We need to find the metadata.json file for the caper run. We can do this with the linux `find` command.

In [None]:
caper_metadata_json=`find  /scratch/caper/atac/$caper_id -name "metadata.json"`

In [None]:
echo $caper_metadata_json

We aggregate the pipeline outputs with croo to the $AGGREGATE_ANALYSIS_DIR/croo/ folder.

In [None]:
experiment_croo_dir=$AGGREGATE_ANALYSIS_DIR/croo/$experiment
echo $experiment_croo_dir
#We already aggregated the samples with croo, so no need to run this command. 
#croo $caper_metadata_json --out-dir $experiment_croo_dir --out-def-json atac.out_def.json


Let's `ls` thre croo output directory to verify the output file organization: 

In [None]:
ls $experiment_croo_dir

We can examine the generated report  file (`croo.report.$caper_id.html`)in the browser -- navigate to the ip address of your machine in the browser and navigate to `$experiment_croo_dir` directory

## Part 3: Examining the pipeline output

The pipeline consists of multiple modules, with output files that include the following: 

```
out                               # root dir. of outputs
│
├ croo.report.*.html                  #  HTML report│
├ alignment                           #  mapped alignments
│ ├ Replicate 1                          #   for true replicate 1 
│ │
│ │ ├ *.bam                       #    raw bam
│ │ ├ *.nodup.bam (E)             #    filtered and deduped bam
│ │ ├ *.tagAlign.gz               #    tagAlign (bed6) generated from filtered bam
│ │ ├ *.tn5.tagAlign.gz           #    TN5 shifted tagAlign for ATAC pipeline (not for DNase pipeline)
│ │
│ ├ Replicate 2                          #   for true repilicate 2
│ ...
│ │ ...                           
│ └ Pooled replicate            #   for pooled pseudo replicates
│   ├ ppr1                        #    for pooled pseudo replicate 1 (rep1-pr1 + rep2-pr1 + ...)
│   └ ppr2                        #    for pooled pseudo replicate 2 (rep1-pr2 + rep2-pr2 + ...)
│
├ peaks                             #  peaks called
│ └ macs2                          #   peaks generated by MACS2
│   ├ rep1                         #    for replicate 1
│   │ ├ *.narrowPeak.gz            #     narrowPeak (p-val threshold = 0.01)
│   │ ├ *.filt.narrowPeak.gz (E)   #     blacklist filtered narrowPeak 
│   │ ├ *.narrowPeak.bb (E)        #     narrowPeak bigBed
│   │ ├ *.narrowPeak.hammock.gz    #     narrowPeak track for WashU browser
│   │ ├ *.pval0.1.narrowPeak.gz    #     narrowPeak (p-val threshold = 0.1)
│   │ └ *.pval0.1.*K.narrowPeak.gz #     narrowPeak (p-val threshold = 0.1) with top *K peaks
│   ├ rep2                         #    for replicate 2
│   ...
│   ├ pseudo_reps                          #   for self pseudo replicates
│   ├ pooled_pseudo_reps                   #   for pooled pseudo replicates
│   ├ overlap                              #   naive-overlapped peaks
│   │ ├ *.naive_overlap.narrowPeak.gz      #     naive-overlapped peak
│   │ └ *.naive_overlap.filt.narrowPeak.gz #     naive-overlapped peak after blacklist filtering
│   └ idr                           #   IDR thresholded peaks
│     ├ true_reps                   #    for replicate 1
│     │ ├ *.narrowPeak.gz           #     IDR thresholded narrowPeak
│     │ ├ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     │ └ *.12-col.bed.gz           #     IDR thresholded narrowPeak track for WashU browser
│     ├ pseudo_reps                 #    for self pseudo replicates
│     │ ├ rep1                      #    for replicate 1
│     │ ...
│     ├ optimal_set                 #    optimal IDR thresholded peaks
│     │ └ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     ├ conservative_set            #    optimal IDR thresholded peaks
│     │ └ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     ├ pseudo_reps                 #    for self pseudo replicates
│     └ pooled_pseudo_reps          #    for pooled pseudo replicate
│
│   
│ 
├ qc                              #  QC logs
│ ├ *IDR_final.qc                 #   Final IDR QC
│ ├ rep1                          #   for true replicate 1
│ │ ├ *.align.log                 #    Bowtie2 mapping stat log
│ │ ├ *.dup.qc                    #    Picard (or sambamba) MarkDuplicate QC log
│ │ ├ *.pbc.qc                    #    PBC QC
│ │ ├ *.nodup.flagstat.qc         #    Flagstat QC for filtered bam
│ │ ├ *M.cc.qc                    #    Cross-correlation analysis score for tagAlign
│ │ ├ *M.cc.plot.pdf/png          #    Cross-correlation analysis plot for tagAlign
│ │ └ *_qc.html/txt               #    ATAQC report
│ ...
│
├ signal                          #  signal tracks
│ ├ macs2                         #   signal tracks generated by MACS2
│ │ ├ rep1                        #    for true replicate 1 
│ │ │ ├ *.pval.signal.bigwig (E)  #     signal track for p-val
│ │ │ └ *.fc.signal.bigwig   (E)  #     signal track for fold change
│ ...
│ └ pooled_rep                    #   for pooled replicate
│ 
├ report                          # files for HTML report
└ meta                            # text files containing md5sum of output files and other metadata
```

Let's examine how well the reads aligned to the reference saccer3 genome. We'd like to see an overall alignment rate >=90% 

In [None]:
cat $experiment_croo_dir/qc/rep2/*align.log

Aggregating across samples, our observed overall alignment rates from bowtie2 were on the low side: 


| Sample     | Rep  | ID                                        | Overall alignment Rate |
|------------|------|-------------------------------------------|------------------------|
| 0min_HOG1  | rep1 | 0min_HOG1_1_R1.merged.align.log   | 58.74%                 |
| 0min_HOG1  | rep2 | 0min_HOG1_2_R1.merged.align.log     | 65.05%                 |
| 0min_HOT1  | rep1 | 0min_HOT1_1_R1.merged.align.log   | 61.70%                 |
| 0min_HOT1  | rep2 | 0min_HOT1_2_R1.merged.align.log   | 51.00%                 |
| 0min_MSN1  | rep1 | 0min_MSN1_1_R1.merged.align.log  | 67.29%                 |
| 0min_MSN1  | rep2 | 0min_MSN1_2_R1.merged.align.log   | 45.98%                 |
| 0min_MSN2  | rep1 | 0min_MSN2_1_R1.merged.align.log     | 62.01%                 |
| 0min_MSN2  | rep2 | 0min_MSN2_2_R1.merged.align.log    | 69.57%                 |
| 0min_MSN4  | rep1 | 0min_MSN4_1_R1.merged.align.log   | 80.48%                 |
| 0min_MSN4  | rep2 | 0min_MSN4_2_R1.merged.align.log     | 63.34%                 |
| 0min_SKN7  | rep1 | 0min_SKN7_1_R1.merged.align.log    | 94.59%                 |
| 0min_SKN7  | rep2 | 0min_SKN7_2_R1.merged.align.log  | 82.67%                 |
| 0min_WT    | rep1 | 0min_WT_1_R1.merged.align.log    | 69.60%                 |
| 0min_WT    | rep2 | 0min_WT_2_R1.merged.align.log     | 49.54%                 |
| 0min_YAP1  | rep1 | 0min_YAP1_1_R1.merged.align.log  | 68.01%                 |
| 0min_YAP1  | rep2 | 0min_YAP1_2_R1.merged.align.log  | 43.15%                 |
| 0min_YAP6  | rep1 | 0min_YAP6_1_R1.merged.align.log  | 95.60%                 |
| 0min_YAP6  | rep2 | 0min_YAP6_2_R1.merged.align.log    | 41.45%                 |
| 0min_YAP7  | rep1 | 0min_YAP7_1_R1.merged.align.log  | 93.73%                 |
| 0min_YAP7  | rep2 | 0min_YAP7_2_R1.merged.align.log     | 44.50%                 |
| 45min_HOG1 | rep1 | 45min_HOG1_1_R1.merged.align.log | 89.00%                 |
| 45min_HOG1 | rep2 | 45min_HOG1_2_R1.merged.align.log | 63.16%                 |
| 45min_HOT1 | rep1 | 45min_HOT1_1_R1.merged.align.log    | 53.53%                 |
| 45min_HOT1 | rep2 | 45min_HOT1_2_R1.merged.align.log | 93.57%                 |
| 45min_MSN1 | rep1 | 45min_MSN1_1_R1.merged.align.log | 62.20%                 |
| 45min_MSN1 | rep2 | 45min_MSN1_2_R1.merged.align.log  | 64.73%                 |
| 45min_MSN2 | rep1 | 45min_MSN2_1_R1.merged.align.log  | 15.44%                 |
| 45min_MSN2 | rep2 | 45min_MSN2_2_R1.merged.align.log    | 67.12%                 |
| 45min_MSN4 | rep1 | 45min_MSN4_1_R1.merged.align.log    | 67.52%                 |
| 45min_MSN4 | rep2 | 45min_MSN4_2_R1.merged.align.log   | 65.88%                 |
| 45min_SKN7 | rep1 | 45min_SKN7_1_R1.merged.align.log | 32.87%                 |
| 45min_SKN7 | rep2 | 45min_SKN7_2_R1.merged.align.log | 33.96%                 |
| 45min_WT   | rep1 | 45min_WT_1_R1.merged.align.log    | 77.00%                 |
| 45min_WT   | rep2 | 45min_WT_2_R1.merged.align.log    | 71.12%                 |
| 45min_YAP1 | rep1 | 45min_YAP1_1_R1.merged.align.log  | 69.80%                 |
| 45min_YAP1 | rep2 | 45min_YAP1_2_R1.merged.align.log  | 29.80%                 |
| 45min_YAP6 | rep1 | 45min_YAP6_1_R1.merged.align.log    | 69.54%                 |
| 45min_YAP6 | rep2 | 45min_YAP6_2_R1.merged.align.log | 38.73%                 |
| 45min_YAP7 | rep1 | 45min_YAP7_1_R1.merged.align.log   | 74.35%                 |
| 45min_YAP7 | rep2 | 45min_YAP7_2_R1.merged.align.log   | 34.40%                 |

Now, let's examine how many  peaks were called for the sample. We use the *zcat* command to examine the contents of a zipped file. 

We have two sets of peak calls -- optimal overlap peaks and IDR peaks. 

* optimal overlap peak calls are generated by overlapping peaks from the replicates. 
* IDR measures consistency between replicates in high-throughput experiments. Also uses reproducibility in score rankings between peaks in each replicate to determine an optimal cutoff for significance. 

In [None]:
zcat $experiment_croo_dir/peak/overlap_reproducibility/optimal_peak.narrowPeak.gz | wc -l 

In [None]:
zcat $experiment_croo_dir/peak/idr_reproducibility/optimal_peak.narrowPeak.gz | wc -l 

In [None]:
zcat $experiment_croo_dir/peak/overlap_reproducibility/optimal_peak.narrowPeak.gz | head -n20

We can calculate a fragment length distribution from our filtered bam files as follows:

In [None]:
#We exract fragment length from column 9 of the filtered bam file in replicates 1 & 2
samtools view $experiment_croo_dir/align/rep1/*merged.nodup.bam | cut -f9 > $WORK_DIR/fraglength.1.txt
samtools view $experiment_croo_dir/align/rep2/*merged.nodup.bam | cut -f9 > $WORK_DIR/fraglength.2.txt
cat $WORK_DIR/fraglength.1.txt $WORK_DIR/fraglength.2.txt > $WORK_DIR/fraglength.txt

In [None]:
#We now generate a histogram of the fragment lengths
python /opt/get_fragment_length_histogram.py --fraglength_file $WORK_DIR/fraglength.txt --o fraglength.png
cp fraglength.png ~

![fragmentlength](fraglength.png)

## Part 4: Visualizing signal tracks in the WashU and UCSC genome browsers ##

The pipeline uses the MACS2 peak caller to generate two types of signal tracks across the yeast genome: 

* P-value Tracks 
* Fold Change Tracks 

In [None]:
ls $experiment_croo_dir/signal/pooled-rep

These files are in binary format, so we cannot print their contents to the terminal, but a number of genome browser tools have been developed that allow us to visualize their contents.  Two of the most popular of these are

* UCSC Genome Browser (https://genome.ucsc.edu/cgi-bin/hgGateway) 

* WashU Epigenome Browser (https://epigenomegateway.wustl.edu/) 

Both browsers enable you to upload or link your data for visualization. The most efficient way to do this, is to place your bigwig files on a publically accessible  web server, and to link to them from the browser. 

Navigate to the `$experiment_croo_dir/signal/pooled-rep` folder in your web browser to get the links for use with the WashU Epigenome Browser. 

You can visualize the full set of fc or pval bigwigs by following these links: 

**P-value tracks: http://epigenomegateway.wustl.edu/browser/?bundle=0b4cbb30-e8c9-11ea-85f3-a78449ece41c**


**Fold change tracks: http://epigenomegateway.wustl.edu/browser/?bundle=ffcc8710-e8da-11ea-aa00-5de4db3892b9**

We will now go step-by-step through the process used to generate this visualization. To begin, point your browser to 
http://epigenomegateway.wustl.edu/


It's quite inefficient to upload our 40 track files one by one. To visualiza files in bulk, the WashU browser allows you to upload "datahubs". A datahub is  a file in the json format, which use a nested syntax to specify attributes about how the files are to be visualized. If you're curious, there's more information about such json "datahubs" here: http://washugb.blogspot.com/2012/04/data-hub.html. 


We have generated datahubs for our fc and pval bigwig files here: 

http://1.gentc.net/outputs/pval_bigwig.json and

http://1.gentc.net/outputs/fc_bigwig.json

don't worry about the syntax of these files for now (you can generally copy the syntax of these and just replace your file names and urls). The main point is to be aware that these hubs can be used to group visualizations of multiple browser tracks. 

### Other QC metrics 

The full set of QC metrics for each sample can be accessed here: http://1.gentc.net/outputs/reports/

## Part 5: Creating a merged peak set across all samples for downstream analysis 

Finally, we merge the peaks across all conditions to create a master list of peaks for analysis. To do this, we concatenate the IDR peaks from all experiments, sort them, and merge them. 

We take the output of the processing pipeline from the $AGGREGATE_ANALYSIS directory. This is the same analysis you performed above, but gathered in one location for all experiments conducted. 

In [None]:
cd $WORK_DIR

In [None]:

#Use the "find" command to identify all IDR narrowPeak output files and write them to a file. 
find -L $AGGREGATE_ANALYSIS_DIR/croo  -wholename "*/peak/overlap_reproducibility/optimal_peak.narrowPeak.gz" > narrowPeak_files.txt
#sanity check the file 
cat narrowPeak_files.txt


In [None]:
wc -l narrowPeak_files.txt

In [None]:
cat narrowPeak_files.txt

In [None]:
#Now, iterate through the list of narrowPeak files and concatenate them into a single master peak list. 
for f in `cat narrowPeak_files.txt`
do 
    zcat $f >> all.peaks.bed
done

#sanity check the all.peaks.bed file 
head all.peaks.bed


In [None]:
#sort the concatenated file 
bedtools sort -i all.peaks.bed > all.peaks.sorted.bed 

head all.peaks.sorted.bed 

In [None]:
#merge the sorted, concatenated fileto join overlapping peaks 
bedtools merge -i all.peaks.sorted.bed > all_merged.peaks.bed 

head all_merged.peaks.bed

In [None]:
#Finally, we use the awk command to add row numbers to the merged peak file, such that each peak has a unique identifier. 

#We cannot do this 'in place', so we use an intermediate output file 
awk  -v OFS='\t' '{print $0,NR}' all_merged.peaks.bed > o.tmp
mv o.tmp all_merged.peaks.bed

head all_merged.peaks.bed

## Part 6: Creating read count and fold change matrices.

We would like to calculate the signal strength in each sample at the genomic regions in **all_merged.peaks.bed**. As we saw above, the ATAC-seq pipeline generates genome-wide fold change signal tracks for each sample that can be used for this calculation (the \*fc.bigwig and \*pval.bigwig files). We use the **bigWigAverageOverBed** utility to computue the mean signal from the pval tracks and the mean signal from the fold change tracks for each genomic region in each sample. 

In [None]:
module load ucsc_tools
bigWigAverageOverBed

In [None]:
#First, we find all the fold change bigWig files
cd $WORK_DIR
find -L $AGGREGATE_ANALYSIS_DIR/croo  -wholename "*signal/rep1/*tn5.fc.signal.bigwig" > all.fc.bigwig
find -L $AGGREGATE_ANALYSIS_DIR/croo  -wholename "*signal/rep2/*tn5.fc.signal.bigwig" >> all.fc.bigwig

head all.fc.bigwig


In [None]:
wc -l all.fc.bigwig

In [None]:
#Iterate through all bigWig fold change tracks to compute mean signal strength at each genomic region 
for f in `cat all.fc.bigwig`
do

    #we extract the part of the filename that corresponds to the sample name and write it as the header in the fc.signal file
    sample_name=`basename $f | awk -F'[.]' '{print $1}'`
    echo "$sample_name"
    echo $sample_name > $sample_name.fc.signal.tmp 
    
    
    bigWigAverageOverBed $f all_merged.peaks.bed $sample_name.fc.signal.data.tmp 
    cut -f5 $sample_name.fc.signal.data.tmp >> $sample_name.fc.signal.tmp

    #cleanup the intermediate file 
    rm $sample_name.fc.signal.data.tmp 
done
paste *fc.signal.tmp > all.fc.txt
#cleanup intermediate files that were generated 
rm *.tmp

#examine the output 
head all.fc.txt

In addition to the fold change data matrix, we would also like to know the number of reads that pile up at each peak region. This is useful for determining differential chromatin accessibility across samples. 
To calculate the read count matrix, we will use the **bedtools coverage** command on the *tagAlign* files generated by the processing pipeline. 

In [None]:
#First, we find all the tagAlign
cd $WORK_DIR
find -L $AGGREGATE_ANALYSIS_DIR/croo  -wholename "*align/rep1/*merged.nodup.tn5.tagAlign.gz" > all.tagAlign.files.txt
find -L $AGGREGATE_ANALYSIS_DIR/croo  -wholename "*align/rep2/*merged.nodup.tn5.tagAlign.gz" >> all.tagAlign.files.txt

head all.tagAlign.files.txt

In [None]:
wc -l all.tagAlign.files.txt

In [None]:
#Let's see how the bedtools coverage command works
bedtools coverage

In [None]:
#Iterate through all tagAlign files to compute read count at each peak region.  
for f in `cat all.tagAlign.files.txt`
do
    sample_name=`basename $f | awk -F'[.]' '{print $1}'`
    echo "$sample_name"
    echo $sample_name > $sample_name.readcount.tmp 
    bedtools coverage -counts -a all_merged.peaks.bed -b $f  | cut -f5 >>$sample_name.readcount.tmp 
done
paste *.readcount.tmp > all.readcount.txt
#cleanup the temporary files
rm *.tmp

#examine the output 
head all.readcount.txt

We observe that the counts in the first and second columns are on a different scale. This makes sense because if a particular sample had more reads to begin with, the raw counts for each peak will be higher. 
We can address this problem with sample normalization, covered in the next section.


In [None]:
#Finally, we add in the peak names to our counts file and fold change file so we can keep track of which row 
#corresponds to which peak. 


#add a header to the merged peak file 
sed -i '1i\Chrom\tStart\tEnd\tID' all_merged.peaks.bed

#paste the peak bed file region annotation matrix to the signal matrix
paste all_merged.peaks.bed all.fc.txt > o.tmp 
mv o.tmp all.fc.txt 

paste all_merged.peaks.bed all.readcount.txt > o.tmp
mv o.tmp all.readcount.txt

In [None]:
head all.readcount.txt


In [None]:
head all.fc.txt

In examining the files, we notice that all the files end with the suffix "\_R1_001". This is an artifact generated by the processing pipeline. This part of the filename is not informative for our purposes, since it's shared by all samples, so we can remove it with the **sed** command. The syntax is illustrated below: 

In [None]:
sed -i 's/_R1//g' all.fc.txt
sed -i 's/_R1//g' all.readcount.txt


In [None]:
head all.fc.txt

In [None]:
head all.readcount.txt

We have now generated a read count matrix and a fold change signal peak regions in our dataset. 
This completes the basic data processing pipeline. 
Now, on to drawing conclusions about our data. 