# Sequencing data analysis

### IMPORTANT: Please make sure that your are using the bash kernel to run this notebook.
#### (Do this at the beginning of every session) ###

### This notebook covers analysis of DNA sequencing data from raw files to processed signals.

Although this analysis is for ATAC-seq data, many of the steps (especially the first section) are the same for other types of DNA sequencing experiments.

We'll be doing the analysis in Bash, which is the standard language for UNIX command-line scripting.

The steps in the analysis pipeline that are covered in this notebook are indicated below:
![Sequencing Data Analysis 1](images/part1.png)

## Part 1: Setting up the data

We start with raw `.fastq.gz` files, which are provided by the sequencing instrument. For each DNA molecule (read) that was sequenced, they provide the nucleotide sequence, and information about the quality of the signal of that nucleotide.

In [1]:
### Set up variables storing the location of our data
### The proper way to load your variables is with the ~/.bashrc command, but this is very slow in iPython 
export SUNETID="$(whoami)"
export WORK_DIR="/scratch/${SUNETID}"
export DATA_DIR="${WORK_DIR}/data"
[[ ! -d ${WORK_DIR}/data ]] && mkdir "${WORK_DIR}/data"
export SRC_DIR="${WORK_DIR}/src"
[[ ! -d ${WORK_DIR}/src ]] && mkdir -p "${WORK_DIR}/src"
export METADATA_DIR="/metadata"
export AGGREGATE_DATA_DIR="/data"
export AGGREGATE_ANALYSIS_DIR="/outputs"
export YEAST_DIR="/saccer3"
export TMP="${WORK_DIR}/tmp"
export TEMP=$TMP
export TMPDIR=$TMP
[[ ! -d ${TMP} ]] && mkdir -p "${TMP}"



: 1

Now, let's check exactly which fastqs we have (we copied these from `$AGGREGATE_DATA_DIR`to your personal `$DATA_DIR` in the last tutorial):

(recall that the `ls` command lists the contents of a directory)

In [2]:
ls $DATA_DIR

hrosenbl_WT_YPGE_1_R1_001.fastq.gz  kjngo_WT_YPGE_2_R1_001.fastq.gz
hrosenbl_WT_YPGE_1_R2_001.fastq.gz  kjngo_WT_YPGE_2_R2_001.fastq.gz


As a sanity check, we can also look at the size and last edited time of some of the fastqs by addind `-lrth` to the `ls` command:

In [3]:
ls -lrth $DATA_DIR

total 205M
-rwxr-xr-x 1 annashch users  98M Sep  9 18:44 hrosenbl_WT_YPGE_1_R1_001.fastq.gz
-rwxr-xr-x 1 annashch users 102M Sep  9 18:44 hrosenbl_WT_YPGE_1_R2_001.fastq.gz
-rwxr-xr-x 1 annashch users 2.6M Sep  9 18:44 kjngo_WT_YPGE_2_R1_001.fastq.gz
-rwxr-xr-x 1 annashch users 2.6M Sep  9 18:44 kjngo_WT_YPGE_2_R2_001.fastq.gz


Let's also inspect the format of one of the fastqs. Notice that each read takes up 4 lines:
1. the read name
2. the read's nucleotide sequence
3. a '+' to indicate the record contains another line
4. a quality score for each base (a number encoded as a letter)

In [4]:
zcat $(ls $DATA_DIR/*gz | head -n 1) | head -n 8

@NB551514:23:H5VFFBGX9:1:11101:20773:1063 1:N:0:TAAGGCGA+GCGATCTA
GACATGGTAGTGTCCACTTGCCCCTCGAATAGTTCCT
+
/6AAA6EEE/AEEEEAEAEEEEEEAEEAEEEEE6AEE
@NB551514:23:H5VFFBGX9:1:11101:21537:1065 1:N:0:TAAGGCGA+GCGATCTA
GTATTATAGCCTGTGGGAATACTGCCAGCTGGGACTG
+
AAAAAEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEE

gzip: stdout: Broken pipe


## Part 2:ATAC-seq data processing

The ENCODE consortium (https://www.encodeproject.org/) uses a standard ATAC-seq data processing pipeline, which can be downloaded here: https://github.com/ENCODE-DCC/atac-seq-pipeline

This pipeline is pre-installed on this computer and can be executed by running the **atac.bds** script. 



In [5]:
/opt/atac_dnase_pipelines/atac.bds


== atac pipeline settings
	-type <string>                   : Type of the pipeline. atac-seq or dnase-seq (default: atac-seq).
	-dnase_seq <bool>                : DNase-Seq (no tn5 shifting).
	-align <bool>                    : Align only (no MACS2 peak calling or IDR or ataqc analysis).
	-subsample_xcor <string>         : # reads to subsample for cross corr. analysis (default: 25M).
	-subsample <string>              : # reads to subsample exp. replicates. Subsampled tagalign will be used for steps downstream (default: 0; no subsampling).
	-true_rep <bool>                 : No pseudo-replicates.
	-no_ataqc <bool>                 : No ATAQC
	-no_xcor <bool>                  : No Cross-correlation analysis.
	-csem <bool>                     : Use CSEM for alignment.
	-smooth_win <string>             : Smoothing window size for MACS2 peak calling (default: 150).
	-idr_thresh <real>               : IDR threshold : -log_10(score) (default: 0.1).
	-ENCODE3 <bool>                 

In [5]:
ls $DATA_DIR

hrosenbl_WT_YPGE_1_R1_001.fastq.gz  kjngo_WT_YPGE_2_R1_001.fastq.gz
hrosenbl_WT_YPGE_1_R2_001.fastq.gz  kjngo_WT_YPGE_2_R2_001.fastq.gz


Though the pipeline is highly customizable and all the customizations might seem a bit confusing at first, do not worry -- for our purposes, the default settings will suffice. You will run the pipeline on your two experiments. Fill in the names of the FASTQ files corresponding to your two experiments below, as well as the name of the ouptut directory to store the processed data. 

In [38]:
#You can find the experiment names in the file $METADATA_DIR/TC2019_samples.tsv.
#Look under the column labeled "ID"
#example: 

export experiment1="hrosenbl_asf1_YPD_1"
export experiment2="pgoddard_rtt109_YPD_1"

#Create directories to store outputs from the pipeline

#We will store the outputs in the $WORK_DIR
export outdir1=$WORK_DIR/$experiment1\_out 
export outdir2=$WORK_DIR/$experiment2\_out
mkdir $outdir1
mkdir $outdir2


mkdir: cannot create directory ‘/srv/scratch/training_camp/work/ubuntu/hrosenbl_asf1_YPD_1_out’: File exists
mkdir: cannot create directory ‘/srv/scratch/training_camp/work/ubuntu/pgoddard_rtt109_YPD_1_out’: File exists


In [9]:
ls $AGGREGATE_DATA_DIR

ambenj_asf1_YPD_2_R1_001.fastq.gz      kjhanson_WT_YPD_6_R2_001.fastq.gz
ambenj_asf1_YPD_2_R2_001.fastq.gz      kjngo_rtt109_YPD_R1_001.fastq.gz
ambenj_rtt109_YPGE_2_R1_001.fastq.gz   kjngo_rtt109_YPD_R2_001.fastq.gz
ambenj_rtt109_YPGE_2_R2_001.fastq.gz   kjngo_WT_YPGE_2_R1_001.fastq.gz
dcotter1_asf1_YPGE_3_R1_001.fastq.gz   kjngo_WT_YPGE_2_R2_001.fastq.gz
dcotter1_asf1_YPGE_3_R2_001.fastq.gz   ktomins_rtt109_YPGE_5_R1_001.fastq.gz
dcotter1_rtt109_YPD_3_R1_001.fastq.gz  ktomins_rtt109_YPGE_5_R2_001.fastq.gz
dcotter1_rtt109_YPD_3_R2_001.fastq.gz  ktomins_WT_YPGE_5_R1_001.fastq.gz
dmaghini_asf1_YPD_5_R1_001.fastq.gz    ktomins_WT_YPGE_5_R2_001.fastq.gz
dmaghini_asf1_YPD_5_R2_001.fastq.gz    marinovg_WT_YPD_2_R1_001.fastq.gz
dmaghini_WT_YPD_5_R1_001.fastq.gz      marinovg_WT_YPD_2_R2_001.fastq.gz
dmaghini_WT_YPD_5_R2_001.fastq.gz      mkoska_asf1_YPD_3_R1_001.fastq.gz
egreenwa_asf1_YPD_6_R1_001.fastq.gz    mkoska_asf1_YPD_3_R2_001.fastq.gz
egreenwa_asf1_YPD_6_R2_001.fastq.gz 

Now, kick off the pipeline! 

In [None]:
#first experiment:
echo "bds_scr $experiment1 $experiment1.log atac.bds -out_dir $outdir1 -species saccer3 -fastq1_1 $DATA_DIR/$experiment1\_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment1\_R2_001.fastq.gz -nth 4"
bds_scr $experiment1 $outdir1/log.txt /opt/atac_dnase_pipelines/atac.bds -out_dir $outdir1 -species saccer3 -fastq1_1 $DATA_DIR/$experiment1\_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment1\_R2_001.fastq.gz -nth 4 

In [None]:
#second experiment:
echo "bds_scr $experiment2 $experiment2.log atac.bds -out_dir $outdir2 -species saccer3 -fastq1_1 $DATA_DIR/$experiment2_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment2_R2_001.fastq.gz -nth 4"
bds_scr $experiment2 $outdir2/log.txt /opt/atac_dnase_pipelines/atac.bds -out_dir $outdir2 -species saccer3 -fastq1_1 $DATA_DIR/$experiment2\_R1_001.fastq.gz -fastq1_2 $DATA_DIR/$experiment2\_R2_001.fastq.gz -nth 4  

The pipeline may run for an hour or so, so meanwhile, we will learn more about what it's doing under the hood. 
If you want to check on the progress, you can examine the latest entries in the  log file generated by the pipeline with teh *tail* command. The log files are specified by the [LOG_FILE_NAME] entry above. 


In [15]:
ls $outdir1/align/rep1

YPD_asf1_rep1_R1_001.PE2SE.bam
YPD_asf1_rep1_R1_001.PE2SE.bam.bai
YPD_asf1_rep1_R1_001.PE2SE.nodup.bam
YPD_asf1_rep1_R1_001.PE2SE.nodup.bam.bai
YPD_asf1_rep1_R1_001.PE2SE.nodup.bedpe.gz
YPD_asf1_rep1_R1_001.PE2SE.nodup.tagAlign.gz
YPD_asf1_rep1_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
YPD_asf1_rep1_R1_001.PE2SE.nodup.tn5.tagAlign.gz


In [11]:
samtools view $outdir1/align/rep1/YPD_asf1_rep1_R1_001.PE2SE.nodup.bam | head -n20

M00653:6:000000000-C3JYV:1:1103:10730:2946	99	chrI	39	40	76M	=	43	80	CCACACCACACCCACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCT	CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG	MD:Z:76	PG:Z:MarkDuplicates	XG:i:0	NM:i:0	XM:i:0	XN:i:0	XO:i:0	MQ:i:40	AS:i:152	XS:i:48	YS:i:152	YT:Z:CP
M00653:6:000000000-C3JYV:1:1103:19356:14698	163	chrI	39	40	76M	=	39	76	CCACACCACACCCACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCT	CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGD	MD:Z:76	PG:Z:MarkDuplicates	XG:i:0	NM:i:0	XM:i:0	XN:i:0	XO:i:0	MQ:i:40	AS:i:152	XS:i:48	YS:i:152	YT:Z:CP
M00653:6:000000000-C3JYV:1:1103:19356:14698	83	chrI	39	40	76M	=	39	-76	CCACACCACACCCACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCT	GGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC	MD:Z:76	PG:Z:MarkDuplicates	XG:i:0	NM:i:0	XM:i:0	XN:i:0	XO:i:0	MQ:i:40	AS:i:152	XS:i:48	YS:i:152	YT:Z:CP
M00653:6:

In [19]:
samtools view -F8 $outdir1/align/rep1/YPD_asf1_rep1_R1_001.PE2SE.bam | head -n20

M00653:6:000000000-C3JYV:1:1109:20532:18321	99	chrI	1	21	3S73M	=	51	129	CACCCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTA	CCCCCGFGGEGGGGGGGGGGGGFGGGGGGGCDFGGEGGGDFE@FDGGGGGGGGGGGGGGGGGGGGGFGGGGGFGFF	AS:i:146	XS:i:44	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:73	YS:i:152	YT:Z:CP
M00653:6:000000000-C3JYV:1:2108:16795:7495	99	chrI	1	9	1S37M38S	=	1	-115	CCCACACCACACCCACACACCCACACACCACACCACACCTGTCTCTTATACACATCTCCGAGCCCACGAGACTGGA	CCCCCGGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGFGAFGGGGFGGGGGGGGGGEEGGGEGGFEEGGGGC	AS:i:74	XS:i:76	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:37	YS:i:74	YT:Z:CP
M00653:6:000000000-C3JYV:1:2113:13087:7056	163	chrI	1	21	2S74M	=	2	79	ACCCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTAC	CCCCCGGGGGGGGGGGGGGGGDGGGGGEGGG@FBFGFAF7FGGF?FF,CCE@@,BBFEGDCFG76C<<<<FFGGGE	AS:i:148	XS:i:104	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:74	YS:i:152	YT:Z:CP
M00653:6:000000000-C3JYV:1:2116:15537:19642	163	chrI	1	21	3S73M	=	59	137	CACCCACACCACACCCA

In [None]:
samtools view  -F4 $outdir1/align/rep1/YPD_asf1_rep1_R1_001.PE2SE.nodup.bam | head -n20

In [13]:
tail $outdir1/*log

./qc/rep1/YPD_asf1_rep1_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.cc.plot.pdf
./qc/rep1/YPD_asf1_rep1_R1_001.PE2SE.nodup.flagstat.qc
./qc/rep1/YPD_asf1_rep1_R1_001.PE2SE.flagstat.qc
./qc/rep1/YPD_asf1_rep1_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.cc.qc
./qc/rep1/YPD_asf1_rep1_R1_001.PE2SE.align.log
./qc/rep1/YPD_asf1_rep1_R1_001.PE2SE.nodup.pbc.qc
./qc/rep1/YPD_asf1_rep1_R1_001.PE2SE.dup.qc
./qc/YPD_asf1_rep1_peak_overlap_final.qc

== Done report()


In [22]:
tail  $outdir2/*log

./qc/rep1/pgoddard_rtt109_YPD_1_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.cc.plot.pdf
./qc/rep1/pgoddard_rtt109_YPD_1_R1_001.PE2SE.align.log
./qc/rep1/pgoddard_rtt109_YPD_1_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.cc.qc
./qc/rep1/pgoddard_rtt109_YPD_1_R1_001.PE2SE.flagstat.qc
./qc/rep1/pgoddard_rtt109_YPD_1_R1_001.PE2SE.nodup.flagstat.qc
./qc/rep1/pgoddard_rtt109_YPD_1_R1_001.PE2SE.dup.qc
./qc/rep1/pgoddard_rtt109_YPD_1_R1_001.PE2SE.nodup.pbc.qc
./qc/pgoddard_rtt109_YPD_1_peak_overlap_final.qc

== Done report()


## Part 3: Examining the pipeline output

The pipeline consists of multiple modules, with output files that include the following: 

```
out                               # root dir. of outputs
│
├ *report.html                    #  HTML report
├ *tracks.json                    #  Tracks datahub (JSON) for WashU browser
├ ENCODE_summary.json             #  Metadata of all datafiles and QC results
│
├ align                           #  mapped alignments
│ ├ rep1                          #   for true replicate 1 
│ │ ├ *.trim.fastq.gz             #    adapter-trimmed fastq
│ │ ├ *.bam                       #    raw bam
│ │ ├ *.nodup.bam (E)             #    filtered and deduped bam
│ │ ├ *.tagAlign.gz               #    tagAlign (bed6) generated from filtered bam
│ │ ├ *.tn5.tagAlign.gz           #    TN5 shifted tagAlign for ATAC pipeline (not for DNase pipeline)
│ │ └ *.*M.tagAlign.gz            #    subsampled tagAlign for cross-corr. analysis
│ ├ rep2                          #   for true repilicate 2
│ ...
│ ├ pooled_rep                    #   for pooled replicate
│ ├ pseudo_reps                   #   for self pseudo replicates
│ │ ├ rep1                        #    for replicate 1
│ │ │ ├ pr1                       #     for self pseudo replicate 1 of replicate 1
│ │ │ ├ pr2                       #     for self pseudo replicate 2 of replicate 1
│ │ ├ rep2                        #    for repilicate 2
│ │ ...                           
│ └ pooled_pseudo_reps            #   for pooled pseudo replicates
│   ├ ppr1                        #    for pooled pseudo replicate 1 (rep1-pr1 + rep2-pr1 + ...)
│   └ ppr2                        #    for pooled pseudo replicate 2 (rep1-pr2 + rep2-pr2 + ...)
│
├ peak                             #  peaks called
│ └ macs2                          #   peaks generated by MACS2
│   ├ rep1                         #    for replicate 1
│   │ ├ *.narrowPeak.gz            #     narrowPeak (p-val threshold = 0.01)
│   │ ├ *.filt.narrowPeak.gz (E)   #     blacklist filtered narrowPeak 
│   │ ├ *.narrowPeak.bb (E)        #     narrowPeak bigBed
│   │ ├ *.narrowPeak.hammock.gz    #     narrowPeak track for WashU browser
│   │ ├ *.pval0.1.narrowPeak.gz    #     narrowPeak (p-val threshold = 0.1)
│   │ └ *.pval0.1.*K.narrowPeak.gz #     narrowPeak (p-val threshold = 0.1) with top *K peaks
│   ├ rep2                         #    for replicate 2
│   ...
│   ├ pseudo_reps                          #   for self pseudo replicates
│   ├ pooled_pseudo_reps                   #   for pooled pseudo replicates
│   ├ overlap                              #   naive-overlapped peaks
│   │ ├ *.naive_overlap.narrowPeak.gz      #     naive-overlapped peak
│   │ └ *.naive_overlap.filt.narrowPeak.gz #     naive-overlapped peak after blacklist filtering
│   └ idr                           #   IDR thresholded peaks
│     ├ true_reps                   #    for replicate 1
│     │ ├ *.narrowPeak.gz           #     IDR thresholded narrowPeak
│     │ ├ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     │ └ *.12-col.bed.gz           #     IDR thresholded narrowPeak track for WashU browser
│     ├ pseudo_reps                 #    for self pseudo replicates
│     │ ├ rep1                      #    for replicate 1
│     │ ...
│     ├ optimal_set                 #    optimal IDR thresholded peaks
│     │ └ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     ├ conservative_set            #    optimal IDR thresholded peaks
│     │ └ *.filt.narrowPeak.gz (E)  #     IDR thresholded narrowPeak (blacklist filtered)
│     ├ pseudo_reps                 #    for self pseudo replicates
│     └ pooled_pseudo_reps          #    for pooled pseudo replicate
│
│   
│ 
├ qc                              #  QC logs
│ ├ *IDR_final.qc                 #   Final IDR QC
│ ├ rep1                          #   for true replicate 1
│ │ ├ *.align.log                 #    Bowtie2 mapping stat log
│ │ ├ *.dup.qc                    #    Picard (or sambamba) MarkDuplicate QC log
│ │ ├ *.pbc.qc                    #    PBC QC
│ │ ├ *.nodup.flagstat.qc         #    Flagstat QC for filtered bam
│ │ ├ *M.cc.qc                    #    Cross-correlation analysis score for tagAlign
│ │ ├ *M.cc.plot.pdf/png          #    Cross-correlation analysis plot for tagAlign
│ │ └ *_qc.html/txt               #    ATAQC report
│ ...
│
├ signal                          #  signal tracks
│ ├ macs2                         #   signal tracks generated by MACS2
│ │ ├ rep1                        #    for true replicate 1 
│ │ │ ├ *.pval.signal.bigwig (E)  #     signal track for p-val
│ │ │ └ *.fc.signal.bigwig   (E)  #     signal track for fold change
│ ...
│ └ pooled_rep                    #   for pooled replicate
│ 
├ report                          # files for HTML report
└ meta                            # text files containing md5sum of output files and other metadata
```

Let's examine how well the reads aligned to the reference saccer3 genome. We'd like to see an overall alignment rate >=90% 

In [20]:
cat $outdir1/qc/rep1/*align.log

302572 reads; of these:
  302572 (100.00%) were paired; of these:
    69186 (22.87%) aligned concordantly 0 times
    129828 (42.91%) aligned concordantly exactly 1 time
    103558 (34.23%) aligned concordantly >1 times
    ----
    69186 pairs aligned concordantly 0 times; of these:
      22201 (32.09%) aligned discordantly 1 time
    ----
    46985 pairs aligned 0 times concordantly or discordantly; of these:
      93970 mates make up the pairs; of these:
        56485 (60.11%) aligned 0 times
        510 (0.54%) aligned exactly 1 time
        36975 (39.35%) aligned >1 times
90.67% overall alignment rate


In [21]:
cat $outdir2/qc/rep1/*align.log

2169329 reads; of these:
  2169329 (100.00%) were paired; of these:
    265572 (12.24%) aligned concordantly 0 times
    989355 (45.61%) aligned concordantly exactly 1 time
    914402 (42.15%) aligned concordantly >1 times
    ----
    265572 pairs aligned concordantly 0 times; of these:
      8319 (3.13%) aligned discordantly 1 time
    ----
    257253 pairs aligned 0 times concordantly or discordantly; of these:
      514506 mates make up the pairs; of these:
        491734 (95.57%) aligned 0 times
        2967 (0.58%) aligned exactly 1 time
        19805 (3.85%) aligned >1 times
88.67% overall alignment rate


Now, let's examine how many  peaks were called for each sample. We use the *zcat* command to examine the contents of a zipped file 

In [41]:
zcat $outdir1/peak/macs2/overlap/optimal_set/*narrowPeak.gz | wc -l 

1179


In [42]:
zcat $outdir2/peak/macs2/overlap/optimal_set/*narrowPeak.gz | wc -l 

2904


In [43]:
zcat $outdir1/peak/macs2/overlap/optimal_set/*narrowPeak.gz | head -n20

chrI	189973	191383	Peak_1564	38	.	2.61324	3.88988	2.32072	1293
chrI	189973	191383	Peak_1575	38	.	2.58621	3.83181	2.26678	1117
chrI	189973	191383	Peak_248	151	.	5.48673	15.17563	12.69517	454
chrI	189973	191383	Peak_276	142	.	5.28169	14.27812	11.86040	605
chrI	28668	28856	Peak_1113	55	.	3.42857	5.55696	3.82029	85
chrI	32938	33365	Peak_295	138	.	4.74479	13.87073	11.48363	140
chrI	34319	35043	Peak_1559	39	.	2.49816	3.92306	2.35115	291
chrI	34319	35043	Peak_1758	34	.	2.35121	3.44516	1.94253	94
chrI	34319	35043	Peak_511	99	.	4.02085	9.98134	7.87844	622
chrI	34	638	Peak_468	103	.	5.16072	10.39738	8.25911	480
chrI	34	638	Peak_912	63	.	3.87054	6.39987	4.57554	128
chrI	45144	45634	Peak_588	90	.	4.35233	9.05212	7.01809	236
chrI	62478	62770	Peak_789	72	.	3.78486	7.25446	5.36251	175
chrI	67957	69202	Peak_189	172	.	5.05370	17.24330	14.61938	616
chrI	67957	69202	Peak_275	142	.	4.55696	14.27955	11.86162	298
chrI	67957	69202	Peak_612	87	.	3.51759	8.79596	6.78057	804
chrI	67957	69202	Pea

## Part 4: Visualizing signal tracks in the WashU and UCSC genome browsers ##

The pipeline uses the MACS2 peak caller to generate two types of signal tracks across the yeast genome: 

* P-value Tracks 
* Fold Change Tracks 

In [40]:
ls $outdir1/signal/macs2/rep1/

ls $outdir2/signal/macs2/rep1/

YPD_asf1_rep1_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
YPD_asf1_rep1_R1_001.PE2SE.nodup.tn5.pf.pval.signal.bigwig
pgoddard_rtt109_YPD_1_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
pgoddard_rtt109_YPD_1_R1_001.PE2SE.nodup.tn5.pf.pval.signal.bigwig


These files are in binary format, so we cannot print their contents to the terminal, but a number of genome browser tools have been developed that allow us to visualize their contents.  Two of the most popular of these are

* UCSC Genome Browser (https://genome.ucsc.edu/cgi-bin/hgGateway) 

* WashU Epigenome Browser (https://epigenomegateway.wustl.edu/) 

Both browsers enable you to upload or link your data for visualization. The most efficient way to do this, is to place your bigwig files on a publically accessible  web server, and to link to them from the browser. 

We have uploaded the fold change and pval bigwigs to the mitra server, here: 

http://mitra.stanford.edu/kundaje/tc2018

In that directory, you see a folder of fc bigwig tracks (http://mitra.stanford.edu/kundaje/tc2018/fc_tracks/) as well as a folder of pval bigwig tracks (http://mitra.stanford.edu/kundaje/tc2018/pvalue_tracks/)

You can visualize the full set of fc or pval bigwigs by following this link: http://mitra.stanford.edu/kundaje/tc2018/saccer3_tracks.html 

We will now go step-by-step through the process used to generate this visualization. To begin, point your browser to 
https://epigenomegateway.wustl.edu/


It's quite inefficient to upload our 35 track files one by one. To visualiza files in bulk, the WashU browser allows you to upload "datahubs". A datahub is  a file in the json format, which use a nested syntax to specify attributes about how the files are to be visualized. If you're curious, there's more information about such json "datahubs" here: http://washugb.blogspot.com/2012/04/data-hub.html. 


We have generated datahubs for our fc and pval bigwig files here: 

http://mitra.stanford.edu/kundaje/tc2018/pval.datahub.json and

http://mitra.stanford.edu/kundaje/tc2018/fc.datahub.json

don't worry about the syntax of these files for now (you can generally copy the syntax of these and just replace your file names and urls). The main point is to be aware that these hubs can be used to group visualizations of multiple browser tracks. 

## Part 5: Creating a merged peak set across all samples for downstream analysis 

Finally, we merge the peaks across all conditions to create a master list of peaks for analysis. To do this, we concatenate the IDR peaks from all experiments, sort them, and merge them. 

We take the output of the processing pipeline from the $AGGREGATE_ANALYSIS directory. This is the same analysis you performed above, but gathered in one location for all experiments conducted. 

In [44]:
cd $WORK_DIR



In [45]:

#Use the "find" command to identify all IDR narrowPeak output files and write them to a file. 
find -L $AGGREGATE_ANALYSIS_DIR  -wholename "*peak/macs2/overlap/optimal_set/*narrowPeak.gz" > narrowPeak_files.txt

#sanity check the file 
head narrowPeak_files.txt


/srv/scratch/training_camp/aggregate_analysis/yiuwong_WT_YPD_1/peak/macs2/overlap/optimal_set/yiuwong_WT_YPD_1_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/yiuwong_rtt109_YPGE_1/peak/macs2/overlap/optimal_set/yiuwong_rtt109_YPGE_1_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/mkoska_WT_YPGE_3/peak/macs2/overlap/optimal_set/mkoska_WT_YPGE_3_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/dcotter_asf1_YPGE_3/peak/macs2/overlap/optimal_set/YPGE_asf1_rep2_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/rosaxma_rtt109_YPD_4/peak/macs2/overlap/optimal_set/rosaxma_rtt109_YPD_4_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/rpatel7_asf1_YPGE_4/peak/macs2/overlap/optimal_set/rpatel7_asf1_YPGE_4_rep1-pr.naive_overlap.narrowPeak.gz
/srv/scratch/training_camp/aggregate_analysis/raungar_asf1_YPGE_6/peak/macs2/overlap/op

In [46]:
wc -l narrowPeak_files.txt

35 narrowPeak_files.txt


In [47]:
#Now, iterate through the list of narrowPeak files and concatenate them into a single master peak list. 
for f in `cat narrowPeak_files.txt`
do 
    zcat $f >> all.peaks.bed
done

#sanity check the all.peaks.bed file 
head all.peaks.bed


chrI	0	768	Peak_515	3535	.	8.35297	353.50931	351.46481	484
chrI	0	768	Peak_88	9544	.	15.56806	954.43042	951.47919	168
chrI	113337	113580	Peak_3252	131	.	1.93954	13.19590	12.03852	134
chrI	114307	114827	Peak_2769	207	.	2.23694	20.75704	19.52608	411
chrI	114307	114827	Peak_3937	37	.	1.42233	3.76329	2.76032	135
chrI	119682	119885	Peak_3033	163	.	2.06885	16.31106	15.12077	98
chrI	128874	129274	Peak_3147	147	.	1.97935	14.78236	13.60782	135
chrI	128874	129274	Peak_3640	76	.	1.65228	7.61521	6.53318	299
chrI	129841	130841	Peak_1518	814	.	3.80832	81.40385	79.89246	696
chrI	129841	130841	Peak_2299	333	.	2.63074	33.30013	31.98524	408


In [48]:
#sort the concatenated file 
bedtools sort -i all.peaks.bed > all.peaks.sorted.bed 

head all.peaks.sorted.bed 

chrI	0	766	Peak_321	2052	.	8.33502	205.25972	202.89413	479
chrI	0	638	Peak_63	3227	.	20.94427	322.78024	319.64966	164
chrI	0	638	Peak_545	769	.	7.35614	76.99364	74.98401	476
chrI	0	638	Peak_113	2100	.	13.71821	210.03169	207.20430	154
chrI	0	638	Peak_300	2198	.	9.27775	219.86606	217.47575	482
chrI	0	638	Peak_58	3931	.	13.36497	393.15579	389.91458	155
chrI	0	638	Peak_507	1012	.	9.74152	101.21449	99.17241	466
chrI	0	637	Peak_21	1879	.	27.93623	187.96126	184.36090	156
chrI	0	637	Peak_484	386	.	9.42228	38.67609	36.61895	481
chrI	0	857	Peak_96	7829	.	16.38464	782.98938	779.95404	153


In [52]:
bedtools sort  --help 


*****ERROR: Unrecognized parameter: --help *****


Tool:    bedtools sort (aka sortBed)
Version: v2.17.0
Summary: Sorts a feature file in various and useful ways.

Usage:   bedtools sort [OPTIONS] -i <bed/gff/vcf>

Options: 
	-sizeA		Sort by feature size in ascending order.
	-sizeD		Sort by feature size in descending order.
	-chrThenSizeA	Sort by chrom (asc), then feature size (asc).
	-chrThenSizeD	Sort by chrom (asc), then feature size (desc).
	-chrThenScoreA	Sort by chrom (asc), then score (asc).
	-chrThenScoreD	Sort by chrom (asc), then score (desc).
	-header	Print the header from the A file prior to results.



In [53]:
#merge the sorted, concatenated fileto join overlapping peaks 
bedtools merge -i all.peaks.sorted.bed > all_merged.peaks.bed 

head all_merged.peaks.bed

chrI	0	857
chrI	2415	2586
chrI	6315	6556
chrI	14706	14936
chrI	20592	21210
chrI	28570	28931
chrI	29238	29452
chrI	29729	30050
chrI	31624	35831
chrI	42233	42693


In [54]:
#Finally, we use the awk command to add row numbers to the merged peak file, such that each peak has a unique identifier. 

#We cannot do this 'in place', so we use an intermediate output file 
awk  -v OFS='\t' '{print $0,NR}' all_merged.peaks.bed > o.tmp
mv o.tmp all_merged.peaks.bed

head all_merged.peaks.bed

chrI	0	857	1
chrI	2415	2586	2
chrI	6315	6556	3
chrI	14706	14936	4
chrI	20592	21210	5
chrI	28570	28931	6
chrI	29238	29452	7
chrI	29729	30050	8
chrI	31624	35831	9
chrI	42233	42693	10


## Part 6: Creating read count and fold change matrices.

We would like to calculate the signal strength in each sample at the genomic regions in **all_merged.peaks.bed**. As we saw above, the ATAC-seq pipeline generates genome-wide fold change signal tracks for each sample that can be used for this calculation (the \*fc.bigwig and \*pval.bigwig files). We use the **bigWigAverageOverBed** utility to computue the mean signal from the pval tracks and the mean signal from the fold change tracks for each genomic region in each sample. 

In [55]:
bigWigAverageOverBed

bigWigAverageOverBed v2 - Compute average score of big wig over each bed, which may have introns.
usage:
   bigWigAverageOverBed in.bw in.bed out.tab
The output columns are:
   name - name field from bed, which should be unique
   size - size of bed (sum of exon sizes
   covered - # bases within exons covered by bigWig
   sum - sum of values over all bases covered
   mean0 - average over bases with non-covered bases counting as zeroes
   mean - average over just covered bases
Options:
   -stats=stats.ra - Output a collection of overall statistics to stat.ra file
   -bedOut=out.bed - Make output bed that is echo of input bed but with mean column appended
   -sampleAroundCenter=N - Take sample at region N bases wide centered around bed item, rather
                     than the usual sample in the bed item.
   -minMax - include two additional columns containing the min and max observed in the area.



In [56]:
#First, we find all the fold change bigWig files
cd $WORK_DIR
find -L $AGGREGATE_ANALYSIS_DIR  -name "*fc*bigwig" > all.fc.bigwig
head all.fc.bigwig


/srv/scratch/training_camp/aggregate_analysis/yiuwong_WT_YPD_1/signal/macs2/rep1/yiuwong_WT_YPD_1_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_camp/aggregate_analysis/yiuwong_rtt109_YPGE_1/signal/macs2/rep1/yiuwong_rtt109_YPGE_1_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_camp/aggregate_analysis/mkoska_WT_YPGE_3/signal/macs2/rep1/mkoska_WT_YPGE_3_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_camp/aggregate_analysis/dcotter_asf1_YPGE_3/signal/macs2/rep1/YPGE_asf1_rep2_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_camp/aggregate_analysis/rosaxma_rtt109_YPD_4/signal/macs2/rep1/rosaxma_rtt109_YPD_4_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_camp/aggregate_analysis/rpatel7_asf1_YPGE_4/signal/macs2/rep1/rpatel7_asf1_YPGE_4_R1_001.PE2SE.nodup.tn5.pf.fc.signal.bigwig
/srv/scratch/training_camp/aggregate_analysis/raungar_asf1_YPGE_6/signal/macs2/rep1/YPGE_asf1_rep1_R1_001.PE2SE.nodup.tn5.p

In [57]:
wc -l all.fc.bigwig

35 all.fc.bigwig


In [58]:
#Iterate through all bigWig fold change tracks to compute mean signal strength at each genomic region 
for f in `cat all.fc.bigwig`
do

    #we extract the part of the filename that corresponds to the sample name and write it as the header in the fc.signal file
    sample_name=`basename $f | awk -F'[.]' '{print $1}'`
    echo "$sample_name"
    echo $sample_name > $sample_name.fc.signal.tmp 
    
    
    bigWigAverageOverBed $f all_merged.peaks.bed $sample_name.fc.signal.data.tmp 
    cut -f5 $sample_name.fc.signal.data.tmp >> $sample_name.fc.signal.tmp

    #cleanup the intermediate file 
    rm $sample_name.fc.signal.data.tmp 
done
paste *fc.signal.tmp > all.fc.txt
#cleanup intermediate files that were generated 
rm *.tmp

#examine the output 
head all.fc.txt

yiuwong_WT_YPD_1_R1_001
processing chromosomes................
yiuwong_rtt109_YPGE_1_R1_001
processing chromosomes................
mkoska_WT_YPGE_3_R1_001
processing chromosomes................
YPGE_asf1_rep2_R1_001
processing chromosomes................
rosaxma_rtt109_YPD_4_R1_001
processing chromosomes................
rpatel7_asf1_YPGE_4_R1_001
processing chromosomes................
YPGE_asf1_rep1_R1_001
processing chromosomes................
ktomins_WT_YPGE_5_R1_001
processing chromosomes................
dmaghini_asf1_YPD_5_R1_001
processing chromosomes................
hrosenbl_WT_YPGE_1_R1_001
processing chromosomes................
gamador_rtt109_YPGE_6_R1_001
processing chromosomes................
ambenj_asf1_YPD_2_R1_001
processing chromosomes................
jarod_rtt109_YPGE_4_R1_001
processing chromosomes................
YPD_WT_rep2_R1_001
processing chromosomes................
YPD_WT_rep1_R1_001
processing chromosomes................
YPD_rtt109_r

In addition to the fold change data matrix, we would also like to know the number of reads that pile up at each peak region. This is useful for determining differential chromatin accessibility across samples. 
To calculate the read count matrix, we will use the **bedtools coverage** command on the *tagAlign* files generated by the processing pipeline. 

In [59]:
#First, we find all the tagAlign
cd $WORK_DIR
find -L $AGGREGATE_ANALYSIS_DIR  -name "*nodup.tn5.no_chrM.25M.R1.tagAlign*" > all.tagAlign.files.txt

head all.tagAlign.files.txt

/srv/scratch/training_camp/aggregate_analysis/yiuwong_WT_YPD_1/align/rep1/yiuwong_WT_YPD_1_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/scratch/training_camp/aggregate_analysis/yiuwong_rtt109_YPGE_1/align/rep1/yiuwong_rtt109_YPGE_1_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/scratch/training_camp/aggregate_analysis/mkoska_WT_YPGE_3/align/rep1/mkoska_WT_YPGE_3_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/scratch/training_camp/aggregate_analysis/dcotter_asf1_YPGE_3/align/rep1/YPGE_asf1_rep2_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/scratch/training_camp/aggregate_analysis/rosaxma_rtt109_YPD_4/align/rep1/rosaxma_rtt109_YPD_4_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/scratch/training_camp/aggregate_analysis/rpatel7_asf1_YPGE_4/align/rep1/rpatel7_asf1_YPGE_4_R1_001.PE2SE.nodup.tn5.no_chrM.25M.R1.tagAlign.gz
/srv/scratch/training_camp/aggregate_analysis/raungar_asf1_YPGE_6/align/rep1/YPGE_asf1_rep1_R1_001.PE2SE.nodup.tn5.no_chrM.

In [60]:
wc -l all.tagAlign.files.txt

35 all.tagAlign.files.txt


In [61]:
#Let's see how the bedtools coverage command works
bedtools coverage


Tool:    bedtools coverage (aka coverageBed)
Version: v2.17.0
Summary: Returns the depth and breadth of coverage of features from A
	 on the intervals in B.

Usage:   bedtools coverage [OPTIONS] -a <bed/gff/vcf> -b <bed/gff/vcf>

Options: 
	-abam	The A input file is in BAM format.

	-s	Require same strandedness.  That is, only counts hits in A that
		overlap B on the _same_ strand.
		- By default, overlaps are counted without respect to strand.

	-S	Require different strandedness.  That is, only report hits in A
		that overlap B on the _opposite_ strand.
		- By default, overlaps are counted without respect to strand.

	-hist	Report a histogram of coverage for each feature in B
		as well as a summary histogram for _all_ features in B.

		Output (tab delimited) after each feature in B:
		  1) depth
		  2) # bases at depth
		  3) size of B
		  4) % of B at depth

	-d	Report the depth at each position in each B feature.
		Positions reported are one based.  Eac

In [62]:
#Iterate through all tagAlign files to compute read count at each peak region.  
for f in `cat all.tagAlign.files.txt`
do
    sample_name=`basename $f | awk -F'[.]' '{print $1}'`
    echo "$sample_name"
    echo $sample_name > $sample_name.readcount.tmp 
    zcat $f | bedtools coverage -counts -a stdin -b all_merged.peaks.bed  | cut -f5 >>$sample_name.readcount.tmp 
done
paste *.readcount.tmp > all.readcount.txt
#cleanup the temporary files
rm *.tmp

#examine the output 
head all.readcount.txt

yiuwong_WT_YPD_1_R1_001
yiuwong_rtt109_YPGE_1_R1_001
mkoska_WT_YPGE_3_R1_001
YPGE_asf1_rep2_R1_001
rosaxma_rtt109_YPD_4_R1_001
rpatel7_asf1_YPGE_4_R1_001
YPGE_asf1_rep1_R1_001
ktomins_WT_YPGE_5_R1_001
dmaghini_asf1_YPD_5_R1_001
hrosenbl_WT_YPGE_1_R1_001
gamador_rtt109_YPGE_6_R1_001
ambenj_asf1_YPD_2_R1_001
jarod_rtt109_YPGE_4_R1_001
YPD_WT_rep2_R1_001
YPD_WT_rep1_R1_001
YPD_rtt109_rep1_R1_001
kjhanson_rtt109_YPD_6_R1_001
jkcheng_rtt109_YPGE_3_R1_001
ambenj_rtt109_YPGE_2_R1_001
jkcheng_WT_YPD_3_R1_001
dmaghini_WT_YPD_5_R1_001
YPGE_WT_rep1_R1_001
YPD_rtt109_rep2_R1_001
gamador_WT_YPGE_6_R1_001
rosaxma_asf1_YPD_4_R1_001
egreenwa_rtt109_YPD_5_R1_001
pgoddard_rtt109_YPD_1_R1_001
mkoska_asf1_YPD_3_R1_001
kjhanson_WT_YPD_6_R1_001
rpatel7_WT_YPGE_4_R1_001
YPD_asf1_rep1_R1_001
egreenwa_asf1_YPD_6_R1_001
jarod_asf1_YPGE_5_R1_001
ktomins_rtt109_YPGE_5_R1_001
pgoddard_asf1_YPGE_1_R1_001
ambenj_asf1_YPD_2_R1_001	ambenj_rtt109_YPGE_2_R1_001	dmaghini_asf1_YPD_5_R1_0

We observe that the counts in the first and second columns are on a different scale. This makes sense because if a particular sample had more reads to begin with, the raw counts for each peak will be higher. 
We can address this problem with sample normalization, covered in the next section.


In [63]:
#Finally, we add in the peak names to our counts file and fold change file so we can keep track of which row 
#corresponds to which peak. 


#add a header to the merged peak file 
sed -i '1i\Chrom\tStart\tEnd\tID' all_merged.peaks.bed

#paste the peak bed file region annotation matrix to the signal matrix
paste all_merged.peaks.bed all.fc.txt > o.tmp 
mv o.tmp all.fc.txt 

paste all_merged.peaks.bed all.readcount.txt > o.tmp
mv o.tmp all.readcount.txt



In [64]:
head all.readcount.txt


Chrom	Start	End	ID	ambenj_asf1_YPD_2_R1_001	ambenj_rtt109_YPGE_2_R1_001	dmaghini_asf1_YPD_5_R1_001	dmaghini_WT_YPD_5_R1_001	egreenwa_asf1_YPD_6_R1_001	egreenwa_rtt109_YPD_5_R1_001	gamador_rtt109_YPGE_6_R1_001	gamador_WT_YPGE_6_R1_001	hrosenbl_WT_YPGE_1_R1_001	jarod_asf1_YPGE_5_R1_001	jarod_rtt109_YPGE_4_R1_001	jkcheng_rtt109_YPGE_3_R1_001	jkcheng_WT_YPD_3_R1_001	kjhanson_rtt109_YPD_6_R1_001	kjhanson_WT_YPD_6_R1_001	ktomins_rtt109_YPGE_5_R1_001	ktomins_WT_YPGE_5_R1_001	mkoska_asf1_YPD_3_R1_001	mkoska_WT_YPGE_3_R1_001	pgoddard_asf1_YPGE_1_R1_001	pgoddard_rtt109_YPD_1_R1_001	rosaxma_asf1_YPD_4_R1_001	rosaxma_rtt109_YPD_4_R1_001	rpatel7_asf1_YPGE_4_R1_001	rpatel7_WT_YPGE_4_R1_001	yiuwong_rtt109_YPGE_1_R1_001	yiuwong_WT_YPD_1_R1_001	YPD_asf1_rep1_R1_001	YPD_rtt109_rep1_R1_001	YPD_rtt109_rep2_R1_001	YPD_WT_rep1_R1_001	YPD_WT_rep2_R1_001	YPGE_asf1_rep1_R1_001	YPGE_asf1_rep2_R1_001	YPGE_WT_rep1_R1_001
chrI	0	857	1	36	25	21	35	14	11	58	18	50	33	10	77	40	31	22	25	8	131	17	102	39	30	27	3	51	57	8

In [65]:
head all.fc.txt

Chrom	Start	End	ID	ambenj_asf1_YPD_2_R1_001	ambenj_rtt109_YPGE_2_R1_001	dmaghini_asf1_YPD_5_R1_001	dmaghini_WT_YPD_5_R1_001	egreenwa_asf1_YPD_6_R1_001	egreenwa_rtt109_YPD_5_R1_001	gamador_rtt109_YPGE_6_R1_001	gamador_WT_YPGE_6_R1_001	hrosenbl_WT_YPGE_1_R1_001	jarod_asf1_YPGE_5_R1_001	jarod_rtt109_YPGE_4_R1_001	jkcheng_rtt109_YPGE_3_R1_001	jkcheng_WT_YPD_3_R1_001	kjhanson_rtt109_YPD_6_R1_001	kjhanson_WT_YPD_6_R1_001	ktomins_rtt109_YPGE_5_R1_001	ktomins_WT_YPGE_5_R1_001	mkoska_asf1_YPD_3_R1_001	mkoska_WT_YPGE_3_R1_001	pgoddard_asf1_YPGE_1_R1_001	pgoddard_rtt109_YPD_1_R1_001	rosaxma_asf1_YPD_4_R1_001	rosaxma_rtt109_YPD_4_R1_001	rpatel7_asf1_YPGE_4_R1_001	rpatel7_WT_YPGE_4_R1_001	yiuwong_rtt109_YPGE_1_R1_001	yiuwong_WT_YPD_1_R1_001	YPD_asf1_rep1_R1_001	YPD_rtt109_rep1_R1_001	YPD_rtt109_rep2_R1_001	YPD_WT_rep1_R1_001	YPD_WT_rep2_R1_001	YPGE_asf1_rep1_R1_001	YPGE_asf1_rep2_R1_001	YPGE_WT_rep1_R1_001
chrI	0	857	1	4.25008	7.21524	4.27744	6.06117	5.64958	10.6068	6.53717	9.58935	5.73521	6.11552

In examining the files, we notice that all the files end with the suffix "\_R1_001". This is an artifact generated by the processing pipeline. This part of the filename is not informative for our purposes, since it's shared by all samples, so we can remove it with the **sed** command. The syntax is illustrated below: 

In [66]:
sed -i 's/_R1_001//g' all.fc.txt
sed -i 's/_R1_001//g' all.readcount.txt




In [67]:
head all.fc.txt

Chrom	Start	End	ID	ambenj_asf1_YPD_2	ambenj_rtt109_YPGE_2	dmaghini_asf1_YPD_5	dmaghini_WT_YPD_5	egreenwa_asf1_YPD_6	egreenwa_rtt109_YPD_5	gamador_rtt109_YPGE_6	gamador_WT_YPGE_6	hrosenbl_WT_YPGE_1	jarod_asf1_YPGE_5	jarod_rtt109_YPGE_4	jkcheng_rtt109_YPGE_3	jkcheng_WT_YPD_3	kjhanson_rtt109_YPD_6	kjhanson_WT_YPD_6	ktomins_rtt109_YPGE_5	ktomins_WT_YPGE_5	mkoska_asf1_YPD_3	mkoska_WT_YPGE_3	pgoddard_asf1_YPGE_1	pgoddard_rtt109_YPD_1	rosaxma_asf1_YPD_4	rosaxma_rtt109_YPD_4	rpatel7_asf1_YPGE_4	rpatel7_WT_YPGE_4	yiuwong_rtt109_YPGE_1	yiuwong_WT_YPD_1	YPD_asf1_rep1	YPD_rtt109_rep1	YPD_rtt109_rep2	YPD_WT_rep1	YPD_WT_rep2	YPGE_asf1_rep1	YPGE_asf1_rep2	YPGE_WT_rep1
chrI	0	857	1	4.25008	7.21524	4.27744	6.06117	5.64958	10.6068	6.53717	9.58935	5.73521	6.11552	6.1878	6.54996	5.88486	8.36391	9.53371	3.61219	6.22783	5.56345	8.38938	4.85431	5.62175	5.86927	6.23959	7.97977	7.07626	4.6925	6.81927	3.10887	2.32237	2.94599	4.56884	3.93238	1.47424	2.63765	4.39358
chrI	2415	2586	2	0.377759	1.26268	0.736095	0.

In [68]:
head all.readcount.txt

Chrom	Start	End	ID	ambenj_asf1_YPD_2	ambenj_rtt109_YPGE_2	dmaghini_asf1_YPD_5	dmaghini_WT_YPD_5	egreenwa_asf1_YPD_6	egreenwa_rtt109_YPD_5	gamador_rtt109_YPGE_6	gamador_WT_YPGE_6	hrosenbl_WT_YPGE_1	jarod_asf1_YPGE_5	jarod_rtt109_YPGE_4	jkcheng_rtt109_YPGE_3	jkcheng_WT_YPD_3	kjhanson_rtt109_YPD_6	kjhanson_WT_YPD_6	ktomins_rtt109_YPGE_5	ktomins_WT_YPGE_5	mkoska_asf1_YPD_3	mkoska_WT_YPGE_3	pgoddard_asf1_YPGE_1	pgoddard_rtt109_YPD_1	rosaxma_asf1_YPD_4	rosaxma_rtt109_YPD_4	rpatel7_asf1_YPGE_4	rpatel7_WT_YPGE_4	yiuwong_rtt109_YPGE_1	yiuwong_WT_YPD_1	YPD_asf1_rep1	YPD_rtt109_rep1	YPD_rtt109_rep2	YPD_WT_rep1	YPD_WT_rep2	YPGE_asf1_rep1	YPGE_asf1_rep2	YPGE_WT_rep1
chrI	0	857	1	36	25	21	35	14	11	58	18	50	33	10	77	40	31	22	25	8	131	17	102	39	30	27	3	51	57	88	3	1	7	3	3	4	3	4
chrI	2415	2586	2	1354	2882	1503	2884	560	354	2873	1347	4899	2759	1441	3221	3123	1009	948	1032	703	4604	6601	6020	1513	1459	1239	285	3889	4071	5241	136	108	160	111	212	139	162	220
chrI	6315	6556	3	217	222	155	350	70	48	258	88	

We have now generated a read count matrix and a fold change signal peak regions in our dataset. 
This completes the basic data processing pipeline. 
Now, on to drawing conclusions about our data. 