# Counting Reads

In this notebook, I'll count the number of reads in both untrimmed and trimmed *C. virgincia* gonad sequence data from Illumina.

1. Untrimmed files
2. Trimmed files

## 0. Prepare for analyses

### 0a. Set working directory

In [1]:
pwd

'/Users/yaamini/Documents/yaamini-virginica/notebooks'

In [4]:
cd ../data/

/Users/yaamini/Documents/yaamini-virginica/data


In [5]:
!mkdir 2019-03-17-Counting-Reads

In [6]:
cd 2019-03-17-Counting-Reads/

/Users/yaamini/Documents/yaamini-virginica/data/2019-03-17-Counting-Reads


## 1. Trimmed files

Since my files were trimmed with FastQC, I can use the information from the FastQC reports to get read information for each file. In the Basic Statistics module, FastQC includes Total Sequences (i.e. Total Reads) after trimming.

### 1a. Download files

In [14]:
!mkdir 2019-03-17-FastQC-Reports

In [15]:
cd 2019-03-17-FastQC-Reports/

/Users/yaamini/Documents/yaamini-virginica/data/2019-03-17-Counting-Reads/2019-03-17-FastQC-Reports


In [16]:
#Download files from owl. The files will be downloaded in the same directory structure they are in online.
!wget -r -l1 --no-parent -A_fastqc.zip \
http://owl.fish.washington.edu/Athaliana/20180411_trimgalore_10bp_Cvirginica_MBD/20180411_fastqc_trim_10bp_Cvirginica_MBD/

--2019-03-18 09:39:10--  http://owl.fish.washington.edu/Athaliana/20180411_trimgalore_10bp_Cvirginica_MBD/20180411_fastqc_trim_10bp_Cvirginica_MBD/
Resolving owl.fish.washington.edu... 128.95.149.83
Connecting to owl.fish.washington.edu|128.95.149.83|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: 'owl.fish.washington.edu/Athaliana/20180411_trimgalore_10bp_Cvirginica_MBD/20180411_fastqc_trim_10bp_Cvirginica_MBD/index.html'

owl.fish.washington     [ <=>                ]  10.61K  --.-KB/s    in 0s      

2019-03-18 09:39:10 (61.3 MB/s) - 'owl.fish.washington.edu/Athaliana/20180411_trimgalore_10bp_Cvirginica_MBD/20180411_fastqc_trim_10bp_Cvirginica_MBD/index.html' saved [10864]

Loading robots.txt; please ignore errors.
--2019-03-18 09:39:10--  http://owl.fish.washington.edu/robots.txt
Reusing existing connection to owl.fish.washington.edu:80.
HTTP request sent, awaiting response... 404 Not Found
2019-03-18 09:39:10 ERROR 404: N

In [17]:
#Move all files from owl folder to the current directory
!mv owl.fish.washington.edu/Athaliana/20180411_trimgalore_10bp_Cvirginica_MBD/20180411_fastqc_trim_10bp_Cvirginica_MBD/* .

In [18]:
#Confirm all files were moved
!ls

[34mmultiqc_data[m[m                     zr2096_4_s1_R2_val_2_fastqc.zip
[34mowl.fish.washington.edu[m[m          zr2096_5_s1_R1_val_1_fastqc.zip
zr2096_10_s1_R1_val_1_fastqc.zip zr2096_5_s1_R2_val_2_fastqc.zip
zr2096_10_s1_R2_val_2_fastqc.zip zr2096_6_s1_R1_val_1_fastqc.zip
zr2096_1_s1_R1_val_1_fastqc.zip  zr2096_6_s1_R2_val_2_fastqc.zip
zr2096_1_s1_R2_val_2_fastqc.zip  zr2096_7_s1_R1_val_1_fastqc.zip
zr2096_2_s1_R1_val_1_fastqc.zip  zr2096_7_s1_R2_val_2_fastqc.zip
zr2096_2_s1_R2_val_2_fastqc.zip  zr2096_8_s1_R1_val_1_fastqc.zip
zr2096_3_s1_R1_val_1_fastqc.zip  zr2096_8_s1_R2_val_2_fastqc.zip
zr2096_3_s1_R2_val_2_fastqc.zip  zr2096_9_s1_R1_val_1_fastqc.zip
zr2096_4_s1_R1_val_1_fastqc.zip  zr2096_9_s1_R2_val_2_fastqc.zip


In [19]:
#Remove the empty owl directory
!rm -r owl.fish.washington.edu

### 1b. Count reads

First, I'll test a loop and ensure it identifies all of the  files I want to use by having the loop print the filename of each file (`f`):

In [30]:
%%bash
for f in *zip
do
    echo ${f}
done

zr2096_10_s1_R1_val_1_fastqc.zip
zr2096_10_s1_R2_val_2_fastqc.zip
zr2096_1_s1_R1_val_1_fastqc.zip
zr2096_1_s1_R2_val_2_fastqc.zip
zr2096_2_s1_R1_val_1_fastqc.zip
zr2096_2_s1_R2_val_2_fastqc.zip
zr2096_3_s1_R1_val_1_fastqc.zip
zr2096_3_s1_R2_val_2_fastqc.zip
zr2096_4_s1_R1_val_1_fastqc.zip
zr2096_4_s1_R2_val_2_fastqc.zip
zr2096_5_s1_R1_val_1_fastqc.zip
zr2096_5_s1_R2_val_2_fastqc.zip
zr2096_6_s1_R1_val_1_fastqc.zip
zr2096_6_s1_R2_val_2_fastqc.zip
zr2096_7_s1_R1_val_1_fastqc.zip
zr2096_7_s1_R2_val_2_fastqc.zip
zr2096_8_s1_R1_val_1_fastqc.zip
zr2096_8_s1_R2_val_2_fastqc.zip
zr2096_9_s1_R1_val_1_fastqc.zip
zr2096_9_s1_R2_val_2_fastqc.zip


Now that I know it works, I'm going to count the number of reads in each file. I will first unzip each file with `unzip`.

In [35]:
%%bash
for f in *zip
do
    unzip ${f}
done

Archive:  zr2096_10_s1_R1_val_1_fastqc.zip
   creating: zr2096_10_s1_R1_val_1_fastqc/
   creating: zr2096_10_s1_R1_val_1_fastqc/Icons/
   creating: zr2096_10_s1_R1_val_1_fastqc/Images/
  inflating: zr2096_10_s1_R1_val_1_fastqc/Icons/fastqc_icon.png  
  inflating: zr2096_10_s1_R1_val_1_fastqc/Icons/error.png  
  inflating: zr2096_10_s1_R1_val_1_fastqc/Icons/tick.png  
  inflating: zr2096_10_s1_R1_val_1_fastqc/summary.txt  
  inflating: zr2096_10_s1_R1_val_1_fastqc/Images/per_base_quality.png  
  inflating: zr2096_10_s1_R1_val_1_fastqc/Images/per_tile_quality.png  
  inflating: zr2096_10_s1_R1_val_1_fastqc/Images/per_sequence_quality.png  
  inflating: zr2096_10_s1_R1_val_1_fastqc/Images/per_base_sequence_content.png  
  inflating: zr2096_10_s1_R1_val_1_fastqc/Images/per_sequence_gc_content.png  
  inflating: zr2096_10_s1_R1_val_1_fastqc/Images/per_base_n_content.png  
  inflating: zr2096_10_s1_R1_val_1_fastqc/Images/sequence_length_distribution.png  
  inflating: zr2096_10_s1_R1_val_1_f

In [36]:
#Confirm files were unzipped
!ls

[34mzr2096_10_s1_R1_val_1_fastqc[m[m     [34mzr2096_5_s1_R1_val_1_fastqc[m[m
zr2096_10_s1_R1_val_1_fastqc.zip zr2096_5_s1_R1_val_1_fastqc.zip
[34mzr2096_10_s1_R2_val_2_fastqc[m[m     [34mzr2096_5_s1_R2_val_2_fastqc[m[m
zr2096_10_s1_R2_val_2_fastqc.zip zr2096_5_s1_R2_val_2_fastqc.zip
[34mzr2096_1_s1_R1_val_1_fastqc[m[m      [34mzr2096_6_s1_R1_val_1_fastqc[m[m
zr2096_1_s1_R1_val_1_fastqc.zip  zr2096_6_s1_R1_val_1_fastqc.zip
[34mzr2096_1_s1_R2_val_2_fastqc[m[m      [34mzr2096_6_s1_R2_val_2_fastqc[m[m
zr2096_1_s1_R2_val_2_fastqc.zip  zr2096_6_s1_R2_val_2_fastqc.zip
[34mzr2096_2_s1_R1_val_1_fastqc[m[m      [34mzr2096_7_s1_R1_val_1_fastqc[m[m
zr2096_2_s1_R1_val_1_fastqc.zip  zr2096_7_s1_R1_val_1_fastqc.zip
[34mzr2096_2_s1_R2_val_2_fastqc[m[m      [34mzr2096_7_s1_R2_val_2_fastqc[m[m
zr2096_2_s1_R2_val_2_fastqc.zip  zr2096_7_s1_R2_val_2_fastqc.zip
[34mzr2096_3_s1_R1_val_1_fastqc[m[m      [34mzr2096_8_s1_R1_val_1_fastqc[m[m
zr2096_3_s1_R1_v

Now, I'll use `grep` to identify  "Total Sequences" within each sample file. Using `>>`, I can concatenate the results each time the loop runs, then save the entire output in a new file.

In [49]:
%%bash
for f in *fastqc
do
    grep "Total Sequences *" ${f}/fastqc_data.txt \
    >> 2019-03-17-Trimmed-Read-Counts.txt
done

In [51]:
#Confirm total sequences were counted. The first 2 lines correspond to sample 10.
!head 2019-03-17-Trimmed-Read-Counts.txt

Total Sequences	17448883
Total Sequences	17448883
Total Sequences	28603346
Total Sequences	28603346
Total Sequences	30325606
Total Sequences	30325606
Total Sequences	29548753
Total Sequences	29548753
Total Sequences	23970516
Total Sequences	23970516


In [59]:
#Sum the contents of the second column ($2), then divide by 2 to obtain the total number of paired-end reads.
!cat 2019-03-17-Trimmed-Read-Counts.txt | awk -F"\t" '{ sum+=$2 / 2} END {print sum}'

275914272


## 2. Untrimmed files

### 2a. Download files

In [None]:
#Download files from owl. The files will be downloaded in the same directory structure they are in online.
!wget -r -l1 --no-parent -A_s1_R1_val_1.fq.gz _s1_R2_val_2.fq.gz \
http://owl.fish.washington.edu/Athaliana/20180411_trimgalore_10bp_Cvirginica_MBD/20180411_fastqc_trim_10bp_Cvirginica_MBD/

In [None]:
#Move all files from owl folder to the current directory
!mv owl.fish.washington.edu/Athaliana/20180411_trimgalore_10bp_Cvirginica_MBD/* .

In [None]:
#Confirm all files were moved
!ls

In [None]:
#Remove the empty owl directory
!rm -r owl.fish.washington.edu

### 2b. Count reads