# BWA

v-02-03

- paper: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM
- papar url:  https://arxiv.org/abs/1303.3997
- github: https://github.com/lh3/bwa
- My github: https://github.com/jingwora/bwa

- Paired-End Sequencing: https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vs-single-read.html
- Reference: https://www.youtube.com/watch?v=PSYnQMRSoPY

Experiment：

1. Agy99.fasta -> index  [bwa index]
2. R1.fastq.gz + R2.fastq.gz + Index -> bam [bwa mem]
3. R1.fastq.gz + R2.fastq.gz + Index -> sorted.bam [bwa mem samtools sort]

Environment
- Google colab
- bwa
- SAM tools

Experiment Result

```
Installation Duration:     0:01:33.431302
Data Load Duration:        0:06:45.588115
bwa index Duration:        0:00:03.849388
bwa mem unsorted Duration: 0:07:08.097268
bwa mem sorted Duration:   0:07:26.602248
Total Duration:            0:22:57.568321
```
Input: 
- P7741_R1.fastq.gz (47 MB)
- P7741_R2.fastq.gz (53 MB)
- Agy99.fasta (5 MB)

Output:
- output.bam (321 MB)
- output.sorted.bam (75 MB)

## 実験

## 環境構築

In [None]:
# C version
!gcc -dM -E - < /dev/null | grep __STDC_VERSION__ | awk '{ print $2 " --> " $3 }'

__STDC_VERSION__ --> 201112L


#### BWA

In [None]:
from datetime import datetime
time01 = datetime.now()

In [None]:
!git clone https://github.com/lh3/bwa.git
!cd bwa; make
!./bwa index ref.fa
!./bwa mem ref.fa read-se.fq.gz | gzip -3 > aln-se.sam.gz
!./bwa mem ref.fa read1.fq read2.fq | gzip -3 > aln-pe.sam.gz

Cloning into 'bwa'...
remote: Enumerating objects: 4398, done.[K
remote: Counting objects: 100% (87/87), done.[K
remote: Compressing objects: 100% (57/57), done.[K
remote: Total 4398 (delta 38), reused 53 (delta 26), pack-reused 4311[K
Receiving objects: 100% (4398/4398), 1.72 MiB | 6.97 MiB/s, done.
Resolving deltas: 100% (3123/3123), done.
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS   utils.c -o utils.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS   kthread.c -o kthread.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS   kstring.c -o kstring.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS   ksw.c -o ksw.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS   bwt.c -o bwt.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD -DUSE_MALLOC_WRAPPERS   bntseq.c -o bntseq.o
gcc -c -g -Wall -Wno-unused-function -O2 -DHAVE_PTHREAD 

#### SAM tools

In [None]:
!wget https://github.com/jingwora/bioinformatics-tools/raw/main/SAMtools/samtools-1.16.1.tar.bz2
!sudo tar xvjf samtools-1.16.1.tar.bz2
!cd samtools-1.16.1; make

--2023-01-04 03:16:01--  https://github.com/jingwora/bioinformatics-tools/raw/main/SAMtools/samtools-1.16.1.tar.bz2
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/jingwora/bioinformatics-tools/main/SAMtools/samtools-1.16.1.tar.bz2 [following]
--2023-01-04 03:16:02--  https://raw.githubusercontent.com/jingwora/bioinformatics-tools/main/SAMtools/samtools-1.16.1.tar.bz2
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8217689 (7.8M) [application/octet-stream]
Saving to: ‘samtools-1.16.1.tar.bz2’


2023-01-04 03:16:03 (74.5 MB/s) - ‘samtools-1.16.1.tar.bz2’ saved [8217689/8217689]

samtools-1.16.1

In [None]:
time02 = datetime.now()

## データのダウンロード 

##### fastq.gz R1 R2 from ebi

In [None]:
!wget http://ftp.sra.ebi.ac.uk/vol1/run/ERR333/ERR3335404/P7741_R1.fastq.gz

--2023-01-04 03:17:19--  http://ftp.sra.ebi.ac.uk/vol1/run/ERR333/ERR3335404/P7741_R1.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 48483635 (46M) [application/x-gzip]
Saving to: ‘P7741_R1.fastq.gz’


2023-01-04 03:20:28 (252 KB/s) - ‘P7741_R1.fastq.gz’ saved [48483635/48483635]



In [None]:
!wget http://ftp.sra.ebi.ac.uk/vol1/run/ERR333/ERR3335404/P7741_R2.fastq.gz

--2023-01-04 03:20:28--  http://ftp.sra.ebi.ac.uk/vol1/run/ERR333/ERR3335404/P7741_R2.fastq.gz
Resolving ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.sra.ebi.ac.uk (ftp.sra.ebi.ac.uk)|193.62.193.138|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54395207 (52M) [application/x-gzip]
Saving to: ‘P7741_R2.fastq.gz’


2023-01-04 03:24:02 (250 KB/s) - ‘P7741_R2.fastq.gz’ saved [54395207/54395207]



#### Agy99.fasta from ncbi
- Mycobacterium ulcerans Agy99, complete genome from ncbi
- https://www.ncbi.nlm.nih.gov/nuccore/CP000325.1

In [None]:
!wget https://github.com/jingwora/bwa/raw/master/dataset2/ncbi/Agy99.fasta

--2023-01-04 03:24:02--  https://github.com/jingwora/bwa/raw/master/dataset2/ncbi/Agy99.fasta
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/jingwora/bwa/master/dataset2/ncbi/Agy99.fasta [following]
--2023-01-04 03:24:03--  https://raw.githubusercontent.com/jingwora/bwa/master/dataset2/ncbi/Agy99.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5712117 (5.4M) [text/plain]
Saving to: ‘Agy99.fasta’


2023-01-04 03:24:04 (57.8 MB/s) - ‘Agy99.fasta’ saved [5712117/5712117]



In [None]:
!ls

Agy99.fasta    bwa		  sample_data
aln-pe.sam.gz  P7741_R1.fastq.gz  samtools-1.16.1
aln-se.sam.gz  P7741_R2.fastq.gz  samtools-1.16.1.tar.bz2


In [None]:
time03 = datetime.now()

## 処理

### bwa index

In [None]:
!mkdir index
!mv Agy99.fasta index/

In [None]:
!bwa/bwa index index/Agy99.fasta

[bwa_index] Pack FASTA... 0.06 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 2.15 seconds elapse.
[bwa_index] Update BWT... 0.05 sec
[bwa_index] Pack forward-only FASTA... 0.03 sec
[bwa_index] Construct SA from BWT and Occ... 0.95 sec
[main] Version: 0.7.17-r1198-dirty
[main] CMD: bwa/bwa index index/Agy99.fasta
[main] Real time: 3.388 sec; CPU: 3.256 sec


In [None]:
!ls /content/index

Agy99.fasta	 Agy99.fasta.ann  Agy99.fasta.pac
Agy99.fasta.amb  Agy99.fasta.bwt  Agy99.fasta.sa


In [None]:
time04 = datetime.now()

### bwa mem

In [None]:
!bwa/bwa mem -t 8 index/Agy99.fasta /content/P7741_R1.fastq.gz /content/P7741_R2.fastq.gz > output.bam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 362848 sequences (80000393 bp)...
[M::process] read 183294 sequences (43629854 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1132, 132319, 27, 1018)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (223, 379, 632)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1450)
[M::mem_pestat] mean and std.dev: (430.55, 288.91)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1859)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (204, 324, 466)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 990)
[M::mem_pestat] mean and std.dev: (347.47, 190.58)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1252)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75

In [None]:
!ls

aln-pe.sam.gz  index		  P7741_R2.fastq.gz  samtools-1.16.1.tar.bz2
aln-se.sam.gz  output.bam	  sample_data
bwa	       P7741_R1.fastq.gz  samtools-1.16.1


In [None]:
time05 = datetime.now()

### bwa mem samtools sort

In [None]:
!bwa/bwa mem -t 8 index/Agy99.fasta /content/P7741_R1.fastq.gz /content/P7741_R2.fastq.gz | samtools-1.16.1/samtools sort -o output.sorted.bam -

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 362848 sequences (80000393 bp)...
[M::process] read 183294 sequences (43629854 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1132, 132319, 27, 1018)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (223, 379, 632)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 1450)
[M::mem_pestat] mean and std.dev: (430.55, 288.91)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1859)
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (204, 324, 466)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 990)
[M::mem_pestat] mean and std.dev: (347.47, 190.58)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1252)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75

In [None]:
!ls

aln-pe.sam.gz  index		  P7741_R1.fastq.gz  samtools-1.16.1
aln-se.sam.gz  output.bam	  P7741_R2.fastq.gz  samtools-1.16.1.tar.bz2
bwa	       output.sorted.bam  sample_data


In [None]:
time06 = datetime.now()

In [None]:
print(f'Installation Duration:     {time02 - time01}')
print(f'Data Load Duration:        {time03 - time02}')
print(f'bwa index Duration:        {time04 - time03}')
print(f'bwa mem unsorted Duration: {time05 - time04}')
print(f'bwa mem sorted Duration:   {time06 - time05}')
print(f'Total Duration:            {time06 - time01}')

Installation Duration:     0:01:33.431302
Data Load Duration:        0:06:45.588115
bwa index Duration:        0:00:03.849388
bwa mem unsorted Duration: 0:07:08.097268
bwa mem sorted Duration:   0:07:26.602248
Total Duration:            0:22:57.568321


END