## In this notebook I use Bowtie2 to align geoduck RNASeq files to my >70k scaffolds, and convert to SAM, then BAM, then index for IGV 

### _Software requirements:_ 
### [Samtools Version: 1.3.1](https://sourceforge.net/projects/samtools/files/samtools/1.3.1/) (using htslib 1.3.1)
### [Bowtie 2 version 2.2.9](https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.9/) 
* #### The Bowtie2 [user manual](http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) has a great tutorial, heavily utilized during this protocol.  





In [1]:
pwd

'/Users/shlaura3/Documents/SAFS/FISH_546_Bioinformatics/546-Bioinformatics/2016-10_Geo-Ann-Project/Jupyter-Notebooks'

In [None]:
# I have an active bioconda channel installed, so installing Bowtie2 via bioconda was very easy: conda install bowtie2
! conda install bowtie2

In [5]:
! bowtie2 -help

Usage: 
bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | <s>} [<hit>]

  <m1>    Comma-separated list of files containing upstream mates (or the
          sequences themselves, if -c is set) paired with mates in <m2>
  <m2>    Comma-separated list of files containing downstream mates (or the
          sequences themselves if -c is set) paired with mates in <m1>
  <r>     Comma-separated list of files containing Crossbow-style reads.  Can be
          a mixture of paired and unpaired.  Specify "-" for stdin.
  <s>     Comma-separated list of files containing unpaired reads, or the
          sequences themselves, if -c is set.  Specify "-" for stdin.
  <hit>   File to write hits to (default: stdout)
Input:
  -q                 query input files are FASTQ .fq/.fastq (default)
  -f                 query input files are (multi-)FASTA .fa/.mfa
  -r                 query input files are raw one-sequence-per-line
  -c                 query sequences given on cmd line (as 

In [12]:
cd ../data/ 

/Users/srlab/546-Bioinformatics/2016-10_Geo-Ann-Project/data


In [13]:
# download geoduck RNASeq files, which are paired-end reads, samples from gonads in 1 male & 1 female
! curl -O http://owl.fish.washington.edu/halfshell/bu-data-genomic/tentacle/Geoduck_v3/Geo_Pool_F_GGCTAC_L006_R2_001_val_1.fq \
-O http://owl.fish.washington.edu/halfshell/bu-data-genomic/tentacle/Geoduck_v3/Geo_Pool_F_GGCTAC_L006_R1_001_val_2.fq \
-O http://owl.fish.washington.edu/halfshell/bu-data-genomic/tentacle/Geoduck_v3/Geo_Pool_M_CTTGTA_L006_R2_001_val_1.fq \
-O http://owl.fish.washington.edu/halfshell/bu-data-genomic/tentacle/Geoduck_v3/Geo_Pool_M_CTTGTA_L006_R1_001_val_2.fq    

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24.9G  100 24.9G    0     0  65.4M      0  0:06:30  0:06:30 --:--:--  110M
100 25.0G  100 25.0G    0     0  59.3M      0  0:07:11  0:07:11 --:--:-- 75.0M
100 28.1G  100 28.1G    0     0  59.0M      0  0:08:07  0:08:07 --:--:-- 99.1M
100 28.2G  100 28.2G    0     0  64.0M      0  0:07:31  0:07:31 --:--:-- 65.2M


In [15]:
ls ../data/

Geo-70k-scaff-annotations-merged-table.tabular
Geo-v3-join-uniprot-all0916-condensed.tab
Geo_Pool_F_GGCTAC_L006_R1_001_val_2.fq
Geo_Pool_F_GGCTAC_L006_R2_001_val_1.fq
Geo_Pool_M_CTTGTA_L006_R1_001_val_2.fq
Geo_Pool_M_CTTGTA_L006_R2_001_val_1.fq
Geoduck-transcriptome-v3.fa.zip
Geoduck-transcriptome-v3.fa_db.nhr
Geoduck-transcriptome-v3.fa_db.nin
Geoduck-transcriptome-v3.fa_db.nsq
Panopea_generosa_No-Line-Breaks.scafSeq.fai
Panopea_generosa_ScafSeq.genome
Panopea_generosa_scaff-70k.fasta
Panopea_generosa_scaff-70k.fasta.fai
Panopea_generosa_scaff-70k.scafSeq
Panopea_generosa_scaff-70k.scafSeq.fai
Panopea_generosa_scaff-70k_bowtie2-index.1.bt2
Panopea_generosa_scaff-70k_bowtie2-index.2.bt2
Panopea_generosa_scaff-70k_bowtie2-index.3.bt2
Panopea_generosa_scaff-70k_bowtie2-index.4.bt2
Panopea_generosa_scaff-70k_bowtie2-index.rev.1.bt2
Panopea_generosa_scaff-70k_bowtie2-index.rev.2.bt2
Panopea_generosa_scaff-70k_db.nhr
Panopea_generosa_scaff-70k_db.nin
Panopea_generosa

---

## Building my Bowtie2 Index
#### Bowtie2 requires a reference genome  against which to map the RNASeq data
#### see line 1429 in the [bowtie manual](https://github.com/BenLangmead/bowtie/blob/master/MANUAL) for the bowtie-build options

In [5]:
# Building my "index" (aka database) out of my >70k genome scaffolds;
! bowtie2-build \
../data/Panopea_generosa_scaff-70k.fasta \
../data/Panopea_generosa_scaff-70k_bowtie2-index

Settings:
  Output files: "../data/Panopea_generosa_scaff-70k_bowtie2-index.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  ../data/Panopea_generosa_scaff-70k.fasta
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 451843
Using parameters --bmax 338883 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these pa

In [20]:
# Indexing creates 6 files ending in .bt2 
! ls ../data/

Geo-70k-scaff-annotations-merged-table.tabular
Geo-v3-join-uniprot-all0916-condensed.tab
Geo_Pool_F_GGCTAC_L006_R1_001_val_2.fq
Geo_Pool_F_GGCTAC_L006_R2_001_val_1.fq
Geo_Pool_M_CTTGTA_L006_R1_001_val_2.fq
Geo_Pool_M_CTTGTA_L006_R2_001_val_1.fq
Geoduck-transcriptome-v3.fa.zip
Geoduck-transcriptome-v3.fa_db.nhr
Geoduck-transcriptome-v3.fa_db.nin
Geoduck-transcriptome-v3.fa_db.nsq
Panopea_generosa_No-Line-Breaks.scafSeq.fai
Panopea_generosa_ScafSeq.genome
Panopea_generosa_scaff-70k.fasta
Panopea_generosa_scaff-70k.fasta.fai
Panopea_generosa_scaff-70k.scafSeq
Panopea_generosa_scaff-70k.scafSeq.fai
Panopea_generosa_scaff-70k_bowtie2-index.1.bt2
Panopea_generosa_scaff-70k_bowtie2-index.2.bt2
Panopea_generosa_scaff-70k_bowtie2-index.3.bt2
Panopea_generosa_scaff-70k_bowtie2-index.4.bt2
Panopea_generosa_scaff-70k_bowtie2-index.rev.1.bt2
Panopea_generosa_scaff-70k_bowtie2-index.rev.2.bt2
Panopea_generosa_scaff-70k_db.nhr
Panopea_generosa_scaff-70k_db.nin
Panopea_generosa

In [8]:
! bowtie2-inspect --summary ../data/Panopea_generosa_scaff-70k_bowtie2-index

Flags	1
Reverse flags	5
Colorspace	0
2.0-compatible	1
SA-Sample	1 in 16
FTab-Chars	10
Sequence-1	scaffold3071 37.0	92480
Sequence-2	scaffold4463 35.0	96504
Sequence-3	scaffold5354 34.4	78472
Sequence-4	scaffold9504 36.1	106643
Sequence-5	scaffold10970 36.6	103072
Sequence-6	scaffold11875 36.6	71487
Sequence-7	scaffold15463 36.3	94990
Sequence-8	scaffold18558 34.9	73279
Sequence-9	scaffold19489 36.3	88023
Sequence-10	scaffold20302 34.0	90664
Sequence-11	scaffold26337 37.5	154899
Sequence-12	scaffold26960 37.7	118606
Sequence-13	scaffold27692 37.9	82296
Sequence-14	scaffold30278 35.7	92860
Sequence-15	scaffold31723 35.8	75618
Sequence-16	scaffold32578 36.5	88248
Sequence-17	scaffold34940 36.6	91720
Sequence-18	scaffold45727 36.4	106547
Sequence-19	scaffold59644 35.9	78717
Sequence-20	scaffold71773 33.8	103920


### Mapping reads from femal gonad RNA Seq data to indexed fasta file

In [16]:
# Now, use bowtie to first map female gonad paired-end reads to scaffold index
# FYI this took about an hour on my personal laptop & on Roadrunner 
! bowtie2 \
-x ../data/Panopea_generosa_scaff-70k_bowtie2-index \
-1 ../data/Geo_Pool_F_GGCTAC_L006_R2_001_val_1.fq \
-2 ../data/Geo_Pool_F_GGCTAC_L006_R1_001_val_2.fq \
-S ../analyses/Panopea_generosascaff-70k_gonadF_bowtie.sam

103213920 reads; of these:
  103213920 (100.00%) were paired; of these:
    102864801 (99.66%) aligned concordantly 0 times
    261616 (0.25%) aligned concordantly exactly 1 time
    87503 (0.08%) aligned concordantly >1 times
    ----
    102864801 pairs aligned concordantly 0 times; of these:
      14372 (0.01%) aligned discordantly 1 time
    ----
    102850429 pairs aligned 0 times concordantly or discordantly; of these:
      205700858 mates make up the pairs; of these:
        205376406 (99.84%) aligned 0 times
        207451 (0.10%) aligned exactly 1 time
        117001 (0.06%) aligned >1 times
0.51% overall alignment rate


In [17]:
!ls ../analyses/

2016-11-3_Geo70k-scaff-transcrip-sequences.tabular
GeoTrans-PhelDiffExp_blasted_sorted
P-Generosa_IGV.xml
Panopea_generosascaff-70k_gonadF_bowtie.sam
Phel_DEGlist.tab
[34mRepeatMasker[m[m
SeaStarDiffExp.R
pgenerosa-scaff70-miRNA.tab
pgenerosa-transcrv3-blastn-scaff70k-01.gff.fai
pgenerosa-transcrv3-blastn-scaff70k-01.tab
pgenerosa-transcrv3-blastn-scaff70k-01.tab.fai


In [18]:
! head ../analyses/Panopea_generosascaff-70k_gonadF_bowtie.sam

@HD	VN:1.0	SO:unsorted
@SQ	SN:scaffold3071	LN:92480
@SQ	SN:scaffold4463	LN:96504
@SQ	SN:scaffold5354	LN:78472
@SQ	SN:scaffold9504	LN:106643
@SQ	SN:scaffold10970	LN:103072
@SQ	SN:scaffold11875	LN:71487
@SQ	SN:scaffold15463	LN:94990
@SQ	SN:scaffold18558	LN:73279
@SQ	SN:scaffold19489	LN:88023


### Converting from .sam to .bam format

In [19]:
# This took about 75 minutes on Roadrunner
! samtools view -bS -o \
../analyses/Panopea_generosascaff-70k_gonadF_bowtie.bam \
../analyses/Panopea_generosascaff-70k_gonadF_bowtie.sam

In [21]:
ls ../analyses/

2016-11-3_Geo70k-scaff-transcrip-sequences.tabular
GeoTrans-PhelDiffExp_blasted_sorted
P-Generosa_IGV.xml
Panopea_generosascaff-70k_gonadF_bowtie.bam
Panopea_generosascaff-70k_gonadF_bowtie.sam
Phel_DEGlist.tab
[34mRepeatMasker[m[m/
SeaStarDiffExp.R
pgenerosa-scaff70-miRNA.tab
pgenerosa-transcrv3-blastn-scaff70k-01.gff.fai
pgenerosa-transcrv3-blastn-scaff70k-01.tab
pgenerosa-transcrv3-blastn-scaff70k-01.tab.fai


### Sorting BAM, and indexing for IGV

In [22]:
# Sort the bam file 
# this took about 90 minutes on Roadrunner
! samtools sort \
-o ../analyses/Panopea_generosascaff-70k_gonadF_bowtie-sorted.bam \
-T TEMP_gonadF_bowtie-sorted \
../analyses/Panopea_generosascaff-70k_gonadF_bowtie.bam

[bam_sort_core] merging from 91 files...


In [23]:
# Index bam file for viewing in IGV browser
! samtools index ../analyses/Panopea_generosascaff-70k_gonadF_bowtie-sorted.bam

In [25]:
mv ../analyses/Panopea_generosascaff-70k_gonadF_bowtie-sorted.bam.bai ../IGV_track_files/

In [26]:
ls ../IGV_track_files/

Panopea_generosa_Scaff-70k-CpG.gff
Panopea_generosa_scaff-70k.scafSeq.out.refined.gff
Panopea_generosa_scaff-70k.scafSeq.out.unrefined.gff
Panopea_generosascaff-70k_gonadF_bowtie-sorted.bam.bai
pgenerosa-scaff70-miRNA.gff
pgenerosa-transcrv3-blastn-scaff70k-01.gff


### Let's do it again! This time using male gonad RNASeq data

In [39]:
!bowtie2 \
-x ../data/Panopea_generosa_scaff-70k_bowtie2-index \
-1 ../data/Geo_Pool_M_CTTGTA_L006_R2_001_val_1.fq \
-2 ../data/Geo_Pool_M_CTTGTA_L006_R1_001_val_2.fq \
-S ../analyses/Panopea_generosascaff-70k_gonadM_bowtie.sam \


116711310 reads; of these:
  116711310 (100.00%) were paired; of these:
    116630840 (99.93%) aligned concordantly 0 times
    56523 (0.05%) aligned concordantly exactly 1 time
    23947 (0.02%) aligned concordantly >1 times
    ----
    116630840 pairs aligned concordantly 0 times; of these:
      3318 (0.00%) aligned discordantly 1 time
    ----
    116627522 pairs aligned 0 times concordantly or discordantly; of these:
      233255044 mates make up the pairs; of these:
        233141267 (99.95%) aligned 0 times
        72884 (0.03%) aligned exactly 1 time
        40893 (0.02%) aligned >1 times
0.12% overall alignment rate


In [40]:
! samtools view -bS -o \
../analyses/Panopea_generosascaff-70k_gonadM_bowtie.bam \
../analyses/Panopea_generosascaff-70k_gonadM_bowtie.sam \


In [41]:
! samtools sort \
-o ../analyses/Panopea_generosascaff-70k_gonadM_bowtie-sorted.bam \
-T TEMP_gonadM_bowtie-sorted \
../analyses/Panopea_generosascaff-70k_gonadM_bowtie.bam \

[bam_sort_core] merging from 97 files...


In [43]:
! samtools index ../analyses/Panopea_generosascaff-70k_gonadM_bowtie-sorted.bam