## In this notebook I use Bowtie to align geoduck RNASeq files to my >70k scaffolds. 

### The Bowtie [user manual](http://bowtie-bio.sourceforge.net/tutorial.shtml) has a great tutorial, which is what I referenced to perform the processes in this notebook. 

In [1]:
pwd

'/Users/shlaura3/Documents/SAFS/FISH_546_Bioinformatics/546-Bioinformatics/2016-10_Geo-Ann-Project/Jupyter-Notebooks'

In [None]:
# I installed Bowtie via the following command: 
conda install -c bioconda bowtie=1.1.2

In [2]:
# Bowtie was automatically installed in the `anaconda` bin:
! /Users/shlaura3/anaconda/bin/bowtie

No index, query, or output file specified!
Usage: 
bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | <s>} [<hit>]

  <m1>    Comma-separated list of files containing upstream mates (or the
          sequences themselves, if -c is set) paired with mates in <m2>
  <m2>    Comma-separated list of files containing downstream mates (or the
          sequences themselves if -c is set) paired with mates in <m1>
  <r>     Comma-separated list of files containing Crossbow-style reads.  Can be
          a mixture of paired and unpaired.  Specify "-" for stdin.
  <s>     Comma-separated list of files containing unpaired reads, or the
          sequences themselves, if -c is set.  Specify "-" for stdin.
  <hit>   File to write hits to (default: stdout)
Input:
  -q                 query input files are FASTQ .fq/.fastq (default)
  -f                 query input files are (multi-)FASTA .fa/.mfa
  -r                 query input files are raw one-sequence-per-line
  -c           

In [5]:
! /Users/shlaura3/anaconda/bin/bowtie -help

Usage: 
bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | <s>} [<hit>]

  <m1>    Comma-separated list of files containing upstream mates (or the
          sequences themselves, if -c is set) paired with mates in <m2>
  <m2>    Comma-separated list of files containing downstream mates (or the
          sequences themselves if -c is set) paired with mates in <m1>
  <r>     Comma-separated list of files containing Crossbow-style reads.  Can be
          a mixture of paired and unpaired.  Specify "-" for stdin.
  <s>     Comma-separated list of files containing unpaired reads, or the
          sequences themselves, if -c is set.  Specify "-" for stdin.
  <hit>   File to write hits to (default: stdout)
Input:
  -q                 query input files are FASTQ .fq/.fastq (default)
  -f                 query input files are (multi-)FASTA .fa/.mfa
  -r                 query input files are raw one-sequence-per-line
  -c                 query sequences given on cmd line (as 

In [6]:
# Now I'll build my "index" (aka database) out of my >70k genome scaffolds;
# see line 1429 in the [bowtie manual](https://github.com/BenLangmead/bowtie/blob/master/MANUAL) for the bowtie-build options
!/Users/shlaura3/anaconda/bin/bowtie-build \
../data/Panopea_generosa_scaff-70k.scafSeq \
../data/Panopea_generosa_scaff-70k_bowtie-index

Settings:
  Output files: "../data/Panopea_generosa_scaff-70k_bowtie-index.*.ebwt"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 5 (one in 32)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  ../data/Panopea_generosa_scaff-70k.scafSeq
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 451843
Using parameters --bmax 338883 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these parameters: --bmax 3388

In [7]:
ls ../data/

Geo-70k-scaff-annotations-merged-table.tabular
Geo-v3-join-uniprot-all0916-condensed.tab
Geoduck-transcriptome-v3.fa.zip
Geoduck-transcriptome-v3.fa_db.nhr
Geoduck-transcriptome-v3.fa_db.nin
Geoduck-transcriptome-v3.fa_db.nsq
[31mGeoduck-transcriptome-v3annotated.fa[m[m*
Panopea_generosa_No-Line-Breaks.scafSeq.fai
Panopea_generosa_ScafSeq.genome
Panopea_generosa_scaff-70k.scafSeq
Panopea_generosa_scaff-70k.scafSeq.fai
Panopea_generosa_scaff-70k_bowtie-index.1.ebwt
Panopea_generosa_scaff-70k_bowtie-index.2.ebwt
Panopea_generosa_scaff-70k_bowtie-index.3.ebwt
Panopea_generosa_scaff-70k_bowtie-index.4.ebwt
Panopea_generosa_scaff-70k_bowtie-index.rev.1.ebwt
Panopea_generosa_scaff-70k_bowtie-index.rev.2.ebwt
Panopea_generosa_scaff-70k_db.nhr
Panopea_generosa_scaff-70k_db.nin
Panopea_generosa_scaff-70k_db.nsq
Phel_countdata.txt
Phel_transcriptome.fasta
hairpin.fa


In [11]:
cd ../data/ 

/Users/shlaura3/Documents/SAFS/FISH_546_Bioinformatics/546-Bioinformatics/2016-10_Geo-Ann-Project/data


In [None]:
# download geoduck RNASeq files, which are paired-end reads, samples from gonads in 1 male & 1 female
! curl -O http://owl.fish.washington.edu/halfshell/bu-data-genomic/tentacle/Geoduck_v3/Geo_Pool_F_GGCTAC_L006_R2_001_val_1.fq \
-O http://owl.fish.washington.edu/halfshell/bu-data-genomic/tentacle/Geoduck_v3/Geo_Pool_F_GGCTAC_L006_R1_001_val_2.fq \
-O http://owl.fish.washington.edu/halfshell/bu-data-genomic/tentacle/Geoduck_v3/Geo_Pool_M_CTTGTA_L006_R2_001_val_1.fq \
-O http://owl.fish.washington.edu/halfshell/bu-data-genomic/tentacle/Geoduck_v3/Geo_Pool_M_CTTGTA_L006_R1_001_val_2.fq    

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24.9G  100 24.9G    0     0   857k      0  8:28:43  8:28:43 --:--:--  916k
 29 25.0G   29 7478M    0     0   912k      0  7:59:11  2:19:55  5:39:16  914k

In [14]:
ls data/

Geo-70k-scaff-annotations-merged-table.tabular
Geo-v3-join-uniprot-all0916-condensed.tab
Geo_Pool_F_GGCTAC_L006_R1_001_val_2.fq
Geo_Pool_F_GGCTAC_L006_R2_001_val_1.fq
Geoduck-transcriptome-v3.fa.zip
Geoduck-transcriptome-v3.fa_db.nhr
Geoduck-transcriptome-v3.fa_db.nin
Geoduck-transcriptome-v3.fa_db.nsq
[31mGeoduck-transcriptome-v3annotated.fa[m[m*
Panopea_generosa_No-Line-Breaks.scafSeq.fai
Panopea_generosa_ScafSeq.genome
Panopea_generosa_scaff-70k.scafSeq
Panopea_generosa_scaff-70k.scafSeq.fai
Panopea_generosa_scaff-70k_bowtie-index.1.ebwt
Panopea_generosa_scaff-70k_bowtie-index.2.ebwt
Panopea_generosa_scaff-70k_bowtie-index.3.ebwt
Panopea_generosa_scaff-70k_bowtie-index.4.ebwt
Panopea_generosa_scaff-70k_bowtie-index.rev.1.ebwt
Panopea_generosa_scaff-70k_bowtie-index.rev.2.ebwt
Panopea_generosa_scaff-70k_db.nhr
Panopea_generosa_scaff-70k_db.nin
Panopea_generosa_scaff-70k_db.nsq
Phel_countdata.txt
Phel_transcriptome.fasta
hairpin.fa


In [16]:
# Now, use bowtie to first map female gonad paired-end reads to scaffold index
! /Users/shlaura3/anaconda/bin/bowtie \
-x ../data/Panopea_generosa_scaff-70k_bowtie-index \
-1 ../data/Geo_Pool_F_GGCTAC_L006_R2_001_val_1.fq \
-2 ../data/Geo_Pool_F_GGCTAC_L006_R1_001_val_2.fq \
-S ../data/Panopea_generosascaff-70k_gonadF_.sam

Could not locate a Bowtie index corresponding to basename "../data/Panopea_generosascaff-70k_gonadF_.sam"
Command: bowtie --wrapper basic-0 -x ../data/Panopea_generosa_scaff-70k_bowtie-index -1 ../data/Geo_Pool_F_GGCTAC_L006_R2_001_val_1.fq -2 ../data/Geo_Pool_F_GGCTAC_L006_R1_001_val_2.fq -S ../data/Panopea_generosascaff-70k_gonadF_.sam 


In [29]:
# Bowtie cannot find the index that I created. 
# I'll try this: download Bowtie2 and try running through the same process:
!/Users/shlaura3/anaconda/bin/bowtie2-build \
../data/Panopea_generosa_scaff-70k.fasta \
../data/Panopea_generosa_scaff-70k_bowtie2-index

Settings:
  Output files: "../data/Panopea_generosa_scaff-70k_bowtie2-index.*.bt2"
  Line rate: 6 (line is 64 bytes)
  Lines per side: 1 (side is 64 bytes)
  Offset rate: 4 (one in 16)
  FTable chars: 10
  Strings: unpacked
  Max bucket size: default
  Max bucket size, sqrt multiplier: default
  Max bucket size, len divisor: 4
  Difference-cover sample period: 1024
  Endianness: little
  Actual local endianness: little
  Sanity checking: disabled
  Assertions: disabled
  Random seed: 0
  Sizeofs: void*:8, int:4, long:8, size_t:8
Input files DNA, FASTA:
  ../data/Panopea_generosa_scaff-70k.fasta
Building a SMALL index
Reading reference sizes
  Time reading reference sizes: 00:00:00
Calculating joined length
Writing header
Reserving space for joined string
Joining reference sequences
  Time to join reference sequences: 00:00:00
bmax according to bmaxDivN setting: 451843
Using parameters --bmax 338883 --dcv 1024
  Doing ahead-of-time memory usage test
  Passed!  Constructing with these pa

In [30]:
! head ../data/Panopea_generosa_scaff-70k_bowtie2-index.rev.1.bt2

   �          
   ����   @i �x �2 �� �� ? s ? �W (b ] N� xA �j b' �X Hf 3� }3 � l         }� s      �� �     u� �
     �       �     �c /     	c �/     �^ �3     �V �;     t7 8Z     !6 ][     �5 �[     $5 �[     �3 �\     �0 �^     �/ L_     _. r`     c* �c     � �z     � �     c�  j�     p�  ,�     O�  ��     �  �     ��  ��     ʨ  -�     �  ��     #p  �    �`  E%    �_  h%    N  �6    yG  g<    F  �<    9-  $T    �  6e    �  n    >  9n    �  q        {}    Q	 ��    ��  ��    ��  ۳    ��  p�    ��  �    �  E�    ��  ��    V�  w�    ��  �    #�  �    ��  }    ~�  �    ͑  �    �       ͇  k    Ӏ  �%    Xz  �+    �x  e,    Gx  �,    u  D/    in  W5    �_  aC    �^  �C    �]  D    @,  �s        Ԟ    �� F�    �S ��    �D 4�    SD T�    G6 5    �/ �
    P. �    ��  +i    Ϳ  y

In [23]:
! /Users/shlaura3/anaconda/bin/bowtie2 -help

Bowtie 2 version 2.2.8 by Ben Langmead (langmea@cs.jhu.edu, www.cs.jhu.edu/~langmea)
Usage: 
  bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]

  <bt2-idx>  Index filename prefix (minus trailing .X.bt2).
             NOTE: Bowtie 1 and Bowtie 2 indexes are not compatible.
  <m1>       Files with #1 mates, paired with files in <m2>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <m2>       Files with #2 mates, paired with files in <m1>.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <r>        Files with unpaired reads.
             Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
  <sam>      File for SAM output (default: stdout)

  <m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be
  specified many times.  E.g. '-U file1.fq,file2.fq -U file3.fq'.

Options (defaults in parentheses):

 Input:
  -q                 query input files are FASTQ .fq

In [32]:
pwd

'/Users/shlaura3/Documents/SAFS/FISH_546_Bioinformatics/546-Bioinformatics/2016-10_Geo-Ann-Project/data'

In [41]:
! bowtie2-inspect --summary ../data/Panopea_generosa_scaff-70k_bowtie2-index

Flags	1
Reverse flags	5
Colorspace	0
2.0-compatible	1
SA-Sample	1 in 16
FTab-Chars	10
Sequence-1	scaffold3071 37.0	92480
Sequence-2	scaffold4463 35.0	96504
Sequence-3	scaffold5354 34.4	78472
Sequence-4	scaffold9504 36.1	106643
Sequence-5	scaffold10970 36.6	103072
Sequence-6	scaffold11875 36.6	71487
Sequence-7	scaffold15463 36.3	94990
Sequence-8	scaffold18558 34.9	73279
Sequence-9	scaffold19489 36.3	88023
Sequence-10	scaffold20302 34.0	90664
Sequence-11	scaffold26337 37.5	154899
Sequence-12	scaffold26960 37.7	118606
Sequence-13	scaffold27692 37.9	82296
Sequence-14	scaffold30278 35.7	92860
Sequence-15	scaffold31723 35.8	75618
Sequence-16	scaffold32578 36.5	88248
Sequence-17	scaffold34940 36.6	91720
Sequence-18	scaffold45727 36.4	106547
Sequence-19	scaffold59644 35.9	78717
Sequence-20	scaffold71773 33.8	103920


In [42]:
# Now, use bowtie to first map female gonad paired-end reads to scaffold index
# FYI this took about an hour
! /Users/shlaura3/anaconda/bin/bowtie2 \
-x ../data/Panopea_generosa_scaff-70k_bowtie2-index \
-1 Geo_Pool_F_GGCTAC_L006_R1_001_val_2.fq \
-2 Geo_Pool_F_GGCTAC_L006_R2_001_val_1.fq \
-S Panopea_generosascaff-70k_gonadF_bowtie.sam

103213920 reads; of these:
  103213920 (100.00%) were paired; of these:
    102865097 (99.66%) aligned concordantly 0 times
    261264 (0.25%) aligned concordantly exactly 1 time
    87559 (0.08%) aligned concordantly >1 times
    ----
    102865097 pairs aligned concordantly 0 times; of these:
      14413 (0.01%) aligned discordantly 1 time
    ----
    102850684 pairs aligned 0 times concordantly or discordantly; of these:
      205701368 mates make up the pairs; of these:
        205379458 (99.84%) aligned 0 times
        205853 (0.10%) aligned exactly 1 time
        116057 (0.06%) aligned >1 times
0.51% overall alignment rate


In [43]:
! mv ../data/Panopea_generosascaff-70k_gonadF_bowtie.sam ../analyses/Panopea_generosascaff-70k_gonadF_bowtie.sam

In [44]:
!ls ../analyses/

2016-11-3_Geo70k-scaff-transcrip-sequences.tabular
GeoTrans-PhelDiffExp_blasted_sorted
P-Generosa_IGV.xml
Panopea_generosascaff-70k_gonadF_bowtie.sam
Phel_DEGlist.tab
[34mRepeatMasker[m[m
SeaStarDiffExp.R
pgenerosa-scaff70-miRNA.tab
pgenerosa-transcrv3-blastn-scaff70k-01.gff.fai
pgenerosa-transcrv3-blastn-scaff70k-01.tab
pgenerosa-transcrv3-blastn-scaff70k-01.tab.fai


In [51]:
! head ../analyses/Panopea_generosascaff-70k_gonadF_bowtie.sam

@HD	VN:1.0	SO:unsorted
@SQ	SN:scaffold3071	LN:92480
@SQ	SN:scaffold4463	LN:96504
@SQ	SN:scaffold5354	LN:78472
@SQ	SN:scaffold9504	LN:106643
@SQ	SN:scaffold10970	LN:103072
@SQ	SN:scaffold11875	LN:71487
@SQ	SN:scaffold15463	LN:94990
@SQ	SN:scaffold18558	LN:73279
@SQ	SN:scaffold19489	LN:88023


In [52]:
# Converting from .sam to .bam format
! samtools view -bS -o \
../analyses/Panopea_generosascaff-70k_gonadF_bowtie.bam \
../analyses/Panopea_generosascaff-70k_gonadF_bowtie.sam

In [58]:
ls ../analyses/

2016-11-3_Geo70k-scaff-transcrip-sequences.tabular
GeoTrans-PhelDiffExp_blasted_sorted
P-Generosa_IGV.xml
Panopea_generosascaff-70k_gonadF_bowtie.bam
Panopea_generosascaff-70k_gonadF_bowtie.sam
Phel_DEGlist.tab
[34mRepeatMasker[m[m/
SeaStarDiffExp.R
pgenerosa-scaff70-miRNA.tab
pgenerosa-transcrv3-blastn-scaff70k-01.gff.fai
pgenerosa-transcrv3-blastn-scaff70k-01.tab
pgenerosa-transcrv3-blastn-scaff70k-01.tab.fai


In [64]:
! samtools sort \
-o ../analyses/Panopea_generosascaff-70k_gonadF_bowtie-sorted.bam \
-T TEMP_gonadF_bowtie-sorted \
../analyses/Panopea_generosascaff-70k_gonadF_bowtie.bam

[bam_sort_core] merging from 91 files...


In [75]:
cd ../analyses/

/Users/shlaura3/Documents/SAFS/FISH_546_Bioinformatics/546-Bioinformatics/2016-10_Geo-Ann-Project/analyses


In [76]:
ls

2016-11-3_Geo70k-scaff-transcrip-sequences.tabular
GeoTrans-PhelDiffExp_blasted_sorted
P-Generosa_IGV.xml
Panopea_generosascaff-70k_gonadF_bowtie-sorted.bam
Panopea_generosascaff-70k_gonadF_bowtie.bam
Panopea_generosascaff-70k_gonadF_bowtie.sam
Phel_DEGlist.tab
[34mRepeatMasker[m[m/
SeaStarDiffExp.R
pgenerosa-scaff70-miRNA.tab
pgenerosa-transcrv3-blastn-scaff70k-01.gff.fai
pgenerosa-transcrv3-blastn-scaff70k-01.tab
pgenerosa-transcrv3-blastn-scaff70k-01.tab.fai


In [79]:
! samtools mpileup --help

samtools: unrecognized option `--help'

Usage: samtools mpileup [options] in1.bam [in2.bam [...]]

Input options:
  -6, --illumina1.3+      quality is in the Illumina-1.3+ encoding
  -A, --count-orphans     do not discard anomalous read pairs
  -b, --bam-list FILE     list of input BAM filenames, one per line
  -B, --no-BAQ            disable BAQ (per-Base Alignment Quality)
  -C, --adjust-MQ INT     adjust mapping quality; recommended:50, disable:0 [0]
  -d, --max-depth INT     max per-file depth; avoids excessive memory usage [250]
  -E, --redo-BAQ          recalculate BAQ on the fly, ignore existing BQs
  -f, --fasta-ref FILE    faidx indexed reference sequence file
  -G, --exclude-RG FILE   exclude read groups listed in FILE
  -l, --positions FILE    skip unlisted positions (chr pos) or regions (BED)
  -q, --min-MQ INT        skip alignments with mapQ smaller than INT [0]
  -Q, --min-BQ INT        skip bases with baseQ/BAQ smaller than INT [13]
  -r, --region REG  

In [1]:
! samtools mpileup -u -t DP \
-f ../data/Panopea_generosa_scaff-70k.fasta \
../analyses/Panopea_generosascaff-70k_gonadF_bowtie-sorted.bam \
-o ../analyses/Panopea_generosascaff-70k_gonadF_mpileup

[E::hts_open_format] fail to open file '../analyses/Panopea_generosascaff-70k_gonadF_bowtie-sorted.bam'
[mpileup] failed to open ../analyses/Panopea_generosascaff-70k_gonadF_bowtie-sorted.bam: No such file or directory


In [2]:
ls ../data/

Geo-70k-scaff-annotations-merged-table.tabular
Geo-v3-join-uniprot-all0916-condensed.tab
Geoduck-transcriptome-v3.fa.zip
Geoduck-transcriptome-v3.fa_db.nhr
Geoduck-transcriptome-v3.fa_db.nin
Geoduck-transcriptome-v3.fa_db.nsq
Panopea_generosa_No-Line-Breaks.scafSeq.fai
Panopea_generosa_ScafSeq.genome
Panopea_generosa_scaff-70k.fasta
Panopea_generosa_scaff-70k.fasta.fai
Panopea_generosa_scaff-70k.scafSeq
Panopea_generosa_scaff-70k.scafSeq.fai
Panopea_generosa_scaff-70k_bowtie2-index.1.bt2
Panopea_generosa_scaff-70k_bowtie2-index.2.bt2
Panopea_generosa_scaff-70k_bowtie2-index.3.bt2
Panopea_generosa_scaff-70k_bowtie2-index.4.bt2
Panopea_generosa_scaff-70k_bowtie2-index.rev.1.bt2
Panopea_generosa_scaff-70k_bowtie2-index.rev.2.bt2
Panopea_generosa_scaff-70k_db.nhr
Panopea_generosa_scaff-70k_db.nin
Panopea_generosa_scaff-70k_db.nsq
Phel_countdata.txt
Phel_transcriptome.fasta
hairpin.fa
