# Downloading Reference Genome and Whole Genome SRA (Sequence Read Archive) Data for Downstream Use in FreeBayes

## 1. Download reference genome

[Papio anubis](https://www.ncbi.nlm.nih.gov/genome/394?genome_assembly_id=324755) genome information on NCBI

[Additional downloads for Papio anubis](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/264/685/GCF_000264685.3_Panu_3.0/) - e.g. genome annotations, etc.

In [15]:
%%bash
# save ftp download link as a variable
refpapio="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/264/685/GCF_000264685.3_Panu_3.0/GCF_000264685.3_Panu_3.0_genomic.fna.gz"

# make directory for storing reference file
mkdir -p /moto/eaton/projects/macaques/refpapio

# download file to dir
curl -Lk $refpapio -o /moto/eaton/projects/macaques/refpapio/refpapio.fna.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0  0  869M    0  425k    0     0   118k      0  2:05:04  0:00:03  2:05:01  118k  0  869M    0 1365k    0     0   305k      0  0:48:37  0:00:04  0:48:33  304k  0  869M    0 2522k    0     0   460k      0  0:32:12  0:00:05  0:32:07  523k  0  869M    0 3796k    0     0   585k      0  0:25:21  0:00:06  0:25:15  768k  0  869M    0 4763k    0     0   637k      0  0:23:16  0:00:07  0:23:09  954k  0  869M    0 5896k    0     0   695k      0  0:21:18  0:00:08  0:21:10 1119k  0  869M    0 6766k    0     0   710k      0  0:20

In [1]:
ls /moto/eaton/projects/macaques/refpapio

[0m[38;5;9mrefpapio.fna.gz[0m


## 2. SRA File Download Using [sratools](https://github.com/ncbi/sra-tools) (`conda install -c bioconda sra-tools`)

Open the csv of runs to download. NaNs in SRR are because the data are either not available on NCBI or because the genome data is spread across multiple runs:

In [7]:
import pandas as pd
import os

In [8]:
df = pd.read_csv("./data/SRA-table.csv")
df[["Species", "Group", "SRR", "BioSample", "Sample", "Study", "PRJ"]]

Unnamed: 0,Species,Group,SRR,BioSample,Sample,Study,PRJ
0,Macaca mulatta northern,mulatta,SRR4454026,SAMN05883679,SRS1762015,SRP092140,PRJNA345528
1,Macaca mulatta southern low altitude,mulatta,SRR4454020,SAMN05883709,SRS1762009,SRP092140,PRJNA345529
2,Macaca mulatta southern high altitude,mulatta,SRR4453966,SAMN05883736,SRS1761955,SRP092140,PRJNA345530
3,Macaca mulatta Indian,mulatta,SRR5628058,SAMN07168901,SRS2238957,SRP049547,PRJNA251548
4,Macaca fascicularis northern,fascicularis,,SAMN00116341,SRS117874,SRP045755,PRJNA51411
5,Macaca fascicularis southern,fascicularis,,SAMD00006158,DRS000787,DRP000438,PRJDB2038
6,Macaca fuscata,mulatta,DRR002233,SAMD00011919,DRS001583,DRP000620,PRJDB2459
7,Macaca thibethana,sinica,SRR1024051,SAMN02390221,SRS498543,SRP032525,PRJNA226187
8,Macaca assamensis,sinica,SRR2981114,SAMN04316321,SRS1196892,SRP067118,PRJNA305009
9,Macaca arctoides,fascicularis,SRR2981139,SAMN04316319,SRS1196879,SRP067118,PRJNA305009


For the purpose of this study I'll be processing each file one by one so I'll just be bash scripting, but if you want to do it all in one go, an easy way is to use ipyrad. See [here](https://eaton-lab.org/articles/sra-downloads/) for more details. Since I'm doing this study one-by-one, I'm going to be downloading SRAs over FTP and then fastq-dump locally. If space is limited, I recommend using fastq-dump directly on the SRR number rather than downloading SRAs locally first.

1) Northern _Macaca mulatta_

In [6]:
##!conda install -c bioconda parallel-fastq-dump

In [3]:
!mkdir -p /moto/eaton/projects/macaques/mulattanorthern

In [5]:
%time !parallel-fastq-dump --sra-id SRR4454026 --threads 24 --tmpdir /moto/eaton/projects/macaques/tmp --outdir /moto/eaton/projects/macaques/mulattanorthern --split-3 --gzip --skip-technical --readids --dumpbase --clip

SRR ids: ['SRR4454026']
extra args: ['--split-3', '--gzip', '--skip-technical', '--readids', '--dumpbase', '--clip']
tempdir: /moto/eaton/projects/macaques/tmp/pfd_28ievza1
SRR4454026 spots: 169608187
blocks: [[1, 7067007], [7067008, 14134014], [14134015, 21201021], [21201022, 28268028], [28268029, 35335035], [35335036, 42402042], [42402043, 49469049], [49469050, 56536056], [56536057, 63603063], [63603064, 70670070], [70670071, 77737077], [77737078, 84804084], [84804085, 91871091], [91871092, 98938098], [98938099, 106005105], [106005106, 113072112], [113072113, 120139119], [120139120, 127206126], [127206127, 134273133], [134273134, 141340140], [141340141, 148407147], [148407148, 155474154], [155474155, 162541161], [162541162, 169608187]]
Read 7067007 spots for SRR4454026
Written 7067007 spots for SRR4454026
Read 7067007 spots for SRR4454026
Written 7067007 spots for SRR4454026
Read 7067007 spots for SRR4454026
Written 7067007 spots for SRR4454026
Read 7067007 spots for SRR4454026
Writt

In [9]:
os.system('mv /moto/eaton/projects/macaques/mulattanorthern/SRR4454026_1.fastq.gz /moto/eaton/projects/macaques/mulattanorthern/mulattanorthernSRR4454026_1.fastq.gz')

0

In [10]:
##renaming to something more human-readable
os.system('mv /moto/eaton/projects/macaques/mulattanorthern/SRR4454026_2.fastq.gz /moto/eaton/projects/macaques/mulattanorthern/mulattanorthernSRR4454026_2.fastq.gz')

0

2) Southern low altitude _Macaca mulatta_

In [None]:
!mkdir -p /moto/eaton/projects/macaques/mulattasouthernlow

In [None]:
%time !parallel-fastq-dump --sra-id SRR4454020 --threads 24 --tmpdir /moto/eaton/projects/macaques/tmp --outdir /moto/eaton/projects/macaques/mulattasouthernlow --split-3 --gzip --skip-technical --readids --dumpbase --clip

In [None]:
os.system('mv /moto/eaton/projects/macaques/mulattasouthernlow/SRR4454020_1.fastq.gz /moto/eaton/projects/macaques/mulattasouthernlow/mulattasouthernlowSRR4454020_1.fastq.gz')

In [None]:
os.system('mv /moto/eaton/projects/macaques/mulattasouthernlow/SRR4454020_2.fastq.gz /moto/eaton/projects/macaques/mulattasouthernlow/mulattasouthernlowSRR4454020_2.fastq.gz')

3) Southern high altitude _Macaca mulatta_

In [None]:
!mkdir -p /moto/eaton/projects/macaques/mulattasouthernhigh

In [None]:
%time !parallel-fastq-dump --sra-id SRR4453966 --threads 24 --tmpdir /moto/eaton/projects/macaques/tmp --outdir /moto/eaton/projects/macaques/mulattasouthernhigh --split-3 --gzip --skip-technical --readids --dumpbase --clip

In [None]:
os.system('mv /moto/eaton/projects/macaques/mulattasouthernhigh/SRR4453966_1.fastq.gz /moto/eaton/projects/macaques/mulattasouthernhigh/mulattasouthernhighSRR4453966_1.fastq.gz')

In [None]:
os.system('mv /moto/eaton/projects/macaques/mulattasouthernhigh/SRR4453966_2.fastq.gz /moto/eaton/projects/macaques/mulattasouthernhigh/mulattasouthernhighSRR4453966_2.fastq.gz')

3) Indian _Macaca mulatta_ (lab specimen)

In [None]:
!mkdir -p /moto/eaton/projects/macaques/mulattaindian

In [None]:
%time !parallel-fastq-dump --sra-id SRR5628058 --threads 24 --tmpdir /moto/eaton/projects/macaques/tmp --outdir /moto/eaton/projects/macaques/mulattaindian --split-3 --gzip --skip-technical --readids --dumpbase --clip

In [None]:
os.system('mv /moto/eaton/projects/macaques/mulattaindian/SRR5628058_1.fastq.gz /moto/eaton/projects/macaques/mulattaindian/mulattaindianSRR5628058_1.fastq.gz')

In [None]:
os.system('mv /moto/eaton/projects/macaques/mulattaindian/SRR5628058_2.fastq.gz /moto/eaton/projects/macaques/mulattaindian/mulattaindianSRR5628058_2.fastq.gz')