# Downloading Reference Genome and Whole Genome SRA (Sequence Read Archive) Data for Downstream Use in FreeBayes

## 1. Download reference genome

[Papio anubis](https://www.ncbi.nlm.nih.gov/assembly/GCA_000264685.2) genome information on NCBI

[Additional downloads for Papio anubis](ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/008/728/515/GCF_008728515.1_Panubis1.0) - e.g. genome annotations, etc.

In [1]:
%%bash
# save ftp download link as a variable
refpapio="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/008/728/515/GCF_008728515.1_Panubis1.0/GCF_008728515.1_Panubis1.0_genomic.fna.gz"

# make directory for storing reference file
mkdir -p /moto/eaton/projects/macaques/refpapio

# download file to dir
curl -Lk $refpapio -o /moto/eaton/projects/macaques/refpapio/refpapio.fna.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  3  864M    3 30.9M    0     0  15.5M      0  0:00:55  0:00:01  0:00:54 15.5M  4  864M    4 35.6M    0     0  11.9M      0  0:01:12  0:00:02  0:01:10 11.9M  4  864M    4 39.0M    0     0   9.7M      0  0:01:28  0:00:03  0:01:25  9.7M  5  864M    5 43.7M    0     0  8976k      0  0:01:38  0:00:04  0:01:34 8974k  5  864M    5 49.3M    0     0  8425k      0  0:01:45  0:00:05  0:01:40  9.8M  6  864M    6 52.5M    0     0  7689k      0  0:01:55  0:00:06  0:01:49 4416k  6  864M    6 55.4M    0     0  7107k      0  0:02:04  0:00:07  0:01:57 4068k  7  864M    7 60.8M    0     0  6932k      0  0:02:07  0:00:08  0:01:59 4471k  7  864M    7 67.8M    0     0  6956k      0  0:02

In [1]:
ls /moto/eaton/projects/macaques/refpapio

[0m[38;5;9mrefpapio.fna.gz[0m


## 2. SRA File Download Using [sratools](https://github.com/ncbi/sra-tools) (`conda install -c bioconda sra-tools`)

Open the csv of runs to download. NaNs in SRR are because the data are either not available on NCBI or because the genome data is spread across multiple runs:

In [1]:
import pandas as pd
import os

In [5]:
df = pd.read_csv("./data/SRA-table.csv")
df[["Species", "Group", "SRR", "BioSample", "Sample", "Study", "PRJ"]]

Unnamed: 0,Species,Group,SRR,BioSample,Sample,Study,PRJ
0,Macaca mulatta northern low altitude,mulatta,SRR4454026,SAMN05883679,SRS1762015,SRP092140,PRJNA345528
1,Macaca mulatta southern high altitude,mulatta,SRR4454020,SAMN05883709,SRS1762009,SRP092140,PRJNA345529
2,Macaca mulatta southern low altitude,mulatta,SRR4453966,SAMN05883736,SRS1761955,SRP092140,PRJNA345530
3,Macaca mulatta Indian,mulatta,SRR5628058,SAMN07168901,SRS2238957,SRP049547,PRJNA251548
4,Macaca fascicularis northern,fascicularis,,SAMN00116341,SRS117874,SRP045755,PRJNA51411
5,Macaca fascicularis southern,fascicularis,SRR445713,SAMN00811240,SRS300124,SRP011089,PRJNA20409
6,Macaca fuscata,mulatta,DRR002233,SAMD00011919,DRS001583,DRP000620,PRJDB2459
7,Macaca fuscata,mulatta,,SAMD00013516,DRS002017,DRP000657,PRJDB2648
8,Macaca thibethana,sinica,SRR1024051,SAMN02390221,SRS498543,SRP032525,PRJNA226187
9,Macaca assamensis,sinica,SRR2981114,SAMN04316321,SRS1196892,SRP067118,PRJNA305009


In [None]:
##Download individuals with single SRRs. 
##The ones with NaN will need more attention so we will do them separately.
for i in df["SRR"]:
    if type(i) is str:
        cmd='wget -P /moto/eaton/projects/macaques/SRA/ ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/'+i[0:3]+'/'+i[0:6]+'/'+i+'/'+i+'.sra'
        os.system(cmd)

In [6]:
!wget -O /moto/eaton/projects/macaques/SRA/fasno/SRS117874.csv 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRS117874'

--2019-12-31 23:35:16--  http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRS117874
Resolving trace.ncbi.nlm.nih.gov (trace.ncbi.nlm.nih.gov)... 130.14.29.113
Connecting to trace.ncbi.nlm.nih.gov (trace.ncbi.nlm.nih.gov)|130.14.29.113|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRS117874 [following]
--2019-12-31 23:35:16--  https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRS117874
Connecting to trace.ncbi.nlm.nih.gov (trace.ncbi.nlm.nih.gov)|130.14.29.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: '/moto/eaton/projects/macaques/SRA/fasno/SRS117874.csv'

    [  <=>                                  ] 72,001       192KB/s   in 0.4s   

2019-12-31 23:35:17 (192 KB/s) - '/moto/eaton/projects/macaq

In [6]:
!wget -O /moto/eaton/projects/macaques/SRA/fuscata/DRS002017.csv 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=DRS002017'

--2019-12-27 00:12:49--  http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=DRS002017
Resolving trace.ncbi.nlm.nih.gov (trace.ncbi.nlm.nih.gov)... 130.14.29.113
Connecting to trace.ncbi.nlm.nih.gov (trace.ncbi.nlm.nih.gov)|130.14.29.113|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=DRS002017 [following]
--2019-12-27 00:12:49--  https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=DRS002017
Connecting to trace.ncbi.nlm.nih.gov (trace.ncbi.nlm.nih.gov)|130.14.29.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: '/moto/eaton/projects/macaques/SRA/fuscata/DRS002017.csv'

    [ <=>                                   ] 1,501       --.-K/s   in 0s      

2019-12-27 00:12:49 (7.70 MB/s) - '/moto/eaton/projects/ma

In [7]:
!wget -O /moto/eaton/projects/macaques/SRA/nemestrina/SRS4092093.csv 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRS4092093'

--2019-12-27 00:12:56--  http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRS4092093
Resolving trace.ncbi.nlm.nih.gov (trace.ncbi.nlm.nih.gov)... 130.14.29.113
Connecting to trace.ncbi.nlm.nih.gov (trace.ncbi.nlm.nih.gov)|130.14.29.113|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRS4092093 [following]
--2019-12-27 00:12:56--  https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRS4092093
Connecting to trace.ncbi.nlm.nih.gov (trace.ncbi.nlm.nih.gov)|130.14.29.113|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: '/moto/eaton/projects/macaques/SRA/nemestrina/SRS4092093.csv'

    [ <=>                                   ] 4,681       --.-K/s   in 0.007s  

2019-12-27 00:12:56 (639 KB/s) - '/moto/eaton/proje

In [2]:
df1 = pd.read_csv("/moto/eaton/projects/macaques/SRA/fasno/SRS117874.csv")
df1

Unnamed: 0,Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,...,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
0,SRR069635,2011-10-14 06:23:44,2014-05-28 08:49:15,7043321,619812248,7043321,88,286,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,BGI,SRA023855,,public,60D8544B4A6ED0CF191794E5649D02C9,4797619F7942C163B4CADE0C534D82CE
1,SRR069636,2011-10-14 06:23:44,2014-05-28 08:49:38,7500713,660062744,7500713,88,312,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,BGI,SRA023855,,public,2AE28E669412DDA12B91DE333A01117B,4167A7A269D6DFA8E17E62D34E8129E7
2,SRR069637,2011-10-14 06:23:44,2014-05-28 08:49:51,7530557,662689016,7530557,88,321,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,BGI,SRA023855,,public,4CDEF55C6D8D7CECB69C9BF41340E814,B16391E509937ADE17B788478392EC1B
3,SRR069638,2011-10-14 06:23:44,2014-05-28 08:48:13,1372576,120786688,1372576,88,51,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,BGI,SRA023855,,public,7975D1FF84E8CA0163A4D222E42896EA,8E2382A431AF23DA94DEF11D9BEA39DB
4,SRR069639,2011-10-14 06:23:44,2014-05-28 08:49:09,6607678,581475664,6607678,88,220,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,BGI,SRA023855,,public,9DF27FBA2FC54F15D18CCE7D95F236D1,701DE2C5BE26AB87ED135E0CA38C998D
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
148,SRR069783,2011-10-14 06:23:44,2014-05-28 09:07:59,17780797,1564710136,17780797,88,1032,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,BGI,SRA023855,,public,772A9B85A03248EE5885D3EA0C8354DF,99D9A0316984CB7E6F818699B4E626F1
149,SRR069784,2011-10-14 06:23:44,2014-05-28 09:15:21,17538928,1543425664,17538928,88,1013,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,BGI,SRA023855,,public,73A22C3B456B4FFEE928E5393530F8AF,61B566FE78E0C9F8758682B1573F5976
150,SRR069785,2011-10-14 06:23:44,2014-05-28 09:07:36,17196731,1513312328,17196731,88,988,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,BGI,SRA023855,,public,D47C2DEB25B952F1C7F005935DB20F13,E03CB5EC39BCA051A051706584F7799C
151,SRR069786,2011-10-14 06:23:44,2014-05-28 09:07:19,17323022,1524425936,17323022,88,1004,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,BGI,SRA023855,,public,FFD5A00F18F5865EF1277DAF2889FE72,7D134B08D290D9D408D29EF9394D7237


In [4]:
df2 = pd.read_csv("/moto/eaton/projects/macaques/SRA/fuscata/DRS002017.csv")
df2

Unnamed: 0,Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,...,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
0,DRR002520,2012-09-11 22:55:10,2012-09-11 22:54:23,104263305,21061187610,104263305,202,11745,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,NIG,DRA000626,,public,336A17E624D47AECEC2AB84FAEFA9673,03DDBDE57DADADDBC6B538EC1A0CB02F
1,DRR002521,2012-09-11 22:51:29,2012-09-11 22:50:19,104810231,21171666662,104810231,202,11751,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,NIG,DRA000626,,public,8253031C10DF7EFDD8D960E72BE65EE1,2908C7DBF5E5102C92B3F7404623A8EC


In [5]:
df3 = pd.read_csv("/moto/eaton/projects/macaques/SRA/nemestrina/SRS4092093.csv")
df3

Unnamed: 0,Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,...,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
0,SRR8262014,2018-11-29 14:50:11,2018-11-29 14:23:37,42052680,12615804000,42052680,300,5053,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,"UNIVERSITY OF CALIFORNIA, SAN FRANCISCO",SRA816674,,public,BC521E2A15CCA4306FC1FD25FA26A8ED,EAB511C8BCECE99CB64A0496FC0197CF
1,SRR8262021,2018-11-29 14:39:13,2018-11-29 14:20:58,41833704,12550111200,41833704,300,5043,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,"UNIVERSITY OF CALIFORNIA, SAN FRANCISCO",SRA816674,,public,D44E91EA22D339B2FAD74B685DE5EE5C,0491E39D1C94EFF1729BDB335003B667
2,SRR8262038,2018-11-29 14:39:13,2018-11-29 14:18:50,41295410,12388623000,41295410,300,5086,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,"UNIVERSITY OF CALIFORNIA, SAN FRANCISCO",SRA816674,,public,28EE022F08B2E442645A428AD57A3EC4,5921776F96336D4DA081EC9CB46BC71E
3,SRR8262039,2018-11-29 15:00:13,2018-11-29 14:26:43,42829769,12848930700,42829769,300,5100,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,"UNIVERSITY OF CALIFORNIA, SAN FRANCISCO",SRA816674,,public,E830477EC01C7AA4E316DC615DDF8E9E,9934484AE7F5B481B933C2F070C9A30C
4,SRR8262042,2018-11-29 14:39:12,2018-11-29 14:15:15,41889659,12566897700,41889659,300,5071,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,"UNIVERSITY OF CALIFORNIA, SAN FRANCISCO",SRA816674,,public,8BBEB635C0EE9BB792C14257FEA15E39,4C275316A3176637C92B2D78DB32E1B7
5,SRR8262043,2018-11-29 15:42:12,2018-11-29 15:33:40,41380325,12414097500,41380325,300,5087,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,"UNIVERSITY OF CALIFORNIA, SAN FRANCISCO",SRA816674,,public,DFFB32EFA4E6457A6E7B4AA916160B71,72C4B4D061A770001C8509D55E2BB6DB
6,SRR8262044,2018-11-29 15:16:13,2018-11-29 14:40:04,78059738,23417921400,78059738,300,9280,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,"UNIVERSITY OF CALIFORNIA, SAN FRANCISCO",SRA816674,,public,911A5FB5ED79278A75C0FC1A70B27C41,7F0B2D197AC0FA1FB88E41AAB6FA718B
7,SRR8262045,2018-11-29 15:14:12,2018-11-29 14:36:08,76601314,22980394200,76601314,300,9187,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/s...,...,,,,,"UNIVERSITY OF CALIFORNIA, SAN FRANCISCO",SRA816674,,public,CB8A558E61838D5D496B76B6AA70B42D,B58C7BE19054FE33CE30D5F5F1DB7DDD


In [8]:
!mkdir -p /moto/eaton/projects/macaques/fastqdump
!mkdir -p /moto/eaton/projects/macaques/fastqdump/fasno
!mkdir -p /moto/eaton/projects/macaques/fastqdump/fuscata
!mkdir -p /moto/eaton/projects/macaques/fastqdump/nemestrina

In [None]:
##download all data that isn't NaN (these are done below)
for i in df["SRR"]:
    if type(i) is str:
        cmd='fasterq-dump '+i+' \
            -O /moto/eaton/projects/macaques/fastqdump/ -t /moto/eaton/projects/macaques/tmp -e 8 \
            ; pigz /moto/eaton/projects/macaques/fastqdump/'+i+'_1.fastq \
            ; pigz /moto/eaton/projects/macaques/fastqdump/'+i+'_2.fastq'
        os.system(cmd)

In [None]:
for i in df1["Run"]:
    cmd='fasterq-dump '+i+' \
        -O /moto/eaton/projects/macaques/fastqdump/fasno/ -t /moto/eaton/projects/macaques/tmp -e 12 \
        ; pigz /moto/eaton/projects/macaques/fastqdump/fasno/'+i+'_1.fastq \
        ; pigz /moto/eaton/projects/macaques/fastqdump/fasno/'+i+'_2.fastq'
    os.system(cmd)

In [None]:
!scancel -u nsl2119

In [6]:
for i in df2["Run"]:
    cmd='fasterq-dump '+i+' \
        -O /moto/eaton/projects/macaques/fastqdump/fuscata/ -t /moto/eaton/projects/macaques/tmp -e 8 \
        ; pigz /moto/eaton/projects/macaques/fastqdump/fuscata/'+i+'_1.fastq \
        ; pigz /moto/eaton/projects/macaques/fastqdump/fuscata/'+i+'_2.fastq'
    os.system(cmd)

In [None]:
for i in df3["Run"]:
    cmd='fasterq-dump '+i+' \
        -O /moto/eaton/projects/macaques/fastqdump/nemestrina/ -t /moto/eaton/projects/macaques/tmp -e 8 \
        ; pigz /moto/eaton/projects/macaques/fastqdump/nemestrina/'+i+'_1.fastq \
        ; pigz /moto/eaton/projects/macaques/fastqdump/nemestrina/'+i+'_2.fastq'
    os.system(cmd)

In [1]:
##combining left and right reads for the projects split over multiple lanes for fasno
!cat /moto/eaton/projects/macaques/fastqdump/fasno/*_1.fastq.gz > /moto/eaton/projects/macaques/fastqdump/fasno.sra_1.fastq.gz
!cat /moto/eaton/projects/macaques/fastqdump/fasno/*_2.fastq.gz > /moto/eaton/projects/macaques/fastqdump/fasno.sra_2.fastq.gz

In [1]:
!cat /moto/eaton/projects/macaques/fastqdump/fuscata/*_1.fastq.gz > /moto/eaton/projects/macaques/fastqdump/fuscata.sra_1.fastq.gz
!cat /moto/eaton/projects/macaques/fastqdump/fuscata/*_2.fastq.gz > /moto/eaton/projects/macaques/fastqdump/fuscata.sra_2.fastq.gz

In [2]:
!cat /moto/eaton/projects/macaques/fastqdump/nemestrina/*_1.fastq.gz > /moto/eaton/projects/macaques/fastqdump/nemestrina.sra_1.fastq.gz
!cat /moto/eaton/projects/macaques/fastqdump/nemestrina/*_2.fastq.gz > /moto/eaton/projects/macaques/fastqdump/nemestrina.sra_2.fastq.gz

In [3]:
##doing the same thing for sylvanus and silenus (not processed above as these files were generated in this study and are not yet on NCBI)
!cat /moto/eaton/projects/macaques/sylvanus/*.R1.fastq.gz > /moto/eaton/projects/macaques/fastqdump/sylvanus.sra_1.fastq.gz
!cat /moto/eaton/projects/macaques/sylvanus/*.R2.fastq.gz > /moto/eaton/projects/macaques/fastqdump/sylvanus.sra_2.fastq.gz
!cat /moto/eaton/projects/macaques/silenus/*.R1.fastq.gz > /moto/eaton/projects/macaques/fastqdump/silenus.sra_1.fastq.gz
!cat /moto/eaton/projects/macaques/silenus/*.R2.fastq.gz > /moto/eaton/projects/macaques/fastqdump/silenus.sra_2.fastq.gz

In [9]:
##we need to update the dataframe to include the names fasno, fasso, sylvanus, and silenus so we can call them later:
df.at[4,'SRR']='fasno'
df.at[7,'SRR']='fuscata'
df.at[13,'SRR']='nemestrina'
df.at[16,'SRR']='sylvanus'
df.at[17,'SRR']='silenus'

In [22]:
df[["Species", "Group", "SRR", "BioSample", "Sample", "Study", "PRJ"]]

Unnamed: 0,Species,Group,SRR,BioSample,Sample,Study,PRJ
0,Macaca mulatta northern low altitude,mulatta,SRR4454026,SAMN05883679,SRS1762015,SRP092140,PRJNA345528
1,Macaca mulatta southern high altitude,mulatta,SRR4454020,SAMN05883709,SRS1762009,SRP092140,PRJNA345529
2,Macaca mulatta southern low altitude,mulatta,SRR4453966,SAMN05883736,SRS1761955,SRP092140,PRJNA345530
3,Macaca mulatta Indian,mulatta,SRR5628058,SAMN07168901,SRS2238957,SRP049547,PRJNA251548
4,Macaca fascicularis northern,fascicularis,fasno,SAMN00116341,SRS117874,SRP045755,PRJNA51411
5,Macaca fascicularis southern,fascicularis,SRR445713,SAMN00811240,SRS300124,SRP011089,PRJNA20409
6,Macaca fuscata,mulatta,DRR002233,SAMD00011919,DRS001583,DRP000620,PRJDB2459
7,Macaca fuscata,mulatta,fuscata,SAMD00013516,DRS002017,DRP000657,PRJDB2648
8,Macaca thibethana,sinica,SRR1024051,SAMN02390221,SRS498543,SRP032525,PRJNA226187
9,Macaca assamensis,sinica,SRR2981114,SAMN04316321,SRS1196892,SRP067118,PRJNA305009


In [23]:
df.to_csv(path_or_buf='/moto/eaton/projects/macaques/metadata.csv')