![rmotr](https://user-images.githubusercontent.com/7065401/52071918-bda15380-2562-11e9-828c-7f95297e4a82.png)
<hr style="margin-bottom: 40px;">

# Python for Genomics 
## Section 6: Retreiving Sequences from NCBI

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

More often that we'd like, we'll encounter sequences that are not associated with a BioProject.  

Maybe the a paper didn't set up a BioProject.

Or you have a stash of accession numbers from which you need the sequence data.

It's useful to know how to efficiently (and responsibly) download large sets of sequences from NCBI's repositories.

---

This is the case for us in the second paper. 

In order to do our analysis, we need to download the sequences from accession numbers, which the authors have listed in an excel spreadsheet. (This is actually more common than you think.) 

### How do we go from accession numbers in an .xls ➡️ a file that contains fasta or genbank (or whatever...) data?

We can manually copy and paste each number into NCBI's database, click some buttons, and then copy and paste sequences into a single file.

But what if we have hundreds of accession numbers?

#### Let's see how to make this workflow more efficient 🐍💫

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

Specifically, these are sequences derived from specimens acquired during what experts call the "rapid growth phase."

"The newly sequenced genomes have been submitted to GenBank. The accession numbers are provided in Supplementary Table 1.

This is an .xlsx file named 'genetic_diversity.xlsx', which is avaiable in our `data/` folder.

### The logic for this workflow is:

1. Import excel worksheet into dataframe
2. Create a list of accession numbers
3. Make a call to NCBI's servers by looping through list
4. Save sequences to a file
5. Use file for downstream applications or analysis here in the notebook


In [38]:
import numpy as np
import pandas as pd

In [39]:
df = pd.read_excel('data/genetic_diversity.xlsx')

In [40]:
df

Unnamed: 0,LabID,Date,District,Town,Country,Accession Number
0,J0001,20140927,WesternRural,Jui,SLE,KP759636
1,J0002,20140927,WesternUrban,Freetown,SLE,KP759640
2,J0003,2014-09-28 00:00:00,Portloko,MasimeraChiefdom,SLE,KP759651
3,J0004,2014-09-28 00:00:00,WesternRural,Cole,SLE,KP759668
4,J0005,20140926,WesternRural,Jui,SLE,KP759628
...,...,...,...,...,...,...
170,J0171,20141110,WesternRural,Allen,SLE,KP759707
171,J0172,2014-11-10 00:00:00,WesternRural,Hastings,SLE,KP759626
172,J0173,2014-11-10 00:00:00,WesternRural,Sulpon,SLE,KP759627
173,J0174,2014-11-10 00:00:00,WesternRural,Waterloo,SLE,KP759708


So sample information including dates, towns, accessions.

Let's see if we can download the genbank files so we can look a little closer.

We need to isolate the accession numbers and make a call to NCBI.

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175 entries, 0 to 174
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   LabID             175 non-null    object
 1   Date              175 non-null    object
 2   District          175 non-null    object
 3   Town              175 non-null    object
 4   Country           175 non-null    object
 5   Accession Number  175 non-null    object
dtypes: object(6)
memory usage: 8.3+ KB


In [42]:
ebola_acc = df["Accession Number"].to_list()

In [43]:
ebola_acc

['KP759636',
 'KP759640',
 'KP759651',
 'KP759668',
 'KP759628',
 'KP759630',
 'KP759631',
 'KP759718',
 'KP759734',
 'KP759639',
 'KP759641',
 'KP759642',
 'KP759643',
 'KP759740',
 'KP759741',
 'KP759742',
 'KP759747',
 'KP759644',
 'KP759754',
 'KP759755',
 'KP759756',
 'KP759757',
 'KP759652',
 'KP759663',
 'KP759666',
 'KP759670',
 'KP759678',
 'KP759683',
 'KP759688',
 'KP759606',
 'KP759607',
 'KP759691',
 'KP759692',
 'KP759608',
 'KP759609',
 'KP759694',
 'KP759615',
 'KP759618',
 'KP759620',
 'KP759629',
 'KP759710',
 'KP759711',
 'KP759712',
 'KP759713',
 'KP759714',
 'KP759715',
 'KP759632',
 'KP759716',
 'KP759717',
 'KP759719',
 'KP759633',
 'KP759720',
 'KP759721',
 'KP759722',
 'KP759723',
 'KP759634',
 'KP759635',
 'KP759724',
 'KP759725',
 'KP759726',
 'KP759637',
 'KP759727',
 'KP759638',
 'KP759728',
 'KP759729',
 'KP759730',
 'KP759731',
 'KP759732',
 'KP759733',
 'KP759735',
 'KP759736',
 'KP759737',
 'KP759738',
 'KP759739',
 'KP759743',
 'KP759744',
 'KP759745',

The dataset is small enough where we can just eyeball it and notice there's a dash where an accession isn't recorded - so we'll just toss that one out for now.

In [44]:
clean_ebola_acc = []

for acc in ebola_acc:
    if acc != '-':
        clean_ebola_acc.append(acc)

clean_ebola_acc
    

['KP759636',
 'KP759640',
 'KP759651',
 'KP759668',
 'KP759628',
 'KP759630',
 'KP759631',
 'KP759718',
 'KP759734',
 'KP759639',
 'KP759641',
 'KP759642',
 'KP759643',
 'KP759740',
 'KP759741',
 'KP759742',
 'KP759747',
 'KP759644',
 'KP759754',
 'KP759755',
 'KP759756',
 'KP759757',
 'KP759652',
 'KP759663',
 'KP759666',
 'KP759670',
 'KP759678',
 'KP759683',
 'KP759688',
 'KP759606',
 'KP759607',
 'KP759691',
 'KP759692',
 'KP759608',
 'KP759609',
 'KP759694',
 'KP759615',
 'KP759618',
 'KP759620',
 'KP759629',
 'KP759710',
 'KP759711',
 'KP759712',
 'KP759713',
 'KP759714',
 'KP759715',
 'KP759632',
 'KP759716',
 'KP759717',
 'KP759719',
 'KP759633',
 'KP759720',
 'KP759721',
 'KP759722',
 'KP759723',
 'KP759634',
 'KP759635',
 'KP759724',
 'KP759725',
 'KP759726',
 'KP759637',
 'KP759727',
 'KP759638',
 'KP759728',
 'KP759729',
 'KP759730',
 'KP759731',
 'KP759732',
 'KP759733',
 'KP759735',
 'KP759736',
 'KP759737',
 'KP759738',
 'KP759739',
 'KP759743',
 'KP759744',
 'KP759745',

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)

## efetch our data

Now that we have a clean list, it's time to see how we can download the associated sequences from these identifiers (accession numbers).

The Entrez package in Biopython provides us with all the tools to talk to NCBI.

### 👉🏼 efetch function returns data records from NCBI's databases

Let's try to grab one record first to get the hang of efetch: KP759636.


In [68]:
from IPython.display import IFrame
url = "https://www.ncbi.nlm.nih.gov/nuccore/KP759636.1/"
IFrame(url, 800, 400)

the `efetch` function requires the following parameters:
   1. db - what database are you requesting the info from (required)
   2. id - the accession (required)
   3. rettype - retrieval type (fasta only? full genbank?)
   4. retmode - retrieval mode, or data format of the sequence (text, HTML or XML?)
   
👇🏼 View the handy table below to see what's best for your downloads.

In [47]:
from IPython.display import IFrame
url = "https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly"
IFrame(url, 800, 400)

In [86]:
from Bio import Entrez

Entrez.email = "adriana@dranalytics.co"

handle = Entrez.efetch(db='nuccore', id='KP759636', rettype="gb", retmode="text")
print(handle.read())

LOCUS       KP759636               18713 bp    cRNA    linear   VRL 07-AUG-2015
DEFINITION  Zaire ebolavirus isolate
            Ebolavirus/H.sapiens-wt/SLE/2014/Makona-J0001, partial genome.
ACCESSION   KP759636
VERSION     KP759636.1
KEYWORDS    .
SOURCE      Zaire ebolavirus
  ORGANISM  Zaire ebolavirus
            Viruses; ssRNA viruses; ssRNA negative-strand viruses;
            Mononegavirales; Filoviridae; Ebolavirus.
REFERENCE   1  (bases 1 to 18713)
  AUTHORS   Tong,Y.G., Shi,W.F., Liu,D., Qian,J., Liang,L., Bo,X.C., Liu,J.,
            Ren,H.G., Fan,H., Ni,M., Sun,Y., Jin,Y., Teng,Y., Li,Z., Kargbo,D.,
            Dafae,F., Kanu,A., Chen,C.C., Lan,Z.H., Jiang,H., Luo,Y., Lu,H.J.,
            Zhang,X.G., Yang,F., Hu,Y., Cao,Y.X., Deng,Y.Q., Su,H.X., Sun,Y.,
            Liu,W.S., Wang,Z., Wang,C.Y., Bu,Z.Y., Guo,Z.D., Zhang,L.B.,
            Nie,W.M., Bai,C.Q., Sun,C.H., An,X.P., Xu,P.S., Zhang,X.L.,
            Huang,Y., Mi,Z.Q., Yu,D., Yao,H.W., Feng,Y., Xia,Z.P., Zheng,X.X.,

Now that we know how to make a call to an NCBI database, let's see to incorporate our list of accessions to get all of the records at once.

There are plenty of ways to achieve this, but as this is a freely available public resource, there are a few rules we should follow: 

### * No more than 3 URL requests per second
### * Limit large requests to weekends or between 9PM to 5AM EST during weekdays
### * Always give them your email so they can contact you if there are problems

In this case we are trying to download 175 genbank files, which relatively speaking is not that large. So theoretically, you could just write a loop that requests (efetches) each genbank file as it moves through your list.  

But in general it's good practice to minimize requests, so we'll go over how to download large numbers of files.  

![green-divider](https://user-images.githubusercontent.com/7065401/52071924-c003ad80-2562-11e9-8297-1c6595f8a7ff.png)
## Use EPost to download large datasets


### 👉🏼 The EPost function performs a "search" based on your accession numbers and saves the results of that search query.  


the `epost` function requires the following parameters:
1. db - the database to search
2. id - the accession numbers you wish to search for

 
Let's try the EPost step:



In [87]:
from Bio import Entrez
Entrez.email = "adriana@dranalytics.co"

epost_handle = Entrez.epost(db="nuccore", id=",".join(clean_ebola_acc))
epost_handle

<_io.TextIOWrapper encoding='utf-8'>

In [88]:
search_results = Entrez.read(epost_handle)
search_results

{'QueryKey': '1', 'WebEnv': 'NCID_1_238793533_130.14.22.33_9001_1588009776_1970646318_0MetA0_S_MegaStore'}

In [89]:
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]

Now let's walk through the script to download all 174 records based on the search result variables that we saved.

In [1]:
batch_size = 174 
out_handle = open("data/gen_div_ebola.gb", "w")

efetch_handle = Entrez.efetch(
    db="nuccore", 
    rettype="gb", 
    retmode="text",
    retmax=batch_size,
    webenv=webenv, #remember we saved these as variables before 
    query_key=query_key,
)

records = efetch_handle.read()
efetch_handle.close()
out_handle.write(records)
out_handle.close()

NameError: name 'Entrez' is not defined

In [91]:
# Now we have a super file with !!hopefully about 174 records in it.

! head data/gen_div_ebola.gb

LOCUS       KP759636               18713 bp    cRNA    linear   VRL 07-AUG-2015
DEFINITION  Zaire ebolavirus isolate
            Ebolavirus/H.sapiens-wt/SLE/2014/Makona-J0001, partial genome.
ACCESSION   KP759636
VERSION     KP759636.1
KEYWORDS    .
SOURCE      Zaire ebolavirus
  ORGANISM  Zaire ebolavirus
            Viruses; ssRNA viruses; ssRNA negative-strand viruses;
            Mononegavirales; Filoviridae; Ebolavirus.


Recall that we can use `SeqIO.parse()` to count the number of records in a file:

In [92]:
from Bio import SeqIO

ebola_seqs = SeqIO.parse('data/gen_div_ebola.gb', 'genbank')

count = 0
for record in ebola_seqs:
    count += 1

print(count)

174


#### Good to go 👍🏼 

## Now imagine doing that manually 😱

If you want to learn more about retrieving specific data from NCBI, their staff has written some great articles here:
[The insider's guide to accessing NLM Data](https://dataguide.nlm.nih.gov/eutilities/what_is_eutilities.html)