# Biopython 3: SearchIO and Entrez

In this workshop, we will take a quick look at SearchIO, Biopython's tool for unifying results from various sequence search tools, before exploring the Entrez API utility from NCBI.

## SearchIO

Biopython provides the [SearchIO package](https://biopython.org/docs/latest/api/Bio.SearchIO.html) for dealing with outputs from various sequence searching utilities, allowing them to be compared directly. Similar to SeqIO and AlignIO, it has capabilities for parsing, reading, writing, and indexing different search results, as well as converting between file types. 

### Reading and Parsing

Parsing and reading also work the same way as previously, where parse returns an iterator, and read simply returns the first item in the file. Note here that each item is a query, not an individual sequence. 

Let's start with an example of reading and XML result file from a BLAST search.

In [2]:
# Parse a BLAST XML file

from Bio import SearchIO

handle = "searchio-data/blast.xml"

my_result = SearchIO.read(handle = handle, format = "blast-xml")
print(dir(my_result))

print("Search {} has {} hits".format(my_result.id, len(my_result)))

print(dir(my_result.hsps[0]))

for hsp in my_result.hsps:
    print(hsp.hit_id, hsp.evalue) 

['_NON_STICKY_ATTRS', '_QueryResult__alt_hit_ids', '_QueryResult__marker', '__annotations__', '__bool__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_description', '_hit_key_function', '_id', '_items', '_transfer_attrs', 'absorb', 'append', 'blast_id', 'description', 'fragments', 'hit_filter', 'hit_keys', 'hit_map', 'hits', 'hsp_filter', 'hsp_map', 'hsps', 'id', 'index', 'items', 'iterhit_keys', 'iterhits', 'iteritems', 'param_evalue_threshold', 'param_filter', 'param_gap_extend', 'param_gap_open', 'param_score_match', 'param_score_mismatch', 'pop', 'program', 'reference', 'seq_len', 'sort',

### Writing and Format Conversion

You can also write results files in various file formats. To do this, use the `Bio.SearchIO.write()` function. In the example below, we parse a file in the blast-xml format, and then write its contents to a new file in blast-tab tabular format. 

```
from Bio import SearchIO
qresults = SearchIO.parse(handle = 'sample.xml', format = 'blast-xml')
SearchIO.write(qresults, handle = 'sample.tab', format = 'blast-tab')
```

You can also use the `convert()` function to directly convert between file formats without reading/parsing and then writing. The supported formats are below, although not all of the pairs will work for conversion (see [documentation](https://biopython.org/docs/latest/api/Bio.SearchIO.html#Bio.SearchIO.convert) for details).

'blast-tab', 'blast-xml', 'blat-psl', 'hmmer3-tab', 'hmmscan3-domtab', 'hmmsearch3-domtab', 'phmmer3-domtab"

The function will return a tuple of four values: the number of QueryResult, Hit, HSP, and HSPFragment objects it writes to the output file.

In [None]:
in_file = "searchio-data/blast.xml"
out_file = "searchio-data/blast_tab.tab"

in_format = "blast-xml"
out_format = "blast-tab"

SearchIO.convert(in_file=in_file, in_format=in_format, out_file=out_file, out_format=out_format)

## Entrez

The Entrez system is a collection of [NCBI databases](https://www.ncbi.nlm.nih.gov/guide/all/) ranging from literature to sequences, along with a text search tool for exploring them. You can [search online](https://www.ncbi.nlm.nih.gov/search/) via browser, or through the NCBI's E-utilities API. 

### Entrez Rules and Etiquette

### UIDs

Every entry in a NCBI database has a UID (unique identifier). This UID will vary depending on the database. For example, PubMed uses PMID while protein records use GI numbers. 

| Entrez Database    | UID common name | E-utility Database Name |
|--------------------|-----------------|-------------------------|
| BioProject         | BioProject ID  | bioproject              |
| BioSample          | BioSample ID   | biosample               |
| Books              | Book ID        | books                   |
| Conserved Domains  | PSSM-ID        | cdd                     |
| dbGaP              | dbGaP ID       | gap                     |
| dbVar              | dbVar ID       | dbvar                   |
| Gene               | Gene ID        | gene                    |
| Genome             | Genome ID      | genome                  |
| GEO Datasets       | GDS ID         | gds                     |
| GEO Profiles       | GEO ID         | geoprofiles             |
| HomoloGene         | HomoloGene ID  | homologene              |
| MeSH               | MeSH ID        | mesh                    |
| NCBI C++ Toolkit   | Toolkit ID     | toolkit                 |
| NLM Catalog        | NLM Catalog ID | nlmcatalog              |
| Nucleotide         | GI number      | nuccore                 |
| PopSet             | PopSet ID      | popset                  |
| Probe              | Probe ID       | probe                   |
| Protein            | GI number      | protein                 |
| Protein Clusters   | Protein Cluster ID | proteinclusters      |
| PubChem BioAssay   | AID            | pcassay                 |
| PubChem Compound   | CID            | pccompound              |
| PubChem Substance  | SID            | pcsubstance             |
| PubMed             | PMID           | pubmed                  |
| PubMed Central     | PMCID          | pmc                     |
| SNP                | rs number      | snp                     |
| SRA                | SRA ID         | sra                     |
| Structure          | MMDB-ID        | structure               |
| Taxonomy           | TaxID          | taxonomy                |

**Accession.Version vs GI Number**

Sequences will have two parallel identifiers given by the NCBI: the GI number and the Accession version. For a full disambiguation, see this [link](https://www.ncbi.nlm.nih.gov/genbank/sequenceids/). 

### E-Utilities on the Unix Command Line

While Biopython gives us nice tools for using the E-utilities withi Python code, they are also available as [command line tools](https://www.ncbi.nlm.nih.gov/books/NBK179288/)

### Core Concepts of the E-utilities

**What are the E-utilities?**

There are nine total E-utilities which perform different tasks with respect to the NCBI databases (from [A General Introduction to the E-utilities](https://www.ncbi.nlm.nih.gov/books/NBK25497/)).

1. EInfo (database statistics) eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi

    Provides the number of records indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrez databases.

2. ESearch (text searches) eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi

    Responds to a text query with the list of matching UIDs in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query.

3. EPost (UID uploads) eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi

    Accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset.

4. ESummary (document summary downloads) eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi

    Responds to a list of UIDs from a given database with the corresponding document summaries.

5. EFetch (data record downloads) eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi

    Responds to a list of UIDs in a given database with the corresponding data records in a specified format.

6. ELink (Entrez links) eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi

    Responds to a list of UIDs in a given database with either a list of related UIDs (and relevancy scores) in the same database or a list of linked UIDs in another Entrez database; checks for the existence of a specified link from a list of one or more UIDs; creates a hyperlink to the primary LinkOut provider for a specific UID and database, or lists LinkOut URLs and attributes for multiple UIDs.

7. EGQuery (global query) eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi

    Responds to a text query with the number of records matching the query in each Entrez database.

8. ESpell (spelling suggestions) eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi

    Retrieves spelling suggestions for a text query in a given database.

9. ECitMatch (batch citation searching in PubMed) eutils.ncbi.nlm.nih.gov/entrez/eutils/ecitmatch.cgi

    Retrieves PubMed IDs (PMIDs) corresponding to a set of input citation strings.

## Bio.Entrez Module

Biopython gives us a module that allows us to perform these functions inside of a Python script. What it essentially does is to take our arguments, generate the Entrez E-utility URL, send to the NCBI server, and deal with the returned information. The documentation can be found [here](https://biopython.org/docs/latest/api/Bio.Entrez.html). As well as the 9 core functions, the module provides functions for reading and parsing the results.

### An Example with EInfo

Here we will use the first function, EInfo, to get information on some NCBI databases. The basic routine is as follows:

0. Give Entrez your email address.
1. Define your handle as the output of the `Entrez.einfo()` function.
2. Create a record by "reading" the handle using the `Entrez.read()` function. This record will be a nest of Python dictionaries - explore the list by looking at the keys.
3. Extract the information you need by calling the specific keys as needed.

When given no arguments, `einfo()` returns a list of all valid Entrez databases. You can also specify a database by giving it a `db` (database) variable. Then, it will return the information related to that specific database. There are a couple of other inputs for specifying the output format (JSON is available as well as the default XML) and a version input (which is for the very interested user only - check the documentation for details).

First, let's get all of the database information.

In [1]:
from Bio import Entrez

Entrez.email = "cwarner@rockefeller.edu"
handle = Entrez.einfo() # Get handle with information
record = Entrez.read(handle) # Use the Bio parser to turn it into a Python object (dictionary)
print(record.keys())
print(record)
print(len(record['DbList']))



dict_keys(['DbList'])
{'DbList': ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']}
40


Next, we can find out more information about the SRA (Sequence Read Archive) database.

In [15]:
Entrez.email = "cwarner@rockefeller.edu"
handle = Entrez.einfo(db = "sra") # Get handle with information
record = Entrez.read(handle) # Change into Python object
print(record.keys()) # Print the first "dictionary layer"
print(record["DbInfo"])
print(record["DbInfo"]["FieldList"])

dict_keys(['DbInfo'])
{'DbName': 'gap', 'MenuName': 'dbGaP', 'Description': 'dbGaP Data', 'DbBuild': 'Build230522-0335m.1', 'Count': '363717', 'LastUpdate': '2023/05/22 04:11', 'FieldList': [{'Name': 'ALL', 'FullName': 'All Fields', 'Description': 'All terms from all searchable fields', 'TermCount': '2183077', 'IsDate': 'N', 'IsNumerical': 'N', 'SingleToken': 'N', 'Hierarchy': 'N', 'IsHidden': 'N'}, {'Name': 'UID', 'FullName': 'UID', 'Description': 'Unique number assigned to publication', 'TermCount': '0', 'IsDate': 'N', 'IsNumerical': 'Y', 'SingleToken': 'Y', 'Hierarchy': 'N', 'IsHidden': 'Y'}, {'Name': 'FILT', 'FullName': 'Filter', 'Description': 'Limits the records', 'TermCount': '23', 'IsDate': 'N', 'IsNumerical': 'N', 'SingleToken': 'Y', 'Hierarchy': 'N', 'IsHidden': 'N'}, {'Name': 'DISC', 'FullName': 'Discriminator', 'Description': 'Discriminator', 'TermCount': '10', 'IsDate': 'N', 'IsNumerical': 'N', 'SingleToken': 'Y', 'Hierarchy': 'N', 'IsHidden': 'N'}, {'Name': 'ANCE', 'FullN

### A quick aside: finding the DTD file

Let's do our Einfo search again, but instead of using the `Entrez.read()` function, which translates the XML returned by EInfo to Python dictionaries and lists, read and print it as an XML file so we can examine it more closely.

In [29]:
from Bio import Entrez
out_handle = open("entrez-data/einfo.xml", "w")
Entrez.email = "cwarner@rockefeller.edu"
handle = Entrez.einfo() # Get handle with information
result = handle.read()
print(result)

b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20190110//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20190110/einfo.dtd">\n<eInfoResult>\n<DbList>\n\n\t<DbName>pubmed</DbName>\n\t<DbName>protein</DbName>\n\t<DbName>nuccore</DbName>\n\t<DbName>ipg</DbName>\n\t<DbName>nucleotide</DbName>\n\t<DbName>structure</DbName>\n\t<DbName>genome</DbName>\n\t<DbName>annotinfo</DbName>\n\t<DbName>assembly</DbName>\n\t<DbName>bioproject</DbName>\n\t<DbName>biosample</DbName>\n\t<DbName>blastdbinfo</DbName>\n\t<DbName>books</DbName>\n\t<DbName>cdd</DbName>\n\t<DbName>clinvar</DbName>\n\t<DbName>gap</DbName>\n\t<DbName>gapplus</DbName>\n\t<DbName>grasp</DbName>\n\t<DbName>dbvar</DbName>\n\t<DbName>gene</DbName>\n\t<DbName>gds</DbName>\n\t<DbName>geoprofiles</DbName>\n\t<DbName>medgen</DbName>\n\t<DbName>mesh</DbName>\n\t<DbName>nlmcatalog</DbName>\n\t<DbName>omim</DbName>\n\t<DbName>orgtrack</DbName>\n\t<DbName>pmc</DbName>\n\t<DbName>popset</DbName>\n\t<D

Take a look at the `einfo.xml` results. Note that before the beginning of the results, denoted by `<eInfoResult>`, the header contains information about the file encoding as well as the DTD file used to define the XML data returned. Click on this link and download/open the file. If you ever receive a warning about your DTD file when you use `Entrez.read()`, you will see a link for the updated DTD file in the warning. Biopython will automatically access this new DTD online and continue as usual, but if you want better performance you can update the DTD on your machine (see the Biopython Tutorial document Ch. 9 for complete instructions).

## An Example with Genbank Data

Here we will use several of the E-utilities to search for genomic data of the RPL16 gene of a prickly pear (opuntia) in the [NCBI Nucleotide Database](https://www.ncbi.nlm.nih.gov/nucleotide/). This will occur in several stages:

1. Use EGQuery to find the number of hits for your search terms in the "nuccore" database.
2. Use ESearch to get the UIDs searching in "nuccore" with the desired search terms
3. Use EFetch to get the data associated  with those UIDs
4. Parse this sequence data and write it to a .gb file
5. Check your work by parsing the .gb file

Let's begin by finding the number of results in the "nuccore" database.


In [5]:
from Bio import Entrez
Entrez.email = "cwarner@rockefeller.edu"

handle = Entrez.egquery(term="Opuntia AND rpl16") # Search all databases for the specified terms
record = Entrez.read(handle) 
print(record.keys())
print(record["eGQueryResult"])

for row in record['eGQueryResult']:
    if row["DbName"] == "nuccore":
        print(row["Count"])


dict_keys(['Term', 'eGQueryResult'])
[{'DbName': 'pubmed', 'MenuName': 'PubMed', 'Count': '0', 'Status': 'Term or Database is not found'}, {'DbName': 'pmc', 'MenuName': 'PubMed Central', 'Count': '21', 'Status': 'Ok'}, {'DbName': 'mesh', 'MenuName': 'MeSH', 'Count': '0', 'Status': 'Term or Database is not found'}, {'DbName': 'books', 'MenuName': 'Books', 'Count': '0', 'Status': 'Term or Database is not found'}, {'DbName': 'pubmedhealth', 'MenuName': 'PubMed Health', 'Count': 'Error', 'Status': 'Database Error'}, {'DbName': 'omim', 'MenuName': 'OMIM', 'Count': '0', 'Status': 'Term or Database is not found'}, {'DbName': 'ncbisearch', 'MenuName': 'Site Search', 'Count': '0', 'Status': 'Term or Database is not found'}, {'DbName': 'nuccore', 'MenuName': 'Nucleotide', 'Count': '115', 'Status': 'Ok'}, {'DbName': 'nucgss', 'MenuName': 'GSS', 'Count': '0', 'Status': 'Ok'}, {'DbName': 'nucest', 'MenuName': 'EST', 'Count': '0', 'Status': 'Ok'}, {'DbName': 'protein', 'MenuName': 'Protein', 'Count'

In [10]:
handle = Entrez.esearch(db = "nuccore", term = "Opuntia AND rpl16", retmax = 200)
record = Entrez.read(handle)
print(record.keys())
id_list = record["IdList"]
print(id_list)
print(len(id_list))

dict_keys(['Count', 'RetMax', 'RetStart', 'IdList', 'TranslationSet', 'TranslationStack', 'QueryTranslation'])
['2689244941', '2689244853', '2666708278', '2666703263', '2419852361', '2643614054', '2627891212', '2627891124', '2627891047', '2627890961', '2627890885', '2627890802', '2627890718', '2627890636', '2627890548', '2627890460', '2627890382', '2627890305', '2627890228', '2627890152', '2627890068', '2627889984', '2627889900', '2627889817', '2627889740', '2627889653', '2627889560', '2627889470', '2627889380', '2627889278', '2627889193', '2627889118', '2627889035', '2627888952', '2582771976', '2582771888', '2582771811', '2582771725', '2582771649', '2582771566', '2582771482', '2582771400', '2582771312', '2582771241', '2582771153', '2582771075', '2582770998', '2582770921', '2582770845', '2582770761', '2582770677', '2582770593', '2582770510', '2582770433', '2582770349', '2582770274', '2582770192', '2582770117', '2582770042', '2582769959', '2582769885', '2582769808', '2582769733', '25827

In [13]:
from Bio import SeqIO

handle = Entrez.efetch(db = "nuccore", id = id_list, rettype = "gb", retmore = "text")
records = SeqIO.parse(handle = handle, format = "genbank")

with open("entrez-data/opuntia.gb", "w") as output_handle:
    for record in records:
        SeqIO.write(sequences = record, handle = output_handle, format = "genbank")

In [14]:
records = SeqIO.parse(handle = "entrez-data/opuntia.gb", format = "genbank")
for record in records:
    print(record.seq[0:10])

GACCAAACAG
TTTGTTGAAG
GCGAACGACG
ATAAATAATT
ATAAATAATT
GCGAACGACG
ACTTAATAGC
AAAAAGAAAT
TTAGAAAGAA
GAAAGGGTAG
ACAGTAAGAA
CCAAGTCAAG
CCAAGTCAAG
CCAAGTCAAG
AAAAAGAAAT
AAAAAGAAAT
AAAAAGAAAT
AAGAATTGAA
TTAGAAAGAA
TTAGAAAGAA
CCAAGTCAAG
CTTGCGCCAA
CTTGCGCCAA
CTTGCGCCAA
AAGAATTGAA
CTTGCGCCAA
CCAAGTCAAG
TTAGAAAGAA
TTAGAAAGAA
CTTGCGCCAA
AAGAATTGAA
TTAGAAAGAA
CTTGCGCCAA
CCAAGTCAAG
ACTTAATAGC
AAAAAGAAAT
TTAGAAAGAA
GAAAGGGTAG
ACAGTAAGAA
CCAAGTCAAG
CCAAGTCAAG
CCAAGTCAAG
AAAAAGAAAT
GGCGAACGAC
AAAAAGAAAT
AAAAAGAAAT
AAGAATTGAA
TTAGAAAGAA
TTAGAAAGAA
CCAAGTCAAG
CTTGCGCCAA
CTTGCGCCAA
CTTGCGCCAA
AAGAATTGAA
CTTGCGCCAA
CGAGAAAGGG
CCAAGTCAAG
TTAGAAAGAA
TTAGAAAGAA
CTTGCGCCAA
ATTTACAGAC
AAGAATTGAA
TTAGAAAGAA
CTTGCGCCAA
CCAAGTCAAG
CTTAGTGTGT
ATACTTTCAA
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATACTTTCAA
GTAAGAGCCC
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
TTCTATAAAC
AACCCCAAAA

## Using the History Feature

When we create pipelines of different E-utilities, the best practice is to store the intermediate information on the NCBI History server, rather than on our own devices. To add an item to the History server, we can use the EPost E-utility, or we can set the `usehistory = "y"` parameter when using ESearch. Each item in the History server will have two identifying parameters: the WebEnv (cookies string) and an integer Query Key. By invoking these identifiers, we can access our information directly from the server, rather than having to use our own device's memory. Note that we can also combine different records in the History server, which will share a WebEnv but have different Query Keys.

In the example below, we will run an ESearch on PubMed, setting `usehistory = "y"` so that the results are posted on the server. Then, we will pull the results from the server using EFetch to retrieve the citation information about these items. Without the history, we would have had to define the list of UIDs as a variable in our code, and then call the EFetch function on that list of UIDs. 

In the code block below, we will run the initial search.

In [16]:
from Bio import Entrez
Entrez.email = "cwarner@rockefeller.edu"

handle = Entrez.esearch(db = "pubmed", term = "Opuntia[ORGN]", reldate = 365, datetype = "pdat", usehistory = "y")
search_results = Entrez.read(handle)
print(search_results.keys())

count = int(search_results["Count"])
print(count)


dict_keys(['Count', 'RetMax', 'RetStart', 'QueryKey', 'WebEnv', 'IdList', 'TranslationSet', 'QueryTranslation', 'ErrorList'])
128


Now that we have performed the search, the results are saved in our History. Now, we can use the EFetch function to refer to these results and get the full citations. We will also download them in batches of 10. 

In [18]:
batch_size = 10
out_handle = open("entrez-data/opuntia-papers.txt", "w")

for start in range(0, count, batch_size):
    end = min(count, start + batch_size)
    print("Downloading records {} through {}".format(start + 1, end))
    fetch_handle = Entrez.efetch(db = "pubmed", rettype = "medline", retmode = "text", retstart = start, retmax = batch_size, webenv = search_results["WebEnv"], query_key = search_results["QueryKey"])

    data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
out_handle.close()


Downloading records 1 through10
Downloading records 11 through20
Downloading records 21 through30
Downloading records 31 through40
Downloading records 41 through50
Downloading records 51 through60
Downloading records 61 through70
Downloading records 71 through80
Downloading records 81 through90
Downloading records 91 through100
Downloading records 101 through110
Downloading records 111 through120
Downloading records 121 through128


We now have a MEDLINE formatted file with all 128 citations. We can easily add this to any reference management software. Alternatively, we can use the `Bio.Medline` module to parse this file and extract additional information.

### Revisiting our GenBank Search

With this in mind, let's also revisit our Opuntia sequence search and do things the "right" way using the History server. First, let's redo our search but this time using `usehistory = "y"`. Note that we no longer have to deal with the awkward problem of "retmax" since the results are being stored on the server, and we are not actually retrieving the list of UIDs here! All we need is the "Count", which is the total number of results rather than the number of UIDs returned. 

In [25]:
from Bio import Entrez
Entrez.email = "cwarner@rockefeller.edu"

handle = Entrez.esearch(db = "nuccore", term = "Opuntia AND rpl16", usehistory = "y")
search_results = Entrez.read(handle)
handle.close()

webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
count = int(search_results["Count"])

print(webenv)
print(query_key)
print(count)

MCID_65ef23ef25842c72a56cfb44
1
115


Now that we have the search saved in the History, we can download the results in batches (best practice). This time, let's write them to a file in FASTA format. For all of the available retrieval modes (defined by the `retmode` variable), see this [table](https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly).

In [26]:
batch_size = 10
out_handle = open("entrez-data/opuntia.fasta", "w")

for start in range(0, count, batch_size):
    end = min(count, start + batch_size)
    print("Downloading records {} through {}".format(start + 1, end))
    fetch_handle = Entrez.efetch(db = "nuccore", rettype = "fasta", retmode = "text", retstart = start, retmax = batch_size, webenv = webenv, query_key = query_key)
    data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
out_handle.close()

Downloading records 1 through 10
Downloading records 11 through 20
Downloading records 21 through 30
Downloading records 31 through 40
Downloading records 41 through 50
Downloading records 51 through 60
Downloading records 61 through 70
Downloading records 71 through 80
Downloading records 81 through 90
Downloading records 91 through 100
Downloading records 101 through 110
Downloading records 111 through 115


Now, let's parse the data and see that it is the same as what we got before!

In [27]:
records = SeqIO.parse(handle = "entrez-data/opuntia.fasta", format = "fasta")
for record in records:
    print(record.seq[0:10])

GACCAAACAG
TTTGTTGAAG
GCGAACGACG
ATAAATAATT
ATAAATAATT
GCGAACGACG
ACTTAATAGC
AAAAAGAAAT
TTAGAAAGAA
GAAAGGGTAG
ACAGTAAGAA
CCAAGTCAAG
CCAAGTCAAG
CCAAGTCAAG
AAAAAGAAAT
AAAAAGAAAT
AAAAAGAAAT
AAGAATTGAA
TTAGAAAGAA
TTAGAAAGAA
CCAAGTCAAG
CTTGCGCCAA
CTTGCGCCAA
CTTGCGCCAA
AAGAATTGAA
CTTGCGCCAA
CCAAGTCAAG
TTAGAAAGAA
TTAGAAAGAA
CTTGCGCCAA
AAGAATTGAA
TTAGAAAGAA
CTTGCGCCAA
CCAAGTCAAG
ACTTAATAGC
AAAAAGAAAT
TTAGAAAGAA
GAAAGGGTAG
ACAGTAAGAA
CCAAGTCAAG
CCAAGTCAAG
CCAAGTCAAG
AAAAAGAAAT
GGCGAACGAC
AAAAAGAAAT
AAAAAGAAAT
AAGAATTGAA
TTAGAAAGAA
TTAGAAAGAA
CCAAGTCAAG
CTTGCGCCAA
CTTGCGCCAA
CTTGCGCCAA
AAGAATTGAA
CTTGCGCCAA
CGAGAAAGGG
CCAAGTCAAG
TTAGAAAGAA
TTAGAAAGAA
CTTGCGCCAA
ATTTACAGAC
AAGAATTGAA
TTAGAAAGAA
CTTGCGCCAA
CCAAGTCAAG
CTTAGTGTGT
ATACTTTCAA
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATACTTTCAA
GTAAGAGCCC
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
ATGCTTAGTG
TTCTATAAAC
AACCCCAAAA

## Putting It All Together

### Dealing with Large Data: Reading vs Parsing

As we have seen throughout the Biopython module, when dealing with large data files it is better to parse them rather than reading them. Rather than creating one Python object for your entire data file (in this case an XML file), the `Entrez.parse()` function allows you to read XML records one-by-one. In this workshop, we have not be dealing with large data files and will simply use the `Entrez.read()` function. Note that one result from an E-utility can't be parsed because it counts as one "item". This is intended for other types of XML data from NCBI.

### Building Python Functions with E-utilities

All of the code we wrote previously could be easily made into Python functions. For example, you could define a function that searches PubMed for a given search term and then generates a citation file all in one step (or a similar scheme for retrieving data files).

### An Example with ELink

Let's use a different E-utility this time: ELink. ELink has the following functions ()

In [43]:
from Bio import Entrez
Entrez.email = "cwarner@rockefeller.edu"

handle = Entrez.esearch(db = "taxonomy", term = "phocidae")
results = Entrez.read(handle)
handle.close()

print(results)

id_list = results["IdList"]




{'Count': '1', 'RetMax': '1', 'RetStart': '0', 'IdList': ['9709'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'phocidae[All Names]', 'Field': 'All Names', 'Count': '1', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'phocidae[All Names]'}


In [None]:
handle = Entrez.elink(db = "pubmed", dbfrom = "taxonomy", id = id_list, usehistory = "y")
search_results = Entrez.read(handle)
handle.close()



# webenv = search_results["WebEnv"]
# query_key = search_results["QueryKey"]
# count = int(search_results["Count"])

# print(webenv)
# print(query_key)
# print(count)


In [46]:
print(search_results[0])

{'LinkSetDb': [{'Link': [{'Id': '15845833'}, {'Id': '15840117'}, {'Id': '15755884'}, {'Id': '15702463'}, {'Id': '15657738'}, {'Id': '15653886'}, {'Id': '15636154'}, {'Id': '15627521'}, {'Id': '15599768'}, {'Id': '15597432'}, {'Id': '15581937'}, {'Id': '15574926'}, {'Id': '15547801'}, {'Id': '15519732'}, {'Id': '15508573'}, {'Id': '15506185'}, {'Id': '15504391'}, {'Id': '15488010'}, {'Id': '15465714'}, {'Id': '15388740'}, {'Id': '15386135'}, {'Id': '15376691'}, {'Id': '15365810'}, {'Id': '15362830'}, {'Id': '15352491'}, {'Id': '15336671'}, {'Id': '15330452'}, {'Id': '15325146'}, {'Id': '15296299'}, {'Id': '15288743'}, {'Id': '15270120'}, {'Id': '15264492'}, {'Id': '15247308'}, {'Id': '15245408'}, {'Id': '15234883'}, {'Id': '15233163'}, {'Id': '15200869'}, {'Id': '15193087'}, {'Id': '15190756'}, {'Id': '15172694'}, {'Id': '15165069'}, {'Id': '15165046'}, {'Id': '15145227'}, {'Id': '15144018'}, {'Id': '15143524'}, {'Id': '15143043'}, {'Id': '15143018'}, {'Id': '15139648'}, {'Id': '1513748

## Useful Links and Documentation

* [Bio.SearchIO Documentation](https://biopython.org/docs/latest/api/Bio.SearchIO.html)
* [Bio.Entrez Documentation](https://biopython.org/docs/latest/api/Bio.Entrez.html)
* [Entrez Help](https://www.ncbi.nlm.nih.gov/books/NBK3837/)
* [Table of Entrez Databases and UIDs](https://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.T._entrez_unique_identifiers_ui/)
* [Values of retmode and rettype for EFetch](https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly)
* [Table of ELinks](https://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html)

## Credits and Inspiration

* [Biopython Tutorial and Cookbook](https://biopython.org/DIST/docs/tutorial/Tutorial.html)
