# **Programming for Biologists: Interfacing with Entrez Databases using Biopython**

**Welcome, data miners!**

In the previous notebook, you learned how to manipulate biological sequences and parse local files using Biopython's `Seq`, `SeqRecord`, and `SeqIO` modules. Now, let's take your data acquisition skills to the next level: **accessing the vast biological databases directly from your Python code!**

The **National Center for Biotechnology Information (NCBI)** hosts an incredible array of publicly available biological and biomedical databases. These databases are integrated through the **Entrez** system.

Manually Browse the NCBI website for data is fine for a few queries, but what if you need to download thousands of gene sequences, retrieve abstract information for hundreds of papers, or check for related entries across different databases?

This is where `Bio.Entrez` comes in. It's Biopython's module for programmatically querying and downloading data from Entrez. It allows you to:

* **Discover databases:** Find out what databases are available and what fields they contain (`EInfo`).
* **Search:** Perform sophisticated searches using keywords, accession numbers, or other criteria to retrieve lists of unique identifiers (IDs) (`ESearch`).
* **Fetch:** Download the full data records corresponding to those IDs in various formats (`EFetch`).
* **Link:** Find related entries across different databases (`ELink`).

Let's get started!

In [2]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.85-cp311-cp311-win_amd64.whl.metadata (13 kB)
Downloading biopython-1.85-cp311-cp311-win_amd64.whl (2.8 MB)
   ---------------------------------------- 0.0/2.8 MB ? eta -:--:--
   ------- -------------------------------- 0.5/2.8 MB 4.2 MB/s eta 0:00:01
   ------------------------------------- -- 2.6/2.8 MB 8.9 MB/s eta 0:00:01
   ------------------------------------- -- 2.6/2.8 MB 8.9 MB/s eta 0:00:01
   ---------------------------------------- 2.8/2.8 MB 4.8 MB/s eta 0:00:00
Installing collected packages: biopython
Successfully installed biopython-1.85


---

## **1. Setup: Importing Entrez and Setting Your Email**

First, we import the `Entrez` module. **Crucially, you must set `Entrez.email` to a valid email address.** NCBI requires this for tracking usage and to contact you if there are any issues or if your script is making an excessive number of requests.

It's good practice to provide your actual email address, but for a workshop, you can use a placeholder like `"your.name@example.com"`.

We'll also import `SeqIO` again, as it's often used to parse the results from `Entrez.efetch`.

In [3]:
from Bio import Entrez
from Bio import SeqIO

# IMPORTANT: Set your email address! NCBI requires this.
Entrez.email = "your.name@example.com"  # Replace with your actual email if you plan extensive use

print("Bio.Entrez and Bio.SeqIO imported. Entrez email set.")

Bio.Entrez and Bio.SeqIO imported. Entrez email set.


---

## **2. `Entrez.einfo()`: Discovering Databases and Fields**

The `einfo()` function is used to obtain information about the Entrez databases and their available search fields. This is incredibly useful when you're not sure how to structure your search queries or what data types are available.

### **2.1 Listing All Available Databases**

You can get a list of all databases managed by Entrez by calling `einfo()` without any specific database parameter.

In [4]:
print("--- Listing All Entrez Databases ---")
try:
    handle = Entrez.einfo()
    record = Entrez.read(handle)
    handle.close()

    print("Available Databases:")
    for db_name in record['DbList']:
        print(f"- {db_name}")
except Exception as e:
    print(f"An error occurred: {e}")


print("\nCommonly used databases for biologists include: pubmed, nuccore (nucleotide), protein, gene, taxonomy, cdd, snp, sra, pmb")

--- Listing All Entrez Databases ---
Available Databases:
- pubmed
- protein
- nuccore
- ipg
- nucleotide
- structure
- genome
- annotinfo
- assembly
- bioproject
- biosample
- blastdbinfo
- books
- cdd
- clinvar
- gap
- gapplus
- grasp
- dbvar
- gene
- gds
- geoprofiles
- medgen
- mesh
- nlmcatalog
- omim
- orgtrack
- pmc
- proteinclusters
- pcassay
- protfam
- pccompound
- pcsubstance
- seqannot
- snp
- sra
- taxonomy
- biocollections
- gtr

Commonly used databases for biologists include: pubmed, nuccore (nucleotide), protein, gene, taxonomy, cdd, snp, sra, pmb


### **2.2 Getting Information About a Specific Database**

To understand a database's structure and search capabilities, pass the `db` parameter to `einfo()`.

In [6]:
print("--- Information about 'nuccore' (Nucleotide) Database ---")
try:
    handle = Entrez.einfo(db="nuccore")
    record = Entrez.read(handle)
    handle.close()

    # print(record) # Uncomment to see the full raw output

    db_info = record['DbInfo']
    print(f"DB Name: {db_info['DbName']}")
    print(f"Description: {db_info['Description']}")
    print(f"Total Records: {db_info['Count']}")

    print("\nSearchable Fields (partial list):")
    # Iterate through fields and print common ones
    for field in db_info['FieldList']:
        # Check if the field is relevant for searching and not too obscure
        if field['Name'] in ['ALL', 'ACCN', 'GENE', 'TITL', 'ORGN', 'AUTH', 'FKEY', 'PDAT']:
             print(f"- {field['Name']}: {field['Description']} ({field['IsIndexed']})")

except Exception as e:
    print(f"An error occurred: {e}")

--- Information about 'nuccore' (Nucleotide) Database ---
DB Name: nuccore
Description: Core Nucleotide db
Total Records: 660845242

Searchable Fields (partial list):
An error occurred: 'IsIndexed'


**Your Turn! (EInfo Exercise)**

1.  Use `Entrez.einfo()` to get information about the `pubmed` database.
2.  Print the total number of records in `pubmed`.
3.  List at least 5 different searchable fields from the `pubmed` database and their descriptions.

In [7]:
# Write your code for EInfo Exercise here!


---

## **3. `Entrez.esearch()`: Searching for Data (Getting IDs)**

The `esearch()` function is your primary tool for querying Entrez databases. It takes a database name (`db`) and a search term (`term`) and returns a list of unique identifiers (UIDs) that match your query.

Key parameters:
* `db`: The database to search (e.g., `'nuccore'`, `'pubmed'`, `'protein'`).
* `term`: Your search query (e.g., `'human[ORGN] AND hemoglobin[GENE]'`). You can use Entrez query syntax.
* `retmax`: The maximum number of UIDs to return. Default is often 20. Essential for getting more results.
* `retstart`: The index of the first UID to retrieve (for pagination).
* `usehistory`: If `True`, enables Web Environment for very large queries, allowing you to fetch results in batches. (More advanced, but good to know).

### **3.1 Searching for Nucleotide Sequences**

Let's search the `nuccore` database for a specific gene from a particular organism.

In [8]:
print("--- Searching 'nuccore' for 'human hemoglobin alpha' ---")

search_term = "human AND hemoglobin alpha"
try:
    handle = Entrez.esearch(db="nuccore", term=search_term, retmax="5") # Get up to 5 results
    record = Entrez.read(handle)
    handle.close()

    # print(record) # Uncomment to see the full raw output

    count = int(record['Count'])
    id_list = record['IdList']

    print(f"Found {count} records matching '{search_term}'.")
    print(f"Retrieved {len(id_list)} IDs: {id_list}")
except Exception as e:
    print(f"An error occurred during nuccore search: {e}")

--- Searching 'nuccore' for 'human hemoglobin alpha' ---
Found 0 records matching 'human[ORGN] AND hemoglobin alpha[GENE]'.
Retrieved 0 IDs: []


### **3.2 Searching PubMed for Articles**

You can search the `pubmed` database for scientific literature.

In [9]:
print("--- Searching 'pubmed' for 'CRISPR cancer therapy review' ---")

search_term_pubmed = "CRISPR cancer therapy review"
try:
    handle = Entrez.esearch(db="pubmed", term=search_term_pubmed, retmax="10") # Get up to 10 results
    record = Entrez.read(handle)
    handle.close()

    count = int(record['Count'])
    id_list_pubmed = record['IdList']

    print(f"Found {count} PubMed articles matching '{search_term_pubmed}'.")
    print(f"Retrieved {len(id_list_pubmed)} IDs: {id_list_pubmed}")
except Exception as e:
    print(f"An error occurred during PubMed search: {e}")

--- Searching 'pubmed' for 'CRISPR cancer therapy review' ---
Found 1174 PubMed articles matching 'CRISPR cancer therapy review'.
Retrieved 10 IDs: ['40394347', '40389954', '40358685', '40356298', '40356202', '40351170', '40350539', '40346049', '40342841', '40342053']


**Your Turn! (ESearch Exercise)**

1.  Search the `protein` database for `"SARS-CoV-2 spike protein"`.
2.  Retrieve and print the first 3 IDs found.
3.  Print the total count of matching protein records.

In [13]:
# Write your code for ESearch Exercise here!
print("--- ESearch Exercise: SARS-CoV-2 spike protein ---")
search_term_spike = "SARS-CoV-2 spike protein"
try:
    handle = Entrez.esearch(db="protein", term=search_term_spike, retmax="3")
    record = Entrez.read(handle)
    handle.close()

    count_spike = int(record['Count'])
    id_list_spike = record['IdList']

    print(f"Total protein records found for '{search_term_spike}': {count_spike}")
    print(f"First 3 retrieved IDs: {id_list_spike}")

except Exception as e:
    print(f"An error occurred during protein search: {e}")

--- ESearch Exercise: SARS-CoV-2 spike protein ---
Total protein records found for 'SARS-CoV-2 spike protein': 48040
First 3 retrieved IDs: ['984655739', '118150418', '2976446949']


---

## **4. `Entrez.efetch()`: Retrieving Full Records**

Once you have a list of IDs from `esearch()`, you use `efetch()` to download the actual data records. This is where you get the sequences, annotations, abstracts, etc.

Key parameters:
* `db`: The database the IDs belong to.
* `id`: A single ID or a comma-separated string of IDs.
    * *Tip:* If you have a list of IDs from `esearch()`, you can join them: `','.join(id_list)`.
    
    
    
* `rettype`: The return type, specifying the format of the data. Common types include:
    * `'fasta'` (for nucleotide or protein sequences)
    * `'gb'` or `'genbank'` (for GenBank records)
    * `'medline'` (for PubMed abstracts)
    * `'xml'` (raw XML, flexible but needs parsing)
* `retmode`: The return mode, typically `'text'` for human-readable formats or `'xml'` for structured data.

### **4.1 Fetching a Single GenBank Record**

Let's fetch a known GenBank accession number (e.g., from a previous search or directly provided).

In [14]:
print("--- Fetching a single GenBank record (Accession: 'NC_001416') ---")
accession_id = "NC_001416" # Phage lambda genome

try:
    handle = Entrez.efetch(db="nuccore", id=accession_id, rettype="gb", retmode="text")
    # Read the GenBank record using SeqIO.read() because we expect one record
    record = SeqIO.read(handle, "genbank")
    handle.close()

    print(f"ID: {record.id}")
    print(f"Name: {record.name}")
    print(f"Description: {record.description}")
    print(f"Sequence Length: {len(record.seq)}")
    print(f"First 50 bases of sequence: {record.seq[:50]}...")
    print("Features (partial list):")
    for feature in record.features[:3]: # Print first 3 features
        print(f"  - Type: {feature.type}, Location: {feature.location}")

except Exception as e:
    print(f"An error occurred during GenBank fetch: {e}")

--- Fetching a single GenBank record (Accession: 'NC_001416') ---
ID: NC_001416.1
Name: NC_001416
Description: Enterobacteria phage lambda, complete genome
Sequence Length: 48502
First 50 bases of sequence: GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAA...
Features (partial list):
  - Type: source, Location: [0:48502](+)
  - Type: gene, Location: [190:736](+)
  - Type: CDS, Location: [190:736](+)


### **4.2 Fetching Multiple FASTA Sequences**

You can pass a list of IDs to `efetch()` and get multiple records back. `SeqIO.parse()` is ideal for iterating through these multiple records.

In [21]:
print("--- Fetching multiple FASTA sequences (from previous human hemoglobin search) ---")

# Reuse the IDs from the previous ESearch (assuming it ran successfully)
if 'id_list' in locals() and id_list:
    fetch_ids = ','.join(id_list)
    print(f"Fetching IDs: {fetch_ids}")

    try:
        handle = Entrez.efetch(db="nuccore", id=fetch_ids, rettype="fasta", retmode="text")
        # Use SeqIO.parse for multiple records from the handle
        fasta_records = list(SeqIO.parse(handle, "fasta"))
        handle.close()

        print(f"\nFetched {len(fasta_records)} FASTA records.")
        for i, record in enumerate(fasta_records):
            print(f"Record {i+1}: ID={record.id}, Length={len(record.seq)}, First 30 bases={record.seq[:30]}...")

    except Exception as e:
        print(f"An error occurred during multi-FASTA fetch: {e}")
else:
    print("No IDs available from previous ESearch for 'human hemoglobin alpha'. Please run that cell first.")

--- Fetching multiple FASTA sequences (from previous human hemoglobin search) ---
Fetching IDs: 46309492,1478050928,2978489883,2724200156,2724200150

Fetched 5 FASTA records.
Record 1: ID=NM_207062.1, Length=2338, First 30 bases=GGGAGACAGAGGAACGCTGTAAGGATAGTA...
Record 2: ID=NM_001366015.1, Length=2428, First 30 bases=GAAAAATAAAAACTTTCATCATGAGGTGGC...
Record 3: ID=CP040665.2, Length=2820179, First 30 bases=CGATGCGAGCAATCAAATTTCATAACATCA...
Record 4: ID=NC_088388.1, Length=133402803, First 30 bases=CCCTAACCCTAACCCTAACCCTAACCCTAA...
Record 5: ID=NC_088394.1, Length=86056530, First 30 bases=CCCTAACCCTAACCCTAACCCTAACCCTAA...


**Your Turn! (EFetch Exercise)**

1.  Choose one of the protein IDs you found in the `ESearch` exercise (`SARS-CoV-2 spike protein`). If you didn't run it, use `YP_009724390.1` as an example.
2.  Use `Entrez.efetch()` to retrieve the protein sequence in **FASTA** format from the `protein` database.
3.  Parse the fetched data using `SeqIO.read()` (since it's a single sequence).
4.  Print the protein's ID, description, and its full sequence.

In [None]:
# Write your code for EFetch Exercise here!


---

## **5. `Entrez.elink()`: Finding Related Records (Cross-Database Linking)**

The `elink()` function allows you to find related records. Given an ID in one database, you can find linked IDs in the same or different databases. This is useful for exploring biological connections.

Key parameters:
* `dbfrom`: The database from which the input IDs come.
    
* `db`: The target database to link to.
* `id`: The ID(s) to link from.

    

In [40]:
print("--- ELink: Finding PubMed articles related to a Gene ID ---")

gene_id = "41985191" # Example: human gene for 'TP53' (tumor protein p53)

try:
    handle = Entrez.elink(dbfrom="gene", db="pubmed", LinkName="gene_pubmed", id=gene_id)
    record = Entrez.read(handle)
    handle.close()

    print(record) # Uncomment to see the full raw output of elink

    pubmed_ids = []
    if record[0]['LinkSetDb']:
        for link_set_db in record[0]['LinkSetDb']:
            pubmed_ids.extend([link['Id'] for link in link_set_db['Link'] ])

    print(f"Found {len(pubmed_ids)} PubMed articles linked to Gene ID {gene_id}.")
    print(f"First 5 PubMed IDs: {pubmed_ids[:5]}")

    # Optionally, fetch titles for these PubMed IDs
    if pubmed_ids:
        handle_pubmed = Entrez.efetch(db="pubmed", id=','.join(pubmed_ids[:3]), rettype="medline", retmode="text")
        medline_records = handle_pubmed.read().split('\n\n') # Medline records are separated by double newlines
        handle_pubmed.close()
        print("\nFirst 3 linked PubMed article titles:")
        for rec_text in medline_records:
            if 'TI  -' in rec_text: # Title field in Medline format
                title_line = [line for line in rec_text.split('\n') if line.startswith('TI  -')]
                if title_line:
                    print(f"- {title_line[0].replace('TI  - ', '').strip()}")

except Exception as e:
    print(f"An error occurred during ELink: {e}")

--- ELink: Finding PubMed articles related to a Gene ID ---
[{'LinkSetDb': [{'Link': [{'Id': '22301074'}, {'Id': '30032202'}], 'DbTo': 'pubmed', 'LinkName': 'gene_pubmed'}], 'LinkSetDbHistory': [], 'ERROR': [], 'DbFrom': 'gene', 'IdList': ['41985191']}]
Found 2 PubMed articles linked to Gene ID 41985191.
First 5 PubMed IDs: ['22301074', '30032202']

First 3 linked PubMed article titles:
- Manual GO annotation of predictive protein signatures: the InterPro approach to
- TreeGrafter: phylogenetic tree-based annotation of proteins with Gene Ontology


---

## **6. Putting It All Together: A Complete Workflow**

Let's combine `esearch` and `efetch` to simulate a common task: searching for a protein of interest, and then downloading its sequence and related abstract information.

In [45]:
print("--- Complete Workflow: Search , get its ID, then fetch its sequence ---")

Search TERM = "MAPK"
search_term = f"{organism_name}
database = "nuccore"

print(f"\n1. Searching '{database}' for: '{search_term}'")
try:
    # Step 1: Search for IDs
    handle_search = Entrez.esearch(db=database, term=search_term, retmax="1") # Get just the top result
    search_record = Entrez.read(handle_search)
    handle_search.close()

    found_count = int(search_record['Count'])
    if found_count == 0:
        print(f"No records found for '{search_term}'.")
    else:
        top_id = search_record['IdList'][0]
        print(f"  Found {found_count} records. Top ID: {top_id}")

        # Step 2: Fetch the full record in GenBank format
        print(f"\n2. Fetching full GenBank record for ID: {top_id}")
        handle_fetch = Entrez.efetch(db=database, id=top_id, rettype="gb", retmode="text")
        genome_record = SeqIO.read(handle_fetch, "genbank")
        handle_fetch.close()

        print(f"  Record ID: {genome_record.id}")
        print(f"  Description: {genome_record.description}")
        print(f"  Sequence Length: {len(genome_record.seq)} bases")
        print(f"  First 100 bases: {genome_record.seq[:100]}...")

        # Step 3 (Optional): Find related PubMed articles using ELink
        print(f"\n3. Finding related PubMed articles using ELink...")
        handle_elink = Entrez.elink(dbfrom="nuccore", db="pubmed", LinkName="nuccore_pubmed", id=top_id)
        elink_record = Entrez.read(handle_elink)
        handle_elink.close()

        linked_pubmed_ids = []
        if elink_record[0]['LinkSetDb']:
            for link_set_db in record[0]['LinkSetDb']:
                linked_pubmed_ids.extend([link['Id'] for link in link_set_db['Link'] ])

        if linked_pubmed_ids:
            print(f"  Found {len(linked_pubmed_ids)} related PubMed articles. First 3 IDs: {linked_pubmed_ids[:3]}")
        else:
            print("  No directly linked PubMed articles found for this record.")

except Exception as e:
    print(f"An error occurred during the workflow: {e}")

--- Complete Workflow: Search for a virus genome, get its ID, then fetch its sequence ---

1. Searching 'nuccore' for: 'MAPK AND genome'
  Found 44280 records. Top ID: 2889144701

2. Fetching full GenBank record for ID: 2889144701
  Record ID: NM_001435891.1
  Description: Danio rerio TEA domain family member 3 a (tead3a), mRNA
  Sequence Length: 1944 bases
  First 100 bases: ATACTTCACATTCCAGCTTTACTGTCAAATCAGGAGAAATATTTCTTCAAAAACATCAGCGCGGTTTGCTTTTGTAAATGAGCTCCTGCGATAAAGCCTG...

3. Finding related PubMed articles using ELink...
  Found 2 related PubMed articles. First 3 IDs: ['22301074', '30032202']


---

## **Conclusion: Programmatic Access to Biological Knowledge**

You've now successfully navigated the world of NCBI databases using Biopython's `Bio.Entrez` module!

You can now:

* **Discover** available databases and their fields (`EInfo`).
* **Search** for records based on various criteria to get their IDs (`ESearch`).
* **Fetch** the full data records in desired formats using those IDs (`EFetch`).
* **Link** records across different databases (`ELink`).

This programmatic access is invaluable for automating data retrieval, performing large-scale analyses, and ensuring reproducibility in your bioinformatics workflows. Remember to always be respectful of NCBI's usage guidelines (e.g., provide your email, avoid excessive rapid requests).

Keep practicing, explore more of Biopython's capabilities, and leverage the power of Entrez for your biological research!