## This notebook is created by Prateek Paul.
* Email: prateekp@iiitd.ac.in
* LinkedIn: [linkedin.com/in/prateekpaulpro/](https://linkedin.com/in/prateekpaulpro/)

Disclaimer: 
The code and content in this notebook are compiled from various open sources, personal experience, and reference materials. It is intended solely for educational purposes. All credits for original ideas and code snippets go to their respective authors. If you find any inaccuracies or have suggestions, feel free to reach out.

# Essential Biological Databases
## 1. GenBank
A comprehensive public database of nucleotide sequences maintained by NCBI.
Important for storing and sharing genetic information.
## 2. UniProt
A protein sequence and function database, vital for protein sequence and functional data.
Useful for protein annotation, interactions, and pathways.
## 3. PDB
A repository for 3D structural data of large biological molecules, including proteins and nucleic acids.
Essential for structural bioinformatics and molecular modeling.
## 4. NCBI Entrez
A tool for accessing a wide range of databases like PubMed, Gene, Protein, and Taxonomy through a single interface.
Ideal for cross-database queries.
## 5. KEGG
A resource for understanding biological systems, diseases, and pathways.
Useful for systems biology and metabolic pathway analysis.
For each database, introduce the type of data it holds and its significance in biological research.


* Try visiting all the databases and explore them before running the codes
* There are many more databases, try exploring more of them.

# Biopython

Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.

It is a distributed collaborative effort to develop Python libraries and applications which address the needs of current and future work in bioinformatics. The source code is made available under the Biopython License, which is extremely liberal and compatible with almost every license in the world.

https://biopython.org/

In [1]:
pip install biopython



## **Entrez**
(https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html) is a data retrieval system that provides users access to NCBI’s databases such as PubMed, GenBank, GEO, and many others. You can access Entrez from a web browser to manually enter queries, or you can use Biopython’s Bio.Entrez module for programmatic access to Entrez. The latter allows you for example to search PubMed or download GenBank records from within a Python script.

https://biopython.org/docs/dev/Tutorial/chapter_entrez.html

## EInfo: Obtaining information about the Entrez databases.

EInfo provides field index term counts, last update, and available links for each of NCBI’s databases. In addition, you can use EInfo to obtain a list of all database names accessible through the Entrez utilities:

In [2]:
from Bio import Entrez
Entrez.email = "your-email@iiitd.ac.in"  # Always tell NCBI who you are
stream = Entrez.einfo()
result = stream.read()
stream.close()
print(result)

b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20190110//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20190110/einfo.dtd">\n<eInfoResult>\n<DbList>\n\n\t<DbName>pubmed</DbName>\n\t<DbName>protein</DbName>\n\t<DbName>nuccore</DbName>\n\t<DbName>ipg</DbName>\n\t<DbName>nucleotide</DbName>\n\t<DbName>structure</DbName>\n\t<DbName>genome</DbName>\n\t<DbName>annotinfo</DbName>\n\t<DbName>assembly</DbName>\n\t<DbName>bioproject</DbName>\n\t<DbName>biosample</DbName>\n\t<DbName>blastdbinfo</DbName>\n\t<DbName>books</DbName>\n\t<DbName>cdd</DbName>\n\t<DbName>clinvar</DbName>\n\t<DbName>gap</DbName>\n\t<DbName>gapplus</DbName>\n\t<DbName>grasp</DbName>\n\t<DbName>dbvar</DbName>\n\t<DbName>gene</DbName>\n\t<DbName>gds</DbName>\n\t<DbName>geoprofiles</DbName>\n\t<DbName>medgen</DbName>\n\t<DbName>mesh</DbName>\n\t<DbName>nlmcatalog</DbName>\n\t<DbName>omim</DbName>\n\t<DbName>orgtrack</DbName>\n\t<DbName>pmc</DbName>\n\t<DbName>popset</DbName>\n\t<D

In [3]:
from Bio import Entrez
stream = Entrez.einfo()
record = Entrez.read(stream)
stream.close()

In [4]:
record['DbList']

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

## 1. PubMed

In [5]:
from Bio import Entrez

stream = Entrez.einfo(db="pubmed")
record = Entrez.read(stream)
record["DbInfo"]["Description"]

'PubMed bibliographic record'

In [6]:
record["DbInfo"]["Count"]

'37859381'

In [7]:
record["DbInfo"]["LastUpdate"]

'2024/10/13 22:16'

In [8]:
for field in record["DbInfo"]["FieldList"]:
    print("%(Name)s, %(FullName)s, %(Description)s" % field)

ALL, All Fields, All terms from all searchable fields
UID, UID, Unique number assigned to publication
FILT, Filter, Limits the records
TITL, Title, Words in title of publication
MESH, MeSH Terms, Medical Subject Headings assigned to publication
MAJR, MeSH Major Topic, MeSH terms of major importance to publication
JOUR, Journal, Journal abbreviation of publication
AFFL, Affiliation, Author's institutional affiliation and address
ECNO, EC/RN Number, EC number for enzyme or CAS registry number
SUBS, Supplementary Concept, CAS chemical name or MEDLINE Substance Name
PDAT, Date - Publication, Date of publication
EDAT, Date - Entry, Date publication first accessible through Entrez
VOL, Volume, Volume number of publication
PAGE, Pagination, Page number(s) of publication
PTYP, Publication Type, Type of publication (e.g., review)
LANG, Language, Language of publication
ISS, Issue, Issue number of publication
SUBH, MeSH Subheading, Additional specificity for MeSH term
SI, Secondary Source ID, Cr

In [9]:
from Bio import Entrez
Entrez.email = "A.N.Other@example.com"  # Always tell NCBI who you are
stream = Entrez.esearch(db="pubmed", term="biopython[title]", retmax="40")
record = Entrez.read(stream)
# "19304878" in record["IdList"]

In [10]:
record["IdList"]

['34434786', '22909249', '19304878']

EFetch is what you use when you want to retrieve a full record from Entrez.


In [11]:
from Bio.Entrez import efetch
handle = efetch(db='pubmed', id=34434786, retmode='text', rettype='abstract')
print(handle.read())

1. MethodsX. 2021 Feb 14;8:101264. doi: 10.1016/j.mex.2021.101264. eCollection 
2021.

A Biopython-based method for comprehensively searching for eponyms in Pubmed.

Cornish TC(1), Kricka LJ(2), Park JY(3).

Author information:
(1)Department of Pathology, University of Colorado School of Medicine, Aurora, 
CO, USA.
(2)Department of Pathology & Laboratory Medicine, Perelman School of Medicine at 
the University of Pennsylvania, Philadelphia, PA, USA.
(3)Department of Pathology and the Eugene McDermott Center for Human Growth and 
Development, Children's Medical Center, and University of Texas Southwestern 
Medical School, Dallas, TX, USA.

Eponyms are common in medicine; however, their usage has varied between 
specialties and over time. A search of specific eponyms will reveal the 
frequency of usage within a medical specialty. While usage of eponyms can be 
studied by searching PubMed, manual searching can be time-consuming. As an 
alternative, we modified an existing Biopython method

In [12]:
for pmid in record["IdList"]:
  handle = efetch(db='pubmed', id=pmid, retmode='text', rettype='abstract')
  print(handle.read())
  print('-----------')

1. MethodsX. 2021 Feb 14;8:101264. doi: 10.1016/j.mex.2021.101264. eCollection 
2021.

A Biopython-based method for comprehensively searching for eponyms in Pubmed.

Cornish TC(1), Kricka LJ(2), Park JY(3).

Author information:
(1)Department of Pathology, University of Colorado School of Medicine, Aurora, 
CO, USA.
(2)Department of Pathology & Laboratory Medicine, Perelman School of Medicine at 
the University of Pennsylvania, Philadelphia, PA, USA.
(3)Department of Pathology and the Eugene McDermott Center for Human Growth and 
Development, Children's Medical Center, and University of Texas Southwestern 
Medical School, Dallas, TX, USA.

Eponyms are common in medicine; however, their usage has varied between 
specialties and over time. A search of specific eponyms will reveal the 
frequency of usage within a medical specialty. While usage of eponyms can be 
studied by searching PubMed, manual searching can be time-consuming. As an 
alternative, we modified an existing Biopython method

## 2. GenBank

In [13]:
from Bio import Entrez
Entrez.email = "your-email@iiitd.ac.in"

# Fetch a nucleotide sequence using an accession number
handle = Entrez.efetch(db="nucleotide", id="NM_001301717", rettype="fasta", retmode="text")
fasta_seq = handle.read()
print(fasta_seq)

>NM_001301717.2 Homo sapiens C-C motif chemokine receptor 7 (CCR7), transcript variant 4, mRNA
CTCTAGATGAGTCAGTGGAGGGCGGGTGGAGCGTTGAACCGTGAAGAGTGTGGTTGGGCGTAAACGTGGA
CTTAAACTCAGGAGCTAAGGGGGAAACCAATGAAAAGCGTGCTGGTGGTGGCTCTCCTTGTCATTTTCCA
GGTATGCCTGTGTCAAGATGAGGTCACGGACGATTACATCGGAGACAACACCACAGTGGACTACACTTTG
TTCGAGTCTTTGTGCTCCAAGAAGGACGTGCGGAACTTTAAAGCCTGGTTCCTCCCTATCATGTACTCCA
TCATTTGTTTCGTGGGCCTACTGGGCAATGGGCTGGTCGTGTTGACCTATATCTATTTCAAGAGGCTCAA
GACCATGACCGATACCTACCTGCTCAACCTGGCGGTGGCAGACATCCTCTTCCTCCTGACCCTTCCCTTC
TGGGCCTACAGCGCGGCCAAGTCCTGGGTCTTCGGTGTCCACTTTTGCAAGCTCATCTTTGCCATCTACA
AGATGAGCTTCTTCAGTGGCATGCTCCTACTTCTTTGCATCAGCATTGACCGCTACGTGGCCATCGTCCA
GGCTGTCTCAGCTCACCGCCACCGTGCCCGCGTCCTTCTCATCAGCAAGCTGTCCTGTGTGGGCATCTGG
ATACTAGCCACAGTGCTCTCCATCCCAGAGCTCCTGTACAGTGACCTCCAGAGGAGCAGCAGTGAGCAAG
CGATGCGATGCTCTCTCATCACAGAGCATGTGGAGGCCTTTATCACCATCCAGGTGGCCCAGATGGTGAT
CGGCTTTCTGGTCCCCCTGCTGGCCATGAGCTTCTGTTACCTTGTCATCATCCGCACCCTGCTCCAGGCA
CGCAACTTTGAGCGCAACAAGGCCATCAAGGTGATCATCGCTGTGGTCGTGGT

Q1. As read in Tutoruial 1, write this fasta sequence into `NM_001301717.fasta`

Q2. Calculate it's GC content.

Q3. Modify the code to get the output in gb format. Also save it into `NM_001301717.gb` file

## 3. UniProt


In [33]:
from Bio import ExPASy, SwissProt
handle = ExPASy.get_sprot_raw("P00533")  # Example accession
record = SwissProt.read(handle)
record

<Bio.SwissProt.Record at 0x7efcbd9577c0>

In [34]:
record.entry_name

'EGFR_HUMAN'

In [35]:
record.gene_name

[{'Name': 'EGFR {ECO:0000312|HGNC:HGNC:3236}',
  'Synonyms': ['ERBB', 'ERBB1', 'HER1']}]

In [36]:
record.description

'RecName: Full=Epidermal growth factor receptor {ECO:0000305}; EC=2.7.10.1; AltName: Full=Proto-oncogene c-ErbB-1; AltName: Full=Receptor tyrosine-protein kinase erbB-1; Flags: Precursor;'

In [18]:
record.sequence

'MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPSIATGMVGALLLLLVVALGIGLFMRRRHIVRKRTLRRLLQERELVEPLTPSGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYR

In [37]:
for ref in record.references:
    print("authors:", ref.authors)
    print("title:", ref.title)

authors: Ullrich A., Coussens L., Hayflick J.S., Dull T.J., Gray A., Tam A.W., Lee J., Yarden Y., Libermann T.A., Schlessinger J., Downward J., Mayes E.L.V., Whittle N., Waterfield M.D., Seeburg P.H.
title: Human epidermal growth factor receptor cDNA sequence and aberrant expression of the amplified gene in A431 epidermoid carcinoma cells.
authors: Ilekis J.V., Stark B.C., Scoccia B.
title: Possible role of variant RNA transcripts in the regulation of epidermal growth factor receptor expression in human placenta.
authors: Reiter J.L., Maihle N.J.
title: A 1.8 kb alternative transcript from the human epidermal growth factor receptor gene encodes a truncated form of the receptor.
authors: Ilekis J.V., Gariti J., Niederberger C., Scoccia B.
title: Expression of a truncated epidermal growth factor receptor-like protein (TEGFR) in ovarian cancer.
authors: Reiter J.L., Threadgill D.W., Eley G.D., Strunk K.E., Danielsen A.J., Schehl Sinclair C., Pearsall R.S., Green P.J., Yee D., Lampland A.L

In [38]:
record.organism_classification

['Eukaryota',
 'Metazoa',
 'Chordata',
 'Craniata',
 'Vertebrata',
 'Euteleostomi',
 'Mammalia',
 'Eutheria',
 'Euarchontoglires',
 'Primates',
 'Haplorrhini',
 'Catarrhini',
 'Hominidae',
 'Homo']

In [39]:
accessions = ["O23729", "O23730", "O23731"]
records = []

for accession in accessions:
    handle = ExPASy.get_sprot_raw(accession)
    record = SwissProt.read(handle)
    records.append(record)

Q1. `append` vs `extend`?

In [22]:
records

[<Bio.SwissProt.Record at 0x7efcf0245540>,
 <Bio.SwissProt.Record at 0x7efcf0310a00>,
 <Bio.SwissProt.Record at 0x7efcf0245ff0>]

### Searching with UniProt


In [40]:
from Bio import UniProt
query = "(organism_id:2697049) AND (reviewed:true)"
results = list(UniProt.search(query))
print(results)

Output hidden; open in https://colab.research.google.com to view.

In [41]:
import pandas as pd
pd.DataFrame(results)

Unnamed: 0,entryType,primaryAccession,uniProtkbId,entryAudit,annotationScore,organism,organismHosts,proteinExistence,proteinDescription,comments,features,keywords,references,uniProtKBCrossReferences,sequence,extraAttributes,genes
0,UniProtKB reviewed (Swiss-Prot),P0DTC1,R1A_SARS2,"{'firstPublicDate': '2020-04-22', 'lastAnnotat...",5.0,{'scientificName': 'Severe acute respiratory s...,"[{'scientificName': 'Homo sapiens', 'commonNam...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'Re...,[{'texts': [{'evidences': [{'evidenceCode': 'E...,"[{'type': 'Chain', 'location': {'start': {'val...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '32...","[{'database': 'EMBL', 'id': 'MN908947', 'prope...",{'value': 'MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSV...,"{'countByCommentType': {'FUNCTION': 11, 'CATAL...",
1,UniProtKB reviewed (Swiss-Prot),P0DTD1,R1AB_SARS2,"{'firstPublicDate': '2020-04-22', 'lastAnnotat...",5.0,{'scientificName': 'Severe acute respiratory s...,"[{'scientificName': 'Homo sapiens', 'commonNam...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'Re...,[{'texts': [{'evidences': [{'evidenceCode': 'E...,"[{'type': 'Chain', 'location': {'start': {'val...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '32...","[{'database': 'EMBL', 'id': 'MN908947', 'prope...",{'value': 'MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSV...,"{'countByCommentType': {'FUNCTION': 16, 'CATAL...","[{'geneName': {'value': 'rep'}, 'orfNames': [{..."
2,UniProtKB reviewed (Swiss-Prot),P0DTC2,SPIKE_SARS2,"{'firstPublicDate': '2020-04-22', 'lastAnnotat...",5.0,{'scientificName': 'Severe acute respiratory s...,"[{'scientificName': 'Homo sapiens', 'commonNam...",1: Evidence at protein level,{'recommendedName': {'fullName': {'evidences':...,[{'texts': [{'evidences': [{'evidenceCode': 'E...,"[{'type': 'Signal', 'location': {'start': {'va...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '32...","[{'database': 'EMBL', 'id': 'MN908947', 'prope...",{'value': 'MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRG...,"{'countByCommentType': {'FUNCTION': 3, 'SUBUNI...",[{'geneName': {'evidences': [{'evidenceCode': ...
3,UniProtKB reviewed (Swiss-Prot),P0DTC7,NS7A_SARS2,"{'firstPublicDate': '2020-04-22', 'lastAnnotat...",5.0,{'scientificName': 'Severe acute respiratory s...,"[{'scientificName': 'Homo sapiens', 'commonNam...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'OR...,[{'texts': [{'evidences': [{'evidenceCode': 'E...,"[{'type': 'Signal', 'location': {'start': {'va...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '32...","[{'database': 'EMBL', 'id': 'MN908947', 'prope...",{'value': 'MKIILFLALITLATCELYHYQECVRGTTVLLKEPC...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI...",[{'orfNames': [{'value': '7a'}]}]
4,UniProtKB reviewed (Swiss-Prot),P0DTC4,VEMP_SARS2,"{'firstPublicDate': '2020-04-22', 'lastAnnotat...",5.0,{'scientificName': 'Severe acute respiratory s...,"[{'scientificName': 'Homo sapiens', 'commonNam...",1: Evidence at protein level,{'recommendedName': {'fullName': {'evidences':...,[{'texts': [{'evidences': [{'evidenceCode': 'E...,"[{'type': 'Chain', 'location': {'start': {'val...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '32...","[{'database': 'EMBL', 'id': 'MN908947', 'prope...",{'value': 'MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILT...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI...",[{'geneName': {'evidences': [{'evidenceCode': ...
5,UniProtKB reviewed (Swiss-Prot),P0DTC8,NS8_SARS2,"{'firstPublicDate': '2020-04-22', 'lastAnnotat...",5.0,{'scientificName': 'Severe acute respiratory s...,"[{'scientificName': 'Homo sapiens', 'commonNam...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'OR...,[{'texts': [{'evidences': [{'evidenceCode': 'E...,"[{'type': 'Signal', 'location': {'start': {'va...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '32...","[{'database': 'EMBL', 'id': 'MN908947', 'prope...",{'value': 'MKFLVFLGIITTVAAFHQECSLQSCTQHQPYVVDD...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI...",[{'orfNames': [{'value': '8'}]}]
6,UniProtKB reviewed (Swiss-Prot),P0DTD2,ORF9B_SARS2,"{'firstPublicDate': '2020-04-22', 'lastAnnotat...",5.0,{'scientificName': 'Severe acute respiratory s...,"[{'scientificName': 'Homo sapiens', 'commonNam...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'OR...,[{'texts': [{'evidences': [{'evidenceCode': 'E...,"[{'type': 'Chain', 'location': {'start': {'val...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '32...","[{'database': 'EMBL', 'id': 'MN908947', 'prope...",{'value': 'MDPKISEMHPALRLVDPQIQLAVTRMENAVGRDQN...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI...",[{'orfNames': [{'value': '9b'}]}]
7,UniProtKB reviewed (Swiss-Prot),P0DTC3,AP3A_SARS2,"{'firstPublicDate': '2020-04-22', 'lastAnnotat...",5.0,{'scientificName': 'Severe acute respiratory s...,"[{'scientificName': 'Homo sapiens', 'commonNam...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'OR...,[{'texts': [{'evidences': [{'evidenceCode': 'E...,"[{'type': 'Chain', 'location': {'start': {'val...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '32...","[{'database': 'EMBL', 'id': 'MN908947', 'prope...",{'value': 'MDLFMRIFTIGTVTLKQGEIKDATPSDFVRATATI...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI...",[{'orfNames': [{'value': '3a'}]}]
8,UniProtKB reviewed (Swiss-Prot),P0DTC5,VME1_SARS2,"{'firstPublicDate': '2020-04-22', 'lastAnnotat...",5.0,{'scientificName': 'Severe acute respiratory s...,"[{'scientificName': 'Homo sapiens', 'commonNam...",1: Evidence at protein level,{'recommendedName': {'fullName': {'evidences':...,[{'texts': [{'evidences': [{'evidenceCode': 'E...,"[{'type': 'Chain', 'location': {'start': {'val...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '32...","[{'database': 'EMBL', 'id': 'MN908947', 'prope...",{'value': 'MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLL...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI...",[{'orfNames': [{'value': 'M'}]}]
9,UniProtKB reviewed (Swiss-Prot),P0DTC6,NS6_SARS2,"{'firstPublicDate': '2020-04-22', 'lastAnnotat...",5.0,{'scientificName': 'Severe acute respiratory s...,"[{'scientificName': 'Homo sapiens', 'commonNam...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'OR...,[{'texts': [{'evidences': [{'evidenceCode': 'E...,"[{'type': 'Chain', 'location': {'start': {'val...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '32...","[{'database': 'EMBL', 'id': 'MN908947', 'prope...",{'value': 'MFHLVDFQVTIAEILLIIMRTFKVSIWNLDYIINL...,"{'countByCommentType': {'FUNCTION': 2, 'SUBUNI...",[{'orfNames': [{'value': '6'}]}]


Let’s try a search that returns more results. At the time of writing, there are 5,147 results for the query “Insulin AND (reviewed:true)”. We can use slicing to get a list of the first 50 results.

In [42]:
from Bio import UniProt
from itertools import islice

query = "Insulin AND (reviewed:true)"
results = UniProt.search(query, batch_size=50)[:50]

In [43]:
pd.DataFrame(results)

Unnamed: 0,entryType,primaryAccession,secondaryAccessions,uniProtkbId,entryAudit,annotationScore,organism,proteinExistence,proteinDescription,genes,comments,features,keywords,references,uniProtKBCrossReferences,sequence,extraAttributes
0,UniProtKB reviewed (Swiss-Prot),P01308,[Q5EEX2],INS_HUMAN,"{'firstPublicDate': '1986-07-21', 'lastAnnotat...",5.0,"{'scientificName': 'Homo sapiens', 'commonName...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'In...,[{'geneName': {'value': 'INS'}}],[{'texts': [{'value': 'Insulin decreases blood...,"[{'type': 'Signal', 'location': {'start': {'va...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '62...","[{'database': 'EMBL', 'id': 'V00565', 'propert...",{'value': 'MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHL...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI..."
1,UniProtKB reviewed (Swiss-Prot),P06213,"[Q17RW0, Q59H98, Q9UCB7, Q9UCB8, Q9UCB9]",INSR_HUMAN,"{'firstPublicDate': '1988-01-01', 'lastAnnotat...",5.0,"{'scientificName': 'Homo sapiens', 'commonName...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'In...,[{'geneName': {'value': 'INSR'}}],[{'texts': [{'evidences': [{'evidenceCode': 'E...,"[{'type': 'Signal', 'location': {'start': {'va...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '28...","[{'database': 'EMBL', 'id': 'M10051', 'propert...",{'value': 'MATGGRRGAAAAPLLVAVAALLLGAAGHLYPGEVC...,"{'countByCommentType': {'FUNCTION': 1, 'CATALY..."
2,UniProtKB reviewed (Swiss-Prot),P14735,"[B2R721, B7ZAU2, D3DR35, Q5T5N2]",IDE_HUMAN,"{'firstPublicDate': '1990-04-01', 'lastAnnotat...",5.0,"{'scientificName': 'Homo sapiens', 'commonName...",1: Evidence at protein level,{'recommendedName': {'fullName': {'evidences':...,[{'geneName': {'evidences': [{'evidenceCode': ...,[{'texts': [{'evidences': [{'evidenceCode': 'E...,"[{'type': 'Chain', 'location': {'start': {'val...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '30...","[{'database': 'EMBL', 'id': 'M21188', 'propert...",{'value': 'MRYRLAWLLHPALPSTFRSVLGARLPPPERLCGFQ...,"{'countByCommentType': {'FUNCTION': 2, 'CATALY..."
3,UniProtKB reviewed (Swiss-Prot),P01317,,INS_BOVIN,"{'firstPublicDate': '1986-07-21', 'lastAnnotat...",5.0,"{'scientificName': 'Bos taurus', 'commonName':...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'In...,[{'geneName': {'value': 'INS'}}],[{'texts': [{'value': 'Insulin decreases blood...,"[{'type': 'Signal', 'location': {'start': {'va...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '24...","[{'database': 'EMBL', 'id': 'M54979', 'propert...",{'value': 'MALWTRLRPLLALLALWPPPPARAFVNQHLCGSHL...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI..."
4,UniProtKB reviewed (Swiss-Prot),P67970,"[P01332, Q53YX4]",INS_CHICK,"{'firstPublicDate': '1986-07-21', 'lastAnnotat...",5.0,"{'scientificName': 'Gallus gallus', 'commonNam...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'In...,[{'geneName': {'value': 'INS'}}],[{'texts': [{'value': 'Insulin decreases blood...,"[{'type': 'Signal', 'location': {'start': {'va...","[{'id': 'KW-0119', 'category': 'Biological pro...","[{'referenceNumber': 1, 'citation': {'id': '73...","[{'database': 'EMBL', 'id': 'V00416', 'propert...",{'value': 'MALWIRSLPLLALLVFSGPGTSYAAANQHLCGSHL...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI..."
5,UniProtKB reviewed (Swiss-Prot),P01329,,INS_CAVPO,"{'firstPublicDate': '1986-07-21', 'lastAnnotat...",5.0,"{'scientificName': 'Cavia porcellus', 'commonN...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'In...,[{'geneName': {'value': 'INS'}}],[{'texts': [{'value': 'Insulin decreases blood...,"[{'type': 'Signal', 'location': {'start': {'va...","[{'id': 'KW-0119', 'category': 'Biological pro...","[{'referenceNumber': 1, 'citation': {'id': '38...","[{'database': 'EMBL', 'id': 'K02233', 'propert...",{'value': 'MALWMHLLTVLALLALWGPNTGQAFVSRHLCGSNL...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI..."
6,UniProtKB reviewed (Swiss-Prot),P17715,,INS_OCTDE,"{'firstPublicDate': '1990-08-01', 'lastAnnotat...",5.0,"{'scientificName': 'Octodon degus', 'commonNam...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'In...,[{'geneName': {'value': 'INS'}}],[{'texts': [{'value': 'Insulin decreases blood...,"[{'type': 'Signal', 'location': {'start': {'va...","[{'id': 'KW-0119', 'category': 'Biological pro...","[{'referenceNumber': 1, 'citation': {'id': '22...","[{'database': 'EMBL', 'id': 'M57671', 'propert...",{'value': 'MAPWMHLLTVLALLALWGPNSVQAYSSQHLCGSNL...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI..."
7,UniProtKB reviewed (Swiss-Prot),P01315,[Q9TSJ5],INS_PIG,"{'firstPublicDate': '1986-07-21', 'lastAnnotat...",5.0,"{'scientificName': 'Sus scrofa', 'commonName':...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'In...,[{'geneName': {'value': 'INS'}}],[{'texts': [{'value': 'Insulin decreases blood...,"[{'type': 'Signal', 'location': {'start': {'va...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': 'CI...","[{'database': 'EMBL', 'id': 'AF064555', 'prope...",{'value': 'MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHL...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI..."
8,UniProtKB reviewed (Swiss-Prot),Q91XI3,,INS_ICTTR,"{'firstPublicDate': '2003-03-28', 'lastAnnotat...",5.0,{'scientificName': 'Ictidomys tridecemlineatus...,3: Inferred from homology,{'recommendedName': {'fullName': {'value': 'In...,[{'geneName': {'value': 'INS'}}],[{'texts': [{'value': 'Insulin decreases blood...,"[{'type': 'Signal', 'location': {'start': {'va...","[{'id': 'KW-0119', 'category': 'Biological pro...","[{'referenceNumber': 1, 'citation': {'id': 'CI...","[{'database': 'EMBL', 'id': 'AY038604', 'prope...",{'value': 'MALWTRLLPLLALLALLGPDPAQAFVNQHLCGSHL...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI..."
9,UniProtKB reviewed (Swiss-Prot),Q9Y5Q6,"[Q3MIY4, Q5VYD8]",INSL5_HUMAN,"{'firstPublicDate': '2000-05-30', 'lastAnnotat...",5.0,"{'scientificName': 'Homo sapiens', 'commonName...",1: Evidence at protein level,{'recommendedName': {'fullName': {'value': 'In...,"[{'geneName': {'value': 'INSL5'}, 'orfNames': ...",[{'texts': [{'value': 'May have a role in gut ...,"[{'type': 'Signal', 'location': {'start': {'va...","[{'id': 'KW-0002', 'category': 'Technical term...","[{'referenceNumber': 1, 'citation': {'id': '10...","[{'database': 'EMBL', 'id': 'AF133816', 'prope...",{'value': 'MKGSIFTLFLFSVLFAISEVRSKESVRLCGLEYIR...,"{'countByCommentType': {'FUNCTION': 1, 'SUBUNI..."


In [44]:
from Bio import UniProt

query = "Insulin AND (reviewed:true)"
result_iterator = UniProt.search(query, batch_size=0)
len(result_iterator)

5140

Q1. Plot Insulin sequence lengths for Human species, Mus musculus, Rattus norvegicus, Drosophila melanogaster and Bos taurus.



https://biopython.org/docs/dev/Tutorial/chapter_uniprot.html

https://biopython.org/docs/dev/api/Bio.UniProt.html


## 4. PDB Example


In [45]:
from Bio.PDB import PDBList
pdbl = PDBList()
pdbl.retrieve_pdb_file('1A8M')

Structure exists: '/content/a8/1a8m.cif' 




'/content/a8/1a8m.cif'

In [46]:
from Bio.PDB import PDBList
pdbl = PDBList()
pdbl.retrieve_pdb_file('1A8M', file_format='pdb', pdir='./')

Structure exists: './pdb1a8m.ent' 


'./pdb1a8m.ent'

In [47]:
from Bio.PDB import PDBParser
parser = PDBParser()
structure = parser.get_structure('1A8M', 'pdb1a8m.ent')
for model in structure:
    for chain in model:
        print(f"Chain ID: {chain.id}")

Chain ID: A
Chain ID: B
Chain ID: C




<Structure id=1A8M>

## 5. KEGG


In [31]:
from Bio.KEGG import REST

# Example: Get a KEGG pathway by KEGG ID
pathway_data = REST.kegg_get("hsa04010").read()
print(pathway_data)

ENTRY       hsa04010                    Pathway
NAME        MAPK signaling pathway - Homo sapiens (human)
DESCRIPTION The mitogen-activated protein kinase (MAPK) cascade is a highly conserved module that is involved in various cellular functions, including cell proliferation, differentiation and migration. Mammals express at least four distinctly regulated groups of MAPKs, extracellular signal-related kinases (ERK)-1/2, Jun amino-terminal kinases (JNK1/2/3), p38 proteins (p38alpha/beta/gamma/delta) and ERK5, that are activated by specific MAPKKs: MEK1/2 for ERK1/2, MKK3/6 for the p38, MKK4/7 (JNKK1/2) for the JNKs, and MEK5 for ERK5. Each MAPKK, however, can be activated by more than one MAPKKK, increasing the complexity and diversity of MAPK signalling. Presumably each MAPKKK confers responsiveness to distinct stimuli. For example, activation of ERK1/2 by growth factors depends on the MAPKKK c-Raf, but other MAPKKKs may activate ERK1/2 in response to pro-inflammatory stimuli.
CLASS   

In [49]:
# Example: Get genes associated with a KEGG disease
disease_data = REST.kegg_link("hsa", "ds:H00099").read()
print(disease_data)

ds:H00099	hsa:3689
ds:H00099	hsa:55343
ds:H00099	hsa:83706
ds:H00099	hsa:5880



https://biopython-tutorial.readthedocs.io/en/latest/notebooks/18%20-%20KEGG.html#

# Practice Questions
## GenBank Practice Questions

1. Fetch a nucleotide sequence by accession number "NM_000546" from GenBank and save it to a FASTA file.
2. Search for nucleotide records related to "insulin gene" in humans using the Entrez module, and print out the first 10 accession numbers.
3. Retrieve a full GenBank record for the nucleotide sequence with accession number "NC_001807" and extract the source organism from the metadata.
4. Download the CDS (coding sequence) of the BRCA2 gene from GenBank and save it to a text file.
5. Count how many nucleotide records in GenBank match the keyword “coronavirus” within the last 5 years.

## UniProt Practice Questions
1. Fetch a protein record by UniProt accession "P12345" and print the organism name and function.
2. Download the protein sequence of Hemoglobin subunit alpha (accession P69905) from UniProt and save it in FASTA format.
3. Perform a keyword search on UniProt for “kinase” and retrieve the first 5 protein accessions.
4. List the domains and active sites of the protein with UniProt accession "P51587."
5. Download functional annotations for a given protein using the UniProt database and summarize them.


## PDB Practice Questions
1. Download the structure file for the protein "1A8M" from the Protein Data Bank and print the resolution of the structure.
2. Parse the downloaded PDB file for the protein "1CRN" and print the number of chains in the structure.
3. Retrieve and print the sequence of the protein chains from a downloaded PDB file.
4. List all the ligands present in the protein structure with PDB ID "2GC4."
5. Download and analyze a structure file, and print all amino acid residues in chain A of protein "3GFT."


## NCBI Entrez Practice Questions
1. Retrieve article metadata for 5 PubMed articles related to "CRISPR gene editing" and print their titles and authors.
2. Search the NCBI Taxonomy database for “Saccharomyces cerevisiae” and retrieve its taxonomic classification.
3. Fetch a protein sequence from the NCBI Protein database using accession number "NP_001005353" and print its length and amino acid sequence.
4. Perform an advanced PubMed search for articles published in 2023 related to “COVID-19 vaccine” and retrieve their PubMed IDs.
5. Retrieve and display the taxonomy ID of "Homo sapiens" and "Mus musculus" using Entrez.