# <center>Working with NCBI</center>

## Libraries

In [1]:
from Bio import Entrez
Entrez.email = "isemenov.bioinfo1998@gmail.com"

## Some basic commands

`esearch`: finds an article in NCBI by our request

### Find an article in NCBI connected with cytomegalovirus

In [2]:
handle = Entrez.esearch(db='pubmed', term='cytomegalovirus')
record = Entrez.read(handle)
print(record)

{'Count': '53107', 'RetMax': '20', 'RetStart': '0', 'IdList': ['35354065', '35353226', '35352632', '35352508', '35351972', '35349757', '35348580', '35348198', '35347236', '35346077', '35345706', '35344659', '35344424', '35344176', '35343807', '35343785', '35343771', '35343620', '35342821', '35342158'], 'TranslationSet': [{'From': 'cytomegalovirus', 'To': '"cytomegalovirus"[MeSH Terms] OR "cytomegalovirus"[All Fields]'}], 'TranslationStack': [{'Term': '"cytomegalovirus"[MeSH Terms]', 'Field': 'MeSH Terms', 'Count': '22284', 'Explode': 'Y'}, {'Term': '"cytomegalovirus"[All Fields]', 'Field': 'All Fields', 'Count': '53110', 'Explode': 'N'}, 'OR', 'GROUP'], 'QueryTranslation': '"cytomegalovirus"[MeSH Terms] OR "cytomegalovirus"[All Fields]'}


By default `RetMax = 20` => we retrieve only the first 20 articles from 52959 articles found

Pubmed IDs with articles on cytomegalovirus:

In [3]:
print(*record['IdList'], sep='\n')

35354065
35353226
35352632
35352508
35351972
35349757
35348580
35348198
35347236
35346077
35345706
35344659
35344424
35344176
35343807
35343785
35343771
35343620
35342821
35342158


### Retrieve abstracts from articles found

`efetch`: returns abstracts of articles in a simple text format

Here we return abstracts from first two articles

In [4]:
mshandle = Entrez.efetch(db="pubmed", id=record["IdList"][0:2], rettype="abstract", retmode="text")
print(mshandle.read())


1. Antiviral Res. 2022 Mar 27:105299. doi: 10.1016/j.antiviral.2022.105299. [Epub
ahead of print]

Reliable quantification of Cytomegalovirus DNAemia in Letermovir treated
patients.

Weinberger S(1), Steininger C(2).

Author information: 
(1)Division of Infectious Diseases and Tropical Medicine, Department of Medicine 
I, Medical University of Vienna, Vienna, Austria.
(2)Division of Infectious Diseases and Tropical Medicine, Department of Medicine 
I, Medical University of Vienna, Vienna, Austria. Electronic address:
christoph.steininger@meduniwien.ac.at.

Polymerase chain reaction (PCR) based methods are a fast and sensitive approach
to detect and monitor viral load in Cytomegalovirus (CMV) patients. Letermovir
(LMV) acts at a late stage during the CMV replication cycle and does not inhibit 
CMV DNA replication per se. Therefore, quantitative nucleic acid amplification
testing might lead to the overestimation of viral load in patients treated with
LMV and underestimate treatment succ

### Retrieve nucleotide sequences for rhodopsin from Homo Sapiens

`esearch` returns results in XML format

In [5]:
handle = Entrez.esearch(db = "nucleotide", term = "rhodopsin AND Homo sapiens[orgn]")
record = Entrez.read(handle)
print(record['IdList'])
Entrez.efetch(db = "nucleotide", id = record["IdList"])

['1675034719', '1653962141', '1523964325', '1520687782', '1519473570', '1519315165', '1519313919', '1519244045', '1062594141', '1236508655', '219281886', '1783531726', '1677539460', '1677530135', '2075919630', '2068205929', '1519312296', '1918095546', '1654220746', '1654220743']


<_io.TextIOWrapper encoding='UTF-8'>

We point `[org]` after species name to specify that we want to search for requested protein directly in *H.sapiens* (and not in other organisms connected with it, e.g microorganims inhabiting it).

### Find ID of an organism (Mus Musculus) in database

In [6]:
handle = Entrez.esearch(db = "taxonomy", term = "Mus musculus")
record = Entrez.read(handle)
print(record)

{'Count': '1', 'RetMax': '1', 'RetStart': '0', 'IdList': ['10090'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'mus musculus[All Names]', 'Field': 'All Names', 'Count': '1', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'mus musculus[All Names]'}


In [7]:
print('Mus Musculus ID:', record['IdList'][0])

Mus Musculus ID: 10090


### Retrieve nucleotide sequences for rhodopsin from H. Sapiens and return results in a table (not XML)

`esearch`+`esummary`: makes a request in a database by gene name and then returns a table with UID (in XML it corresponds to Id), accession number (in XML it corresponds to Caption) and sequence length (Slen).

We retrieve here all PMIDs and then iterate through them to obtain some basic description

In [8]:
handle = Entrez.esearch(db="protein", term="rhodopsin AND Homo sapiens[orgn]")
record = Entrez.read(handle)
with open("results/id_capt_len_py.txt", "w") as ouf:
    for rec in record["IdList"]:
        temphandle = Entrez.read(Entrez.esummary(db="protein", id=rec, retmode="text"))
        ouf.write(f"{temphandle[0]['Id']}\t{temphandle[0]['Caption']}\t{temphandle[0]['Length']}\n")

In [9]:
temphandle

[{'Item': [], 'Id': '166362740', 'Caption': 'NP_001983', 'Title': 'proteinase-activated receptor 1 isoform 1 precursor [Homo sapiens]', 'Extra': 'gi|166362740|ref|NP_001983.2|[166362740]', 'Gi': IntegerElement(166362740, attributes={}), 'CreateDate': '1999/03/19', 'UpdateDate': '2022/03/12', 'Flags': IntegerElement(512, attributes={}), 'TaxId': IntegerElement(9606, attributes={}), 'Length': IntegerElement(425, attributes={}), 'Status': 'live', 'ReplacedBy': '', 'Comment': '  ', 'AccessionVersion': 'NP_001983.2'}]

In [10]:
! cat results/id_capt_len_py.txt

523704349	NP_001265723	375
1043261393	NP_001316355	74
1002819403	NP_001307687	439
539848521	NP_064445	364
523704347	NP_005963	375
222080049	NP_000858	481
88758606	NP_001034674	297
52317263	NP_001004723	307
11386179	NP_068774	74
485837026	NP_057726	3674
282403488	NP_997253	609
153945858	NP_002089	200
116008178	NP_000532	405
110556629	NP_775904	493
38505194	NP_000946	402
31083344	NP_064707	362
23312386	NP_006230	753
7662494	NP_054798	416
1062594142	NP_001318005	374
166362740	NP_001983	425


### Retrieve nucleotide sequences in fasta format for requested gene

Return sequences in a text format

In [11]:
handle = Entrez.esearch(db="protein", term="rhodopsin AND Homo sapiens[orgn]")
record = Entrez.read(handle)
print(Entrez.efetch(db="protein", id=record["IdList"], retmode="text", rettype="fasta").read())

>NP_001265723.1 neuropeptide Y receptor type 4 [Homo sapiens]
MNTSHLLALLLPKSPQGENRSKPLGTPYNFSEHCQDSVDVMVFIVTSYSIETVVGVLGNLCLMCVTVRQK
EKANVTNLLIANLAFSDFLMCLLCQPLTAVYTIMDYWIFGETLCKMSAFIQCMSVTVSILSLVLVALERH
QLIINPTGWKPSISQAYLGIVLIWVIACVLSLPFLANSILENVFHKNHSKALEFLADKVVCTESWPLAHH
RTIYTTFLLLFQYCLPLGFILVCYARIYRCLQRQGRVFHKGTYSLRAGHMKQVNVVLVVMVVAFAVLWLP
LHVFNSLEDWHHEAIPICHGNLIFLVCHLLAMASTCVNPFIYGFLNTNFKKEIKALVLTCQQSAPLEESE
HLPLSTVHTEVSKGSLRLSGRSNPI

>NP_001316355.1 guanine nucleotide-binding protein G(T) subunit gamma-T1 precursor [Homo sapiens]
MPVINIEDLTEKDKLKMEVDQLKKEVTLERMLVSKCCEEVRDYVEERSGEDPLVKGIPEDKNPFKELKGG
CVIS

>NP_001307687.1 5-hydroxytryptamine receptor 2B isoform 2 [Homo sapiens]
MKQIVEEQGNKLHWAALLILMVIIPTIGGNTLVILAVSLEKKLQYATNYFLMSLAVADLLVGLFVMPIAL
LTIMFEAMWPLPLVLCPAWLFLDVLFSTASIMHLCAISVDRYIAIKKPIQANQYNSRATAFIKITVVWLI
SIGIAIPVPIKGIETDVDNPNNITCVLTKERFGDFMLFGSLAAFFTPLAIMIVTYFLTIHALQKKAYLVK
NKPPQRLTWLTVSTVFQRDETPCSSPEKVAMLDGSRKDKALPNSGDETLMRRTSTIGKKSVQTISNEQRA
SKVLGIVFFLFLLMWCPFFITNIT

Save downloaded sequences into a fasta file

In [12]:
with open("results/rhodopsin.fasta", "w") as ouf:
    for rec in record["IdList"]:
        lne = Entrez.efetch(db="protein", id=rec, retmode="text", rettype="fasta").read()
        ouf.write(lne)

In [13]:
! head -10 results/rhodopsin.fasta

>NP_001265723.1 neuropeptide Y receptor type 4 [Homo sapiens]
MNTSHLLALLLPKSPQGENRSKPLGTPYNFSEHCQDSVDVMVFIVTSYSIETVVGVLGNLCLMCVTVRQK
EKANVTNLLIANLAFSDFLMCLLCQPLTAVYTIMDYWIFGETLCKMSAFIQCMSVTVSILSLVLVALERH
QLIINPTGWKPSISQAYLGIVLIWVIACVLSLPFLANSILENVFHKNHSKALEFLADKVVCTESWPLAHH
RTIYTTFLLLFQYCLPLGFILVCYARIYRCLQRQGRVFHKGTYSLRAGHMKQVNVVLVVMVVAFAVLWLP
LHVFNSLEDWHHEAIPICHGNLIFLVCHLLAMASTCVNPFIYGFLNTNFKKEIKALVLTCQQSAPLEESE
HLPLSTVHTEVSKGSLRLSGRSNPI

>NP_001316355.1 guanine nucleotide-binding protein G(T) subunit gamma-T1 precursor [Homo sapiens]
MPVINIEDLTEKDKLKMEVDQLKKEVTLERMLVSKCCEEVRDYVEERSGEDPLVKGIPEDKNPFKELKGG


### Download protein by its nucleotide UID

In [14]:
lhandle = Entrez.elink(dbfrom="nucleotide", db="protein", id="2065188392") ##pubmed default
lrecord = Entrez.read(lhandle)
lrecord

[{'ERROR': [], 'LinkSetDb': [{'Link': [{'Id': '2068639613'}], 'DbTo': 'protein', 'LinkName': 'nuccore_protein'}], 'LinkSetDbHistory': [], 'DbFrom': 'nuccore', 'IdList': ['2065188392']}]

In [15]:
prothandle = lrecord[0]["LinkSetDb"][0]['Link'][0]['Id']
rrecord = Entrez.efetch(db="protein", id=prothandle, rettype="fasta", retmode="text")
with open ("results/prot_from_nt.fasta", "w") as ouf:
    ouf.write(rrecord.read())

In [16]:
! cat results/prot_from_nt.fasta

>XP_042223242.1 crustacyanin-A2 subunit-like [Homarus americanus]
MGVWYEIQAQPNIFQSIKSCLASSYKRVKTEIHVLSEGLDSSGASTTTKSILKIVDPQNPAHMVTDFVPG
VEPPFDIVDTDYKTFSCAHSCLSIVGIKTEFVFIYSRNRTLRSNSTQHCLSIFEVSIIGIISFYTNANNY



### Download all sequences by PMID

In [17]:
lhandle = Entrez.elink(dbfrom="pubmed", db="nucleotide", id="20558169")
lrecord = Entrez.read(lhandle)
lrecord

[{'ERROR': [], 'LinkSetDb': [{'Link': [{'Id': '312146895'}, {'Id': '312146894'}, {'Id': '312146893'}, {'Id': '312146892'}, {'Id': '312146891'}, {'Id': '312146890'}, {'Id': '312146889'}, {'Id': '312146888'}, {'Id': '312146887'}], 'DbTo': 'nuccore', 'LinkName': 'pubmed_nuccore'}], 'LinkSetDbHistory': [], 'DbFrom': 'pubmed', 'IdList': ['20558169']}]

In [18]:
ids = []
for el in lrecord[0]["LinkSetDb"][0]["Link"]:
    ids.append(el['Id'])
ids

['312146895',
 '312146894',
 '312146893',
 '312146892',
 '312146891',
 '312146890',
 '312146889',
 '312146888',
 '312146887']

In [19]:
rrecord = Entrez.efetch(db="nucleotide", id=ids[:4], rettype="fasta", retmode="text")
with open ("results/py_fasta_pmid.fasta", "w") as ouf:
    ouf.write(rrecord.read())

In [20]:
! cat results/py_fasta_pmid.fasta

>HM140499.1 Myospora metanephrops isolate NZ6C small subunit ribosomal RNA gene, partial sequence
GACGGCTACCAGGTCCAAGGACAGCAGCAGGCGCGAAAATTACCGAAGCCTACAACAGGGCGGTAGTAAT
GAGACGTGAAAACTAGACACGAATAAAATACGTGTTAGCAACTGGAGGTCAAGTCTGGTGCCAGCATCCG
CGGTAATACCAGCTCCAGGGGTGTCTATGATGATTGCTGCGATTAAAAGGTCCGTAGTCTTATGTCAGAA
CCGATGTGTAAGATGCTCGATCTAAGAGCAAAAAGGATTGGTACAGACATACATATATAGTGGTGTGTAT
ATAGAGATGTTATATTTGTAATGTTGATATGTATATGGTGCAATATATTGAAATGAGGAGCGACCGGGGG
CTAGATTATCGAGCAACGAGAGGTGAAATTTGATGACTTGCTTGGGAGTAACAGAGGCGAAAGCGCTAGT
CAAGGGCGAATCCGATGATCAAGGACGTAGGCTGGAGGATCGAACACGATTAGATACCGTAGTAGTTCCA
GCAGTAAACTATGCCGACGCCGTGGGTTGTTTGACCCGCGGAAGAGAAATCTAGTAGGGCTTTGGGGAGA
GTACGCGCGCAAGCGATAAATTTAAAGGAAATTGACGGAAGAACACCACAAGGAGTGGAGTGTGCGGCTT
AATTTGACTCAACGCGGGACAGCTTACCAGGCCCGAGGATTGCACGAGCGAATACGCGATAGATCTGAAA
GTGGTGCATGGCCGTTATCGACGAATGGAGTGATCTTTTGGTTAAATCCGTCAATTCGTGAGACCCTTTT
AATTTGATTAATGTCAGTGGTTGATACAGGTATGAAAATACAGGGGGGAAAGGACAAGAACAGGTCAGTG
ATGCCCTTAGATGGCCTGGGCTGCACGCGCACTACAG