# **Molecular Phylogenetics**
## **Mining Data**
> Done by Ilia Popov

### Part I - Python

0) First I log in so I don't get 'shadow-ban'

In [18]:
from Bio import Entrez
Entrez.email = 'iljapopov17@gmail.com'

1) Search PubMed for articles on the query "Cyclophilin A AND Open reading frame AND Real-time PCR" and see the abstracts of first two articles in plain text format.

In [19]:
handle = Entrez.esearch(db = "pubmed", term = "Cyclophilin A AND Open reading frame AND Real-time PCR")
record = Entrez.read(handle)
print(record)
mshandle = Entrez.efetch(db="pubmed", id=record["IdList"][0:2], rettype="abstract", retmode="text")
print(mshandle.read())

{'Count': '3', 'RetMax': '3', 'RetStart': '0', 'IdList': ['29097323', '19041262', '18819019'], 'TranslationSet': [{'From': 'Cyclophilin A', 'To': '"cyclophilin a"[MeSH Terms] OR "cyclophilin a"[All Fields]'}, {'From': 'Open reading frame', 'To': '"open reading frames"[MeSH Terms] OR ("open"[All Fields] AND "reading"[All Fields] AND "frames"[All Fields]) OR "open reading frames"[All Fields] OR ("open"[All Fields] AND "reading"[All Fields] AND "frame"[All Fields]) OR "open reading frame"[All Fields]'}, {'From': 'Real-time PCR', 'To': '"real-time polymerase chain reaction"[MeSH Terms] OR ("real-time"[All Fields] AND "polymerase"[All Fields] AND "chain"[All Fields] AND "reaction"[All Fields]) OR "real-time polymerase chain reaction"[All Fields] OR ("real"[All Fields] AND "time"[All Fields] AND "pcr"[All Fields]) OR "real time pcr"[All Fields]'}], 'QueryTranslation': '("cyclophilin a"[MeSH Terms] OR "cyclophilin a"[All Fields]) AND ("open reading frames"[MeSH Terms] OR ("open"[All Fields] A

2) Find the ID of the organism "_Procambarus clarkii_" by name in the taxonomy database

In [20]:
handle = Entrez.esearch(db = "taxonomy", term = "Procambarus clarkii")
record = Entrez.read(handle)
print(record)
print(record['IdList'])

{'Count': '1', 'RetMax': '1', 'RetStart': '0', 'IdList': ['6728'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'Procambarus clarkii[All Names]', 'Field': 'All Names', 'Count': '1', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'Procambarus clarkii[All Names]'}
['6728']


3) Let's query the nucleotide sequence database for the gene name "cyclophilin", and then return a table with:
- UID
- Accession number
- Sequence length

I changed the command we used in the lecture because in the lecture we searched the protein sequence database, and I needed to find the UID of a nucleotide for the assignment

In [21]:
handle = Entrez.esearch(db="nucleotide", term="cyclophilin AND Procambarus clarkii[orgn]")
record = Entrez.read(handle)
for rec in record["IdList"]:
        temphandle = Entrez.read(Entrez.esummary(db="nucleotide", id=rec, retmode="text"))
        print(temphandle[0]['Id']+"\t"+temphandle[0]['Caption']+"\t"+str(int(temphandle[0]['Length'])))#+"\n")

1940114972	MT601694	636
429843488	JX878886	495


4) Let's give the protein sequence database a text query and then return the sequences in fasta format, which we write to a file

In [22]:
handle = Entrez.esearch(db="protein", term="cyclophilin AND Procambarus clarkii[orgn]")
record = Entrez.read(handle)
Entrez.efetch(db="protein", id=record["IdList"], retmode="text", rettype="fasta").read()
with open("cyclophilin.fasta", "w") as ouf:
    for rec in record["IdList"]:
        lne = Entrez.efetch(db="protein", id=rec, retmode="text", rettype="fasta").read()
        ouf.write(lne+"\n")
with open("cyclophilin.fasta", "r") as fastaf:
    snippet = [next(fastaf) for x in range(5)]
    print(snippet)

['>QPM92673.1 cyclophilin [Procambarus clarkii]\n', 'MKALVAVVALLVIFSVFNRADGQAGESKGPKVTHKVFFDITIGGVPKGTVVIGLFGSTVPRTAQNFFELA\n', 'QKPVGEGYKGSVFHRVIKDFMIQGGDFTRGDGTGGRSIYGERFADENFKLKHFGAGWLSMANAGKDTNGS\n', 'QFFITTNKTTWLDGKHVVFGKVLAGMPIIREIEASATDGRDRPVAEVKIVDSRGEALSQPFESVAKEDAT\n', 'D\n']


5) Download the protein corresponding to the known UID of the nucleotide

In [23]:
lhandle = Entrez.elink(dbfrom="nucleotide", db="protein", id="429843488") 
lrecord = Entrez.read(lhandle)
prothandle = lrecord[0]["LinkSetDb"][0]['Link'][0]['Id']
rrecord = Entrez.efetch(db="protein", id=prothandle, rettype="fasta", retmode="text")
with open ("prot_from_nt.fasta", "w") as ouf:
    ouf.write(rrecord.read()+"\n")
with open("prot_from_nt.fasta", "r") as fastaf:
    snippet = [next(fastaf) for x in range(5)]
    print(snippet)

['>AGA16578.1 cyclophilin A [Procambarus clarkii]\n', 'MGNPQVFFDITANGKPLGRIVMELRADVVPKTAENFRALCTGEKGFGYKGSTFHRVIPNFMCQGGDFTAG\n', 'NGTGGKSIYGSKFADENFQLPHDGPGILSMANAGPNTNGSQFFLCTVRTNWLDGKHVVLGKVTEGMDVVR\n', 'QIEGYGKPSGETSAKIVVANCGQL\n', '\n']


6) Finally, let's download all the sequences from working with PMID 19041262 and write them to the fasta file

In [24]:
lhandle = Entrez.elink(dbfrom="pubmed", db="nucleotide", id="19041262")
lrecord = Entrez.read(lhandle)
ids = []
for el in lrecord[0]["LinkSetDb"][0]["Link"]:
    ids.append(el['Id'])
rrecord = Entrez.efetch(db="nucleotide", id=ids[:4], rettype="fasta", retmode="text")
with open ("py_fasta_pmid.fasta", "w") as ouf:
    ouf.write(rrecord.read()+"\n")
with open("py_fasta_pmid.fasta", "r") as fastaf:
    snippet = [next(fastaf) for x in range(5)]
    print(snippet)

['>EU164775.1 Penaeus monodon cyclophilin A mRNA, complete cds\n', 'CTCGTCCTCGGTTCCCGGCGATCCTCTGGAGATTGTTGCCGTAGATGGACTTGCGAGCAGACCTACACCA\n', 'ACTTAGCCACCATGGGCAACCCCAAAGTCTTTTTCGACATTACCGCTGACAACCAGCCCGTTGGCAGGAT\n', 'CGTCATGGAGCTCCGCGCCGACGTGGTCCCCAAGACCGCCGAGAACTTCCGGTCGCTGTGCACGGGCGAG\n', 'AAGGGCTTCGGCTACAAGGGTTCCTGCTTCCACCGCGTGATCCCCAACTTCATGTGTCAGGGAGGCGACT\n']


### Part II - Bash

1) Search PubMed for articles on the query "Cyclophilin A AND Open reading frame AND Real-time PCR" and see the abstracts of first two articles in plain text format.

In [1]:
! esearch -email iljapopov17@gmail.com -db pubmed -query "Cyclophilin A AND Open reading frame AND Real-time PCR" | efetch -mode text -format abstract

1. Fish Shellfish Immunol. 2018 Jan;72:383-388. doi: 10.1016/j.fsi.2017.10.053. 
Epub 2017 Oct 31.

Molecular identification and expression analysis of a novel cyclophilin a gene 
in the red swamp crayfish, Procambarus clarkii.

Zhu J(1), Lin F(2), Li F(2), Wang Y(3).

Author information:
(1)College of Animal Sciences, Zhejiang University, Hangzhou, 310058, China; 
School of Life Sciences, RanHuzhou University, Huzhou, 313000, China.
(2)Zhejiang Institute of Freshwater Fisheries, Huzhou, 313001, China.
(3)College of Animal Sciences, Zhejiang University, Hangzhou, 310058, China. 
Electronic address: ywang@zju.edu.cn.

Cyclophilin A (Cyp A) is the main intracellular receptor of cyclosporin A (CsA) 
belonging to the immunophilin family, which is known as an effective 
immunosuppressive drug. This study aimed to gain insights into the structure and 
biological function of cyclophilin A in the red swamp crayfish, Procambarus 
clarkii (PcCypA). We cloned PcCypA by homology cloning and anchor

2) Find the ID of the organism "_Procambarus clarkii_" by name in the taxonomy database

In [2]:
! esearch -email iljapopov17@gmail.com -db taxonomy -query "Procambarus clarkii" | esummary | grep TaxId

    <TaxId>6728</TaxId>
    <AkaTaxId>0</AkaTaxId>


3) Let's query the nucleotide sequence database for the gene name "cyclophilin", and then return a table with:
- UID
- Accession number
- Sequence length

I changed the command we used in the lecture because in the lecture we searched the protein sequence database, and I needed to find the UID of a nucleotide for the assignment

In [3]:
! esearch -email iljapopov17@gmail.com -db nucleotide -query "cyclophilin AND Procambarus clarkii[orgn]" | esummary -mode xml | xtract -pattern DocumentSummary -element Id Caption Slen

1940114972	MT601694	636
429843488	JX878886	495


4) Let's give the protein sequence database a text query and then return the sequences in fasta format, which we write to a file

In [4]:
! esearch -email iljapopov17@gmail.com -db protein -query "cyclophilin AND Procambarus clarkii[orgn]" | efetch -format fasta -mode text >cyclophilin.fa
! head cyclophilin.fa

>QPM92673.1 cyclophilin [Procambarus clarkii]
MKALVAVVALLVIFSVFNRADGQAGESKGPKVTHKVFFDITIGGVPKGTVVIGLFGSTVPRTAQNFFELA
QKPVGEGYKGSVFHRVIKDFMIQGGDFTRGDGTGGRSIYGERFADENFKLKHFGAGWLSMANAGKDTNGS
QFFITTNKTTWLDGKHVVFGKVLAGMPIIREIEASATDGRDRPVAEVKIVDSRGEALSQPFESVAKEDAT
D
>AGA16578.1 cyclophilin A [Procambarus clarkii]
MGNPQVFFDITANGKPLGRIVMELRADVVPKTAENFRALCTGEKGFGYKGSTFHRVIPNFMCQGGDFTAG
NGTGGKSIYGSKFADENFQLPHDGPGILSMANAGPNTNGSQFFLCTVRTNWLDGKHVVLGKVTEGMDVVR
QIEGYGKPSGETSAKIVVANCGQL


5) Download the protein corresponding to the known UID of the nucleotide

In [6]:
! elink -id 429843488 -db nuccore -target protein | efetch -format fasta -mode text > prot_from_nt.fa
! head prot_from_nt.fa

>AGA16578.1 cyclophilin A [Procambarus clarkii]
MGNPQVFFDITANGKPLGRIVMELRADVVPKTAENFRALCTGEKGFGYKGSTFHRVIPNFMCQGGDFTAG
NGTGGKSIYGSKFADENFQLPHDGPGILSMANAGPNTNGSQFFLCTVRTNWLDGKHVVLGKVTEGMDVVR
QIEGYGKPSGETSAKIVVANCGQL


6) Finally, let's download all the sequences from working with PMID 19041262 and write them to the fasta file

In [7]:
! elink -db pubmed -target nucleotide -id 19041262 | efetch -format fasta -mode text > py_fasta_pmid.fa
! head py_fasta_pmid.fa

>EU164775.1 Penaeus monodon cyclophilin A mRNA, complete cds
CTCGTCCTCGGTTCCCGGCGATCCTCTGGAGATTGTTGCCGTAGATGGACTTGCGAGCAGACCTACACCA
ACTTAGCCACCATGGGCAACCCCAAAGTCTTTTTCGACATTACCGCTGACAACCAGCCCGTTGGCAGGAT
CGTCATGGAGCTCCGCGCCGACGTGGTCCCCAAGACCGCCGAGAACTTCCGGTCGCTGTGCACGGGCGAG
AAGGGCTTCGGCTACAAGGGTTCCTGCTTCCACCGCGTGATCCCCAACTTCATGTGTCAGGGAGGCGACT
TCACCGCCGGCAACGGCACGGGCGGCAAGTCCATCTACGGCAACAAATTCGAGGACGAGAACTTCGCACT
GAAGCACACCGGCCCCGGCACCCTGTCCATGGCCAACGCCGGCCCCAACACCAACGGGTCGCAATTCTTC
ATCTGCACCGTCAAAACCCCCTGGCTGGACAACAAGCACGTGGTTTTCGGCTCCGTGGTGGAGGGCATGG
ACATCGTGCGCCAGGTCGAGGGTTTCGGCACCCCCAACGGCTCTTGCAAGCGGAAAGTGATGATCGCCAA
CTGCGGCCAGCTGTAAAGTTTCAGAACATTCCCCCTTAGCCGCCCACCCCTTTTTTTTTTGATGTAATTG
