## Identifying host and pathogen information

In this notebook I will be going thorugh how to identify your host and pathogen, crawl through NCBI using Biopython, and extract the full genbank record, create a list of associated publications, and download host and pathogen sequences. There are a few section in the Biopython cookbook I reference frequently and will outline here.

* [Overview of Biopython](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec7)
* [Accessing NCBI Entrez database](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec139)
    + I would pay particular attention to section 9.15 and 9.16
* [Full genbank record](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec145)
* [Retrieving publications](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec158)
* [Downloading Sequences](http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec163)

For this tutorial we will be using the human immunodeficiency virus 1 (HIV) and Homo sapeins as our models. These are two good models as humans have selfishly put many genetic records of themselves into NCBI along with their pathogens.

### Starting with the Entrez database

Entrez is a powerful but complex database. It allows nearly all of NCBI to be accessed using Biopython. Being complex and highly sought after comes with some caveats.

1. You need to make an NCBI account. this will allow you to create an API key and have an email on file
2. They limit traffic during certain parts of the week. Be diligent about not overloading their servers
3. Did I mention it was complex?

**We will always enter our apikey and email.**

In [6]:
# Showing the depth of Entrez
# Import statement
from Bio import Entrez
# Enter your own api key. To find this go to your NCBI account, click your profile name (top right), go to API Key management
Entrez.api_key = "92471a1f25f1f658f6711d0c7bc31ff80708"
# Always add an email so the NCBI admin can contact you
Entrez.email = "gree9242@uri.edu"
# You will find these handle and record variables used often. I will specificy when I can.
handle = Entrez.einfo()
record = Entrez.read(handle)
record.keys()
# Print out database list
record["DbList"]

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'sparcle', 'protfam', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

#### See I told you

To be able to manipulate these database we need to know what we can search for. To do this we need to veiw information about the individual databases.

**If you are interested you can swap out the nucleotide database in the (db="nucleotide") line for another database and view the description, entries, updates, and fields within each one**

In [55]:
from Bio import Entrez
Entrez.api_key = "92471a1f25f1f658f6711d0c7bc31ff80708"
Entrez.email = "gree9242@uri.edu"
handle = Entrez.einfo(db="nucleotide")
record = Entrez.read(handle)
print("This is a description of the database " + record["DbInfo"]["Description"])
print("This is a count of all the entries in the database " + record["DbInfo"]["Count"])
print("This is when the last update was done " + record["DbInfo"]["LastUpdate"]) 

This is a description of the database Core Nucleotide db
This is a count of all the entries in the database 434797561
This is when the last update was done 2020/12/06 23:16


#### The Esearch record can also help to parse out some of the information inside of the selected database

In [56]:
for field in record["DbInfo"]["FieldList"]:
    print("%(Name)s, %(FullName)s, %(Description)s" % field)

ALL, All Fields, All terms from all searchable fields
UID, UID, Unique number assigned to each sequence
FILT, Filter, Limits the records
WORD, Text Word, Free text associated with record
TITL, Title, Words in definition line
KYWD, Keyword, Nonstandardized terms provided by submitter
AUTH, Author, Author(s) of publication
JOUR, Journal, Journal abbreviation of publication
VOL, Volume, Volume number of publication
ISS, Issue, Issue number of publication
PAGE, Page Number, Page number(s) of publication
ORGN, Organism, Scientific and common names of organism, and all higher levels of taxonomy
ACCN, Accession, Accession number of sequence
PACC, Primary Accession, Does not include retired secondary accessions
GENE, Gene Name, Name of gene associated with sequence
PROT, Protein Name, Name of protein associated with sequence
ECNO, EC/RN Number, EC number for enzyme or CAS registry number
PDAT, Publication Date, Date sequence added to GenBank
MDAT, Modification Date, Date of last update
SUBS, S

**We need to set our host and pathogen names as variables**

**This way we won't have to enter in this information everytime we move onto a new section**

**It is helpful to be descriptive in this naming, such as providing genus and species**

**Searching your host/pathogen in NCBI might help to refine the organism name**

In [44]:
# Enter host organism name and gene
host = 'Homo sapiens[Orgn]'

#Enter pathogen organism name and gene
pathogen = 'Human immunodeficiency virus 1[Orgn]'

# Enter host and pathogen combined

hp = 'Homo sapiens[Orgn] AND Human immunodeficiency virus 1[Orgn]'

### Finding Records

Now we will move onto finding records for our host and pathogen. This section will print out how many record we found and their accession ids. Accession ids are unique number applied to specific record of genetic information. Since there might be multiple instance of assembled human genomes, each verified and complete entry has its own unique identifier. We will be accessing the nucleotide database for host and pathogen information as this provides most specific results when using general names. We can use the assembly database to find ids, but this database is limited when using the efetch function in the next section.

In [39]:
from Bio import Entrez
Entrez.api_key = "92471a1f25f1f658f6711d0c7bc31ff80708"
Entrez.email = "gree9242@uri.edu"
# Searchs assembly database for 10 inputs of host information and retrieves the accession numbers
handle1 = Entrez.esearch(db='nucleotide',term=host,idtype="acc",retmax='10')
# Searchs nucleotide database for 10 inputs of pathogen information and retrieves the accession numbers
handle2 = Entrez.esearch(db='nucleotide',term=pathogen,idtype="acc",retmax='10')
record1 = Entrez.read(handle1)
record2 = Entrez.read(handle2)
print('{} number of host ids found'.format(record1["Count"]))
print()
print('{} host id list'.format(record1["IdList"]))
print()
print('{} pathogen ids found'.format(record2["Count"]))
print()
print('{} pathogen id list'.format(record2["IdList"]))

27704223 number of host ids found

['NM_001389244.1', 'NM_001386188.2', 'NR_138058.2', 'NM_001324459.2', 'NM_001165037.2', 'NM_001265609.2', 'NM_001286378.2', 'NM_001329906.2', 'NM_001164720.3', 'NM_004194.5'] host id list

929947 pathogen ids found

['6XH3_D', '6XH2_D', '6XH1_D', '6XH0_D', '7JU1_X', '7K1Z_A', '6WB2_C', '6WB1_C', '6WB0_C', '6WAZ_C'] pathogen id list


#### We can see here that there are many more instances of pathogen ids than host ids. This is why we limit our returns to 10 for the time being. 

### Accessing and saving full records

Having full genbank records provides a large amount of information related to your study system. Best of all it can provide you with the publication and project associated with the id.

In [40]:
from Bio import Entrez
Entrez.api_key = "92471a1f25f1f658f6711d0c7bc31ff80708"
Entrez.email = "gree9242@uri.edu"
# Use a loop to go through the IDlist from previous section and fetch nucleotide databases
for i in record1["IdList"]:
    handle = Entrez.efetch(db='nucleotide',id=i, rettype="gb", retmode="text",retmax=10)
    print(handle.read())
for i in record2["IdList"]:
    handle = Entrez.efetch(db='nucleotide',id=i, rettype="gb", retmode="text",retmax=10)
    print(handle.read())

LOCUS       NM_001389244            1831 bp    mRNA    linear   PRI 03-DEC-2020
DEFINITION  Homo sapiens keratin 40 (KRT40), transcript variant 4, mRNA.
ACCESSION   NM_001389244 XM_017024189
VERSION     NM_001389244.1
KEYWORDS    RefSeq; RefSeq Select.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1831)
  AUTHORS   Luck K, Kim DK, Lambourne L, Spirohn K, Begg BE, Bian W, Brignall
            R, Cafarelli T, Campos-Laborie FJ, Charloteaux B, Choi D, Cote AG,
            Daley M, Deimling S, Desbuleux A, Dricot A, Gebbia M, Hardy MF,
            Kishore N, Knapp JJ, Kovacs IA, Lemmens I, Mee MW, Mellor JC,
            Pollis C, Pons C, Richardson AD, Schlabach S, Teeking B, Yadav A,
            Babor M, Balcha D, Basha O, Bowman-Colin C, Chin SF, Choi SG,
     

LOCUS       NR_138058               1272 bp    RNA     linear   PRI 03-DEC-2020
DEFINITION  Homo sapiens ADP ribosylation factor like GTPase 16 (ARL16),
            transcript variant 4, non-coding RNA.
ACCESSION   NR_138058
VERSION     NR_138058.2
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1272)
  AUTHORS   Haenig C, Atias N, Taylor AK, Mazza A, Schaefer MH, Russ J,
            Riechers SP, Jain S, Coughlin M, Fontaine JF, Freibaum BD,
            Brusendorf L, Zenkner M, Porras P, Stroedicke M, Schnoegl S,
            Arnsburg K, Boeddrich A, Pigazzini L, Heutink P, Taylor JP,
            Kirstein J, Andrade-Navarro MA, Sharan R and Wanker EE.
  TITLE     Interactome Mapping Provides a Network of Neurodegenerative Disease
            

LOCUS       NM_001165037            3169 bp    mRNA    linear   PRI 03-DEC-2020
DEFINITION  Homo sapiens gamma-aminobutyric acid type A receptor subunit alpha5
            (GABRA5), transcript variant 2, mRNA.
ACCESSION   NM_001165037
VERSION     NM_001165037.2
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 3169)
  AUTHORS   Hernandez CC, XiangWei W, Hu N, Shen D, Shen W, Lagrange AH, Zhang
            Y, Dai L, Ding C, Sun Z, Hu J, Zhu H, Jiang Y and Macdonald RL.
  TITLE     Altered inhibitory synapses in de novo GABRA5 and GABRA1 mutations
            associated with early onset epileptic encephalopathies
  JOURNAL   Brain 142 (7), 1938-1954 (2019)
   PUBMED   31056671
  REMARK    GeneRIF: study identified GABRA5 as a causative gene for 


LOCUS       NM_001286378            1701 bp    mRNA    linear   PRI 03-DEC-2020
DEFINITION  Homo sapiens maelstrom spermatogenic transposon silencer (MAEL),
            transcript variant 3, mRNA.
ACCESSION   NM_001286378
VERSION     NM_001286378.2
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1701)
  AUTHORS   Taei A, Kiani T, Taghizadeh Z, Moradi S, Samadian A, Mollamohammadi
            S, Sharifi-Zarchi A, Guenther S, Akhlaghpour A, Asgari Abibeiglou
            B, Najar-Asl M, Karamzadeh R, Khalooghi K, Braun T, Hassani SN and
            Baharvand H.
  TITLE     Temporal activation of LRH-1 and RAR-gamma in human pluripotent
            stem cells induces a functional naive-like state
  JOURNAL   EMBO Rep 21 (10), e47533 (2020)
   P

LOCUS       NM_001329906            1879 bp    mRNA    linear   PRI 03-DEC-2020
DEFINITION  Homo sapiens nodal growth differentiation factor (NODAL),
            transcript variant 2, mRNA.
ACCESSION   NM_001329906
VERSION     NM_001329906.2
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1879)
  AUTHORS   Taei A, Kiani T, Taghizadeh Z, Moradi S, Samadian A, Mollamohammadi
            S, Sharifi-Zarchi A, Guenther S, Akhlaghpour A, Asgari Abibeiglou
            B, Najar-Asl M, Karamzadeh R, Khalooghi K, Braun T, Hassani SN and
            Baharvand H.
  TITLE     Temporal activation of LRH-1 and RAR-gamma in human pluripotent
            stem cells induces a functional naive-like state
  JOURNAL   EMBO Rep 21 (10), e47533 (2020)
   PUBMED   

LOCUS       NM_004194               2883 bp    mRNA    linear   PRI 03-DEC-2020
DEFINITION  Homo sapiens ADAM metallopeptidase domain 22 (ADAM22), transcript
            variant 4, mRNA.
ACCESSION   NM_004194
VERSION     NM_004194.5
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 2883)
  AUTHORS   Sundell GN, Arnold R, Ali M, Naksukpaiboon P, Orts J, Guntert P,
            Chi CN and Ivarsson Y.
  TITLE     Proteome-wide analysis of phospho-regulated PDZ domain interactions
  JOURNAL   Mol Syst Biol 14 (8), e8129 (2018)
   PUBMED   30126976
  REMARK    Publication Status: Online-Only
REFERENCE   2  (bases 1 to 2883)
  AUTHORS   Yamagata A, Miyazaki Y, Yokoi N, Shigematsu H, Sato Y, Goto-Ito S,
            Maeda A, Goto T, Sanbo M, Hirabayash

LOCUS       6XH2_D                    27 bp    RNA     linear   VRL 01-DEC-2020
DEFINITION  Co-crystal structure of HIV-1 TAR RNA in complex with lab-evolved
            RRM 6.6.
ACCESSION   6XH2_D
VERSION     6XH2_D
KEYWORDS    .
SOURCE      Human immunodeficiency virus 1 (HIV-1)
  ORGANISM  Human immunodeficiency virus 1
            Viruses; Riboviria; Pararnavirae; Artverviricota; Revtraviricetes;
            Ortervirales; Retroviridae; Orthoretrovirinae; Lentivirus.
REFERENCE   1  (bases 1 to 27)
  AUTHORS   Chavali,S.S., Mali,S.M., Jenkins,J.L., Fasan,R. and Wedekind,J.E.
  TITLE     Co-crystal structures of HIV TAR RNA bound to lab-evolved proteins
            show key roles for arginine relevant to the design of cyclic
            peptide TAR inhibitors
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 27)
  AUTHORS   Chavali,S.S., Jenkins,J.L. and Wedekind,J.E.
  TITLE     Direct Submission
  JOURNAL   Submitted (18-JUN-2020)
COMMENT     Co-crystal structure of HIV-1 TAR RNA i

LOCUS       6WB0_C                   101 bp    RNA     linear   VRL 01-DEC-2020
DEFINITION  +3 extended HIV-1 reverse transcriptase initiation complex core
            (pre-translocation state).
ACCESSION   6WB0_C
VERSION     6WB0_C
KEYWORDS    .
SOURCE      Human immunodeficiency virus 1 (HIV-1)
  ORGANISM  Human immunodeficiency virus 1
            Viruses; Riboviria; Pararnavirae; Artverviricota; Revtraviricetes;
            Ortervirales; Retroviridae; Orthoretrovirinae; Lentivirus.
REFERENCE   1  (bases 1 to 101)
  AUTHORS   Larsen,K.P., Choi,J., Jackson,L.N., Kappel,K., Zhang,J., Ha,B.,
            Chen,D.H. and Puglisi,E.V.
  TITLE     Distinct Conformational States Underlie Pausing during Initiation
            of HIV-1 Reverse Transcription
  JOURNAL   J Mol Biol 432 (16), 4499-4522 (2020)
   PUBMED   32512005
REFERENCE   2  (bases 1 to 101)
  AUTHORS   Larsen,K.P., Jackson,L.N., Kappel,K., Zhang,J. and Puglisi,E.V.
  TITLE     Direct Submission
  JOURNAL   Submitted (26-MAR-20

### Retrieving publications

Since we are interested in humans and their interaction with HIV we may want to find publications related to this field. Pubmed is a great resource and database to use to access free publications. We will be using similar coding structures with a few changes, especially when saving files.

In [45]:
from Bio import Entrez
Entrez.api_key = "92471a1f25f1f658f6711d0c7bc31ff80708"
Entrez.email = "gree9242@uri.edu"
search_results = Entrez.read(Entrez.esearch(db="pubmed", term=hp, reldate=365, datetype="pdat", usehistory="y"))
count = int(search_results["Count"])
print("Found %i results" % count)

Found 949 results


In case you were wondering that is 2.6 papers every day

#### Saving abstracts

Now that we know we have a lot of information to sift through maybe lets limit ourselves to saving the abstracts. We will be downloading these in batches. This helps NCBI admin to regulate traffic.

In [48]:
batch_size = 10
out_handle = open("recent_human_hiv_papers.txt", "w")
for start in range(0, count, batch_size):
    end = min(count, start + batch_size)
    print("Going to download record %i to %i" % (start + 1, end))
    fetch_handle = Entrez.efetch(
        db="pubmed",
        rettype="medline",
        retmode="text",
        retstart=start,
        retmax=batch_size,
        webenv=search_results["WebEnv"],
        query_key=search_results["QueryKey"],
    )
    data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
out_handle.close()

Going to download record 1 to 10
Going to download record 11 to 20
Going to download record 21 to 30
Going to download record 31 to 40
Going to download record 41 to 50
Going to download record 51 to 60
Going to download record 61 to 70
Going to download record 71 to 80
Going to download record 81 to 90
Going to download record 91 to 100
Going to download record 101 to 110
Going to download record 111 to 120
Going to download record 121 to 130
Going to download record 131 to 140
Going to download record 141 to 150
Going to download record 151 to 160
Going to download record 161 to 170
Going to download record 171 to 180
Going to download record 181 to 190
Going to download record 191 to 200
Going to download record 201 to 210
Going to download record 211 to 220
Going to download record 221 to 230
Going to download record 231 to 240
Going to download record 241 to 250
Going to download record 251 to 260
Going to download record 261 to 270
Going to download record 271 to 280
Going to dow

#### Not bad for downloading 900 abstracts

In terms of starting a research project, having access to this much relevant information is crucial to developing good hypothesis and designing experiments. 

### Downloading Sequences

This next section is the most crucial aspect to our project. For the last two sections we have been on an information gathering hunt, which is impportant, but if we want to apply some of the genetic information contained within the NCBI database we need to access the volume of genetic information within. We will be taking a slightly more refined approach to this process becuase of how much genetic infromation there is. 

As an example, the human genome is 3.2 gb and takes up approximately 964.8 MB. That would be about 72,000 sheets of paper filled with ATCG. Since we want to alignment multiple different things we will ask a more specific question. Do any HIV proteins align with Human proteins? Our knowledge of how HIV invades the immune system help here.

<img src="HIV_absorb_cd4_and_ccr5.png" width="400"/>

What we know is that the major players in facilitating HIV invasion of T helper cells are the HIV gp120 and T Helper CD4 receptor. Changes in these receptors can inhibit invasion of the immune cell and keep people from developing AIDS. In this section we will download these two proteins to use in subsequent sections of this project.

In [101]:
# Download CD4 fasta
from Bio import Entrez
# Add email and key
Entrez.api_key = "92471a1f25f1f658f6711d0c7bc31ff80708"
Entrez.email = "gree9242@uri.edu"
search_handle = Entrez.esearch(db="nuccore",term="Homo sapiens[orgn] AND cd4 [gene]", usehistory="y", idtype="acc")
search_results = Entrez.read(search_handle)
search_handle.close()
#search_results.pop("NC_000012.12")
acc_list = search_results["IdList"]
count = int(search_results["Count"])
count == len(acc_list)
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
print("Found %i results " % count)
print('CD4 accession ids: {}'.format(search_results["IdList"]))

Found 30 results 
CD4 accession ids: ['NM_001382705.1', 'NM_001382714.1', 'NM_001382706.1', 'NM_001382707.1', 'NM_001195016.3', 'NM_001195015.3', 'NM_001195014.3', 'NM_001195017.3', 'NM_000616.5', 'NC_000012.12', 'NG_027688.1', 'MK170450.1', 'DQ895246.2', 'CM000263.1', 'CH471116.2', 'BC025782.1', 'U40625.1', 'U47924.1', 'AB590293.1', 'U01066.1']


In [102]:
batch_size = 3
out_handle = open("human_cd4.fasta", "w")
for start in range(0, count, batch_size):
    end = min(count, start + batch_size)
    print("Going to download record %i to %i" % (start + 1, end))
    fetch_handle = Entrez.efetch(
        db="nucleotide",
        rettype="fasta",
        retmode="text",
        retstart=start,
        retmax=batch_size,
        webenv=webenv,
        query_key=query_key,
        idtype="acc",
    )
    data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
out_handle.close()

Going to download record 1 to 3
Going to download record 4 to 6
Going to download record 7 to 9
Going to download record 10 to 12
Going to download record 13 to 15
Going to download record 16 to 18
Going to download record 19 to 21
Going to download record 22 to 24
Going to download record 25 to 27
Going to download record 28 to 30


In [109]:
# Download CD4 fasta
from Bio import Entrez
# Add email and key
Entrez.api_key = "92471a1f25f1f658f6711d0c7bc31ff80708"
Entrez.email = "gree9242@uri.edu"
search_handle = Entrez.esearch(db="nuccore",term="HIV1 [orgn] AND gp120 [gene]", usehistory="y", idtype="acc")
search_results = Entrez.read(search_handle)
search_handle.close()
#search_results.pop("NC_000012.12")
acc_list = search_results["IdList"]
count = int(search_results["Count"])
count == len(acc_list)
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
print("Found %i results " % count)
print('CD4 accession ids: {}'.format(search_results["IdList"]))

Found 1165 results 
CD4 accession ids: ['EU010360.1', 'EU010359.1', 'EU010358.1', 'EU010357.1', 'EU010356.1', 'EU010355.1', 'EU010354.1', 'EU010353.1', 'EU010352.1', 'EU010351.1', 'EU010350.1', 'EU010349.1', 'EU010348.1', 'EU010347.1', 'EU010346.1', 'EU010345.1', 'EU010344.1', 'EU010343.1', 'EU010342.1', 'EU010341.1']


In [63]:
batch_size = 3
out_handle = open("HIV_gp120.fasta", "w")
for start in range(0, count, batch_size):
    end = min(count, start + batch_size)
    print("Going to download record %i to %i" % (start + 1, end))
    fetch_handle = Entrez.efetch(
        db="nucleotide",
        rettype="fasta",
        retmode="text",
        retstart=start,
        retmax=batch_size,
        webenv=webenv,
        query_key=idlist,
        idtype="acc",
    )
    data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
out_handle.close()

Going to download record 1 to 3
Going to download record 4 to 6
Going to download record 7 to 9
Going to download record 10 to 12
Going to download record 13 to 15
Going to download record 16 to 18
Going to download record 19 to 21
Going to download record 22 to 24
Going to download record 25 to 27
Going to download record 28 to 30
Going to download record 31 to 33
Going to download record 34 to 36
Going to download record 37 to 39
Going to download record 40 to 42
Going to download record 43 to 45
Going to download record 46 to 48
Going to download record 49 to 51
Going to download record 52 to 54
Going to download record 55 to 57
Going to download record 58 to 60
Going to download record 61 to 63
Going to download record 64 to 66
Going to download record 67 to 69
Going to download record 70 to 72
Going to download record 73 to 75
Going to download record 76 to 78
Going to download record 79 to 81
Going to download record 82 to 84
Going to download record 85 to 87
Going to download re

Going to download record 691 to 693
Going to download record 694 to 696
Going to download record 697 to 699
Going to download record 700 to 702
Going to download record 703 to 705
Going to download record 706 to 708
Going to download record 709 to 711
Going to download record 712 to 714
Going to download record 715 to 717
Going to download record 718 to 720
Going to download record 721 to 723
Going to download record 724 to 726
Going to download record 727 to 729
Going to download record 730 to 732
Going to download record 733 to 735
Going to download record 736 to 738
Going to download record 739 to 741
Going to download record 742 to 744
Going to download record 745 to 747
Going to download record 748 to 750
Going to download record 751 to 753
Going to download record 754 to 756
Going to download record 757 to 759
Going to download record 760 to 762
Going to download record 763 to 765
Going to download record 766 to 768
Going to download record 769 to 771
Going to download record 772

In [107]:
#Editing version
# Download CD4 fasta
from Bio import Entrez
# Add email and key
Entrez.api_key = "92471a1f25f1f658f6711d0c7bc31ff80708"
Entrez.email = "gree9242@uri.edu"
search_handle = Entrez.esearch(db="nuccore",term="HIV1 [orgn] AND gp120 [gene]", usehistory="y", idtype="acc")
search_results = Entrez.read(search_handle)
search_handle.close()
#search_results.pop("NC_000012.12")
acc_list = search_results["IdList"]
count = int(search_results["Count"])
count == len(acc_list)
idlist = search_results["IdList"][:9]
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
print("Found %i results " % count)
print('CD4 accession ids: {}'.format(search_results["IdList"][:9]))

Found 1165 results 
CD4 accession ids: ['EU010360.1', 'EU010359.1', 'EU010358.1', 'EU010357.1', 'EU010356.1', 'EU010355.1', 'EU010354.1', 'EU010353.1', 'EU010352.1']


In [105]:
# Editing version


Going to download record 1 to 3


HTTPError: HTTP Error 500: Internal Server Error

In [None]:
handle = Entrez.efetch(db="nucleotide", id=idlist, retmode="xml")

### Viewing files

Luckily Jupyter notebooks is fairly straight forward in veiwing these files (as long as they are UTF-8 encoded, which most NCBI files are). The abstract file and fasta file are both in the local directory from where this notebook is being used. Make sure to go back into these files and view some of the diversity of genetic information that is obtained. One issue with data mining resources like NCBI is that our specificity in refininig searches is extremely important, as downloading certain elements that have little to no relevance is very common. In this case view the last entry of the human CD4.fasta and you will find that we downloaded a whole chromosome of the human genome reference. While it is amazing that we can do that, and it may provided a good reference point, there are some analyses that we would not want to included that entry on.

### Onward to alignment

In the next tutorial we will be aligning our pathogen sequences in the fasta file that we downloaded. Make sure you have the notebook and fasta file locally available by running this script. See you over there!