# My own fetch test on bruna's data:

First, I'll initialize the Entrez object, set my email and check the
databases' FieldList for search:


In [2]:
from Bio import Entrez
Entrez.email = 'guilhormo.47@gmail.com'

In [3]:
handle = Entrez.einfo(db='bioproject')
result = Entrez.read(handle)
result.keys()
result['DbInfo'].keys()
result['DbInfo']['FieldList']
for key in result['DbInfo']['FieldList']:
  print(f"{key['Name']}: {key['FullName']}, {key['Description']}")

ALL: All Fields, All terms from all searchable fields
UID: UID, Unique number assigned to publication
FILT: Filter, Limits the records
ORGN: Organism, Organism
PRJA: Project Accession, Project Accession
TYPE: Project Type, Project Type
STPE: Project Subtype, Project Subtype
DATE: Registration Date, Registration Date
TITL: Title, Title
CEN: Submitter Organization, Submitter Organization(s)
ACCN: Replicon accession, Space delimited GenBank or RefSeq Replicon Accessions
RTYP: Replicon type, Replicon Type
RNME: Replicon name, Replicon Name
LTP: Locus Tag Prefix, Locus Tag Prefix
WORD: Description, Organism/Project Description
KWRD: Keyword, Keyword
PROP: Properties, Project/Organism Properties
DTPE: Project Data Type, Project Data Type
GRNT: Grant ID, Grant ID
FUND: Funding Agency, Funding Agency
PMID: PMID, Pubmed ID
DOID: DOI, DOI ID
PID: ProjectID, Project ID
RELV: Relevance, Relevance
ANME: Assembly name, Assembly Name
BPRJ: BioProject ID, BioProject ID or accession
TPRJ: Top Bioprojec

It looks like we'll search for the `PRJNA788342[PRJA]`

In [4]:
handle = Entrez.esearch(db='bioproject', term='PRJNA788342[PRJA]')
result = Entrez.read(handle)
project_id = result['IdList'][0]
result

{'Count': '1', 'RetMax': '1', 'RetStart': '0', 'IdList': ['788342'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'PRJNA788342[PRJA]', 'Field': 'PRJA', 'Count': '1', 'Explode': 'N'}, 'GROUP'], 'QueryTranslation': 'PRJNA788342[PRJA]'}

In [5]:
# Remember to close the handle!
handle.close()

Now that we've found the project ID in the bioproject database, we'll
search for it's data in the SRA database

In [6]:
handle = Entrez.elink(dbfrom='bioproject', db='sra', id=project_id)
result = Entrez.read(handle)
result
result[0].keys()

dict_keys(['LinkSetDb', 'LinkSetDbHistory', 'ERROR', 'DbFrom', 'IdList'])

In [7]:
result[0]['LinkSetDb'][0].keys()

dict_keys(['Link', 'DbTo', 'LinkName'])

In [8]:
result[0]['LinkSetDb'][0]['Link']

[{'Id': '18549310'},
 {'Id': '18549309'},
 {'Id': '18549308'},
 {'Id': '18549307'},
 {'Id': '18549306'},
 {'Id': '18549305'},
 {'Id': '18549304'},
 {'Id': '18549303'},
 {'Id': '18549302'},
 {'Id': '18549301'},
 {'Id': '18549300'},
 {'Id': '18549299'},
 {'Id': '18549298'},
 {'Id': '18549297'},
 {'Id': '18549296'},
 {'Id': '18549295'},
 {'Id': '18549294'},
 {'Id': '18549293'},
 {'Id': '18549292'},
 {'Id': '18549291'},
 {'Id': '18549290'},
 {'Id': '18549289'},
 {'Id': '18549288'},
 {'Id': '18549287'},
 {'Id': '18549286'},
 {'Id': '18549285'},
 {'Id': '18549284'},
 {'Id': '18549283'},
 {'Id': '18549282'},
 {'Id': '18549281'},
 {'Id': '18549280'},
 {'Id': '18549279'},
 {'Id': '18549278'},
 {'Id': '18549277'},
 {'Id': '18549276'},
 {'Id': '18549275'},
 {'Id': '18549274'},
 {'Id': '18549273'},
 {'Id': '18549272'},
 {'Id': '18549271'},
 {'Id': '18549270'},
 {'Id': '18549269'},
 {'Id': '18549268'},
 {'Id': '18549267'},
 {'Id': '18549266'},
 {'Id': '18549265'},
 {'Id': '18549264'},
 {'Id': '1854

In [9]:
handle.close()
sra_ids = [id['Id'] for id in result[0]['LinkSetDb'][0]['Link']]
len(sra_ids)
with open('sra_ids.txt', 'w') as output:
  for id in sra_ids:
    output.write(f'{id}\n')

The above object contains the ID for each sample taken during the experiment.
Each sample contained thousands of reads, but the authors rarefied them to 
about 50 thousand reads per sample.
Next up (probably tomorrow) we'll try and fetch some of those FASTQ files!