# EInfo: Obtaining information about the Entrez databases

In [1]:
from Bio import Entrez
Entrez.email = "vela.vela.luis@gmail.com"  
handle = Entrez.einfo()
result = handle.read()
handle.close()
print(result)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">
<eInfoResult>
<DbList>

	<DbName>pubmed</DbName>
	<DbName>protein</DbName>
	<DbName>nuccore</DbName>
	<DbName>ipg</DbName>
	<DbName>nucleotide</DbName>
	<DbName>nucgss</DbName>
	<DbName>nucest</DbName>
	<DbName>structure</DbName>
	<DbName>sparcle</DbName>
	<DbName>genome</DbName>
	<DbName>annotinfo</DbName>
	<DbName>assembly</DbName>
	<DbName>bioproject</DbName>
	<DbName>biosample</DbName>
	<DbName>blastdbinfo</DbName>
	<DbName>books</DbName>
	<DbName>cdd</DbName>
	<DbName>clinvar</DbName>
	<DbName>clone</DbName>
	<DbName>gap</DbName>
	<DbName>gapplus</DbName>
	<DbName>grasp</DbName>
	<DbName>dbvar</DbName>
	<DbName>gene</DbName>
	<DbName>gds</DbName>
	<DbName>geoprofiles</DbName>
	<DbName>homologene</DbName>
	<DbName>medgen</DbName>
	<DbName>mesh</DbName>
	<DbName>ncbisearch</DbName>
	<DbName>nlmcatalog</DbName>
	<DbName

In [2]:
from Bio import Entrez
handle = Entrez.einfo()
record = Entrez.read(handle)
record.keys()

dict_keys(['DbList'])

In [3]:
record["DbList"]

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']

In [4]:
handle = Entrez.einfo(db="gds")
record = Entrez.read(handle)
record["DbInfo"]["Description"]

'GEO DataSets'

In [5]:
record["DbInfo"]["Count"]

'3064302'

In [6]:
record["DbInfo"]["LastUpdate"]

'2019/03/10 17:09'

In [7]:
for field in record["DbInfo"]["FieldList"]:
    print("%(Name)s, %(FullName)s, %(Description)s" % field)

ALL, All Fields, All terms from all searchable fields
UID, UID, Unique number assigned to publication
FILT, Filter, Limits the records
ORGN, Organism, exploded organism names
ACCN, GEO Accession, accession for GDS (DataSet), GPL (Platform), GSM (Sample), GSE (Series)
TITL, Title, Words in title of record
DESC, Description, Text from description, summary and other similar fields
SFIL, Supplementary Files, Supplementary Files
ETYP, Entry Type, Entry type (DataSet or Series)
STYP, Sample Type, Sample type
VTYP, Sample Value Type, type of values, e.g. log ratio, count
PTYP, Platform Technology Type, Platform technology type
GTYP, DataSet Type, type of dataset
NSAM, Number of Samples, Number of samples
SRC, Sample Source, sample source
AUTH, Author, author of the GEO Sample, Platform or Series
INST, Submitter Institute, institute, or organization affiliatedd with contributers
NPRO, Number of Platform Probes, number of platform probes
SSTP, Subset Variable Type, subset variable type
SSDE, Su

# ESearch: Searching the Entrez databases

In [6]:
from Bio import Entrez
Entrez.email = "vela.vela.luis@gmail.com" 
handle = Entrez.esearch(db="gds", term='GSE117746[ACCN]')
record = Entrez.read(handle)
print(record['IdList'])

['200117746', '100018573', '303308153', '303308152', '303308151', '303308150', '303308149', '303308148']


In [13]:
count = record['Count']
handle = Entrez.esearch(db="gds", term='GSE117746[ACCN]', retmax=count)
record = Entrez.read(handle)
print(len(record['IdList']))

8


In [7]:
record['IdList'][:25]

['200117746',
 '100018573',
 '303308153',
 '303308152',
 '303308151',
 '303308150',
 '303308149',
 '303308148']

In [16]:
'303308148' in record['IdList']

True

# ESummary: Retrieving summaries from primary IDs

In [9]:
from Bio import Entrez
Entrez.email = "vela.vela.luis@gmail.com"
handle = Entrez.esummary(db="gds", id="200117746")
record = Entrez.read(handle)

In [10]:
record[0].keys()

dict_keys(['Item', 'Id', 'Accession', 'GDS', 'title', 'summary', 'GPL', 'GSE', 'taxon', 'entryType', 'gdsType', 'ptechType', 'valType', 'SSInfo', 'subsetInfo', 'PDAT', 'suppFile', 'Samples', 'Relations', 'ExtRelations', 'n_samples', 'SeriesTitle', 'PlatformTitle', 'PlatformTaxa', 'SamplesTaxa', 'PubMedIds', 'Projects', 'FTPLink', 'GEO2R'])

In [11]:
record[0]['summary']

'CDK4/6 inhibition is now part of the standard armamentarium for patients with estrogen receptor (ER)-positive breast cancer, so that defining mechanisms of resistance is a pressing issue. Here, we identify increased CDK6 expression as a key determinant of acquired resistance after exposure to palbociclib in ER-positive breast cancer cells. Increased CDK6 in resistant cells was dependent on TGF-β pathway suppression via miR-432-5p expression. Exosomal miR-432-5p expression mediated transfer of the resistance phenotype between neighboring cell populations. We confirmed these data in pre-treatment and post-progression biopsies from a parotid cancer patient who had responded to ribociclib, demonstrating clinical relevance of this mechanism. Additionally, the CDK4/6 inhibitor resistance phenotype can be reversed in vitro and in vivo by a prolonged drug holiday.'

In [20]:
record[0]['Samples'][:]

[DictElement({'Accession': 'GSM3308149', 'Title': 'T47D control RNA 2'}, attributes={}),
 DictElement({'Accession': 'GSM3308152', 'Title': 'T47D mir-432-5p captured RNA 2'}, attributes={}),
 DictElement({'Accession': 'GSM3308148', 'Title': 'T47D control RNA 1'}, attributes={}),
 DictElement({'Accession': 'GSM3308151', 'Title': 'T47D mir-432-5p captured RNA 1'}, attributes={}),
 DictElement({'Accession': 'GSM3308153', 'Title': 'T47D mir-432-5p captured RNA 3'}, attributes={}),
 DictElement({'Accession': 'GSM3308150', 'Title': 'T47D control RNA 3'}, attributes={})]

In [12]:
for key, value in record[0].items():
    print(key, ' : ', value)

Item  :  []
Id  :  200117746
Accession  :  GSE117746
GDS  :  
title  :  MicroRNA-mediated suppression of the TGF-β pathway confers transmissible and reversible CDK4/6 inhibitor resistance (RNA-Seq)
summary  :  CDK4/6 inhibition is now part of the standard armamentarium for patients with estrogen receptor (ER)-positive breast cancer, so that defining mechanisms of resistance is a pressing issue. Here, we identify increased CDK6 expression as a key determinant of acquired resistance after exposure to palbociclib in ER-positive breast cancer cells. Increased CDK6 in resistant cells was dependent on TGF-β pathway suppression via miR-432-5p expression. Exosomal miR-432-5p expression mediated transfer of the resistance phenotype between neighboring cell populations. We confirmed these data in pre-treatment and post-progression biopsies from a parotid cancer patient who had responded to ribociclib, demonstrating clinical relevance of this mechanism. Additionally, the CDK4/6 inhibitor resistan

# Automatic process for the GSE

In [1]:
from Bio import Entrez
Entrez.email = "vela.vela.luis@gmail.com" 

query_dataset = 'GSE117746'
type_of_query = '[ACCN]'

In [2]:
handle = Entrez.esearch(db="gds", term=query_dataset+type_of_query)
record = Entrez.read(handle)
every_id = record['IdList']
print(every_id)

['200117746', '100018573', '303308153', '303308152', '303308151', '303308150', '303308149', '303308148']


In [3]:
for e_id in every_id: 
    handle = Entrez.esummary(db="gds", id=e_id)
    record = Entrez.read(handle)
    print('ID: ', e_id)
    print()
    print('Summary: ', record[0]['summary'])
    print()
    print('Samples: ', *record[0]['Samples'][:])
    print()
    print()

ID:  200117746

Summary:  CDK4/6 inhibition is now part of the standard armamentarium for patients with estrogen receptor (ER)-positive breast cancer, so that defining mechanisms of resistance is a pressing issue. Here, we identify increased CDK6 expression as a key determinant of acquired resistance after exposure to palbociclib in ER-positive breast cancer cells. Increased CDK6 in resistant cells was dependent on TGF-β pathway suppression via miR-432-5p expression. Exosomal miR-432-5p expression mediated transfer of the resistance phenotype between neighboring cell populations. We confirmed these data in pre-treatment and post-progression biopsies from a parotid cancer patient who had responded to ribociclib, demonstrating clinical relevance of this mechanism. Additionally, the CDK4/6 inhibitor resistance phenotype can be reversed in vitro and in vivo by a prolonged drug holiday.

Samples:  DictElement({'Accession': 'GSM3308149', 'Title': 'T47D control RNA 2'}, attributes={}) DictEle

# Improved Script

In [5]:
# Imports
from Bio import Entrez
import csv
import time
import pandas as pd
from urllib.error import HTTPError  # for Python 2 use: from urllib2 import HTTPError  # for Python 2

# Define max_samples
max_samples = 10

# Define email
Entrez.email = "vela.vela.luis@gmail.com" #"karsten.leonhardt@posteo.de"

# Perform search - get handle
handle = Entrez.esearch(db="gds", term="GSE[ETYP] AND Homo[Organism]", usehistory="y", retmax = max_samples)

# Read results
record = Entrez.read(handle)

# Get idlist
idlist = record['IdList']

# Count entries
found_count = int(record['Count'])
read_count = len(idlist)

# Echo results
print('Total number of FOUND entries: ' + str(found_count))
print('Total number of READ  entries: ' + str(read_count))

# Close handle
handle.close()

Total number of FOUND entries: 46584
Total number of READ  entries: 10


In [6]:
idlist[:15]

['200112120',
 '200117735',
 '200117734',
 '200128119',
 '200126367',
 '200128077',
 '200126109',
 '200106774',
 '200125561',
 '200117301']

In [7]:
filename = 'test_1000.csv'

# Open csv-target file
with open(filename, 'w') as opened_file:

    # write with writer
    csvwriter = csv.writer(opened_file)

    # Set fieldnames
#     fieldnames = ['Item', 'Id', 'Accession', 'GDS', 'title', 
#           'summary', 'GPL', 'GSE', 'taxon', 'entryType', 'gdsType', 
#           'ptechType', 'valType', 'SSInfo', 'subsetInfo', 'PDAT', 
#           'suppFile', 'Samples', 'Relations', 'ExtRelations', 
#           'n_samples', 'SeriesTitle', 'PlatformTitle', 'PlatformTaxa', 
#           'SamplesTaxa', 'PubMedIds', 'Projects', 'FTPLink', 'GEO2R']
    fieldnames = ['Id', 'Accession', 'title', 'summary', 'taxon']
        
    # Print fildnames
    csvwriter.writerow(fieldnames)

    # Begin retrieval
    for i, e_id in enumerate(idlist):
    
        # Echo info
        print("Going to download record: {:10.0f} ({:5.1f}%)".format(int(e_id), (i+1)/read_count*100))
    
        # Get Summary
        handle = Entrez.esummary(db="gds", id=e_id)
    
        # Read handle
        data = Entrez.read(handle)

        # Define list for things to print
        list_to_print = list()

        # Iterate over fieldnames
        for name in fieldnames:
            list_to_print.append(data[0][name])

        # Print the line
        csvwriter.writerow(list_to_print)
        
        # Close handle
        handle.close()    

Going to download record:  200112120 ( 10.0%)
Going to download record:  200117735 ( 20.0%)
Going to download record:  200117734 ( 30.0%)


HTTPError: HTTP Error 500: Internal Server Error

In [195]:
# Read to DataFrame
pd.read_csv(filename, sep=',', header=[0], error_bad_lines=False)

Unnamed: 0,Id,Accession,title,summary,taxon
0,200112120,GSE112120,Risk SNPs mediated promoter-enhancer switching...,To determine the binding of H3K4me1 and H3K4me...,Homo sapiens
1,200117735,GSE117735,The ATPase module of mammalian SWI/SNF family ...,This SuperSeries is composed of the SubSeries ...,Homo sapiens
2,200117734,GSE117734,The mSWI/SNF ATPase module mediates subcomplex...,Perturbations to mammalian SWI/SNF (mSWI/SNF) ...,Homo sapiens
3,200128119,GSE128119,COX-2 mediates tumor-stromal Prolactin signali...,Tumor-stromal communication within the microen...,Homo sapiens
4,200126367,GSE126367,Copy number analysis of selumetinib-resistant ...,Copy number analysis to compare parental color...,Homo sapiens
5,200128077,GSE128077,Whole genome-derived tiled peptide arrays dete...,Investigation of whole genome-derived tiled pe...,Homo sapiens
6,200126109,GSE126109,RNA sequencing analysis of selumetinib-resista...,RNA sequencing analysis to compare parental co...,Homo sapiens
7,200106774,GSE106774,Risk SNPs mediated promoter-enhancer switching...,To determine the functional mechanisms of PCAT...,Homo sapiens
8,200125561,GSE125561,Transcription factor SPIB binding sites identi...,SPIB overexpressed in lung cancer and promoted...,Homo sapiens
9,200117301,GSE117301,The mSWI/SNF ATPase module mediates subcomplex...,Perturbations to mammalian SWI/SNF (mSWI/SNF) ...,Homo sapiens
