### Pubmed Extract
Libraries:
**Bio:** Biopython to use *NCBI's Entrez* utilities and get PubMed information
**OS** use python miscellaneous operating system interfaces, use here to get the working directory
**Randinit** generate random number
**time** sleep functions - when you want to pause between requests (random number for the time)
**Collections** defaultdict for dealing with missing keys

In [1]:
import os
# re module to write regular expressions - used to pull out doi's from string
import re
# import random number generator
from random import randint
# import to setup sleep and pauses
import time
# importing "collections" for defaultdict
# deal with missing keys in data dictionary of fetched records
import collections
from collections import defaultdict
# Biopython for NCBI's Entrez utilities
from Bio import Entrez
from Bio import SeqIO
from Bio import Medline
#from Bio import Medline
# use return type of MEDLINE
from Bio.Alphabet import IUPAC
Entrez.email = 'jamison@library.ucla.edu'
#  base url for PUBMED converter API
converterBase = "https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/"
#crossref api
from crossref.restful import Works

In [2]:
# esearch:  searches and retrieves primary ID's used in efetch, elink and esummary
# efetch retrieves records in the requested format from list of primary IDs

In [3]:
# uses python os module
# tells you where your default working directory is - ie. where files will be written
os.getcwd()

'/home/jmjamison/work/pubmedExtracts'

In [4]:
query='("University of California, Los Angeles"[Affiliation] OR "University of California, Los Angeles"[Affiliation] OR "University of California Los Angeles"[Affiliation] OR "University of California at Los Angeles"[Affiliation] OR "University of California of Los Angeles"[Affiliation] OR "University of California in Los Angeles"[Affiliation] OR UCLA[Affiliation] OR 90095[Affiliation]) AND ("2014/01/01"[PDAT] : "2018/01/31"[PDAT])'
handle=Entrez.esearch(db="pubmed", term=query)

#### Entrez esearch, read the results

In [5]:
handle = Entrez.esearch(db="pubmed", term=query)
record=Entrez.read(handle)
record["IdList"]

['29686938', '29686739', '29686725', '29682398', '29682382', '29681647', '29676225', '29673526', '29672601', '29671273', '29671268', '29671267', '29671266', '29671265', '29671264', '29670931', '29670925', '29670334', '29666723', '29664956']

#### get the total number of records, print out, list out the pubmed records without specifying maximum records

In [6]:
record["Count"]
recordCount = record["Count"]
print(recordCount)

21640


#### same but specify the maximum records, but  get same total number of records.

In [7]:
handle=Entrez.esearch(db="pubmed", term=query, retmax=recordCount)
result=Entrez.read(handle)
record["Count"]
recordCount = record["Count"]
print(recordCount)

21640


#### print out a list of the Pubmed record numbers

In [8]:
# f = open('/home/jmjamison/work/pubmedExtracts/outputList.txt', "w")
# comment out so this will default to the default working directory
# 
# write out the id list - result
f = open('outputList.txt', "w")
f.write(str(result))
f.close()

In [9]:
handle=Entrez.efetch(db="pubmed", id=17383002, retmode="xml")
record=Entrez.read(handle)
# print(record) comment out print statement

#### search, read and fetch records, pull selected PubMed fields. 

In [10]:
#fetchHandle=Entrez.efetch(db="pubmed", id=17383002, retmode="xml")
infoHandle = Entrez.einfo()
record = Entrez.read(infoHandle)
infoHandle.close()
print(record.keys())
record['DbList']

infoHandle = Entrez.einfo(db="pubmed")
record = Entrez.read(infoHandle)
record["DbInfo"]["Description"]
record["DbInfo"]["Count"]
record["DbInfo"].keys()

dict_keys(['DbList'])


dict_keys(['DbName', 'MenuName', 'Description', 'DbBuild', 'Count', 'LastUpdate', 'FieldList', 'LinkList'])

In [11]:
handle=Entrez.esearch(db="pubmed", term=query, retmax=recordCount)
record=Entrez.read(handle)
record["Count"]

'21640'

Crossref rest api
"singletons" - Singletons are single results. Retrieving metadata for a specific identifier (e.g. DOI, ISSN, funder_identifier) typically returns in a singleton result.

In [None]:
handle = Entrez.esearch(db="pubmed", term=query)
record=Entrez.read(handle)
idList = record["IdList"]
print(idList)
handle = Entrez.efetch(db="pubmed", id=idList, rettype="medline", retmode="json")
records = Medline.parse(handle)
Entrez.email = 'jamison@library.ucla.edu'
# declaring defaultdict
# sets default value 'Key Not found' to absent keys
#records = collections.defaultdict(lambda : 'Key Not found')
# How does it work? Simple. You can do one of two things to get directed to the "polite pool":
# Include a "mailto" parameter in your query. For example:

works=Works()
for record in records:
    # print(record["PMID"])
    PMID = record["PMID"]
    print(PMID) 
    # print(record["PMC"])
    # PMC = record["PMC"]
    # print(PMC) 
    # print("DOI: ID Converter API")
    # getDOI = converterBase + "?tool=my_tool&email=jamison@library.ucla.edu&ids=" + PMID + "&format=json"
    print(getDOI)
    print("Pubmed Record ID (PMID): " + record.get('PMID', ""))
    # print(record["TI"])
    print("Title: " + record.get('TI', ""))
    # print(record["AUID"])
    print("Author Identifier (AUID): " + str(record.get('AUID', "")))
    # print(record["AU"])
    print("Author(s)(AU): " + str(record.get('AU', "")) )   
    # print(record["AID"])
    print("Author affiliation (AD): " + str(record.get('AD', "")))  
    # print(record["AD"])
    # affiliation
    print("Article Identifier/DOI (AID): " + str(record.get('AID', "")))
    # print(record["GR"])
    # grant number
    print(record.get('GR', ""))      
    print("\n")

In [None]:
# f = open('/home/jmjamison/work/pubmedExtracts/outputIdListFull.txt', "w")
# comment out so this will default to the default working directory# 
# write out the id list - result
f = open('outputIdListFull.txt', "w")
# first esearch and then efetch
handle=Entrez.esearch(db="pubmed", term=query)
# read - sould now be a Python dictionary/list
searchResult=Entrez.read(handle)
handle.close()
idList=searchResult["IdList"]
ids= ','.join(idList)
listCount=searchResult["Count"]# now have a list of record ids
#print(idList)
#print(ids)
#print(searchResult)
f.write(str(searchResult["IdList"]))
# now have a list of record ids
print(idList)
print(ids)
#print(searchResult)
f.close()

In [77]:
from random import randint
import time
# f = open('/home/jmjamison/work/pubmedExtracts/outputPubMed.txt', "w")
# comment out so this will default to the default working directory# 
# write out the id list - result
f = open('outputPubMed.csv', "w")
# first esearch and then efetch
searchHandle=Entrez.esearch(db="pubmed", term=query, retmax=recordCount)
# read - sould now be a Python dictionary/list
idList=searchResult["IdList"]
ids= ','.join(idList)
searchResult=Entrez.read(searchHandle)
fetchHandle = Entrez.efetch(db="pubmed", id=ids, rettype="medline", retmode="text")
handle.close()

listCount=searchResult["Count"]
print(listCount)
# now have a list of record ids
#print(idList)
#print(ids)
#print(searchResult)
records = Medline.parse(fetchHandle)
print("Total records: " + listCount)
f.write("Pubmed Record*Title*AUID-Author Id*AU-Author*AD-Affiliation*AID-Article Id*GR-Grant Number")
i = 1
for record in records:
    f.write(record.get('PMID', ""))
    f.write("*")
    f.write(record.get('TI', ""))
    f.write("*")
    f.write(str(record.get('AUID', "")))
    f.write("*")
    f.write(str(record.get('AU', "")))
    f.write("*") 
    f.write(str(record.get('AD', "")))
    f.write("*")
    f.write(str(record.get('AID', "")))
    f.write("*")
    f.write(str(record.get('GR', "")))
    f.write("*")
    f.write("\n")
    i = i+1

f.close()

print("Total records processed: " + str(i))

21641
Total records: 21641
Total records processed: 10001


####  use real expressions to search the AID field for 10.* [doi ]


In [93]:
f = open('OutputReportTest.txt')

while True:
	line = f.readline()
	if not line: break
	print(line)
f.close()


['29682398', '29682382', '29681647', '29676225', '29673526', '29672601', '29671273', '29671268', '29671267', '29671266', '29671265', '29671264', '29670931', '29670925', '29670334', '29666723', '29664956', '29664904', '29662744', '29662743']

29682398

Pubmed Record ID (PMID): 29682398

Title: Nickel-Catalyzed Suzuki-Miyaura Coupling of Aliphatic Amides.

Author Identifier (AUID): 

Author(s)(AU): ['Boit TB', 'Weires NA', 'Kim J', 'Garg NK']

Author affiliation (AD): Department of Chemistry and Biochemistry, University of California, Los Angeles, California 90095, United States. Department of Chemistry and Biochemistry, University of California, Los Angeles, California 90095, United States. Department of Chemistry and Biochemistry, University of California, Los Angeles, California 90095, United States. Department of Chemistry and Biochemistry, University of California, Los Angeles, California 90095, United States.

Article Identifier/DOI (AID): ['10.1021/acscatal.7b03688 [doi]']







