# Overview

changed

# Identifying the workflow for PubMed article abstract extraction using a small test case

## Import the python request library

In [1]:
import requests

In [2]:
requests.__version__

'2.31.0'

## Step 1 - querying PubMed and retrieving a list of IDs matching the search

**References and guides**

1. The base query, some example searches, and XML output parameters can be found here:
https://www.ncbi.nlm.nih.gov/books/NBK25500/

2. Parameters and syntax in depth:
https://www.ncbi.nlm.nih.gov/books/NBK25499/

3. These are the PubMed specific search term fields and their tags that should be used to construct the query:
https://pubmed.ncbi.nlm.nih.gov/help/#tiab

4. As seen in the url_suffix variable, it uses an 'Esearch' utility. There are a total of 9 functionalities under eutils (see base url). Others can be found here:
https://www.ncbi.nlm.nih.gov/books/NBK25497/

5. Alternative python package: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6821292/ (Not used here.)

6. EDirect - for batch access using command line
https://www.ncbi.nlm.nih.gov/books/NBK179288/

**Notes**
* By default, it will only return 20 IDs.
* However, the Count parameter within the XML output shows the total number of records matching the search query.
* To establish workflow, a small query was searched - papers with ALK and GBM in the title or abstract.

In [3]:
base_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'

In [4]:
# Specify search parameters
database = 'pubmed'
query = 'glioblastoma[tiab]+AND+ALK[tiab]'

# index of the first record. When doing calls in a for loop, this number can be incremented by adding the maximum records being extracted
retstart = '0'

# max. number of records to obtain. The default is 20 and the max. allowable IDs returned are 10,000
retmax = 100

# output type. The default is XML.
retmode = 'json'

In [5]:
'''The usehistory = y means it will save the IDs from the search on the server 
so that these can then be used for a subsequent call to then extract the abstract using another utility like Efetch.'''

search_suffix = f'esearch.fcgi?db={database}&term={query}&usehistory=y&retmode={retmode}&retstart={retstart}&retmax={retmax}'

In [6]:
search_url = base_url + search_suffix

In [7]:
search_url

'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=glioblastoma[tiab]+AND+ALK[tiab]&usehistory=y&retmode=json&retstart=0&retmax=100'

In [8]:
response = requests.get(search_url)

* This code means the request was successful. There are different codes such as 400 which means bad request.

In [9]:
response.status_code

200

In [10]:
response.encoding

'UTF-8'

* The default output is XML, however the retreival mode can be set to return a json.
* Also, the number of records was confirmed by searching manually on PubMed: https://pubmed.ncbi.nlm.nih.gov/?term=%28ALK%5BTitle%2FAbstract%5D%29+AND+%28glioblastoma%5BTitle%2FAbstract%5D%29&sort=

In [11]:
response.json()

{'header': {'type': 'esearch', 'version': '0.3'},
 'esearchresult': {'count': '76',
  'retmax': '76',
  'retstart': '0',
  'querykey': '1',
  'webenv': 'MCID_65c2c81a67315d3455095fc3',
  'idlist': ['38285799',
   '37939020',
   '37861443',
   '37271069',
   '37260294',
   '37240478',
   '37168365',
   '36823756',
   '36780194',
   '36707425',
   '35724395',
   '35625997',
   '35190826',
   '34702773',
   '34626238',
   '34341009',
   '34323181',
   '34015889',
   '33966367',
   '33887544',
   '33853673',
   '33728771',
   '33486679',
   '33341678',
   '33109342',
   '32866816',
   '32308772',
   '31875306',
   '31776900',
   '31483918',
   '31399568',
   '30894200',
   '30065256',
   '29336268',
   '28960893',
   '28912153',
   '28837676',
   '28484053',
   '28465216',
   '28459464',
   '28090572',
   '28069875',
   '27993946',
   '27579614',
   '27178681',
   '27046135',
   '26939704',
   '26648752',
   '26498130',
   '26438251',
   '26235020',
   '26090865',
   '25882777',
   '257338

## Step 2 - Extracting article information for a given PubMed identifier - Get content for PMIDs using the efetch utility using IDs stored on the server.

**Notes**

* The esearch utility used above only returns a list of identifers matching the query terms.
* In order to extract other information such as title, abstract, another utility called Efetch has to be used.

In [12]:
# The base url and database fields have already been defined above:
base_url

'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'

In [13]:
database

'pubmed'

In [14]:
# Test one of the ids from the XML output above:

rettype = 'abstract'
retmode = 'xml'

In [15]:
# fetchurl_suffix = f'efetch.fcgi?db={database}&id={ids_of_interest}&rettype={rettype}&retmode={retmode}'


* Alternatively, if ids are not stored on the server using the usehistory=y in esearch, then here the search term should then include 'ids=38285799,37939020'. A comma separated list of ids can be passed here. 
* However, storing the id list on the server and then retrieving the querykey and webenv values will enable storing all IDs and it's not necessary to pass a really long list in case of multiple IDs.

In [16]:
# The query_key and web_env values can be obtained from the Esearch result above. 
# This stores the ids on the server for subsequent access to be used in another functionality.
query_key=response.json()['esearchresult']['querykey']
web_env = response.json()['esearchresult']['webenv']

fetchurl_suffix = f'efetch.fcgi?db={database}&query_key={query_key}&WebEnv={web_env}&rettype={rettype}&retmode={retmode}'


In [17]:
fetch_url = base_url + fetchurl_suffix
fetch_url

'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_65c2c81a67315d3455095fc3&rettype=abstract&retmode=xml'

In [18]:
fetch_response = requests.get(fetch_url)

In [19]:
fetch_response.status_code

200

In [20]:
type(fetch_response.text)

str

* The first 3 abstracts are shown as an example output. But this variable holds a string containing all 76 IDs for papers for GBM and ALK terms.
* Alternatively, if retmode=xml is set in the url then the output will be of XML format.

In [21]:
fetch_response

<Response [200]>

In [22]:
print(fetch_response.text[0:5000])

<?xml version="1.0" ?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2024//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_240101.dtd">
<PubmedArticleSet>
<PubmedArticle><MedlineCitation Status="MEDLINE" Owner="NLM" IndexingMethod="Automated"><PMID Version="1">38285799</PMID><DateCompleted><Year>2024</Year><Month>01</Month><Day>31</Day></DateCompleted><DateRevised><Year>2024</Year><Month>01</Month><Day>31</Day></DateRevised><Article PubModel="Electronic"><Journal><ISSN IssnType="Electronic">2476-762X</ISSN><JournalIssue CitedMedium="Internet"><Volume>25</Volume><Issue>1</Issue><PubDate><Year>2024</Year><Month>Jan</Month><Day>01</Day></PubDate></JournalIssue><Title>Asian Pacific journal of cancer prevention : APJCP</Title><ISOAbbreviation>Asian Pac J Cancer Prev</ISOAbbreviation></Journal><ArticleTitle>Evaluation of Immunohistochemical Expression of ALK-1 in Gliomas, WHO Grade 4 and Its Correlation with IDH1-R132H Mutation Status.</ArticleTitle><Paginatio

**Feasability**

If the esearch and efetch steps are repeated over and over again to get all cancer related papers (cancer OR tumor OR tumour in the title/abstract), then how many loops of API calls and how much time would be required?

* For bulk access, it might be better to use the command line EDirect utility. 
* However, for an application where gene and cancer specific searches are required, a real time extraction using these combined steps is feasible.

In [23]:
# About 3 million records and 10000 records allowed per call.
total_loops = 6e6 / 10000

In [24]:
# Estimating 2 seconds per call (1 esearch call and 1 efetch call)
# Adding 3 seconds between each round of loop as a wait time

time_sec = (2 * total_loops) + ((total_loops-1)*3)

Time to get all papers with the following as query: cancer OR tumor OR tumour

In [25]:
time_sec/60

49.95

## Step 3 - Converting an XML output obtained from PubMed efetch utilty to a df or a CSV

### Import the module that can parse or create XML format and test it on a small subset.

In [26]:
import xml.etree.ElementTree as ET

**This module can parse data from an XML file or from a string. First, obtain just 2-3 records from PubMed to see how the module works.**

In [27]:
base_url

'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'

* The following are copied from steps 1 and 2 above.

In [28]:
# Specify search parameters
database = 'pubmed'
query = 'glioblastoma[tiab]+AND+ALK[tiab]'

# index of the first record. When doing calls in a for loop, this number can be incremented by adding the maximum records being extracted
retstart = '0'

# max. number of records to obtain. The default is 20 and the max. allowable IDs returned are 10,000
retmax = 3

# output type. The default is XML.
retmode = 'json'

search_suffix = f'esearch.fcgi?db={database}&term={query}&usehistory=y&retmode={retmode}&retstart={retstart}&retmax={retmax}'

search_url = base_url + search_suffix

# Get the PMIDs first
response = requests.get(search_url)
esearch_result = response.json()

# Extract the first 3 PMIDs from the esearch result and convert it to a string of comma separated ids

ids_list = esearch_result['esearchresult']['idlist'][0:3]
ids_str = ','.join(ids_list)

# 
rettype = 'abstract'
retmode = 'xml'

fetchurl_suffix = f'efetch.fcgi?db={database}&id={ids_str}&rettype={rettype}&retmode={retmode}'

fetch_url = base_url + fetchurl_suffix
print(fetch_url)

# Get the abstracts using the efetch utility
fetch_response = requests.get(fetch_url)

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=38285799,37939020,37861443&rettype=abstract&retmode=xml


* The esearch worked properly and obtained the first 3 PMIDs for papers where glioblatoma and ALK are listed in the title or abstract.

In [29]:
esearch_result

{'header': {'type': 'esearch', 'version': '0.3'},
 'esearchresult': {'count': '76',
  'retmax': '3',
  'retstart': '0',
  'querykey': '1',
  'webenv': 'MCID_65c2c8e122f8c6029022d43c',
  'idlist': ['38285799', '37939020', '37861443'],
  'translationset': [],
  'querytranslation': '"glioblastoma"[Title/Abstract] AND "ALK"[Title/Abstract]'}}

* The efetch extracted the abstracts for the above 3 PMIDs and is present in an XML format as a string:

In [143]:
fetch_response.text

'<?xml version="1.0" ?>\n<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2024//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_240101.dtd">\n<PubmedArticleSet>\n<PubmedArticle><MedlineCitation Status="MEDLINE" Owner="NLM" IndexingMethod="Automated"><PMID Version="1">38285799</PMID><DateCompleted><Year>2024</Year><Month>01</Month><Day>31</Day></DateCompleted><DateRevised><Year>2024</Year><Month>01</Month><Day>31</Day></DateRevised><Article PubModel="Electronic"><Journal><ISSN IssnType="Electronic">2476-762X</ISSN><JournalIssue CitedMedium="Internet"><Volume>25</Volume><Issue>1</Issue><PubDate><Year>2024</Year><Month>Jan</Month><Day>01</Day></PubDate></JournalIssue><Title>Asian Pacific journal of cancer prevention : APJCP</Title><ISOAbbreviation>Asian Pac J Cancer Prev</ISOAbbreviation></Journal><ArticleTitle>Evaluation of Immunohistochemical Expression of ALK-1 in Gliomas, WHO Grade 4 and Its Correlation with IDH1-R132H Mutation Status.</ArticleTitle><Pagin

**PROBLEM - The result string has many special characters such as &#xa0 and &#x3ba. How to take care of decoding?**

In [57]:
print(fetch_response.text.count("&#xa0"))
print(fetch_response.text.count("&#x3ba"))
print(fetch_response.text.count("&#"))

1
10
59


**This library can normalize the data. However it shows that the string is already in the normal form.**

In [42]:
import unicodedata

In [47]:
unicodedata.is_normalized("NFKD", fetch_response.text)

True

**Q - How to save the string to a file with the proper encoding?**

**Try the html module to clean up the string containing html characters such as the one shown above.**

In [159]:
import html

In [160]:
cleaned_response = html.unescape(fetch_response.text)

In [177]:
print(cleaned_response.count("&#xa0"))
print(cleaned_response.count("&#x3ba"))
print(cleaned_response.count("&#"))

0
0
0


**From above, although it looks like all HTML entities were taken care of, the string now has '\xa0', which is a non breaking space character in unicode.**

In [188]:
"\xa0" in cleaned_response

True

In [185]:
cleaned_response[cleaned_response.find("\xa0")-20 : cleaned_response.find("\xa0")+20]

' than the mean age. \xa0Conclusion: Our res'

**To resolve this possibility, this will be replaced by regular space characters.**

In [189]:
cleaned_response = html.unescape(fetch_response.text).replace('\xa0', ' ')

In [191]:
cleaned_response[cleaned_response.find("Conclusion")-20 : cleaned_response.find("Conclusion")+20]

'than the mean age.  Conclusion: Our resu'

**Explore the XML module to see the different methods of getting fields or keys pertaining to all papers can be obtained.**

Element object

In [49]:
root = ET.fromstring(fetch_response.text)

* Top level tag

In [54]:
root.tag

'PubmedArticleSet'

**The root contains all 3 papers, each initializing as 'PubMedArticle'.**

In [59]:
root.tag

'PubmedArticleSet'

#### INDICES METHOD
**Indices can be used to fetch fields however this can get very messy and unreliable.**

Each of the 3 papers are children of the main root called PubMedArticleSet.
The tag method gives the name of the tag or the field.

In [62]:

print(root[0].tag)
print(root[1].tag)
print(root[2].tag)

PubmedArticle
PubmedArticle
PubmedArticle


The tree depth can be navigated by further specifying indices. EG. Children of PubMedArticle:

In [67]:
print(root[0][0].tag)
print(root[0][1].tag)


MedlineCitation
PubmedData


Children of MedlineCitation:
Bulk of the information like PMID, abtract text etc. will be contained here.

In [76]:
for num in range(9):
    print(root[0][0][num].tag)

PMID
DateCompleted
DateRevised
Article
MedlineJournalInfo
ChemicalList
CitationSubset
MeshHeadingList
KeywordList


**Each of the above may have internal text or further sub-elements. So this is why it might get complex to use indices to get specific information.**

For example, PMID has no sub-elements (hence, the IndexError). The actual ID is contained as text:

In [83]:
print(root[0][0][0][0])

IndexError: child index out of range

The text method gives the actual text contained within the tag.

In [88]:
# the PMID of the first paper
print(root[0][0][0].tag)
print(root[0][0][0].text)

PMID
38285799


The attrib method gives any attributes for an element or a tagname.

In [89]:
print(root[0][0][0].tag)
print(root[0][0][0].attrib)

PMID
{'Version': '1'}


#### ITERATION METHOD
Iteration tool to get information about a field from all papers.

Following code pieces can be used to get information of interest.

**Here the problem could be that if some paper doesn't have that piece of information, it will still skip it and then give the results.**

In [156]:
data = {'title': [],
        'keywords':[]}

for child in root.iter("PubmedArticle"):
    for title in child.iter("ArticleTitle"):
        
        data['title'].append(title.text)
        
    keywords = ""
    for keyword in child.iter("KeywordList"):
        keywords += ",".join(keyword.itertext())
    
    data['keywords'].append(keywords)
        

In [157]:
data

{'title': ['Evaluation of Immunohistochemical Expression of ALK-1 in Gliomas, WHO Grade 4 and Its Correlation with IDH1-R132H Mutation Status.',
  'Cyanine Dye Conjugation Enhances Crizotinib Localization to Intracranial Tumors, Attenuating NF-κB-Inducing Kinase Activity and Glioma Progression.',
  'Synthesis and Bioevaluation of 3-(Arylmethylene)indole Derivatives: Discovery of a Novel ALK Modulator with Antiglioblastoma Activities.'],
 'keywords': ['ALK-1,Glioblastoma,Glioma grade 4,IDH-1,immunohistochemistry',
  'Crizotinib,GBM,HMCD,NF-κB-inducing kinase,cyanine dye,kinase inhibitor',
  '']}

**Title of the paper**

In [97]:
for child in root.iter("ArticleTitle"):
    print(f'tag: {child.tag}')
    print(f'attrib: {child.attrib}')
    print(f'text:{child.text}')
    

tag: ArticleTitle
attrib: {}
text:Evaluation of Immunohistochemical Expression of ALK-1 in Gliomas, WHO Grade 4 and Its Correlation with IDH1-R132H Mutation Status.
tag: ArticleTitle
attrib: {}
text:Cyanine Dye Conjugation Enhances Crizotinib Localization to Intracranial Tumors, Attenuating NF-κB-Inducing Kinase Activity and Glioma Progression.
tag: ArticleTitle
attrib: {}
text:Synthesis and Bioevaluation of 3-(Arylmethylene)indole Derivatives: Discovery of a Novel ALK Modulator with Antiglioblastoma Activities.


**Publication date.**

Here, there are further tags like Year, Date, Month contained within the main PubDate tag. So, these can be obtained by using the itertext function that can gather all the text contained within all of the sub-elements.

In [105]:
for child in root.iter("PubDate"):
    print(f'tag: {child.tag}')
    print(" ".join(child.itertext()))

tag: PubDate
2024 Jan 01
tag: PubDate
2023 Dec 04
tag: PubDate
2023 Nov 09


**Journal name**

In [107]:
for child in root.iter("Title"):
    print(f'tag: {child.tag}')
    print(f'text:{child.text}')

tag: Title
text:Asian Pacific journal of cancer prevention : APJCP
tag: Title
text:Molecular pharmaceutics
tag: Title
text:Journal of medicinal chemistry


**Publication type ie. article or review, etc.**

In [123]:
for child in root.iter("PublicationTypeList"):
    print(" ".join(child.itertext()))

Journal Article
Journal Article
Journal Article Research Support, Non-U.S. Gov't


**Keywords**

Here, the last/third paper does not have any Keywords and hence KeywordList tag. But if the second paper didn't have it, there wouldn't have been a way to know.

In [138]:
for count,child in enumerate(root.iter("KeywordList")):
    print(count)
    print(f'tag: {child.tag}')
    print(",".join(child.itertext()))
 

0
tag: KeywordList
ALK-1,Glioblastoma,Glioma grade 4,IDH-1,immunohistochemistry
1
tag: KeywordList
Crizotinib,GBM,HMCD,NF-κB-inducing kinase,cyanine dye,kinase inhibitor


**Mesh headings**

In [137]:
for count,child in enumerate(root.iter("MeshHeadingList")):
    print(count)
    print(f'tag: {child.tag}')
    print(",".join(child.itertext()))
 

0
tag: MeshHeadingList
Male,Adult,Humans,Female,Brain Neoplasms,pathology,Anaplastic Lymphoma Kinase,genetics,Glioma,pathology,Glioblastoma,Mutation,Receptor Protein-Tyrosine Kinases,genetics,World Health Organization,Isocitrate Dehydrogenase,genetics,metabolism
1
tag: MeshHeadingList
Mice,Animals,Humans,Crizotinib,pharmacology,therapeutic use,NF-kappa B,Cell Line, Tumor,Glioma,drug therapy,pathology,Brain Neoplasms,drug therapy,pathology,Glioblastoma,drug therapy,NF-kappaB-Inducing Kinase
2
tag: MeshHeadingList
Humans,Anaplastic Lymphoma Kinase,Receptor Protein-Tyrosine Kinases,Glioblastoma,pathology,Glioma,Indoles,pharmacology,therapeutic use,Cell Line, Tumor,Protein Kinase Inhibitors,pharmacology,therapeutic use,Cell Proliferation


**Chemicals**

In [142]:
for count,child in enumerate(root.iter("ChemicalList")):
    print(count)
    print(f'tag: {child.tag}')
    print(f'attrib: {child.attrib}')
    print(",".join(child.itertext()))

0
tag: ChemicalList
attrib: {}
EC 2.7.10.1,Anaplastic Lymphoma Kinase,EC 2.7.10.1,Receptor Protein-Tyrosine Kinases,EC 1.1.1.41,Isocitrate Dehydrogenase,EC 1.1.1.42.,IDH1 protein, human
1
tag: ChemicalList
attrib: {}
53AH36668S,Crizotinib,0,NF-kappa B
2
tag: ChemicalList
attrib: {}
EC 2.7.10.1,Anaplastic Lymphoma Kinase,EC 2.7.10.1,Receptor Protein-Tyrosine Kinases,0,Indoles,0,Protein Kinase Inhibitors


In [1049]:
val = root.find('PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year')
val.text

'2024'

In [1050]:
fetch_response.text.strip()

'<?xml version="1.0" ?>\n<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2024//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_240101.dtd">\n<PubmedArticleSet>\n<PubmedArticle><MedlineCitation Status="MEDLINE" Owner="NLM" IndexingMethod="Automated"><PMID Version="1">38285799</PMID><DateCompleted><Year>2024</Year><Month>01</Month><Day>31</Day></DateCompleted><DateRevised><Year>2024</Year><Month>01</Month><Day>31</Day></DateRevised><Article PubModel="Electronic"><Journal><ISSN IssnType="Electronic">2476-762X</ISSN><JournalIssue CitedMedium="Internet"><Volume>25</Volume><Issue>1</Issue><PubDate><Year>2024</Year><Month>Jan</Month><Day>01</Day></PubDate></JournalIssue><Title>Asian Pacific journal of cancer prevention : APJCP</Title><ISOAbbreviation>Asian Pac J Cancer Prev</ISOAbbreviation></Journal><ArticleTitle>Evaluation of Immunohistochemical Expression of ALK-1 in Gliomas, WHO Grade 4 and Its Correlation with IDH1-R132H Mutation Status.</ArticleTitle><Pagin

In [1157]:
root_str2[root_str2.find("<Abstract>"):root_str2.find("</Abstract>")]

'<Abstract><AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Glioblastoma (GB), a grade 4 glioma is the most common primary malignant brain tumor in adults. Recently, the mutation status of isocitrate dehydrogenase (IDH) has been crucial in the treatment of GB. IDH mutant cases display a more favorable prognosis than IDH-wild type ones. The anaplastic lymphoma kinase (ALK) is expressed as a receptor tyrosine kinase in both the developing central and peripheral nervous systems. Increasing lines of evidence suggest that ALK is over-expressed in GB and represents a potential therapeutic target.</AbstractText><AbstractText Label="OBJECTIVES" NlmCategory="OBJECTIVE">The goal of the current study was to investigate ALK-1 immunohistochemical expression in gliomas, grade 4, besides its correlation with IDH1-R132H mutation status and the clinicopathological parameters of the tumors.</AbstractText><AbstractText Label="MATERIAL AND METHODS" NlmCategory="METHODS">Seventy cases of gliomas, 

In [158]:
# Here, the &#xa0 and similar others don't appear. However, they do appear if the content is read from a file.

for child in root.iter("AbstractText"):
#     print(f'tag: {child.tag}')
#     print(f'attrib: {child.attrib.get("Label")}')
#     print(child.text)
    print("".join(child.itertext()))
    

Glioblastoma (GB), a grade 4 glioma is the most common primary malignant brain tumor in adults. Recently, the mutation status of isocitrate dehydrogenase (IDH) has been crucial in the treatment of GB. IDH mutant cases display a more favorable prognosis than IDH-wild type ones. The anaplastic lymphoma kinase (ALK) is expressed as a receptor tyrosine kinase in both the developing central and peripheral nervous systems. Increasing lines of evidence suggest that ALK is over-expressed in GB and represents a potential therapeutic target.
The goal of the current study was to investigate ALK-1 immunohistochemical expression in gliomas, grade 4, besides its correlation with IDH1-R132H mutation status and the clinicopathological parameters of the tumors.
Seventy cases of gliomas, grade 4 were tested for immunohistochemical expression of ALK-1 & IDH1-R132H in the tumor cells.
ALK-1 immunoexpression was detected in 22.9% of our cases and IDH1-R132H mutation was detected in 12.9% of them. ALK-1 exp

In [1076]:
for child in root.iter("KeywordList"):
    print(f'tag: {child.tag}')
    print(f'attrib: {child.attrib.get("Label")}')
#     print(child.text)
    print(",".join(child.itertext()))
 

tag: KeywordList
attrib: None
ALK-1,Glioblastoma,Glioma grade 4,IDH-1,immunohistochemistry
tag: KeywordList
attrib: None
Crizotinib,GBM,HMCD,NF-κB-inducing kinase,cyanine dye,kinase inhibitor


In [1074]:
a = 'some,string'
a.split(",")

['some', 'string']

In [862]:
for count,child in enumerate(root.iter("PublicationType")):
    print(count)
    print(f'tag: {child.tag}')
    print(f'attrib: {child.attrib}')
    print(f'text:{child.text}')
    print("".join(child.itertext()))
    

0
tag: PublicationType
attrib: {'UI': 'D016428'}
text:Journal Article
Journal Article
1
tag: PublicationType
attrib: {'UI': 'D016428'}
text:Journal Article
Journal Article
2
tag: PublicationType
attrib: {'UI': 'D016428'}
text:Journal Article
Journal Article
3
tag: PublicationType
attrib: {'UI': 'D013485'}
text:Research Support, Non-U.S. Gov't
Research Support, Non-U.S. Gov't


In [815]:
content_type = "PMID"

for child in root.iter(content_type):
    print(child.tag)
    print(child.attrib)
    print(child.text)
    print("".join(child.itertext()))

PMID
{'Version': '1'}
38285799
38285799
PMID
{'Version': '1'}
37939020
37939020
PMID
{'Version': '1'}
37861443
37861443


In [783]:
content_type = "PubDate"

for child in root.iter(content_type):
    print(child.tag)
    print(child.attrib)
    print(child.text)
    print("".join(child.itertext()))

PubDate
{}
None
2024Jan01
[]
PubDate
{}
None
2023Dec04
[]
PubDate
{}
None
2023Nov09
[]


In [868]:
root.findall("./PubMedArticle/Authors")

[]

In [785]:
for child in root:
    print(child.tag, child.attrib)

PubmedArticle {}
PubmedArticle {}
PubmedArticle {}


In [803]:
test_string = '''<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>'''

In [804]:
root = ET.fromstring(test_string)

In [805]:
for child in root:
    print(child.attrib)

{'name': 'Liechtenstein'}
{'name': 'Singapore'}
{'name': 'Panama'}


In [806]:
for child in root:
    print(child.tag)

country
country
country


In [810]:
for rank in root.iter('rank'):
    print(rank.attrib)
    print(rank.text)

{}
1
{}
4
{}
68


In [730]:
for child in root.iter("PMID"):
    print(child.attrib)
    print(child)

{'Version': '1'}
<Element 'PMID' at 0x10b8c5400>
{'Version': '1'}
<Element 'PMID' at 0x10ba25720>
{'Version': '1'}
<Element 'PMID' at 0x10bb3c2c0>


* The 3 papers would start at root[0], root[1], root[2]..

In [732]:
root[paper_index]
for paper_index in range(3):
    print(root[paper_index].tag)

PubmedArticle
PubmedArticle
PubmedArticle


In [876]:
paper_1 = root[0]

paper_1.attrib


for date in root.iter("PubDate"):
    print(date.tag)

PubDate
PubDate
PubDate


In [734]:

for child in paper_1[0]:
    print(child.tag)

PMID
DateCompleted
DateRevised
Article
MedlineJournalInfo
ChemicalList
CitationSubset
MeshHeadingList
KeywordList


In [735]:
root[0].tag

'PubmedArticle'

In [736]:
root[0][0].tag

'MedlineCitation'

In [737]:
root[0][0][0].tag

'PMID'

In [738]:
root[0][0][0][0].tag

IndexError: child index out of range

In [739]:
print(root[0][0].tag)
for count in range(10):
    

    print(root[0][0][count].tag)

MedlineCitation
PMID
DateCompleted
DateRevised
Article
MedlineJournalInfo
ChemicalList
CitationSubset
MeshHeadingList
KeywordList


IndexError: child index out of range

In [740]:
root[0]

<Element 'PubmedArticle' at 0x10b8c5540>

In [741]:
root[0][0]

<Element 'MedlineCitation' at 0x10b8c5f40>

In [742]:
root[0][0][0]

<Element 'PMID' at 0x10b8c5400>

In [762]:
def get_children(parent,label=None):
    a = {}
    tags = []
    attributes = []
    text = []
    for count,x in enumerate(parent.iter(label)):
        
    # the actual root.iter tag
        if count==0:
            parent_name = x.tag
            key_1 = parent_name + '_children'
            key_2 = parent_name + '_attr'
            key_3 = parent_name + '_text'
            
            try:
                a[key_2] = x.attrib
            except:
                a[key_2]= 'No attribute'
            try:
                
                a[key_3] = x.text
            except:
                a[key_3] = 'No text'
        else:
            # get all children
            
            tags.append(x.tag)
    a[key_1] = tags

    return a
    

In [946]:
get_children(root[1][0])

{'MedlineCitation_attr': {'Status': 'MEDLINE',
  'Owner': 'NLM',
  'IndexingMethod': 'Automated'},
 'MedlineCitation_text': None,
 'MedlineCitation_children': ['PMID',
  'DateCompleted',
  'Year',
  'Month',
  'Day',
  'DateRevised',
  'Year',
  'Month',
  'Day',
  'Article',
  'Journal',
  'ISSN',
  'JournalIssue',
  'Volume',
  'Issue',
  'PubDate',
  'Year',
  'Month',
  'Day',
  'Title',
  'ISOAbbreviation',
  'ArticleTitle',
  'Pagination',
  'StartPage',
  'EndPage',
  'MedlinePgn',
  'ELocationID',
  'Abstract',
  'AbstractText',
  'i',
  'i',
  'i',
  'i',
  'AuthorList',
  'Author',
  'LastName',
  'ForeName',
  'Initials',
  'AffiliationInfo',
  'Affiliation',
  'Author',
  'LastName',
  'ForeName',
  'Initials',
  'Identifier',
  'AffiliationInfo',
  'Affiliation',
  'Author',
  'LastName',
  'ForeName',
  'Initials',
  'AffiliationInfo',
  'Affiliation',
  'Author',
  'LastName',
  'ForeName',
  'Initials',
  'AffiliationInfo',
  'Affiliation',
  'Author',
  'LastName',
  '

In [761]:
for x in root[0][0].iter("PMID"):
    print(x.attrib)

{'Version': '1'}
