# Data collection of PubMed Central full text articles

## 1. Introduction

Europe PubMed Central (Europe PMC) is an open-access repository of biomedical research that contains ~45 million abstracts and ~10 million full text articles, including research articles, preprints, micropublications, books, reviews, and protocols.

 Data will be collected from [Europe PMC](https://europepmc.org/) for the PubMed Central index of full text scientific papers via the [Articles RESTful API](https://europepmc.org/RestfulWebService). The  query used for the project relates to COVID-19 and drug repurposing from 2019-2022.

## 2. Install/import libraries

In [None]:
!pip install diskcache

In [None]:
import requests
import urllib
import urllib.parse as urlparse
import json
import pandas as pd
import time
import pickle
import re
import concurrent


from collections import Counter
from diskcache import Cache
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup, NavigableString, Tag
from itertools import chain

## 3. Download metadata

Construct a function to download metadata using the RESTful web service [search module](https://europepmc.org/RestfulWebService#!/Europe32PMC32Articles32RESTful32API/search) to search the publications database for the following fields: 'pmcid', 'published', 'revised', 'title', 'journal', 'authors', 'doi', 'pdf_url', and convert the dictionary to a DataFrame.

The PubMed Central Identifier (pmcid) has been chosen as this is only returned if the full text is available in Europe PMC as opposed to the PubMed Identifier (pmid) which links to abstracts.





In [None]:
# Adapted from https://github.com/carrlucy/HSL_OA/blob/main/streamlit_app.py

def get_search_results(query):
    dct = {}
    for col in ['pmcid','published','revised','title','journal','authors','doi', 'pdf_url']:
        dct[col] = []

    cr_mrk= '' #current cursor mark
    nxt_mrk = '*' #next cursor mark
    while cr_mrk != nxt_mrk:
        url = 'https://www.ebi.ac.uk/europepmc/webservices/rest/search?'
        query = query
        params = {'query':query, 'resultType':'core','synonym':'TRUE','cursorMark':nxt_mrk,'pageSize':'1000','format':'json'}
        response = requests.get(url,params)
        rjson = response.json()
        cr_mrk = urlparse.unquote(rjson['request']['cursorMark'])
        nxt_mrk = urlparse.unquote(rjson['nextCursorMark'])
        for rslt in rjson['resultList']['result']:
            dct['pmcid'].append(rslt['pmcid']) if 'pmcid' in rslt.keys() else dct['pmcid'].append(0)
            dct['published'].append(rslt['firstPublicationDate']) if 'firstPublicationDate' in rslt.keys() else dct['published'].append(0)
            dct['revised'].append(rslt['dateOfRevision']) if 'dateOfRevision' in rslt.keys() else dct['revised'].append(0)
            dct['title'].append(rslt['title']) if 'title' in rslt.keys() else dct['title'].append(0)
            dct['journal'].append(rslt['journalInfo']['journal']['title']) if 'journalInfo' in rslt.keys() else dct['journal'].append(0)
            dct['authors'].append(rslt['authorString']) if 'authorString' in rslt.keys() else dct['authors'].append(0)
            dct['doi'].append(rslt['doi']) if 'doi' in rslt.keys() else dct['doi'].append(0)
            dct['pdf_url'].append(f"https://europepmc.org/articles/{rslt['pmcid']}?pdf=render") if 'pmcid' in rslt.keys() else dct['pdf_url'].append(0)

    df=pd.DataFrame.from_dict(dct, orient='columns')
    return df

### 3.1 Construct search query

The direct search syntax allows for boolean operators AND, OR and NOT, as well as phrases inside double quotes, which can be used along with field searches to construct more complex combinations.

Field names IN_EPMC, HAS_FT and OPEN_ACCESS are used to return full text Open Access publications in Europe PMC. The search can also be limited by PUB_TYPE. Here Journal Article has been chosen - the [most common and general publication type](https://www.nlm.nih.gov/bsd/indexing/training/PUB_040.html) used for original full text research, review or other reports published in a journal. Research Article has also been selected. See the [Advanced search](https://europepmc.org/advancesearch) page for a full list of publication types.

The [Search syntax reference](https://europepmc.org/searchsyntax) provides more information on query syntax and search field combinations with examples.

Refer also the [Web Service Reference](https://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf) for method parameters and output elements.



In [None]:
query = '(covid OR coronavirus OR sars-cov-2) AND ("drug discovery" OR "drug repurposing" OR "drug repositioning")\
 AND (IN_EPMC:Y) AND (HAS_FT:Y) AND (OPEN_ACCESS:Y) AND (PUB_TYPE:"Journal Article" OR PUB_TYPE:"Research-article")\
  NOT (PUB_TYPE:"Abstract" OR PUB_TYPE:"News" OR PUB_TYPE:"Editorial" OR PUB_TYPE:"Letter")\
   AND (FIRST_PDATE:[2019-01-01 TO 2022-12-31])'
search_results = get_search_results(query)
search_results

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...
...,...,...,...,...,...,...,...,...
11411,PMC6328940,2019-01-01,2020-03-09,β-RA reduces DMQ/CoQ ratio and rescues the enc...,EMBO molecular medicine,"Hidalgo-Gutiérrez A, Barriocanal-Casado E, Bak...",10.15252/emmm.201809466,https://europepmc.org/articles/PMC6328940?pdf=...
11412,PMC6598402,2019-06-21,2020-09-28,Alzheimer Disease Pathogenesis: Insights From ...,Frontiers in neuroscience,"Chen XQ, Mobley WC.",10.3389/fnins.2019.00659,https://europepmc.org/articles/PMC6598402?pdf=...
11413,PMC6481739,2019-02-05,2020-09-28,Modeling cardiac complexity: Advancements in m...,APL bioengineering,"Callaghan NI, Hadipour-Lakmehsari S, Lee SH, G...",10.1063/1.5055873,https://europepmc.org/articles/PMC6481739?pdf=...
11414,PMC6624471,2019-07-05,2020-09-28,Tissue Response to Neural Implants: The Use of...,Frontiers in neuroscience,"Gulino M, Kim D, Pané S, Santos SD, Pêgo AP.",10.3389/fnins.2019.00689,https://europepmc.org/articles/PMC6624471?pdf=...


Time to complete was 2m 18s for 11,416 rows of metadata.

In [None]:
with open('2023-01-06_europepmc_df_json_ft_urls_all.pickle', 'wb') as f:
  pickle.dump(search_results, f)

## 4. Data cleaning

Basic data cleaning to identify and remove missing values, duplicates, unwanted categories etc.

### 4.1 Rows with missing authors



In [None]:
search_results.loc[search_results['authors']==0]

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...
287,PMC9803264,2022-12-30,2023-01-06,DFT investigations and molecular docking as po...,Journal of molecular structure,0,0,https://europepmc.org/articles/PMC9803264?pdf=...
334,PMC9428111,2022-08-31,2022-09-30,Evaluation of flavonoids as potential inhibito...,Journal of the Indian Chemical Society,0,0,https://europepmc.org/articles/PMC9428111?pdf=...
361,PMC9548488,2022-11-01,2022-11-25,Language models for the prediction of SARS-CoV...,The international journal of high performance ...,0,0,https://europepmc.org/articles/PMC9548488?pdf=...
362,PMC9798891,2022-08-26,2023-01-05,Identification and semisynthesis of (−)-anisom...,National science review,0,0,https://europepmc.org/articles/PMC9798891?pdf=...
397,PMC9711896,2022-12-01,2022-12-05,Potential of vibrational spectroscopy coupled ...,Computer methods and programs in biomedicine,0,0,https://europepmc.org/articles/PMC9711896?pdf=...
438,PMC9762046,2022-12-19,2023-01-04,COVID-19 infection and metabolic comorbidities...,Human Nutrition & Metabolism,0,0,https://europepmc.org/articles/PMC9762046?pdf=...
590,PMC9812491,2022-01-01,2023-01-06,Drug repositioning based on heterogeneous netw...,Frontiers in pharmacology,0,0,https://europepmc.org/articles/PMC9812491?pdf=...
917,PMC9452103,2022-08-17,2022-09-12,SARS-CoV-2 immunity and vaccine strategies in ...,Oxford open immunology,0,0,https://europepmc.org/articles/PMC9452103?pdf=...
986,PMC9273052,2022-07-11,2022-07-18,Cover Story.,Acta pharmaceutica Sinica. B,0,10.1016/s2211-3835(22)00283-0,https://europepmc.org/articles/PMC9273052?pdf=...


In [None]:
len(search_results.loc[search_results['authors']==0])

54

This is not necessarily a problem so we will keep these rows and still extract the full text for them.

### 4.2 Remove unwanted categories

To ensure that only full text articles remain, we will remove other categories (Abstracts, Research Abstract, Annual Meeting, Congress, Conference).

In [None]:
str_remove = search_results[(search_results['title'].str.contains('Abstract', case=False)) |
                            (search_results['title'].str.contains('Abstracts', case=False)) |
                            (search_results['title'].str.contains('Annual Meeting', case=False)) |
                            (search_results['title'].str.contains('Congress', case=False)) |
                            (search_results['title'].str.contains('Conference', case=False))]
str_remove

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url
1417,PMC9615638,2022-09-28,2022-11-08,"Atti 55° Congresso Nazionale SItI Padova, 28 s...",Journal of preventive medicine and hygiene,0,10.15167/2421-4248/jpmh2022.63.2s1,https://europepmc.org/articles/PMC9615638?pdf=...
1712,PMC9623364,2022-01-01,2022-11-04,Meeting report: 34th international conference ...,Antiviral chemistry & chemotherapy,"Brancale A, Carter K, Delang L, Deval J, Duran...",10.1177/20402066221130853,https://europepmc.org/articles/PMC9623364?pdf=...
1763,PMC8011435,2021-03-01,2022-09-29,Drugmonizome and Drugmonizome-ML: integration ...,Database : the journal of biological databases...,"Kropiwnicki E, Evangelista JE, Stein DJ, Clark...",10.1093/database/baab017,https://europepmc.org/articles/PMC8011435?pdf=...
2995,PMC8041998,2021-04-12,2021-04-29,COVID-19 information retrieval with deep-learn...,NPJ digital medicine,"Esteva A, Kale A, Paulus R, Hashimoto K, Yin W...",10.1038/s41746-021-00437-0,https://europepmc.org/articles/PMC8041998?pdf=...
5581,PMC7418285,2020-08-11,2020-12-18,Meeting report of the 49th annual meeting of t...,Inflammation research : official journal of th...,"Kay L, Obara I.",10.1007/s00011-020-01390-6,https://europepmc.org/articles/PMC7418285?pdf=...
6137,PMC7767910,2020-12-28,2021-01-06,Accelerating bioinformatics research with Inte...,BMC bioinformatics,"Guo Y, Shen L, Shi X, Wang K, Dai Y, Zhao Z.",10.1186/s12859-020-03890-y,https://europepmc.org/articles/PMC7767910?pdf=...
6256,PMC8265285,2021-07-08,2022-03-17,Lewy Body Dementia Association's Industry Advi...,Alzheimer's research & therapy,"Goldman JG, Boeve BF, Armstrong MJ, Galasko DR...",10.1186/s13195-021-00868-7,https://europepmc.org/articles/PMC8265285?pdf=...
7109,PMC9169230,2022-04-08,2022-07-16,The Cure VCP Scientific Conference 2021: Molec...,Neurobiology of disease,"Johnson MA, Klickstein JA, Khanna R, Gou Y, Cu...",10.1016/j.nbd.2022.105722,https://europepmc.org/articles/PMC9169230?pdf=...
7154,PMC8679246,2021-12-17,2022-04-07,Program Abstracts from The GSA 2021 Annual Sci...,Innovation in aging,0,10.1093/geroni/igab046,https://europepmc.org/articles/PMC8679246?pdf=...
7213,PMC7376524,2020-07-23,2021-01-27,Reflections on the upsurge of virtual cancer c...,British journal of cancer,Speirs V.,10.1038/s41416-020-1000-x,https://europepmc.org/articles/PMC7376524?pdf=...


In [None]:
len(str_remove)

18

Make copy of DataFrame, drop the 18 rows and reset index.

In [None]:
search_results_new = pd.concat([search_results, str_remove]).drop_duplicates(keep=False).reset_index(drop=True)
search_results_new

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...
...,...,...,...,...,...,...,...,...
11393,PMC6328940,2019-01-01,2020-03-09,β-RA reduces DMQ/CoQ ratio and rescues the enc...,EMBO molecular medicine,"Hidalgo-Gutiérrez A, Barriocanal-Casado E, Bak...",10.15252/emmm.201809466,https://europepmc.org/articles/PMC6328940?pdf=...
11394,PMC6598402,2019-06-21,2020-09-28,Alzheimer Disease Pathogenesis: Insights From ...,Frontiers in neuroscience,"Chen XQ, Mobley WC.",10.3389/fnins.2019.00659,https://europepmc.org/articles/PMC6598402?pdf=...
11395,PMC6481739,2019-02-05,2020-09-28,Modeling cardiac complexity: Advancements in m...,APL bioengineering,"Callaghan NI, Hadipour-Lakmehsari S, Lee SH, G...",10.1063/1.5055873,https://europepmc.org/articles/PMC6481739?pdf=...
11396,PMC6624471,2019-07-05,2020-09-28,Tissue Response to Neural Implants: The Use of...,Frontiers in neuroscience,"Gulino M, Kim D, Pané S, Santos SD, Pêgo AP.",10.3389/fnins.2019.00689,https://europepmc.org/articles/PMC6624471?pdf=...


In [None]:
len(search_results_new)

11398

Validate by checking that no titles contain 'Abstract'.

In [None]:
contain_abstract = search_results_new[search_results_new['title'].str.contains('Abstract')]
len(contain_abstract)

0

### 4.3 Check for missing values

Concise summary of DataFrame to see if there are any columns with missing values.

In [None]:
search_results_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11398 entries, 0 to 11397
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pmcid      11398 non-null  object
 1   published  11398 non-null  object
 2   revised    11398 non-null  object
 3   title      11398 non-null  object
 4   journal    11398 non-null  object
 5   authors    11398 non-null  object
 6   doi        11398 non-null  object
 7   pdf_url    11398 non-null  object
dtypes: object(8)
memory usage: 712.5+ KB


### 4.4 Remove duplicates

Check value counts for duplicate pmcids.

In [None]:
search_results_new['pmcid'].value_counts(ascending=True)

PMC9549161    1
PMC7892713    1
PMC8576417    1
PMC8100288    1
PMC9032529    1
             ..
PMC7705431    1
PMC8216129    1
PMC7454275    1
PMC6409730    1
PMC7365084    2
Name: pmcid, Length: 11397, dtype: int64

There is one duplicate for PMC7365084

In [None]:
dup_pmcid = search_results_new.loc[search_results_new['pmcid']=='PMC7365084']
dup_pmcid

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url
7504,PMC7365084,2020-08-01,2021-08-31,Targeting Two-Pore Channels: Current Progress ...,Trends in pharmacological sciences,"Jin X, Zhang Y, Alharbi A, Hanbashi A, Alhosha...",10.1016/j.tips.2020.06.002,https://europepmc.org/articles/PMC7365084?pdf=...
7530,PMC7365084,2020-07-16,2020-07-20,Targeting Two-Pore Channels: Current Progress ...,Trends in pharmacological sciences,"Jin X, Zhang Y, Alharbi A, Hanbashi A, Alhosha...",0,https://europepmc.org/articles/PMC7365084?pdf=...


Drop duplicate pmcid keeping the row with the DOI and most recent revised date.

In [None]:
search_results_new = search_results_new.drop(7530).reset_index(drop=True)
search_results_new

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...
...,...,...,...,...,...,...,...,...
11392,PMC6328940,2019-01-01,2020-03-09,β-RA reduces DMQ/CoQ ratio and rescues the enc...,EMBO molecular medicine,"Hidalgo-Gutiérrez A, Barriocanal-Casado E, Bak...",10.15252/emmm.201809466,https://europepmc.org/articles/PMC6328940?pdf=...
11393,PMC6598402,2019-06-21,2020-09-28,Alzheimer Disease Pathogenesis: Insights From ...,Frontiers in neuroscience,"Chen XQ, Mobley WC.",10.3389/fnins.2019.00659,https://europepmc.org/articles/PMC6598402?pdf=...
11394,PMC6481739,2019-02-05,2020-09-28,Modeling cardiac complexity: Advancements in m...,APL bioengineering,"Callaghan NI, Hadipour-Lakmehsari S, Lee SH, G...",10.1063/1.5055873,https://europepmc.org/articles/PMC6481739?pdf=...
11395,PMC6624471,2019-07-05,2020-09-28,Tissue Response to Neural Implants: The Use of...,Frontiers in neuroscience,"Gulino M, Kim D, Pané S, Santos SD, Pêgo AP.",10.3389/fnins.2019.00689,https://europepmc.org/articles/PMC6624471?pdf=...


Check that the duplicate has gone.

In [None]:
search_results_new['pmcid'].value_counts(ascending=True)

PMC9549161    1
PMC8875160    1
PMC7892713    1
PMC8576417    1
PMC8100288    1
             ..
PMC7705431    1
PMC8216129    1
PMC9558053    1
PMC8015232    1
PMC6409730    1
Name: pmcid, Length: 11397, dtype: int64

In [None]:
with open('2023-01-06_europepmc_df_json_ft_urls_11397.pickle', 'wb') as f:
  pickle.dump(search_results_new, f)

## 5. Download full text as XML

### 5.1 DiskCache

[DiskCache](https://github.com/grantjenks/python-diskcache) can effectively cache expensive computations, improve performance and reduce computation time for heavy data processing, especially for functions with repeatable results over time.

In [None]:
# Initialise the Cache object, specifying the path to the cache directory
cache = Cache('/content/drive/MyDrive/cache')

Define a function to download full article text for each pmcid using Europe PMC API endpoint for full text XML.

In [None]:
# Define function and use the cache.memoize decorator to cache its results
@cache.memoize(expire=3600) # Results are cached for 1 hour
def dl_article_xml(pmcid: str):

    url = f'https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML'
    response = requests.get(url, allow_redirects=True)
    if response.status_code != 200:
      raise Exception(response.text)
    parsed_article = response.text

    return parsed_article

###  5.2 Multithreading

Multithreading is used here since the code is I/O-bound rather than CPU-bound. Executing multiple threads concurrently speeds up the process which took 22m 47s to complete as opposed to ~6h 30m without concurrency and just iterating through the pmcid list using a for loop calling the dl_article_xml() function sequentially for each pmcid.

In [None]:
with concurrent.futures.ThreadPoolExecutor(20) as executor:
     futures = [executor.submit(dl_article_xml, pmcid) for pmcid in search_results_new.pmcid]
     concurrent.futures.wait(futures)

Create a dictionary of futures which are proxies for results that do not yet exist but will in the future.

In [None]:
futures_map = dict(zip(search_results_new.pmcid, futures))

In [None]:
len(futures_map)

11397

Create dictionary of exceptions with pmcid as key and exception error message as value.

In [None]:
exceptions = {pmcid: f.exception() for pmcid, f in futures_map.items() if f.exception() is not None}
exceptions

{'PMC8018918': Exception(''),
 'PMC7640961': Exception(''),
 'PMC8018905': Exception(''),
 'PMC7098069': Exception(''),
 'PMC7382535': Exception(''),
 'PMC7936759': Exception(''),
 'PMC8014535': Exception(''),
 'PMC7383733': Exception(''),
 'PMC7558230': Exception(''),
 'PMC8115429': Exception(''),
 'PMC7497212': Exception(''),
 'PMC7321661': Exception(''),
 'PMC8459260': Exception('')}

In [None]:
len(exceptions)

13

13 exceptions were found which will be handled separately.

In [None]:
with open('2023-01-06_europepmc_df_json_ft_urls_13_exceptions.pickle', 'wb') as f:
  pickle.dump(exceptions, f)

Initiate a cache for the full text XML.

In [None]:
cache = Cache('/content/drive/MyDrive/cache')

Amend the function to handle the exceptions and run again to download the full text.

In [None]:
@cache.memoize(expire=3600)
def dl_article_xml(pmcid: str):

    url = f'https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML'
    response = requests.get(url, allow_redirects=True)
    if response.status_code != 200:
      if response.text == '':
        return '0'
      else:
          raise Exception(response.text)
    parsed_article = response.text


    return parsed_article

In [None]:
with ThreadPool(20) as pool:
  dl_results = pool.map(dl_article_xml, search_results_new.pmcid)

In [None]:
len(dl_results)

11397

In [None]:
with open('2023-01-06_europepmc_ft_xml.pickle', 'wb') as f:
  pickle.dump(dl_results, f)

## 6. Clean and return XML as Beautiful Soup object

Function to remove DOCTYPE declaration and return the cleaned, parsed text as a [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) object. This object represents the document as a nested data structure so that we can navigate and extract text using XML tags.

The simplest way to navigate the parse tree is to find a tag by name so we will use the find() method to return the XML for the `<body>` tag content.

In [None]:
def soupify(text):
    cleaned_text = re.sub('<!DOCTYPE.*(\[[\s\S]*?\])?>', '', text)
    return  BeautifulSoup(cleaned_text, 'lxml-xml').find("body")

Results were downloaded individually in batches of 2500 and took ~3m each to complete for file sizes of ~9-10 GB.

In [None]:
soup_results_1 = list(map(soupify, dl_results[0:2500]))

In [None]:
soup_results_2 = list(map(soupify, dl_results[2500:5000]))

In [None]:
soup_results_3 = list(map(soupify, dl_results[5000:7500]))

In [None]:
soup_results_4 = list(map(soupify, dl_results[7500:10000]))

In [None]:
soup_results_5 = list(map(soupify, dl_results[10000:11397]))

### 6.1 Find all tags including descendants

The function below uses the find_all() method to return all of the direct child and descendant tags of the `<body>` tag for the first 2500 articles.

In [None]:
def find_all_tags(article):

    tags_list = []

    try:
        for tag in article.find_all(True):
            if tag.name != None:
              tags_list.append(tag.name)
    except Exception as e:
      print(e)

    return sorted(list(set(tags_list)))

In [None]:
all_tags = list(map(find_all_tags, soup_results_1))

'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute 'find_all'
'NoneType' object has no attribute

The find_all() method scans the entire document looking for tags and returns a list. Some attribute errors were returned as there are exceptions for some of the articles.

In [None]:
len(all_tags)

2500

Print out a list of tags.

In [None]:
all_tags

[['caption',
  'fig',
  'fn',
  'graphic',
  'italic',
  'label',
  'p',
  'sec',
  'sup',
  'table',
  'table-wrap',
  'table-wrap-foot',
  'tbody',
  'td',
  'th',
  'thead',
  'title',
  'tr',
  'xref'],
 ['caption',
  'col',
  'ext-link',
  'fig',
  'graphic',
  'italic',
  'label',
  'list',
  'list-item',
  'media',
  'p',
  'sec',
  'styled-content',
  'sub',
  'sup',
  'supplementary-material',
  'table',
  'table-wrap',
  'tbody',
  'td',
  'th',
  'thead',
  'title',
  'tr',
  'xref'],
 ['caption',
  'fig',
  'graphic',
  'italic',
  'label',
  'list',
  'list-item',
  'p',
  'sec',
  'sup',
  'title',
  'xref'],
 ['bold',
  'caption',
  'ext-link',
  'fig',
  'graphic',
  'label',
  'list',
  'list-item',
  'p',
  'sec',
  'table',
  'table-wrap',
  'tbody',
  'td',
  'th',
  'thead',
  'title',
  'tr',
  'xref'],
 ['inline-formula',
  'italic',
  'math',
  'mi',
  'mrow',
  'p',
  'sec',
  'sub',
  'title',
  'uri',
  'xref'],
 ['italic', 'list', 'list-item', 'p', 'sec', 's

Sum function to take in all_tags nested list and return the sum of all elements.

In [None]:
sum_all_tags = sum(all_tags, [])
sum_all_tags

['caption',
 'fig',
 'fn',
 'graphic',
 'italic',
 'label',
 'p',
 'sec',
 'sup',
 'table',
 'table-wrap',
 'table-wrap-foot',
 'tbody',
 'td',
 'th',
 'thead',
 'title',
 'tr',
 'xref',
 'caption',
 'col',
 'ext-link',
 'fig',
 'graphic',
 'italic',
 'label',
 'list',
 'list-item',
 'media',
 'p',
 'sec',
 'styled-content',
 'sub',
 'sup',
 'supplementary-material',
 'table',
 'table-wrap',
 'tbody',
 'td',
 'th',
 'thead',
 'title',
 'tr',
 'xref',
 'caption',
 'fig',
 'graphic',
 'italic',
 'label',
 'list',
 'list-item',
 'p',
 'sec',
 'sup',
 'title',
 'xref',
 'bold',
 'caption',
 'ext-link',
 'fig',
 'graphic',
 'label',
 'list',
 'list-item',
 'p',
 'sec',
 'table',
 'table-wrap',
 'tbody',
 'td',
 'th',
 'thead',
 'title',
 'tr',
 'xref',
 'inline-formula',
 'italic',
 'math',
 'mi',
 'mrow',
 'p',
 'sec',
 'sub',
 'title',
 'uri',
 'xref',
 'italic',
 'list',
 'list-item',
 'p',
 'sec',
 'sub',
 'title',
 'xref',
 'alt-text',
 'bold',
 'caption',
 'ext-link',
 'fig',
 'graphi

Counter class with additional most_common() method to return a list of the *n* most common elements and their counts from the most common to the least.

In [None]:
num_all_unique_tags = Counter(sum_all_tags).most_common()
num_all_unique_tags = num_all_unique_tags[::]

for tag, count in num_all_unique_tags:
  print(tag, count)

p 2461
xref 2456
title 2446
sec 2438
italic 2261
label 2060
caption 1978
graphic 1902
fig 1897
sup 1889
sub 1628
table-wrap 1567
td 1551
tr 1551
tbody 1550
table 1545
thead 1511
bold 1497
th 1483
ext-link 1182
table-wrap-foot 797
fn 582
supplementary-material 580
media 574
disp-formula 537
math 507
mo 493
mrow 492
mi 480
list 429
list-item 429
alt-text 411
col 408
msub 398
mn 364
break 356
inline-formula 308
inline-graphic 297
uri 281
mfrac 280
alternatives 262
colgroup 259
hr 250
msup 244
sc 222
mfenced 205
mspace 201
mtext 198
tex-math 190
msubsup 155
mtable 120
mtd 120
mtr 120
boxed-text 119
underline 118
object-id 106
munder 89
funding-source 89
munderover 88
permissions 83
copyright-holder 80
mover 72
mstyle 70
msqrt 59
institution 59
institution-id 59
institution-wrap 59
def 58
def-item 58
def-list 58
term 58
styled-content 53
glyph-data 24
private-char 24
named-content 21
x 19
glyph-ref 17
disp-quote 13
attrib 10
notes 9
monospace 9
inline-supplementary-material 9
statement 8
co

#### 6.1.1 Example: section titles

Function to find section titles to view the structure of an article using the find_all() method to return all `<title`> tags for each article.

In [None]:
def find_section_titles(article):
    """
    Return list of sections from an article
    """

    section_titles_list = []
    section_titles = article.find_all("title")
    for section_title in section_titles:
        title = section_title.text
        section_titles_list.append(
            {
                "section_title": title,
            }
        )
    return section_titles_list

View section titles for article 0

In [None]:
section_titles_article_0 = find_section_titles(soup_results_1[0])
section_titles_article_0

[{'section_title': '1 Introduction'},
 {'section_title': '2 Methodology and data processing'},
 {'section_title': '2.1 Data collection'},
 {'section_title': '2.2 Data import and deduplication'},
 {'section_title': '2.3 Data splitting or merging'},
 {'section_title': '2.4 Data analysis and visualization'},
 {'section_title': '3 Results'},
 {'section_title': '3.1 Number and type of publications'},
 {'section_title': '3.2 Countries and number of publications'},
 {'section_title': '3.3 National/regional cooperation'},
 {'section_title': '3.4 Contributions of leading bodies'},
 {'section_title': '3.5 Contribution of leading research areas'},
 {'section_title': '3.6 Contribution of major journals'},
 {'section_title': '3.7 Contribution of the lead author'},
 {'section_title': '3.8 Research hotspots and trends'},
 {'section_title': '3.8.1 Author keyword analysis'},
 {'section_title': '3.8.2 Analysis of hot research topics'},
 {'section_title': '3.8.3 Analysis of the most cited studies'},
 {'s

View section titles for article 1

In [None]:
section_titles_article_1 = find_section_titles(soup_results_1[1])
section_titles_article_1

[{'section_title': 'INTRODUCTION'},
 {'section_title': 'COMPUTATIONAL DRUG DISCOVERY APPROACHES FOR COVID‐19'},
 {'section_title': 'Ligand‐based drug design for COVID‐19'},
 {'section_title': 'Structure‐based drug design and molecular docking'},
 {'section_title': 'Chemogenomic approaches for COVID‐19 drug discovery'},
 {'section_title': 'Target fishing Chemogenomics approaches for COVID‐19 drug discovery'},
 {'section_title': 'Drug repurposing application of Chemogenomics for COVID‐19 drug discovery'},
 {'section_title': 'Predicting the bio‐profile of drugs via Chemogenomics for COVID‐19 drug discovery'},
 {'section_title': 'DRUG REPOSITIONING'},
 {'section_title': 'Drug repositioning for COVID‐19 drug discovery'},
 {'section_title': 'CONCLUSION AND FUTURE PROSPECTS'},
 {'section_title': 'FUNDING INFORMATION'},
 {'section_title': 'Supporting information'}]

### 6.2 Find direct child tags of `<body>` tag

Function to find direct child tags of `<body>` tag using the `.name` attribute and append to a list.

In [None]:
def find_body_child_tags(article):

    body_child_list = []

    try:
        for tag in article:
            if tag.name != None:
                  body_child_list.append(tag.name)
    except Exception as e:
        print(e)

    return sorted(list(set(body_child_list)))

In [None]:
body_child_tags = list(map(find_body_child_tags, soup_results_1))

'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' obj

Printing out the tags reveals some empty lists for the rows giving 'NoneType' object exceptions.



In [None]:
body_child_tags

[['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['graphic', 'p'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['p'],
 ['sec'],
 [],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['p', 'sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['sec'],
 ['

Sum function to take in body_child_tags nested list and return the sum of all elements as one list.

In [None]:
sum_body_child_tags = sum(body_child_tags, [])
sum_body_child_tags

['sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'graphic',
 'p',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'p',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'p',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 'sec',
 '

Again use the Counter class with most_common() method to return a list of the *n* most common elements and their counts from the most common to the least.

In [None]:
num_unique_body_child_tags = Counter(sum_body_child_tags).most_common()
num_unique_body_child_tags = num_unique_body_child_tags[::]

for tag, count in num_unique_body_child_tags:
  print(tag, count)

sec 2438
p 149
fig 40
def-list 24
table-wrap 8
boxed-text 5
disp-quote 3
graphic 2


### 6.3 Find all direct child tags

You can iterate over a tag's direct children using the `.children` generator.

In [None]:
def find_child_tags(article: str):

    child_tags_list = []


    try:
        for tag in article:
            for child in tag.children:
                if child.name != None:
                  child_tags_list.append(child.name)
    except Exception as e:
        print(e)



    return sorted(list(set(child_tags_list)))


In [None]:
child_tags = list(map(find_child_tags, soup_results_1))

'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NavigableString' object has no attribute 'children'
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iterable
'NoneType' object is not iter

Again the empty list 'NoneType' objects, but three additional attribute errors this time.

Two instances of 'NavigableString' object has no attribute 'children', possibly caused by newlines and spaces in markup between nodes which Beautiful Soup turns into NavigableStrings. A string does not have children because it cannot contain anything in the way a tag may contain a string or another tag.

And one instance of 'Comment' object which represents a comment in the XML document.
According to the [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.Comment) this is a special type of NavigableString that adds something extra to the string on output, displaying with special formatting. Again, because the content of the comment is a string it cannot have child nodes.

In [None]:
child_tags

[['p', 'sec', 'title'],
 ['fig', 'label', 'p', 'sec', 'supplementary-material', 'title'],
 ['p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['label', 'p', 'sec', 'title'],
 ['label', 'p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['label', 'p', 'sec', 'title'],
 ['label', 'p', 'sec', 'title'],
 ['bold', 'fig', 'italic', 'list', 'table-wrap', 'xref'],
 ['p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['label', 'p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['label', 'p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['label', 'p', 'sec', 'title'],
 ['fig', 'label', 'p', 'sec', 'supplementary-material', 'title'],
 ['boxed-text', 'fig', 'p', 'sec', 'table-wrap', 'title'],
 ['boxed-text', 'fig', 'p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['fig', 'p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['label', 'p', 'sec', 'title'],
 ['p', 'sec', 'title'],
 ['

In [None]:
len(child_tags)

2500

Sum function to take in child_tags nested list and return the sum of all elements.

In [None]:
sum_child_tags = sum(child_tags, [])
sum_child_tags

['p',
 'sec',
 'title',
 'fig',
 'label',
 'p',
 'sec',
 'supplementary-material',
 'title',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'label',
 'p',
 'sec',
 'title',
 'label',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'label',
 'p',
 'sec',
 'title',
 'label',
 'p',
 'sec',
 'title',
 'bold',
 'fig',
 'italic',
 'list',
 'table-wrap',
 'xref',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'label',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'label',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'label',
 'p',
 'sec',
 'title',
 'fig',
 'label',
 'p',
 'sec',
 'supplementary-material',
 'title',
 'boxed-text',
 'fig',
 'p',
 'sec',
 'table-wrap',
 'title',
 'boxed-text',
 'fig',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'fig',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'p',
 'sec',
 'title',
 'label',
 'p',
 'sec',
 'title',
 'p'

Again, counts from most to least common.

In [None]:
num_unique_child_tags = Counter(sum_child_tags).most_common()
num_unique_child_tags = num_unique_child_tags[::]

for tag, count in num_unique_child_tags:
  print(tag, count)

title 2434
p 2425
sec 2210
label 715
fig 449
supplementary-material 286
table-wrap 220
xref 109
italic 81
boxed-text 80
sup 64
caption 42
graphic 40
sub 37
list 31
def-list 27
def-item 24
ext-link 21
bold 19
table 7
table-wrap-foot 6
uri 5
disp-quote 4
sc 3
disp-formula 2
underline 1
inline-formula 1


In [None]:
with open('2023-01-06_europepmc_child_tags_2500.pickle', 'wb') as f:
  pickle.dump(child_tags, f)

### 6.4 Create single sorted list of child tags

From itertools module import chain() function to group nested list into a single sorted iterable.

In [None]:
sorted(list(set(chain(*child_tags))))

['bold',
 'boxed-text',
 'caption',
 'def-item',
 'def-list',
 'disp-formula',
 'disp-quote',
 'ext-link',
 'fig',
 'graphic',
 'inline-formula',
 'italic',
 'label',
 'list',
 'p',
 'sc',
 'sec',
 'sub',
 'sup',
 'supplementary-material',
 'table',
 'table-wrap',
 'table-wrap-foot',
 'title',
 'underline',
 'uri',
 'xref']

In [None]:
len(sorted(list(set(chain(*child_tags)))))

27

## 7. Remove unwanted tags and content

Functions to remove unwanted tags and their contents using BeautifulSoup's decompose() method.

In [None]:
def remove_section_titles(article):

    for title in article("title"):
        title.decompose()

    return article


def remove_figures(article):

    for fig in article("fig"):
        fig.decompose()

    return article

def remove_def_lists(article):

    for def_list in article("def_list"):
        def_list.decompose()

    return article

def remove_tables(article):

    for table_wrap in article("table-wrap"):
        table_wrap.decompose()

    return article

def remove_formulas(article):

    for inline_formula in article("inline-formula"):
        inline_formula.decompose()
    for disp_formula in article("disp-formula"):
        disp_formula.decompose()
    for disp_formula_group in article("disp-formula-group"):
        disp_formula_group.decompose()

    return article


def remove_ext_links(article):

    for ext_link in article("ext-link"):
        ext_link.decompose()

    return article


def remove_labels(article):

    for label in article("label"):
        label.decompose()

    return article


def remove_refs(article):

    for xref in article("xref"):
        xref.decompose()

    return article


def remove_supp_material(article):

    for supp_material in article("supplementary-material"):
        supp_material.decompose()

    return article


Function to call the functions above on each article and return new article list.

In [None]:
def remove_tags(article):

    new_article_list = []

    try:
       if article != None:
          new_article = remove_section_titles(article)
          new_article = remove_figures(new_article)
          new_article = remove_def_lists(new_article)
          new_article = remove_tables(new_article)
          new_article = remove_formulas(new_article)
          new_article = remove_ext_links(new_article)
          new_article = remove_labels(new_article)
          new_article = remove_refs(new_article)
          new_article = remove_supp_material(new_article)
          new_article_list.append(new_article)
    except Exception as e:
        print(e)

    return new_article_list

### 7.1 First batch of 2500 remove tags

Map the remove_tags() function on to first batch of 2500 articles to remove unwanted tags and content.

In [None]:
soup_results_removed_tags_2500 = list(map(remove_tags, soup_results_1))

In [None]:
len(soup_results_removed_tags_2500)

2500

View list of first 10 articles with unwanted tags and contents removed.

In [None]:
soup_results_removed_tags_2500[0:10]

[[<body><sec id="s1"><p>Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery (). In 2004, Ted T. Ashburn et al. () summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. These molecules are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted (). The definition of the term “drug repurposing” has been endorsed by scholars () and used by them (; ). It should be pointed out that the synonyms of “drug repurposing” often used by academics also include drug repositioning (), drug rediscovery (), drug redirecting (), drug retasking (), and therapeutic switching (; ). After the research s

Repeat for further four batches.

### 7.2 Second batch of 2500 remove tags

In [None]:
soup_results_removed_tags_2500_5000 = list(map(remove_tags, soup_results_2))

In [None]:
len(soup_results_removed_tags_2500_5000)

2500

### 7.3 Third batch of 2500 remove tags

In [None]:
soup_results_removed_tags_5000_7500 = list(map(remove_tags, soup_results_3))

In [None]:
len(soup_results_removed_tags_5000_7500)

2500

### 7.4 Fourth batch of 2500 remove tags

In [None]:
soup_results_removed_tags_7500_10000 = list(map(remove_tags, soup_results_4))

In [None]:
len(soup_results_removed_tags_7500_10000)

2500

### 7.5 Fifth batch of 1397 remove tags

In [None]:
soup_results_removed_tags_10000_11397 = list(map(remove_tags, soup_results_5))

In [None]:
len(soup_results_removed_tags_10000_11397)

1397

### 7.6 Strip markup and keep text

Function to return only the human-readable text in a document using BeautifulSoup's get_text() method. This will return all the text in the articles as a single Unicode string without the `<body>`, `<div>` and `<p>` tags.



In [None]:
def strip_markup(articles):

    for article in articles:

        return article.get_text()



In [None]:
stripped_markup_articles_2500 = list(map(strip_markup, soup_results_removed_tags_2500))

View a couple of articles with all tags removed.

In [None]:
stripped_markup_articles_2500[0]

'Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery (). In 2004, Ted T. Ashburn et al. () summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. These molecules are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted (). The definition of the term “drug repurposing” has been endorsed by scholars () and used by them (; ). It should be pointed out that the synonyms of “drug repurposing” often used by academics also include drug repositioning (), drug rediscovery (), drug redirecting (), drug retasking (), and therapeutic switching (; ). After the research study by Ashburn et al.,

In [None]:
stripped_markup_articles_2500[3]

'The sudden outbreak of SARS-CoV-2 in 2019 took the world by storm and despite there being vaccines, numerous other alternative treatment approaches are also being researched. Coronavirus disease-2019 (COVID-19) is a communicable disease caused by severe acute respiratory syndrome coronavirus-2 (SARS Cov-2). This disease was first detected in Wuhan, China, and has expanded its reach globally, leading to myriad deaths. According to World Health Organisation, the total number of COVID-19 confirmed cases worldwide was found to be 53,22,01,219 and total number of deaths accounted to 63,05,358 as of 10th June, 2022. Among many regions, Europe holds the top position in highest number of confirmed COVID-19 cases (22,24,17,177) then followed by America (15,89,83,746), Western Pacific (6,17,35,224), South-East Asia (5,82,17,287), Eastern Mediterranean (2,18,07,376), and Africa (90,39,645) respectively []. These statistics reveal that even today, the disease is still infecting several people, ma

In [None]:
len(stripped_markup_articles_2500)

2500

In [None]:
with open('2023-01-06_stripped_markup_articles_2500.pickle', 'wb') as f:
  pickle.dump(stripped_markup_articles_2500, f)

Repeat for second batch

In [None]:
stripped_markup_articles_2500_5000 = list(map(strip_markup, soup_results_removed_tags_2500_5000))

In [None]:
len(stripped_markup_articles_2500_5000)

2500

Check an article from second batch.

In [None]:
stripped_markup_articles_2500_5000[10]

'Corona Virus Disease 2019 (COVID-19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). It was first identified in December 2019 in Wuhan, Hubei, China, and has since spread around the world. The WHO has declared that the COVID-19 outbreak constitutes a Public Health Emergency of International Concern (PHEIC). This disease can be clinically classified as mild, severe, or critical. Fever, dry cough, and fatigue are the main manifestations, and patients classified as severe can rapidly progress to acute respiratory distress syndrome (ARDS) and multiple organ failure (MOF), amongst other conditions. Unfortunately, at present, there is no cure officially approved for this disease, creating a formidable challenge in its treatment, prognosis, and control. Traditional Chinese medicines (TCM), that are characterized as being anti-viral and affecting multiple pathways and targets, have been proven to be significantly effective in treating COVID-19.

In [None]:
with open('2023-01-06_stripped_markup_articles_2500_5000.pickle', 'wb') as f:
  pickle.dump(stripped_markup_articles_2500_5000, f)

Repeat for third batch.

In [None]:
stripped_markup_articles_5000_7500 = list(map(strip_markup, soup_results_removed_tags_5000_7500))

In [None]:
len(stripped_markup_articles_5000_7500)

2500

In [None]:
with open('2023-01-06_stripped_markup_articles_5000_7500.pickle', 'wb') as f:
  pickle.dump(stripped_markup_articles_5000_7500, f)

Repeat for fourth batch.

In [None]:
stripped_markup_articles_7500_10000 = list(map(strip_markup, soup_results_removed_tags_7500_10000))

In [None]:
len(stripped_markup_articles_7500_10000)

2500

In [None]:
with open('2023-01-06_stripped_markup_articles_7500_10000.pickle', 'wb') as f:
  pickle.dump(stripped_markup_articles_7500_10000, f)

Repeat for fifth and final batch.

In [None]:
stripped_markup_articles_10000_11397 = list(map(strip_markup, soup_results_removed_tags_10000_11397))

In [None]:
len(stripped_markup_articles_10000_11397)

1397

In [None]:
with open('2023-01-06_stripped_markup_articles_10000_11397.pickle', 'wb') as f:
  pickle.dump(stripped_markup_articles_10000_11397, f)

### 7.7 Concatenate lists for five batches

Using the * operator we can concatenate the five lists into one list containing all 11397 articles.

In [None]:
concat_lists = [*stripped_markup_articles_2500, *stripped_markup_articles_2500_5000,
                *stripped_markup_articles_5000_7500, *stripped_markup_articles_7500_10000,
                *stripped_markup_articles_10000_11397]

Print first article to test.

In [None]:
concat_lists[0]

'Sir James Black, a winner of the 1988 Nobel Prize, clearly recognized well before the 21st century that drug repurposing strategies would occupy an important place in the future of new drug discovery (). In 2004, Ted T. Ashburn et al. () summarized previous research and developed a general approach to drug development using drug repurposing, retrospectively looking for new indications for approved drugs and molecules that are waiting for approval for new pathways of action and targets. These molecules are usually safe in clinical trials but do not show sufficient efficacy for the treatment of the disease originally targeted (). The definition of the term “drug repurposing” has been endorsed by scholars () and used by them (; ). It should be pointed out that the synonyms of “drug repurposing” often used by academics also include drug repositioning (), drug rediscovery (), drug redirecting (), drug retasking (), and therapeutic switching (; ). After the research study by Ashburn et al.,

In [None]:
len(concat_lists)

11397

In [None]:
with open('2023-01-06_concat_all_11397.pickle', 'wb') as f:
  pickle.dump(concat_lists, f)

### 7.8 Add extracted full text to DataFrame

Load metadata for Europe PMC articles.

In [None]:
with open('2023-01-06_europepmc_df_json_ft_urls_11397.pickle', 'rb') as f:
    search_results_new = pickle.load(f)

In [None]:
search_results_full_text = search_results_new.copy()

In [None]:
search_results_full_text

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...
...,...,...,...,...,...,...,...,...
11392,PMC6328940,2019-01-01,2020-03-09,β-RA reduces DMQ/CoQ ratio and rescues the enc...,EMBO molecular medicine,"Hidalgo-Gutiérrez A, Barriocanal-Casado E, Bak...",10.15252/emmm.201809466,https://europepmc.org/articles/PMC6328940?pdf=...
11393,PMC6598402,2019-06-21,2020-09-28,Alzheimer Disease Pathogenesis: Insights From ...,Frontiers in neuroscience,"Chen XQ, Mobley WC.",10.3389/fnins.2019.00659,https://europepmc.org/articles/PMC6598402?pdf=...
11394,PMC6481739,2019-02-05,2020-09-28,Modeling cardiac complexity: Advancements in m...,APL bioengineering,"Callaghan NI, Hadipour-Lakmehsari S, Lee SH, G...",10.1063/1.5055873,https://europepmc.org/articles/PMC6481739?pdf=...
11395,PMC6624471,2019-07-05,2020-09-28,Tissue Response to Neural Implants: The Use of...,Frontiers in neuroscience,"Gulino M, Kim D, Pané S, Santos SD, Pêgo AP.",10.3389/fnins.2019.00689,https://europepmc.org/articles/PMC6624471?pdf=...


Load concatenated lists of full text and add as 'text' column to metadata DataFrame.

In [None]:
with open('2023-01-06_concat_all_11397.pickle', 'rb') as f:
    concat_lists = pickle.load(f)

In [None]:
search_results_full_text['text'] = concat_lists
search_results_full_text

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
0,PMC9549161,2022-09-26,2022-10-14,Drug repositioning: A bibliometric analysis.,Frontiers in pharmacology,"Sun G, Dong D, Dong Z, Zhang Q, Fang H, Wang C...",10.3389/fphar.2022.974849,https://europepmc.org/articles/PMC9549161?pdf=...,"Sir James Black, a winner of the 1988 Nobel Pr..."
1,PMC9539342,2022-09-22,2022-11-12,A review on computer-aided chemogenomics and d...,Chemical biology & drug design,"Maghsoudi S, Taghavi Shahraki B, Rameh F, Naza...",10.1111/cbdd.14136,https://europepmc.org/articles/PMC9539342?pdf=...,Tight and selective interaction between ligand...
2,PMC9357751,2022-12-01,2022-12-05,Repurposing Molnupiravir as a new opportunity ...,"Journal of Generic Medicines : Duplicate, mark...",0,0,https://europepmc.org/articles/PMC9357751?pdf=...,The severe acute respiratory syndrome coronavi...
3,PMC9346052,2022-08-03,2022-09-05,Scope of repurposed drugs against the potentia...,Structural chemistry,"Niranjan V, Setlur AS, Karunakaran C, Uttarkar...",10.1007/s11224-022-02020-z,https://europepmc.org/articles/PMC9346052?pdf=...,The sudden outbreak of SARS-CoV-2 in 2019 took...
4,PMC9775208,2022-12-15,2022-12-25,Drug Repurposing Using Gene Co-Expression and ...,Biology,"Mailem RC, Tayo LL.",10.3390/biology11121827,https://europepmc.org/articles/PMC9775208?pdf=...,"The 2019 novel coronavirus, now dubbed SARS-Co..."
...,...,...,...,...,...,...,...,...,...
11392,PMC6328940,2019-01-01,2020-03-09,β-RA reduces DMQ/CoQ ratio and rescues the enc...,EMBO molecular medicine,"Hidalgo-Gutiérrez A, Barriocanal-Casado E, Bak...",10.15252/emmm.201809466,https://europepmc.org/articles/PMC6328940?pdf=...,Mitochondria are the primary site of cellular ...
11393,PMC6598402,2019-06-21,2020-09-28,Alzheimer Disease Pathogenesis: Insights From ...,Frontiers in neuroscience,"Chen XQ, Mobley WC.",10.3389/fnins.2019.00659,https://europepmc.org/articles/PMC6598402?pdf=...,"AD is the most common cause of dementia, accou..."
11394,PMC6481739,2019-02-05,2020-09-28,Modeling cardiac complexity: Advancements in m...,APL bioengineering,"Callaghan NI, Hadipour-Lakmehsari S, Lee SH, G...",10.1063/1.5055873,https://europepmc.org/articles/PMC6481739?pdf=...,Compromised contractility of the heart is a ma...
11395,PMC6624471,2019-07-05,2020-09-28,Tissue Response to Neural Implants: The Use of...,Frontiers in neuroscience,"Gulino M, Kim D, Pané S, Santos SD, Pêgo AP.",10.3389/fnins.2019.00689,https://europepmc.org/articles/PMC6624471?pdf=...,Recent technological progress in the field of ...


In [None]:
with open('2023-01-06_search_results_full_text.pickle', 'wb') as f:
  pickle.dump(search_results_full_text, f)

## 8. Check for missing text

Concise summary of DataFrame to see if there are any articles with missing text.

In [None]:
search_results_full_text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11397 entries, 0 to 11396
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pmcid      11397 non-null  object
 1   published  11397 non-null  object
 2   revised    11397 non-null  object
 3   title      11397 non-null  object
 4   journal    11397 non-null  object
 5   authors    11397 non-null  object
 6   doi        11397 non-null  object
 7   pdf_url    11397 non-null  object
 8   text       11279 non-null  object
dtypes: object(9)
memory usage: 801.5+ KB


We can see that there are 118 articles with missing values for the text column.

In [None]:
pmc_search_results_no_full_text = search_results_full_text[search_results_full_text['text'].isna()]
pmc_search_results_no_full_text

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
64,PMC9538661,2022-10-13,2022-11-22,Recent Drug Development and Medicinal Chemistr...,ChemMedChem,"Ghosh AK, Mishevich JL, Mesecar A, Mitsuya H.",10.1002/cmdc.202200440,https://europepmc.org/articles/PMC9538661?pdf=...,
188,PMC9794394,2022-12-28,2023-01-02,Nonstructural protein 1 (nsp1) widespread RNA ...,iScience,"Bermudez Y, Miles J, Muller M.",10.1016/j.isci.2022.105887,https://europepmc.org/articles/PMC9794394?pdf=...,
212,PMC9538837,2022-10-10,2022-11-22,The Efficacy of Traditional Medicinal Plants i...,Chemistry & biodiversity,"Choe J, Har Yong P, Xiang Ng Z.",10.1002/cbdv.202200655,https://europepmc.org/articles/PMC9538837?pdf=...,
256,PMC9788990,2022-12-24,2023-01-02,Sleep and circadian rhythm disruption alters t...,iScience,"Taylor L, Von Lendenfeld F, Ashton A, Sanghani...",10.1016/j.isci.2022.105877,https://europepmc.org/articles/PMC9788990?pdf=...,
275,PMC9794516,2022-12-28,2023-01-02,MultiOMICs landscape of SARS-CoV-2-induced hos...,iScience,"Pinto SM, Subbannayya Y, Kim H, Hagen L, Górna...",10.1016/j.isci.2022.105895,https://europepmc.org/articles/PMC9794516?pdf=...,
...,...,...,...,...,...,...,...,...,...
9797,PMC7162151,2020-02-28,2020-04-21,News.,Chemistry & industry,0,10.1002/cind.842_3.x,https://europepmc.org/articles/PMC7162151?pdf=...,
9809,PMC9094125,2022-05-11,2022-07-16,Recent advances in metal-organic framework-bas...,Nano research,"Yang M, Zhang J, Wei Y, Zhang J, Tao C.",10.1007/s12274-022-4302-x,https://europepmc.org/articles/PMC9094125?pdf=...,
10399,PMC7492056,2020-09-15,2020-09-28,Full Issue PDF.,JACC. CardioOncology,0,10.1016/s2666-0873(20)30180-0,https://europepmc.org/articles/PMC7492056?pdf=...,
10767,PMC8459260,2021-02-25,2022-01-25,Getting in touch with your senses: Mechanisms ...,WIREs mechanisms of disease,"Gupta S, Butler SJ.",10.1002/wsbm.1520,https://europepmc.org/articles/PMC8459260?pdf=...,


In [None]:
with open('2023-01-06_pmc_search_results_no_full_text.pickle', 'wb') as f:
  pickle.dump(pmc_search_results_no_full_text, f)

Check one of the articles to see 'None' in text column.

In [None]:
pmc_search_results_no_full_text.loc[pmc_search_results_no_full_text['pmcid']  == 'PMC8459260']

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
10767,PMC8459260,2021-02-25,2022-01-25,Getting in touch with your senses: Mechanisms ...,WIREs mechanisms of disease,"Gupta S, Butler SJ.",10.1002/wsbm.1520,https://europepmc.org/articles/PMC8459260?pdf=...,


The above article PMC8459260 was one of the 13 exceptions raised earlier.

## 9. GROBID to download exceptions full text as XML

We will use the parse_pdf function as for the arXiv PDFs using [GROBID](https://github.com/kermitt2/grobid/) to download parsed XML for the PDF versions of the Europe PMC articles for which there are exceptions when trying to extract the text via the Europe PMC API.

The code below used the previous demo server hosted at [https://cloud.science-miner.com/grobid](https://cloud.science-miner.com/grobid) which has since been updated and redirects to a new demo server so updating the GROBID URL in the function to [https://kermitt2-grobid.hf.space/api/processFulltextDocument](https://kermitt2-grobid.hf.space/api/processFulltextDocument) will be necessary.

In [None]:
def parse_pdf(pdf_url: str):
    """
    Parse PDF to XML using GROBID tool

    :param pdf_url: str, URL to article PDF

    :return: XML of parsed article
    """
    # GROBID URL for the cloud service to parse full text of the article
    url = "https://cloud.science-miner.com/grobid/api/processFulltextDocument"

    if isinstance(pdf_url, str):
            page = urllib.request.urlopen(pdf_url).read()
            resp = requests.post(url, files={"input": page})
            if resp.status_code != 200:
              raise Exception(resp.text)
            parsed_article = resp.text
            time.sleep(3)
    else:
        raise TypeError("Need to supply a url")

    return parsed_article

Running the above function with multithreading for the 118 exceptions took 10m 27s.

In [None]:
with concurrent.futures.ThreadPoolExecutor(4) as executor:
     futures = [executor.submit(parse_pdf, pdf_url) for pdf_url in pmc_search_results_no_full_text.pdf_url]
     concurrent.futures.wait(futures)

Create a dictionary of futures.

In [None]:
futures_map = dict(zip(pmc_search_results_no_full_text.pdf_url, futures))

In [None]:
len(futures_map)

118

Create dictionary of exceptions with PDF URL as key and exception error message as value.

In [None]:
exceptions = {url: f.exception() for url, f in futures_map.items() if f.exception() is not None}
exceptions

{'https://europepmc.org/articles/PMC7492056?pdf=render': Exception('{\n  "message":"The upstream server is timing out"\n}')}

One exception for article PMC7492056. We will amend the function to handle exceptions caused by the timeout exception, and any server errors, and run again to download the full text.

In [None]:
def parse_pdf(pdf_url: str):
    """
    Parse PDF to XML using GROBID tool

    :param pdf_url: str, URL to article PDF

    :return: XML of parsed article
    """
    # GROBID URL for the cloud service to parse full text of the article
    url = "https://cloud.science-miner.com/grobid/api/processFulltextDocument"

    if isinstance(pdf_url, str):
          page = urllib.request.urlopen(pdf_url).read()
          resp = requests.post(url, files={"input": page})
          if resp.status_code != 200:
              if resp.status_code >= 500:
                retry = 1
              else:
                  return "500"
              if resp.text in ['{\n  "message":"The upstream server is timing out"\n}']:
                  return "0"
              else:
                  raise Exception(resp.text)
          parsed_article = resp.text
          time.sleep(3)
    else:
        raise TypeError("Need to supply a url")


    return parsed_article

In [None]:
with ThreadPool(4) as pool:
  exceptions_xml = pool.map(parse_pdf, pmc_search_results_no_full_text.pdf_url)

In [None]:
len(exceptions_xml)

118

No further exceptions this time. Print out first article to view GROBID XML output.

In [None]:
exceptions_xml[0]

'<?xml version="1.0" encoding="UTF-8"?>\n<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" \nxmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" \nxsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"\n xmlns:xlink="http://www.w3.org/1999/xlink">\n\t<teiHeader xml:lang="en">\n\t\t<fileDesc>\n\t\t\t<titleStmt>\n\t\t\t\t<title level="a" type="main">Recent Drug Development and Medicinal Chemistry Approaches for the Treatment of SARS-CoV-2 and Covid-19</title>\n\t\t\t</titleStmt>\n\t\t\t<publicationStmt>\n\t\t\t\t<publisher/>\n\t\t\t\t<availability status="unknown"><licence/></availability>\n\t\t\t</publicationStmt>\n\t\t\t<sourceDesc>\n\t\t\t\t<biblStruct>\n\t\t\t\t\t<analytic>\n\t\t\t\t\t\t<author role="corresp">\n\t\t\t\t\t\t\t<persName><roleName>Professor</roleName><forename type="first">Arun</forename><forename type="middle">K</forename><surname>Ghosh</surname></persName>\n\t\t\t\t\t\t

In [None]:
with open('2023-01-06_pmc_search_results_no_full_text_exceptions_xml_v2.pickle', 'wb') as f:
  pickle.dump(exceptions_xml, f)

##  10. Clean and return XML as Beautiful Soup object
Function to remove xmlns attribute which specifies the XML namespace and return the cleaned, parsed text for the body tag as a Beautiful Soup object. This object represents the document as a nested data structure so that we can navigate and extract text using XML tags.

In [None]:
# remove xmlns attributes from XML and convert to soup object

def soupify(text):
    cleaned_text = re.sub('\s*xmlns(:\w+)?=\"[^\"]*\"', '', text)
    return  BeautifulSoup(cleaned_text, 'lxml-xml').find("body")

In [None]:
soup_results_exceptions_xml = list(map(soupify, exceptions_xml))

In [None]:
len(soup_results_exceptions_xml)

118

Check the first article has been converted into a Tag object, which corresponds to the `<body>` tag in the original document.

In [None]:
type(soup_results_exceptions_xml[0])

bs4.element.Tag

We can navigate the parse tree using tag names e.g. head.

In [None]:
soup_results_exceptions_xml[0].head

<head n="1.">Introduction</head>

### 10.1 Find all tags including descendants


The function below uses the find_all() method to return all of the direct child and descendant tags of the `<body>` tag for each article.

In [None]:
def find_all_tags(article):

    tags_list = []

    try:
        for tag in article.find_all(True):
            if tag.name != None:
              tags_list.append(tag.name)
    except Exception as e:
      print(e)

    return sorted(list(set(tags_list)))

In [None]:
all_tags = list(map(find_all_tags, soup_results_exceptions_xml))

There should be no exceptions this time and we can print out all the child and descendant tags.

In [None]:
all_tags

[['cell',
  'div',
  'figDesc',
  'figure',
  'graphic',
  'head',
  'label',
  'p',
  'ref',
  'row',
  'table'],
 ['div', 'figDesc', 'figure', 'graphic', 'head', 'label', 'note', 'p', 'ref'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'graphic',
  'head',
  'label',
  'note',
  'p',
  'ref',
  'row',
  'table'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'graphic',
  'head',
  'label',
  'note',
  'p',
  'ref',
  'row',
  'table'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'head',
  'label',
  'p',
  'ref',
  'row',
  'table'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'graphic',
  'head',
  'label',
  'note',
  'p',
  'ref',
  'row',
  'table'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'formula',
  'graphic',
  'head',
  'label',
  'note',
  'p',
  'ref',
  'row',
  'table'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'formula',
  'graphic',
  'head',
  'label',
  'note',
  'p',
  'ref',
  'row',
  'table'],
 ['cell',
  'div',
  'figDesc',
  'figure',
  'formul

Sum function to take in all_tags nested list and return the sum of all elements as one list.

In [None]:
sum_all_tags = sum(all_tags, [])
sum_all_tags

['cell',
 'div',
 'figDesc',
 'figure',
 'graphic',
 'head',
 'label',
 'p',
 'ref',
 'row',
 'table',
 'div',
 'figDesc',
 'figure',
 'graphic',
 'head',
 'label',
 'note',
 'p',
 'ref',
 'cell',
 'div',
 'figDesc',
 'figure',
 'graphic',
 'head',
 'label',
 'note',
 'p',
 'ref',
 'row',
 'table',
 'cell',
 'div',
 'figDesc',
 'figure',
 'graphic',
 'head',
 'label',
 'note',
 'p',
 'ref',
 'row',
 'table',
 'cell',
 'div',
 'figDesc',
 'figure',
 'head',
 'label',
 'p',
 'ref',
 'row',
 'table',
 'cell',
 'div',
 'figDesc',
 'figure',
 'graphic',
 'head',
 'label',
 'note',
 'p',
 'ref',
 'row',
 'table',
 'cell',
 'div',
 'figDesc',
 'figure',
 'formula',
 'graphic',
 'head',
 'label',
 'note',
 'p',
 'ref',
 'row',
 'table',
 'cell',
 'div',
 'figDesc',
 'figure',
 'formula',
 'graphic',
 'head',
 'label',
 'note',
 'p',
 'ref',
 'row',
 'table',
 'cell',
 'div',
 'figDesc',
 'figure',
 'formula',
 'graphic',
 'head',
 'label',
 'note',
 'p',
 'ref',
 'row',
 'table',
 'cell',
 'di

Counter class with additional most_common() method to return a list of the n most common elements and their counts from the most common to the least.

In [None]:
num_all_unique_tags = Counter(sum_all_tags).most_common()
num_all_unique_tags = num_all_unique_tags[::]

for tag, count in num_all_unique_tags:
  print(tag, count)

div 117
p 116
head 115
ref 113
figDesc 111
figure 111
label 111
graphic 89
table 83
cell 79
row 79
note 78
formula 32


### 10.2 Find direct child tags of `<body>` tag

Every tag has a name which can be accessed using the `.name` attribute.

We will use this to find the direct child tags of the `<body>` tag.




In [None]:
def find_body_child_tags(article):

    body_child_list = []

    try:
        for tag in article:
            if tag.name != None:
                  body_child_list.append(tag.name)
    except Exception as e:
        print(e)

    return sorted(list(set(body_child_list)))

In [None]:
body_child_tags = list(map(find_body_child_tags, soup_results_exceptions_xml))

We can see below that the `<body>` tag has direct child tags for `<div>`, `<figure>` and `<note>`.

In [None]:
body_child_tags

[['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 [],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure', 'note'],
 ['div', 'figure'],
 ['div', 'figure'],
 ['div', 'figure', 'note'],
 ['div', 'fig

Sum function to take in body_child_tags nested list and return the sum of all elements as one list.

In [None]:
sum_body_child_tags = sum(body_child_tags, [])
sum_body_child_tags

['div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'div',
 'figure',
 'div',
 'figure',
 'note',
 'div',
 'figure',
 'note',
 'div',
 'figu

Again use the Counter class with most_common() method to return a list of the n most common elements and their counts from the most common to the least.

In [None]:
num_unique_body_child_tags = Counter(sum_body_child_tags).most_common()
num_unique_body_child_tags = num_unique_body_child_tags[::]

for tag, count in num_unique_body_child_tags:
  print(tag, count)

div 117
figure 111
note 51


### 10.3 Find direct child tags of `<div>` tags

Function to find all `<div>` tags and append all direct child tags to a list.

In [None]:
def find_div_tags(article):

    div_tags = []

    try:
        divs = article.find_all("div")
        for div in divs:
          for tag in div:
              if tag.name != None:
                  div_tags.append(tag.name)
    except Exception as e:
        print(e)

    return sorted(list(set(div_tags)))

In [None]:
div_tags = list(map(find_div_tags, soup_results_exceptions_xml))

In [None]:
div_tags

[['head', 'p'],
 ['head', 'note', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'p'],
 ['formula', 'head', 'note', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'note', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 [],
 ['head', 'p'],
 ['head', 'note', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'p', 'ref'],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['formula', 'head', 'p'],
 ['head', 'p'],
 ['head', 'p'],
 ['formula', 

Sum function to take in div_tags nested list and return the sum of all elements as one list.

In [None]:
sum_div_tags = sum(div_tags, [])
sum_div_tags

['head',
 'p',
 'head',
 'note',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'formula',
 'head',
 'note',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'note',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'note',
 'p',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'ref',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'p',
 'head',
 'p',
 'head',
 'p',
 'formula',
 'head',
 'note',
 'p',
 'head',
 'p',

Again use the Counter class with most_common() method to return a list of the n most common elements and their counts from the most common to the least.

In [None]:
num_unique_div_tags = Counter(sum_div_tags).most_common()
num_unique_div_tags = num_unique_div_tags[::]

for tag, count in num_unique_div_tags:
  print(tag, count)

p 116
head 112
formula 32
note 10
ref 2


## 11. Remove unwanted tags and content

Define functions to remove unwanted tags and their contents using decompose() method which removes a tag from the tree, then completely destroys it and its contents.

In [None]:
def remove_headings(article):

    for head in article("head"):
        head.decompose()

    return article


def remove_figures(article):

    for figure in article("figure"):
        figure.decompose()

    return article

def remove_tables(article):

    for table in article("table"):
        table.decompose()

    return article

def remove_formulas(article):

    for formula in article("formula"):
        formula.decompose()

    return article


def remove_labels(article):

    for label in article("label"):
        label.decompose()

    return article


def remove_refs(article):

    for ref in article("ref"):
        ref.decompose()

    return article


def remove_graphics(article):

    for graphic in article("graphic"):
        graphic.decompose()

    return article


def remove_notes(article):

    for note in article("note"):
        note.decompose()

    return article

Function to call the functions above on each article and return new article list.

In [None]:
def remove_tags(article):

    new_article_list = []

    if article != None:
        try:
              new_article = remove_headings(article)
              new_article = remove_figures(new_article)
              new_article = remove_tables(new_article)
              new_article = remove_formulas(new_article)
              new_article = remove_labels(new_article)
              new_article = remove_refs(new_article)
              new_article = remove_graphics(article)
              new_article = remove_notes(article)
              new_article_list.append(new_article)
        except Exception as e:
            print(e)

    return new_article_list

In [None]:
soup_results_exceptions_removed_tags = list(map(remove_tags, soup_results_exceptions_xml))

In [None]:
len(soup_results_exceptions_removed_tags)

118

View article with unwanted tags and contents removed.

In [None]:
soup_results_exceptions_removed_tags[0]

[<body>
 <div><p>Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) originated in Wuhan, China in late December 2019.  This outbreak began spreading at an alarming rate, and unleashed a severe health crisis around the globe. Subsequently, the uncertainties caused a serious economic meltdown worldwide. On March 11, 2020, the World Health Organization (WHO) declared the novel coronavirus (COVID-19) outbreak a global pandemic.  Since then, it has gone on to affect millions of lives across the globe and caused nearly 6.3 million deaths as of June 7, 2022.  Human-human transmission for SARS-CoVs occurs primarily via respiratory droplets through sneezing, coughing, or close contact between persons. Mild symptomatic cases may include: fever, fatigue, dyspnea.  More severe cases of SARS-CoV-2 develop pneumonia, acute respiratory distress, and hypoxia.  Early on, many laboratories around the world got involved in the development of COVID-19 therapeutics. These include, development of 

###  11.1 Strip markup and keep text

We only want to keep the human-readable text so we will use the get_text() method to return all the text in the articles as a single Unicode string without the `<body>`, `<div>` and `<p>` tags.

In [None]:
def strip_markup(articles):

    for article in articles:

        return article.get_text()

In [None]:
stripped_markup_articles = list(map(strip_markup, soup_results_exceptions_removed_tags))

In [None]:
len(stripped_markup_articles)

118

View article with all tags removed.

In [None]:
stripped_markup_articles[0]

'\nSevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) originated in Wuhan, China in late December 2019.  This outbreak began spreading at an alarming rate, and unleashed a severe health crisis around the globe. Subsequently, the uncertainties caused a serious economic meltdown worldwide. On March 11, 2020, the World Health Organization (WHO) declared the novel coronavirus (COVID-19) outbreak a global pandemic.  Since then, it has gone on to affect millions of lives across the globe and caused nearly 6.3 million deaths as of June 7, 2022.  Human-human transmission for SARS-CoVs occurs primarily via respiratory droplets through sneezing, coughing, or close contact between persons. Mild symptomatic cases may include: fever, fatigue, dyspnea.  More severe cases of SARS-CoV-2 develop pneumonia, acute respiratory distress, and hypoxia.  Early on, many laboratories around the world got involved in the development of COVID-19 therapeutics. These include, development of therapies thro

In [None]:
with open('2023-01-06_stripped_markup_articles_exceptions_118_v2.pickle', 'wb') as f:
  pickle.dump(stripped_markup_articles, f)

### 11.2 Add extracted full text to DataFrame

Add extracted full text to DataFrame for 118 exceptions.

In [None]:
with open('2023-01-06_pmc_search_results_no_full_text.pickle', 'rb') as f:
    pmc_search_results_no_full_text = pickle.load(f)

In [None]:
pmc_search_results_no_full_text

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
64,PMC9538661,2022-10-13,2022-11-22,Recent Drug Development and Medicinal Chemistr...,ChemMedChem,"Ghosh AK, Mishevich JL, Mesecar A, Mitsuya H.",10.1002/cmdc.202200440,https://europepmc.org/articles/PMC9538661?pdf=...,
188,PMC9794394,2022-12-28,2023-01-02,Nonstructural protein 1 (nsp1) widespread RNA ...,iScience,"Bermudez Y, Miles J, Muller M.",10.1016/j.isci.2022.105887,https://europepmc.org/articles/PMC9794394?pdf=...,
212,PMC9538837,2022-10-10,2022-11-22,The Efficacy of Traditional Medicinal Plants i...,Chemistry & biodiversity,"Choe J, Har Yong P, Xiang Ng Z.",10.1002/cbdv.202200655,https://europepmc.org/articles/PMC9538837?pdf=...,
256,PMC9788990,2022-12-24,2023-01-02,Sleep and circadian rhythm disruption alters t...,iScience,"Taylor L, Von Lendenfeld F, Ashton A, Sanghani...",10.1016/j.isci.2022.105877,https://europepmc.org/articles/PMC9788990?pdf=...,
275,PMC9794516,2022-12-28,2023-01-02,MultiOMICs landscape of SARS-CoV-2-induced hos...,iScience,"Pinto SM, Subbannayya Y, Kim H, Hagen L, Górna...",10.1016/j.isci.2022.105895,https://europepmc.org/articles/PMC9794516?pdf=...,
...,...,...,...,...,...,...,...,...,...
9797,PMC7162151,2020-02-28,2020-04-21,News.,Chemistry & industry,0,10.1002/cind.842_3.x,https://europepmc.org/articles/PMC7162151?pdf=...,
9809,PMC9094125,2022-05-11,2022-07-16,Recent advances in metal-organic framework-bas...,Nano research,"Yang M, Zhang J, Wei Y, Zhang J, Tao C.",10.1007/s12274-022-4302-x,https://europepmc.org/articles/PMC9094125?pdf=...,
10399,PMC7492056,2020-09-15,2020-09-28,Full Issue PDF.,JACC. CardioOncology,0,10.1016/s2666-0873(20)30180-0,https://europepmc.org/articles/PMC7492056?pdf=...,
10767,PMC8459260,2021-02-25,2022-01-25,Getting in touch with your senses: Mechanisms ...,WIREs mechanisms of disease,"Gupta S, Butler SJ.",10.1002/wsbm.1520,https://europepmc.org/articles/PMC8459260?pdf=...,


In [None]:
pmc_search_results_full_text_grobid = pmc_search_results_no_full_text.copy()

In [None]:
with open('2023-01-06_stripped_markup_articles_exceptions_118_v2.pickle', 'rb') as f:
    stripped_markup_articles = pickle.load(f)

In [None]:
len(stripped_markup_articles)

118

Add full text with stripped markup as 'text' column to DataFrame.

In [None]:
pmc_search_results_full_text_grobid['text'] = stripped_markup_articles

In [None]:
pmc_search_results_full_text_grobid

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
64,PMC9538661,2022-10-13,2022-11-22,Recent Drug Development and Medicinal Chemistr...,ChemMedChem,"Ghosh AK, Mishevich JL, Mesecar A, Mitsuya H.",10.1002/cmdc.202200440,https://europepmc.org/articles/PMC9538661?pdf=...,\nSevere acute respiratory syndrome coronaviru...
188,PMC9794394,2022-12-28,2023-01-02,Nonstructural protein 1 (nsp1) widespread RNA ...,iScience,"Bermudez Y, Miles J, Muller M.",10.1016/j.isci.2022.105887,https://europepmc.org/articles/PMC9794394?pdf=...,\nThe past 20 years have seen the emergence of...
212,PMC9538837,2022-10-10,2022-11-22,The Efficacy of Traditional Medicinal Plants i...,Chemistry & biodiversity,"Choe J, Har Yong P, Xiang Ng Z.",10.1002/cbdv.202200655,https://europepmc.org/articles/PMC9538837?pdf=...,\nCoronavirus disease (Covid- 19) is a human r...
256,PMC9788990,2022-12-24,2023-01-02,Sleep and circadian rhythm disruption alters t...,iScience,"Taylor L, Von Lendenfeld F, Ashton A, Sanghani...",10.1016/j.isci.2022.105877,https://europepmc.org/articles/PMC9788990?pdf=...,\n\n\nJ o u r n a l P r e -p r o o f\nRespirat...
275,PMC9794516,2022-12-28,2023-01-02,MultiOMICs landscape of SARS-CoV-2-induced hos...,iScience,"Pinto SM, Subbannayya Y, Kim H, Hagen L, Górna...",10.1016/j.isci.2022.105895,https://europepmc.org/articles/PMC9794516?pdf=...,\nThe rapid emergence of the COVID-19 pandemic...
...,...,...,...,...,...,...,...,...,...
9797,PMC7162151,2020-02-28,2020-04-21,News.,Chemistry & industry,0,10.1002/cind.842_3.x,https://europepmc.org/articles/PMC7162151?pdf=...,\n\n\nRestricting the amount of the amino acid...
9809,PMC9094125,2022-05-11,2022-07-16,Recent advances in metal-organic framework-bas...,Nano research,"Yang M, Zhang J, Wei Y, Zhang J, Tao C.",10.1007/s12274-022-4302-x,https://europepmc.org/articles/PMC9094125?pdf=...,\n\n\n\n\n\n\n\n\n\n\n\n
10399,PMC7492056,2020-09-15,2020-09-28,Full Issue PDF.,JACC. CardioOncology,0,10.1016/s2666-0873(20)30180-0,https://europepmc.org/articles/PMC7492056?pdf=...,\nT he survival of children with cancer has co...
10767,PMC8459260,2021-02-25,2022-01-25,Getting in touch with your senses: Mechanisms ...,WIREs mechanisms of disease,"Gupta S, Butler SJ.",10.1002/wsbm.1520,https://europepmc.org/articles/PMC8459260?pdf=...,\nSomatosensation is essential for survival. I...


In [None]:
with open('2023-01-06_pmc_search_results_full_text_grobid_v2.pickle', 'wb') as f:
  pickle.dump(pmc_search_results_full_text_grobid, f)

## 12. Check for missing text

Concise summary of DataFrame to see if there are any articles with missing text.

In [None]:
pmc_search_results_full_text_grobid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 118 entries, 64 to 11329
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   pmcid      118 non-null    object
 1   published  118 non-null    object
 2   revised    118 non-null    object
 3   title      118 non-null    object
 4   journal    118 non-null    object
 5   authors    118 non-null    object
 6   doi        118 non-null    object
 7   pdf_url    118 non-null    object
 8   text       118 non-null    object
dtypes: object(9)
memory usage: 9.2+ KB


No null values for any of the 118 exceptions but we can try checking for an empty string in the text column.

In [None]:
pmc_search_results_full_text_grobid[pmc_search_results_full_text_grobid['text']=='']

Unnamed: 0,pmcid,published,revised,title,journal,authors,doi,pdf_url,text
1305,PMC8010379,2021-03-31,2022-11-08,Repurposing antiviral drugs on recently emerge...,Materials today. Proceedings,"Swathi K, Nikitha B, Chandrakala B, Lakshmanad...",10.1016/j.matpr.2021.03.143,https://europepmc.org/articles/PMC8010379?pdf=...,


In [None]:
pmc_search_results_full_text_grobid.loc[pmc_search_results_full_text_grobid['pmcid'] == 'PMC8010379'].iloc[0]

pmcid                                               PMC8010379
published                                           2021-03-31
revised                                             2022-11-08
title        Repurposing antiviral drugs on recently emerge...
journal                           Materials today. Proceedings
authors      Swathi K, Nikitha B, Chandrakala B, Lakshmanad...
doi                                10.1016/j.matpr.2021.03.143
pdf_url      https://europepmc.org/articles/PMC8010379?pdf=...
text                                                          
Name: 1305, dtype: object

One article was withdrawn hence the empty string for the text column.

### References

* Europe PMC https://europepmc.org/

* Articles RESTful API https://europepmc.org/RestfulWebService

* Full list of publication types https://europepmc.org/advancesearch

* Search syntax reference https://europepmc.org/searchsyntax

* Web Service Reference https://europepmc.org/docs/EBI_Europe_PMC_Web_Service_Reference.pdf

* University of Virginia Claude Moore Health Sciences Library project https://github.com/carrlucy/HSL_OA/blob/main/streamlit_app.py


* Beautiful Soup documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/

* GROBID https://github.com/kermitt2/grobid/
