# acceptTime_xml_extract

## Purpose
This notebook documents the process of extracting date data from PLOS article XML files.

## Workflow
The workflow begins by updating a local directory of allofplos to reflect newly published articles since the last update. As of 11/26/2023, there were 347,607 total articles taking up around 38GB of storage.

In [None]:
# External allofplos update command
# !python -m allofplos.update

Next, my focus is on primary research articles published for the first time, so I'll filter out entries for corrected articles.

In [224]:
from os import listdir

In [109]:
# List all PLOS article XMLs in allofplos - points to local directory for allofplos module
xml_list = os.listdir(".venv/Lib/site-packages/allofplos/allofplos_xml/")

# Remove corrected articles
xml_no_corr = []
xml_before = len(xml_list)

for xml in xml_list:
    corrected = xml.find("correct")
    if corrected == -1:
        xml_no_corr.append(xml)

xml_after = len(xml_no_corr)

print(f"{xml_before - xml_after} corrected articles removed from final XML list.")

3641 corrected articles removed from final XML list.


We'll be using the **xmltodict** module to read the XML files as more accessible Python dictionaries. Below are several functions to help us parse the resulting xmltodict dictionary objects, taking advantage of the relatively rigid structure of PLOS article XMLs.

In [27]:
import xmltodict

def readxml(xml_path):
    """
    Reads XML files as Python dictionary objects.

    Param xml_path: Local file path to XML file
    Returns: Python dictionary (xmldict) object
    """

    xml = open(xml_path, "r", encoding = "utf-8")

    xml_content = xml.read()
    xml_dict = xmltodict.parse(xml_content)
    xml.close()
    return(xml_dict)


def get_date(branch):
    """
    Retrieves and formats the date from a Python dictionary (xmltodict) object branch.
    If "day" dictionary key is missing, defaults day value to 1.

    Param branch: Branch of an xmltodict object that has date information
    Returns: Formatted date
    """

    if "day" in branch.keys():
        out_date = branch["year"] + "-" + branch["month"] + "-" + branch["day"]
    else:
        out_date = branch["year"] + "-" + branch["month"] + "-" "1"
    return(out_date)


def find_dates(xmldict):
    """
    Finds and retrieves: received, accepted, and epub dates from a 
    Python dictionary (xmltodict) object.

    Param xmldict: Python dictionary (xmltodict) object
    Returns: List of dates in format [receive_date, accept_date, epub_date]
    """
    
    receive_date = ""
    accept_date = ""
    epub_date = ""

    if "history" in xmldict["article"]["front"]["article-meta"].keys():
        date_branch = xmldict["article"]["front"]["article-meta"]["history"]["date"]
        has_keys = False

        try: 
            if "received" in date_branch.values() or "accepted" in date_branch.values():
                has_keys = True
        except: 
            has_keys = False

        if has_keys:
            if date_branch["@date-type"] == "received":
                receive_date = get_date(date_branch)
            elif date_branch["@date-type"] == "accepted":
                accept_date = get_date(date_branch)
                

        else: 
            for hist_entry in date_branch:
                if hist_entry["@date-type"] == "received":
                    receive_date = get_date(hist_entry)

                elif hist_entry["@date-type"] == "accepted":
                    accept_date = get_date(hist_entry)

    if "pub-date" in xmldict["article"]["front"]["article-meta"].keys():
        for pub_entry in xmldict["article"]["front"]["article-meta"]["pub-date"]:
            if pub_entry["@pub-type"] == "epub":
                epub_date = get_date(pub_entry)
                
    return(receive_date, accept_date, epub_date)


def get_doi(xmldict):
    """
    Retrieves DOI for a Python dictionary (xmltodict) object.

    Param xml: Python dictionary (xmltodict) object
    Returns: DOI string
    """

    doi = ""
    id_branch = xmldict["article"]["front"]["article-meta"]["article-id"]
    has_keys = False
    
    try: 
        if len(id_branch.keys()) > 0:
            has_keys = True
    except: pass

    if has_keys:
        doi = id_branch["#text"]
    else:
        for id in id_branch:
            if id["@pub-id-type"] == "doi":
                doi = id["#text"]

    return(doi)

Next we iterate through all of the PLOS articles in allofplos that are first-time publications of research articles (not corrections). This step has not been parallelized and takes about 3 hours to run sequentially through the more than 300,000 articles in allofplos.

In [660]:
# Initialize information lists
received_dates = []
accepted_dates = []
epub_dates = []
journals = []
article_dois = []
skipped_articles = 0

# Iterate through all non-correction research articles
# Also keeps a count of any skipped articles due to XML, formatting, or other errors (note: only 1 article skipped when run on 11/26/2023)
for xml in xml_no_corr:
    try: 
        xml_path = ".venv/Lib/site-packages/allofplos/allofplos_xml/" + xml
        xmldict = readxml(xml_path)

        if xmldict["article"]["@article-type"] == "research-article":
            journal = xml.split(".")[1]
            journals.append(journal)
            xml_dates = find_dates(xmldict)
            received_dates.append(xml_dates[0])
            accepted_dates.append(xml_dates[1])
            epub_dates.append(xml_dates[2])
            article_dois.append(get_doi(xmldict))
    
    except: skipped_articles += 1


Now we can create a pandas dataframe from the extracted information.

In [664]:
import pandas as pd

df = pd.DataFrame(list(zip(received_dates, accepted_dates, epub_dates, journals, article_dois)),
                  columns = ["received_date", "accepted_date", "epub_date", "journal", "article_doi"])

We'll need a function to compute differences in time so that we can get the time from submission (received_date) to acceptance or publish date:

In [26]:
from datetime import date
import math

def time_diff(date1, date2):
    """
    Computes the difference between two dates in days.

    Param date1: First date string in the "YYYY-MM-DD" format
    Param date2: Second date string in the "YYYY-MM-DD" format
    Returns: Difference in days between date2 and date2
    """
    
    diff = pd.NA
    
    if type(date1) == str and date1 != "" and type(date2) == str and date2 != "" :
        date1_format = date1.split("-")
        date1_format = date(int(date1_format[0]), int(date1_format[1]), int(date1_format[2]))
        date2_format = date2.split("-")
        date2_format = date(int(date2_format[0]), int(date2_format[1]), int(date2_format[2]))
        diff = date2_format - date1_format
        return(diff.days)

Now we can compute the time differences and add those values to the pandas dataframe.

In [32]:
# Initialize lists
accept_time = []
receive_to_pub_time = []
accept_to_pub_time = []

# Iterate through each row of the dataframe
for i in range(df.shape[0]):
    t1 = time_diff(df.loc[i, "received_date"], df.loc[i, "accepted_date"])
    accept_time.append(t1)

    t2 = time_diff(df.loc[i, "received_date"], df.loc[i, "epub_date"])
    receive_to_pub_time.append(t2)

    t3 = time_diff(df.loc[i, "accepted_date"], df.loc[i, "epub_date"])
    accept_to_pub_time.append(t3)

# Add computed times to dataframe
df["accept_time"] = accept_time
df["receive_to_pub_time"] = receive_to_pub_time
df["accept_to_pub_time"] = accept_to_pub_time

print(df.shape)
df.head()

(326737, 8)


Unnamed: 0,received_date,accepted_date,epub_date,journal,article_doi,accept_time,receive_to_pub_time,accept_to_pub_time
0,2003-6-1,2003-7-10,2003-10-13,pbio,10.1371/journal.pbio.0000001,39.0,134.0,95.0
1,2003-5-19,2003-7-16,2003-11-17,pbio,10.1371/journal.pbio.0000002,58.0,182.0,124.0
2,2003-6-12,2003-7-25,2003-8-18,pbio,10.1371/journal.pbio.0000005,43.0,67.0,24.0
3,2003-6-3,2003-7-29,2003-8-18,pbio,10.1371/journal.pbio.0000006,56.0,76.0,20.0
4,2003-6-20,2003-8-1,2003-10-13,pbio,10.1371/journal.pbio.0000010,42.0,115.0,73.0


Let's save the processed data and then use it for downstream analyses in Python and/or R. One final note: there are a small number of negative values. These are due to nonsensical date values in some older XML documents. These values are not carried over into a filtered dataframe subset of recently published articles (e.g. last 180 days). There are also very low time to acceptance values like 0 days or 1 day (e.g. 10.1371/journal.pgen.1007429), but spot-checking shows that these are true values -- in this case, received May 17, 2018 and accepted May 18, 2018. These are likely rare special cases.

In [None]:
out_file = f"data/{date.today()}_plostime_data.txt"
df.to_csv(out_file, sep = "\t", index = False)

# End of notebook