# Disruptive Papers in Fetal Surgery

## Description

This notebook takes as input one or more PubMed CSV exports, placed in the [data/pubmed/results/](data/pubmed/results/) directory. It adds useful columns to the CSV file, maps the PubMed IDs to Microsoft Academic Graph IDs, uses the MAGIDs to look up citation and disruption data from Wu et al., 2023 to score each article, and saves the scored output to a styled Excel spreadsheet.

### PubMed Search Documentation and Replication

The searches used to create the PubMed CSV exports are documented in [DisruptivePapersFetalSurgery-PubMedSearchDocumentation-202303.xlsx](data/pubmed/searches/DisruptivePapersFetalSurgery-PubMedSearchDocumentation-202303.xlsx). They cover Congenital Diaphragmatic Hernia (CDH), Congenital Pulmonary Airway Malfunction (CPAM), Neural Tube Defects (NTD), and Twin-to-Twin Transfusion Syndrome (TTTS), with each search documented on a separate spreadsheet tab. 

The searches can be replicated by either copying and pasting the full search string from each spreadsheet tab into PubMed or clicking on the number in the Results column. Once replicated, the search results can be exported to CSV via the *Save* button. Select *All results* and *CSV* format, then click *Create file*.

### References

Wu, L., Wang, D., & Evans, J. (2023). Replication Data for: Large teams develop and small teams disrupt science and technology [Data set]. Harvard Dataverse. https://doi.org/10.7910/DVN/JPWNNK


## Environment

### Import libraries

In [33]:
import glob
import logging
import pandas as pd
from pathlib import Path
import requests
from styleframe import StyleFrame, Styler, utils
import time
from tqdm.notebook import tqdm


### Define constants

In [34]:
email = "whimar@ohsu.edu"
openalex = "https://api.openalex.org/works?per-page=100&filter=pmid:"
headers = {"user-agent": "mailto:" + email}

datestamp = time.strftime("%Y%m%d")
TOP_PAPERS = 100

DATA_DIR = "data/"
DISRUPT_DIR = DATA_DIR + "MAG-disruption/"
PUBMED_DIR = DATA_DIR + "pubmed/"
RESULTS_DIR = PUBMED_DIR + "results/*"

LOG_DIR = "log/"

TARGET_DIR = "target/"
CSV_DIR = TARGET_DIR + "csv/"
XLSX_DIR = TARGET_DIR + "xlsx/"


### Create target directories

In [35]:
Path(LOG_DIR).mkdir(parents=True, exist_ok=True)
Path(TARGET_DIR).mkdir(parents=True, exist_ok=True)
Path(CSV_DIR).mkdir(parents=True, exist_ok=True)
Path(XLSX_DIR).mkdir(parents=True, exist_ok=True)


### Configure logging

In [36]:
logging.basicConfig(
    filename=f"{LOG_DIR}fsdp-{datestamp}.log",
    filemode="w",
    force=True,
    format="%(asctime)s.%(msecs)03d : %(levelname)s : %(message)s",
    level=logging.INFO,
    datefmt="%Y-%m-%d %H:%M:%S",
)
logger = logging.getLogger()

## Function definitions

### pmid2magid



#### Description

Maps PubMed IDs (PMID) from a PubMed CSV export to Microsoft Academic Graph IDs (MAGID) using the [OpenAlex Works API](https://docs.openalex.org/api-entities/works). Caches API results to CSV and reads from CSV if it already exists to reduce API calls.

#### Parameters

| Parameter | Description |
| --- | --- |
| articles_df | A processed/augmented DataFrame created from a PubMed CSV export |
| topicdate | The stem of the PubMed CSV filename, expected filename convenion *topic-date.csv* |

#### Return Values

| Return Value | Description |
| --- | --- |
| mags_df | a two-column dataframe of PMIDs and MAGIDs |


In [37]:
def pmid2magid(articles_df, topicdate):
    magsfile = f"{CSV_DIR}{topicdate}-MAGID.csv"
    magsfilepath = Path(magsfile)

    if magsfilepath.is_file():
            logger.info("Reading cached %s MAGID data from file %s", topicdate, magsfile)
            mags_df = pd.read_csv(magsfile, index_col="long_pmid")
    else:
        logger.info("Pulling new %s MAGID data from OpenAlex", topicdate)
        pmids = articles_df["pmid"].tolist()

        # combine PMIDs into groups of 50 to reduce OpenAlex API calls
        pmid_groups = [pmids[i: i+50] for i in range(0, len(pmids), 50)]

        pmid_strings = []
        mags = {}
        for pmid_group in pmid_groups:
            pmid_strings.append("|".join(map(str, pmid_group)))
        
        for pmid_string in tqdm(pmid_strings, desc=topicdate + " openalex"):
            logger.info("Requesting %s PMIDs from OpenAlex: %s", topicdate, pmid_string)
            response = requests.get(openalex + pmid_string, headers=headers)
            logger.info("OpenAlex response code: %s", response.status_code)
            if response.status_code == 200:
                works = response.json()
                for work in works["results"]:
                    if "mag" in work["ids"]:
                        logger.info("Found MAGID %s for PMID %s", work["ids"]["mag"], work["ids"]["pmid"])
                        mags[work["ids"]["pmid"]] = work["ids"]["mag"]
                    else:
                        logger.info("No MAGID found for PMID %s", work["ids"]["pmid"])
            else:
                logger.error("Error making OpenAlex request")
        
        mags_df = pd.DataFrame.from_dict(mags, orient="index", columns=["magid"])
        mags_df.index.name = "long_pmid"
        mags_df.sort_index(inplace=True)
        logger.info("Saving mapped PMID/MAGIDs to %s", magsfile)
        mags_df.to_csv(magsfile)
        mags_df = pd.read_csv(magsfile, index_col="long_pmid")

    return mags_df

### score_results


#### Description

Maps PubMed IDs (PMID) from a PubMed CSV export to citation and disruption score data.

#### Parameters

| Parameter | Description |
| --- | --- |
| articles_df | A processed/augmented DataFrame created from a PubMed CSV export |
| topicdate | The stem of the PubMed CSV filename, expected format topic-date.csv |

#### Return Values

| Return Value | Description |
| --- | --- |
| cited_df | A processed/augmented DataFrame created from a PubMed CSV export and sorted by descending number of citations |
| develop_df | A processed/augmented DataFrame created from a PubMed CSV export and sorted from most developmental to least developmental |
| disrupt_df |A processed/augmented DataFrame created from a PubMed CSV export and sorted from most disruptive to least disruptive |


In [38]:
def score_results(articles_df, topicdate):
    citefile = f"{CSV_DIR}{topicdate}-topcited.csv"
    developfile = f"{CSV_DIR}{topicdate}-topdevelopmental.csv"
    disruptfile = f"{CSV_DIR}{topicdate}-topdisruptive.csv"

    # magdata_df = pd.read_csv(
    #     f"AggregatedMAG.txt",
    #     sep="\t",
    #     usecols=[0, 5, 6],
    #     names=["magid", "num_citations", "disruption_score"],
    #     index_col="magid",
    #     dtype={"num_citations": "int64", "disruption_score": "float64"},
    # )
    magdata_df = pd.read_csv(
        f"{DISRUPT_DIR}Aggregated_20210521.txt",
        sep="\t",
        usecols=[0, 4, 5],
        names=["magid", "num_citations", "disruption_score"],
        index_col="magid",
        dtype={"num_citations": "int64", "disruption_score": "float64"},
    )
    magdata_df.sort_index(inplace=True)

    scored_df = articles_df.join(magdata_df)
    scored_df.reset_index(inplace=True)

    cited_df = scored_df[
        [
            "num_citations",
            "title",
            "journal",
            "pubdate",
            "magid",
            "pmid",
            "doi",
            "pubmed",
            "ohsu_library",
            "rush_library",
        ]
    ].nlargest(TOP_PAPERS, columns="num_citations")
    development_df = scored_df[
        [
            "disruption_score",
            "title",
            "journal",
            "pubdate",
            "magid",
            "pmid",
            "doi",
            "pubmed",
            "ohsu_library",
            "rush_library",
        ]
    ].nsmallest(TOP_PAPERS, columns="disruption_score")
    disrupt_sf = scored_df[
        [
            "disruption_score",
            "title",
            "journal",
            "pubdate",
            "magid",
            "pmid",
            "doi",
            "pubmed",
            "ohsu_library",
            "rush_library",
        ]
    ].nlargest(TOP_PAPERS, columns="disruption_score")

    cited_df.to_csv(
        citefile,
        index=False,
        columns=[
            "num_citations",
            "magid",
            "pmid",
            "title",
            "journal",
            "pubdate",
            "doi",
        ],
    )
    development_df.to_csv(
        developfile,
        index=False,
        columns=[
            "disruption_score",
            "magid",
            "pmid",
            "title",
            "journal",
            "pubdate",
            "doi",
        ],
    )
    disrupt_sf.to_csv(
        disruptfile,
        index=False,
        columns=[
            "disruption_score",
            "magid",
            "pmid",
            "title",
            "journal",
            "pubdate",
            "doi",
        ],
    )

    return cited_df, development_df, disrupt_sf

### style_output


#### Description

Exports sorted scored dataframes to styled XLSX file.

#### Parameters

| Parameter | Description |
| --- | --- |
| cited_df | A processed/augmented DataFrame created from a PubMed CSV export and sorted by descending number of citations |
| develop_df | A processed/augmented DataFrame created from a PubMed CSV export and sorted from most developmental to least developmental |
| disrupt_df |A processed/augmented DataFrame created from a PubMed CSV export and sorted from most disruptive to least disruptive |
| topicdate | The stem of the PubMed CSV filename, expected format topic-date.csv |

#### Return Values

None.

In [39]:
def style_output(cited_df, develop_df, disrupt_sf, topicdate):
    scorefile = f"{XLSX_DIR}{topicdate}-scores.xlsx"

    cited_sf = StyleFrame(cited_df.drop("pmid", axis=1))
    develop_sf = StyleFrame(develop_df.drop("pmid", axis=1))
    disrupt_sf = StyleFrame(disrupt_sf.drop("pmid", axis=1))
    # query_s = pd.Series({f"{querytype} query": query})

    default_style = Styler(
        font=utils.fonts.calibri,
        font_size=11,
        border_type=utils.borders.default_grid,
        horizontal_alignment=utils.horizontal_alignments.left,
        wrap_text=False,
        shrink_to_fit=False,
    )

    header_style = Styler(
        bg_color=utils.colors.black,
        bold=True,
        font=utils.fonts.calibri,
        font_color=utils.colors.white,
        font_size=14,
        horizontal_alignment=utils.horizontal_alignments.left,
        shrink_to_fit=False,
        wrap_text=False,
        vertical_alignment=utils.vertical_alignments.center,
    )
    hyperlink_style = Styler(
        font_color=utils.colors.blue,
        protection=True,
        underline=utils.underline.single,
    )
    float_style = Styler(
        number_format="0.000000000000",
        horizontal_alignment=utils.horizontal_alignments.right,
    )

    cited_sf.set_column_width_dict(
        col_width_dict={
            ("pubdate", "magid", "pubmed"): 12,
            ("num_citations", "ohsu_library", "rush_library"): 15,
            ("journal"): 30,
            ("title", "doi"): 50,
        }
    )
    cited_sf.apply_headers_style(header_style)
    cited_sf.apply_column_style(cited_sf.columns, styler_obj=default_style)
    cited_sf.apply_column_style(
        ["doi", "pubmed", "ohsu_library", "rush_library"],
        styler_obj=Styler.combine(default_style, hyperlink_style),
    )

    develop_sf.set_column_width_dict(
        col_width_dict={
            ("pubdate", "magid", "pubmed"): 12,
            ("ohsu_library", "rush_library"): 15,
            ("disruption_score"): 20,
            ("journal"): 30,
            ("title", "doi"): 50,
        }
    )
    develop_sf.apply_headers_style(header_style)
    develop_sf.apply_column_style(develop_sf.columns, styler_obj=default_style)
    develop_sf.apply_column_style(
        "disruption_score", styler_obj=Styler.combine(default_style, float_style)
    )
    develop_sf.apply_column_style(
        ["doi", "pubmed", "ohsu_library", "rush_library"],
        styler_obj=Styler.combine(default_style, hyperlink_style),
    )

    disrupt_sf.set_column_width_dict(
        col_width_dict={
            ("pubdate", "magid", "pubmed"): 12,
            ("ohsu_library", "rush_library"): 15,
            ("disruption_score"): 20,
            ("journal"): 30,
            ("title", "doi"): 50,
        }
    )
    disrupt_sf.apply_headers_style(header_style)
    disrupt_sf.apply_column_style(disrupt_sf.columns, styler_obj=default_style)
    disrupt_sf.apply_column_style(
        "disruption_score", styler_obj=Styler.combine(default_style, float_style)
    )
    disrupt_sf.apply_column_style(
        ["doi", "pubmed", "ohsu_library", "rush_library"],
        styler_obj=Styler.combine(default_style, hyperlink_style),
    )

    with pd.ExcelWriter(scorefile) as sfile:
        cited_sf.to_excel(
            sfile,
            index=False,
            columns=[
                "num_citations",
                "title",
                "journal",
                "pubdate",
                "magid",
                "doi",
                "pubmed",
                "ohsu_library",
                "rush_library",
            ],
            sheet_name=f"top {TOP_PAPERS} cited",
        )
        develop_sf.to_excel(
            sfile,
            index=False,
            columns=[
                "disruption_score",
                "title",
                "journal",
                "pubdate",
                "magid",
                "doi",
                "pubmed",
                "ohsu_library",
                "rush_library",
            ],
            sheet_name=f"top {TOP_PAPERS} developmental",
        )
        disrupt_sf.to_excel(
            sfile,
            index=False,
            columns=[
                "disruption_score",
                "title",
                "journal",
                "pubdate",
                "magid",
                "doi",
                "pubmed",
                "ohsu_library",
                "rush_library",
            ],
            sheet_name=f"top {TOP_PAPERS} disruptive",
        )
        # query_s.to_excel(sfile, index=False, header=False, sheet_name="pubmed query")

# Read and Augment PubMed CSV for MAGID mapping and citation/disruption scoring

In [40]:
pubmedCSVs = sorted(glob.glob(RESULTS_DIR))

for pubmedCSV in pubmedCSVs:
    topicdate = Path(pubmedCSV).stem

    logger.info("Reading %s search results from PubMed CSV", topicdate)
    
    pmresults_df = pd.read_csv(
        pubmedCSV, 
        usecols=[0,1, 5, 6, 10], 
        names=["pmid", "title", "journal", "pubdate", "doi"])

    # add columns of helpful URLs
    # hyperlinked URLs makes publications more accessible in the XLSX output

    # DOI link, blank if no DOI
    pmresults_df["doi"] = [
        f'=HYPERLINK("https://doi.org/{doi}")' if pd.notna(doi) else ""
        for doi in pmresults_df["doi"]
    ]

    # Link to PubMed record
    pmresults_df["pubmed"] = [
        f'=HYPERLINK("https://pubmed.ncbi.nlm.nih.gov/{pmid}", {pmid})'
        for pmid in pmresults_df["pmid"]
    ]

    # Link to OHSU Library Catalog lookup for full-text access
    pmresults_df["ohsu_library"] = [
        f'=HYPERLINK("https://librarysearch.ohsu.edu/openurl/OHSU/OHSU?sid=Entrez:PubMed&id=pmid:{pmid}", "Find @ OHSU")'
        for pmid in pmresults_df["pmid"]
    ]

    # Link to Rush University Library Catalog lookup for full-text access
    pmresults_df["rush_library"] = [
        f'=HYPERLINK("https://i-share-rsh.primo.exlibrisgroup.com/openurl/01CARLI_RSH/01CARLI_RSH:CARLI_RSH?sid=Entrez:PubMed&id=pmid:{pmid}", "Find @ Rush")'
        for pmid in pmresults_df["pmid"]
    ]

    # "Long PMID" for MAGID mapping in OpenAlex
    pmresults_df["long_pmid"] = [
        f"https://pubmed.ncbi.nlm.nih.gov/{pmid}" for pmid in pmresults_df["pmid"]
    ]

    pmresults_df.set_index("long_pmid", inplace=True)
    pmresults_df.sort_index(inplace=True)
    pmresults_df.to_csv(f"{CSV_DIR}{topicdate}.csv")
    pmresults_df = pd.read_csv(f"{CSV_DIR}{topicdate}.csv", index_col="long_pmid")

    logger.info("Mapping PMIDs to MAGIDs")
    mags_df = pmid2magid(pmresults_df, topicdate)

    logger.info("Joining PubMed search results with MAGID mapping")
    articles_df = mags_df.join(pmresults_df).set_index("magid").sort_index()
    articles_df.index = articles_df.index.astype("int64")

    logger.info("Scoring articles")
    articles_cited_df, articles_develop_df, articles_disrupt_df = score_results(articles_df, topicdate)

    logger.info("Styling output XSLX")
    style_output(articles_cited_df, articles_develop_df, articles_disrupt_df, topicdate)

CDH-202303 openalex:   0%|          | 0/6 [00:00<?, ?it/s]

CPAM-202303 openalex:   0%|          | 0/12 [00:00<?, ?it/s]

NTD-202303 openalex:   0%|          | 0/66 [00:00<?, ?it/s]

TTTS-202303 openalex:   0%|          | 0/21 [00:00<?, ?it/s]