In [1]:
from scripts.excel_imports import append_sheets_with_organ_column, describe_df
from scripts.get_full_text_epmc import check_full_texts, get_full_text
from scripts.get_abstract_epmc import get_abstract
from scripts.pubmed_id_api import id_convert, get_identifier_type, id_set
import numpy as np
import pandas as pd

# 1. Exploration

This notebook explores the content of the supplementary materials to Kumar _et al_, 2023 ([Nanoparticle biodistribution coefficients: A quantitative approach for understanding the tissue distribution of nanoparticles](https://doi.org/10.1016/j.addr.2023.114708)) and prepares a subset with only the open access articles for llm information retrieval / question answering assessment with [openAI models](01_nanodistribution-openai-qa.ipynb), [BioBERT](01_nanodistribution-biobert-qa.ipynb)


## 1.1 Downloading the review data

The curated excel file from Kumar _et al_, 2023 is available as a supplementary material from the article text: https://ars.els-cdn.com/content/image/1-s2.0-S0169409X23000236-mmc1.xlsx.

It will be downloaded and explored before defining the curation strategy. The scripts used to concatenate the different spreadsheets in the file can be found under [excel_imports](scripts/excel_imports.py).

In [2]:
%%bash

if [ ! -f "../data/perc_id_g_organ.xlsx" ]; then
  curl https://ars.els-cdn.com/content/image/1-s2.0-S0169409X23000236-mmc1.xlsx --output ../data/perc_id_g_organ.xlsx
    echo "xlsx data successfully downloaded and saved to ../data/perc_id_g_organ.xlsx."
    else
        echo "Data (../data/perc_id_g_organ.xlsx) already exists"
fi


Data (../data/perc_id_g_organ.xlsx) already exists


In [3]:
file_path = '../data/perc_id_g_organ.xlsx'
df = append_sheets_with_organ_column(file_path)
df.describe(include='object')


Column 'ID': 0 NaN values; datatype: int64
Column 'Time_h': 0 NaN values; datatype: float64
Column 'perc_ID_g': 17 NaN values; datatype: float64
Column 'Species': 0 NaN values; datatype: object
Column 'Age/weight': 994 NaN values; datatype: float64
Column 'Strain': 0 NaN values; datatype: object
Column 'Organ': 0 NaN values; datatype: object
Column 'Size_nm': 12 NaN values; datatype: float64
Column 'Analysis method': 0 NaN values; datatype: object
Column 'NP_Type': 0 NaN values; datatype: object
Column 'NP_Shape': 0 NaN values; datatype: object
Column 'Ligand': 360 NaN values; datatype: object
Column 'Charge': 2576 NaN values; datatype: object
Column 'PEG cover': 56 NaN values; datatype: object
Column 'PMID': 0 NaN values; datatype: object
Column 'Name': 3720 NaN values; datatype: object
Column 'Charge ': 5651 NaN values; datatype: object

Number of rows with at least one NaN value: 5703 of 5703

Organs: Tail, Liver, Skin, Tumor, Intestine, Lung, Brain, Pancreas, Heart, Plasma, Kidney,

Unnamed: 0,Species,Strain,Organ,Analysis method,NP_Type,NP_Shape,Ligand,Charge,PEG cover,PMID,Name,Charge.1
count,5703,5703,5703,5703,5703,5703,5703,3127,5703,5703,1983,52
unique,1,17,16,29,9,18,25,9,25,116,235,4
top,Mouse,Balb/c mice,Blood,PET,Gold,Nanoparticle,5000,Neutral,5000,17962085,SPIO,Neutral
freq,5703,2357,809,962,1986,4194,2234,1731,2234,220,36,28


The provided 'pmids' (`df['PMID']`) are not only PMIDs. Also, to request full text from ePMC, we need the PMC identifier (if it exists). 

The code used to convert pmids to DOI and PMCids is in the [pubmed_id_api](scripts/pubmed_id_api.py) script ([ID Converter API documentation](https://www.ncbi.nlm.nih.gov/pmc/tools/id-converter-api/)). Then, the EuropePMC API is used to check which PMCids have a full text associated ([script](scripts/get_full_text_epmc.py), [documentation](https://europepmc.org/RestfulWebService#!/Europe32PMC32Articles32RESTful32API/fullTextXML)).

In [4]:
df.rename(columns={'PMID': 'provided_identifier'}, inplace=True)
df['provided_identifier_type'] = df['provided_identifier'].apply(get_identifier_type)
df = id_set(df)

In [5]:
pmcids = list(set([i for i in df['pmcid'] if 'PMC' in str(i)]))

The full texts are requested in order to check if they are really available.

In [6]:
available = check_full_texts(pmcids)
print(f'{len(available)} available open access texts')

Failed to retrieve data for PMC4127427. Response code: 404 / url: https://www.ebi.ac.uk/europepmc/webservices/rest/PMC4127427/fullTextXML
PMC4207078
Failed to retrieve data for PMC3086380. Response code: 404 / url: https://www.ebi.ac.uk/europepmc/webservices/rest/PMC3086380/fullTextXML
Failed to retrieve data for PMC3292876. Response code: 404 / url: https://www.ebi.ac.uk/europepmc/webservices/rest/PMC3292876/fullTextXML
Failed to retrieve data for PMC4437573. Response code: 404 / url: https://www.ebi.ac.uk/europepmc/webservices/rest/PMC4437573/fullTextXML
Failed to retrieve data for PMC3314116. Response code: 404 / url: https://www.ebi.ac.uk/europepmc/webservices/rest/PMC3314116/fullTextXML
PMC5102673
Failed to retrieve data for PMC3404261. Response code: 404 / url: https://www.ebi.ac.uk/europepmc/webservices/rest/PMC3404261/fullTextXML
Failed to retrieve data for PMC6854296. Response code: 404 / url: https://www.ebi.ac.uk/europepmc/webservices/rest/PMC6854296/fullTextXML
PMC3425121
F

Subset the table for the open access articles only:

In [7]:
df = df[df['pmcid'].isin(available)]
describe_df(df)

Column 'ID': 0 NaN values; datatype: int64
Column 'Time_h': 0 NaN values; datatype: float64
Column 'perc_ID_g': 0 NaN values; datatype: float64
Column 'Species': 0 NaN values; datatype: object
Column 'Age/weight': 28 NaN values; datatype: float64
Column 'Strain': 0 NaN values; datatype: object
Column 'Organ': 0 NaN values; datatype: object
Column 'Size_nm': 0 NaN values; datatype: float64
Column 'Analysis method': 0 NaN values; datatype: object
Column 'NP_Type': 0 NaN values; datatype: object
Column 'NP_Shape': 0 NaN values; datatype: object
Column 'Ligand': 0 NaN values; datatype: object
Column 'Charge': 80 NaN values; datatype: object
Column 'PEG cover': 0 NaN values; datatype: object
Column 'provided_identifier': 0 NaN values; datatype: object
Column 'Name': 103 NaN values; datatype: object
Column 'Charge ': 168 NaN values; datatype: object
Column 'provided_identifier_type': 0 NaN values; datatype: object
Column 'pmcid': 0 NaN values; datatype: object
Column 'doi': 0 NaN values; dat

Get full texts and abstracts:

In [8]:
seen = []
abs_count = 0
text_count = 0
total = len(np.unique(df['pmcid']))
for index, row in df.iterrows():
    if pd.notnull(row['pmcid']):     
        pmcid = row['pmcid']
        if pmcid not in seen:
            seen.append(pmcid)
            abstract = get_abstract(pmcid)
            text = get_full_text(pmcid)
            df.at[index, 'abstract'] = abstract
            abs_count += 1
            df.at[index, 'full_text'] = text
            text_count += 1
print(f'{abs_count} abstracts and {text_count} full texts retrieved out of {total} different journal articles.')


J Am Chem SocJ. Am. Chem. SocjajacsatJournal of the American Chemical Society0002-78631520-5126American Chemical Society398588010.1021/ja412001eArticleConstruction and Validation of Nano Gold Tripods for Molecular Imaging of Living SubjectsChengKaiKothapalliSri-RajasekharLiuHongguangKohAi LeenJokerstJesse V.JiangHanYangMengLiJinboLeviJelenaWuJoseph C.GambhirSanjiv S.ChengZhen*Molecular Imaging Program at Stanford (MIPS), Canary Center at Stanford for Cancer Early Detection, Department of Radiology and Bio-X Program, School of Medicine, Stanford Nanocharacterization Laboratory, Stanford University, 1201 Welch Road, Lucas P095, Stanford, California 94305-5484, United Stateszcheng@stanford.edu04022015040220140503201413693560357101122013Copyright 2014 American Chemical Society2014American Chemical SocietyAnisotropic colloidal hybrid nanoparticles exhibit superior optical and physical properties compared to their counterparts with regular architectures. We herein developed a controlled, ste

In [9]:
df

Unnamed: 0,ID,Time_h,perc_ID_g,Species,Age/weight,Strain,Organ,Size_nm,Analysis method,NP_Type,...,Charge,PEG cover,provided_identifier,Name,Charge.1,provided_identifier_type,pmcid,doi,abstract,full_text
30,7,1.0,9.716981,Mouse,21.4,Nude mice,Spleen,10.0,PET,Gold,...,,3400,24495038,Gold tripods <20nm,,pmid,PMC3985880,10.1021/ja412001e,Anisotropic colloidal hybrid nanoparticles exh...,J Am Chem SocJ. Am. Chem. SocjajacsatJournal o...
34,8,1.0,2.489796,Mouse,20.0,Balb/c mice,Spleen,56.8,198Au,Gold,...,,5000,24766522,Gold nanospheres 56.8nm,,pmid,PMC4358630,10.1021/nn406258m,"With Au nanocages as an example, we recently d...",ACS NanoACS Nanonnancac3ACS Nano1936-08511936-...
59,16,4.0,9.022310,Mouse,,SCID mice,Spleen,42.5,ICP-MS,Gold,...,,5000,21711861,43 nm AuNP-PEG5000 (Kennedy et al. 2011),,pmid,PMC3211348,10.1186/1556-276X-6-283,Gold nanoparticle-mediated photothermal therap...,Nanoscale Res LettNanoscale Research Letters19...
176,50,72.0,91.914894,Mouse,32.0,Male ddY mice,Spleen,11.0,ICP-MS,Gold,...,Neutral,5000,23050635,PEG-modified gold nanorods AP 5.0 10.6*49.6,,pmid,PMC3492114,10.1186/1556-276X-7-565,Gold nanorods that have an absorption band in ...,Nanoscale Res LettNanoscale Res LettNanoscale ...
185,56,5.0,1.993007,Mouse,18.0,Nude mice,Spleen,10.0,64Cu,Gold,...,Positive,5000,22916075,AuNR - DOX 45*10nm,,pmid,PMC3425121,10.7150/thno.4756,A multifunctional gold nanorod (GNR)-based nan...,TheranosticsTheranosticsthnoTheranostics1838-7...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
338,125,1.0,5.090000,Mouse,19.1,Balb/c mice,Kidney,92.1,111In,Lipid,...,Neutral,2000,23226020,111In-labeledlow RLP 92.1nm -1.8mV - 6.1mV,,pmid,PMC4550540,10.1021/acsnano.5b00526,Traditional chelator-based radio-labeled nanop...,ACS NanoACS Nanonnancac3ACS Nano1936-08511936-...
345,129,1.0,6.900000,Mouse,,Nude mice,Kidney,100.0,64Cu,Lipid,...,Negative,2000,26646780,PEGylated 64Cu-liposomes 5mol %,,pmid,PMC3625170,10.1371/journal.pone.0061346,This study aimed to evaluate the acute toxicit...,PLoS OnePLoS ONEplosplosonePLoS ONE1932-6203Pu...
402,149,24.0,1.436782,Mouse,20.0,Nude mice,Kidney,70.7,ICP-MS,Iron Oxide,...,Negative,0,22100983,SPIO 70.72nm,,pmid,PMC5197067,10.7150/thno.18078,Minimizing the sequestration of nanomaterials ...,TheranosticsTheranosticsthnoTheranostics1838-7...
419,155,3.0,8.979592,Mouse,22.5,Nude mice,Kidney,6.0,SPECT/CT,Iron Oxide,...,,3400,26353592,99mTc-USPION-RAD,,pmid,PMC3512544,10.2147/IJN.S36847,<h4>Purpose</h4>Liposomes have been proposed t...,Int J NanomedicineInt J NanomedicineInternatio...


In [10]:
df.to_csv("../data/subset_distribution_nm.csv", index=False)