In [1]:
import openai
import time
import os
from scripts.get_abstract_epmc import get_abstract
from scripts.get_full_text_epmc import get_full_text
import pandas as pd
import numpy as np
import re
import json

# LLMs and information retrieval

This notebook uses the OpenAI API to assess the GPT-3.5-turbo model performance for extracting assay-specific data from the abstracts of the available open access journal articles sourced in [Nanoparticle biodistribution coefficients: A quantitative approach for understanding the tissue distribution of nanoparticles](https://doi.org/10.1016/j.addr.2023.114708) (continues [exploration-clean](00_exploration-manual-ir.ipynb)).

We first curated data manually from the available full text open access journal articles (on top of the already curated data from the aforementioned review), and then compare the answers to those given by OpenAI chat model GPT-3.5 turbo when asked to process either the abstract or a synthesized version of the full text.

One big caveat is the lack of automatization -I cannot think of a way of making the language model stick to a specific lexicon.

## 0. Set up

The materials for this study are:
-   The curated excel spreadsheet from Kumar _et al_ as cleaned and wrangled in the [exploration-clean notebook](00_exploration-manual-ir.ipynb), to which we added more manuyally curated information.
-   OpenAI API key
-   EuropePMC's article RESTful API ([documentation](https://europepmc.org/RestfulWebService), [scripts](scripts/))


## 0.1 Importing the subset data

In [2]:
df = pd.read_csv("../data/subset_distribution_nm.csv")


## 0.2 Set up the OpenAI text completion API

Needed API key:

In [3]:
with open('resources/openAI_key.txt', 'r') as f:
    api_key = f.read().strip()
os.environ['OPENAI_API_KEY'] = api_key
openai.api_key = os.getenv("OPENAI_API_KEY")
# model
GPT_MODEL = "gpt-3.5-turbo"
tokens = 0 #TODO add up all used tokens * pricing

## Test

The abstract used for the test belongs to [PMC5102673](https://doi.org/10.1002/advs.201600122). The abstract is:

In [4]:
text_test = df.iloc[31]['abstract']
id_test = df.iloc[31]['pmcid']
doi_test = df.iloc[31]['doi']
print(id_test, " ", doi_test)
print(text_test)


PMC5102673   10.1002/advs.201600122
A systematic study of in vitro and in vivo behavior of biodegradable mesoporous silica nanoparticles (bMSNs), designed to carry multiple cargos (both small and macromolecular drugs) and subsequently self-destruct following release of their payloads, is presented. Complete degradation of bMSNs is seen within 21 d of incubation in simulated body fluid. The as-synthesized bMSNs are intrinsically radiolabeled with oxophilic zirconium-89 (<sup>89</sup>Zr, <i>t</i><sub>1/2</sub> = 78.4 h) radionuclide to track their in vivo pharmacokinetics via positron emission tomography imaging. Rapid and persistent CD105 specific tumor vasculature targeting is successfully demonstrated in murine model of metastatic breast cancer by using TRC105 (an anti-CD105 antibody)-conjugated bMSNs. This study serves to illustrate a simple, versatile, and readily tunable approach to potentially overcome the current challenges facing nanomedicine and further the goals of personalize

Setting up the query and function used for the API call:

In [5]:
query = """Scan the following scientific article {} describing the use of animal models to investigate nanomaterial or nanoparticle biodistribution in organs to fill up the braces for each value (use 3 words max for each)

{}:
\"\"\"
{}
\"\"\"

Your response:

1. assessed nanomaterial (exclude ligand): []
2. species of the animal model used for the in vivo assay: []
3: Strain of the animal model used for the in vivo assay: []
4. labelling used for the in vivo assay: []
5. analysis method used for measurements for the in vivo assay: []
6. Time points included in the in vivo assay (h): []
7: Organs analyzed: []
8: Nanomaterial shape: []
9: Nanomaterial type (lipid, silica, graphene, gold, metal oxide...): []
10: Nanomaterial size (nm): []
11: Ligand used for imaging: []
"""


def query_chat_openai(text_type, query, text, model, temperature, token_bag):
    if text_type in ['abstract', 'full_text']:
        query = query.format(text_type, text_type, text)
    if text_type == 'chunk':
        query = query.format(text)
    messages = [
        {'role': 'system', 'content': 'You answer questions about the abstract of a journal article.'},
        {'role': 'user', 'content': query},
    ]
    retries = 5
    wait_time = 70

    while retries > 0:
        try:
            response = openai.ChatCompletion.create(
                messages=messages,
                model=model,
                temperature=temperature
            )

            response_text = response['choices'][0]['message']['content']
            #print(response_text)
            used_tokens = int(response['usage']['total_tokens'])
            token_bag += used_tokens
            #print(f'Used {used_tokens} total tokens')
            result = {'values': {}, 'response': response_text}

            # Split the string by lines
            lines = response_text.strip().split("\n")

            # Iterate over each line
            for line in lines:
                if ":" in line:
                    key, value = line.strip().split(": ")[0], line.strip().split(":")[-1], 
                    result['values'][key] = value
            result['tokens'] = token_bag

            return result

        except openai.error.RateLimitError:
            print('RateLimitError: Too many requests. Retrying after {} seconds...'.format(wait_time))
            time.sleep(wait_time)
            retries -= 1

    raise Exception('Max retries exceeded. Request could not be completed.')


test_abstract = query_chat_openai('abstract', query, text_test, model=GPT_MODEL, temperature=0, token_bag=tokens)
tokens += int(test_abstract['tokens'])
print(json.dumps(test_abstract['values'], indent = 4))

{
    "1. assessed nanomaterial (exclude ligand)": " biodegradable mesoporous silica nanoparticles (bMSNs)",
    "2. species of the animal model used for the in vivo assay": " murine",
    "3": " not specified",
    "4. labelling used for the in vivo assay": " oxophilic zirconium-89 (<sup>89</sup>Zr)",
    "5. analysis method used for measurements for the in vivo assay": " positron emission tomography imaging",
    "6. Time points included in the in vivo assay (h)": " not specified, but complete degradation of bMSNs is seen within 21 days",
    "7": " not specified",
    "8": " not specified",
    "9": " silica",
    "10": " not specified",
    "11": " TRC105 (an anti-CD105 antibody)"
}


Now performing the request for all available open access abstracts:

In [19]:
seen = []
count = 0
total = len(set(df['abstract']))
for index, row in df.iterrows():
    if pd.notnull(row['abstract']):     
        abstract = row['abstract']
        pmcid = row['pmcid']
        if pmcid not in seen:
            seen.append(pmcid)
            condition = df['pmcid'] == pmcid
            # Map keys back
            query_chat = query_chat_openai('abstract', query, abstract, GPT_MODEL, 0, tokens)
            answer = query_chat['values']
            tokens += int(query_chat['tokens'])
            number = 1
            for value in answer:
                if number == 1:
                    df.loc[condition, 'nm_abs'] = answer[value]
                if number == 2:
                    df.loc[condition, 'species_abs'] = answer[value]
                if number == 3:
                    df.loc[condition, 'strain_abs'] = answer[value]
                if number == 4:
                    df.loc[condition, 'label_abs'] = answer[value]
                if number == 5:
                    df.loc[condition, 'analysis_method_abs'] = answer[value]
                if number == 6:
                    df.loc[condition, 'time_points_abs'] = answer[value]
                if number == 7:
                    df.loc[condition, 'organ_abs'] = answer[value]
                if number == 8:
                    df.loc[condition, 'shape_abs'] = answer[value]
                if number == 9:
                    df.loc[condition, 'type_abs'] = answer[value]
                if number == 10:
                    df.loc[condition, 'size_abs'] = answer[value]
                if number == 11:
                        df.loc[condition, 'ligand_abs'] = answer[value]
                number += 1

Overview of results:

Using the NM type as a benchmark for accuracy:

In [20]:
seen = []
count = 0
total = len(set(df['abstract']))
df_benchmark = []
for index, row in df.iterrows():
    if pd.notnull(row['abstract']):     
        abstract = row['abstract']
        pmcid = row['pmcid']
        if pmcid not in seen:
            seen.append(pmcid)
            row = [row['pmcid'], row['Name'], row['nm_abs']]
            df_benchmark.append(row)
df_comparison = pd.DataFrame(df_benchmark, columns = ['pmcid', 'curated_nm', 'abs_nm'])


Unnamed: 0,pmcid,curated_nm,abs_nm
0,PMC3985880,Gold tripods <20nm,Au-tripods
1,PMC4358630,Gold nanospheres 56.8nm,Au nanostructures
2,PMC3211348,43 nm AuNP-PEG5000 (Kennedy et al. 2011),gold nanoparticles
3,PMC3492114,PEG-modified gold nanorods AP 5.0 10.6*49.6,gold nanorods
4,PMC3425121,AuNR - DOX 45*10nm,gold nanorod
5,PMC5039679,GNR P1 2*10,AuNR
6,PMC4207078,125I-NGO,nanoscale graphene oxide (NGO)
7,PMC4262629,64Cu-NOTA-MSN-PEG-VEGF121,MSNs
8,PMC4218929,r 64Cu-MSN-800CW-TRC105(Fab). 80nm,MSN
9,PMC4038837,4Cu-HMSN-ZW800-TRC105 150nm,HMSN


## Full texts

### Test
Same journal article as before

In [8]:
test_text = df.iloc[47]['full_text']
test_text

"Adv Sci (Weinh)Adv Sci (Weinh)10.1002/(ISSN)2198-3844ADVSAdvanced Science2198-3844John Wiley and Sons Inc.Hoboken510267310.1002/advs.201600122ADVS163Full PaperFull PapersEngineering Intrinsically Zirconium89 Radiolabeled SelfDestructing Mesoporous Silica Nanostructures for In Vivo Biodistribution and Tumor Targeting StudiesGoelShreya 1 ChenFeng 2 LuanShijie 3 ValdovinosHector F. 4 ShiSixiang 1 GravesStephen A. 4 AiFanrong 2 BarnhartTodd E. 4 TheuerCharles P. 5 CaiWeiboWCai@uwhealth.org 1 2 4 6 1Materials Science ProgramUniversity of WisconsinMadisonMadisonWI53705USA2Department of RadiologyUniversity of WisconsinMadisonMadisonWI53705USA3School of PharmacyTemple UniversityPhiladelphiaPA19140USA4Department of Medical PhysicsUniversity of WisconsinMadisonMadisonWI53705USA5TRACON Pharmaceuticals IncSan DiegoCA92122USA6University of Wisconsin Carbone Cancer CentreMadisonWI53705USA*Email: WCai@uwhealth.org275201611201631110.1002/advs.v3.11160012230320161942016 Copyright 2016 WILEYVCH Verlag 

Need to get around the [token limit](https://platform.openai.com/docs/models/gpt-4): set up a function that splits the full text into chunks to generate a shortened version of the article.

In [24]:
def ftext_tochunks(text, max_tokens = 2700, chars_per_token = 0.75):
    chars = len(text)
    chunk_size = int(chars_per_token * max_tokens)
    chunks = [text[i:i+chunk_size] for i in range(0, chars, chunk_size)]
    return chunks
    
def long_query_answer(query, text, token_bag, model=GPT_MODEL, temperature=0, max_tokens=2700):
    chunks = ftext_tochunks(text)
    query_synthesize = 'Synthesize this excerpt of a scientific journal article about nanomaterial biodistribution into a very schematic version of less than {} words which will be used to extract information about the article. It is crucial to retain all information and parameters in the experiment:\n\"\"\"\n{}\n\"\"\"'
    synthesized = ''
    i = 1
    words = int(max_tokens/len(chunks))
    query_synthesize =query_synthesize.format(words, '{}')
    for chunk in chunks:
        i+=1
        chunk_query = query_chat_openai(query=query_synthesize, text=chunk, model=model, temperature=temperature, text_type='chunk', token_bag=token_bag)
        synthesized += chunk_query['response']
    synthesized = synthesized.replace("\n", " ")
    response = query_chat_openai(query = query, text = synthesized, model=GPT_MODEL, temperature=0, token_bag=token_bag, text_type='full_text')
    token_bag += response['tokens']
    response['query'] = synthesized
    return response

test = long_query_answer(query=query, text=test_text, model=GPT_MODEL, temperature=0, token_bag=tokens)

tokens += test['tokens']

Text divided into 27 chunks
Synthesizing into chunks of less than 100 words
Synthesizing chunk #1
Synthesizing chunk #2
Synthesizing chunk #3
Synthesizing chunk #4
Synthesizing chunk #5
Synthesizing chunk #6
Synthesizing chunk #7
Synthesizing chunk #8
Synthesizing chunk #9
Synthesizing chunk #10
Synthesizing chunk #11
Synthesizing chunk #12
Synthesizing chunk #13
Synthesizing chunk #14
Synthesizing chunk #15
Synthesizing chunk #16
Synthesizing chunk #17
Synthesizing chunk #18
Synthesizing chunk #19
Synthesizing chunk #20
Synthesizing chunk #21
Synthesizing chunk #22
Synthesizing chunk #23
Synthesizing chunk #24
Synthesizing chunk #25
Synthesizing chunk #26
Synthesizing chunk #27
The article presents a study on the behavior of biodegradable mesoporous silica nanoparticles (bMSNs) designed to carry multiple cargos and self-destruct following the release of their payloads. The bMSNs are intrinsically radiolabeled with oxophilic zirconium89 (89Zr) radionuclide to track their in vivo pharma

In [25]:
print(json.dumps(test['values'], indent = 4))

{
    "1. assessed nanomaterial (exclude ligand)": " biodegradable mesoporous silica nanoparticles (bMSNs)",
    "2. species of the animal model used for the in vivo assay": " mice",
    "3": " Balb/c",
    "4. labelling used for the in vivo assay": " oxophilic zirconium89 (89Zr) radionuclide",
    "5. analysis method used for measurements for the in vivo assay": " positron emission tomography (PET) imaging",
    "6. Time points included in the in vivo assay (h)": " up to 48 hours",
    "7": " tumor tissues, liver, spleen, joints",
    "8": " dendritic, radially arranged mesochannels",
    "9": " silica nanoparticles",
    "10": " 162.9 nm",
    "11": " TRC105 (an antiCD105 antibody)"
}


In [26]:
for i in range(len(test['values'])):
    key = list(test['values'].keys())[i]
    abstract  = test_abstract['values'][key]
    full_text = test['values'][key]
    print(f"{key}\n\tABSTRACT: {abstract}\n\tFULL TEXT: {full_text}")

1. assessed nanomaterial (exclude ligand)
	ABSTRACT:  biodegradable mesoporous silica nanoparticles (bMSNs)
	FULL TEXT:  biodegradable mesoporous silica nanoparticles (bMSNs)
2. species of the animal model used for the in vivo assay
	ABSTRACT:  murine
	FULL TEXT:  mice
3
	ABSTRACT:  not specified
	FULL TEXT:  Balb/c
4. labelling used for the in vivo assay
	ABSTRACT:  oxophilic zirconium-89 (<sup>89</sup>Zr)
	FULL TEXT:  oxophilic zirconium89 (89Zr) radionuclide
5. analysis method used for measurements for the in vivo assay
	ABSTRACT:  positron emission tomography imaging
	FULL TEXT:  positron emission tomography (PET) imaging
6. Time points included in the in vivo assay (h)
	ABSTRACT:  not specified, but complete degradation of bMSNs is seen within 21 days
	FULL TEXT:  up to 48 hours
7
	ABSTRACT:  not specified
	FULL TEXT:  tumor tissues, liver, spleen, joints
8
	ABSTRACT:  not specified
	FULL TEXT:  dendritic, radially arranged mesochannels
9
	ABSTRACT:  silica
	FULL TEXT:  silica nan

In [35]:
seen = []
count = 0
total = len(set(df['full_text']))
for index, row in df.iterrows():
    if pd.notnull(row['full_text']):     
        full_text = row['full_text']
        pmcid = row['pmcid']
        if pmcid not in seen:
            seen.append(pmcid)
            condition = df['pmcid'] == pmcid
            print(f'\n{pmcid}')
            # Map keys back
            query_ftext = long_query_answer(query=query, text= full_text, model = GPT_MODEL, temperature = 0, token_bag = tokens)
            answer = query_ftext['values']
            tokens += query_ftext['tokens']
            number = 1
            for value in answer:
                if number == 1:
                    df.loc[condition, 'nm_ft'] = answer[value]
                if number == 2:
                    df.loc[condition, 'species_ft'] = answer[value]
                if number == 3:
                    df.loc[condition, 'strain_ft'] = answer[value]
                if number == 4:
                    df.loc[condition, 'label_ft'] = answer[value]
                if number == 5:
                    df.loc[condition, 'analysis_method_ft'] = answer[value]
                if number == 6:
                    df.loc[condition, 'time_points_ft'] = answer[value]
                if number == 7:
                    df.loc[condition, 'organ_ft'] = answer[value]
                if number == 8:
                    df.loc[condition, 'shape_ft'] = answer[value]
                if number == 9:
                    df.loc[condition, 'type_ft'] = answer[value]
                if number == 10:
                    df.loc[condition, 'size_ft'] = answer[value]
                if number == 11:
                    df.loc[condition, 'ligand_ft'] = answer[value]
                number += 1


PMC3985880
Text divided into 28 chunks
Synthesizing into chunks of less than 96 words
Synthesizing chunk #1
Synthesizing chunk #2
Synthesizing chunk #3
Synthesizing chunk #4
Synthesizing chunk #5
Synthesizing chunk #6
Synthesizing chunk #7
Synthesizing chunk #8
Synthesizing chunk #9
Synthesizing chunk #10
Synthesizing chunk #11
Synthesizing chunk #12
Synthesizing chunk #13
Synthesizing chunk #14
Synthesizing chunk #15
Synthesizing chunk #16
Synthesizing chunk #17
Synthesizing chunk #18
Synthesizing chunk #19
Synthesizing chunk #20
Synthesizing chunk #21
Synthesizing chunk #22
Synthesizing chunk #23
Synthesizing chunk #24
Synthesizing chunk #25
Synthesizing chunk #26
Synthesizing chunk #27
Synthesizing chunk #28
The article describes the development of anisotropic, branched, gold nanoarchitectures (Au-tripods) for bioimaging. The Au-tripods were less than 20 nm in size and were identified as high-absorbance nanomaterials for in vivo photoacoustic imaging (PAI). The in vivo biodistribut

In [36]:
df

Unnamed: 0,ID,Time_h,perc_ID_g,Species,Age/weight,Strain,Organ,Size_nm,Analysis method,NP_Type,...,species_ft,strain_ft,label_ft,analysis_method_ft,time_points_ft,organ_ft,shape_ft,type_ft,size_ft,ligand_ft
0,7,1.0,9.716981,Mouse,21.4,Nude mice,Spleen,10.0,PET,Gold,...,mice,"U87MG tumor-bearing mice, nude mice",64Cu,small animal positron emission tomography (PE...,"1, 2, 4, 24, and 48 hours post-injection","liver, spleen, kidney, muscle, tumor","anisotropic, branched, tripods (dipods, tripo...",gold-based nanomaterials,less than 20 nm,cyclic Arg-Gly-Asp-d-Phe-Cys (RGDfC) peptide
1,8,1.0,2.489796,Mouse,20.0,Balb/c mice,Spleen,56.8,198Au,Gold,...,mice,BALB/c,radioactive 198Au,"Cerenkov imaging, autoradiography, PET imagin...","1, 6, 24","liver, spleen, tumor, lungs","nanospheres, nanodisks, nanorods, nanocages",gold,around 50 nm,PEGylation
2,16,4.0,9.022310,Mouse,,SCID mice,Spleen,42.5,ICP-MS,Gold,...,mice,severe combined immune deficient (SCID) mice,"retroviral labelling, chromium51",inductively coupled plasma mass spectrometry ...,"4, 8, 24, 48","liver, spleen, kidneys, small intestine, musc...","nanorods, nanoshells, colloidal nanospheres",gold,40-45 nm,anti-epidermal growth factor receptor antibodies
3,50,72.0,91.914894,Mouse,32.0,Male ddY mice,Spleen,11.0,ICP-MS,Gold,...,mice,tumor-bearing mice,not specified,inductively coupled plasma mass spectrometry ...,72 hours,"liver, lung, spleen, kidney, tumor",rod-shaped and spherical,gold,10 to 50 nm,PEG (polyethylene glycol)
4,56,5.0,1.993007,Mouse,18.0,Nude mice,Spleen,10.0,64Cu,Gold,...,mice,athymic nude mice,64Cu,PET imaging,"1 hour, 48 hours","tumors, liver",rod-shaped,gold,approximately 45 x 10 nm,NOTA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
166,125,1.0,5.090000,Mouse,19.1,Balb/c mice,Kidney,92.1,111In,Lipid,...,mice,BALB/c,chelator-free zirconium-89 (89Zr) labeling,positron emission tomography (PET) imaging,up to 21 days,"liver, spleen, bone marrow",not specified,silica nanoparticles,"90 nm (dSiO2), 150 nm (MSN)",not specified
167,129,1.0,6.900000,Mouse,,Nude mice,Kidney,100.0,64Cu,Lipid,...,mice,ICR mice,not specified,"ICP-OES, blood biochemical assay, histologica...",14 days,"liver, spleen, lungs, kidneys, heart, brain",amorphous spherical,silica and silver nanoparticles,64-70 nm,not specified
168,149,24.0,1.436782,Mouse,20.0,Nude mice,Kidney,70.7,ICP-MS,Iron Oxide,...,mice,athymic nude mice,64Cu,PET imaging,"1, 5, 8, 24, 27","liver, spleen, tumor, heart",nanorods,gold,10,DOTA
169,155,3.0,8.979592,Mouse,22.5,Nude mice,Kidney,6.0,SPECT/CT,Iron Oxide,...,mice,BALB/c nude mice,"indium-111, 125I-cyclo-(-RGDyV-)","biodistribution studies, micro-SPECT/CT imaging",1 and 4 hours post-injection,"liver, spleen, tumor, intestine, lungs",not specified,lipid-based,92.1-110.4 nm,RGD peptide


In [40]:
seen = []
count = 0
total = len(set(df['abstract']))
df_benchmark = []
for index, row in df.iterrows():
    if pd.notnull(row['abstract']):     
        pmcid = row['pmcid']
        if pmcid not in seen:
            seen.append(pmcid)
            row = [row['pmcid'], row['Name'], row['nm_abs'], row['nm_ft']]
            df_benchmark.append(row)
df_comparison = pd.DataFrame(df_benchmark, columns = ['pmcid', 'curated_nm', 'nm_abs', 'nm_ft'])
df_comparison

Unnamed: 0,pmcid,curated_nm,nm_abs,nm_ft
0,PMC3985880,Gold tripods <20nm,Au-tripods,gold nanoarchitectures
1,PMC4358630,Gold nanospheres 56.8nm,Au nanostructures,"nanospheres, nanodisks, nanorods, nanocages"
2,PMC3211348,43 nm AuNP-PEG5000 (Kennedy et al. 2011),gold nanoparticles,gold nanoparticles (AuNPs)
3,PMC3492114,PEG-modified gold nanorods AP 5.0 10.6*49.6,gold nanorods,gold nanoparticles
4,PMC3425121,AuNR - DOX 45*10nm,gold nanorod,gold nanorods
5,PMC5039679,GNR P1 2*10,AuNR,gold nanorods (AuNR)
6,PMC4207078,125I-NGO,nanoscale graphene oxide (NGO),nanoscale graphene oxide (NGO)
7,PMC4262629,64Cu-NOTA-MSN-PEG-VEGF121,MSNs,mesoporous silica nanoparticles (MSNs)
8,PMC4218929,r 64Cu-MSN-800CW-TRC105(Fab). 80nm,MSN,"MSN, QDs"
9,PMC4038837,4Cu-HMSN-ZW800-TRC105 150nm,HMSN,hollow mesoporous silica nanoparticles (HMSN)


In [42]:
seen = []
count = 0
total = len(set(df['abstract']))
df_benchmark = []
for index, row in df.iterrows():
    if pd.notnull(row['abstract']):     
        pmcid = row['pmcid']
        if pmcid not in seen:
            seen.append(pmcid)
            row = [row['pmcid'], row['Ligand'], row['ligand_abs'], row['ligand_ft']]
            df_benchmark.append(row)
df_comparison = pd.DataFrame(df_benchmark, columns = ['pmcid', 'curated_ligand', 'abs_ligand', 'ft_ligand'])
df_comparison

Unnamed: 0,pmcid,curated_ligand,abs_ligand,ft_ligand
0,PMC3985880,3400,cyclic Arg-Gly-Asp-d-Phe-Cys (RGDfC) peptide,cyclic Arg-Gly-Asp-d-Phe-Cys (RGDfC) peptide
1,PMC4358630,5000,PEGylation,PEGylation
2,PMC3211348,5000,not specified,anti-epidermal growth factor receptor antibodies
3,PMC3492114,5000,PEG density on the gold surface,PEG (polyethylene glycol)
4,PMC3425121,5000,cyclo(Arg-Gly-Asp-D-Phe-Cys) peptides (cRGD),NOTA
5,PMC5039679,5000,,PEGylation
6,PMC4207078,0,polyethylene glycol (PEG) coating,not specified
7,PMC4262629,5000,anti-VEGFR ligand VEGF121,VEGF121
8,PMC4218929,5000,TRC105,TRC105 (a human/murine chimeric IgG1 monoclon...
9,PMC4038837,5000,TRC105,anti-CD105 antibody (TRC105)


In [44]:
seen = []
count = 0
total = len(set(df['abstract']))
df_benchmark = []
for index, row in df.iterrows():
    if pd.notnull(row['abstract']):     
        pmcid = row['pmcid']
        if pmcid not in seen:
            seen.append(pmcid)
            row = [row['pmcid'], row['Strain'], row['strain_abs'], row['strain_ft']]
            df_benchmark.append(row)
df_comparison = pd.DataFrame(df_benchmark, columns = ['pmcid', 'curated_strain', 'abs_strain', 'ft_strain'])
df_comparison

Unnamed: 0,pmcid,curated_strain,abs_strain,ft_strain
0,PMC3985880,Nude mice,U87MG tumor-bearing mice,"U87MG tumor-bearing mice, nude mice"
1,PMC4358630,Balb/c mice,EMT6 breast cancer model,BALB/c
2,PMC3211348,SCID mice,not specified,severe combined immune deficient (SCID) mice
3,PMC3492114,Male ddY mice,tumor-bearing mice,tumor-bearing mice
4,PMC3425121,Nude mice,not specified,athymic nude mice
5,PMC5039679,Nude mice,,not specified
6,PMC4207078,Kunming mice,not specified,not specified
7,PMC4262629,Nude mice,U87MG,"U87MG tumor-bearing mice, female athymic nude..."
8,PMC4218929,Balb/c mice,4T1 murine breast tumor-bearing mice,4T1 murine breast tumor-bearing mice
9,PMC4038837,mice,4T1,4T1 murine breast cancer model


In [46]:
seen = []
count = 0
total = len(set(df['abstract']))
df_benchmark = []
for index, row in df.iterrows():
    if pd.notnull(row['abstract']):     
        pmcid = row['pmcid']
        if pmcid not in seen:
            seen.append(pmcid)
            row = [row['pmcid'], row['Organ'], row['organ_abs'], row['organ_ft']]
            df_benchmark.append(row)
df_comparison = pd.DataFrame(df_benchmark, columns = ['pmcid', 'curated_organ', 'abs_organ', 'ft_organ'])
df_comparison

Unnamed: 0,pmcid,curated_organ,abs_organ,ft_organ
0,PMC3985880,Spleen,not specified,"liver, spleen, kidney, muscle, tumor"
1,PMC4358630,Spleen,not specified,"liver, spleen, tumor, lungs"
2,PMC3211348,Spleen,tumors,"liver, spleen, kidneys, small intestine, musc..."
3,PMC3492114,Spleen,"liver, lung, kidney, tumors, spleen","liver, lung, spleen, kidney, tumor"
4,PMC3425121,Spleen,not specified,"tumors, liver"
5,PMC5039679,Spleen,,"heart, tumor, liver, spleen"
6,PMC4207078,Spleen,"liver, lung, spleen","liver, lung, spleen, kidney"
7,PMC4262629,Spleen,not specified,"liver, spleen, tumor, kidney, intestine, lung..."
8,PMC4218929,Spleen,not specified,"liver, spleen, kidneys"
9,PMC4038837,Spleen,not specified,"liver, spleen, tumor, lungs, muscle"


In [48]:
seen = []
count = 0
total = len(set(df['abstract']))
df_benchmark = []
for index, row in df.iterrows():
    if pd.notnull(row['abstract']):     
        pmcid = row['pmcid']
        if pmcid not in seen:
            seen.append(pmcid)
            row = [row['pmcid'], row['Size_nm'], row['size_abs'], row['size_ft']]
            df_benchmark.append(row)
df_comparison = pd.DataFrame(df_benchmark, columns = ['pmcid', 'curated_size', 'abs_size', 'ft_size'])
df_comparison

Unnamed: 0,pmcid,curated_size,abs_size,ft_size
0,PMC3985880,10.0,less than 20,less than 20 nm
1,PMC4358630,56.8,not specified,around 50 nm
2,PMC3211348,42.5,45 nm,40-45 nm
3,PMC3492114,11.0,diameter of 10 to 50 nm,10 to 50 nm
4,PMC3425121,10.0,not specified,approximately 45 x 10 nm
5,PMC5039679,2.0,varied,various sizes of AuNRs were used
6,PMC4207078,308.0,nanoscale,10-800 nm
7,PMC4262629,129.1,not specified,80 nm
8,PMC4218929,175.3,not specified,80
9,PMC4038837,194.4,not specified,150nm


In [50]:
df.to_csv('../data/openai_ir_distribution_nm.csv', index = False)

Note: for variables like size the llm correctly looked for an interval, the table above shows only one of the possible values for each publication.