In [1]:
import openai
import os
from scripts.get_abstract_epmc import get_abstract
import pandas as pd
import numpy as np
import re
import json

# Using the OpenAI API for text completion

This notebook uses the OpenAI API to assess the GPT-3.5-turbo model performance for extracting assay-specific data from the abstracts of the available open access journal articles sourced in [Nanoparticle biodistribution coefficients: A quantitative approach for understanding the tissue distribution of nanoparticles](https://doi.org/10.1016/j.addr.2023.114708) (continues [exploration-clean](00_exploration-clean.ipynb)).

In [2]:
df = pd.read_csv("../data/perc_id_g.csv")

## Get abstracts
The Europe PMC API is used to retrieve the abstracts for the available Open Access journal articles used to build the dataset.


In [3]:
seen = []
count = 0
total = len(np.unique(df['provided_identifier']))
for index, row in df.iterrows():
    if pd.notnull(row['pmcid']):     
        pmcid = row['pmcid']
        if pmcid not in seen:
            seen.append(pmcid)
            text = get_abstract(pmcid)
            if text != "":
                df.at[index, 'abstract'] = text
                count += 1
print(f'{count} abstracts retrieved out of {total} different journal articles')


Not available: ('abstractText') for PMC4127427
55 abstracts retrieved out of 116 different journal articles


## Set up the OpenAI text completion API

Needed API key:

In [20]:
with open('resources/openAI_key.txt', 'r') as f:
    api_key = f.read().strip()
os.environ['OPENAI_API_KEY'] = api_key
openai.api_key = os.getenv("OPENAI_API_KEY")
# models
EMBEDDING_MODEL = "text-embedding-ada-002"
GPT_MODEL = "gpt-3.5-turbo"

## Test

The abstract used for the test belongs to [PMC5102673](https://doi.org/10.1002/advs.201600122). The abstract is:

    A systematic study of in vitro and in vivo behavior of biodegradable mesoporous silica nanoparticles (bMSNs), designed to carry multiple cargos (both small and macromolecular drugs) and subsequently self-destruct following release of their payloads, is presented. Complete degradation of bMSNs is seen within 21 d of incubation in simulated body fluid. The as-synthesized bMSNs are intrinsically radiolabeled with oxophilic zirconium-89 (<sup>89</sup>Zr, <i>t</i><sub>1/2</sub> = 78.4 h) radionuclide to track their in vivo pharmacokinetics via positron emission tomography imaging. Rapid and persistent CD105 specific tumor vasculature targeting is successfully demonstrated in murine model of metastatic breast cancer by using TRC105 (an anti-CD105 antibody)-conjugated bMSNs. This study serves to illustrate a simple, versatile, and readily tunable approach to potentially overcome the current challenges facing nanomedicine and further the goals of personalized nanotheranostics.

In [24]:
text_test = df.iloc[330]['abstract']
id_test = df.iloc[330]['pmcid']
doi_test = df.iloc[330]['doi']
print(id_test, " ", doi_test)
print(text_test)


PMC5102673   10.1002/advs.201600122
A systematic study of in vitro and in vivo behavior of biodegradable mesoporous silica nanoparticles (bMSNs), designed to carry multiple cargos (both small and macromolecular drugs) and subsequently self-destruct following release of their payloads, is presented. Complete degradation of bMSNs is seen within 21 d of incubation in simulated body fluid. The as-synthesized bMSNs are intrinsically radiolabeled with oxophilic zirconium-89 (<sup>89</sup>Zr, <i>t</i><sub>1/2</sub> = 78.4 h) radionuclide to track their in vivo pharmacokinetics via positron emission tomography imaging. Rapid and persistent CD105 specific tumor vasculature targeting is successfully demonstrated in murine model of metastatic breast cancer by using TRC105 (an anti-CD105 antibody)-conjugated bMSNs. This study serves to illustrate a simple, versatile, and readily tunable approach to potentially overcome the current challenges facing nanomedicine and further the goals of personalize

Setting up the query and function used for the API call:

In [6]:
query = """Scan the following scientific article abstract describing the use of animal models to investigate nanomaterial or nanoparticle biodistribution in organs to fill up the following key-value pairs, respecting the formatting:

Abstract:
\"\"\"
{}
\"\"\"

Key-value pairs:

1. assessed nanomaterial:
2. commercial or synthesized:
3. labelling used for the in vivo assay:
4. instrumental equipment used for measurements for the in vivo assay:
5. animal model used: 
6. route of administration of nanomaterial:
7. age/sex of the animal model:
8. fate of the nanomaterial observed: (3 words max)
"""


import time
import requests
import openai

def question_answer(query, abstract, model, temperature):
    query = query.format(abstract)
    messages = [
        {'role': 'system', 'content': 'You answer questions about the abstract of a journal article.'},
        {'role': 'user', 'content': query},
    ]
    retries = 3
    wait_time = 30

    while retries > 0:
        try:
            response = openai.ChatCompletion.create(
                messages=messages,
                model=model,
                temperature=temperature
            )

            response_text = response['choices'][0]['message']['content']
            print(response_text)
            dictionary = {'values': {}, 'response': response_text}

            # Split the string by lines
            lines = response_text.strip().split("\n")

            # Iterate over each line
            for line in lines:
                if ":" in line:
                    key, value = line.strip().split(": ")
                    dictionary['values'][key] = value

            return dictionary

        except openai.error.RateLimitError:
            print('RateLimitError: Too many requests. Retrying after {} seconds...'.format(wait_time))
            time.sleep(wait_time)
            retries -= 1

    raise Exception('Max retries exceeded. Request could not be completed.')

# Example usage
try:
    result = question_answer('Your query', 'Abstract', 'model', 0.8)
    print('Response:', result['response'])
    print('Values:', result['values'])
except Exception as e:
    print('Error occurred:', str(e))


test = question_answer(query, text_test, model=GPT_MODEL, temperature=0)
print(json.dumps(test, indent = 4))

Error occurred: The model `model` does not exist
1. assessed nanomaterial: biodegradable mesoporous silica nanoparticles (bMSNs)
2. commercial or synthesized: synthesized
3. labelling used for the in vivo assay: oxophilic zirconium-89 (<sup>89</sup>Zr)
4. instrumental equipment used for measurements for the in vivo assay: positron emission tomography (PET) imaging
5. animal model used: murine model of metastatic breast cancer
6. route of administration of nanomaterial: not specified
7. age/sex of the animal model: not specified
8. fate of the nanomaterial observed: complete degradation
{
    "values": {
        "1. assessed nanomaterial": "biodegradable mesoporous silica nanoparticles (bMSNs)",
        "2. commercial or synthesized": "synthesized",
        "3. labelling used for the in vivo assay": "oxophilic zirconium-89 (<sup>89</sup>Zr)",
        "4. instrumental equipment used for measurements for the in vivo assay": "positron emission tomography (PET) imaging",
        "5. animal 

Now performing the request for all available open access abstracts:

In [15]:
seen = []
count = 0
total = len(set(df['abstract']))
for index, row in df.iterrows():
    if pd.notnull(row['abstract']):     
        abstract = row['abstract']
        pmcid = row['pmcid']
        if pmcid not in seen:
            seen.append(pmcid)
            condition = df['pmcid'] == pmcid
            # Map keys back
            answer = question_answer(query, abstract, GPT_MODEL, 0)['values']
            for value in answer:
                # Check if the value matches the regex pattern
                if re.search(r'assessed nanomaterial', value):
                    df.loc[condition, 'nm_llm'] = answer[value]
                if re.search(r'commercial', value):
                    df.loc[condition, 'commercial_synthesized_llm'] = answer[value]
                if re.search(r'animal model', value):
                    df.loc[condition, 'model_llm'] = answer[value]
                if re.search(r'route of administration', value):
                    df.loc[condition, 'route_admin_llm'] = value
                if re.search(r'fate', value):
                    df.loc[condition, 'fate_llm'] = answer[value]
            


1. assessed nanomaterial: gold nanoparticles (Au NPs)
2. commercial or synthesized: not specified
3. labelling used for the in vivo assay: intrinsic fluorescence of photodynamic therapy drug Pc 4
4. instrumental equipment used for measurements for the in vivo assay: not specified
5. animal model used: not specified
6. route of administration of nanomaterial: intravenous (IV) injection
7. age/sex of the animal model: not specified
8. fate of the nanomaterial observed: biodistribution impact
1. assessed nanomaterial: gold nanocluster ((64)Cu-doped AuNCs)
2. commercial or synthesized: synthesized
3. labelling used for the in vivo assay: PET radionuclide (64)Cu
4. instrumental equipment used for measurements for the in vivo assay: dual-modality positron emission tomography (PET) and near-infrared (NIR) fluorescence imaging
5. animal model used: U87MG glioblastoma xenograft model
6. route of administration of nanomaterial: not specified
7. age/sex of the animal model: not specified
8. fate 

Overview of results:

In [16]:
print("NMs identified:\n\t-")
print("\n\t-".join(set(df['nm_llm'].dropna())))
print("commercial/synthesized?\n\t-")
print("\n\t-".join(set(df['commercial_synthesized_llm'].dropna())))
print("animal model identified:\n\t-")
print("\n\t-".join(set(df['model_llm'].dropna())))
print("How was the NM administered:\n\t-")
print("\n\t-".join(set(df['route_admin_llm'].dropna())))

NMs identified:
	-
mesoporous silica nanoparticles (MSNs)
	-graphene oxide (GO) nanoconjugates
	-Solid lipid nanoparticles (SLNs)
	-superparamagnetic iron oxide (SPIO) nanoparticles
	-Au nanostructures
	-liposomes
	-polymeric nanoparticles
	-Gold nanorods (GNR)
	-Gold nanorods (AuNR)
	-poly(lactic acid)-polyethylene glycol copolymer nanoparticles
	-magnetic nanoparticles (MNPs)
	-gold nanocluster ((64)Cu-doped AuNCs)
	-Gold nanocages
	-Reduced graphene oxide nanosheets anchored with iron oxide nanoparticles (RGO-IONP-(1st)PEG-(2nd)PEG)
	-polymeric micelles
	-gold nanoparticles
	-Magnetic iron oxide nanoparticles platform (MIONPs)
	-Hollow mesoporous silica nanoparticles (HMSNs)
	-cRGDY-conjugated fluorescent silica nanoparticles (C dots)
	-Gold nanoparticles
	-Polyethylene glycol (PEG)-coated gold nanoparticles (AuNPs)
	-Cur-loaded nanoparticles
	-Gold nanoparticles (AuNPs)
	-sub-nanometer sized polymeric nanoparticles
	-Mesoporous silica nanoparticles (MSN)
	-nanographene oxide (GO)
	

Using the NM type as a benchmark for accuracy:

In [17]:
seen = []
count = 0
total = len(set(df['abstract']))
for index, row in df.iterrows():
    if pd.notnull(row['abstract']):     
        abstract = row['abstract']
        pmcid = row['pmcid']
        if pmcid not in seen:
            seen.append(pmcid)
            print(str(row['pmcid']) , ' | ' , str(row['Name']) ," -----> " + str(row['nm_llm']))

PMC4437573  |  Peptide Au NP 5nm  -----> gold nanoparticles (Au NPs)
PMC4180787  |  64Cu-doped AuNCs 2.5nm  -----> gold nanocluster ((64)Cu-doped AuNCs)
PMC3985880  |  Gold tripods <20nm  -----> Au-tripods
PMC4358630  |  Gold nanospheres 56.8nm  -----> Au nanostructures
PMC3404261  |  64Cu-DOTA-PEGAuNCs (55 nm)  -----> Gold nanocages
PMC3211348  |  43 nm AuNP-PEG5000 (Kennedy et al. 2011)  -----> Gold nanoparticles (AuNPs)
PMC3379889  |  50nm Au PEG  -----> gold nanoparticles
PMC3563754  |  64Cu-NOTA-Au-IONP-Affibody 24.4nm  -----> Au-IO hetero-nanostructures (Au-IONPs)
PMC4151626  |  SERS nanoparticles Gold 120nm  -----> gold surface-enhanced Raman scattering (SERS) nanoparticles
PMC2745599  |  20-nm AuNPs coated with PEG5000-TA  -----> Polyethylene glycol (PEG)-coated gold nanoparticles (AuNPs)
PMC4836969  |   5 nm 199Au-AuNP-PEG  -----> Gold nanoparticles
PMC3492114  |  PEG-modified gold nanorods AP 5.0 10.6*49.6  -----> Gold nanorods
PMC3425121  |  AuNR - DOX 45*10nm  -----> Gold n

Data is saved to [perc_id_g_llm.csv](../data/perc_id_g_llm.csv)

In [19]:
df.to_csv('../data/perc_id_g_llm.csv', index=False)