# Table of contents

* [Overview](#overview)
* [Imports](#imports)
* [Paths](#paths)
* [1.Get literature from PubMed](#pubmed)
* [2.Prepare prompt text](#prompt)
* [3.Run a pipeline to process 30 randomly selected abstracts](#pipeline)
* [4.Evaluate](#evaluate)
    * [Evaluation metrics for drug names extracted by GPT-4 [Level 1]](#level1)
    * [Evaluation metrics for target names extracted by GPT-4 [Level 2]](#level2)
    * [Evaluation metrics for drug-target interactions extracted by GPT-4 [Level 3]](#level3)
* [Conclusions and next steps](#conclusion)
* [References & Accessory links](#refs)

# Overview

A Retrieval Augmented Generation project showing creation of a drug-target dataset from biomedical literature, using GPT-4, and evaluation of performance.
This project involved the following steps:
* Use of NCBI API to automatically access abstracts and metadata.
* Prompt engineering to finalize prompt text that gave the best output (not shown in this notebook.)
* Pipeline packaging each abstract and prompt specific to the OpenAI API and output generation through GPT-4.
* Comparison of GPT-4 extracted entities with ground truth, manually curated data to get performance metrics.

<font color='blue'>**For a quick glance, major outputs in the pipeline are marked in blue text.**</font>


# Imports<a id="imports"></a>

In [1]:
%load_ext autoreload
%autoreload 2

In [38]:
import json
import pandas as pd
from pathlib import Path
import sys
import tiktoken

### pubmed
Contains code for querying PubMed, collecting metadata, organizing abstracts and other information into a df.

In [39]:
cwd = Path.cwd()

In [40]:
module_dir = cwd.parent / 'scripts'

In [41]:
sys.path.append(str(module_dir))

In [7]:
import pubmed

### openai_api
Contains code for running a pipeline using the GPT API - packaging pubmed abstracts into the prompts, creating messages, estimating tokens, cost, extracting relevant parts of a completion into a df.

In [8]:
import openai_api

### utils
Contains code for interacting with the Google Sheets API and some data cleaning functions.

In [48]:
import utils

### ner_evaluation
Contains code to evaluate model extracted entities against manually curated ground truth data.

In [126]:
import ner_evaluation

# Paths<a id="paths"></a>

## PubMed output path

In [42]:

output_dir = cwd.parent / 'outputs'

**Path to CSV containing literature from 87 papers from PubMed.**

In [9]:
pubmed_output = output_dir / 'pubmed_data_run_1.csv'

## Prompt language - system and user instructions

In [10]:
input_dir = cwd.parent / 'inputs'

In [11]:
system_path = input_dir / 'system_prompt_v1.json'
user_path = input_dir / 'user_prompt_v1.json'

## Completions jsons

**Run 1 - Completions for 30 pubmed abstracts processed for entity extraction via OpenAI API.**

In [12]:
completions_path = output_dir / 'completions_run_1.json'

**Run 1 - results from completions extracted into a CSV.**

In [43]:
ner_path = output_dir / 'pubmed_ner_run_1.csv'

# 1. Get literature from PubMed<a id="pubmed"></a>

<span style="font-size:20px;">**Method**</span>

* Since the focus of the project is to create a drug-target dataset, published biomedical articles involving drug(s) within specific cancers of interest were obtained from PubMed. 
* A total of 87 articles (30 breast cancer, 30 glioblastoma, and 27 lung cancer) were obtained using NCBI's E-utilities API.
* To get updated content, the date range of 2023 onwards to current was used.


<span style="font-size:20px;">**Steps**</span>

**What query should be used?** 

Optimal search query was first figured out on PubMed's website to ensure relevant and similar results were obtained programmatically as well as the language expected by the API was correct.

**Establish base criteria for querying.**

Variables show data range (2023-current (date of development), search terms that should be present in the abstract text so that we are getting studies involving drugs in cancer types of interest. Different types of review articles were removed.

In [12]:
date_criteria = '(2023/01/01:3000[pdat])'
drug_criteria = '(drug[tiab]+OR+inhibitor[tiab]+OR+compound[tiab]+OR+small+molecule[tiab]+OR+clinical+trial[tiab]+OR+therapy[tiab]+OR+agent[tiab])'
pub_criteria = '(Review[pt]+OR+Scientific+Integrity+Review[pt]+OR+Systematic+Review[pt])'


**Run the PubMed pipeline for getting papers for 3 cancer types.**

The pipeline takes care of searching PubMed, getting results, metadata about the results, and from the total matched articles, it pulls only 30 articles.
Full article content was not pulled as the purpose was to only use abstract text and identify different types of information pertaining to the drug.
Contents for all required number of papers are returned in an XML format.
The XML string goes through data cleanup and parsing steps so that relevant details specific to each paper are properly extracted and then structured into a dataframe.


In [13]:
# Collect all dfs to merge later
all_dfs = []

for cancer_type in ['breast+cancer','lung+cancer','glioblastoma']:
    disease_criteria = f'({cancer_type}[tiab])'
    
    query = date_criteria + '+AND+' + disease_criteria + '+AND+' + drug_criteria + '+NOT+' + pub_criteria
    
    df = pubmed.run_pubmed_pipeline(query=query,
                                    save_on_server='y',
                                    search_format='json',
                                    search_starting_index=0,
                                    search_max_records=9999,
                                    sorting_criteria='relevance',
                                    content_type='abstract',
                                    fetch_starting_index=0,
                                    fetch_max_records=30)
    
    # Add an identifier column for cancer type for easy searching
    df['disease'] = cancer_type
    print(f'Num rows in df: {len(df)}')
    all_dfs.append(df)
    
    print(f'Pipeline complete for {cancer_type}')

# Combine all dfs
final_df = pd.concat(all_dfs)

----Running pipeline for the following query:----
(2023/01/01:3000[pdat])+AND+(breast+cancer[tiab])+AND+(drug[tiab]+OR+inhibitor[tiab]+OR+compound[tiab]+OR+small+molecule[tiab]+OR+clinical+trial[tiab]+OR+therapy[tiab]+OR+agent[tiab])+NOT+(Review[pt]+OR+Scientific+Integrity+Review[pt]+OR+Systematic+Review[pt])
Using PubMed esearch API to get PMIDs matching the search query.
	The actual total number of records matching the search for is 9701
	The number of ids present in the esearch json is 9701
	Function get_pmids complete.
Collecting metadata about the search results into a dictionary.
	Metadata obtained and saved in a dictionary.
Using PubMed efetch API to get abstract and other details for relevant PMIDs into an XML string.
	The number of matching PMIDs based on the server: 30
	Function get_abstracts complete.
Extracting data from XML string and organizing it into a dataframe.
	Performing basic cleanup.
	&#xa0 left: 0
	&#x3ba left: 0
	&# left: 0
Iterating through each article and col

In [14]:
len(final_df)

87

<font color='blue'>**Final dataframe ready with content extracted from PubMed.**</font>

In [15]:
final_df

Unnamed: 0,pmid,publication_date,publication_type,article_title,abstract,keywords,journal,num_abstracts_retrieved,num_abstracts_requested,query_string,num_total_matches,all_matching_pmids,acquisition_date,disease
0,37256976,2023 Jun 01,"Clinical Trial, Phase III|Journal Article|Rand...",Capivasertib in Hormone Receptor-Positive Adva...,[BACKGROUND]AKT pathway activation is implicat...,,The New England journal of medicine,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9701,"37256976,37070653,37147285,37723305,36585452,3...",2024-03-21,breast+cancer
1,37070653,2023 Mar,Clinical Trial Protocol|Journal Article,"Design of SERENA-6, a phase III switching tria...",ESR1 mutation (ESR1m) is a frequent cause of a...,ESR1 mutation|advanced breast cancer|camizestr...,"Future oncology (London, England)",30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9701,"37256976,37070653,37147285,37723305,36585452,3...",2024-03-21,breast+cancer
2,37147285,2023 May 05,"Journal Article|Research Support, Non-U.S. Gov't",KK-LC-1 as a therapeutic target to eliminate ALDH,Failure to achieve complete elimination of tri...,,Nature communications,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9701,"37256976,37070653,37147285,37723305,36585452,3...",2024-03-21,breast+cancer
3,37723305,2023 Oct,Journal Article,Acetate acts as a metabolic immunomodulator by...,Acetate metabolism is an important metabolic p...,,Nature cancerMain References:Methods Only Refe...,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9701,"37256976,37070653,37147285,37723305,36585452,3...",2024-03-21,breast+cancer
4,36585452,2023 Feb,"Journal Article|Research Support, Non-U.S. Gov...",Network-based assessment of HDAC6 activity pre...,Inhibiting individual histone deacetylase (HDA...,,Nature cancerMETHODS-ONLY REFERENCES,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9701,"37256976,37070653,37147285,37723305,36585452,3...",2024-03-21,breast+cancer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,37460404,2023 Nov,Journal Article,Drug Repurposing-Based Brain-Targeting Self-As...,"Glioblastoma (GBM), the most aggressive and le...",blood-brain barriers|chemophototherapy|drug re...,"Small (Weinheim an der Bergstrasse, Germany)",30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,1886,"36791206,37451272,36749723,37935665,36966152,3...",2024-03-21,glioblastoma
26,37244935,2023 May 27,"Journal Article|Research Support, Non-U.S. Gov't",macroH2A2 antagonizes epigenetic programs of s...,Self-renewal is a crucial property of glioblas...,,Nature communications,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,1886,"36791206,37451272,36749723,37935665,36966152,3...",2024-03-21,glioblastoma
27,37886538,2023 Oct 05,Preprint,LDHA-regulated tumor-macrophage symbiosis prom...,Abundant macrophage infiltration and altered t...,,Research square,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,1886,"36791206,37451272,36749723,37935665,36966152,3...",2024-03-21,glioblastoma
28,37672559,2023 Sep 19,"Research Support, N.I.H., Extramural|Editorial...",Overcoming EGFR inhibitor resistance in Gliobl...,,,Proceedings of the National Academy of Science...,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,1886,"36791206,37451272,36749723,37935665,36966152,3...",2024-03-21,glioblastoma


**Final df with the PubMed data showing the information extracted and their dtypes.**

In [16]:
final_df.dtypes

pmid                       object
publication_date           object
publication_type           object
article_title              object
abstract                   object
keywords                   object
journal                    object
num_abstracts_retrieved     int64
num_abstracts_requested     int64
query_string               object
num_total_matches           int64
all_matching_pmids         object
acquisition_date           object
disease                    object
dtype: object

**Save the df as a CSV.**

In [43]:
final_df.to_csv(pubmed_output,index=False)

# 2. Prepare prompt text<a id="prompt"></a>

<span style="font-size:20px;">**Method**</span>

* Please see the notebook llms_playground which shows extensive experimentation to determine OpenAI API workflow, feasibility of using GPT for entity extraction, prompt language testing, including one-shot and few shot learning, and extraction of relevant data from completions into a df.
* The following prompt was selected after testing a series of zero-shot, one-shot, and few-shot prompt formats.
* Following system text and user instructions are the basis of the prompt text going inside the messages list when calling the openai API. These are first saved into json files.
* The abstract text is appended to the user text, beneath 'TASKS'.

In [20]:
system_data = {"role":"system","content": "You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text. Only consider the text given to you."}
system_data

{'role': 'system',
 'content': 'You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text. Only consider the text given to you.'}

In [21]:
user_data = {"role":"user","content":"""Look at the examples delimited by ### and the rules delimited by ***.
*** RULES
For each of the text shown under TASKS, do the following:
1. Identify which drugs have been tested and create a set for each. 
2. Return multiple sets if more than drug is present in the text.
For each drug:
3. Get direct target: Use the following logic: if the text clearly and directly mentions that the drug targets a gene and has also defined the type of interaction
4. Get interaction type between the drug and direct target: if explicitly and clearly mentioned, then get the drug-direct target relationship type. What does the drug do to the direct target? inhibitor, activator, etc.?
5. Get groups tested: specify which type of genes was the drug tested on, eg. if drug was tested on samples showing high expression of certain gene.
6. Collect all disease names for which the drug has been tested into 1 list.
7. Extract any specific ClinicalTrials.gov identifier or number.
8. For each drug, construct a set consisting of (drug name, direct target, drug-direct target interaction, tested or effective group, ClinicalTrials.gov number, all diseases that the drug is tested in)
9. Any empty values should be indicated by null and not an empty string.
10. Double check your logic and see if the data you have collected is actually reflected in the text. If not, make revisions.
11. Both, direct target, and drug-direct target interaction fields should have been filled or both should be null. Only one of these fields cannot be null.
12. Assemble all sets and produce 1 final JSON output in a single line without any whitespaces.
***
### EXAMPLES
PMID23:LKB1/STK11 is a serine/threonine kinase that plays a major role in controlling cell metabolism, resulting in potential therapeutic vulnerabilities in LKB1-mutant cancers. Here, we identify the NAD + degrading ectoenzyme, CD38, as a new target in LKB1-mutant NSCLC. Metabolic profiling of genetically engineered mouse models (GEMMs) revealed that LKB1 mutant lung cancers have a striking increase in ADP-ribose, a breakdown product of the critical redox co-factor, NAD + . Surprisingly, compared with other genetic subsets, murine and human LKB1-mutant NSCLC show marked overexpression of the NAD+-catabolizing ectoenzyme, CD38 on the surface of tumor cells. Loss of LKB1 or inactivation of Salt-Inducible Kinases (SIKs)-key downstream effectors of LKB1- induces CD38 transcription induction via a CREB binding site in the CD38 promoter. Treatment with the FDA-approved anti-CD38 antibody, daratumumab, inhibited growth of LKB1-mutant NSCLC xenografts. Together, these results reveal CD38 as a promising therapeutic target in patients with LKB1 mutant lung cancer.
Output:{"PMID23": [{"drug name": "daratumumab","target": [{"direct target": "CD38","drug-direct target interaction": "anti-CD38 monoclonal antibody"},],"tested or effective group": ["LKB-1/STK-11 mutant NSCLC"],"drug tested in following diseases": ["lung cancer", "NSCLC"],"ClinicalTrials.gov ID": []}]}
###
TASKS:
Use the RULES and EXAMPLES and create a similar outputs for the following text delimited by ID:
"""}

**Save these in json files.**

In [22]:
with open(system_path, "w") as json_file:
    json.dump(system_data, json_file)

In [23]:
with open(user_path, "w") as json_file:
    json.dump(user_data, json_file)

# 3. Run a pipeline to process 30 randomly selected abstracts<a id="pipeline"></a>

<span style="font-size:20px;">**Method**</span>

* Out of the total 87 pubmed articles, only 30 were randomly chosen to generate drug-target dataset using GPT.
* A small subset was chosen to confirm feasibility of the approach and show proof of concept.
* The GPT chat completions API was used, where the API was called for each abstract.
* Each abstract was appended to the user prompt shown above.
* Each API call returned the completions as a json output.
* After collecting all outputs, entities of interest were extracted from the completions output and organized into a df.

<span style="font-size:20px;">**Steps**</span>

**Get the df containing pubmed data (generated above in Step 1).**

In [24]:
df = pd.read_csv(pubmed_output,dtype={'pmid':'str'})

In [25]:
df1 = openai_api.select_pubmed_data(data_df=df,
                                    cols=['pmid','abstract'],
                                    num_rows=30,
                                    random_state=1)

Function select_pubmed_data complete.


**Only pubmed ID (PMID) and abstract text are used in the prompts.**

In [26]:
df1.head()

Unnamed: 0,pmid,abstract
10,36841821,Pellino-1 (PELI1) is an E3 ubiquitin ligase ac...
69,36399643,Durable glioblastoma multiforme (GBM) manageme...
61,36966152,Glioblastoma multiforme (GBM) is the most comm...
34,37321064,Drug resistance is a major challenge in cancer...
86,37147437,Mounting evidence is identifying human cytomeg...


In [27]:
len(df1)

30

In [28]:
df1['pmid'].nunique()

30

In [29]:
df1['pmid'].dtype

dtype('O')

**Pass each abstract to the prompt and call the API. Run the whole pipeline for all 30 abstracts, collect all completions in a single dictionary.**

In [32]:
all_completions = openai_api.process_pubmed_data(system_path=system_path,
                                                 user_path=user_path,
                                                 data_df1=df1,
                                                 openai_key_path=openai_key_path,
                                                 seed=1,
                                                 temperature=0.0)

	Base prompt messages loaded.
	Necessary columns present in the input df.
Function setup_client complete.
	Client setup complete.
	Processing abstracts through the API.
Function process_pubmed_data complete in 0:03:21.925227. This run costed $0.41.


**Save the completions dictionary as a JSON.**

In [41]:
with open(completions_path, "w") as json_file:
    json.dump(all_completions, json_file)

**Take GPT completions and organize them into a df.**

In [14]:
with open(completions_path,"r") as json_file:
    all_completions = json.load(json_file)
    

In [19]:
ner_df = openai_api.get_df_from_completions(all_completions=all_completions)

Function get_df_from_completions complete.


<font color='blue'>**Entities extracted by GPT from PubMed abstracts.**</font>

* completion_id: ID given by the GPT API for each 'chat' converstation, in this case, each API call processing an abstract and completing the tasks specified in the prompt.

* completion_tokens: number of output tokens 
* prompt_tokens: number of tokens in the input messages
* total_tokens: sum of completion and prompt tokens
* pmid: PubMed ID

Following are extracted by GPT from the abstract text. Brackets indicate the associated task mentioned in the user_data prompt instruction given to the GPT.
* drug_name: drug name (Instruction 1)
* direct_target: target of the drug (Instruction 3)
* drug-direct_target_interaction: relationship or the type of interaction between the drug and the target (Instruction 4)
* tested_or_effective_group: a list of specific sample groups that the drug was tested in (Instruction 5)
* drug_tested_in_diseases: a list of disease types that the drug was tested in (Instruction 6)
* clinical_trials_id: ClinicalTrials.gov identifier if mentioned (Instruction 7)


In [22]:
ner_df.head()

Unnamed: 0,completion_id,completion_tokens,prompt_tokens,total_tokens,drug_name,tested_or_effective_group,drug_tested_in_diseases,clinical_trials_id,direct_target,drug-direct_target_interaction,pmid
0,chatcmpl-95Fz5ygwhHbh6fS34y9rl6UCQzw7V,61.0,1073.0,1134.0,S62,,[breast cancer],,PELI1/EGFR,disruptor,36841821
1,chatcmpl-95FzB5d0oHjWiZz8uYJofalUBnFuB,99.0,1131.0,1230.0,regorafenib,,[GBM],,autophagosome-lysosome fusion,inhibitor,36399643
2,chatcmpl-95FzLUxdFBE1KNul7adDXyaq1Ds0Q,70.0,1086.0,1156.0,fatostatin,,"[glioblastoma, GBM]",,AKT/mTORC1/GPX4 signaling pathway,inhibitor,36966152
3,chatcmpl-95FzTMkaW3MDTzy0nvTisQfDIP9bc,78.0,1171.0,1249.0,MAM,"[cisplatin-resistant A549, AZD9291-resistant H...","[non-small cell lung cancer, NSCLC]",,NQO1,activator,37321064
4,chatcmpl-95FzZlGTV1CFR13ye9QStxhproc0a,130.0,1111.0,1241.0,ganciclovir,[HCMV-induced glioblastoma],[glioblastoma],,EZH2,inhibitor,37147437


**Save the results from the API as a CSV.**

In [23]:
ner_df.to_csv(ner_path,index=False)

**Write the PMIDs to Google Sheets so that manually curated data can be added in Sheets.**

In [62]:
utils.write_data_to_sheets(df=ner_df,
                           credentials_path=credentials_path,
                           sheets_filename='pubmed_manual_curated',
                           sheets_sheetname='run_1',
                           columns_to_write=['pmid'])

# 4. Evaluate<a id="evaluate"></a>

<span style="font-size:20px;">**Method**</span>

* For evaluation, a manually curated dataset was prepared for all the 30 abstracts.
* Of all the extracted information types, the following 3 are the most important and complex to identify:
drug name, target, and drug-target interaction types.
* Hence, metrics were obtained only for these 3 data types by comparing the results from GPT-4 and the manually curated dataset.
* In general, the fuzzywuzzy package was used for comparing two strings. However, different scorers and cutoffs were used for assessing each data type and the logic behind each are explained further in the sections below.
* Overall, this is the strategy used for evaluation, in comparison with manually curated data:

* As shown in the above schematic, level 2 evaluation is dependent on level 1, and level 3 evaluation is dependent on entities (drug-target pairs) captured in level 2.
* Drug name is the most important and primary data type.
* So, in level 2, targets aren't assessed if their drug was not extracted correctly from the abstract.
* Similarly, in level 3, drug-target interaction type values are not assessed if both, the drug and corresponding targets were captured correctly by GPT-4.

<span style="font-size:20px;">**Steps**</span>

**Get the entities extracted by GPT-4.**

In [115]:
ner_df = pd.read_csv(ner_path,
                     dtype={'pmid':'str'})

In [116]:
ner_df.head()

Unnamed: 0,completion_id,completion_tokens,prompt_tokens,total_tokens,drug_name,tested_or_effective_group,drug_tested_in_diseases,clinical_trials_id,direct_target,drug-direct_target_interaction,pmid
0,chatcmpl-95Fz5ygwhHbh6fS34y9rl6UCQzw7V,61.0,1073.0,1134.0,S62,,['breast cancer'],,PELI1/EGFR,disruptor,36841821
1,chatcmpl-95FzB5d0oHjWiZz8uYJofalUBnFuB,99.0,1131.0,1230.0,regorafenib,,['GBM'],,autophagosome-lysosome fusion,inhibitor,36399643
2,chatcmpl-95FzLUxdFBE1KNul7adDXyaq1Ds0Q,70.0,1086.0,1156.0,fatostatin,,"['glioblastoma', 'GBM']",,AKT/mTORC1/GPX4 signaling pathway,inhibitor,36966152
3,chatcmpl-95FzTMkaW3MDTzy0nvTisQfDIP9bc,78.0,1171.0,1249.0,MAM,"['cisplatin-resistant A549', 'AZD9291-resistan...","['non-small cell lung cancer', 'NSCLC']",,NQO1,activator,37321064
4,chatcmpl-95FzZlGTV1CFR13ye9QStxhproc0a,130.0,1111.0,1241.0,ganciclovir,['HCMV-induced glioblastoma'],['glioblastoma'],,EZH2,inhibitor,37147437


**Run cleaning steps and read specific data types consistently.**

In [117]:
ner_df['tested_or_effective_group'] = ner_df['tested_or_effective_group'].apply(utils.convert_string_to_structure)

In [118]:
ner_df['drug_tested_in_diseases'] = ner_df['drug_tested_in_diseases'].apply(utils.convert_string_to_structure)

In [119]:
ner_df2 = utils.clean_missing_data(ner_df)

**Get manually curated data present in Google Sheets.**

In [120]:
sheets_filename='pubmed_manual_curated'
sheets_sheetname='run_1'
start_cell='A1'
end_cell='G68'

In [121]:
manual_df = utils.read_data_from_sheets(credentials_path,
                                        sheets_filename,
                                        sheets_sheetname,
                                        start_cell,
                                        end_cell)

**Run cleaning steps and read specific data types consistently.**

In [122]:
manual_df2 = utils.clean_missing_data(manual_df)

In [123]:
manual_df2['tested_or_effective_group'] = manual_df2['tested_or_effective_group'].apply(utils.convert_string_to_structure)


In [124]:
manual_df2['drug_tested_in_diseases'] = manual_df2['drug_tested_in_diseases'].apply(utils.convert_string_to_structure)


## 4.1 Evaluation metrics for drug names extracted by GPT-4 [Level 1]<a id="level1"></a>

<span style="font-size:20px;">**Method**</span>

* To compare drug name strings extracted by GPT-4 with manually curated string, any characters were removed from the strings (!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~) and then a score was obtained using the token_set_ratio from the fuzzywuzzy package.
* The assessment did not rely on exact string matching but on GPT-4 being able to capture bulk of the drug name.
* A score cutoff of 100 was used so that with the token_set_ratio. As an example, this means that when two strings 'sotorasib (AMG510)' and 'sotorasib' are compared, it is considered a match. Even if the model failed to capture the abbreviated or alternative drug name AMG510, it was still able to correctly identify the drug from the abstract.
* A potential downside of this approach would be that when 'erlotinib' and 'erlotinib phosphate' (an analog of the same drug) are compared, it is still considered a match.

<span style="font-size:20px;">**Steps**</span>

**Compare GPT-4 extracted drug names with manually curated data: About 45% of the false negatives are coming from a single paper 36942434.**

In [127]:
metrics, pmids_errors, matches = ner_evaluation.get_drug_metrics(ner_df2,
                                                        manual_df2,
                                                        'drug_name',
                                                        score_cutoff=100)

PMID 37451272 has a false positive.
PMID 37066699 has a false positive.
PMID 37066699: 1 false negatives.
PMID 36942434: 10 false negatives.
PMID 36966152: 1 false negatives.
PMID 37036151: 1 false negatives.
PMID 37816716 has a false positive.
PMID 35849035: 1 false negatives.
PMID 37507782: 1 false negatives.
PMID 36399643: 2 false negatives.
PMID 36521103: 3 false negatives.
PMID 37264081: 2 false negatives.


**Metrics comparing the ground truth set and the model extracted set.**

In [128]:
metrics

{'tp': 29, 'fp': 3, 'fn': 22, 'tn': 6}

<font color='blue'>**The accuracy is low but the model gets drug names with 91% precision when drug names were fuzzy matched.**</font>

In [129]:
eval_metrics = ner_evaluation.calculate_metrics(metrics)

In [130]:
eval_metrics

{'accuracy': 0.58,
 'precision': 0.91,
 'recall': 0.57,
 'f1_score': 0.7,
 'neg_pred_value': 0.21}

**Deep dive into the PMIDs where the model made errors.**

In [131]:
pmids_errors

{'fp': ['37451272', '37066699', '37816716'],
 'fn': ['37066699',
  '36942434',
  '36966152',
  '37036151',
  '35849035',
  '37507782',
  '36399643',
  '36521103',
  '37264081']}

**FALSE POSITIVES:**
1. GPT extracted non-specific phrases as drug names:
    * 'PKC inhibitors' would be a class of drugs.
    * 'F3-targeting agent' isn't a drug name.
    
2. The drug name extracted wasn't complete enough due to extremely complex text.
    * GPT extracted 'AE-NPs' and the actual manually curated string was 'transferrin and polyethylene glycol-poly (lactic-co-glycolic acid) aloe-emodin nanocarrier (Tf-PEG-PLGA modified AE@ZIF-8 NPs)'. In this case, even manual curation was difficult as it required consideration of context from a bunch of sentences in the text.

Model results showing PMIDs where false positives were present.

In [132]:
ner_df2[ner_df2['pmid'].isin(['37066699', '37816716', '37451272'])][['pmid','drug_name']]

Unnamed: 0,pmid,drug_name
11,37451272,F3-targeting agent
12,37066699,Aloe-emodin
13,37066699,AE@ZIF-8 NPs
14,37066699,AE-NPs
33,37816716,PKC inhibitors


Ground truth data showing these same PMIDs.

In [157]:
manual_df2[manual_df2['pmid'].isin(['37066699', '37816716', '37451272'])][['pmid','drug_name']]

Unnamed: 0,pmid,drug_name
16,37451272,
17,37066699,aloe-emodin (AE)
18,37066699,aloe-emodin nanocarrier (AE@ZIF-8 NPs)
19,37066699,transferrin and polyethylene glycol-poly (lact...
60,37816716,


**FALSE NEGATIVES: The following shows drugs that were correctly extracted along with those that were missed from each paper.**

* PMID 36942434 is an outlier as it has >10 drugs listed in the abstract. GPT correctly extracted the primary drug that the paper was about. It definitely failed to extract all other 10 names but metrics are recalculated after the paper is eliminated from consideration. (shown below)
* Surprisingly, in many cases, GPT-4 only captured 1 of the drug names in the text but failed to capture others. (36942434 and 37264081 are some examples.)
* Drug names were also missed if the names weren't explicitly mentioned in the text but something that a researcher could put together manually after getting the context from neighboring sentences.
    eg. "H-ferritin (HFn), regorafenib, and Cu2+ nanoplatform (HFn-Cu-REGO NPs)" from ID 36399643.

In [134]:
for pmid in set(pmids_errors['fn']):
    print(f'PMID {pmid}')
    print(f"GPT result: {ner_df2[ner_df2['pmid']==pmid]['drug_name'].unique()}")
    print(f"Ground truth result:{manual_df2[manual_df2['pmid']==pmid]['drug_name'].unique()}")
    

PMID 35849035
GPT result: ['depatuxizumab mafodotin']
Ground truth result:['depatuxizumab mafodotin (depatux-m)' 'temozolomide']
PMID 37507782
GPT result: ['sertaconazole']
Ground truth result:['Sertaconazole (STZ)'
 'sertaconazole (STZ) repurposed nanoplatform (HTS NP)']
PMID 37066699
GPT result: ['Aloe-emodin' 'AE@ZIF-8 NPs' 'AE-NPs']
Ground truth result:['aloe-emodin (AE)' 'aloe-emodin nanocarrier (AE@ZIF-8 NPs)'
 'transferrin and polyethylene glycol-poly (lactic-co-glycolic acid) aloe-emodin nanocarrier (Tf-PEG-PLGA modified AE@ZIF-8 NPs)']
PMID 36942434
GPT result: ['osimertinib']
Ground truth result:['osimertinib' 'afatinib' 'gefitinib' 'carboplatin' 'pemetrexed'
 'rosuvastatin' 'dabigatran' 'bisoprolol' 'digoxin' 'ramipril'
 'spironolactone']
PMID 36399643
GPT result: ['regorafenib']
Ground truth result:['regorafenib'
 'H-ferritin (HFn), regorafenib, and Cu2+ nanoplatform (HFn-Cu-REGO NPs)'
 'temozolomide']
PMID 36966152
GPT result: ['fatostatin']
Ground truth result:['fatostati

**Since majority false negatives are coming from PMID 36942434, if this paper is eliminated from consideration for evaluation, these are the metrics. The accuracy and recall improve (57% to 70%) but are still low.**

In [135]:
ner_df3 = ner_df2[ner_df2['pmid']!='36942434'].copy()
manual_df3 = manual_df2[manual_df2['pmid']!='36942434'].copy()

In [136]:
re_metrics, re_pmids_errors, re_matches = ner_evaluation.get_drug_metrics(ner_df3,
                                                                         manual_df3,
                                                                         'drug_name',
                                                                         score_cutoff=100)

PMID 37451272 has a false positive.
PMID 37066699 has a false positive.
PMID 37066699: 1 false negatives.
PMID 36966152: 1 false negatives.
PMID 37036151: 1 false negatives.
PMID 37816716 has a false positive.
PMID 35849035: 1 false negatives.
PMID 37507782: 1 false negatives.
PMID 36399643: 2 false negatives.
PMID 36521103: 3 false negatives.
PMID 37264081: 2 false negatives.


In [137]:
re_metrics

{'tp': 28, 'fp': 3, 'fn': 12, 'tn': 6}

In [138]:
re_eval_metrics = ner_evaluation.calculate_metrics(re_metrics)

In [139]:
re_eval_metrics

{'accuracy': 0.69,
 'precision': 0.9,
 'recall': 0.7,
 'f1_score': 0.79,
 'neg_pred_value': 0.33}

## 4.2 Evaluation metrics for target names extracted by GPT-4 [Level 2]<a id="level2"></a>

<span style="font-size:20px;">**Method**</span>

* As shown in the schematic above, only drugs that were correctly identified by GPT-4 were considered for evaluating targets (35 entities). If the drug itself is not correctly identified, even though the target is captured correctly from an abstract, such data point would lead to inaccurate computational analyses.
* To compare target name strings extracted by GPT-4 with manually curated string, any characters were removed from the strings (!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~) and then a score was obtained using the token_sort_ratio from the fuzzywuzzy package.
* A stricter matching criteria was used in case of targets as a slight difference in target names could mean different genes or mutation variants.

* As an example, when 'KRAS' and 'KRAS G12C' (a specific mutation of the KRAS gene) are compared, token_sort_ratio will give a score of 62 and will not be considered a match when a score-cutoff of 90 is used. If the same scorer as the drug name was used, it would have given a score of 100 here and it would not be accurate.

* For only drugs that matched:
    * TP - target correctly ID'd
    * TN - targets dont exist and were not extracted
    * FP - targets dont exist but were incorrectly extracted
    * FN - targets exist but were missed

<span style="font-size:20px;">**Steps**</span>

**For each correctly identified drug, compare GPT-4 extracted target names with manually curated data.**

In [140]:
target_metrics, targets_pmids_errors, target_matches = ner_evaluation.get_target_metrics(ner_df2,
                                                                                         manual_df2,
                                                                                         matches=matches,
                                                                                         score_cutoff=90)

In [141]:
target_metrics

{'fp': 6, 'fn': 12, 'tp': 18, 'tn': 4}

In [142]:
targets_pmids_errors

{'fp': ['36966152',
  '36272139',
  '36272139',
  '36650267',
  '37147437',
  '36399643'],
 'fn': ['37141396',
  '36966152',
  '36272139',
  '37147285',
  '35849035',
  '36650267',
  '36650267',
  '36970205',
  '36240971',
  '37264081']}

In [143]:
target_eval_metrics = ner_evaluation.calculate_metrics(target_metrics)

<font color='blue'>**For drugs that were correctly captured from the abstracts by GPT-4, it showed poor performance in correctly identifying associated targets or the lack of.**</font>

In [144]:
target_eval_metrics

{'accuracy': 0.55,
 'precision': 0.75,
 'recall': 0.6,
 'f1_score': 0.67,
 'neg_pred_value': 0.25}

**FALSE POSITIVES:**
 
1. GPT-4 extracted entities as targets even if they weren't actually associated with the target.
   eg. PMID 36272139 where it listed MGMT as a target for the drug temozolomide.
   eg. PMID 37147437, where it listed EZH2 as the target for temozolomide in addition to the correctly paired second drug mentioned in the abstract.

2. It extracted pathway names or processes as target names. 
   eg. 36272139 - PMID DNA damage repair pathway 

Model results showing PMIDs where false positives were present.

In [146]:
ner_df2[ner_df2['pmid'].isin(['36966152',
  '36272139',
  '36272139',
  '36650267',
  '37147437',
  '36399643'])][['pmid','drug_name','direct_target']]

Unnamed: 0,pmid,drug_name,direct_target
1,36399643,regorafenib,autophagosome-lysosome fusion
2,36966152,fatostatin,AKT/mTORC1/GPX4 signaling pathway
4,37147437,ganciclovir,EZH2
5,37147437,temozolomide,EZH2
16,36272139,EPIC-0412,MGMT
17,36272139,EPIC-0412,DNA damage repair pathway
18,36272139,Temozolomide,MGMT
25,36650267,KU60019,ATM
26,36650267,VP-16,TOP2β


Ground truth data showing PMIDs where false positives were present.

In [147]:
manual_df2[manual_df2['pmid'].isin(['36966152',
  '36272139',
  '36272139',
  '36650267',
  '37147437',
  '36399643'])][['pmid','drug_name','direct_target']]

Unnamed: 0,pmid,drug_name,direct_target
1,36399643,regorafenib,
2,36399643,"H-ferritin (HFn), regorafenib, and Cu2+ nanopl...",
3,36399643,temozolomide,
4,36966152,fatostatin,SREBP
5,36966152,p28-functionalized PLGA nanoparticle with fato...,SREBP
7,37147437,ganciclovir,EZH2
8,37147437,temozolomide,
23,36272139,EPIC-0412,HOTAIR/EZH2
24,36272139,EPIC-0412,MGMT
25,36272139,EPIC-0412,ATF3-p-p65-HADC1 axis


**FALSE NEGATIVES:**

1. In some cases, the drug was mentioned to target an association between 2 genes or their interaction. Manual curation identified these associations as targets by listing both genes. eg. HOTAIR/EZH2. But such complexity was missed by GPT-4. In GPT-4's defense, instructions weren't given to handle such a scenario. Such cases warrant fine-tuning a model on specific data and outputs rather than trying to specify prompts for every possible scenario.

2. GPT-4 also missed out on targets explicitly mentioned in the text. 
   eg. PMID 36240971 - glutaminase is a target of CB-839; PMID 36970205: MEK is a target of trametinib
   
3. Another complexity that was missed and warrants finetuning a model comes where a gene and it's variant were mentioned for a drug in the same abstract.
   eg. PMID 35849035 - depatuxizumab mafodotin (depatux-m) targets EGFR and EGFRvIII, a variant. But GPT-4 missed on capturing the latter as a target.

Model results showing PMIDs with false negatives

In [148]:
ner_df2[ner_df2['pmid'].isin(['37141396',
  '36966152',
  '36272139',
  '37147285',
  '35849035',
  '36650267',
  '36650267',
  '36970205',
  '36240971',
  '37264081'])][['pmid','drug_name','direct_target']]

Unnamed: 0,pmid,drug_name,direct_target
2,36966152,fatostatin,AKT/mTORC1/GPX4 signaling pathway
8,36240971,TAK-228,TORC1/2
9,36240971,CB-839,
15,35849035,depatuxizumab mafodotin,EGFR
16,36272139,EPIC-0412,MGMT
17,36272139,EPIC-0412,DNA damage repair pathway
18,36272139,Temozolomide,MGMT
19,36970205,trametinib,
20,36970205,IACS-010759,mitochondrial complex I
25,36650267,KU60019,ATM


Ground truth data from PMIDs where false negatives were present:

In [149]:
manual_df2[manual_df2['pmid'].isin(['37141396',
  '36966152',
  '36272139',
  '37147285',
  '35849035',
  '36650267',
  '36650267',
  '36970205',
  '36240971',
  '37264081'])][['pmid','drug_name','direct_target']]

Unnamed: 0,pmid,drug_name,direct_target
4,36966152,fatostatin,SREBP
5,36966152,p28-functionalized PLGA nanoparticle with fato...,SREBP
12,36240971,TAK-228,TORC1/2
13,36240971,CB-839,glutaminase
20,35849035,depatuxizumab mafodotin (depatux-m),EGFR
21,35849035,depatuxizumab mafodotin (depatux-m),EGFRvIII
22,35849035,temozolomide,
23,36272139,EPIC-0412,HOTAIR/EZH2
24,36272139,EPIC-0412,MGMT
25,36272139,EPIC-0412,ATF3-p-p65-HADC1 axis


## 4.3 Evaluation metrics for drug-target interactions extracted by GPT-4 [Level 3]<a id="level3"></a>

<span style="font-size:20px;">**Method**</span>

* As shown in the schematic above, only drug-target pairs that were correctly identified by GPT-4 were considered for evaluating targets. Drug, targets, and their relationship are all inter-connected entitites. So, if the drug itself is correctly identified but the target is not identified properly, even though the interaction is captured correctly from an abstract, such data point would lead to inaccurate computational analyses.
* To compare target name strings extracted by GPT-4 with manually curated string, any characters were removed from the strings (!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~) and then a score was obtained using the token_set_ratio from the fuzzywuzzy package.
* Here, a relaxed criteria was used for comparing the strings as it was more important to capture the primary context than the exact string.

* As an example, if manually curated string was 'non-covalent inhibitor' and GPT-4 output said 'inhibitor', these were assessed as correct. In this case if the token_sort_ratio would have been used (like for target evaluation), it would have given a poor matching score.

* For only drugs and their target (pairs) that matched:
    * TP - interaction correctly ID'd
    * TN - interaction doesn't exist and was not extracted
    * FP - interaction doesn't exist but was incorrectly extracted
    * FN - interaction exists but was missed by GPT

<span style="font-size:20px;">**Steps**</span>

**For each correctly identified drug-target pair, compare GPT-4 extracted interaction string with manually curated string.**

In [150]:
int_metrics, int_pmids_errors, int_matches = ner_evaluation.get_interaction_metrics(ner_df2,
                                                                                    manual_df2,
                                                                                    target_matches,
                                                                                    score_cutoff=80)

In [151]:
int_metrics

{'fp': 2, 'fn': 2, 'tp': 16, 'tn': 4}

In [152]:
int_eval_metrics = ner_evaluation.calculate_metrics(int_metrics)

In [154]:
int_pmids_errors

{'fp': ['36272139', '36521103'], 'fn': ['36272139', '36521103']}

<font color='blue'>**For drug-target pairs that were correctly captured from the abstracts by GPT-4, it showed good performance in correctly identifying their interactions.**</font>

In [153]:
int_eval_metrics

{'accuracy': 0.83,
 'precision': 0.89,
 'recall': 0.89,
 'f1_score': 0.89,
 'neg_pred_value': 0.67}

In [155]:
for pmid in int_pmids_errors['fp']:
    
    for pair in target_matches:
        if pair['pmid'] == pmid:
            model = ner_df2[(ner_df2['pmid']==pmid) & (ner_df2['drug_name']==pair['model_drug_name']) &
                               (ner_df2['direct_target']==pair['model_direct_target'])]['drug-direct_target_interaction'].unique()
            gt = manual_df2[(manual_df2['pmid']==pmid) & (manual_df2['drug_name']==pair['gtruth_drug_name']) &
                               (manual_df2['direct_target']==pair['gtruth_direct_target'])]['drug-direct_target_interaction'].unique()
            
            print(pmid)
            print(f'Interaction extracted by model ={model}')
            print(f'Ground truth ={gt}')

36272139
Interaction extracted by model =['inhibitor']
Ground truth =['epigenetic silencer']
36521103
Interaction extracted by model =['activator']
Ground truth =['peptide vaccine conjugate against target']


In [156]:
ner_df2[ner_df2['pmid']=='36272139']

Unnamed: 0,completion_id,completion_tokens,prompt_tokens,total_tokens,drug_name,tested_or_effective_group,drug_tested_in_diseases,clinical_trials_id,direct_target,drug-direct_target_interaction,pmid
16,chatcmpl-95G0Wc7vcVsPnVWNFT1pvp6WdmDPY,142.0,1227.0,1369.0,EPIC-0412,[GBM],[Glioblastoma multiforme],,MGMT,inhibitor,36272139
17,chatcmpl-95G0Wc7vcVsPnVWNFT1pvp6WdmDPY,142.0,1227.0,1369.0,EPIC-0412,[GBM],[Glioblastoma multiforme],,DNA damage repair pathway,inhibitor,36272139
18,chatcmpl-95G0Wc7vcVsPnVWNFT1pvp6WdmDPY,142.0,1227.0,1369.0,Temozolomide,[GBM],[Glioblastoma multiforme],,MGMT,affected by resistance,36272139


# Conclusions and next steps<a id="conclusion"></a>

1. This proof of concept study shows generation of an updated drug-centric dataset from PubMed abstracts using GPT-4 and confirms feasibility of the approach.
2. GPT-4 extracted drug names, their targets, and the drug-target interactions with the following precision, recall as shown in the schematic.
3. Even with exhaustive instructions included in the prompt, GPT-4 doesn't give good performance at all levels of evaluation (shown in the schematic below.). 
4. Edge cases and some deep dives into GPT errors shown in Section 4 provide a rationale for finetuning a model on a training dataset.
5. The prompt used in this project was selected after testing various prompts. Since biomedical context can be pretty complex, further prompt engineering is not expected to significantly improve performance, necessitating application of a model fine-tuned on this specific dataset. This next step is shown in the notebook pubmed_finetuning [LINK].

# References & Accessory links<a id="refs"></a>

*Additional, detailed references are present in the notebooks linked here.*

**REFERENCES**
* NCBI PubMed https://pubmed.ncbi.nlm.nih.gov/
* Setting up service account https://medium.com/@jb.ranchana/write-and-append-dataframes-to-google-sheets-in-python-f62479460cf0
* GPT API https://platform.openai.com/docs/api-reference

**ACCESSORY LINKS**
* Notebook where the PubMed API workflow was determined to arrive at the development of the pubmed module - 
* Notebook showing GPT API testing, prompt engineering to arrive at development of the openai_api module - 

# Scratchpad

In [9]:
df = pd.DataFrame([{'pmid':34245,'col1':'ggtt'},
                   {'pmid':35355,'col1':None},
                   {'pmid':35355,'col1':'dfhg'},
                   {'pmid':23986,'col1':'ksjf'}])

In [10]:
df

Unnamed: 0,pmid,col1
0,34245,ggtt
1,35355,
2,35355,dfhg
3,23986,ksjf


In [11]:
df.groupby('pmid')['col1'].nunique()[lambda x: x==0].index

Index([], dtype='int64', name='pmid')

In [12]:
df.groupby('pmid')['col1'].nunique()

pmid
23986    1
34245    1
35355    1
Name: col1, dtype: int64