# Overview

# Table of contents

* [Imports](#imports)
* [Paths](#paths)
* [1.Get literature from PubMed](#pubmed)

# Imports<a id="imports"></a>

In [161]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [97]:
import json
import pandas as pd
from pathlib import Path
import sys
import tiktoken

**Import the PubMed module.**

In [162]:
cwd = Path.cwd()

In [163]:
module_dir = cwd.parent / 'scripts'

In [164]:
sys.path.append(str(module_dir))

In [6]:
import pubmed

**Import the openai module.**

In [167]:
import openai_api

# Paths

## PubMed output path

In [8]:

output_dir = cwd.parent / 'outputs'

**Path to CSV containing literature from 87 papers from PubMed.**

In [9]:
pubmed_output = output_dir / 'pubmed_data.csv'

## Prompt language - system and user instructions

In [98]:
input_dir = cwd.parent / 'inputs'

In [99]:
system_path = input_dir / 'system_prompt.json'
user_path = input_dir / 'user_prompt.json'

In [179]:
openai_key_path = '/Users/mukti/myproject_creds/.env.openai'

# 1. Get literature from PubMed

**Scope**

We'll focus on getting ~30 articles each for breast cancer,lung cancer, and glioblastoma from 2023-present, and remove articles of the type Review or Systematic Review.

**Establish base criteria for querying.**

In [7]:
date_criteria = '(2023/01/01:3000[pdat])'
drug_criteria = '(drug[tiab]+OR+inhibitor[tiab]+OR+compound[tiab]+OR+small+molecule[tiab]+OR+clinical+trial[tiab]+OR+therapy[tiab]+OR+agent[tiab])'
pub_criteria = '(Review[pt]+OR+Scientific+Integrity+Review[pt]+OR+Systematic+Review[pt])'


**Run the PubMed pipeline for getting papers for 3 cancer types.**

In [8]:
# Collect all dfs to merge later
all_dfs = []

for cancer_type in ['breast+cancer','lung+cancer','glioblastoma']:
    disease_criteria = f'({cancer_type}[tiab])'
    
    query = date_criteria + '+AND+' + disease_criteria + '+AND+' + drug_criteria + '+NOT+' + pub_criteria
    
    df = pubmed.run_pubmed_pipeline(query=query,
                                    save_on_server='y',
                                    search_format='json',
                                    search_starting_index=0,
                                    search_max_records=9999,
                                    sorting_criteria='relevance',
                                    content_type='abstract',
                                    fetch_starting_index=0,
                                    fetch_max_records=30)
    
    # Add an identifier column for cancer type for easy searching
    df['disease'] = cancer_type
    print(f'Num rows in df: {len(df)}')
    all_dfs.append(df)
    
    print(f'Pipeline complete for {cancer_type}')

# Combine all dfs
final_df = pd.concat(all_dfs)

----Running pipeline for the following query:----
(2023/01/01:3000[pdat])+AND+(breast+cancer[tiab])+AND+(drug[tiab]+OR+inhibitor[tiab]+OR+compound[tiab]+OR+small+molecule[tiab]+OR+clinical+trial[tiab]+OR+therapy[tiab]+OR+agent[tiab])+NOT+(Review[pt]+OR+Scientific+Integrity+Review[pt]+OR+Systematic+Review[pt])
Using PubMed esearch API to get PMIDs matching the search query.
	The actual total number of records matching the search for is 9403
	The number of ids present in the esearch json is 9403
	Function get_pmids complete.
Collecting metadata about the search results into a dictionary.
	Metadata obtained and saved in a dictionary.
Using PubMed efetch API to get abstract and other details for relevant PMIDs into an XML string.
	The number of matching PMIDs based on the server: 30
	Function get_abstracts complete.
Extracting data from XML string and organizing it into a dataframe.
	Performing basic cleanup.
	&#xa0 left: 0
	&#x3ba left: 0
	&# left: 0
Iterating through each article and col

In [9]:
len(final_df)

87

**Final dataframe ready with content extracted from PubMed.**

In [15]:
final_df

Unnamed: 0,pmid,publication_date,publication_type,article_title,abstract,keywords,journal,num_abstracts_retrieved,num_abstracts_requested,query_string,num_total_matches,all_matching_pmids,acquisition_date,disease
0,37256976,2023 Jun 01,"Clinical Trial, Phase III|Journal Article|Rand...",Capivasertib in Hormone Receptor-Positive Adva...,[BACKGROUND]AKT pathway activation is implicat...,,The New England journal of medicine,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9403,"37256976,37557181,37070653,37147285,37723305,3...",2024-03-06,breast+cancer
1,37557181,2023 Oct 19,"Journal Article|Research Support, Non-U.S. Gov't","Discovery of a highly potent, selective, orall...","KAT6A, and its paralog KAT6B, are histone lysi...",CTx-648|KAT6A|KAT6B|PF-9363|breast cancer|cell...,Cell chemical biology,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9403,"37256976,37557181,37070653,37147285,37723305,3...",2024-03-06,breast+cancer
2,37070653,2023 Mar,Clinical Trial Protocol|Journal Article,"Design of SERENA-6, a phase III switching tria...",ESR1 mutation (ESR1m) is a frequent cause of a...,ESR1 mutation|advanced breast cancer|camizestr...,"Future oncology (London, England)",30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9403,"37256976,37557181,37070653,37147285,37723305,3...",2024-03-06,breast+cancer
3,37147285,2023 May 05,"Journal Article|Research Support, Non-U.S. Gov't",KK-LC-1 as a therapeutic target to eliminate ALDH,Failure to achieve complete elimination of tri...,,Nature communications,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9403,"37256976,37557181,37070653,37147285,37723305,3...",2024-03-06,breast+cancer
4,37723305,2023 Oct,Journal Article,Acetate acts as a metabolic immunomodulator by...,Acetate metabolism is an important metabolic p...,,Nature cancerMain References:Methods Only Refe...,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9403,"37256976,37557181,37070653,37147285,37723305,3...",2024-03-06,breast+cancer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,37886538,2023 Oct 05,Preprint,LDHA-regulated tumor-macrophage symbiosis prom...,Abundant macrophage infiltration and altered t...,,Research square,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,1817,"36791206,37451272,36749723,37935665,38215747,3...",2024-03-06,glioblastoma
26,37417769,2023 Oct,Journal Article,Engineering and Characterization of an Artific...,Glioblastoma multiforme (GBM) treatment is hin...,blood-brain barrier|engineered artificial vesi...,"Advanced materials (Deerfield Beach, Fla.)",30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,1817,"36791206,37451272,36749723,37935665,38215747,3...",2024-03-06,glioblastoma
27,37147437,2023 Jun,Journal Article,EZH2-Myc driven glioblastoma elicited by cytom...,Mounting evidence is identifying human cytomeg...,,Oncogene,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,1817,"36791206,37451272,36749723,37935665,38215747,3...",2024-03-06,glioblastoma
28,37572644,2023 Sep,Journal Article,Combination drug screen targeting glioblastoma...,[BACKGROUND]Pharmacological synergisms are an ...,Cancer vulnerabilities|Drug combination screen...,EBioMedicine,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,1817,"36791206,37451272,36749723,37935665,38215747,3...",2024-03-06,glioblastoma


**Save the df as a CSV.**

In [17]:
final_df.to_csv(pubmed_output,index=False)

# 2. Prepare prompt text

**Following system text and user instructions will be the basis of the prompt text going inside the messages list when calling the openai API. These are first saved into json files.
The abstract text will have to be appended to the user text, beneath 'TASKS'.**

In [105]:
system_data = {"role":"system","content": "You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text. Only consider the text given to you."}
system_data

{'role': 'system',
 'content': 'You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text. Only consider the text given to you.'}

In [104]:
user_data = {"role":"user","content":"""Look at the examples delimited by ### and the rules delimited by ***.
*** RULES
For each of the text shown under TASKS, do the following:
1. Identify which drugs have been tested and create a set for each. 
2. Return multiple sets if more than drug is present in the text.
For each drug:
3. Get direct target: Use the following logic: if the text clearly and directly mentions that the drug targets a gene and has also defined the type of interaction
4. Get interaction type between the drug and direct target: if explicitly and clearly mentioned, then get the drug-direct target relationship type. What does the drug do to the direct target? inhibitor, activator, etc.?
5. Get groups tested: specify which type of genes was the drug tested on, eg. if drug was tested on samples showing high expression of certain gene.
6. Collect all disease names for which the drug has been tested into 1 list.
7. Extract any specific ClinicalTrials.gov identifier or number.
8. For each drug, construct a set consisting of (drug name, direct target, drug-direct target interaction, tested or effective group, ClinicalTrials.gov number, all diseases that the drug is tested in)
9. Any empty values should be indicated by null and not an empty string.
10. Double check your logic and see if the data you have collected is actually reflected in the text. If not, make revisions.
11. Both, direct target, and drug-direct target interaction fields should have been filled or both should be null. Only one of these fields cannot be null.
12. Assemble all sets and produce 1 final JSON output in a single line without any whitespaces.
***
### EXAMPLES
PMID23:LKB1/STK11 is a serine/threonine kinase that plays a major role in controlling cell metabolism, resulting in potential therapeutic vulnerabilities in LKB1-mutant cancers. Here, we identify the NAD + degrading ectoenzyme, CD38, as a new target in LKB1-mutant NSCLC. Metabolic profiling of genetically engineered mouse models (GEMMs) revealed that LKB1 mutant lung cancers have a striking increase in ADP-ribose, a breakdown product of the critical redox co-factor, NAD + . Surprisingly, compared with other genetic subsets, murine and human LKB1-mutant NSCLC show marked overexpression of the NAD+-catabolizing ectoenzyme, CD38 on the surface of tumor cells. Loss of LKB1 or inactivation of Salt-Inducible Kinases (SIKs)-key downstream effectors of LKB1- induces CD38 transcription induction via a CREB binding site in the CD38 promoter. Treatment with the FDA-approved anti-CD38 antibody, daratumumab, inhibited growth of LKB1-mutant NSCLC xenografts. Together, these results reveal CD38 as a promising therapeutic target in patients with LKB1 mutant lung cancer.
Output:{"PMID23": [{"drug name": "daratumumab","target": [{"direct target": "CD38","drug-direct target interaction": "anti-CD38 monoclonal antibody"},],"tested or effective group": ["LKB-1/STK-11 mutant NSCLC"],"drug tested in following diseases": ["lung cancer", "NSCLC"],"ClinicalTrials.gov ID": []}]}
###
TASKS:
Use the RULES and EXAMPLES and create a similar outputs for the following text delimited by ID:
"""}

**Save these in json files.**

In [106]:
with open(system_path, "w") as json_file:
    json.dump(system_data, json_file)

In [107]:
with open(user_path, "w") as json_file:
    json.dump(user_data, json_file)

# 3. Run a pipeline to process 30 randomly selected abstracts

**Get the df containing pubmed data.**

In [157]:
df = pd.read_csv(pubmed_output)

In [177]:
df1 = openai_api.select_pubmed_data(data_df=df,
                                    cols=['pmid','abstract'],
                                    num_rows=3,
                                    random_state=1)

Function select_pubmed_data complete.


In [178]:
df1

Unnamed: 0,pmid,abstract
39,36240971,[INTRODUCTION]Increased insight into the mutat...
46,36650267,Type II topoisomerases (TOP2) poisons represen...
28,37044095,A genome-wide PiggyBac transposon-mediated scr...


In [185]:
completions = openai_api.process_pubmed_data(system_path=system_path,
                                  user_path=user_path,
                                  data_df1=df1,
                                  openai_key_path=openai_key_path,
                                  seed=1,
                                  temperature=0.0)

	Necessary columns present in the input df.
Function setup_client complete.
Function call_api complete.
Function call_api complete.
Function call_api complete.
Function process_pubmed_data complete. The cost of the run is 0.04.


In [186]:
completions

{'chatcmpl-91hv6k50eNcXthuoMki8pXdqXk6hw': {'usage_dict': {'completion_tokens': 105,
   'prompt_tokens': 1129,
   'total_tokens': 1234},
  'data_dict': {'PMID36240971': [{'drug name': 'TAK-228',
     'target': [{'direct target': 'TORC1/2',
       'drug-direct target interaction': 'inhibitor'}],
     'tested or effective group': ['NFE2L2-mutated LUSC',
      'KEAP1-mutated LUSC',
      'KRAS/NFE2L2- or KEAP1-mutated NSCLC'],
     'drug tested in following diseases': ['NSCLC', 'LUSC'],
     'ClinicalTrials.gov ID': None}]},
  'model_str': 'gpt-4-0125-preview'},
 'chatcmpl-91hvBPdBSPkQy8SFSc1B9VcRfJqcM': {'usage_dict': {'completion_tokens': 115,
   'prompt_tokens': 1069,
   'total_tokens': 1184},
  'data_dict': {'PMID36650267': [{'drug name': 'KU60019',
     'target': [{'direct target': 'ATM',
       'drug-direct target interaction': 'inhibitor'}],
     'tested or effective group': None,
     'drug tested in following diseases': ['lung cancer'],
     'ClinicalTrials.gov ID': None},
    {'

In [19]:
text

'[BACKGROUND]AKT pathway activation is implicated in endocrine-therapy resistance. Data on the efficacy and safety of the AKT inhibitor capivasertib, as an addition to fulvestrant therapy, in patients with hormone receptor-positive advanced breast cancer are limited.[METHODS]In a phase 3, randomized, double-blind trial, we enrolled eligible pre-, peri-, and postmenopausal women and men with hormone receptor-positive, human epidermal growth factor receptor 2-negative advanced breast cancer who had had a relapse or disease progression during or after treatment with an aromatase inhibitor, with or without previous cyclin-dependent kinase 4 and 6 (CDK4/6) inhibitor therapy. Patients were randomly assigned in a 1:1 ratio to receive capivasertib plus fulvestrant or placebo plus fulvestrant. The dual primary end point was investigator-assessed progression-free survival assessed both in the overall population and among patients with AKT pathway-altered ([RESULTS]Overall, 708 patients underwent