# Overview

# Table of contents

* [Imports](#imports)
* [Paths](#paths)
* [1.Get literature from PubMed](#pubmed)

# Imports<a id="imports"></a>

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import json
import pandas as pd
from pathlib import Path
import sys
import tiktoken

**Import the PubMed module.**

In [3]:
cwd = Path.cwd()

In [4]:
module_dir = cwd.parent / 'scripts'

In [5]:
sys.path.append(str(module_dir))

In [6]:
import pubmed

**Import the openai module.**

In [7]:
import openai_api

**Import the module containing utils functions eg. writing data from df to Google Sheets.**

In [53]:
import utils

# Paths

## PubMed output path

In [8]:

output_dir = cwd.parent / 'outputs'

**Path to CSV containing literature from 87 papers from PubMed.**

In [42]:
pubmed_output = output_dir / 'pubmed_data_run_1.csv'

## Prompt language - system and user instructions

In [10]:
input_dir = cwd.parent / 'inputs'

In [11]:
system_path = input_dir / 'system_prompt_v1.json'
user_path = input_dir / 'user_prompt_v1.json'

## Completions jsons

**Run 1 - Completions for 30 pubmed abstracts processed for entity extraction via OpenAI API.**

In [40]:
completions_path = output_dir / 'completions_run_1.json'

**Run 1 - results from completions extracted into a CSV.**

In [44]:
ner_path = output_dir / 'pubmed_ner_run_1.csv'

## Results

# 1. Get literature from PubMed

**Scope**

We'll focus on getting ~30 articles each for breast cancer,lung cancer, and glioblastoma from 2023-present, and remove articles of the type Review or Systematic Review.

**Establish base criteria for querying.**

In [12]:
date_criteria = '(2023/01/01:3000[pdat])'
drug_criteria = '(drug[tiab]+OR+inhibitor[tiab]+OR+compound[tiab]+OR+small+molecule[tiab]+OR+clinical+trial[tiab]+OR+therapy[tiab]+OR+agent[tiab])'
pub_criteria = '(Review[pt]+OR+Scientific+Integrity+Review[pt]+OR+Systematic+Review[pt])'


**Run the PubMed pipeline for getting papers for 3 cancer types.**

In [13]:
# Collect all dfs to merge later
all_dfs = []

for cancer_type in ['breast+cancer','lung+cancer','glioblastoma']:
    disease_criteria = f'({cancer_type}[tiab])'
    
    query = date_criteria + '+AND+' + disease_criteria + '+AND+' + drug_criteria + '+NOT+' + pub_criteria
    
    df = pubmed.run_pubmed_pipeline(query=query,
                                    save_on_server='y',
                                    search_format='json',
                                    search_starting_index=0,
                                    search_max_records=9999,
                                    sorting_criteria='relevance',
                                    content_type='abstract',
                                    fetch_starting_index=0,
                                    fetch_max_records=30)
    
    # Add an identifier column for cancer type for easy searching
    df['disease'] = cancer_type
    print(f'Num rows in df: {len(df)}')
    all_dfs.append(df)
    
    print(f'Pipeline complete for {cancer_type}')

# Combine all dfs
final_df = pd.concat(all_dfs)

----Running pipeline for the following query:----
(2023/01/01:3000[pdat])+AND+(breast+cancer[tiab])+AND+(drug[tiab]+OR+inhibitor[tiab]+OR+compound[tiab]+OR+small+molecule[tiab]+OR+clinical+trial[tiab]+OR+therapy[tiab]+OR+agent[tiab])+NOT+(Review[pt]+OR+Scientific+Integrity+Review[pt]+OR+Systematic+Review[pt])
Using PubMed esearch API to get PMIDs matching the search query.
	The actual total number of records matching the search for is 9701
	The number of ids present in the esearch json is 9701
	Function get_pmids complete.
Collecting metadata about the search results into a dictionary.
	Metadata obtained and saved in a dictionary.
Using PubMed efetch API to get abstract and other details for relevant PMIDs into an XML string.
	The number of matching PMIDs based on the server: 30
	Function get_abstracts complete.
Extracting data from XML string and organizing it into a dataframe.
	Performing basic cleanup.
	&#xa0 left: 0
	&#x3ba left: 0
	&# left: 0
Iterating through each article and col

In [14]:
len(final_df)

87

**Final dataframe ready with content extracted from PubMed.**

In [15]:
final_df

Unnamed: 0,pmid,publication_date,publication_type,article_title,abstract,keywords,journal,num_abstracts_retrieved,num_abstracts_requested,query_string,num_total_matches,all_matching_pmids,acquisition_date,disease
0,37256976,2023 Jun 01,"Clinical Trial, Phase III|Journal Article|Rand...",Capivasertib in Hormone Receptor-Positive Adva...,[BACKGROUND]AKT pathway activation is implicat...,,The New England journal of medicine,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9701,"37256976,37070653,37147285,37723305,36585452,3...",2024-03-21,breast+cancer
1,37070653,2023 Mar,Clinical Trial Protocol|Journal Article,"Design of SERENA-6, a phase III switching tria...",ESR1 mutation (ESR1m) is a frequent cause of a...,ESR1 mutation|advanced breast cancer|camizestr...,"Future oncology (London, England)",30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9701,"37256976,37070653,37147285,37723305,36585452,3...",2024-03-21,breast+cancer
2,37147285,2023 May 05,"Journal Article|Research Support, Non-U.S. Gov't",KK-LC-1 as a therapeutic target to eliminate ALDH,Failure to achieve complete elimination of tri...,,Nature communications,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9701,"37256976,37070653,37147285,37723305,36585452,3...",2024-03-21,breast+cancer
3,37723305,2023 Oct,Journal Article,Acetate acts as a metabolic immunomodulator by...,Acetate metabolism is an important metabolic p...,,Nature cancerMain References:Methods Only Refe...,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9701,"37256976,37070653,37147285,37723305,36585452,3...",2024-03-21,breast+cancer
4,36585452,2023 Feb,"Journal Article|Research Support, Non-U.S. Gov...",Network-based assessment of HDAC6 activity pre...,Inhibiting individual histone deacetylase (HDA...,,Nature cancerMETHODS-ONLY REFERENCES,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,9701,"37256976,37070653,37147285,37723305,36585452,3...",2024-03-21,breast+cancer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,37460404,2023 Nov,Journal Article,Drug Repurposing-Based Brain-Targeting Self-As...,"Glioblastoma (GBM), the most aggressive and le...",blood-brain barriers|chemophototherapy|drug re...,"Small (Weinheim an der Bergstrasse, Germany)",30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,1886,"36791206,37451272,36749723,37935665,36966152,3...",2024-03-21,glioblastoma
26,37244935,2023 May 27,"Journal Article|Research Support, Non-U.S. Gov't",macroH2A2 antagonizes epigenetic programs of s...,Self-renewal is a crucial property of glioblas...,,Nature communications,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,1886,"36791206,37451272,36749723,37935665,36966152,3...",2024-03-21,glioblastoma
27,37886538,2023 Oct 05,Preprint,LDHA-regulated tumor-macrophage symbiosis prom...,Abundant macrophage infiltration and altered t...,,Research square,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,1886,"36791206,37451272,36749723,37935665,36966152,3...",2024-03-21,glioblastoma
28,37672559,2023 Sep 19,"Research Support, N.I.H., Extramural|Editorial...",Overcoming EGFR inhibitor resistance in Gliobl...,,,Proceedings of the National Academy of Science...,30,30,(2023/01/01:3000/12/31[Date - Publication] AND...,1886,"36791206,37451272,36749723,37935665,36966152,3...",2024-03-21,glioblastoma


**Final df with the PubMed data showing the information extracted and their dtypes.**

In [16]:
final_df.dtypes

pmid                       object
publication_date           object
publication_type           object
article_title              object
abstract                   object
keywords                   object
journal                    object
num_abstracts_retrieved     int64
num_abstracts_requested     int64
query_string               object
num_total_matches           int64
all_matching_pmids         object
acquisition_date           object
disease                    object
dtype: object

**Save the df as a CSV.**

In [43]:
final_df.to_csv(pubmed_output,index=False)

# 2. Prepare prompt text

**Following system text and user instructions will be the basis of the prompt text going inside the messages list when calling the openai API. These are first saved into json files.
The abstract text will have to be appended to the user text, beneath 'TASKS'.**

In [20]:
system_data = {"role":"system","content": "You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text. Only consider the text given to you."}
system_data

{'role': 'system',
 'content': 'You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text. Only consider the text given to you.'}

In [21]:
user_data = {"role":"user","content":"""Look at the examples delimited by ### and the rules delimited by ***.
*** RULES
For each of the text shown under TASKS, do the following:
1. Identify which drugs have been tested and create a set for each. 
2. Return multiple sets if more than drug is present in the text.
For each drug:
3. Get direct target: Use the following logic: if the text clearly and directly mentions that the drug targets a gene and has also defined the type of interaction
4. Get interaction type between the drug and direct target: if explicitly and clearly mentioned, then get the drug-direct target relationship type. What does the drug do to the direct target? inhibitor, activator, etc.?
5. Get groups tested: specify which type of genes was the drug tested on, eg. if drug was tested on samples showing high expression of certain gene.
6. Collect all disease names for which the drug has been tested into 1 list.
7. Extract any specific ClinicalTrials.gov identifier or number.
8. For each drug, construct a set consisting of (drug name, direct target, drug-direct target interaction, tested or effective group, ClinicalTrials.gov number, all diseases that the drug is tested in)
9. Any empty values should be indicated by null and not an empty string.
10. Double check your logic and see if the data you have collected is actually reflected in the text. If not, make revisions.
11. Both, direct target, and drug-direct target interaction fields should have been filled or both should be null. Only one of these fields cannot be null.
12. Assemble all sets and produce 1 final JSON output in a single line without any whitespaces.
***
### EXAMPLES
PMID23:LKB1/STK11 is a serine/threonine kinase that plays a major role in controlling cell metabolism, resulting in potential therapeutic vulnerabilities in LKB1-mutant cancers. Here, we identify the NAD + degrading ectoenzyme, CD38, as a new target in LKB1-mutant NSCLC. Metabolic profiling of genetically engineered mouse models (GEMMs) revealed that LKB1 mutant lung cancers have a striking increase in ADP-ribose, a breakdown product of the critical redox co-factor, NAD + . Surprisingly, compared with other genetic subsets, murine and human LKB1-mutant NSCLC show marked overexpression of the NAD+-catabolizing ectoenzyme, CD38 on the surface of tumor cells. Loss of LKB1 or inactivation of Salt-Inducible Kinases (SIKs)-key downstream effectors of LKB1- induces CD38 transcription induction via a CREB binding site in the CD38 promoter. Treatment with the FDA-approved anti-CD38 antibody, daratumumab, inhibited growth of LKB1-mutant NSCLC xenografts. Together, these results reveal CD38 as a promising therapeutic target in patients with LKB1 mutant lung cancer.
Output:{"PMID23": [{"drug name": "daratumumab","target": [{"direct target": "CD38","drug-direct target interaction": "anti-CD38 monoclonal antibody"},],"tested or effective group": ["LKB-1/STK-11 mutant NSCLC"],"drug tested in following diseases": ["lung cancer", "NSCLC"],"ClinicalTrials.gov ID": []}]}
###
TASKS:
Use the RULES and EXAMPLES and create a similar outputs for the following text delimited by ID:
"""}

**Save these in json files.**

In [22]:
with open(system_path, "w") as json_file:
    json.dump(system_data, json_file)

In [23]:
with open(user_path, "w") as json_file:
    json.dump(user_data, json_file)

# 3. Run a pipeline to process 30 randomly selected abstracts

**Get the df containing pubmed data.**

In [24]:
df = pd.read_csv(pubmed_output,dtype={'pmid':'str'})

In [25]:
df1 = openai_api.select_pubmed_data(data_df=df,
                                    cols=['pmid','abstract'],
                                    num_rows=30,
                                    random_state=1)

Function select_pubmed_data complete.


**ADD NOTE.**

In [26]:
df1.head()

Unnamed: 0,pmid,abstract
10,36841821,Pellino-1 (PELI1) is an E3 ubiquitin ligase ac...
69,36399643,Durable glioblastoma multiforme (GBM) manageme...
61,36966152,Glioblastoma multiforme (GBM) is the most comm...
34,37321064,Drug resistance is a major challenge in cancer...
86,37147437,Mounting evidence is identifying human cytomeg...


In [27]:
len(df1)

30

In [28]:
df1['pmid'].nunique()

30

In [29]:
df1['pmid'].dtype

dtype('O')

**ADD NOTE.**

In [32]:
all_completions = openai_api.process_pubmed_data(system_path=system_path,
                                                 user_path=user_path,
                                                 data_df1=df1,
                                                 openai_key_path=openai_key_path,
                                                 seed=1,
                                                 temperature=0.0)

	Base prompt messages loaded.
	Necessary columns present in the input df.
Function setup_client complete.
	Client setup complete.
	Processing abstracts through the API.
Function process_pubmed_data complete in 0:03:21.925227. This run costed $0.41.


**Save the completions dictionary as a JSON.**

In [41]:
with open(completions_path, "w") as json_file:
    json.dump(all_completions, json_file)

**ADD NOTE**

In [33]:
ner_df = openai_api.get_df_from_completions(all_completions=all_completions)

Function get_df_from_completions complete.


In [39]:
ner_df.head()

Unnamed: 0,completion_id,completion_tokens,prompt_tokens,total_tokens,drug_name,tested_or_effective_group,drug_tested_in_diseases,clinical_trials_id,direct_target,drug-direct_target_interaction,pmid
0,chatcmpl-95Fz5ygwhHbh6fS34y9rl6UCQzw7V,61.0,1073.0,1134.0,S62,,[breast cancer],,PELI1/EGFR,disruptor,36841821
1,chatcmpl-95FzB5d0oHjWiZz8uYJofalUBnFuB,99.0,1131.0,1230.0,regorafenib,,[GBM],,autophagosome-lysosome fusion,inhibitor,36399643
2,chatcmpl-95FzLUxdFBE1KNul7adDXyaq1Ds0Q,70.0,1086.0,1156.0,fatostatin,[null],"[glioblastoma, GBM]",[],AKT/mTORC1/GPX4 signaling pathway,inhibitor,36966152
3,chatcmpl-95FzTMkaW3MDTzy0nvTisQfDIP9bc,78.0,1171.0,1249.0,MAM,"[cisplatin-resistant A549, AZD9291-resistant H...","[non-small cell lung cancer, NSCLC]",,NQO1,activator,37321064
4,chatcmpl-95FzZlGTV1CFR13ye9QStxhproc0a,130.0,1111.0,1241.0,ganciclovir,[HCMV-induced glioblastoma],[glioblastoma],,EZH2,inhibitor,37147437


**Save the results from the API as a CSV.**

In [45]:
ner_df.to_csv(ner_path,index=False)

**Write the PMIDs to Google Sheets so that manually curated data can be added in Sheets.**

In [62]:
utils.write_data_to_sheets(df=ner_df,
                           credentials_path=credentials_path,
                           sheets_filename='pubmed_manual_curated',
                           sheets_sheetname='run_1',
                           columns_to_write=['pmid'])