# Table of contents



* [Overview](#overview)
* [Notes and links about LLMs](#notes)
* [Imports](#imports)
* [Gemini](#gemini)
* [GPT](#gpt)
    * [Set-up](#setup)
    * [Use cases - Q/A and entity extraction](#usecases)
    * [Ideas to reduce the cost per query](#cost)
        * [Model alternatives](#models)
    * [Testing some examples for drug and gene name extraction](#druggene)
    * [Testing tiktoken to estimate token size](#tik)

# Overview<a id="overview"></a>

Purpose of this notebook is to try out different LLMs.

FREE:
* Llama-2 [encoder-decoder]
* BERT [encoder only]
* Gemini [encoder-decoder]
* T5 [encoder-decoder]

PAID:
GPT4 (OpenAI) [decoder only]

# Notes and links about LLMs<a id="notes"></a>

1. Gemini 1.0 [Google]
    * is free to use: https://ai.google.dev/pricing
    * Python package: https://pypi.org/project/google-generativeai/
    * How to: https://ai.google.dev/tutorials/python_quickstart
    * API key was obtained by going to Create API Key (New Project) - https://aistudio.google.com/app/apikey
    * Free v 1.0 and allows 60 queries per minute. The prompts and responses are used to improve their products.
    
    
2. OpenAI API
    * Is not free but it doesn't require upgrading to Plus.
    * You have to first purchase credits (\\$ 5 minimum) to use the API.
    * Pricing for GPT4: https://openai.com/pricing#language-models
    
        Input: \\$0.03 / 1K tokens
        
        Output: \\$0.06 / 1K tokens
        
    * API ref: https://platform.openai.com/docs/api-reference
    * Python quickstart: https://platform.openai.com/docs/quickstart?context=python
    * How to / cookbook on formatting inputs. https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models
    * Model types and compatibility with endpoints - https://platform.openai.com/docs/models/model-endpoint-compatibility
    * How to count tokens to get an estimate of cost: https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models#4-counting-tokens
    * Parameters that can be passed into the Chat Completions endpoint: https://platform.openai.com/docs/api-reference/chat/create
    * https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models#4-counting-tokens
    * Tiktoken to estimate token size: https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken
    * Context window limits https://community.openai.com/t/what-does-context-window-mean-in-the-documentation/566158
    * How to reduce cost https://www.reddit.com/r/OpenAI/comments/13scry1/how_to_reduce_your_openai_costs_by_up_to_30_3/
    
    
3. Llama-2
    * Free to use from Meta
    * Getting started guide - https://ai.meta.com/blog/5-steps-to-getting-started-with-llama-2/#:~:text=Llama%202%20is%20available%20for%20free%20for%20research%20and%20commercial%20use.
    * Does not have an API through Meta.
    * Try Hugging Face https://huggingface.co/docs/transformers/v4.38.1/en/autoclass_tutorial

4. Prompting
    * https://www.promptingguide.ai/techniques/cot#zero-shot-cot-prompting
    * https://community.openai.com/t/cheat-sheet-mastering-temperature-and-top-p-in-chatgpt-api/172683
    * https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-the-openai-api
    * https://cookbook.openai.com/articles/related_resources
    * https://community.openai.com/t/convert-few-shot-example-to-api-code/325614
    * https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter

# Imports<a id="imports"></a>

In [1]:
from dotenv import load_dotenv
import google.generativeai as genai
import json
from openai import OpenAI
import os
import pandas as pd
from pathlib import Path
import tiktoken
import torch
import transformers

# Gemini<a id="gemini"></a>

**Load environment variables.**

In [14]:
load_dotenv(dotenv_path=dotenv_path)

True

**Access environment variables.**

In [17]:
api_key_gemini = os.environ['API_KEY_GEMINI']

**Pass the key to the gemini API.**

In [18]:
genai.configure(api_key=api_key_gemini)

In [26]:
genai.list_models()  

generator

**This code from the Python tutorial - does not work and gives an AttributeError**
https://ai.google.dev/tutorials/python_quickstart
This should show a list of available models. This problem persists with other users too but it looks like it hasn't been solved. https://github.com/google/generative-ai-python/issues/145

**Try some queries with Gemini-Pro**

Load the model.

In [39]:
model = genai.GenerativeModel('gemini-pro')

In [38]:
response = model.generate_content('What is BRCA1?')

In [42]:
print(response.text)

TypeError: argument of type 'Part' is not iterable

# GPT4<a id="gpt"></a>

## Set up the client.<a id="setup"></a>

**Load API key.**

In [8]:
load_dotenv(dotenv_path=openai_key_path)

True

**Access the API key.**

In [9]:
openai_api_key = os.environ['OPENAI_API_KEY']

In [10]:
client = OpenAI(api_key=openai_api_key)

**Use the chat completions endpoint.**

First try with just the two required arguments.

The messages input is basically a list of dictionaries, where each dictionary also shows where the instruction or content is coming from. 
In short: 

    * 'user' role means you as a user who is talking to the model
    
    * 'assistant' role means the GPT server
    
    * 'system' role means you can as a developer set instructions such as 'Frame the answer for a non-engineer'.
    

## Try a Q/A and a entity recognition use case.<a id="usecases"></a>

**USE CASE 1 - Q/A.**

**You can see that a slight change in the query can produce different results.**

In [67]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "Provide explanation to a lay audience in JSON format."},
        {"role": "user", "content": "What is BRCA1?"}
    ]
)


In [68]:
response

ChatCompletion(id='chatcmpl-8vqqmLFCoJSAcoEQxWHNVsLdX9Mgh', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "BRCA1": {\n    "Definition": "BRCA1 is a gene that produces a protein responsible for repairing damaged DNA and maintaining the cell\'s genetic stability.",\n    "Significance": "Mutations or changes in this gene can lead to the development of hereditary breast and ovarian cancer. When functioning normally, this gene helps prevent uncontrolled cell growth. However, a mutation can lead to an increased risk of developing cancer.",\n    "Testing": "Genetic tests are available to check for BRCA1 mutations. These tests are often recommended for individuals with a strong family history of breast or ovarian cancer."\n  }\n}', role='assistant', function_call=None, tool_calls=None))], created=1708798544, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=128, prompt_token

In [65]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "Explain to a lay audience."},
        {"role": "user", "content": "What is BRCA1?"}
    ]
)


In [66]:
response

ChatCompletion(id='chatcmpl-8vqnbzu6mE7CIFprnlx6qGbHe5waT', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="BRCA1 is a gene that everyone has in their cells. This gene plays an important role in repairing damaged DNA and keeping our cells' genetic material stable. When this gene works properly, it helps prevent uncontrolled cell growth that could otherwise lead to cancer.\n\nHowever, some people carry changes or mutations in the BRCA1 gene that they inherited from their parents. These changes can prevent the gene from working properly, which increases the risk of breast and ovarian cancer, and to a lesser extent, other types of cancer.\n\nTesting for these gene changes is sometimes recommended for people with a strong family history of breast or ovarian cancer. If an individual knows they carry a mutated BRCA1 gene, they can make certain decisions about preventative measures, early detection and treatment options.", role='assistant',

**USE CASE 2 - EXTRACT ENTITIES FROM GIVEN INFORMATION.**

In [8]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """Extract the target gene name and associated drug names from a chunk of text:
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """}
    ]
)


In [9]:
response

ChatCompletion(id='chatcmpl-8wXyu0JFHegsYnSMhXuBgfGUlrmVQ', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Target Gene Name: ALK \n\nAssociated Drug Names: Crizotinib', role='assistant', function_call=None, tool_calls=None))], created=1708964340, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=164, total_tokens=180))

In [10]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """Extract the target gene name and associated drug names from a chunk of text:
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """},
        {"role":"system","content":"Try to also extract any aliases for gene or drug names from any external links to other databases provided in the text."}
    ]
)

In [11]:
response

ChatCompletion(id='chatcmpl-8wY80Uwnsruad1FSr3QpcTPSbbd1R', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The text does not provide information about any external links to other databases for gene or drug names. Please provide the text that contains these details.', role='assistant', function_call=None, tool_calls=None))], created=1708964904, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=28, prompt_tokens=191, total_tokens=219))

In [12]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """Extract the target gene name and associated drug names from a chunk of text:
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """},
        {"role":"system","content":"Extract details if any of the following database ids are provided: Clinical trials, PubChem, Entrez or similar."}
    ]
)

In [13]:
response

ChatCompletion(id='chatcmpl-8wYA7dq8MRbcDKGgaK2qHi7d1RNPE', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The text does not provide database ids for Clinical trials, PubChem, Entrez or similar.', role='assistant', function_call=None, tool_calls=None))], created=1708965035, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=19, prompt_tokens=191, total_tokens=210))

In [14]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """Extract the target gene name and associated drug names from a chunk of text:
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """},
        {"role":"system","content":"If ClinicalTrials.gov id or a number starting with NCT is given, extract the title of the study."}
    ]
)

In [15]:
response

ChatCompletion(id='chatcmpl-8wYBQttEtHrUrQgz3gUNs8c6PE0EG', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The target gene name is "ALK". The associated drug name is "crizotinib".', role='assistant', function_call=None, tool_calls=None))], created=1708965116, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=20, prompt_tokens=191, total_tokens=211))

**Looks like it's not good to have 2 sets of instructions coming from the user vs the system. The system instructions will override the user instructions.**

In [16]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """Extract the target gene name and associated drug names from a chunk of text:
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """},
        {"role":"system","content":"Extract ClinicalTrials.gov number."}
    ]
)

In [17]:
response

ChatCompletion(id='chatcmpl-8wYCITXZRp7j1KllxerKfpTb9pOUx', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='NCT00585195', role='assistant', function_call=None, tool_calls=None))], created=1708965170, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=5, prompt_tokens=175, total_tokens=180))

**Put all instructions as system.**
This seems to have worked better.

In [20]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """},
        {"role":"system","content":"Extract the target gene name, associated drug names, and ClinicalTrials.gov number."}
    ]
)

In [19]:
response

ChatCompletion(id='chatcmpl-8wYEbRAu9nGXArcK5FlJXNi4o8W4O', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Target Gene Name: ALK\nAssociated Drug Names: Crizotinib\nClinicalTrials.gov number: NCT00585195', role='assistant', function_call=None, tool_calls=None))], created=1708965313, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=28, prompt_tokens=171, total_tokens=199))

## Ideas to reduce the cost per query.<a id="cost"></a>

1. Calculate the length of different text queries and see how many tokens they would be. Even an estimate will give an idea about whether it will be feasible to use GPT 3.5 and reduce the cost. The context window should be able to take in the whole input text along with the role and content key-values.

2. If step 1 confirms that the longest text query in the dataset can fit in the context window for GPT-4 and GPT-3.5, then test if GPT-3.5 gives the same quality of response as GPT-4. 

3. Perform a cleaning step on the text from the dataset before it goes into a prompt - remove extra spaces, trailing spaces and the last period. This small step can reduce the total number of tokens.

### Model alternatives - shortlist acceptable models based on context window size, cost, and use case.<a id="models"></a>

1. gpt-3.5-turbo-0125
    * 16,385 tokens
    * But training data only upto Sep 2021
    * Input - \$0.0005 / 1K tokens
    * Output - \$0.0015 / 1K tokens
    * you can set response_format to { "type": "json_object" } to enable JSON mode.


2. gpt-4
    * Currently points to gpt-4-0613.
    * 8,192 tokens
    * Up to Sep 2021
    * Input - \$0.03 / 1K tokens
    * Output - \$0.06 / 1K tokens
    

3. gpt-4-turbo-preview
    * New
    * 128,000 tokens
    * Up to Dec 2023
    * Input - \$0.01 / 1K tokens
    * Output - \$0.03 / 1K tokens
    * you can set response_format to { "type": "json_object" } to enable JSON mode.

#### Test if these 3 models give the same response quality.

This is the model we've used so far: 

In [21]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """},
        {"role":"system","content":"Extract the target gene name, associated drug names, and ClinicalTrials.gov number."}
    ]
)

In [22]:
response

ChatCompletion(id='chatcmpl-8wZ4ANHsg2yXIrK1GeXF2aRXcDpPA', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Target gene name: ALK\nAssociated drug names: Crizotinib\nClinicalTrials.gov number: NCT00585195', role='assistant', function_call=None, tool_calls=None))], created=1708968510, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=28, prompt_tokens=171, total_tokens=199))

Test these 2 models - the first one is more expensive but most updated and the last one is lease expensive but less updated compared to the first one.

**Conclusion - GPT-4-turbo-preview is more accurate compared to the cheaper gpt-3.5. GPT-4-turbo-preview produces the same output as GPT-4 and is less expensive than GPT-4.**

In [26]:
models_to_test = ['gpt-4-turbo-preview', 'gpt-3.5-turbo-0125']
# models_to_test = ['gpt-3.5-turbo-0125']

# Here we can add an additional argument for JSON format.

for MODEL in models_to_test:
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=[
            {"role": "user", "content": """
            In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
            Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
            Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
            This study is registered with ClinicalTrials.gov, number NCT00585195.
            """},
            {"role":"system","content":"Extract the target gene name, associated drug names, and ClinicalTrials.gov number and give JSON."}
        ]
    )
    
    print(response)

ChatCompletion(id='chatcmpl-8wZ9kPCdHjS0SIdLUSv9ZQpFK4O23', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "target_gene": "ALK",\n  "associated_drug_names": ["crizotinib"],\n  "ClinicalTrials.gov_number": "NCT00585195"\n}', role='assistant', function_call=None, tool_calls=None))], created=1708968856, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_91aa3742b1', usage=CompletionUsage(completion_tokens=39, prompt_tokens=174, total_tokens=213))
ChatCompletion(id='chatcmpl-8wZ9mw6TGouFg8MuIl1D1P5P5Xz1t', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "target_gene": "ALK-positive",\n  "drug_name": "crizotinib",\n  "ClinicalTrials.gov_number": "NCT00585195"\n}', role='assistant', function_call=None, tool_calls=None))], created=1708968858, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint='fp_86156a94a0', usage=CompletionUsage

## Testing some more use cases for entity extraction - gene name and drug names<a id="druggene"></a>

First, create a messages variable to swap out different abstract examples and system level instructions.

In [45]:
MODEL = 'gpt-4-turbo-preview'

messages = [{"role": "user", "content": """"""},
            {'role': 'system',"content": """"""}]

In [53]:
user_content = ["""
In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles.
Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
This study is registered with ClinicalTrials.gov, number NCT00585195.
""",
"""
Conventional chemotherapeutic drugs such as doxorubicin (DOX) are associated with severe adverse effects such as cardiac, hepatic, and gastrointestinal (GI) toxicities. Excessive production of reactive oxygen species (ROS) was reported to be one of the main mechanisms underlying these severe adverse effects. Recently, we have developed 2 types of novel redox nanoparticles (RNPs) including pH-sensitive redox nanoparticle (RNP(N)) and pH-insensitive redox nanoparticle (RNP(O)), which effectively scavenge overproduced ROS in inflamed and cancerous tissues. In this study, we investigated the effects of these RNPs on DOX-induced adverse effects during cancer chemotherapy. The DOX-induced body weight loss was significantly attenuated in the mice treated with RNPs, particularly pH-insensitive RNP(O). We also found that cardiac ROS levels in the DOX-treated mice were dramatically decreased by treatment with RNPs, resulting in the reversal of cardiac damage, as confirmed by both plasma cardiac biomarkers and histological analysis. It was interesting to notice that, during cotreatment with DOX and RNPs, the DOX uptake was significantly enhanced in the cancer cells, but not in healthy aortic endothelial cells in vitro. Treatment with RNPs also improved anticancer efficacy of DOX in the colitis-associated colon cancer model mice in vivo. On the basis of these results, a combination of the novel antioxidative nanotherapeutics (RNPs) with conventional anticancer drugs seems to be a robust strategy for well-tolerated anticancer therapy.
""",
"""
Background: Sotorasib showed anticancer activity in patients with KRAS p.G12C-mutated advanced solid tumors in a phase 1 study, and particularly promising anticancer activity was observed in a subgroup of patients with non-small-cell lung cancer (NSCLC).
Methods: In a single-group, phase 2 trial, we investigated the activity of sotorasib, administered orally at a dose of 960 mg once daily, in patients with KRAS p.G12C-mutated advanced NSCLC previously treated with standard therapies. The primary end point was objective response (complete or partial response) according to independent central review. Key secondary end points included duration of response, disease control (defined as complete response, partial response, or stable disease), progression-free survival, overall survival, and safety. Exploratory biomarkers were evaluated for their association with response to sotorasib therapy.
Results: Among the 126 enrolled patients, the majority (81.0%) had previously received both platinum-based chemotherapy and inhibitors of programmed death 1 (PD-1) or programmed death ligand 1 (PD-L1). According to central review, 124 patients had measurable disease at baseline and were evaluated for response. An objective response was observed in 46 patients (37.1%; 95% confidence interval [CI], 28.6 to 46.2), including in 4 (3.2%) who had a complete response and in 42 (33.9%) who had a partial response. The median duration of response was 11.1 months (95% CI, 6.9 to could not be evaluated). Disease control occurred in 100 patients (80.6%; 95% CI, 72.6 to 87.2). The median progression-free survival was 6.8 months (95% CI, 5.1 to 8.2), and the median overall survival was 12.5 months (95% CI, 10.0 to could not be evaluated). Treatment-related adverse events occurred in 88 of 126 patients (69.8%), including grade 3 events in 25 patients (19.8%) and a grade 4 event in 1 (0.8%). Responses were observed in subgroups defined according to PD-L1 expression, tumor mutational burden, and co-occurring mutations in STK11, KEAP1, or TP53.
""",

"""
Background: KRAS G12C is a mutation that occurs in approximately 3 to 4% of patients with metastatic colorectal cancer. Monotherapy with KRAS G12C inhibitors has yielded only modest efficacy. Combining the KRAS G12C inhibitor sotorasib with panitumumab, an epidermal growth factor receptor (EGFR) inhibitor, may be an effective strategy.
Methods: In this phase 3, multicenter, open-label, randomized trial, we assigned patients with chemorefractory metastatic colorectal cancer with mutated KRAS G12C who had not received previous treatment with a KRAS G12C inhibitor to receive sotorasib at a dose of 960 mg once daily plus panitumumab (53 patients), sotorasib at a dose of 240 mg once daily plus panitumumab (53 patients), or the investigator's choice of trifluridine-tipiracil or regorafenib (standard care; 54 patients). The primary end point was progression-free survival as assessed by blinded independent central review according to the Response Evaluation Criteria in Solid Tumors, version 1.1. Key secondary end points were overall survival and objective response.
Results: After a median follow-up of 7.8 months (range, 0.1 to 13.9), the median progression-free survival was 5.6 months (95% confidence interval [CI], 4.2 to 6.3) and 3.9 months (95% CI, 3.7 to 5.8) in the 960-mg sotorasib-panitumumab and 240-mg sotorasib-panitumumab groups, respectively, as compared with 2.2 months (95% CI, 1.9 to 3.9) in the standard-care group. The hazard ratio for disease progression or death in the 960-mg sotorasib-panitumumab group as compared with the standard-care group was 0.49 (95% CI, 0.30 to 0.80; P = 0.006), and the hazard ratio in the 240-mg sotorasib-panitumumab group was 0.58 (95% CI, 0.36 to 0.93; P = 0.03). Overall survival data are maturing. The objective response was 26.4% (95% CI, 15.3 to 40.3), 5.7% (95% CI, 1.2 to 15.7), and 0% (95% CI, 0.0 to 6.6) in the 960-mg sotorasib-panitumumab, 240-mg sotorasib-panitumumab, and standard-care groups, respectively. Treatment-related adverse events of grade 3 or higher occurred in 35.8%, 30.2%, and 43.1% of patients, respectively. Skin-related toxic effects and hypomagnesemia were the most common adverse events observed with sotorasib-panitumumab.
Conclusions: In this phase 3 trial of a KRAS G12C inhibitor plus an EGFR inhibitor in patients with chemorefractory metastatic colorectal cancer, both doses of sotorasib in combination with panitumumab resulted in longer progression-free survival than standard treatment. Toxic effects were as expected for either agent alone and resulted in few discontinuations of treatment. (Funded by Amgen; CodeBreaK 300 ClinicalTrials.gov number, NCT05198934.).
"""
]

In [54]:
system_content = ["""
Extract the target gene name, associated drug names, and ClinicalTrials.gov number and give JSON.
"""]

In [55]:
for user_text in user_content:
    messages[0]['content'] = user_text
    messages[1]['content'] = system_content[0]
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )
    print(f'''----RESULT---- {response}''')

----RESULT---- ChatCompletion(id='chatcmpl-8wadqdx5J8ZLVT8EM27zqROfaHvIL', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "target_gene": "ALK",\n  "associated_drug_names": ["crizotinib"],\n  "ClinicalTrials.gov_number": "NCT00585195"\n}', role='assistant', function_call=None, tool_calls=None))], created=1708974566, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_89b1a570e1', usage=CompletionUsage(completion_tokens=39, prompt_tokens=168, total_tokens=207))
----RESULT---- ChatCompletion(id='chatcmpl-8wadtC7gTMDxkBKgAlQM6UJEcd6Ui', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='\n{\n  "target_gene_name": null,\n  "associated_drug_names": ["doxorubicin (DOX)"],\n  "ClinicalTrials.gov_number": null\n}', role='assistant', function_call=None, tool_calls=None))], created=1708974569, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint

## Test use cases to see if any drug-target interaction type can also be extracted from the abstract

In [3]:
MODEL = 'gpt-4-turbo-preview'

messages = [{"role": "user", "content": """"""},
            {'role': 'system',"content": """"""}]

In [4]:
system_content = ["""
Extract the target gene name, associated drug names, type of drug mechanism or interaction, and ClinicalTrials.gov number as JSON
"""]

In [5]:
user_content = [
"""
Background: Sotorasib is a specific, irreversible inhibitor of the GTPase protein, KRASG12C. 
We compared the efficacy and safety of sotorasib with a standard-of-care treatment in patients with non-small-cell lung cancer (NSCLC) with the KRASG12C mutation who had been previously treated with other anticancer drugs.
"""
    
]

**This 1 example shows that it is able to extract the type of drug-target interaction as well. 
Test another example where it's worded differently.**

In [11]:
for user_text in user_content:
    messages[0]['content'] = user_text
    messages[1]['content'] = system_content[0]
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )
    print(f'''----RESULT---- {response}''')

----RESULT---- ChatCompletion(id='chatcmpl-8wuvDO53HHEH4kvYtUjwzdqlIRkEj', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "target_gene_name": "KRASG12C",\n  "associated_drug_names": ["Sotorasib"],\n  "type_of_drug_mechanism": "Specific, irreversible inhibitor of the GTPase protein",\n  "ClinicalTrials.gov_number": null\n}', role='assistant', function_call=None, tool_calls=None))], created=1709052523, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_91aa3742b1', usage=CompletionUsage(completion_tokens=60, prompt_tokens=114, total_tokens=174))


**Here too, it was able to extract the correct drug and gene name.**

In [12]:
user_content = [
"""
Lung adenocarcinoma (LUAD) is the most common lung cancer, with high mortality. 
As a tumor-suppressor gene, JWA plays an important role in blocking pan-tumor progression. 
JAC4, a small molecular-compound agonist, transcriptionally activates JWA expression both in vivo and in vitro.
"""
    
]

In [13]:
for user_text in user_content:
    messages[0]['content'] = user_text
    messages[1]['content'] = system_content[0]
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )
    print(f'''----RESULT---- {response}''')

----RESULT---- ChatCompletion(id='chatcmpl-8wuzc0WQUb3FOcxnCIhJT6e3H2cWR', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "target_gene": "JWA",\n  "associated_drug_names": [\n    "JAC4"\n  ],\n  "type_of_drug_mechanism_or_interaction": "agonist",\n  "ClinicalTrials.gov_number": null\n}', role='assistant', function_call=None, tool_calls=None))], created=1709052796, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_91aa3742b1', usage=CompletionUsage(completion_tokens=53, prompt_tokens=108, total_tokens=161))


## Modify the prompt for it to return pairs or a set consisting of {gene, drug, interaction, clinical trial}. There should be a separate set if multiple entities are present

First test the usual prompt:
Here the output is incomplete as it didn't catch the second gene BRAF and associated drug.

In [14]:
MODEL = 'gpt-4-turbo-preview'

messages = [{"role": "user", "content": """"""},
            {'role': 'system',"content": """"""}]

system_content = ["""
Extract the target gene name, associated drug names, type of drug mechanism or interaction, and ClinicalTrials.gov number as JSON
"""]

user_content = [
"""
More than half of metastatic melanoma patients receiving standard therapy fail to achieve a long-term survival due to primary and/or acquired resistance. 
Tumor cell ability to switch from epithelial to a more aggressive mesenchymal phenotype, attributed with AXLhigh molecular profile in melanoma, has been recently linked to such event, limiting treatment efficacy. 
In the current study, we investigated the therapeutic potential of the AXL inhibitor (AXLi) BGB324 alone or in combination with the clinically relevant BRAF inhibitor (BRAFi) vemurafenib. 
Firstly, AXL was shown to be expressed in majority of melanoma lymph node metastases.
When treated ex vivo, the largest reduction in cell viability was observed when the two drugs were combined. 
In addition, a therapeutic benefit of adding AXLi to the BRAF-targeted therapy was observed in pre-clinical AXLhigh melanoma models in vitro and in vivo. When searching for mechanistic insights, AXLi was found to potentiate BRAFi-induced apoptosis, stimulate ferroptosis and inhibit autophagy. Altogether, our findings propose AXLi as a promising treatment in combination with standard therapy to improve therapeutic outcome in metastatic melanoma.
"""
    
]

for user_text in user_content:
    messages[0]['content'] = user_text
    messages[1]['content'] = system_content[0]
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )
    print(f'''----RESULT---- {response}''')

----RESULT---- ChatCompletion(id='chatcmpl-8wvEkFhdoII94FjUa1DL3rlxNpNO5', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "target_gene_name": "AXL",\n  "associated_drug_names": [\n    "BGB324",\n    "vemurafenib"\n  ],\n  "type_of_drug_mechanism_or_interaction": [\n    "AXLi (AXL inhibitor) potentiates BRAFi (BRAF inhibitor)-induced apoptosis",\n    "stimulates ferroptosis",\n    "inhibits autophagy"\n  ],\n  "ClinicalTrials.gov_number": null\n}', role='assistant', function_call=None, tool_calls=None))], created=1709053734, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_91aa3742b1', usage=CompletionUsage(completion_tokens=103, prompt_tokens=291, total_tokens=394))


**Try changing the prompt instruction to get the intended output.
We can see that changing the instruction and asking the model to discriminate between different combinations of gene-drug pairs works.
The output could be cleaner and more concise by using a 1 shot approach. This should be the next test.**

In [17]:
system_content = ["""
Return a JSON output with a set consisting of (target gene name,drug name, drug-target interaction, ClinicalTrials.gov number).
Return multiple sets if more than 1 gene-drug combinations are present in the text.
"""]


**It extracts multiple drug-target combinations correctly for the first article. 
However, for the second article it returns inaccurate information and places the drug type as the drug name.
Also, it just reproduces the sentence from the paper to inform us about the type of drug interaction.
This present an opportunity for:**
1. Experimenting with the system instruction within the prompt.
2. Testing if a one-shot approach gives a cleaner output.

In [25]:
MODEL = 'gpt-4-turbo-preview'

messages = [{"role": "user", "content": """"""},
            {'role': 'system',"content": """"""}]

user_content = [
"""
More than half of metastatic melanoma patients receiving standard therapy fail to achieve a long-term survival due to primary and/or acquired resistance. 
Tumor cell ability to switch from epithelial to a more aggressive mesenchymal phenotype, attributed with AXLhigh molecular profile in melanoma, has been recently linked to such event, limiting treatment efficacy. 
In the current study, we investigated the therapeutic potential of the AXL inhibitor (AXLi) BGB324 alone or in combination with the clinically relevant BRAF inhibitor (BRAFi) vemurafenib. 
Firstly, AXL was shown to be expressed in majority of melanoma lymph node metastases.
When treated ex vivo, the largest reduction in cell viability was observed when the two drugs were combined. 
In addition, a therapeutic benefit of adding AXLi to the BRAF-targeted therapy was observed in pre-clinical AXLhigh melanoma models in vitro and in vivo. When searching for mechanistic insights, AXLi was found to potentiate BRAFi-induced apoptosis, stimulate ferroptosis and inhibit autophagy. Altogether, our findings propose AXLi as a promising treatment in combination with standard therapy to improve therapeutic outcome in metastatic melanoma.
"""
    
]

for user_text in user_content:
    messages[0]['content'] = user_text
    messages[1]['content'] = system_content[0]
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )
    print(f'''----RESULT---- {response}''')

----RESULT---- ChatCompletion(id='chatcmpl-8wwTTSJhidKa2PRZamGg8Wn0USUCf', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "gene-drug combinations": [\n    {\n      "target gene name": "AXL",\n      "drug name": "BGB324",\n      "drug-target interaction": "AXL inhibitor (AXLi)",\n      "ClinicalTrials.gov number": null\n    },\n    {\n      "target gene name": "BRAF",\n      "drug name": "vemurafenib",\n      "drug-target interaction": "BRAF inhibitor (BRAFi)",\n      "ClinicalTrials.gov number": null\n    }\n  ]\n}', role='assistant', function_call=None, tool_calls=None))], created=1709058491, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_91aa3742b1', usage=CompletionUsage(completion_tokens=113, prompt_tokens=310, total_tokens=423))


* Entities captured correctly.

In [26]:
print(response.choices[0].message.content)

{
  "gene-drug combinations": [
    {
      "target gene name": "AXL",
      "drug name": "BGB324",
      "drug-target interaction": "AXL inhibitor (AXLi)",
      "ClinicalTrials.gov number": null
    },
    {
      "target gene name": "BRAF",
      "drug name": "vemurafenib",
      "drug-target interaction": "BRAF inhibitor (BRAFi)",
      "ClinicalTrials.gov number": null
    }
  ]
}


In [27]:
MODEL = 'gpt-4-turbo-preview'

messages = [{"role": "user", "content": """"""},
            {'role': 'system',"content": """"""}]

user_content = [
"""
Triple-negative breast cancers (TNBC) frequently inactivate p53, increasing their aggressiveness and therapy resistance. We identified an unexpected protein vulnerability in p53-inactivated TNBC and designed a new PROteolysis TArgeting Chimera (PROTAC) to target it. Our PROTAC selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment. MDM2 loss in p53 mutant/deleted TNBC cells in two-dimensional/three-dimensional culture and TNBC patient explants, including relapsed tumors, causes apoptosis while sparing normal cells. Our MDM2-PROTAC is stable in vivo, and treatment of TNBC xenograft-bearing mice demonstrates tumor on-target efficacy with no toxicity to normal cells, significantly extending survival. Transcriptomic analyses revealed upregulation of p53 family target genes. Investigations showed activation and a required role for TAp73 to mediate MDM2-PROTAC-induced apoptosis. Our data, challenging the current MDM2/p53 paradigm, show MDM2 is required for p53-inactivated TNBC cell survival, and PROTAC-targeted MDM2 degradation is an innovative potential therapeutic strategy for TNBC and superior to existing MDM2 inhibitors.
"""
    
]

for user_text in user_content:
    messages[0]['content'] = user_text
    messages[1]['content'] = system_content[0]
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )
    print(f'''----RESULT---- {response}''')

----RESULT---- ChatCompletion(id='chatcmpl-8wwTvHEIUzOfazMNWe66ycUfjgiq8', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "gene-drug combinations": [\n    {\n      "target gene name": "MDM2",\n      "drug name": "MDM2-PROTAC",\n      "drug-target interaction": "PROTAC selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment",\n      "ClinicalTrials.gov number": null\n    }\n  ]\n}', role='assistant', function_call=None, tool_calls=None))], created=1709058519, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_8c864dca93', usage=CompletionUsage(completion_tokens=82, prompt_tokens=312, total_tokens=394))


* Entities captured incorrectly.

In [28]:
print(response.choices[0].message.content)

{
  "gene-drug combinations": [
    {
      "target gene name": "MDM2",
      "drug name": "MDM2-PROTAC",
      "drug-target interaction": "PROTAC selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment",
      "ClinicalTrials.gov number": null
    }
  ]
}


## Can multiple abstracts be sent in 1 call with just one system instruction?

Answer: Yes

In [29]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    For each of the user text delimited by triple quotes, do the following:
    1. Return a set consisting of (target gene name,drug name, drug-target interaction, ClinicalTrials.gov number).
    2. Return multiple sets if more than 1 gene-drug combinations are present in the text.
    3. Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        More than half of metastatic melanoma patients receiving standard therapy fail to achieve a long-term survival due to primary and/or acquired resistance. 
Tumor cell ability to switch from epithelial to a more aggressive mesenchymal phenotype, attributed with AXLhigh molecular profile in melanoma, has been recently linked to such event, limiting treatment efficacy. 
In the current study, we investigated the therapeutic potential of the AXL inhibitor (AXLi) BGB324 alone or in combination with the clinically relevant BRAF inhibitor (BRAFi) vemurafenib. 
Firstly, AXL was shown to be expressed in majority of melanoma lymph node metastases.
When treated ex vivo, the largest reduction in cell viability was observed when the two drugs were combined. 
In addition, a therapeutic benefit of adding AXLi to the BRAF-targeted therapy was observed in pre-clinical AXLhigh melanoma models in vitro and in vivo. When searching for mechanistic insights, AXLi was found to potentiate BRAFi-induced apoptosis, stimulate ferroptosis and inhibit autophagy. Altogether, our findings propose AXLi as a promising treatment in combination with standard therapy to improve therapeutic outcome in metastatic melanoma.
        """
    },
    {
        "role": "user", "content": """
        Lung adenocarcinoma (LUAD) is the most common lung cancer, with high mortality. 
As a tumor-suppressor gene, JWA plays an important role in blocking pan-tumor progression. 
JAC4, a small molecular-compound agonist, transcriptionally activates JWA expression both in vivo and in vitro.
        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

In [35]:
response

ChatCompletion(id='chatcmpl-8wzCWPeaRKazsJa7W5gRPuqgGPj09', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "sets": [\n    {\n      "target gene name": "AXL",\n      "drug name": "BGB324",\n      "drug-target interaction": "inhibitor",\n      "ClinicalTrials.gov number": null\n    },\n    {\n      "target gene name": "BRAF",\n      "drug name": "vemurafenib",\n      "drug-target interaction": "inhibitor",\n      "ClinicalTrials.gov number": null\n    },\n    {\n      "target gene name": "JWA",\n      "drug name": "JAC4",\n      "drug-target interaction": "agonist",\n      "ClinicalTrials.gov number": null\n    }\n  ]\n}', role='assistant', function_call=None, tool_calls=None))], created=1709068972, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_89b1a570e1', usage=CompletionUsage(completion_tokens=146, prompt_tokens=424, total_tokens=570))

In [31]:
print(response.choices[0].message.content)

{
  "sets": [
    {
      "target gene name": "AXL",
      "drug name": "BGB324",
      "drug-target interaction": "inhibitor",
      "ClinicalTrials.gov number": null
    },
    {
      "target gene name": "BRAF",
      "drug name": "vemurafenib",
      "drug-target interaction": "inhibitor",
      "ClinicalTrials.gov number": null
    },
    {
      "target gene name": "JWA",
      "drug name": "JAC4",
      "drug-target interaction": "agonist",
      "ClinicalTrials.gov number": null
    }
  ]
}


## How to extract details from the Chat completion object?

* In addition to the results, it's useful to extract the model and usage details.
* While constructing the main code processing abstracts in batches, have a batch ID to track which PMIDs got processed in which batch and then add model and usage / token details for each batch. This metadata would help to plan future costs based on abstract lengths.

**The response JSON object can be loaded as a Python object.**

In [58]:
result = json.loads(response.model_dump_json())

* Model string can be extracted like:

In [81]:
result['model']

'gpt-4-0125-preview'

* Usage details can be extracted like:

In [82]:
result['usage']

{'completion_tokens': 146, 'prompt_tokens': 424, 'total_tokens': 570}

**The main results are contained within 'message' key in the 'choices' list.**

In [80]:
result

{'id': 'chatcmpl-8wzCWPeaRKazsJa7W5gRPuqgGPj09',
 'choices': [{'finish_reason': 'stop',
   'index': 0,
   'logprobs': None,
   'message': {'content': '{\n  "sets": [\n    {\n      "target gene name": "AXL",\n      "drug name": "BGB324",\n      "drug-target interaction": "inhibitor",\n      "ClinicalTrials.gov number": null\n    },\n    {\n      "target gene name": "BRAF",\n      "drug name": "vemurafenib",\n      "drug-target interaction": "inhibitor",\n      "ClinicalTrials.gov number": null\n    },\n    {\n      "target gene name": "JWA",\n      "drug name": "JAC4",\n      "drug-target interaction": "agonist",\n      "ClinicalTrials.gov number": null\n    }\n  ]\n}',
    'role': 'assistant',
    'function_call': None,
    'tool_calls': None}}],
 'created': 1709068972,
 'model': 'gpt-4-0125-preview',
 'object': 'chat.completion',
 'system_fingerprint': 'fp_89b1a570e1',
 'usage': {'completion_tokens': 146,
  'prompt_tokens': 424,
  'total_tokens': 570}}

**Extract the results into a df.**

In [75]:
data_dict = json.loads(result['choices'][0]['message']['content'])

In [77]:
data_dict['sets']

[{'target gene name': 'AXL',
  'drug name': 'BGB324',
  'drug-target interaction': 'inhibitor',
  'ClinicalTrials.gov number': None},
 {'target gene name': 'BRAF',
  'drug name': 'vemurafenib',
  'drug-target interaction': 'inhibitor',
  'ClinicalTrials.gov number': None},
 {'target gene name': 'JWA',
  'drug name': 'JAC4',
  'drug-target interaction': 'agonist',
  'ClinicalTrials.gov number': None}]

In [79]:
pd.DataFrame(data_dict['sets'])

Unnamed: 0,target gene name,drug name,drug-target interaction,ClinicalTrials.gov number
0,AXL,BGB324,inhibitor,
1,BRAF,vemurafenib,inhibitor,
2,JWA,JAC4,agonist,


## Test extraction of disease name in addition to target, drug, disease it is tested in, and clinical trial number

**An important note is that while it will be great to also collect the specific cancer subtype, it will increase the output token size quite a bit.**
* Creating unique sets if multiple diseases are tested for a gene-drug combination, generates 478 tokens.
* Also, for the last paper, it missed out on capturing all diseases that the drug was tested it.
* For the last paper, it hasn't been explicitly mentioned that vemurafenib is a BRAF inhibitor but it showed that set anyway. In order to avoid inaccuracies, add in the prompt that a gene-drug interaction should be explicitly mentioned.

In [91]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a helpful biologist and data scientist.
    For each of the user text delimited by triple quotes, do the following:
    1. Return a set consisting of (target gene name,drug name, drug-target interaction, ClinicalTrials.gov number, cancer type the drug is tested in).
    2. Return multiple sets if more than 1 gene-drug combinations-cancer are present in the text.
    3. Identify the relationships accurately to construct each set rather than just extracting information.
    4. Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        More than half of metastatic melanoma patients receiving standard therapy fail to achieve a long-term survival due to primary and/or acquired resistance. 
Tumor cell ability to switch from epithelial to a more aggressive mesenchymal phenotype, attributed with AXLhigh molecular profile in melanoma, has been recently linked to such event, limiting treatment efficacy. 
In the current study, we investigated the therapeutic potential of the AXL inhibitor (AXLi) BGB324 alone or in combination with the clinically relevant BRAF inhibitor (BRAFi) vemurafenib. 
Firstly, AXL was shown to be expressed in majority of melanoma lymph node metastases.
When treated ex vivo, the largest reduction in cell viability was observed when the two drugs were combined. 
In addition, a therapeutic benefit of adding AXLi to the BRAF-targeted therapy was observed in pre-clinical AXLhigh melanoma models in vitro and in vivo. When searching for mechanistic insights, AXLi was found to potentiate BRAFi-induced apoptosis, stimulate ferroptosis and inhibit autophagy. Altogether, our findings propose AXLi as a promising treatment in combination with standard therapy to improve therapeutic outcome in metastatic melanoma.
        """
    },
    {
        "role": "user", "content": """
        Lung adenocarcinoma (LUAD) is the most common lung cancer, with high mortality. 
As a tumor-suppressor gene, JWA plays an important role in blocking pan-tumor progression. 
JAC4, a small molecular-compound agonist, transcriptionally activates JWA expression both in vivo and in vitro.
        """
    },
    {
        "role": "user", "content": """
        Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).
        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

In [92]:
result = json.loads(response.model_dump_json())

In [93]:
result

{'id': 'chatcmpl-8x2Aa7KXIdUNTptebMgIl2ezC8FZI',
 'choices': [{'finish_reason': 'stop',
   'index': 0,
   'logprobs': None,
   'message': {'content': '{\n  "sets": [\n    {\n      "target gene name": "AXL",\n      "drug name": "BGB324",\n      "drug-target interaction": "inhibitor",\n      "ClinicalTrials.gov number": null,\n      "cancer type tested in": "metastatic melanoma"\n    },\n    {\n      "target gene name": "BRAF",\n      "drug name": "vemurafenib",\n      "drug-target interaction": "inhibitor",\n      "ClinicalTrials.gov number": null,\n      "cancer type tested in": "metastatic melanoma"\n    },\n    {\n      "target gene name": "JWA",\n      "drug name": "JAC4",\n      "drug-target interaction": "agonist",\n      "ClinicalTrials.gov number": null,\n      "cancer type tested in": "lung adenocarcinoma"\n    },\n    {\n      "target gene name": "BRAF V600",\n      "drug name": "vemurafenib",\n      "drug-target interaction": "inhibitor",\n      "ClinicalTrials.gov number": "

In [94]:
data_dict = json.loads(result['choices'][0]['message']['content'])

In [95]:
data_dict['sets']

[{'target gene name': 'AXL',
  'drug name': 'BGB324',
  'drug-target interaction': 'inhibitor',
  'ClinicalTrials.gov number': None,
  'cancer type tested in': 'metastatic melanoma'},
 {'target gene name': 'BRAF',
  'drug name': 'vemurafenib',
  'drug-target interaction': 'inhibitor',
  'ClinicalTrials.gov number': None,
  'cancer type tested in': 'metastatic melanoma'},
 {'target gene name': 'JWA',
  'drug name': 'JAC4',
  'drug-target interaction': 'agonist',
  'ClinicalTrials.gov number': None,
  'cancer type tested in': 'lung adenocarcinoma'},
 {'target gene name': 'BRAF V600',
  'drug name': 'vemurafenib',
  'drug-target interaction': 'inhibitor',
  'ClinicalTrials.gov number': 'NCT01524978',
  'cancer type tested in': 'non-small-cell lung cancer'},
 {'target gene name': 'BRAF V600',
  'drug name': 'vemurafenib',
  'drug-target interaction': 'inhibitor',
  'ClinicalTrials.gov number': 'NCT01524978',
  'cancer type tested in': 'Erdheim-Chester disease'},
 {'target gene name': 'BRAF

**Modifying the prompt to decrease output size.**

In [96]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a helpful biologist and data scientist.
    For each of the user text delimited by triple quotes, do the following:
    1. Return a set consisting of (target gene name,drug name, drug-target interaction, ClinicalTrials.gov number, all diseases that the drug is tested in).
    2. Return multiple sets if more than 1 gene-drug combinations are present in the text.
    3. Identify the relationships accurately to construct each set rather than just extracting information.
    4. Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        More than half of metastatic melanoma patients receiving standard therapy fail to achieve a long-term survival due to primary and/or acquired resistance. 
Tumor cell ability to switch from epithelial to a more aggressive mesenchymal phenotype, attributed with AXLhigh molecular profile in melanoma, has been recently linked to such event, limiting treatment efficacy. 
In the current study, we investigated the therapeutic potential of the AXL inhibitor (AXLi) BGB324 alone or in combination with the clinically relevant BRAF inhibitor (BRAFi) vemurafenib. 
Firstly, AXL was shown to be expressed in majority of melanoma lymph node metastases.
When treated ex vivo, the largest reduction in cell viability was observed when the two drugs were combined. 
In addition, a therapeutic benefit of adding AXLi to the BRAF-targeted therapy was observed in pre-clinical AXLhigh melanoma models in vitro and in vivo. When searching for mechanistic insights, AXLi was found to potentiate BRAFi-induced apoptosis, stimulate ferroptosis and inhibit autophagy. Altogether, our findings propose AXLi as a promising treatment in combination with standard therapy to improve therapeutic outcome in metastatic melanoma.
        """
    },
    {
        "role": "user", "content": """
        Lung adenocarcinoma (LUAD) is the most common lung cancer, with high mortality. 
As a tumor-suppressor gene, JWA plays an important role in blocking pan-tumor progression. 
JAC4, a small molecular-compound agonist, transcriptionally activates JWA expression both in vivo and in vitro.
        """
    },
    {
        "role": "user", "content": """
        Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).
        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

**Here, the token size decreased from 478 with the above prompt to 433 with the new one. For the last paper, it captured all the missing disease names that weren't captured with the above prompt. 
However, it also returned an inaccurate combination - cetuximab and BRAF combination is inaccurate and shouldn't have been captured.
Try modifying the prompt to get complete and accurate outputs.**

In [97]:
result = json.loads(response.model_dump_json())

In [98]:
result['usage']

{'completion_tokens': 433, 'prompt_tokens': 925, 'total_tokens': 1358}

In [99]:
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict['sets']

[{'target gene name': 'AXL',
  'drug name': 'BGB324',
  'drug-target interaction': 'inhibitor',
  'ClinicalTrials.gov number': '',
  'all diseases that the drug is tested in': ['metastatic melanoma']},
 {'target gene name': 'BRAF',
  'drug name': 'vemurafenib',
  'drug-target interaction': 'inhibitor',
  'ClinicalTrials.gov number': '',
  'all diseases that the drug is tested in': ['metastatic melanoma']},
 {'target gene name': 'JWA',
  'drug name': 'JAC4',
  'drug-target interaction': 'agonist',
  'ClinicalTrials.gov number': '',
  'all diseases that the drug is tested in': ['Lung adenocarcinoma']},
 {'target gene name': 'BRAF',
  'drug name': 'vemurafenib',
  'drug-target interaction': '',
  'ClinicalTrials.gov number': 'NCT01524978',
  'all diseases that the drug is tested in': ['non-small-cell lung cancer',
   'Erdheim-Chester disease',
   "Langerhans'-cell histiocytosis",
   'pleomorphic xanthoastrocytoma',
   'anaplastic thyroid cancer',
   'cholangiocarcinoma',
   'salivary-duct

**Testing different prompts for this 1 use case to resolve the following:**
1. It should create a set with drug-gene both mentioned together only if the drug is explicitly described to be acting on the target gene.
2. It should pull all possible diseases that the drug is tested in.

**Test different system instructions on just this one abstract as it has multiple gene, drug, and disease names. Then, re-test the final selected prompt on all other text chunks used in above cells.**

In [100]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a helpful biologist and meticulous data scientist.
    For each of the user text delimited by triple quotes, do the following:
    1.Identify which drugs have been tested and create a set for each.
    2.For each drug, construct a set consisting of (drug name, target gene name, drug-target interaction, ClinicalTrials.gov number, all diseases that the drug is tested in)
    To collect these details and build these sets, use the following rules:
    2a)target gene name: only mention if it is explicitly mentioned that the drug targets the gene in some way
    2b)drug-target interaction: only mention if the interaction type is clearly specified in the text
    3.Identify the relationships accurately to construct each set rather than just extracting information.
    4.Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).
        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

In [101]:
result = json.loads(response.model_dump_json())

In [104]:
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'drugs_tested': ['vemurafenib', 'cetuximab'],
 'vemurafenib': [{'drug_name': 'vemurafenib',
   'target_gene_name': 'BRAF V600',
   'drug-target_interaction': '',
   'ClinicalTrials.gov_number': 'NCT01524978',
   'diseases_tested_in': ['non-small-cell lung cancer',
    'Erdheim-Chester disease',
    "Langerhans'-cell histiocytosis",
    'pleomorphic xanthoastrocytoma',
    'anaplastic thyroid cancer',
    'cholangiocarcinoma',
    'salivary-duct cancer',
    'ovarian cancer',
    'clear-cell sarcoma',
    'colorectal cancer']}],
 'cetuximab': [{'drug_name': 'cetuximab',
   'target_gene_name': '',
   'drug-target_interaction': '',
   'ClinicalTrials.gov_number': '',
   'diseases_tested_in': ['colorectal cancer']}]}

**It is still showing vemurafenib-BRAF as a pair. Although this is true, it is not actually specified in the text.**
Test some more prompts to change this behavior.

In [105]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a helpful biologist and meticulous data scientist.
    For each of the user text delimited by triple quotes, do the following:
    1.Identify which drugs have been tested and create a set for each.
    2.For each drug, construct a set consisting of (drug name, target gene name, drug-target interaction, ClinicalTrials.gov number, all diseases that the drug is tested in)
    To collect these details and build these sets, use the following rules:
    2a)target gene name: properly inspect the text to see if it has been mentioned clearly that the said drug interacts with the target. Basically, the drug-target interaction should be mentioned.
    2b)drug-target interaction: only mention if the interaction type is clearly specified in the text
    3.Identify the relationships accurately to construct each set rather than just extracting information.
    4.Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).
        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

In [106]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'tested_drugs': ['vemurafenib', 'cetuximab'],
 'vemurafenib': [{'drug_name': 'vemurafenib',
   'target_gene_name': 'BRAF V600',
   'drug-target_interaction': 'mutation-positive',
   'ClinicalTrials.gov_number': 'NCT01524978',
   'diseases_tested_in': ['non-small-cell lung cancer',
    'Erdheim-Chester disease',
    "Langerhans'-cell histiocytosis",
    'pleomorphic xanthoastrocytoma',
    'anaplastic thyroid cancer',
    'cholangiocarcinoma',
    'salivary-duct cancer',
    'ovarian cancer',
    'clear-cell sarcoma',
    'colorectal cancer']}],
 'cetuximab': [{'drug_name': 'cetuximab',
   'target_gene_name': 'BRAF V600',
   'drug-target_interaction': 'mutation-positive',
   'ClinicalTrials.gov_number': 'NCT01524978',
   'diseases_tested_in': ['colorectal cancer']}]}

In [107]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a helpful biologist and meticulous data scientist.
    For each of the user text delimited by triple quotes, do the following:
    1. Identify which drugs have been tested and create a set for each.
    2. Look for any mention of the drug's target, which gene does the drug target or against what gene was the drug developed? Collect the gene name as target gene name. It is possible for this value to be null if the text does not contain sufficient proof.
    3. Briefly indicate how you found this drug-gene combination (logic for identifying gene-drug pair)
    4. If target gene name is found, then get the type of drug-target interaction.
    5. Collect all diseases for which the drug has been tested.
    6. Extract any specific ClinicalTrials.gov identifier or number.
    7. For each drug, construct a set consisting of (drug name, target gene name, drug-target interaction, logic, ClinicalTrials.gov number, all diseases that the drug is tested in)
    8. Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).
        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

**Pros and cons of the result**

PROS:
1. Shows the logic of ID'ing a drug-target pair.
2. It didn't put any target for Cetuximab.
3. Correctly ID'd all diseases.

CONS:
1. Incorrectly puts mutation-targeting as an interaction type. We're looking for specific types like inhibitor, agonist, blocker, etc.

In [108]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'sets': [{'drug name': 'vemurafenib',
   'target gene name': 'BRAF',
   'drug-target interaction': 'mutation-targeting',
   'logic': 'The text specifies that vemurafenib was used in the context of BRAF V600 mutation-positive cancers, indicating the drug targets BRAF V600 mutations.',
   'ClinicalTrials.gov number': 'NCT01524978',
   'all diseases that the drug is tested in': ['nonmelanoma cancers',
    'colorectal cancer',
    'non-small-cell lung cancer',
    'Erdheim-Chester disease',
    "Langerhans'-cell histiocytosis",
    'pleomorphic xanthoastrocytoma',
    'anaplastic thyroid cancer',
    'cholangiocarcinoma',
    'salivary-duct cancer',
    'ovarian cancer',
    'clear-cell sarcoma']},
  {'drug name': 'cetuximab',
   'target gene name': None,
   'drug-target interaction': None,
   'logic': 'The text mentions cetuximab was received by patients with colorectal cancer in combination with vemurafenib, but does not specify the gene target for cetuximab.',
   'ClinicalTrials.gov nu

In [113]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist.
    For each of the user text delimited by triple quotes, do the following:
    1. Identify which drugs have been tested and create a set for each.
    2. For this step be very conservative: If and only if the text explicitly mentions the drug's target, collect it as target gene name.
    Just because a drug is tested in samples having a certain gene, doesn't mean that gene is a target. Do not imply or extrapolate information.
    This fact should be directly present in the text and clearly mentioned.
    3. If you do find a target gene name, briefly indicate how you found this drug-gene combination (logic for identifying gene-drug pair)
    4. If target gene name is found, then get the type of drug - example, is it an inhibitor, blocker, etc. for the target gene?
    5. Collect all diseases for which the drug has been tested.
    6. Extract any specific ClinicalTrials.gov identifier or number.
    7. For each drug, construct a set consisting of (drug name, target gene name, drug-target interaction, logic, ClinicalTrials.gov number, all diseases that the drug is tested in)
    8. Any empty values should be indicated by null and not an empty string.
    9. Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).
        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

**Still, it extracts BRAF as a target. Again, this is true but it actually is not directly mentioned in the text. It is implying this information. While this fact is true here, it may not be true for some other case.**

In [114]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'sets': [{'drug name': 'vemurafenib',
   'target gene name': 'BRAF',
   'drug-target interaction': 'Inhibitor',
   'logic': 'The text explicitly mentions vemurafenib in the context of BRAF V600 mutation-positive cancers, indicating BRAF as the target gene vemurafenib inhibits.',
   'ClinicalTrials.gov number': 'NCT01524978',
   'all diseases that the drug is tested in': ['non-small-cell lung cancer',
    'Erdheim-Chester disease',
    "Langerhans'-cell histiocytosis",
    'pleomorphic xanthoastrocytoma',
    'anaplastic thyroid cancer',
    'cholangiocarcinoma',
    'salivary-duct cancer',
    'ovarian cancer',
    'clear-cell sarcoma',
    'colorectal cancer']},
  {'drug name': 'cetuximab',
   'target gene name': None,
   'drug-target interaction': None,
   'logic': None,
   'ClinicalTrials.gov number': None,
   'all diseases that the drug is tested in': ['colorectal cancer']}]}

**Create a few examples to test few shot inferencing and identify the optimal prompt, before applying it to abstracts.**

In [115]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist.
    For each of the user text delimited by triple quotes, do the following:
    1. Identify which drugs have been tested and create a set for each.
    2. For this step be very conservative: If and only if the text explicitly mentions the drug's target, collect it as target gene name.
    Just because a drug is tested in samples having a certain gene, doesn't mean that gene is a target. Do not imply or extrapolate information.
    This fact should be directly present in the text and clearly mentioned.
    3. If you do find a target gene name, briefly indicate how you found this drug-gene combination (logic for identifying gene-drug pair)
    4. If target gene name is found, then get the type of drug - example, is it an inhibitor, blocker, etc. for the target gene?
    5. Collect all diseases for which the drug has been tested.
    6. Extract any specific ClinicalTrials.gov identifier or number.
    7. For each drug, construct a set consisting of (drug name, target gene name, drug-target interaction, logic, ClinicalTrials.gov number, all diseases that the drug is tested in)
    8. Any empty values should be indicated by null and not an empty string.
    9. Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        Drug ABC has been widely tested for ovarian cancers. 
        However, it's applications have been unknown in hematological cancers. 
        In this study, we show that ABC, a gene X inhibitor, shows efficacy in cell lines for the following cancer types: ALL, AML, CML, and myeloma.
        Furthermore, we also discover a vulnerability in these cancers specific to the gene Y, where cancers with high expression of gene Y are more sensitive to ABC.        
        """
    },
    {
        "role": "assistant", "content":"""
        drug name: ABC
        target gene name: X
        drug-target-interaction: inhibitor
        logic: Text explicitly mentions that ABC is an inhibitor for X. Although it was tested and shown to be effective in Y- high expressing cancers, it is not explicitly mentioned that Y is also a target of ABC. 
        all diseases drug is tested in: hematological cancers, ALL, AML, CML, myeloma
        ClinicalTrials.gov ID: null
        """
    },
    {
        "role": "user", "content": """
        Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).

        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

In [116]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'vemurafenib': [{'drug name': 'vemurafenib',
   'target gene name': 'BRAF',
   'drug-target interaction': 'null',
   'logic': 'null',
   'ClinicalTrials.gov number': 'NCT01524978',
   'all diseases drug is tested in': ['nonmelanoma cancers',
    'non-small-cell lung cancer',
    'Erdheim-Chester disease',
    "Langerhans'-cell histiocytosis",
    'pleomorphic xanthoastrocytoma',
    'anaplastic thyroid cancer',
    'cholangiocarcinoma',
    'salivary-duct cancer',
    'ovarian cancer',
    'clear-cell sarcoma',
    'colorectal cancer']}],
 'cetuximab': [{'drug name': 'cetuximab',
   'target gene name': 'null',
   'drug-target interaction': 'null',
   'logic': 'null',
   'ClinicalTrials.gov number': 'null',
   'all diseases drug is tested in': ['colorectal cancer']}]}

In [117]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist.
    For each of the user text delimited by triple quotes, do the following:
    1. Identify which drugs have been tested and create a set for each.
    2. Identify the drug targets and categorize them as direct or indirect: Use the following logic.
    direct target: if the text clearly and directly mentions that the drug targets a gene
    indirect target: if the text does not explicitly mention the drug and target relationship but the drug is tested on some gene, then put that gene name here.
    3. Briefly indicate the logic for identifying the direct and indirect targets. 
    4. Again if explicitly and clearly mentioned, then get the drug-direct target relationship type. What does the drug do to the direct target? inhibitor, activator, etc.?
    5. Collect all diseases for which the drug has been tested.
    6. Extract any specific ClinicalTrials.gov identifier or number.
    7. For each drug, construct a set consisting of (drug name, direct target, indirect target, drug-direct target interaction, logic for direct target, logic for indirect target, ClinicalTrials.gov number, all diseases that the drug is tested in)
    8. Any empty values should be indicated by null and not an empty string.
    9. Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        Drug ABC has been widely tested for ovarian cancers. 
        However, it's applications have been unknown in hematological cancers. 
        In this study, we show that ABC, a gene X inhibitor, shows efficacy in cell lines for the following cancer types: ALL, AML, CML, and myeloma.
        Furthermore, we also discover a vulnerability in these cancers specific to the gene Y, where cancers with high expression of gene Y are more sensitive to ABC.        
        """
    },
    {
        "role": "assistant", "content":"""
        drug name: ABC
        direct target: X
        indirect target: Y
        drug-direct target-interaction: inhibitor
        logic for direct target: Text explicitly mentions that ABC is an inhibitor for X, so X is a direct target. 
        logic for indirect target: Although ABC was tested and shown to be effective in Y- high expressing cancers, it is not explicitly mentioned that Y is also a target of ABC. So, Y is an indirect target of drug ABC.
        all diseases drug is tested in: hematological cancers, ALL, AML, CML, myeloma
        ClinicalTrials.gov ID: null
        """
    },
    {
        "role": "user", "content": """
        Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).

        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

In [118]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'ABC': {'drug name': 'ABC',
  'direct target': 'X',
  'indirect target': 'Y',
  'drug-direct target interaction': 'inhibitor',
  'logic for direct target': 'Explicitly mentioned as a gene X inhibitor.',
  'logic for indirect target': 'ABC shown effective in Y-high expressing cancers without explicitly mentioning Y as a target.',
  'ClinicalTrials.gov number': None,
  'all diseases that the drug is tested in': ['ovarian cancer',
   'hematological cancers',
   'ALL',
   'AML',
   'CML',
   'myeloma']},
 'vemurafenib': {'drug name': 'vemurafenib',
  'direct target': 'BRAF V600',
  'indirect target': None,
  'drug-direct target interaction': None,
  'logic for direct target': 'Study on vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.',
  'logic for indirect target': None,
  'ClinicalTrials.gov number': 'NCT01524978',
  'all diseases that the drug is tested in': ['non-small-cell lung cancer',
   'Erdheim-Chester disease',
   "Langerhans'-cell histiocytosis",
   'pleomorphic 

In [120]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist.
    For each of the user text delimited by triple quotes, do the following:
    1. Identify which drugs have been tested and create a set for each.
    2. Identify the drug targets and categorize them as direct or indirect: Use the following logic.
    direct target: if the text clearly and directly mentions that the drug targets a gene
    indirect target: if the text does not explicitly mention the drug and target relationship but the drug is tested on some gene, then put that gene name here.
    3. Briefly indicate the logic for identifying the direct and indirect targets. 
    4. Again if explicitly and clearly mentioned, then get the drug-direct target relationship type. What does the drug do to the direct target? inhibitor, activator, etc.?
    5. Collect all diseases for which the drug has been tested.
    6. Extract any specific ClinicalTrials.gov identifier or number.
    7. For each drug, construct a set consisting of (drug name, direct target, indirect target, drug-direct target interaction, logic for direct target, logic for indirect target, ClinicalTrials.gov number, all diseases that the drug is tested in)
    8. Any empty values should be indicated by null and not an empty string.
    9. Double check your logic and see if the data you have collected is actually reflected in the text. If not, make revisions.
    9. Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        Drug ABC has been widely tested for ovarian cancers. 
        However, it's applications have been unknown in hematological cancers. 
        In this study, we show that ABC, a gene X inhibitor, shows efficacy in cell lines for the following cancer types: ALL, AML, CML, and myeloma.
        Furthermore, we also discover a vulnerability in these cancers specific to the gene Y, where cancers with high expression of gene Y are more sensitive to ABC.        
        """
    },
    {
        "role": "assistant", "content":"""
        drug name: ABC
        direct target: X
        indirect target: Y
        drug-direct target-interaction: inhibitor
        logic for direct target: Text explicitly mentions that ABC is an inhibitor for X, so X is a direct target. 
        logic for indirect target: Although ABC was tested and shown to be effective in Y- high expressing cancers, it is not explicitly mentioned that Y is also a target of ABC. So, Y is an indirect target of drug ABC.
        all diseases drug is tested in: hematological cancers, ALL, AML, CML, myeloma
        ClinicalTrials.gov ID: null
        """
    },
    {
        "role":"user","content":"Great, you correctly identified that Y is not a direct target but that X is a direct target of ABC."
    },
    {
        "role": "user", "content": """
        Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).

        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

In [121]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'drugs': [{'drug name': 'vemurafenib',
   'direct target': 'BRAF V600',
   'indirect target': None,
   'drug-direct target interaction': None,
   'logic for direct target': 'The study involved vemurafenib targeting BRAF V600 mutation-positive nonmelanoma cancers, thus BRAF V600 is a direct target.',
   'logic for indirect target': 'There is no explicit mention of an indirect target in the provided text.',
   'ClinicalTrials.gov number': 'NCT01524978',
   'all diseases drug is tested in': ['non-small-cell lung cancer',
    'Erdheim-Chester disease',
    "Langerhans'-cell histiocytosis",
    'colorectal cancer',
    'pleomorphic xanthoastrocytoma',
    'anaplastic thyroid cancer',
    'cholangiocarcinoma',
    'salivary-duct cancer',
    'ovarian cancer',
    'clear-cell sarcoma']},
  {'drug name': 'cetuximab',
   'direct target': None,
   'indirect target': None,
   'drug-direct target interaction': None,
   'logic for direct target': "There is no explicit information provided about ce

In [122]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist.
    For each of the user text delimited by triple quotes, do the following:
    1. Identify which drugs have been tested and create a set for each.
    2. Identify the drug targets and categorize them as direct or indirect: Use the following logic.
    direct target: if the text clearly and directly mentions that the drug targets a gene
    indirect target: if the text does not explicitly mention the drug and target relationship but the drug is tested on some gene, then put that gene name here.
    3. Briefly indicate the logic for identifying the direct and indirect targets. 
    4. Again if explicitly and clearly mentioned, then get the drug-direct target relationship type. What does the drug do to the direct target? inhibitor, activator, etc.?
    5. Collect all diseases for which the drug has been tested.
    6. Extract any specific ClinicalTrials.gov identifier or number.
    7. For each drug, construct a set consisting of (drug name, direct target, indirect target, drug-direct target interaction, logic for direct target, logic for indirect target, ClinicalTrials.gov number, all diseases that the drug is tested in)
    8. Any empty values should be indicated by null and not an empty string.
    9. Double check your logic and see if the data you have collected is actually reflected in the text. If not, make revisions.
    9. Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        Drug ABC has been widely tested for ovarian cancers. 
        However, it's applications have been unknown in hematological cancers. 
        In this study, we show that ABC, a gene X inhibitor, shows efficacy in cell lines for the following cancer types: ALL, AML, CML, and myeloma.
        Furthermore, we also discover a vulnerability in these cancers specific to the gene Y, where cancers with high expression of gene Y are more sensitive to ABC.        
        """
    },
    {
        "role": "assistant", "content":"""
        drug name: ABC
        direct target: X
        indirect target: Y
        drug-direct target-interaction: inhibitor
        logic for direct target: Text explicitly mentions that ABC is an inhibitor for X, so X is a direct target. 
        logic for indirect target: Although ABC was tested and shown to be effective in Y- high expressing cancers, it is not explicitly mentioned that Y is also a target of ABC. So, Y is an indirect target of drug ABC.
        all diseases drug is tested in: hematological cancers, ALL, AML, CML, myeloma
        ClinicalTrials.gov ID: null
        """
    },
    {
        "role":"user","content":"Great, you correctly identified that Y is not a direct target but that X is a direct target of ABC."
    },
    {
        "role": "user", "content": """
        Treatment-refractory lupus nephritis (LN) has a high risk of a poor outcome and is often life-threatening. Here we report a case series of six patients (one male and five females) with a median age of 41.3 years (range, 20-61 years) with refractory LN who received renal biopsies and were subsequently treated with intravenous daratumumab, an anti-CD38 monoclonal antibody (weekly for 8 weeks, followed by eight biweekly infusions and up to eight monthly infusions). One patient did not show any improvement after 6 months of therapy, and daratumumab was discontinued. In five patients, the mean disease activity, as assessed by the Systemic Lupus Erythematosus Disease Activity 2000 index, decreased from 10.8 before treatment to 3.6 at 12 months after treatment. Mean proteinuria (5.6 g per 24 h to 0.8 g per 24 h) and mean serum creatinine (2.3 mg dl-1 to 1.5 mg dl-1) also decreased after 12 months. Improvement of clinical symptoms was accompanied by seroconversion of anti-double-stranded DNA antibodies; decreases in median interferon-gamma levels, B cell maturation antigen and soluble CD163 levels; and increases in C4 and interleukin-10 levels. These data suggest that daratumumab monotherapy warrants further exploration as a potential treatment for refractory LN.
        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

In [123]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'daratumumab': {'direct_target': 'CD38',
  'indirect_target': None,
  'drug_direct_target_interaction': 'anti-CD38 monoclonal antibody',
  'logic_for_direct_target': 'The text explicitly mentions daratumumab as an anti-CD38 monoclonal antibody, indicating a direct interaction with CD38 as its target.',
  'logic_for_indirect_target': 'No indirect targets are mentioned as being affected by daratumumab; the effects on other molecules like interferon-gamma, B cell maturation antigen, soluble CD163, C4, and interleukin-10 are likely downstream consequences rather than direct targets.',
  'clinical_trials_gov_id': None,
  'all_diseases_tested_in': 'lupus nephritis (LN)'}}

In [124]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist.
    For each of the user text delimited by triple quotes, do the following:
    1. Identify which drugs have been tested and create a set for each.
    2. Identify the drug targets and categorize them as direct or indirect: Use the following logic.
    direct target: if the text clearly and directly mentions that the drug targets a gene
    indirect target: if the text does not explicitly mention the drug and target relationship but the drug is tested on some gene, then put that gene name here.
    3. Briefly indicate the logic for identifying the direct and indirect targets. 
    4. Again if explicitly and clearly mentioned, then get the drug-direct target relationship type. What does the drug do to the direct target? inhibitor, activator, etc.?
    5. Collect all diseases for which the drug has been tested.
    6. Extract any specific ClinicalTrials.gov identifier or number.
    7. For each drug, construct a set consisting of (drug name, direct target, indirect target, drug-direct target interaction, logic for direct target, logic for indirect target, ClinicalTrials.gov number, all diseases that the drug is tested in)
    8. Any empty values should be indicated by null and not an empty string.
    9. Double check your logic and see if the data you have collected is actually reflected in the text. If not, make revisions.
    9. Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        Drug ABC has been widely tested for ovarian cancers. 
        However, it's applications have been unknown in hematological cancers. 
        In this study, we show that ABC, a gene X inhibitor, shows efficacy in cell lines for the following cancer types: ALL, AML, CML, and myeloma.
        Furthermore, we also discover a vulnerability in these cancers specific to the gene Y, where cancers with high expression of gene Y are more sensitive to ABC.        
        """
    },
    {
        "role": "assistant", "content":"""
        drug name: ABC
        direct target: X
        indirect target: Y
        drug-direct target-interaction: inhibitor
        logic for direct target: Text explicitly mentions that ABC is an inhibitor for X, so X is a direct target. 
        logic for indirect target: Although ABC was tested and shown to be effective in Y- high expressing cancers, it is not explicitly mentioned that Y is also a target of ABC. So, Y is an indirect target of drug ABC.
        all diseases drug is tested in: hematological cancers, ALL, AML, CML, myeloma
        ClinicalTrials.gov ID: null
        """
    },
    {
        "role":"user","content":"Great, you correctly identified that Y is not a direct target but that X is a direct target of ABC."
    },
    {
        "role": "user", "content": """
        LKB1/STK11 is a serine/threonine kinase that plays a major role in controlling cell metabolism, resulting in potential therapeutic vulnerabilities in LKB1-mutant cancers. Here, we identify the NAD + degrading ectoenzyme, CD38, as a new target in LKB1-mutant NSCLC. Metabolic profiling of genetically engineered mouse models (GEMMs) revealed that LKB1 mutant lung cancers have a striking increase in ADP-ribose, a breakdown product of the critical redox co-factor, NAD + . Surprisingly, compared with other genetic subsets, murine and human LKB1-mutant NSCLC show marked overexpression of the NAD+-catabolizing ectoenzyme, CD38 on the surface of tumor cells. Loss of LKB1 or inactivation of Salt-Inducible Kinases (SIKs)-key downstream effectors of LKB1- induces CD38 transcription induction via a CREB binding site in the CD38 promoter. Treatment with the FDA-approved anti-CD38 antibody, daratumumab, inhibited growth of LKB1-mutant NSCLC xenografts. Together, these results reveal CD38 as a promising therapeutic target in patients with LKB1 mutant lung cancer.
        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

**It should have pulled LKB-1 as an indirect target.**

In [125]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'drugs': [{'drug name': 'daratumumab',
   'direct target': 'CD38',
   'indirect target': None,
   'drug-direct target interaction': 'inhibitor',
   'logic for direct target': 'Text explicitly mentions daratumumab as an FDA-approved anti-CD38 antibody, indicating it targets CD38 directly.',
   'logic for indirect target': "No indirect targets are explicitly mentioned in relation to daratumumab's action.",
   'ClinicalTrials.gov number': None,
   'all diseases that the drug is tested in': ['LKB1-mutant NSCLC']}]}

**Change the format of the prompt by explicitly highlighting an examples section.**

In [126]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist building out a drug-target dataset from biomedical text.
    """ },
    {
        "role": "user", "content": """
        Look at the examples delimited by ### and the rules delimited by ***.
        *** RULES ***
        For each of the user text delimited by triple quotes, do the following:
    1. Identify which drugs have been tested and create a set for each.
    2. Identify the drug targets and categorize them as direct or indirect: Use the following logic.
    direct target: if the text clearly and directly mentions that the drug targets a gene
    indirect target: if the text does not explicitly mention the drug and target relationship but the drug is tested on some gene, then put that gene name here.
    3. Briefly indicate the logic for identifying the direct and indirect targets. 
    4. Again if explicitly and clearly mentioned, then get the drug-direct target relationship type. What does the drug do to the direct target? inhibitor, activator, etc.?
    5. Collect all diseases for which the drug has been tested.
    6. Extract any specific ClinicalTrials.gov identifier or number.
    7. For each drug, construct a set consisting of (drug name, direct target, indirect target, drug-direct target interaction, logic for direct target, logic for indirect target, ClinicalTrials.gov number, all diseases that the drug is tested in)
    8. Any empty values should be indicated by null and not an empty string.
    9. Double check your logic and see if the data you have collected is actually reflected in the text. If not, make revisions.
    9. Assemble all sets and produce 1 final JSON output.
    
        ### EXAMPLES 
        Example 1: LKB1/STK11 is a serine/threonine kinase that plays a major role in controlling cell metabolism, resulting in potential therapeutic vulnerabilities in LKB1-mutant cancers. Here, we identify the NAD + degrading ectoenzyme, CD38, as a new target in LKB1-mutant NSCLC. Metabolic profiling of genetically engineered mouse models (GEMMs) revealed that LKB1 mutant lung cancers have a striking increase in ADP-ribose, a breakdown product of the critical redox co-factor, NAD + . Surprisingly, compared with other genetic subsets, murine and human LKB1-mutant NSCLC show marked overexpression of the NAD+-catabolizing ectoenzyme, CD38 on the surface of tumor cells. Loss of LKB1 or inactivation of Salt-Inducible Kinases (SIKs)-key downstream effectors of LKB1- induces CD38 transcription induction via a CREB binding site in the CD38 promoter. Treatment with the FDA-approved anti-CD38 antibody, daratumumab, inhibited growth of LKB1-mutant NSCLC xenografts. Together, these results reveal CD38 as a promising therapeutic target in patients with LKB1 mutant lung cancer.
        Output: {drug name: daratumumab,
        direct target: CD38, 
        indirect target: LKB-1/STK-11, 
        drug-direct target interaction: anti-CD38 monoclonal antibody,
        direct target logic: explicitly mentioned,
        indirect target logic: drug shown to be effective in cancers with LKB-1 mutations,
        drug tested in following diseases: lung cancer, NSCLC,
        ClinicalTrials.gov ID: null}
        
        Example 2: Triple-negative breast cancers (TNBC) frequently inactivate p53, increasing their aggressiveness and therapy resistance. We identified an unexpected protein vulnerability in p53-inactivated TNBC and designed a new PROteolysis TArgeting Chimera (PROTAC) to target it. Our PROTAC selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment. MDM2 loss in p53 mutant/deleted TNBC cells in two-dimensional/three-dimensional culture and TNBC patient explants, including relapsed tumors, causes apoptosis while sparing normal cells. Our MDM2-PROTAC is stable in vivo, and treatment of TNBC xenograft-bearing mice demonstrates tumor on-target efficacy with no toxicity to normal cells, significantly extending survival. Transcriptomic analyses revealed upregulation of p53 family target genes. Investigations showed activation and a required role for TAp73 to mediate MDM2-PROTAC-induced apoptosis. Our data, challenging the current MDM2/p53 paradigm, show MDM2 is required for p53-inactivated TNBC cell survival, and PROTAC-targeted MDM2 degradation is an innovative potential therapeutic strategy for TNBC and superior to existing MDM2 inhibitors. 
        Output: {drug name: MDM2-PROTAC,
        direct target: MDM2
        indirect target: p53, 
        drug-direct target interaction: degrader,
        direct target logic: explicit mention - PROTAC selectively targets MDM2 for degradation,
        indirect target logic: drug kills cells with inactive p53 ,
        drug tested in following diseases: breast cancer, Triple-negative breast cancer TNBC
        ClinicalTrials.gov ID: null}
        
    Use the RULES and EXAMPLES and create a similar output for the following text:
        Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).

        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

In [127]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'drug name': 'vemurafenib',
 'direct target': 'BRAF V600',
 'indirect target': None,
 'drug-direct target interaction': 'null',
 'direct target logic': 'explicit mention - study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers',
 'indirect target logic': 'null',
 'drug tested in following diseases': ['non-small-cell lung cancer',
  'Erdheim-Chester disease',
  "Langerhans'-cell histiocytosis",
  'colorectal cancer',
  'pleomorphic xanthoastrocytoma',
  'anaplastic thyroid cancer',
  'cholangiocarcinoma',
  'salivary-duct cancer',
  'ovarian cancer',
  'clear-cell sarcoma'],
 'ClinicalTrials.gov ID': 'NCT01524978'}

In [139]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist building out a drug-target dataset from biomedical text.
    """ },
    {
        "role": "user", "content": """
        Look at the examples delimited by ### and the rules delimited by ***.
        *** RULES 
        For each of the text shown under TASKS, do the following:
    1. Identify which drugs have been tested and create a set for each. 
    2. Return multiple sets if more than drug is present in the text.
    
    For each drug:
    3. Get direct target: Use the following logic: if the text clearly and directly mentions that the drug targets a gene and has also defined the type of interaction
    4. Get interaction type between the drug and direct target: if explicitly and clearly mentioned, then get the drug-direct target relationship type. What does the drug do to the direct target? inhibitor, activator, etc.?
    5. Get groups tested: specify which type of genes was the drug tested on, eg. if drug was tested on samples showing high expression of certain gene.
    6. Collect all disease names for which the drug has been tested.
    7. Extract any specific ClinicalTrials.gov identifier or number.
    8. For each drug, construct a set consisting of (drug name, direct target, drug-direct target interaction, tested or effective group, logic for direct target, ClinicalTrials.gov number, all diseases that the drug is tested in)
    9. Any empty values should be indicated by null and not an empty string.
    10. Double check your logic and see if the data you have collected is actually reflected in the text. If not, make revisions.
    11. Both, direct target, and drug-direct target interaction fields should have been filled or both should be null. Only one of these fields cannot be null.
    11. Assemble all sets and produce 1 final JSON output.
    ***
        ### EXAMPLES 
        Example 1: LKB1/STK11 is a serine/threonine kinase that plays a major role in controlling cell metabolism, resulting in potential therapeutic vulnerabilities in LKB1-mutant cancers. Here, we identify the NAD + degrading ectoenzyme, CD38, as a new target in LKB1-mutant NSCLC. Metabolic profiling of genetically engineered mouse models (GEMMs) revealed that LKB1 mutant lung cancers have a striking increase in ADP-ribose, a breakdown product of the critical redox co-factor, NAD + . Surprisingly, compared with other genetic subsets, murine and human LKB1-mutant NSCLC show marked overexpression of the NAD+-catabolizing ectoenzyme, CD38 on the surface of tumor cells. Loss of LKB1 or inactivation of Salt-Inducible Kinases (SIKs)-key downstream effectors of LKB1- induces CD38 transcription induction via a CREB binding site in the CD38 promoter. Treatment with the FDA-approved anti-CD38 antibody, daratumumab, inhibited growth of LKB1-mutant NSCLC xenografts. Together, these results reveal CD38 as a promising therapeutic target in patients with LKB1 mutant lung cancer.
        Output: {drug name: daratumumab,
        direct target: CD38, 
        tested or effective group: LKB-1/STK-11 mutant NSCLC, 
        drug-direct target interaction: anti-CD38 monoclonal antibody,
        drug tested in following diseases: lung cancer, NSCLC,
        ClinicalTrials.gov ID: null}
        ###
        TASKS:
    Use the RULES and EXAMPLES and create a similar outputs for the following text delimited by ID:
    ID1:Triple-negative breast cancers (TNBC) frequently inactivate p53, increasing their aggressiveness and therapy resistance. We identified an unexpected protein vulnerability in p53-inactivated TNBC and designed a new PROteolysis TArgeting Chimera (PROTAC) to target it. Our PROTAC selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment. MDM2 loss in p53 mutant/deleted TNBC cells in two-dimensional/three-dimensional culture and TNBC patient explants, including relapsed tumors, causes apoptosis while sparing normal cells. Our MDM2-PROTAC is stable in vivo, and treatment of TNBC xenograft-bearing mice demonstrates tumor on-target efficacy with no toxicity to normal cells, significantly extending survival. Transcriptomic analyses revealed upregulation of p53 family target genes. Investigations showed activation and a required role for TAp73 to mediate MDM2-PROTAC-induced apoptosis. Our data, challenging the current MDM2/p53 paradigm, show MDM2 is required for p53-inactivated TNBC cell survival, and PROTAC-targeted MDM2 degradation is an innovative potential therapeutic strategy for TNBC and superior to existing MDM2 inhibitors. 
    
    ID2:Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).

        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

**This solution looks reasonable. It captured Cetuximab as well, however it again tagged BRAF as a direct target even though this is not explicitly mentioned in the text.**

In [140]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'ID1': {'drug name': 'PROTAC',
  'direct target': 'MDM2',
  'drug-direct target interaction': 'proteasome-mediated degradation',
  'tested or effective group': 'p53 mutant/deleted TNBC',
  'logic for direct target': None,
  'ClinicalTrials.gov ID': None,
  'drug tested in following diseases': ['Triple-negative breast cancer (TNBC)']},
 'ID2': [{'drug name': 'vemurafenib',
   'direct target': 'BRAF V600',
   'drug-direct target interaction': 'inhibitor',
   'tested or effective group': 'BRAF V600 mutation-positive nonmelanoma cancers',
   'logic for direct target': None,
   'ClinicalTrials.gov ID': 'NCT01524978',
   'drug tested in following diseases': ['non-small-cell lung cancer',
    'Erdheim-Chester disease',
    "Langerhans'-cell histiocytosis",
    'pleomorphic xanthoastrocytoma',
    'anaplastic thyroid cancer',
    'cholangiocarcinoma',
    'salivary-duct cancer',
    'ovarian cancer',
    'clear-cell sarcoma']},
  {'drug name': 'cetuximab',
   'direct target': None,
   'drug-d

In [141]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text.
    """ },
    {
        "role": "user", "content": """
        Look at the examples delimited by ### and the rules delimited by ***.
        *** RULES 
        For each of the text shown under TASKS, do the following:
    1. Identify which drugs have been tested and create a set for each. 
    2. Return multiple sets if more than drug is present in the text.
    
    For each drug:
    3. Get direct target: Use the following logic: if the text clearly and directly mentions that the drug targets a gene and has also defined the type of interaction
    4. Get interaction type between the drug and direct target: if explicitly and clearly mentioned, then get the drug-direct target relationship type. What does the drug do to the direct target? inhibitor, activator, etc.?
    5. Get groups tested: specify which type of genes was the drug tested on, eg. if drug was tested on samples showing high expression of certain gene.
    6. Collect all disease names for which the drug has been tested.
    7. Extract any specific ClinicalTrials.gov identifier or number.
    8. For each drug, construct a set consisting of (drug name, direct target, drug-direct target interaction, tested or effective group, logic for direct target, ClinicalTrials.gov number, all diseases that the drug is tested in)
    9. Any empty values should be indicated by null and not an empty string.
    10. Double check your logic and see if the data you have collected is actually reflected in the text. If not, make revisions.
    11. Both, direct target, and drug-direct target interaction fields should have been filled or both should be null. Only one of these fields cannot be null.
    11. Assemble all sets and produce 1 final JSON output.
    ***
        ### EXAMPLES 
        Example 1: LKB1/STK11 is a serine/threonine kinase that plays a major role in controlling cell metabolism, resulting in potential therapeutic vulnerabilities in LKB1-mutant cancers. Here, we identify the NAD + degrading ectoenzyme, CD38, as a new target in LKB1-mutant NSCLC. Metabolic profiling of genetically engineered mouse models (GEMMs) revealed that LKB1 mutant lung cancers have a striking increase in ADP-ribose, a breakdown product of the critical redox co-factor, NAD + . Surprisingly, compared with other genetic subsets, murine and human LKB1-mutant NSCLC show marked overexpression of the NAD+-catabolizing ectoenzyme, CD38 on the surface of tumor cells. Loss of LKB1 or inactivation of Salt-Inducible Kinases (SIKs)-key downstream effectors of LKB1- induces CD38 transcription induction via a CREB binding site in the CD38 promoter. Treatment with the FDA-approved anti-CD38 antibody, daratumumab, inhibited growth of LKB1-mutant NSCLC xenografts. Together, these results reveal CD38 as a promising therapeutic target in patients with LKB1 mutant lung cancer.
        Output: {drug name: daratumumab,
        direct target: CD38, 
        tested or effective group: LKB-1/STK-11 mutant NSCLC, 
        logic for direct target: text mentioned it is an anti-CD38 antibody
        drug-direct target interaction: anti-CD38 monoclonal antibody,
        drug tested in following diseases: lung cancer, NSCLC,
        ClinicalTrials.gov ID: null}
        ###
        TASKS:
    Use the RULES and EXAMPLES and create a similar outputs for the following text delimited by ID:
    ID1:Triple-negative breast cancers (TNBC) frequently inactivate p53, increasing their aggressiveness and therapy resistance. We identified an unexpected protein vulnerability in p53-inactivated TNBC and designed a new PROteolysis TArgeting Chimera (PROTAC) to target it. Our PROTAC selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment. MDM2 loss in p53 mutant/deleted TNBC cells in two-dimensional/three-dimensional culture and TNBC patient explants, including relapsed tumors, causes apoptosis while sparing normal cells. Our MDM2-PROTAC is stable in vivo, and treatment of TNBC xenograft-bearing mice demonstrates tumor on-target efficacy with no toxicity to normal cells, significantly extending survival. Transcriptomic analyses revealed upregulation of p53 family target genes. Investigations showed activation and a required role for TAp73 to mediate MDM2-PROTAC-induced apoptosis. Our data, challenging the current MDM2/p53 paradigm, show MDM2 is required for p53-inactivated TNBC cell survival, and PROTAC-targeted MDM2 degradation is an innovative potential therapeutic strategy for TNBC and superior to existing MDM2 inhibitors. 
    
    ID2:Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).

        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0
    )

In [142]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'ID1': {'drug name': 'MDM2-PROTAC',
  'direct target': 'MDM2',
  'tested or effective group': 'p53 mutant/deleted TNBC',
  'logic for direct target': 'Our PROTAC selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment',
  'drug-direct target interaction': 'proteasome-mediated degradation',
  'drug tested in following diseases': 'Triple-negative breast cancer (TNBC)',
  'ClinicalTrials.gov ID': None},
 'ID2': {'drug name': 'vemurafenib',
  'direct target': 'BRAF V600',
  'tested or effective group': 'BRAF V600 mutation-positive nonmelanoma cancers',
  'logic for direct target': 'vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers',
  'drug-direct target interaction': 'inhibitor',
  'drug tested in following diseases': "non-small-cell lung cancer, Erdheim-Chester disease, Langerhans'-cell histiocytosis, pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, clear-c

In [143]:
result['usage']

{'completion_tokens': 414, 'prompt_tokens': 1532, 'total_tokens': 1946}

**Since it is still catching targets by implying from text, let's try some smaller phrases to see it's behavior. Also try chain of thought prompting.**

In [150]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text.
    Return a single JSON output combining outputs from all tasks contained within ###.
    """ },
    {
        "role": "user", "content": """
        Look at this example and logically arriving at any drug and target information. Return a set capturing drug target information. 
        Example: We tested Vemurafenib in KRASG12C mutated cancers and showed that is very effective in these cancers.
        
        Answer=The text talks about the drug Vemurafenib. It was tested in cancers with a gene mutation in KRAS gene.
        However, it does not say that the drug is designed to target the KRAS gene. So, KRAS should not be labeled as a target for Vemurafenib.
        The answer is {drug name: Vemurafenib, target name: None}
        
        ### TASKS
        Task 1 - Sotorosib is a newly approved FDA drug and is designed to inhibit the KRAS gene.
        Task 2 - Venetoclax efficacy was shown to be very poor in AXL mutated cancers.
        Task 3 - Paclitaxel is a very common chemotherapy known to deregulate the microtubule dynamics.
        
        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0
    )

In [151]:
response

ChatCompletion(id='chatcmpl-8xQlSVIvrgHv6pKNky15sTmgTVhfO', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "Task 1": {\n    "drug name": "Sotorosib",\n    "target name": "KRAS"\n  },\n  "Task 2": {\n    "drug name": "Venetoclax",\n    "target name": "None"\n  },\n  "Task 3": {\n    "drug name": "Paclitaxel",\n    "target name": "microtubule dynamics"\n  }\n}', role='assistant', function_call=None, tool_calls=None))], created=1709174926, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_89b1a570e1', usage=CompletionUsage(completion_tokens=92, prompt_tokens=271, total_tokens=363))

**Let's run the exact same prompt again - many times, it actually changes the behavior.**

In [155]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text.
    Return a single JSON output combining outputs from all tasks contained within ###.
    """ },
    {
        "role": "user", "content": """
        Look at this example and logically arriving at any drug and target information. Return a set capturing drug target information. 
        Example: We tested Vemurafenib in KRASG12C mutated cancers and showed that is very effective in these cancers.
        
        Answer=The text talks about the drug Vemurafenib. It was tested in cancers with a gene mutation in KRAS gene.
        However, it does not say that the drug is designed to target the KRAS gene. So, KRAS should not be labeled as a target for Vemurafenib.
        The answer is {drug name: Vemurafenib, target name: None}
        
        ### TASKS
        Task 1 - Sotorosib is a newly approved FDA drug and is designed to inhibit the KRAS gene.
        Task 2 - Venetoclax efficacy was shown to be very poor in AXL mutated cancers.
        Task 3 - Paclitaxel is a very common chemotherapy known to deregulate the microtubule dynamics.
        
        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0
    )

**We can see that there is no change in the prompt text. Yet, in the above result, it correctly captured the relationships. But here, it incorrectly says AXL is a target for Venetoclax. This also confirms that the seed parameter does not really lead to consistent outputs.**

In [156]:
response

ChatCompletion(id='chatcmpl-8xQnD7HVtiOBvOKXJ5gRhgooY3GNH', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "Task 1": {\n    "drug name": "Sotorosib",\n    "target name": "KRAS"\n  },\n  "Task 2": {\n    "drug name": "Venetoclax",\n    "target name": "AXL"\n  },\n  "Task 3": {\n    "drug name": "Paclitaxel",\n    "target name": "Microtubule dynamics"\n  }\n}', role='assistant', function_call=None, tool_calls=None))], created=1709175035, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_89b1a570e1', usage=CompletionUsage(completion_tokens=93, prompt_tokens=271, total_tokens=364))

**Lowering the temperature might help since it tends to produce more deterministic output.**

In [161]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text.
    Return a single JSON output combining outputs from all tasks contained within ###.
    """ },
    {
        "role": "user", "content": """
        Look at this example and logically arriving at any drug and target information. Return a set capturing drug target information. 
        Example: We tested Vemurafenib in KRASG12C mutated cancers and showed that is very effective in these cancers.
        
        Answer=The text talks about the drug Vemurafenib. It was tested in cancers with a gene mutation in KRAS gene.
        However, it does not say that the drug is designed to target the KRAS gene. So, KRAS should not be labeled as a target for Vemurafenib.
        The answer is {drug name: Vemurafenib, target name: None}
        
        ### TASKS
        Task 1 - Sotorosib is a newly approved FDA drug and is designed to inhibit the KRAS gene.
        Task 2 - Venetoclax efficacy was shown to be very poor in AXL mutated cancers.
        Task 3 - Paclitaxel is a very common chemotherapy known to deregulate the microtubule dynamics.
        
        """
    }
]


temperatures = [0.2, 0.5, 0.7, 1.0, 1.2, 1.5]

for temperature in temperatures:
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0,
        temperature=temperature
    )
    
    result = json.loads(response.model_dump_json())
    data_dict = json.loads(result['choices'][0]['message']['content'])
    
    print(f'Temperature={temperature}')
    print(data_dict)

Temperature=0.2
{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'}, 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'}, 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'microtubule dynamics'}}
Temperature=0.5
{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'}, 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'}, 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'microtubule dynamics'}}
Temperature=0.7
{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'}, 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'}, 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'microtubule dynamics'}}
Temperature=1.0
{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'}, 'Task 2': {'drug name': 'Venetoclax', 'target name': 'AXL'}, 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'microtubule dynamics'}}
Temperature=1.2
{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'}, 'Task 2': {'drug name': 'Venetoclax', 'target name'

**FEW SHOT: Adding a mix of positive and negative examples is supposed to improve performance. Here, including 2 negative cases and 1 positive case seems to have improved the output. It no longer tags BRAF as a target for Vemurafenib. But, here and in above tests, it mentions microtubule dynamics as a target. In. a way, it is correct, but for a dataset this might lead to some messiness as it isn't a gene but rather a term. We haven't tried if we got back to a 1 shot prompt, would it still gives us the same result with a low temperature of 0.2?** 

In [162]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text.
    Return a single JSON output combining outputs from all tasks contained within ###.
    """ },
    {
        "role": "user", "content": """
        Look at these examples and logically arriving at any drug and target information. Return a set capturing drug target information. 
        
        Example 1: We tested Vemurafenib in KRASG12C mutated cancers and showed that is very effective in these cancers.
        Answer=The text talks about the drug Vemurafenib. It was tested in cancers with a gene mutation in KRAS gene.
        However, it does not say that the drug is designed to target the KRAS gene. So, KRAS should not be labeled as a target for Vemurafenib.
        The answer is {drug name: Vemurafenib, target name: None}
        
        Example 2: Erlotinib is a kinase inihibitor, with binding activites showed for EGFR and downstream effects seen on PKA, PKC. 
        Answer=The drug name is Erlotinib. There are 3 genes mentioned here - EGFR, PKA, PKC. Erlotinib is said to bind to EGFR and hence EGFR is a target. Erlotinib has effects on PKA and PKC but that doesn't mean it directly binds and targets them, so they are not targets.
        The answer is {drug name: Erlotinib, target name: EGFR}
        
        Example 3: Metformin is a drug commonly used for people with type 2 diabetes. But in cancer, Metformin has been shown to decrease KI-67 expression.
        Answer=The drug name is Metformin. Although it can impact Ki-67 expression in cancers, the text doesn't explicitly say that Ki-67 is a direct target. So, in this text no target is mentioned for Metformin.
        The answer is {drug name: Metformin, target name: None}
       
        
        ### TASKS
        Task 1 - Sotorosib is a newly approved FDA drug and is designed to inhibit the KRAS gene.
        Task 2 - Venetoclax efficacy was shown to be very poor in AXL mutated cancers.
        Task 3 - Paclitaxel is a very common chemotherapy known to deregulate the microtubule dynamics.
        Task 4 - Vemurafenib was tested and showed positive results in patients with a BRAFV600E mutation.
        """
    }
]


temperatures = [0.2, 0.5, 0.7, 1.0, 1.2, 1.5]

for temperature in temperatures:
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0,
        temperature=temperature
    )
    
    result = json.loads(response.model_dump_json())
    data_dict = json.loads(result['choices'][0]['message']['content'])
    
    print(f'Temperature={temperature}')
    print(data_dict)

Temperature=0.2
{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'}, 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'}, 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'microtubule dynamics'}, 'Task 4': {'drug name': 'Vemurafenib', 'target name': 'None'}}
Temperature=0.5
{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'}, 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'}, 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'microtubule dynamics'}, 'Task 4': {'drug name': 'Vemurafenib', 'target name': 'BRAFV600E'}}
Temperature=0.7
{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'}, 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'}, 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'microtubule dynamics'}, 'Task 4': {'drug name': 'Vemurafenib', 'target name': 'BRAFV600E'}}
Temperature=1.0
{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'}, 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'}, 'Task 3

**Let's split up the instructions between the system and the user and refine the structure. Test a temperature of 0.2**

In [165]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text.
    You will look at instructions from the ###instruction section, see examples from the ***example sections. Produce an output similar to shown for examples for each of the sections identified as TASK.   
    Then generate a single JSON output from all tasks.
    """ },
    {
        "role": "user", "content": """
        ###Instruction: Extract drug and target names from text by following these steps:
        First, identify all drug names.
        If no drug name is present, completely skip the task and go to the next one.
        Then, check if any targets are mentioned for each drug.This fact should be clearly mentioned in the text.
        Generate a separate set for each drug and then combine all into 1 JSON. 
        Follow the 'output' pattern shown in ###example sections.
        """
    },
    {
        "role": "user", "content": """
        ***example: We tested Vemurafenib in KRASG12C mutated cancers and showed that is very effective in these cancers.
        Let's follow ###Instruction and think step by step.
        The text talks about the drug Vemurafenib. It was tested in cancers with a gene mutation in KRAS gene.
        However, it does not say that the drug is designed to target the KRAS gene. So, KRAS should not be labeled as a target for Vemurafenib.
        Output={drug name: Vemurafenib, target name:None}
        """
        
    },
    {
        "role": "user", "content":"""
        TASKS
        Task 1 - Sotorosib is a newly approved FDA drug and is designed to inhibit the KRAS gene.
        Task 2 - Venetoclax efficacy was shown to be very poor in AXL mutated cancers.
        Task 3 - Paclitaxel is a very common chemotherapy known to deregulate the microtubule dynamics.
        Task 4 - Vemurafenib was tested and showed positive results in patients with a BRAFV600E mutation.
        """
    }
]

temperature=0.2

**Oddly, here the problem is back again. It did better with the prompt above where there were actually fewer instructions.
Delineate the following: is it the prompt itself that's the problem or the fact that the number of examples reduced from 3 to just 1?
Let's test the two scenarios.**

In [166]:
response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0,
        temperature=temperature
    )

result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])

In [167]:
data_dict

{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'},
 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'},
 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'microtubule dynamics'},
 'Task 4': {'drug name': 'Vemurafenib', 'target name': 'BRAFV600E'}}

**Older prompt language with just 1 example but the text is distributed differently across system and user.**

In [168]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text.
    Return a single JSON output combining outputs from all tasks contained within ###
    """ },
    {
        "role": "user", "content": """
        Look at these examples and logically arriving at any drug and target information. Return a set capturing drug target information.
        """
    },
    {
        "role": "user", "content": """
        Example 1: We tested Vemurafenib in KRASG12C mutated cancers and showed that is very effective in these cancers.
        Answer=The text talks about the drug Vemurafenib. It was tested in cancers with a gene mutation in KRAS gene.
        However, it does not say that the drug is designed to target the KRAS gene. So, KRAS should not be labeled as a target for Vemurafenib.
        The answer is {drug name: Vemurafenib, target name: None}
        """
    },
    {
        "role": "user", "content":"""
        ### TASKS
        Task 1 - Sotorosib is a newly approved FDA drug and is designed to inhibit the KRAS gene.
        Task 2 - Venetoclax efficacy was shown to be very poor in AXL mutated cancers.
        Task 3 - Paclitaxel is a very common chemotherapy known to deregulate the microtubule dynamics.
        Task 4 - Vemurafenib was tested and showed positive results in patients with a BRAFV600E mutation.
        """
    }
]

temperature=0.2

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0,
        temperature=temperature
    )

result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])

In [169]:
data_dict

{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'},
 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'},
 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'microtubule dynamics'},
 'Task 4': {'drug name': 'Vemurafenib', 'target name': 'BRAFV600E'}}

In [174]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text.
    Return a single JSON output combining outputs from all tasks contained within ###
    """ },
    {
        "role": "user", "content": """
        Look at these examples and logically arriving at any drug and target information. Return a set capturing drug target information.
        """
    },
    {
        "role": "user", "content": """
        Example 1: We tested Vemurafenib in KRASG12C mutated cancers and showed that is very effective in these cancers.
        """
    },
    {
        "role":"user","content":"""
        Let's think about this step by step.
        The text talks about the drug Vemurafenib. It was tested in cancers with a gene mutation in KRAS gene.
        However, it does not say that the drug is designed to target the KRAS gene. So, KRAS should not be labeled as a target for Vemurafenib.
        The output should be {drug name: Vemurafenib, target name: None}
        """
    },
    {
        "role": "user", "content":"""
        ### TASKS
        Task 1 - Sotorosib is a newly approved FDA drug and is designed to inhibit the KRAS gene.
        Task 2 - Venetoclax efficacy was shown to be very poor in AXL mutated cancers.
        Task 3 - Paclitaxel is a very common chemotherapy known to deregulate the microtubule dynamics.
        Task 4 - Vemurafenib was tested and showed positive results in patients with a BRAFV600E mutation.
        """
    }
]

temperature=0.2

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0,
        temperature=temperature
    )

result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])

In [175]:
data_dict

{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'},
 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'},
 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'microtubule dynamics'},
 'Task 4': {'drug name': 'Vemurafenib', 'target name': 'BRAFV600E'}}

**That didn't work. So, let's try keeping the new language but increasing the number of examples to 3.
Let's go back to the new prompt language but add 2 more examples.**

In [177]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text.
    You will look at instructions from the ###Instruction ### section, see examples from the ***examples sections. Produce an output similar to shown for examples for each of the sections identified as TASK.   
    Then generate a single JSON output from all tasks.
    """ },
    {
        "role": "user", "content": """
        ###Instruction: Extract drug and target names from text by following these steps:
        First, identify all drug names.
        If no drug name is present, completely skip the task and go to the next one.
        Then, check if any targets are mentioned for each drug.This fact should be clearly mentioned in the text.
        Generate a separate set for each drug and then combine all into 1 JSON. 
        Follow the 'output' pattern shown in ***examples*** sections.
        ###
        """
    },
    {
        "role": "user", "content": """
        ***examples
        Example 1: We tested Vemurafenib in KRASG12C mutated cancers and showed that is very effective in these cancers.
        Answer=The text talks about the drug Vemurafenib. It was tested in cancers with a gene mutation in KRAS gene.
        However, it does not say that the drug is designed to target the KRAS gene. So, KRAS should not be labeled as a target for Vemurafenib.
        The answer is {drug name: Vemurafenib, target name: None}
        
        Example 2: Erlotinib is a kinase inihibitor, with binding activites showed for EGFR and downstream effects seen on PKA, PKC. 
        Answer=The drug name is Erlotinib. There are 3 genes mentioned here - EGFR, PKA, PKC. Erlotinib is said to bind to EGFR and hence EGFR is a target. Erlotinib has effects on PKA and PKC but that doesn't mean it directly binds and targets them, so they are not targets.
        The answer is {drug name: Erlotinib, target name: EGFR}
        
        Example 3: Metformin is a drug commonly used for people with type 2 diabetes. But in cancer, Metformin has been shown to decrease KI-67 expression.
        Answer=The drug name is Metformin. Although it can impact Ki-67 expression in cancers, the text doesn't explicitly say that Ki-67 is a direct target. So, in this text no target is mentioned for Metformin.
        The answer is {drug name: Metformin, target name: None}
        ***
        """
        
    },
    {
        "role": "user", "content":"""
        TASKS
        Task 1 - Sotorosib is a newly approved FDA drug and is designed to inhibit the KRAS gene.
        Task 2 - Venetoclax efficacy was shown to be very poor in AXL mutated cancers.
        Task 3 - Paclitaxel is a very common chemotherapy known to deregulate the microtubule dynamics.
        Task 4 - Vemurafenib was tested and showed positive results in patients with a BRAFV600E mutation.
        """
    }
]

temperature=0.2

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0,
        temperature=temperature
    )

result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])

**So, it looks like adding the number of examples helped.But this is a lot of tokens, so we need to reduce the text somehow.**

In [178]:
data_dict

{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'},
 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'},
 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'None'},
 'Task 4': {'drug name': 'Vemurafenib', 'target name': 'None'}}

**Reduce the prompt size.**

In [215]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition.
    Look at instructions from the ###Instruction ### section, see examples from the ***examples sections. Produce an output similar to shown for examples for each of the sections identified as TASK.   
    Then generate a single JSON output from all tasks.
    """ },
    {
        "role": "user", "content": """
        ###Instruction: Extract drug and target names from text by following these steps:
        First, identify all drug names.If no drug name is present, completely skip the task and go to the next one.
        Check if targets are mentioned for each drug. Only extract target if drug-target relationship is clearly defined.
        ###
        """
    },
    {
        "role": "user", "content": """
        ***examples
        Example 1: We tested Vemurafenib in KRASG12C mutated cancers and showed that is very effective in these cancers.
        Answer=The text talks about the drug Vemurafenib. It was tested in cancers with a gene mutation in KRAS gene.
        However, it does not say that the drug is designed to target the KRAS gene. So, KRAS should not be labeled as a target for Vemurafenib.
        The answer is {drug name: Vemurafenib, target name: None}
        
        Example 2: Erlotinib is a kinase inihibitor, with binding activites showed for EGFR and downstream effects seen on PKA, PKC. 
        Answer=The drug name is Erlotinib. There are 3 genes mentioned here - EGFR, PKA, PKC. Erlotinib is said to bind to EGFR and hence EGFR is a target. Erlotinib has effects on PKA and PKC but that doesn't mean it directly binds and targets them, so they are not targets.
        The answer is {drug name: Erlotinib, target name: EGFR}
        
        Example 3: Metformin is a drug commonly used for people with type 2 diabetes. But in cancer, Metformin has been shown to decrease KI-67 expression.
        Answer=The drug name is Metformin. Although it can impact Ki-67 expression in cancers, the text doesn't explicitly say that Ki-67 is a direct target. So, in this text no target is mentioned for Metformin.
        The answer is {drug name: Metformin, target name: None}
        ***
        """
        
    },
    {
        "role": "user", "content":"""
        TASKS
        Task 1 - Sotorosib is a newly approved FDA drug and is designed to inhibit the KRAS gene.
        Task 2 - Venetoclax efficacy was shown to be very poor in AXL mutated cancers.
        Task 3 - Paclitaxel is a very common chemotherapy known to deregulate the microtubule dynamics.
        Task 4 - Vemurafenib was tested and showed positive results in patients with a BRAFV600E mutation.
        """
    }
]

temperature=0

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0,
        temperature=temperature
    )

result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])

In [216]:
data_dict

{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'},
 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'},
 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'None'},
 'Task 4': {'drug name': 'Vemurafenib', 'target name': 'None'}}

In [217]:
result['usage']

{'completion_tokens': 117, 'prompt_tokens': 619, 'total_tokens': 736}

**Can we remove 1 of the examples?**

Not shown, but notes from testing:

* Removing example 2 about EGFR, changes the Paclitaxel output.

* Removing the Metformin example keeps the same result.

In [225]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition.
    Look at instructions from the ###Instruction ### section, see examples from the ***examples sections.Produce an output similar to shown for examples for each of the sections identified as TASK.   
    Then generate a single JSON output from all tasks.
    """ },
    {
        "role": "user", "content": """
        ###Instruction: Extract drug and target names from text by following these steps:
        First, identify all drug names.If no drug name is present, completely skip the task and go to the next one.
        Check if targets are mentioned for each drug. Only extract target if drug-target relationship is clearly defined.
        ###
        """
    },
    {
        "role": "user", "content": """
        ***examples
        Example 1: We tested Vemurafenib in KRASG12C mutated cancers and showed that is very effective in these cancers.
        Answer=The text talks about the drug Vemurafenib.It was tested in cancers with a gene mutation in KRAS gene.
        However, it does not say that the drug is designed to target the KRAS gene.So, KRAS should not be labeled as a target for Vemurafenib.
        The answer is {drug name: Vemurafenib, target name: None}
        Example 2: Erlotinib is a kinase inihibitor, with binding activites showed for EGFR and downstream effects seen on PKA, PKC. 
        Answer=The drug name is Erlotinib.There are 3 genes mentioned here - EGFR, PKA, PKC.Erlotinib is said to bind to EGFR and hence EGFR is a target.Erlotinib has effects on PKA and PKC but that doesn't mean it directly binds and targets them, so they are not targets.
        The answer is {drug name: Erlotinib, target name: EGFR}
        ***
        """
        
    },
    {
        "role": "user", "content":"""
        TASKS
        Task 1 - Sotorosib is a newly approved FDA drug and is designed to inhibit the KRAS gene.
        Task 2 - Venetoclax efficacy was shown to be very poor in AXL mutated cancers.
        Task 3 - Paclitaxel is a very common chemotherapy known to deregulate the microtubule dynamics.
        Task 4 - Vemurafenib was tested and showed positive results in patients with a BRAFV600E mutation.
        """
    }
]

temperature=0

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0,
        temperature=temperature
    )

result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])

In [226]:
data_dict

{'Task 1': {'drug name': 'Sotorosib', 'target name': 'KRAS'},
 'Task 2': {'drug name': 'Venetoclax', 'target name': 'None'},
 'Task 3': {'drug name': 'Paclitaxel', 'target name': 'None'},
 'Task 4': {'drug name': 'Vemurafenib', 'target name': 'None'}}

In [222]:
result['usage']

{'completion_tokens': 117, 'prompt_tokens': 612, 'total_tokens': 729}

**Let's test a full abstract. When the abstracts were tested with the exact above prompt language, it failed to extract Cetuximab as a drug name. Upon experimenting with the Instructions language, it captured Cetuximab but then it combined the drugs as a list [vemurafenib, cetuximab]. Also, it again incorrectly defined BRAF as a target for the second text.
This text was able to capture all drugs but then it still gives BRAF as a target for vemurafenib.

IMPORTANT NOTE: When this cell was copied and run again, it again omitted Cetuximab. This should not happen with a low temperature of 0. This has been widely documented by various developers as well that even with a 0 temperature, the output could be different.**

In [241]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition.
    Look at instructions from the ###Instruction ### section, see examples from the ***examples sections.Produce an output similar to shown for examples for each of the sections identified as TASK.   
    Then generate a single JSON output from all tasks.
    """ },
    {
        "role": "user", "content": """
        ###Instruction: For each of the text shown under TASKS, do the following:
    1.Identify which drug names are mentioned and create a set for each drug. 
    2.Check if targets are mentioned for each drug.Only extract target if drug-target relationship is clearly defined.
    3.Return multiple sets if more than 1 drug is present in the text.

        ###
        """
    },
    {
        "role": "user", "content": """
        ***examples
        Example 1: We tested Vemurafenib in KRASG12C mutated cancers and showed that is very effective in these cancers.
        Answer=The text talks about the drug Vemurafenib.It was tested in cancers with a gene mutation in KRAS gene.
        However, it does not say that the drug is designed to target the KRAS gene.So, KRAS should not be labeled as a target for Vemurafenib.
        The answer is {drug name: Vemurafenib, target name: None}
        Example 2: Erlotinib is a kinase inihibitor, with binding activites showed for EGFR and downstream effects seen on PKA, PKC. 
        Answer=The drug name is Erlotinib.There are 3 genes mentioned here - EGFR, PKA, PKC.Erlotinib is said to bind to EGFR and hence EGFR is a target.Erlotinib has effects on PKA and PKC but that doesn't mean it directly binds and targets them, so they are not targets.
        The answer is {drug name: Erlotinib, target name: EGFR}
        ***
        """
        
    },
    {
        "role": "user", "content":"""
        TASKS
        Task 1-LKB1/STK11 is a serine/threonine kinase that plays a major role in controlling cell metabolism, resulting in potential therapeutic vulnerabilities in LKB1-mutant cancers. Here, we identify the NAD + degrading ectoenzyme, CD38, as a new target in LKB1-mutant NSCLC. Metabolic profiling of genetically engineered mouse models (GEMMs) revealed that LKB1 mutant lung cancers have a striking increase in ADP-ribose, a breakdown product of the critical redox co-factor, NAD + . Surprisingly, compared with other genetic subsets, murine and human LKB1-mutant NSCLC show marked overexpression of the NAD+-catabolizing ectoenzyme, CD38 on the surface of tumor cells. Loss of LKB1 or inactivation of Salt-Inducible Kinases (SIKs)-key downstream effectors of LKB1- induces CD38 transcription induction via a CREB binding site in the CD38 promoter. Treatment with the FDA-approved anti-CD38 antibody, daratumumab, inhibited growth of LKB1-mutant NSCLC xenografts. Together, these results reveal CD38 as a promising therapeutic target in patients with LKB1 mutant lung cancer.
        Task 2-Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers. Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival. Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma. Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).
        """
    }
]

temperature=0

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0,
        temperature=temperature
    )

result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])

In [242]:
data_dict

{'Task 1': {'drug name': 'daratumumab', 'target name': 'CD38'},
 'Task 2': {'drug names': [{'drug name': 'vemurafenib',
    'target name': 'BRAF V600'},
   {'drug name': 'cetuximab', 'target name': None}]}}

**Show only 2 examples instead of 3. - Behavior changed again and adding back the example didn't help!**

In [287]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition.
    Look at instructions from the ###Instruction ### section, see examples from the ***examples sections.Produce an output similar to shown for examples for each of the sections identified as TASK.   
    Then generate a single JSON output from all tasks.
    """ },
    {
        "role": "user", "content": """
        ###Instruction: For each of the text shown under TASKS, extract drug and target names from text by following these steps:
        1.Identify all drug names and create a set for each drug.Return multiple sets if more than 1 drug is present in the text.
        2.Check if targets are mentioned for each drug.Only extract target if drug-target relationship is clearly defined.
        3.Return multiple sets if more than 1 drug is present in the text.
        ###
        """
    },
    {
        "role": "user", "content": """
        ***examples
        Example 1: We tested Crizotinib in KRASG12C mutated cancers and showed that is very effective in these cancers.
        Answer=The text talks about the drug Crizotinib.It was tested in cancers with a gene mutation in KRAS gene.
        However, it does not say that the drug is designed to target the KRAS gene.So, KRAS should not be labeled as a target for Crizotinib.
        The answer is {drug name: crizotinib, target name: None}
        Example 2: Erlotinib is a kinase inihibitor, with binding activites showed for EGFR and downstream effects seen on PKA, PKC. 
        Answer=The drug name is Erlotinib.There are 3 genes mentioned here - EGFR, PKA, PKC.Erlotinib is said to bind to EGFR and hence EGFR is a target.Erlotinib has effects on PKA and PKC but that doesn't mean it directly binds and targets them, so they are not targets.
        The answer is {drug name: erlotinib, target name: EGFR}
        ***
        """
        
    },
    {
        "role": "user", "content":"""
        TASKS
        Task 1-LKB1/STK11 is a serine/threonine kinase that plays a major role in controlling cell metabolism, resulting in potential therapeutic vulnerabilities in LKB1-mutant cancers. Here, we identify the NAD + degrading ectoenzyme, CD38, as a new target in LKB1-mutant NSCLC. Metabolic profiling of genetically engineered mouse models (GEMMs) revealed that LKB1 mutant lung cancers have a striking increase in ADP-ribose, a breakdown product of the critical redox co-factor, NAD + . Surprisingly, compared with other genetic subsets, murine and human LKB1-mutant NSCLC show marked overexpression of the NAD+-catabolizing ectoenzyme, CD38 on the surface of tumor cells. Loss of LKB1 or inactivation of Salt-Inducible Kinases (SIKs)-key downstream effectors of LKB1- induces CD38 transcription induction via a CREB binding site in the CD38 promoter. Treatment with the FDA-approved anti-CD38 antibody, daratumumab, inhibited growth of LKB1-mutant NSCLC xenografts. Together, these results reveal CD38 as a promising therapeutic target in patients with LKB1 mutant lung cancer.
        Task 2-Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers. Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival. Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma. Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).
        Task 3-Durafenib was tested and showed positive results in patients with a BRAFV600E mutation.
        Task 4-Venetoclax efficacy was shown to be very poor in AXL mutated cancers.
        """
    }
]

temperature=0

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0,
        temperature=temperature
    )

result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'Task 1': {'drug name': 'daratumumab', 'target name': 'CD38'},
 'Task 2': {'drug name': 'vemurafenib', 'target name': 'BRAF V600'},
 'Task 3': {'drug name': 'durafenib', 'target name': 'BRAFV600E'},
 'Task 4': {'drug name': 'venetoclax', 'target name': 'None'}}

In [266]:
result['usage']

{'completion_tokens': 124, 'prompt_tokens': 1329, 'total_tokens': 1453}

**PROBLEMS AND PROMPT SELECTED FOR THE PIPELINE:**

1. Needs to ID and return mutliple drugs from each text.
2. Should give accurate output for the self-made sentences.
3. Shouldn't clump all drugs into 1 list.
4. Since the outputs vary quite a bit when self-made/dummy 1-2 line sentences are shown vs when 1 full actual abstract is shown as an example. So, it's best to go back to the following longer prompt as it extracts multiple entities at once.

**CONCLUSION - SELECT THE FOLLOWING FORMAT AND CREATE A PIPELINE TO TEST ON 20-30 ABSTRACTS.**

In [324]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text.
    """ },
    {
        "role": "user", "content": """
        Look at the examples delimited by ### and the rules delimited by ***.
        *** RULES 
        For each of the text shown under TASKS, do the following:
    1. Identify which drugs have been tested and create a set for each. 
    2. Return multiple sets if more than drug is present in the text.
    
    For each drug:
    3. Get direct target: Use the following logic: if the text clearly and directly mentions that the drug targets a gene and has also defined the type of interaction
    4. Get interaction type between the drug and direct target: if explicitly and clearly mentioned, then get the drug-direct target relationship type. What does the drug do to the direct target? inhibitor, activator, etc.?
    5. Get groups tested: specify which type of genes was the drug tested on, eg. if drug was tested on samples showing high expression of certain gene.
    6. Collect all disease names for which the drug has been tested into 1 list.
    7. Extract any specific ClinicalTrials.gov identifier or number.
    8. For each drug, construct a set consisting of (drug name, direct target, drug-direct target interaction, tested or effective group, logic for direct target, ClinicalTrials.gov number, all diseases that the drug is tested in)
    9. Any empty values should be indicated by null and not an empty string.
    10. Double check your logic and see if the data you have collected is actually reflected in the text. If not, make revisions.
    11. Both, direct target, and drug-direct target interaction fields should have been filled or both should be null. Only one of these fields cannot be null.
    11. Assemble all sets and produce 1 final JSON output.
    ***
        ### EXAMPLES 
        Example 1: LKB1/STK11 is a serine/threonine kinase that plays a major role in controlling cell metabolism, resulting in potential therapeutic vulnerabilities in LKB1-mutant cancers. Here, we identify the NAD + degrading ectoenzyme, CD38, as a new target in LKB1-mutant NSCLC. Metabolic profiling of genetically engineered mouse models (GEMMs) revealed that LKB1 mutant lung cancers have a striking increase in ADP-ribose, a breakdown product of the critical redox co-factor, NAD + . Surprisingly, compared with other genetic subsets, murine and human LKB1-mutant NSCLC show marked overexpression of the NAD+-catabolizing ectoenzyme, CD38 on the surface of tumor cells. Loss of LKB1 or inactivation of Salt-Inducible Kinases (SIKs)-key downstream effectors of LKB1- induces CD38 transcription induction via a CREB binding site in the CD38 promoter. Treatment with the FDA-approved anti-CD38 antibody, daratumumab, inhibited growth of LKB1-mutant NSCLC xenografts. Together, these results reveal CD38 as a promising therapeutic target in patients with LKB1 mutant lung cancer.
        Output: {drug name: daratumumab,
        direct target: CD38, 
        tested or effective group: LKB-1/STK-11 mutant NSCLC, 
        logic for direct target: text mentioned it is an anti-CD38 antibody
        drug-direct target interaction: anti-CD38 monoclonal antibody,
        drug tested in following diseases: lung cancer, NSCLC,
        ClinicalTrials.gov ID: null}
        ###
        TASKS:
    Use the RULES and EXAMPLES and create a similar outputs for the following text delimited by ID:
    ID1:Triple-negative breast cancers (TNBC) frequently inactivate p53, increasing their aggressiveness and therapy resistance. We identified an unexpected protein vulnerability in p53-inactivated TNBC and designed a new PROteolysis TArgeting Chimera (PROTAC) to target it. Our PROTAC selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment. MDM2 loss in p53 mutant/deleted TNBC cells in two-dimensional/three-dimensional culture and TNBC patient explants, including relapsed tumors, causes apoptosis while sparing normal cells. Our MDM2-PROTAC is stable in vivo, and treatment of TNBC xenograft-bearing mice demonstrates tumor on-target efficacy with no toxicity to normal cells, significantly extending survival. Transcriptomic analyses revealed upregulation of p53 family target genes. Investigations showed activation and a required role for TAp73 to mediate MDM2-PROTAC-induced apoptosis. Our data, challenging the current MDM2/p53 paradigm, show MDM2 is required for p53-inactivated TNBC cell survival, and PROTAC-targeted MDM2 degradation is an innovative potential therapeutic strategy for TNBC and superior to existing MDM2 inhibitors. 
    
    ID2:Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).

        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0,
        temperature=0
    )


In [325]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'ID1': {'drug name': 'MDM2-PROTAC',
  'direct target': 'MDM2',
  'tested or effective group': 'p53 mutant/deleted TNBC',
  'logic for direct target': 'text mentioned it selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment',
  'drug-direct target interaction': 'proteasome-mediated degradation',
  'drug tested in following diseases': ['Triple-negative breast cancer (TNBC)'],
  'ClinicalTrials.gov ID': None},
 'ID2': {'drug name': 'vemurafenib',
  'direct target': 'BRAF V600',
  'tested or effective group': None,
  'logic for direct target': 'text mentioned BRAF V600 mutation-positive nonmelanoma cancers',
  'drug-direct target interaction': None,
  'drug tested in following diseases': ['non-small-cell lung cancer',
   'Erdheim-Chester disease',
   "Langerhans'-cell histiocytosis",
   'pleomorphic xanthoastrocytoma',
   'anaplastic thyroid cancer',
   'cholangiocarcinoma',
   'salivary-duct cancer',
   'ovarian cancer',
   'clear-cel

In [326]:
result['usage']

{'completion_tokens': 403, 'prompt_tokens': 1536, 'total_tokens': 1939}

**Cleaned up the above prompt.**

In [327]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    You are a meticulous data scientist working on named entity recognition and building out a drug-target dataset from biomedical text.
    Only consider the text given to you.""" },
    {
        "role": "user", "content": """
        Look at the examples delimited by ### and the rules delimited by ***.
        *** RULES 
        For each of the text shown under TASKS, do the following:
    1. Identify which drugs have been tested and create a set for each. 
    2. Return multiple sets if more than drug is present in the text.
    
    For each drug:
    3. Get direct target: Use the following logic: if the text clearly and directly mentions that the drug targets a gene and has also defined the type of interaction
    4. Get interaction type between the drug and direct target: if explicitly and clearly mentioned, then get the drug-direct target relationship type. What does the drug do to the direct target? inhibitor, activator, etc.?
    5. Get groups tested: specify which type of genes was the drug tested on, eg. if drug was tested on samples showing high expression of certain gene.
    6. Collect all disease names for which the drug has been tested into 1 list.
    7. Extract any specific ClinicalTrials.gov identifier or number.
    8. For each drug, construct a set consisting of (drug name, direct target, drug-direct target interaction, tested or effective group, ClinicalTrials.gov number, all diseases that the drug is tested in)
    9. Any empty values should be indicated by null and not an empty string.
    10. Double check your logic and see if the data you have collected is actually reflected in the text. If not, make revisions.
    11. Both, direct target, and drug-direct target interaction fields should have been filled or both should be null. Only one of these fields cannot be null.
    12. Assemble all sets and produce 1 final JSON output.
    ***
        ### EXAMPLES 
        PMID23: LKB1/STK11 is a serine/threonine kinase that plays a major role in controlling cell metabolism, resulting in potential therapeutic vulnerabilities in LKB1-mutant cancers. Here, we identify the NAD + degrading ectoenzyme, CD38, as a new target in LKB1-mutant NSCLC. Metabolic profiling of genetically engineered mouse models (GEMMs) revealed that LKB1 mutant lung cancers have a striking increase in ADP-ribose, a breakdown product of the critical redox co-factor, NAD + . Surprisingly, compared with other genetic subsets, murine and human LKB1-mutant NSCLC show marked overexpression of the NAD+-catabolizing ectoenzyme, CD38 on the surface of tumor cells. Loss of LKB1 or inactivation of Salt-Inducible Kinases (SIKs)-key downstream effectors of LKB1- induces CD38 transcription induction via a CREB binding site in the CD38 promoter. Treatment with the FDA-approved anti-CD38 antibody, daratumumab, inhibited growth of LKB1-mutant NSCLC xenografts. Together, these results reveal CD38 as a promising therapeutic target in patients with LKB1 mutant lung cancer.
        Output: {"PMID23": [{"drug name": "daratumumab",
                            "target": [
                                {"direct target": "CD38",
                                  "drug-direct target interaction": "anti-CD38 monoclonal antibody"},],
                            "tested or effective group": ["LKB-1/STK-11 mutant NSCLC"],
                             "drug tested in following diseases": ["lung cancer", "NSCLC"],
                             "ClinicalTrials.gov ID": []
                             }]}
        ###
        TASKS:
    Use the RULES and EXAMPLES and create a similar outputs for the following text delimited by ID:
    PMID345:Triple-negative breast cancers (TNBC) frequently inactivate p53, increasing their aggressiveness and therapy resistance. We identified an unexpected protein vulnerability in p53-inactivated TNBC and designed a new PROteolysis TArgeting Chimera (PROTAC) to target it. Our PROTAC selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment. MDM2 loss in p53 mutant/deleted TNBC cells in two-dimensional/three-dimensional culture and TNBC patient explants, including relapsed tumors, causes apoptosis while sparing normal cells. Our MDM2-PROTAC is stable in vivo, and treatment of TNBC xenograft-bearing mice demonstrates tumor on-target efficacy with no toxicity to normal cells, significantly extending survival. Transcriptomic analyses revealed upregulation of p53 family target genes. Investigations showed activation and a required role for TAp73 to mediate MDM2-PROTAC-induced apoptosis. Our data, challenging the current MDM2/p53 paradigm, show MDM2 is required for p53-inactivated TNBC cell survival, and PROTAC-targeted MDM2 degradation is an innovative potential therapeutic strategy for TNBC and superior to existing MDM2 inhibitors. 
    
    PMID567:Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
        Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).

        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages,
        seed=0,
        temperature=0
    )


In [328]:
result = json.loads(response.model_dump_json())
data_dict = json.loads(result['choices'][0]['message']['content'])
data_dict

{'PMID345': [{'drug name': 'MDM2-PROTAC',
   'target': [{'direct target': 'MDM2',
     'drug-direct target interaction': 'proteasome-mediated degradation'}],
   'tested or effective group': ['p53 mutant/deleted TNBC'],
   'drug tested in following diseases': ['TNBC'],
   'ClinicalTrials.gov ID': None}],
 'PMID567': [{'drug name': 'vemurafenib',
   'target': [{'direct target': None, 'drug-direct target interaction': None}],
   'tested or effective group': None,
   'drug tested in following diseases': ['non-small-cell lung cancer',
    'Erdheim-Chester disease',
    "Langerhans'-cell histiocytosis",
    'pleomorphic xanthoastrocytoma',
    'anaplastic thyroid cancer',
    'cholangiocarcinoma',
    'salivary-duct cancer',
    'ovarian cancer',
    'clear-cell sarcoma'],
   'ClinicalTrials.gov ID': ['NCT01524978']},
  {'drug name': 'cetuximab',
   'target': [{'direct target': None, 'drug-direct target interaction': None}],
   'tested or effective group': None,
   'drug tested in following 

In [329]:
result['usage']

{'completion_tokens': 341, 'prompt_tokens': 1543, 'total_tokens': 1884}

**PROBLEMS**

1. Needs to ID and return mutliple drugs from each text. - Add example?
2. Should give accurate output for the self-made sentences.
3. Shouldn't clump all drugs into 1 list.

Things to figure out / next steps:

1. How to send multiple abstracts and how many? - for the first part, test 3 text chunks. - DONE
2. How to extract details from a JSON object? - DONE
3. Experiment with the temperature setting?
4. One shot learning

## Abstracts and some example sentences for checking the quality of responses.

**Two really good abstracts to check:**

LKB1/STK11 is a serine/threonine kinase that plays a major role in controlling cell metabolism, resulting in potential therapeutic vulnerabilities in LKB1-mutant cancers. Here, we identify the NAD + degrading ectoenzyme, CD38, as a new target in LKB1-mutant NSCLC. Metabolic profiling of genetically engineered mouse models (GEMMs) revealed that LKB1 mutant lung cancers have a striking increase in ADP-ribose, a breakdown product of the critical redox co-factor, NAD + . Surprisingly, compared with other genetic subsets, murine and human LKB1-mutant NSCLC show marked overexpression of the NAD+-catabolizing ectoenzyme, CD38 on the surface of tumor cells. Loss of LKB1 or inactivation of Salt-Inducible Kinases (SIKs)-key downstream effectors of LKB1- induces CD38 transcription induction via a CREB binding site in the CD38 promoter. Treatment with the FDA-approved anti-CD38 antibody, daratumumab, inhibited growth of LKB1-mutant NSCLC xenografts. Together, these results reveal CD38 as a promising therapeutic target in patients with LKB1 mutant lung cancer.


AND 

Background: BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers.
Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival.
        Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma.
        Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.).


**Example/made-up sentences for few shot.**

Example 2: Erlotinib is a kinase inihibitor, with binding activites showed for EGFR and downstream effects seen on PKA, PKC. 
Answer=The drug name is Erlotinib.There are 3 genes mentioned here - EGFR, PKA, PKC.Erlotinib is said to bind to EGFR and hence EGFR is a target.Erlotinib has effects on PKA and PKC but that doesn't mean it directly binds and targets them, so they are not targets.
The answer is {drug name: Erlotinib, target name: EGFR}

Example 3: Metformin is a drug commonly used for people with type 2 diabetes.But in cancer, Metformin has been shown to decrease KI-67 expression.
Answer=The drug name is Metformin.Although it can impact Ki-67 expression in cancers, the text doesn't explicitly say that Ki-67 is a direct target.So, in this text no target is mentioned for Metformin.
The answer is {drug name: Metformin, target name: None}

## Testing tiktoken to estimate token size<a id="tik"></a>

**Tiktoken is a library that takes a string and splits it into a list of tokens. There are different encoding models available specific to each LLM. These models differ in how they convert text into tokens.**

Here we can see that the encoding used for GPT-4-turbo-preview model is cl100kbase.

In [3]:
encoding = tiktoken.encoding_for_model('gpt-4-turbo-preview')

In [4]:
encoding

<Encoding 'cl100k_base'>

**Load an encoding model.**

In [5]:
encoding = tiktoken.encoding_for_model("gpt-4-turbo-preview")

**Convert string into a list of tokens.**

In [37]:
test_str = 'ZNF10.1 is a gene that encodes a zinc-finger protein.'

In [38]:
tokens = encoding.encode(test_str)

In [39]:
tokens

[57,
 39167,
 605,
 13,
 16,
 374,
 264,
 15207,
 430,
 3289,
 2601,
 264,
 49601,
 2269,
 5248,
 13128,
 13]

In [40]:
len(test_str)

53

In [41]:
len(tokens)

17

**A list of tokens can be converted to a string.**

In [42]:
encoding.decode(tokens)

'ZNF10.1 is a gene that encodes a zinc-finger protein.'

**Single tokens can be decoded like this:**

Here we can see that spaces or other characters may be grouped with a letter or string and represent 1 token.
The b in front indicates byte strings.

In [43]:
for token in tokens:
    print(f'{token}-',encoding.decode_single_token_bytes(token))

57- b'Z'
39167- b'NF'
605- b'10'
13- b'.'
16- b'1'
374- b' is'
264- b' a'
15207- b' gene'
430- b' that'
3289- b' enc'
2601- b'odes'
264- b' a'
49601- b' zinc'
2269- b'-f'
5248- b'inger'
13128- b' protein'
13- b'.'


**Using tiktoken estimation before calling the API would be helpful for determining the number of prompts to send in 1 call and also confirm that the input size meets the token limit criteria based on the model.**

In [14]:
str_a = 'All written in 1 line.'
str_b = '''
    All written in 
    1 line
'''

In [16]:
len(encoding.encode(str_a))

7

In [17]:
len(encoding.encode(str_b))

11

### Following data cleaning steps would help before the text is passed into a prompt.

Let's try a bunch of scenarios. As we can see, this is a small example, however for large text sizes, incorporating small clean up steps will still help in reducing the number of tokens.
Something trivial like removing the last period also helps in decreasing token size.

In [20]:
base_str = 'This is a string.'
trail_str = ' This is a string. '
double_str = 'This is  a string.'
broken_str = '''This
is a string.'''
tab_str = '''This is a 
    string.'''
noperiod_str = 'This is a string'

In [22]:
for text in [base_str,trail_str,double_str,broken_str,tab_str,noperiod_str]:
    
    size = len(encoding.encode(text))
    print(text)
    print(f'Token size={size}')

This is a string.
Token size=5
 This is a string. 
Token size=6
This is  a string.
Token size=6
This
is a string.
Token size=6
This is a 
    string.
Token size=7
This is a string
Token size=4


In [24]:
a = 'This is a line. This is a second line.'

b = 'This is a line.This is a second line.'

**Test an example abstract.**

In [50]:
abstract = '[BACKGROUND]AKT pathway activation is implicated in endocrine-therapy resistance. Data on the efficacy and safety of the AKT inhibitor capivasertib, as an addition to fulvestrant therapy, in patients with hormone receptor-positive advanced breast cancer are limited.[METHODS]In a phase 3, randomized, double-blind trial, we enrolled eligible pre-, peri-, and postmenopausal women and men with hormone receptor-positive, human epidermal growth factor receptor 2-negative advanced breast cancer who had had a relapse or disease progression during or after treatment with an aromatase inhibitor, with or without previous cyclin-dependent kinase 4 and 6 (CDK4/6) inhibitor therapy. Patients were randomly assigned in a 1:1 ratio to receive capivasertib plus fulvestrant or placebo plus fulvestrant. The dual primary end point was investigator-assessed progression-free survival assessed both in the overall population and among patients with AKT pathway-altered ([RESULTS]Overall, 708 patients underwent randomization; 289 patients (40.8%) had AKT pathway alterations, and 489 (69.1%) had received a CDK4/6 inhibitor previously for advanced breast cancer. In the overall population, the median progression-free survival was 7.2 months in the capivasertib-fulvestrant group, as compared with 3.6 months in the placebo-fulvestrant group (hazard ratio for progression or death, 0.60; 95% confidence interval [CI], 0.51 to 0.71; P<0.001). In the AKT pathway-altered population, the median progression-free survival was 7.3 months in the capivasertib-fulvestrant group, as compared with 3.1 months in the placebo-fulvestrant group (hazard ratio, 0.50; 95% CI, 0.38 to 0.65; P<0.001). The most frequent adverse events of grade 3 or higher in patients receiving capivasertib-fulvestrant were rash (in 12.1% of patients, vs. in 0.3% of those receiving placebo-fulvestrant) and diarrhea (in 9.3% vs. 0.3%). Adverse events leading to discontinuation were reported in 13.0% of the patients receiving capivasertib and in 2.3% of those receiving placebo.[CONCLUSIONS]Capivasertib-fulvestrant therapy resulted in significantly longer progression-free survival than treatment with fulvestrant alone among patients with hormone receptor-positive advanced breast cancer whose disease had progressed during or after previous aromatase inhibitor therapy with or without a CDK4/6 inhibitor. (Funded by AstraZeneca and the National Cancer Institute; CAPItello-291 ClinicalTrials.gov number, NCT04305496.).'


In [51]:
len(encoding.encode(abstract))

603

In [52]:
len(abstract)

2480

# Langchain trial

In [7]:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

In [5]:
# Load the API key
load_dotenv(dotenv_path=openai_key_path)

# Access the key
openai_api_key = os.environ['OPENAI_API_KEY']

In [21]:
model = ChatOpenAI(model="gpt-4-turbo-preview",
                   openai_api_key=openai_api_key)

In [22]:
messages = [('system','You are working on named entity recognition from biomedical text. Only consider the text given to you.'),
            ('user',"""Identify drug names from the following text delimited by ID.Only list the names. 
            {task}""")]

In [23]:
prompt = ChatPromptTemplate.from_messages(messages)

In [24]:
output_parser = StrOutputParser()

In [25]:
chain = prompt | model | output_parser

In [26]:
drug_name = chain.invoke({'task':"""IDXYZ: Sorafenib has been recently used to treat many lung cancers. It is a KRAS inhibitor."""})

In [27]:
drug_name

'Sorafenib'

**Try a chained pipeline for a paper.**

In [37]:
abstract = """BRAF V600 mutations occur in various nonmelanoma cancers. We undertook a histology-independent phase 2 "basket" study of vemurafenib in BRAF V600 mutation-positive nonmelanoma cancers. Methods: We enrolled patients in six prespecified cancer cohorts; patients with all other tumor types were enrolled in a seventh cohort. A total of 122 patients with BRAF V600 mutation-positive cancer were treated, including 27 patients with colorectal cancer who received vemurafenib and cetuximab. The primary end point was the response rate; secondary end points included progression-free and overall survival. Results: In the cohort with non-small-cell lung cancer, the response rate was 42% (95% confidence interval [CI], 20 to 67) and median progression-free survival was 7.3 months (95% CI, 3.5 to 10.8). In the cohort with Erdheim-Chester disease or Langerhans'-cell histiocytosis, the response rate was 43% (95% CI, 18 to 71); the median treatment duration was 5.9 months (range, 0.6 to 18.6), and no patients had disease progression during therapy. There were anecdotal responses among patients with pleomorphic xanthoastrocytoma, anaplastic thyroid cancer, cholangiocarcinoma, salivary-duct cancer, ovarian cancer, and clear-cell sarcoma and among patients with colorectal cancer who received vemurafenib and cetuximab. Safety was similar to that in prior studies of vemurafenib for melanoma. Conclusions: BRAF V600 appears to be a targetable oncogene in some, but not all, nonmelanoma cancers. Preliminary vemurafenib activity was observed in non-small-cell lung cancer and in Erdheim-Chester disease and Langerhans'-cell histiocytosis. The histologic context is an important determinant of response in BRAF V600-mutated cancers. (Funded by F. Hoffmann-La Roche/Genentech; ClinicalTrials.gov number, NCT01524978.)."""

In [29]:
from langchain_core.output_parsers import JsonOutputParser

In [51]:
messages = [('system','You are working on named entity recognition from biomedical text. Only consider the text given to you.'),
            ('user','Look at the given text delimited by PMID and identify drug names in a JSON output.The output should be a dictionary with keys as drug names: {task}')]


In [52]:
prompt = ChatPromptTemplate.from_messages(messages)
output_parser = JsonOutputParser()

In [53]:
chain = prompt | model | output_parser

In [55]:
drugs = chain.invoke({'task':abstract})

**Try chaining outputs on a simple sentence.**

In [65]:
sentence = 'The blue cat jumped over the red fence. The fence was surrounding a white house that had black windows.'

In [116]:
questions = ['What are the colors in this text {sentence}?',
             'Take this {ans0} and for each color identify which entities are of that color (objects) based on the text {sentence}',
             'Take this {ans1} and for each color add which word preceeds each color based on the text {sentence}.']


In [117]:
message_0 = [('system','You are working on entity recognition. Only consider the text and the question given to you and generate JSON.'),
            ('user',questions[0])]
message_1 = [('system','You are working on entity recognition. Only consider the text given to you and generate JSON..'),
            ('user',questions[1])]
message_2 = [('system','You are working on entity recognition. Only consider the text given to you and generate JSON..'),
            ('user',questions[2])]

In [118]:
prompt_0 = ChatPromptTemplate.from_messages(message_0)
prompt_1 = ChatPromptTemplate.from_messages(message_1)
prompt_2 = ChatPromptTemplate.from_messages(message_2)

In [119]:
chain = prompt_0 | model | output_parser
ans0 = chain.invoke({'sentence':sentence})
ans0

{'colors': ['blue', 'red', 'white', 'black']}

In [120]:
chain = prompt_1 | model | output_parser
ans1 = chain.invoke({'sentence':sentence,'ans0':ans0})
ans1

{'colors': {'blue': ['cat'],
  'red': ['fence'],
  'white': ['house'],
  'black': ['windows']}}

In [121]:
chain = prompt_2 | model | output_parser
ans2 = chain.invoke({'sentence':sentence,'ans0':ans0,'ans1':ans1})

In [122]:
ans2

{'colors': {'blue': {'preceding_word': 'the', 'examples': ['cat']},
  'red': {'preceding_word': 'the', 'examples': ['fence']},
  'white': {'preceding_word': 'a', 'examples': ['house']},
  'black': {'preceding_word': 'had', 'examples': ['windows']}}}

**Check drug overlap with drugbank and TTD.**

In [147]:
messages = [('system','Strictly generate JSON output.'),('user','Does the drug {drug_name} exist in Drugbank online?')]

In [148]:
prompt = ChatPromptTemplate.from_messages(messages)

In [149]:
chain = prompt | model | output_parser

In [150]:
chain.invoke({'drug_name':'herceptin'})

{'exists_in_DrugBank': True,
 'DrugBank_ID': 'DB00072',
 'name': 'Trastuzumab',
 'common_brand_names': ['Herceptin']}

In [151]:
messages = [('system','Strictly generate JSON output.'),('user','Does the drug {drug_name} exist in Therapeutic Targets Database online?')]

In [152]:
prompt = ChatPromptTemplate.from_messages(messages)

In [153]:
chain = prompt | model | output_parser

In [154]:
chain.invoke({'drug_name':'herceptin'})

{'exists_in_TTD': True,
 'TTD_ID': 'DAP000633',
 'drug_name': 'Herceptin',
 'active_ingredient': 'Trastuzumab'}

**Conduct checks in Drugbank and TTD in parallel using RunnableParallel.**

In [157]:
from langchain_core.runnables import RunnableParallel

In [169]:
db_messages = [('system','Strictly generate JSON output and include database IDs.'),('user','Does the drug {drug_name} exist in Drugbank online?')]
db_prompt = ChatPromptTemplate.from_messages(db_messages)
drugbank_chain = db_prompt | model | output_parser

In [170]:
ttd_messages = [('system','Strictly generate JSON output and include database IDs.'),('user','Does the drug {drug_name} exist in Therapeutic Targets Database online?')]
ttd_prompt = ChatPromptTemplate.from_messages(ttd_messages)
ttd_chain = ttd_prompt | model | output_parser

In [171]:
map_chain = RunnableParallel(db=drugbank_chain, ttd=ttd_chain)

In [172]:
map_chain.invoke({"drug_name": "herceptin"})


{'db': {'exists_in_drugbank': True,
  'drugbank_id': 'DB00072',
  'name': 'Herceptin',
  'generic_name': 'Trastuzumab'},
 'ttd': {'exists': True,
  'database_id': None,
  'name': 'Herceptin',
  'alternative_names': ['Trastuzumab'],
  'database': 'Therapeutic Targets Database',
  'url': 'https://db.idrblab.net/ttd/',
  'note': 'The database ID for specific drugs can change and may need to be directly queried from the database for the most accurate and up-to-date reference.'}}

In [173]:
map_chain.invoke({'drug_name':'S62'})

{'db': {'exists': False,
  'message': "As of the current knowledge, there is no drug with the identifier 'S62' listed in the DrugBank database."},
 'ttd': {'exists': False,
  'error': "Unable to verify the existence of drug 'S62' in the Therapeutic Targets Database without current access."}}

In [174]:
map_chain.invoke({'drug_name':'sorafenib'})

{'db': {'exists': True,
  'drugbank_id': 'DB00398',
  'name': 'Sorafenib',
  'status': 'confirmed'},
 'ttd': {'exists': True,
  'database_ID': 'TTD Drug ID: D0R8SO',
  'name': 'Sorafenib',
  'source': 'Therapeutic Targets Database'}}