# Table of contents



* [Overview](#overview)
* [Notes and links about LLMs](#notes)
* [Imports](#imports)
* [Gemini](#gemini)
* [GPT](#gpt)
    * [Set-up](#setup)
    * [Use cases - Q/A and entity extraction](#usecases)
    * [Ideas to reduce the cost per query](#cost)
        * [Model alternatives](#models)
    * [Testing some examples for drug and gene name extraction](#druggene)
* [Hugging Face](#hf)

# Overview<a id="overview"></a>

Purpose of this notebook is to try out different LLMs.

FREE:
* Llama-2 [encoder-decoder]
* BERT [encoder only]
* Gemini [encoder-decoder]
* T5 [encoder-decoder]

PAID:
GPT4 (OpenAI) [decoder only]

# Notes and links about LLMs<a id="notes"></a>

1. Gemini 1.0 [Google]
    * is free to use: https://ai.google.dev/pricing
    * Python package: https://pypi.org/project/google-generativeai/
    * How to: https://ai.google.dev/tutorials/python_quickstart
    * API key was obtained by going to Create API Key (New Project) - https://aistudio.google.com/app/apikey
    * Free v 1.0 and allows 60 queries per minute. The prompts and responses are used to improve their products.
    
    
2. OpenAI API
    * Is not free but it doesn't require upgrading to Plus.
    * You have to first purchase credits (\\$ 5 minimum) to use the API.
    * Pricing for GPT4: https://openai.com/pricing#language-models
    
        Input: \\$0.03 / 1K tokens
        
        Output: \\$0.06 / 1K tokens
        
    * API ref: https://platform.openai.com/docs/api-reference
    * Python quickstart: https://platform.openai.com/docs/quickstart?context=python
    * How to / cookbook on formatting inputs. https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models
    * Model types and compatibility with endpoints - https://platform.openai.com/docs/models/model-endpoint-compatibility
    * How to count tokens to get an estimate of cost: https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models#4-counting-tokens
    
    
3. Llama-2
    * Free to use from Meta
    * Getting started guide - https://ai.meta.com/blog/5-steps-to-getting-started-with-llama-2/#:~:text=Llama%202%20is%20available%20for%20free%20for%20research%20and%20commercial%20use.
    * Does not have an API through Meta.
    * Try Hugging Face https://huggingface.co/docs/transformers/v4.38.1/en/autoclass_tutorial
    

# Imports<a id="imports"></a>

In [78]:
from dotenv import load_dotenv
import google.generativeai as genai
import json
from openai import OpenAI
import os
import pandas as pd
from pathlib import Path
import torch
import transformers

# Gemini<a id="gemini"></a>

**Load environment variables.**

In [14]:
load_dotenv(dotenv_path=dotenv_path)

True

**Access environment variables.**

In [17]:
api_key_gemini = os.environ['API_KEY_GEMINI']

**Pass the key to the gemini API.**

In [18]:
genai.configure(api_key=api_key_gemini)

In [26]:
genai.list_models()  

generator

**This code from the Python tutorial - does not work and gives an AttributeError**
https://ai.google.dev/tutorials/python_quickstart
This should show a list of available models. This problem persists with other users too but it looks like it hasn't been solved. https://github.com/google/generative-ai-python/issues/145

**Try some queries with Gemini-Pro**

Load the model.

In [39]:
model = genai.GenerativeModel('gemini-pro')

In [38]:
response = model.generate_content('What is BRCA1?')

In [42]:
print(response.text)

TypeError: argument of type 'Part' is not iterable

# GPT4<a id="gpt"></a>

## Set up the client.<a id="setup"></a>

**Load API key.**

In [8]:
load_dotenv(dotenv_path=openai_key_path)

True

**Access the API key.**

In [9]:
openai_api_key = os.environ['OPENAI_API_KEY']

In [10]:
client = OpenAI(api_key=openai_api_key)

**Use the chat completions endpoint.**

First try with just the two required arguments.

The messages input is basically a list of dictionaries, where each dictionary also shows where the instruction or content is coming from. 
In short: 

    * 'user' role means you as a user who is talking to the model
    
    * 'assistant' role means the GPT server
    
    * 'system' role means you can as a developer set instructions such as 'Frame the answer for a non-engineer'.
    

## Try a Q/A and a entity recognition use case.<a id="usecases"></a>

**USE CASE 1 - Q/A.**

**You can see that a slight change in the query can produce different results.**

In [67]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "Provide explanation to a lay audience in JSON format."},
        {"role": "user", "content": "What is BRCA1?"}
    ]
)


In [68]:
response

ChatCompletion(id='chatcmpl-8vqqmLFCoJSAcoEQxWHNVsLdX9Mgh', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "BRCA1": {\n    "Definition": "BRCA1 is a gene that produces a protein responsible for repairing damaged DNA and maintaining the cell\'s genetic stability.",\n    "Significance": "Mutations or changes in this gene can lead to the development of hereditary breast and ovarian cancer. When functioning normally, this gene helps prevent uncontrolled cell growth. However, a mutation can lead to an increased risk of developing cancer.",\n    "Testing": "Genetic tests are available to check for BRCA1 mutations. These tests are often recommended for individuals with a strong family history of breast or ovarian cancer."\n  }\n}', role='assistant', function_call=None, tool_calls=None))], created=1708798544, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=128, prompt_token

In [65]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "system", "content": "Explain to a lay audience."},
        {"role": "user", "content": "What is BRCA1?"}
    ]
)


In [66]:
response

ChatCompletion(id='chatcmpl-8vqnbzu6mE7CIFprnlx6qGbHe5waT', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="BRCA1 is a gene that everyone has in their cells. This gene plays an important role in repairing damaged DNA and keeping our cells' genetic material stable. When this gene works properly, it helps prevent uncontrolled cell growth that could otherwise lead to cancer.\n\nHowever, some people carry changes or mutations in the BRCA1 gene that they inherited from their parents. These changes can prevent the gene from working properly, which increases the risk of breast and ovarian cancer, and to a lesser extent, other types of cancer.\n\nTesting for these gene changes is sometimes recommended for people with a strong family history of breast or ovarian cancer. If an individual knows they carry a mutated BRCA1 gene, they can make certain decisions about preventative measures, early detection and treatment options.", role='assistant',

**USE CASE 2 - EXTRACT ENTITIES FROM GIVEN INFORMATION.**

In [8]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """Extract the target gene name and associated drug names from a chunk of text:
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """}
    ]
)


In [9]:
response

ChatCompletion(id='chatcmpl-8wXyu0JFHegsYnSMhXuBgfGUlrmVQ', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Target Gene Name: ALK \n\nAssociated Drug Names: Crizotinib', role='assistant', function_call=None, tool_calls=None))], created=1708964340, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=164, total_tokens=180))

In [10]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """Extract the target gene name and associated drug names from a chunk of text:
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """},
        {"role":"system","content":"Try to also extract any aliases for gene or drug names from any external links to other databases provided in the text."}
    ]
)

In [11]:
response

ChatCompletion(id='chatcmpl-8wY80Uwnsruad1FSr3QpcTPSbbd1R', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The text does not provide information about any external links to other databases for gene or drug names. Please provide the text that contains these details.', role='assistant', function_call=None, tool_calls=None))], created=1708964904, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=28, prompt_tokens=191, total_tokens=219))

In [12]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """Extract the target gene name and associated drug names from a chunk of text:
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """},
        {"role":"system","content":"Extract details if any of the following database ids are provided: Clinical trials, PubChem, Entrez or similar."}
    ]
)

In [13]:
response

ChatCompletion(id='chatcmpl-8wYA7dq8MRbcDKGgaK2qHi7d1RNPE', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The text does not provide database ids for Clinical trials, PubChem, Entrez or similar.', role='assistant', function_call=None, tool_calls=None))], created=1708965035, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=19, prompt_tokens=191, total_tokens=210))

In [14]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """Extract the target gene name and associated drug names from a chunk of text:
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """},
        {"role":"system","content":"If ClinicalTrials.gov id or a number starting with NCT is given, extract the title of the study."}
    ]
)

In [15]:
response

ChatCompletion(id='chatcmpl-8wYBQttEtHrUrQgz3gUNs8c6PE0EG', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='The target gene name is "ALK". The associated drug name is "crizotinib".', role='assistant', function_call=None, tool_calls=None))], created=1708965116, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=20, prompt_tokens=191, total_tokens=211))

**Looks like it's not good to have 2 sets of instructions coming from the user vs the system. The system instructions will override the user instructions.**

In [16]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """Extract the target gene name and associated drug names from a chunk of text:
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """},
        {"role":"system","content":"Extract ClinicalTrials.gov number."}
    ]
)

In [17]:
response

ChatCompletion(id='chatcmpl-8wYCITXZRp7j1KllxerKfpTb9pOUx', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='NCT00585195', role='assistant', function_call=None, tool_calls=None))], created=1708965170, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=5, prompt_tokens=175, total_tokens=180))

**Put all instructions as system.**
This seems to have worked better.

In [20]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """},
        {"role":"system","content":"Extract the target gene name, associated drug names, and ClinicalTrials.gov number."}
    ]
)

In [19]:
response

ChatCompletion(id='chatcmpl-8wYEbRAu9nGXArcK5FlJXNi4o8W4O', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Target Gene Name: ALK\nAssociated Drug Names: Crizotinib\nClinicalTrials.gov number: NCT00585195', role='assistant', function_call=None, tool_calls=None))], created=1708965313, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=28, prompt_tokens=171, total_tokens=199))

## Ideas to reduce the cost per query.<a id="cost"></a>

1. Calculate the length of different text queries and see how many tokens they would be. Even an estimate will give an idea about whether it will be feasible to use GPT 3.5 and reduce the cost. The context window should be able to take in the whole input text along with the role and content key-values.

2. If step 1 confirms that the longest text query in the dataset can fit in the context window for GPT-4 and GPT-3.5, then test if GPT-3.5 gives the same quality of response as GPT-4. 

3. Perform a cleaning step on the text from the dataset before it goes into a prompt - remove extra spaces, trailing spaces and the last period. This small step can reduce the total number of tokens.

### Model alternatives - shortlist acceptable models based on context window size, cost, and use case.<a id="models"></a>

1. gpt-3.5-turbo-0125
    * 16,385 tokens
    * But training data only upto Sep 2021
    * Input - \$0.0005 / 1K tokens
    * Output - \$0.0015 / 1K tokens
    * you can set response_format to { "type": "json_object" } to enable JSON mode.


2. gpt-4
    * Currently points to gpt-4-0613.
    * 8,192 tokens
    * Up to Sep 2021
    * Input - \$0.03 / 1K tokens
    * Output - \$0.06 / 1K tokens
    

3. gpt-4-turbo-preview
    * New
    * 128,000 tokens
    * Up to Dec 2023
    * Input - \$0.01 / 1K tokens
    * Output - \$0.03 / 1K tokens
    * you can set response_format to { "type": "json_object" } to enable JSON mode.

#### Test if these 3 models give the same response quality.

This is the model we've used so far: 

In [21]:
MODEL = "gpt-4"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": """
        In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
        Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
        Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
        This study is registered with ClinicalTrials.gov, number NCT00585195.
        """},
        {"role":"system","content":"Extract the target gene name, associated drug names, and ClinicalTrials.gov number."}
    ]
)

In [22]:
response

ChatCompletion(id='chatcmpl-8wZ4ANHsg2yXIrK1GeXF2aRXcDpPA', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Target gene name: ALK\nAssociated drug names: Crizotinib\nClinicalTrials.gov number: NCT00585195', role='assistant', function_call=None, tool_calls=None))], created=1708968510, model='gpt-4-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=28, prompt_tokens=171, total_tokens=199))

Test these 2 models - the first one is more expensive but most updated and the last one is lease expensive but less updated compared to the first one.

**Conclusion - GPT-4-turbo-preview is more accurate compared to the cheaper gpt-3.5. GPT-4-turbo-preview produces the same output as GPT-4 and is less expensive than GPT-4.**

In [26]:
models_to_test = ['gpt-4-turbo-preview', 'gpt-3.5-turbo-0125']
# models_to_test = ['gpt-3.5-turbo-0125']

# Here we can add an additional argument for JSON format.

for MODEL in models_to_test:
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=[
            {"role": "user", "content": """
            In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles. 
            Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
            Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
            This study is registered with ClinicalTrials.gov, number NCT00585195.
            """},
            {"role":"system","content":"Extract the target gene name, associated drug names, and ClinicalTrials.gov number and give JSON."}
        ]
    )
    
    print(response)

ChatCompletion(id='chatcmpl-8wZ9kPCdHjS0SIdLUSv9ZQpFK4O23', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "target_gene": "ALK",\n  "associated_drug_names": ["crizotinib"],\n  "ClinicalTrials.gov_number": "NCT00585195"\n}', role='assistant', function_call=None, tool_calls=None))], created=1708968856, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_91aa3742b1', usage=CompletionUsage(completion_tokens=39, prompt_tokens=174, total_tokens=213))
ChatCompletion(id='chatcmpl-8wZ9mw6TGouFg8MuIl1D1P5P5Xz1t', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "target_gene": "ALK-positive",\n  "drug_name": "crizotinib",\n  "ClinicalTrials.gov_number": "NCT00585195"\n}', role='assistant', function_call=None, tool_calls=None))], created=1708968858, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint='fp_86156a94a0', usage=CompletionUsage

## Testing some more use cases for entity extraction - gene name and drug names<a id="druggene"></a>

First, create a messages variable to swap out different abstract examples and system level instructions.

In [45]:
MODEL = 'gpt-4-turbo-preview'

messages = [{"role": "user", "content": """"""},
            {'role': 'system',"content": """"""}]

In [53]:
user_content = ["""
In this phase 1 study, patients with ALK-positive stage III or IV NSCLC received oral crizotinib 250 mg twice daily in 28-day cycles.
Endpoints included tumour responses, duration of response, time to tumour response, progression-free survival (PFS), overall survival at 6 and 12 months, and determination of the safety and tolerability and characterisation of the plasma pharmacokinetic profile of crizotinib after oral administration. 
Responses were analysed in evaluable patients and PFS and safety were analysed in all patients. 
This study is registered with ClinicalTrials.gov, number NCT00585195.
""",
"""
Conventional chemotherapeutic drugs such as doxorubicin (DOX) are associated with severe adverse effects such as cardiac, hepatic, and gastrointestinal (GI) toxicities. Excessive production of reactive oxygen species (ROS) was reported to be one of the main mechanisms underlying these severe adverse effects. Recently, we have developed 2 types of novel redox nanoparticles (RNPs) including pH-sensitive redox nanoparticle (RNP(N)) and pH-insensitive redox nanoparticle (RNP(O)), which effectively scavenge overproduced ROS in inflamed and cancerous tissues. In this study, we investigated the effects of these RNPs on DOX-induced adverse effects during cancer chemotherapy. The DOX-induced body weight loss was significantly attenuated in the mice treated with RNPs, particularly pH-insensitive RNP(O). We also found that cardiac ROS levels in the DOX-treated mice were dramatically decreased by treatment with RNPs, resulting in the reversal of cardiac damage, as confirmed by both plasma cardiac biomarkers and histological analysis. It was interesting to notice that, during cotreatment with DOX and RNPs, the DOX uptake was significantly enhanced in the cancer cells, but not in healthy aortic endothelial cells in vitro. Treatment with RNPs also improved anticancer efficacy of DOX in the colitis-associated colon cancer model mice in vivo. On the basis of these results, a combination of the novel antioxidative nanotherapeutics (RNPs) with conventional anticancer drugs seems to be a robust strategy for well-tolerated anticancer therapy.
""",
"""
Background: Sotorasib showed anticancer activity in patients with KRAS p.G12C-mutated advanced solid tumors in a phase 1 study, and particularly promising anticancer activity was observed in a subgroup of patients with non-small-cell lung cancer (NSCLC).
Methods: In a single-group, phase 2 trial, we investigated the activity of sotorasib, administered orally at a dose of 960 mg once daily, in patients with KRAS p.G12C-mutated advanced NSCLC previously treated with standard therapies. The primary end point was objective response (complete or partial response) according to independent central review. Key secondary end points included duration of response, disease control (defined as complete response, partial response, or stable disease), progression-free survival, overall survival, and safety. Exploratory biomarkers were evaluated for their association with response to sotorasib therapy.
Results: Among the 126 enrolled patients, the majority (81.0%) had previously received both platinum-based chemotherapy and inhibitors of programmed death 1 (PD-1) or programmed death ligand 1 (PD-L1). According to central review, 124 patients had measurable disease at baseline and were evaluated for response. An objective response was observed in 46 patients (37.1%; 95% confidence interval [CI], 28.6 to 46.2), including in 4 (3.2%) who had a complete response and in 42 (33.9%) who had a partial response. The median duration of response was 11.1 months (95% CI, 6.9 to could not be evaluated). Disease control occurred in 100 patients (80.6%; 95% CI, 72.6 to 87.2). The median progression-free survival was 6.8 months (95% CI, 5.1 to 8.2), and the median overall survival was 12.5 months (95% CI, 10.0 to could not be evaluated). Treatment-related adverse events occurred in 88 of 126 patients (69.8%), including grade 3 events in 25 patients (19.8%) and a grade 4 event in 1 (0.8%). Responses were observed in subgroups defined according to PD-L1 expression, tumor mutational burden, and co-occurring mutations in STK11, KEAP1, or TP53.
""",

"""
Background: KRAS G12C is a mutation that occurs in approximately 3 to 4% of patients with metastatic colorectal cancer. Monotherapy with KRAS G12C inhibitors has yielded only modest efficacy. Combining the KRAS G12C inhibitor sotorasib with panitumumab, an epidermal growth factor receptor (EGFR) inhibitor, may be an effective strategy.
Methods: In this phase 3, multicenter, open-label, randomized trial, we assigned patients with chemorefractory metastatic colorectal cancer with mutated KRAS G12C who had not received previous treatment with a KRAS G12C inhibitor to receive sotorasib at a dose of 960 mg once daily plus panitumumab (53 patients), sotorasib at a dose of 240 mg once daily plus panitumumab (53 patients), or the investigator's choice of trifluridine-tipiracil or regorafenib (standard care; 54 patients). The primary end point was progression-free survival as assessed by blinded independent central review according to the Response Evaluation Criteria in Solid Tumors, version 1.1. Key secondary end points were overall survival and objective response.
Results: After a median follow-up of 7.8 months (range, 0.1 to 13.9), the median progression-free survival was 5.6 months (95% confidence interval [CI], 4.2 to 6.3) and 3.9 months (95% CI, 3.7 to 5.8) in the 960-mg sotorasib-panitumumab and 240-mg sotorasib-panitumumab groups, respectively, as compared with 2.2 months (95% CI, 1.9 to 3.9) in the standard-care group. The hazard ratio for disease progression or death in the 960-mg sotorasib-panitumumab group as compared with the standard-care group was 0.49 (95% CI, 0.30 to 0.80; P = 0.006), and the hazard ratio in the 240-mg sotorasib-panitumumab group was 0.58 (95% CI, 0.36 to 0.93; P = 0.03). Overall survival data are maturing. The objective response was 26.4% (95% CI, 15.3 to 40.3), 5.7% (95% CI, 1.2 to 15.7), and 0% (95% CI, 0.0 to 6.6) in the 960-mg sotorasib-panitumumab, 240-mg sotorasib-panitumumab, and standard-care groups, respectively. Treatment-related adverse events of grade 3 or higher occurred in 35.8%, 30.2%, and 43.1% of patients, respectively. Skin-related toxic effects and hypomagnesemia were the most common adverse events observed with sotorasib-panitumumab.
Conclusions: In this phase 3 trial of a KRAS G12C inhibitor plus an EGFR inhibitor in patients with chemorefractory metastatic colorectal cancer, both doses of sotorasib in combination with panitumumab resulted in longer progression-free survival than standard treatment. Toxic effects were as expected for either agent alone and resulted in few discontinuations of treatment. (Funded by Amgen; CodeBreaK 300 ClinicalTrials.gov number, NCT05198934.).
"""
]

In [54]:
system_content = ["""
Extract the target gene name, associated drug names, and ClinicalTrials.gov number and give JSON.
"""]

In [55]:
for user_text in user_content:
    messages[0]['content'] = user_text
    messages[1]['content'] = system_content[0]
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )
    print(f'''----RESULT---- {response}''')

----RESULT---- ChatCompletion(id='chatcmpl-8wadqdx5J8ZLVT8EM27zqROfaHvIL', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "target_gene": "ALK",\n  "associated_drug_names": ["crizotinib"],\n  "ClinicalTrials.gov_number": "NCT00585195"\n}', role='assistant', function_call=None, tool_calls=None))], created=1708974566, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_89b1a570e1', usage=CompletionUsage(completion_tokens=39, prompt_tokens=168, total_tokens=207))
----RESULT---- ChatCompletion(id='chatcmpl-8wadtC7gTMDxkBKgAlQM6UJEcd6Ui', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='\n{\n  "target_gene_name": null,\n  "associated_drug_names": ["doxorubicin (DOX)"],\n  "ClinicalTrials.gov_number": null\n}', role='assistant', function_call=None, tool_calls=None))], created=1708974569, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint

## Test use cases to see if any drug-target interaction type can also be extracted from the abstract

In [3]:
MODEL = 'gpt-4-turbo-preview'

messages = [{"role": "user", "content": """"""},
            {'role': 'system',"content": """"""}]

In [4]:
system_content = ["""
Extract the target gene name, associated drug names, type of drug mechanism or interaction, and ClinicalTrials.gov number as JSON
"""]

In [5]:
user_content = [
"""
Background: Sotorasib is a specific, irreversible inhibitor of the GTPase protein, KRASG12C. 
We compared the efficacy and safety of sotorasib with a standard-of-care treatment in patients with non-small-cell lung cancer (NSCLC) with the KRASG12C mutation who had been previously treated with other anticancer drugs.
"""
    
]

**This 1 example shows that it is able to extract the type of drug-target interaction as well. 
Test another example where it's worded differently.**

In [11]:
for user_text in user_content:
    messages[0]['content'] = user_text
    messages[1]['content'] = system_content[0]
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )
    print(f'''----RESULT---- {response}''')

----RESULT---- ChatCompletion(id='chatcmpl-8wuvDO53HHEH4kvYtUjwzdqlIRkEj', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "target_gene_name": "KRASG12C",\n  "associated_drug_names": ["Sotorasib"],\n  "type_of_drug_mechanism": "Specific, irreversible inhibitor of the GTPase protein",\n  "ClinicalTrials.gov_number": null\n}', role='assistant', function_call=None, tool_calls=None))], created=1709052523, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_91aa3742b1', usage=CompletionUsage(completion_tokens=60, prompt_tokens=114, total_tokens=174))


**Here too, it was able to extract the correct drug and gene name.**

In [12]:
user_content = [
"""
Lung adenocarcinoma (LUAD) is the most common lung cancer, with high mortality. 
As a tumor-suppressor gene, JWA plays an important role in blocking pan-tumor progression. 
JAC4, a small molecular-compound agonist, transcriptionally activates JWA expression both in vivo and in vitro.
"""
    
]

In [13]:
for user_text in user_content:
    messages[0]['content'] = user_text
    messages[1]['content'] = system_content[0]
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )
    print(f'''----RESULT---- {response}''')

----RESULT---- ChatCompletion(id='chatcmpl-8wuzc0WQUb3FOcxnCIhJT6e3H2cWR', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "target_gene": "JWA",\n  "associated_drug_names": [\n    "JAC4"\n  ],\n  "type_of_drug_mechanism_or_interaction": "agonist",\n  "ClinicalTrials.gov_number": null\n}', role='assistant', function_call=None, tool_calls=None))], created=1709052796, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_91aa3742b1', usage=CompletionUsage(completion_tokens=53, prompt_tokens=108, total_tokens=161))


## Modify the prompt for it to return pairs or a set consisting of {gene, drug, interaction, clinical trial}. There should be a separate set if multiple entities are present

First test the usual prompt:
Here the output is incomplete as it didn't catch the second gene BRAF and associated drug.

In [14]:
MODEL = 'gpt-4-turbo-preview'

messages = [{"role": "user", "content": """"""},
            {'role': 'system',"content": """"""}]

system_content = ["""
Extract the target gene name, associated drug names, type of drug mechanism or interaction, and ClinicalTrials.gov number as JSON
"""]

user_content = [
"""
More than half of metastatic melanoma patients receiving standard therapy fail to achieve a long-term survival due to primary and/or acquired resistance. 
Tumor cell ability to switch from epithelial to a more aggressive mesenchymal phenotype, attributed with AXLhigh molecular profile in melanoma, has been recently linked to such event, limiting treatment efficacy. 
In the current study, we investigated the therapeutic potential of the AXL inhibitor (AXLi) BGB324 alone or in combination with the clinically relevant BRAF inhibitor (BRAFi) vemurafenib. 
Firstly, AXL was shown to be expressed in majority of melanoma lymph node metastases.
When treated ex vivo, the largest reduction in cell viability was observed when the two drugs were combined. 
In addition, a therapeutic benefit of adding AXLi to the BRAF-targeted therapy was observed in pre-clinical AXLhigh melanoma models in vitro and in vivo. When searching for mechanistic insights, AXLi was found to potentiate BRAFi-induced apoptosis, stimulate ferroptosis and inhibit autophagy. Altogether, our findings propose AXLi as a promising treatment in combination with standard therapy to improve therapeutic outcome in metastatic melanoma.
"""
    
]

for user_text in user_content:
    messages[0]['content'] = user_text
    messages[1]['content'] = system_content[0]
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )
    print(f'''----RESULT---- {response}''')

----RESULT---- ChatCompletion(id='chatcmpl-8wvEkFhdoII94FjUa1DL3rlxNpNO5', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "target_gene_name": "AXL",\n  "associated_drug_names": [\n    "BGB324",\n    "vemurafenib"\n  ],\n  "type_of_drug_mechanism_or_interaction": [\n    "AXLi (AXL inhibitor) potentiates BRAFi (BRAF inhibitor)-induced apoptosis",\n    "stimulates ferroptosis",\n    "inhibits autophagy"\n  ],\n  "ClinicalTrials.gov_number": null\n}', role='assistant', function_call=None, tool_calls=None))], created=1709053734, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_91aa3742b1', usage=CompletionUsage(completion_tokens=103, prompt_tokens=291, total_tokens=394))


**Try changing the prompt instruction to get the intended output.
We can see that changing the instruction and asking the model to discriminate between different combinations of gene-drug pairs works.
The output could be cleaner and more concise by using a 1 shot approach. This should be the next test.**

In [17]:
system_content = ["""
Return a JSON output with a set consisting of (target gene name,drug name, drug-target interaction, ClinicalTrials.gov number).
Return multiple sets if more than 1 gene-drug combinations are present in the text.
"""]


**It extracts multiple drug-target combinations correctly for the first article. 
However, for the second article it returns inaccurate information and places the drug type as the drug name.
Also, it just reproduces the sentence from the paper to inform us about the type of drug interaction.
This present an opportunity for:**
1. Experimenting with the system instruction within the prompt.
2. Testing if a one-shot approach gives a cleaner output.

In [25]:
MODEL = 'gpt-4-turbo-preview'

messages = [{"role": "user", "content": """"""},
            {'role': 'system',"content": """"""}]

user_content = [
"""
More than half of metastatic melanoma patients receiving standard therapy fail to achieve a long-term survival due to primary and/or acquired resistance. 
Tumor cell ability to switch from epithelial to a more aggressive mesenchymal phenotype, attributed with AXLhigh molecular profile in melanoma, has been recently linked to such event, limiting treatment efficacy. 
In the current study, we investigated the therapeutic potential of the AXL inhibitor (AXLi) BGB324 alone or in combination with the clinically relevant BRAF inhibitor (BRAFi) vemurafenib. 
Firstly, AXL was shown to be expressed in majority of melanoma lymph node metastases.
When treated ex vivo, the largest reduction in cell viability was observed when the two drugs were combined. 
In addition, a therapeutic benefit of adding AXLi to the BRAF-targeted therapy was observed in pre-clinical AXLhigh melanoma models in vitro and in vivo. When searching for mechanistic insights, AXLi was found to potentiate BRAFi-induced apoptosis, stimulate ferroptosis and inhibit autophagy. Altogether, our findings propose AXLi as a promising treatment in combination with standard therapy to improve therapeutic outcome in metastatic melanoma.
"""
    
]

for user_text in user_content:
    messages[0]['content'] = user_text
    messages[1]['content'] = system_content[0]
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )
    print(f'''----RESULT---- {response}''')

----RESULT---- ChatCompletion(id='chatcmpl-8wwTTSJhidKa2PRZamGg8Wn0USUCf', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "gene-drug combinations": [\n    {\n      "target gene name": "AXL",\n      "drug name": "BGB324",\n      "drug-target interaction": "AXL inhibitor (AXLi)",\n      "ClinicalTrials.gov number": null\n    },\n    {\n      "target gene name": "BRAF",\n      "drug name": "vemurafenib",\n      "drug-target interaction": "BRAF inhibitor (BRAFi)",\n      "ClinicalTrials.gov number": null\n    }\n  ]\n}', role='assistant', function_call=None, tool_calls=None))], created=1709058491, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_91aa3742b1', usage=CompletionUsage(completion_tokens=113, prompt_tokens=310, total_tokens=423))


* Entities captured correctly.

In [26]:
print(response.choices[0].message.content)

{
  "gene-drug combinations": [
    {
      "target gene name": "AXL",
      "drug name": "BGB324",
      "drug-target interaction": "AXL inhibitor (AXLi)",
      "ClinicalTrials.gov number": null
    },
    {
      "target gene name": "BRAF",
      "drug name": "vemurafenib",
      "drug-target interaction": "BRAF inhibitor (BRAFi)",
      "ClinicalTrials.gov number": null
    }
  ]
}


In [27]:
MODEL = 'gpt-4-turbo-preview'

messages = [{"role": "user", "content": """"""},
            {'role': 'system',"content": """"""}]

user_content = [
"""
Triple-negative breast cancers (TNBC) frequently inactivate p53, increasing their aggressiveness and therapy resistance. We identified an unexpected protein vulnerability in p53-inactivated TNBC and designed a new PROteolysis TArgeting Chimera (PROTAC) to target it. Our PROTAC selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment. MDM2 loss in p53 mutant/deleted TNBC cells in two-dimensional/three-dimensional culture and TNBC patient explants, including relapsed tumors, causes apoptosis while sparing normal cells. Our MDM2-PROTAC is stable in vivo, and treatment of TNBC xenograft-bearing mice demonstrates tumor on-target efficacy with no toxicity to normal cells, significantly extending survival. Transcriptomic analyses revealed upregulation of p53 family target genes. Investigations showed activation and a required role for TAp73 to mediate MDM2-PROTAC-induced apoptosis. Our data, challenging the current MDM2/p53 paradigm, show MDM2 is required for p53-inactivated TNBC cell survival, and PROTAC-targeted MDM2 degradation is an innovative potential therapeutic strategy for TNBC and superior to existing MDM2 inhibitors.
"""
    
]

for user_text in user_content:
    messages[0]['content'] = user_text
    messages[1]['content'] = system_content[0]
    
    response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )
    print(f'''----RESULT---- {response}''')

----RESULT---- ChatCompletion(id='chatcmpl-8wwTvHEIUzOfazMNWe66ycUfjgiq8', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "gene-drug combinations": [\n    {\n      "target gene name": "MDM2",\n      "drug name": "MDM2-PROTAC",\n      "drug-target interaction": "PROTAC selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment",\n      "ClinicalTrials.gov number": null\n    }\n  ]\n}', role='assistant', function_call=None, tool_calls=None))], created=1709058519, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_8c864dca93', usage=CompletionUsage(completion_tokens=82, prompt_tokens=312, total_tokens=394))


* Entities captured incorrectly.

In [28]:
print(response.choices[0].message.content)

{
  "gene-drug combinations": [
    {
      "target gene name": "MDM2",
      "drug name": "MDM2-PROTAC",
      "drug-target interaction": "PROTAC selectively targets MDM2 for proteasome-mediated degradation with high-affinity binding and VHL recruitment",
      "ClinicalTrials.gov number": null
    }
  ]
}


## Can multiple abstracts be sent in 1 call with just one system instruction?

Answer: Yes

In [29]:
MODEL = 'gpt-4-turbo-preview'

messages = [
    {"role": "system","content": """
    For each of the user text delimited by triple quotes, do the following:
    1. Return a set consisting of (target gene name,drug name, drug-target interaction, ClinicalTrials.gov number).
    2. Return multiple sets if more than 1 gene-drug combinations are present in the text.
    3. Assemble all sets and produce 1 final JSON output.
    """ },
    {
        "role": "user", "content": """
        More than half of metastatic melanoma patients receiving standard therapy fail to achieve a long-term survival due to primary and/or acquired resistance. 
Tumor cell ability to switch from epithelial to a more aggressive mesenchymal phenotype, attributed with AXLhigh molecular profile in melanoma, has been recently linked to such event, limiting treatment efficacy. 
In the current study, we investigated the therapeutic potential of the AXL inhibitor (AXLi) BGB324 alone or in combination with the clinically relevant BRAF inhibitor (BRAFi) vemurafenib. 
Firstly, AXL was shown to be expressed in majority of melanoma lymph node metastases.
When treated ex vivo, the largest reduction in cell viability was observed when the two drugs were combined. 
In addition, a therapeutic benefit of adding AXLi to the BRAF-targeted therapy was observed in pre-clinical AXLhigh melanoma models in vitro and in vivo. When searching for mechanistic insights, AXLi was found to potentiate BRAFi-induced apoptosis, stimulate ferroptosis and inhibit autophagy. Altogether, our findings propose AXLi as a promising treatment in combination with standard therapy to improve therapeutic outcome in metastatic melanoma.
        """
    },
    {
        "role": "user", "content": """
        Lung adenocarcinoma (LUAD) is the most common lung cancer, with high mortality. 
As a tumor-suppressor gene, JWA plays an important role in blocking pan-tumor progression. 
JAC4, a small molecular-compound agonist, transcriptionally activates JWA expression both in vivo and in vitro.
        """
    }
]

response = client.chat.completions.create(
        model=MODEL,
        response_format={ "type": "json_object" },
        messages=messages
    )

In [35]:
response

ChatCompletion(id='chatcmpl-8wzCWPeaRKazsJa7W5gRPuqgGPj09', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='{\n  "sets": [\n    {\n      "target gene name": "AXL",\n      "drug name": "BGB324",\n      "drug-target interaction": "inhibitor",\n      "ClinicalTrials.gov number": null\n    },\n    {\n      "target gene name": "BRAF",\n      "drug name": "vemurafenib",\n      "drug-target interaction": "inhibitor",\n      "ClinicalTrials.gov number": null\n    },\n    {\n      "target gene name": "JWA",\n      "drug name": "JAC4",\n      "drug-target interaction": "agonist",\n      "ClinicalTrials.gov number": null\n    }\n  ]\n}', role='assistant', function_call=None, tool_calls=None))], created=1709068972, model='gpt-4-0125-preview', object='chat.completion', system_fingerprint='fp_89b1a570e1', usage=CompletionUsage(completion_tokens=146, prompt_tokens=424, total_tokens=570))

In [31]:
print(response.choices[0].message.content)

{
  "sets": [
    {
      "target gene name": "AXL",
      "drug name": "BGB324",
      "drug-target interaction": "inhibitor",
      "ClinicalTrials.gov number": null
    },
    {
      "target gene name": "BRAF",
      "drug name": "vemurafenib",
      "drug-target interaction": "inhibitor",
      "ClinicalTrials.gov number": null
    },
    {
      "target gene name": "JWA",
      "drug name": "JAC4",
      "drug-target interaction": "agonist",
      "ClinicalTrials.gov number": null
    }
  ]
}


## How to extract details from the Chat completion object?

* In addition to the results, it's useful to extract the model and usage details.
* While constructing the main code processing abstracts in batches, have a batch ID to track which PMIDs got processed in which batch and then add model and usage / token details for each batch. This metadata would help to plan future costs based on abstract lengths.

**The response JSON object can be loaded as a Python object.**

In [58]:
result = json.loads(response.model_dump_json())

* Model string can be extracted like:

In [81]:
result['model']

'gpt-4-0125-preview'

* Usage details can be extracted like:

In [82]:
result['usage']

{'completion_tokens': 146, 'prompt_tokens': 424, 'total_tokens': 570}

**The main results are contained within 'message' key in the 'choices' list.**

In [80]:
result

{'id': 'chatcmpl-8wzCWPeaRKazsJa7W5gRPuqgGPj09',
 'choices': [{'finish_reason': 'stop',
   'index': 0,
   'logprobs': None,
   'message': {'content': '{\n  "sets": [\n    {\n      "target gene name": "AXL",\n      "drug name": "BGB324",\n      "drug-target interaction": "inhibitor",\n      "ClinicalTrials.gov number": null\n    },\n    {\n      "target gene name": "BRAF",\n      "drug name": "vemurafenib",\n      "drug-target interaction": "inhibitor",\n      "ClinicalTrials.gov number": null\n    },\n    {\n      "target gene name": "JWA",\n      "drug name": "JAC4",\n      "drug-target interaction": "agonist",\n      "ClinicalTrials.gov number": null\n    }\n  ]\n}',
    'role': 'assistant',
    'function_call': None,
    'tool_calls': None}}],
 'created': 1709068972,
 'model': 'gpt-4-0125-preview',
 'object': 'chat.completion',
 'system_fingerprint': 'fp_89b1a570e1',
 'usage': {'completion_tokens': 146,
  'prompt_tokens': 424,
  'total_tokens': 570}}

**Extract the results into a df.**

In [75]:
data_dict = json.loads(result['choices'][0]['message']['content'])

In [77]:
data_dict['sets']

[{'target gene name': 'AXL',
  'drug name': 'BGB324',
  'drug-target interaction': 'inhibitor',
  'ClinicalTrials.gov number': None},
 {'target gene name': 'BRAF',
  'drug name': 'vemurafenib',
  'drug-target interaction': 'inhibitor',
  'ClinicalTrials.gov number': None},
 {'target gene name': 'JWA',
  'drug name': 'JAC4',
  'drug-target interaction': 'agonist',
  'ClinicalTrials.gov number': None}]

In [79]:
pd.DataFrame(data_dict['sets'])

Unnamed: 0,target gene name,drug name,drug-target interaction,ClinicalTrials.gov number
0,AXL,BGB324,inhibitor,
1,BRAF,vemurafenib,inhibitor,
2,JWA,JAC4,agonist,


Things to figure out / next steps:

1. How to send multiple abstracts and how many? - for the first part, test 3 text chunks. - DONE
2. How to extract details from a JSON object? - DONE
3. Experiment with the temperature setting?
4. One shot learning