# Notebook for Testing Prompts with GPT-4o

This is for generating synthetic dataset which involves having the LLM (GPT-4o in this case) add different magnitude of spin to a given abstract.

In [144]:
# import package
from openai import OpenAI
import pandas as pd
from tqdm import tqdm
import os
from dotenv import load_dotenv
from enum import Enum
from pydantic import BaseModel

In [145]:
def load_api_keys():
    load_dotenv(override=True)
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
    return client

In [146]:
# get original abstracts used for testing prompts (DEV set)
df_abstracts = pd.read_csv('../data/FactPico_115_abstracts_with_pmid_pmcid_title.csv')
df_dev_abstracts = df_abstracts[df_abstracts['data_split'] == 'DEV']

# FOR TESTING
# df_dev_abstracts = df_dev_abstracts.sample(3)

# count number of abstracts
print('Number of abstracts in DEV set:', len(df_dev_abstracts))

Number of abstracts in DEV set: 14


## PART TO CUSTOMIZE

In [None]:
PROMPT_TEMPLATE_NAME = 'gpt4o-generated-long-detailed-with-rationale-change-results-conclusions-example-abstracts-temp0_6-p0_7
'

In [148]:
SYSTEM_PROMPT = 'You are an expert in the field of health sciences.'

In [149]:
PROMPT_TEMPLATE = '''Spin is misrepresentation of study results which involves overstating (exaggerating) the efficacy and/or understating the harms. It also involves using specific reporting strategies highlighting that the experimental treatment is beneficial, despite a statistically nonsignificant difference for the primary outcome, or to distract the reader from statistically nonsignificant results. 

Reframe the provided title and abstract to produce three distinct versions with varying levels of spin (subtle, moderate, and strong). Incorporate the definition of spin, common strategies for spin, and approaches for quantifying spin levels while ensuring that the original numerical results (e.g., p-values, effect sizes) remain unchanged. These adjustments should aim to emphasize positive treatment outcomes while maintaining adherence to the data's integrity.

Spin Strategies:
1. Misleading Reporting (i.e., incomplete or inadequate reporting of any important information in the context of the research that could be misleading for the reader)
    * Not reporting adverse events or lack of focus on harm
    * Reporting of statistically non-significant outcome as if the trial were an equivalence trial
    * Selective reporting (only reporting subset of original or primary outcomes)
        1. Focus on statistically significant within-group comparison
        2. Focus on statistically significant secondary outcomes
        3. Focus on statistically significant subgroup analyses
        4. Focus on statistically significant modified population of analyses (Reporting a per-protocol analysis when intention-to-treat was prespecified)
        5. Focus on statistically significant within- and between-group comparisons for secondary outcomes
        6. Emphasizing statistically significant results out of order (e.g. subgroup before overall analysis, secondary endpoint before primary endpoint)
    * Misleading description of study design (study design is presented as more robust than it is actually)
    * Use of linguistic spin or “hype”
        1. Used “trend statements” in the description of statistical significance (e.g. “trend toward significance”) even though there is no statistical significance in results
        2. Exaggeration of efficacy of treatment
        3. Understating the harms of treatment
2. Misleading Interpretation (i.e., interpretation of the study results in news stories that is not consistent with the results reported in the scientific articles and overestimating the beneficial effect of the treatment)
    * Claim an effect for non-statistically significant results
    * Claim equivalence for non-statistically significant results
    * Rule out safety when results are not statistically significant (intervention presented as safe based on non-statistically significant comparison test, despite large confidence interval)
    * Causal language or causal claim (using cause-and-effect link between intervention and outcome)
    * Claim any difference despite no comparison test performed
    * Focus on statistical significance instead of clinical relevance of the effect size
    * Distracting from nonsignificant findings by stating that the nonsignificant results were due to a trial design issue (e.g., underpowered)
    * Claiming treatment benefits from a statistically significant surrogate endpoint or secondary outcome when primary outcome was nonsignificant.
    * Emphasizing subgroups or modified treatment populations
    * Focusing on within-group assessment (both treatments are effective/treatment administered in both groups is effective)
3. Misleading extrapolation (i.e., overgeneralization of study results in news stories to different populations, treatments or outcomes that were not assessed in the study)
    * Inadequate extrapolation to larger population, intervention, or outcome
    * Inadequate implications for clinical practice
    * No mentions of further investigation being necessary to confirm present findings

Spin Quantification:
* Measure spin by the total number of strategies used and their severity:
    * Subtle spin: Minimal strategy use, low severity.
    * Moderate spin: Increased strategy use, moderate severity.
    * Strong spin: Frequent strategy use, high severity.

Instructions for Output:
* Write Three Versions of the Title and Abstract:
    * Subtle Spin: Balanced tone with minimal rephrasing.
    * Moderate Spin: Optimistic tone emphasizing benefits.
    * Strong Spin: Aggressive language focusing on efficacy while minimizing drawbacks.
* Editing Scope:
    * Only edit the title, results, findings, and conclusion sections.
    * Do not alter the introduction, background, aims, objectives, or methods.
* Documentation for Each Version:
    * Edit Summary: List all changes to the title and abstract.
    * Rationale for Edits: Justify each edit, explaining how it aligns with the level of spin. Examples:
        * Linguistic adjustments (e.g., ""significant improvement"" → ""substantial improvement"").
        * Reordering results to emphasize secondary over primary outcomes.
        * Adding trend statements to suggest potential significance even though the result is not statistically significant.
    * Spin Strategy Applied: Identify the strategy used (e.g., linguistic spin, selective focus).

Examples of abstracts with spin and strategies used:
* Example 1
    * Title: A Novel Home-Based Intervention for Child and Adolescent Obesity: The Results of the WhÄnau Pakari Randomized Controlled Trial.
    * Abstract: Objective: To report 12-month outcomes from a multidisciplinary child obesity intervention program, targeting high-risk groups.

Methods: In this unblinded randomized controlled trial, participants (recruited January 2012-August 2014) were aged 5 to 16 years, resided in Taranaki, Aotearoa/New Zealand, and had BMI ≥ 98th percentile or BMI > 91st percentile with weight-related comorbidities. Randomization was by minimization (age and ethnicity), with participants assigned to an intense intervention group (home-based assessments at 6-month intervals and a 12-month multidisciplinary program with weekly group sessions) or to a minimal-intensity control group with home-based assessments and advice at each 6-month follow-up. The primary outcome was the change in BMI standard deviation score (SDS) at 12 months from baseline. A mixed model analysis was undertaken, incorporating all 6- and 12-month data.

Results: Two hundred and three children were randomly assigned (47% Māori, 43% New Zealand European, 53% female, 28% from the most deprived quintile, mean age 10.7 years, mean BMI SDS 3.12). Both groups displayed a change in BMI SDS at 12 months from baseline (-0.12 control, -0.10 intervention), improvements in cardiovascular fitness (P < 0.0001), and improvements in quality of life (P < 0.001). Achieving ≥ 70% attendance in the intense intervention group resulted in a change in BMI SDS of -0.22.

Conclusions: This program achieved a high recruitment of target groups and a high rate of BMI SDS reduction, irrespective of intervention intensity. If retention is optimized, the intensive program doubles its effect. 
    * Strategies: RESULTS: Selectively focus on (+) within-group comparison for primary endpoint, CONCLUSION: "Trend toward significance" // "Numerically longer survival" or equivalent verbiage, Focus on another objective (e.g., trial is (-) but they say they accomplish some goal that they did not prespecify)
* Example 2
    * Title: Low-dose ketamine vs morphine for acute pain in the ED: a randomized controlled trial.
    * Abstract: Objectives
To compare the maximum change in numeric rating scale (NRS) pain scores, in patients receiving low-dose ketamine (LDK) or morphine (MOR) for acute pain in the emergency department.
Methods
We performed an institutional review board–approved, randomized, prospective, double-blinded trial at a tertiary, level 1 trauma center. A convenience sample of patients aged 18 to 59 years with acute abdominal, flank, low back, or extremity pain were enrolled. Subjects were consented and randomized to intravenous LDK (0.3 mg/kg) or intravenous MOR (0.1 mg/kg). Our primary outcome was the maximum change in NRS scores. A sample size of 20 subjects per group was calculated based on an 80% power to detect a 2-point change in NRS scores between treatment groups with estimated SDs of 2 and an α of .05, using a repeated-measures linear model.
Results
Forty-five subjects were enrolled (MOR 21, LDK 24). Demographic variables and baseline NRS scores (7.1 vs 7.1) were similar. Ketamine was not superior to MOR in the maximum change of NRS pain scores, MOR = 5 (confidence interval, 6.6-3.5) and LDK = 4.9 (confidence interval, 5.8-4). The time to achieve maximum reduction in NRS pain scores was at 5 minutes for LDK and 100 minutes for MOR. Vital signs, adverse events, provider, and nurse satisfaction scores were similar between groups.
Conclusion
Low-dose ketamine did not produce a greater reduction in NRS pain scores compared with MOR for acute pain in the emergency department. However, LDK induced a significant analgesic effect within 5 minutes and provided a moderate reduction in pain for 2 hours.
    * Strategies: RESULTS: Focus on (+) secondary endpoint, RESULTS: Focus on (+) subgroup analysis, CONCLUSION: Claim benefit based on (+) secondary endpoint
* Example 3
    * Title: Intraarticular analgesia versus epidural plus femoral nerve block after TKA: a randomized, double-blind trial.
    * Abstract: Background
Pain management after TKA remains challenging and the efficacy of continuously infused intraarticular anesthetics remains a controversial topic.

Questions/purposes
We compared the side effect profile, analgesic efficacy, and functional recovery between patients receiving a continuous intraarticular infusion of ropivacaine and patients receiving an epidural plus femoral nerve block (FNB) after TKA.

Methods
Ninety-four patients undergoing unilateral TKA were prospectively randomized to receive a spinal-epidural analgesic infusion plus a single-injection FNB or a spinal anesthetic plus a continuous postoperative intraarticular infusion of 0.2% ropivacaine. All patients were blinded to their treatment with placebo saline catheters. Blinded coinvestigators collected data concerning side effect profiles (nausea, hypotension), analgesic efficacy (VAS pain scores, narcotic usage), and functional recovery (timed up and go test, quadriceps strength, WOMAC scores, Knee Society scores, early postoperative ambulatory ability, in-hospital falls). All complications and adverse events were recorded.

Results
The frequency of nausea and hypertension was not different between the study groups. During the first 12 and 24 postoperative hours, the mean maximum VAS pain scores were higher in the ropivacaine group than in the epidural group (first 12 hours: 3.93 versus 1.14, respectively, p < 0.0001; 12–24 hours: 3.52 versus 1.93, respectively, p = 0.008). After 24 hours, pain scores were similar between groups. Narcotic consumption was significantly higher in the ropivacaine group on the day of surgery, but overall in-hospital narcotic usage was similar between groups. There were no clinically important differences in functional recovery between groups at any time point, but patients in the epidural group were more likely to have knee buckling (32.7% versus 6.7%, p = 0.002) and delayed ambulation (16.3% versus 0.0%, p = 0.006) than patients in the ropivacaine group, though not in-hospital falls. No infections occurred in either group, and the frequency of complications was not different between groups.

Conclusions
A continuous intraarticular infusion of ropivacaine can be recommended as a safe, effective alternative to epidural analgesia plus single-injection FNB after TKA. Improved analgesic efficacy in the group that received epidural analgesia plus single-injection FNB must be weighed against the disadvantage of a higher likelihood of knee buckling and delayed ambulation with that treatment approach.
    * Strategies: CONCLUSION: Claim equivalence/non-inferiority versus control for a (-) endpoint

Title: {title}
Abstract: {abstract}'''

In [150]:
TEMPERATURE = 0.6
TOP_P = 0.7

In [151]:
NEW_FILENAME = "./prompt_engineering/" + PROMPT_TEMPLATE_NAME + ".csv"

In [152]:
class Severity(Enum):
    subtle = 'subtle'
    moderate = 'moderate'
    strong = 'strong'

# Used for generating rationales for each edit
class Documentation(BaseModel):
    edit: str
    rationale: str
    strategy_applied: str

class AbstractResponse(BaseModel):
    title: str
    abstract: str
    spin_severity: Severity
    documentation: list[Documentation]

class Response(BaseModel):
    generated: list[AbstractResponse]

## RUN generation

In [153]:
def format_model_output(model_output_response: Response):
    formatted_output = ""
    for response in model_output_response.generated:
        formatted_documentation = ""
        for documentation in response.documentation:
            formatted_documentation += f"Edit: {documentation.edit}\nRationale: {documentation.rationale}\nStrategy Applied: {documentation.strategy_applied},\n"
        formatted_output += f"{response.spin_severity.value.capitalize()}:\nTitle: {response.title}\nAbstract: {response.abstract}\nDocumentation: [{formatted_documentation}]\n\n"

    return formatted_output

In [154]:
def gen_gpt4o(title, abstract, client, temperature=1, top_p=1):
    
    sys_prompt = SYSTEM_PROMPT.replace('{title}', title).replace('{abstract}', abstract)
    user_prompt = PROMPT_TEMPLATE.replace('{title}', title).replace('{abstract}', abstract)
    try:
        response = client.beta.chat.completions.parse(
            model="gpt-4o",
            temperature=temperature,
            top_p=top_p,
            messages=[
                {'role':'system', 'content': sys_prompt},
                {'role': 'user', 'content': user_prompt}
            ],
            response_format=Response,
        )
        response_message = response.choices[0].message
        if response_message.parsed:
            response = response_message.parsed
            return response 
        elif response_message.refusal:
            # handle refusal
            print(response_message.refusal)
    except Exception as e:
        # Handle exceptions
        print(e)
        pass   

In [155]:
# prompt_template, prompt_with_input, model, model_output
client = load_api_keys()

output_data = []
for i, row in tqdm(df_dev_abstracts.iterrows(), total=df_dev_abstracts.shape[0]):
    data_dict = {}
    data_dict['pmid'] = row['pmid']
    data_dict['pmcid'] = row['pmcid']
    data_dict['title'] = row['title']
    data_dict['abstract'] = row['abstract']
    data_dict['prompt_template_name'] = PROMPT_TEMPLATE_NAME
    data_dict['prompt_template'] = 'system prompt: ' + SYSTEM_PROMPT + ' user prompt: ' + PROMPT_TEMPLATE
    data_dict['model_name'] = 'gpt-4o'
    data_dict['temperature'] = TEMPERATURE
    data_dict['top_p'] = TOP_P
    model_output = gen_gpt4o(row['title'], row['abstract'], client, TEMPERATURE, TOP_P)
    data_dict['model_output'] = format_model_output(model_output)

    output_data.append(data_dict)
    
new_df = pd.DataFrame.from_dict(output_data)

new_df.to_csv(NEW_FILENAME, index=False)

100%|██████████| 14/14 [04:45<00:00, 20.37s/it]
