# LLM: Creating prompts

Classify data into categories without being explicitly trained on labeled examples for that specific task. Instead, the model uses pre-trained knowledge and natural language prompts to infer the correct category.

| Prompt Name              | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **simple_prompt**        | "Does the following person develop a psychological disorder between time point T1 and T2? Data: …"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| **class_definitions_prompt** | "Given the following data, classify whether this person develops a psychological disorder between T1 and T2 according to the instructions provided and data measured by F-DIPS structural interviews. Respond with YES or NO. Instructions: NO: The person did not develop any new psychological disorder between T1 and T2. This means they were either healthy at both time points, had an ongoing disorder across both time points, or had already recovered from a previous disorder. YES: The person was psychologically healthy at T1 but developed a psychological disorder at T2. Data: …"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| **profiled_simple_prompt** | "You are an expert in clinical psychology and mental health diagnostics. You are trained to analyze longitudinal data to assess whether a person develops a psychological disorder over time. You know how to analyze data measured with F-DIPS structural interviews at two time points T1 and T2. Does the following person develop a psychological disorder between time point T1 and T2? Data: …"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| **few_shot_prompt**      | "Please refer to the following examples of correctly classified data points with 'Total mental disorders incidence point prevalence' being the true classification: … Based on the previous example data prompts, classify the following data. Does the following person develop a psychological disorder between time point T1 and T2? Data: …"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| **vignette_text_prompt** | "A woman with a BMI of {bmi}, an education level of {education}, and a socioeconomic status of {ses} has the following psychological profile: At Time Point 1, she showed {t1_pm_health} positive mental health, {t1_support} social support, and {t1_self_eff} general self-efficacy. Her life satisfaction was {t1_life_sat}, and she relied on {t1_coping_prob} problem-focused and {t1_coping_emot} emotion-focused coping strategies. Anxiety sensitivity was {t1_anx_sens}, and her fear of bodily sensations was {t1_fear}. She reported {t1_dysfunc} levels of dysfunctional attitudes and {t1_gsi} levels of general psychopathology. Her stress level was {t1_stress}. By Time Point 2, approximately 17 months later, she reported {t2_pm_health} positive mental health, {t2_support} social support, and {t2_self_eff} self-efficacy. Life satisfaction remained {t2_life_sat}. However, anxiety sensitivity had become {t2_anx_sens}, and fear of bodily sensations was {t2_fear}. Dysfunctional attitudes were {t2_dysfunc}, and general psychopathology was {t2_gsi}. Does this person develop a psychological disorder between time point T1 and T2?" |

## 0 Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from scipy.spatial.distance import pdist

In [2]:
data_change = pd.read_csv("../../../dat/dips/DIPS_Data_cleaned_change.csv", sep =",", low_memory = False)

## 1 Zero-shot prompts

In this section, I will **create different prompts for zero-shot classification** for every person (row) in the test set. The prompts will be used to classify whether a person develops a psychological disorder between time point T1 and T2.

### 1.1 Data preparation

In [3]:
# Predictors
X = data_change
X = X.drop(["hpi"], axis = 1)

# Target
y = data_change["hpi"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

# concatenate X and y for example data
dat_train = pd.concat([X_train, y_train], axis = 1)
dat_test = pd.concat([X_test, y_test], axis = 1)

print("LLMs \n",
      "X_train shape: ", X_train.shape, round(X_train.shape[0]/len(X), 2), "\n",
      "X_test shape: ", X_test.shape, round(X_test.shape[0]/len(X), 2),  "\n",
      "y_train shape: ", y_train.shape, round(y_train.shape[0]/len(y), 2), "\n",
      "y_test shape: ", y_test.shape, round(y_test.shape[0]/len(y), 2), "\n")

LLMs 
 X_train shape:  (983, 22) 0.8 
 X_test shape:  (246, 22) 0.2 
 y_train shape:  (983,) 0.8 
 y_test shape:  (246,) 0.2 



In [4]:
# get clear names for each column in data
preds = X.columns
preds

Index(['whi_ges', 'soz_ges', 'gke_ges', 'lzh_ges', 'ile', 'woc_gesp',
       'woc_gese', 'asi_ges', 'bsq_ges', 'das_ges', 'scl_gsi', 'bild',
       'bmi_kat', 'ses_kom', 'whi_change', 'soz_change', 'gke_change',
       'lzh_change', 'asi_change', 'bsq_change', 'das_change', 'scl_change'],
      dtype='object')

In [5]:
predictor_names = ["T1 Positive mental health",    #whi_ges
                   "T1 Social support",            #soz_ges (SAVE)
                   "T1 General self-efficacy",     #gke_ges SAVE
                   "T1 Life satisfaction",         #lzh_ges SAVE
                   "T1 Stress",                     #ile SAVE
                   "T1 Problem-focused coping",    #woc_gesp SAVE
                   "T1 Emotion-focused coping",    #woc_gese SAVE
                   "T1 Anxiety sensitivity",       #asi_ges SAVE
                   "T1 Fear of bodily sensations", #bsq_ges SAVE
                   "T1 Dysfunctional attitudes",   #das_ges SAVE
                   "T1 General psychopathology: Global Severity Index (GSI)",    #scl_gsi SAVE
                   "Education",                    #bild
                   "T1 BMI",                       #bmi_kat
                   "Socioeconomic status",         #ses_kom
                   "Change in Positive mental health (T2-T1)",    #whi_change
                   "Change in Social support (T2-T1)",            #soz_change
                   "Change in General self-efficacy (T2-T1)",     #gke_change
                   "Change in Life satisfaction (T2-T1)",         #lzh_change
                   "Change in Anxiety sensitivity (T2-T1)",       #asi_change
                   "Change in Fear of bodily sensations (T2-T1)", #bsq_change
                   "Change in Dysfunctional attitudes (T2-T1)",   #das_change
                   "Change in General psychopathology: Global Severity Index (GSI) (T2-T1)",    #scl_change
                   ]

target_names = ["Total mental disorders incidence point prevalence"]    #hpi

In [6]:
# print col and names
for col, name in zip(preds, predictor_names):
    print(f"{col}: {name}")

whi_ges: T1 Positive mental health
soz_ges: T1 Social support
gke_ges: T1 General self-efficacy
lzh_ges: T1 Life satisfaction
ile: T1 Stress
woc_gesp: T1 Problem-focused coping
woc_gese: T1 Emotion-focused coping
asi_ges: T1 Anxiety sensitivity
bsq_ges: T1 Fear of bodily sensations
das_ges: T1 Dysfunctional attitudes
scl_gsi: T1 General psychopathology: Global Severity Index (GSI)
bild: Education
bmi_kat: T1 BMI
ses_kom: Socioeconomic status
whi_change: Change in Positive mental health (T2-T1)
soz_change: Change in Social support (T2-T1)
gke_change: Change in General self-efficacy (T2-T1)
lzh_change: Change in Life satisfaction (T2-T1)
asi_change: Change in Anxiety sensitivity (T2-T1)
bsq_change: Change in Fear of bodily sensations (T2-T1)
das_change: Change in Dysfunctional attitudes (T2-T1)
scl_change: Change in General psychopathology: Global Severity Index (GSI) (T2-T1)


### 1.2 Simple prompt

In [7]:
def create_data(data_row):

    # time point T1 measures
    t1_positive_mental_health = data_row['whi_ges']
    t1_social_support = data_row['soz_ges']
    t1_general_self_efficacy = data_row['gke_ges']
    t1_life_satisfaction = data_row['lzh_ges']
    t1_stress = data_row['ile']
    t1_problem_focused_coping = data_row['woc_gesp']
    t1_emotion_focused_coping = data_row['woc_gese']
    t1_anxiety_sensitivity = data_row['asi_ges']
    t1_fear_of_bodily_sensations = data_row['bsq_ges']
    t1_dysfunctional_attitudes = data_row['das_ges']
    t1_general_psychopathology_gsi = data_row['scl_gsi']

    # demographics
    bmi = data_row['bmi_kat']
    education = data_row['bild']
    ses = data_row['ses_kom']

    # changes between T1 and T2
    change_positive_mental_health = data_row['whi_change']
    change_social_support = data_row['soz_change']
    change_general_self_efficacy = data_row['gke_change']
    change_life_satisfaction = data_row['lzh_change']
    change_anxiety_sensitivity = data_row['asi_change']
    change_fear_of_bodily_sensations = data_row['bsq_change']
    change_dysfunctional_attitudes = data_row['das_change']
    change_general_psychopathology_gsi = data_row['scl_change']

    data_string = f"Here are the variables you need to consider: \n \n 1. Demographic Information (centered): \n {predictor_names[11]}: {education} \n {predictor_names[12]}: {bmi} \n {predictor_names[13]}: {ses} \n \n 2. Time Point T1 Measures (standardized): \n {predictor_names[0]}: {t1_positive_mental_health} \n {predictor_names[1]}: {t1_social_support} \n {predictor_names[2]}: {t1_general_self_efficacy} \n {predictor_names[3]}: {t1_life_satisfaction} \n {predictor_names[4]}: {t1_stress} \n {predictor_names[5]}: {t1_problem_focused_coping} \n {predictor_names[6]}: {t1_emotion_focused_coping} \n {predictor_names[7]}: {t1_anxiety_sensitivity} \n {predictor_names[8]}: {t1_fear_of_bodily_sensations} \n {predictor_names[9]}: {t1_dysfunctional_attitudes} \n {predictor_names[10]}: {t1_general_psychopathology_gsi} \n \n 3. Changes between T1 and T2 (standardized): \n {predictor_names[14]}: {change_positive_mental_health} \n {predictor_names[15]}: {change_social_support} \n {predictor_names[16]}: {change_general_self_efficacy} \n {predictor_names[17]}: {change_life_satisfaction} \n {predictor_names[18]}: {change_anxiety_sensitivity} \n {predictor_names[19]}: {change_fear_of_bodily_sensations} \n {predictor_names[20]}: {change_dysfunctional_attitudes} \n {predictor_names[21]}: {change_general_psychopathology_gsi}"

    return data_string

In [8]:
def create_simple_prompt(data_row):
    """
    Create a simple prompt for zero-shot classification.
    """

    # Convert the data row to a string format using the real column names
    data_string = create_data(data_row)

    instruction = "You must respond ONLY with YES or NO to the question. Do not output anything else. Example output format: YES/NO"

    prompt = f"Does the following person develop a psychological disorder between time point T1 and T2? {data_string}"

    return instruction, prompt

In [9]:
# Create a simple prompt for each row in the test set
X_train_simple_prompt = []

for i in range(len(X_train)):
    row = X_train.iloc[i]
    simple_instruction, simple_prompt = create_simple_prompt(row)
    X_train_simple_prompt.append(simple_prompt)

print(X_train_simple_prompt[0])

Does the following person develop a psychological disorder between time point T1 and T2? Here are the variables you need to consider: 
 
 1. Demographic Information (centered): 
 Education: -1.0 
 T1 BMI: 0.0 
 Socioeconomic status: 0.0 
 
 2. Time Point T1 Measures (standardized): 
 T1 Positive mental health: -0.0279170753483525 
 T1 Social support: 0.1887579094812689 
 T1 General self-efficacy: -0.7683193301454964 
 T1 Life satisfaction: -4.988109170535162 
 T1 Stress: 0.0419806820921755 
 T1 Problem-focused coping: -0.1132461506506245 
 T1 Emotion-focused coping: -0.0352757160928132 
 T1 Anxiety sensitivity: 0.0273038333880951 
 T1 Fear of bodily sensations: -1.0510334432380857 
 T1 Dysfunctional attitudes: 0.1083521462552709 
 T1 General psychopathology: Global Severity Index (GSI): -0.6365022182918669 
 
 3. Changes between T1 and T2 (standardized): 
 Change in Positive mental health (T2-T1): -0.235399282050399 
 Change in Social support (T2-T1): 0.3152228312446513 
 Change in Gen

In [10]:
# # create random input for testing
# data_row_random = X_train.iloc[4]
# data_row_random = data_row_random + np.random.rand(len(data_row_random))
#
# simple_instruction, simple_prompt = create_simple_prompt(data_row_random)
# print(simple_prompt)

In [11]:
X_train_simple_prompt_df = pd.DataFrame(X_train_simple_prompt)
X_train_simple_prompt_df.to_csv("X_train_pred/prompts/X_train_simple_prompt.csv", sep = ",", index = True)

simple_instruction_df = pd.DataFrame([simple_instruction])
simple_instruction_df.to_csv("X_train_pred/instructions/simple_instruction.csv", sep = ",", index = True)

### 1.3 Class definitions prompt

In [12]:
def create_class_definitions_prompt(data_row):
    """
    Create a class definitions prompt for zero-shot classification.
    """

    # Convert the data row to a string format using the real column names
    data_string = create_data(data_row)

    instruction = "You must respond ONLY with YES or NO to the question. Do not output anything else. Example output format: YES/NO"

    prompt = f"""Given the following data, classify whether this person develops a psychological disorder between T1 and T2 according to the general context and instructions provided and data measured by F-DIPS structural interviews. Respond only with YES or NO.

    General Context:
    Psychological disorders are common globally, with approximately 12-25% of people living with a mental health condition at any given time, and about 50% of individuals developing at least one mental health disorder during their lifetime by age 75.

    Instructions:
    NO: The person did not develop any new psychological disorder between T1 and T2. This means they were either psychologically healthy at both time points (no diagnosis at T1 and T2), had an ongoing psychological disorder across both time points (diagnosis at both T1 and T2), or they recovered and were diagnosed only at the first, but healthy at the second time point (diagnosis at T1 but no diagnosis at T2).
    YES: The person was psychologically healthy at T1 but developed a psychological disorder at T2. This includes new onset of any mental disorder assessed by the F-DIPS interview, such as anxiety disorders (e.g., panic disorder, specific phobia), affective disorders (e.g., depression, bipolar disorder), somatoform disorders (e.g., hypochondria, somatization disorder), substance-related disorders (e.g., alcohol dependence, drug misuse), eating disorders (e.g., anorexia nervosa, bulimia nervosa), child/adolescent disorders (e.g., ADHD), or psychotic disorders.

    Does the following person develop a psychological disorder between time point T1 and T2? {data_string}"""

    return instruction, prompt

In [13]:
# Create a class definitions prompt for each row in the test set
X_train_class_definitions_prompt = []
for i in range(len(X_train)):
    row = X_train.iloc[i]
    class_definitions_instruction, class_definitions_prompt = create_class_definitions_prompt(row)
    X_train_class_definitions_prompt.append(class_definitions_prompt)

print(X_train_class_definitions_prompt[0])

Given the following data, classify whether this person develops a psychological disorder between T1 and T2 according to the general context and instructions provided and data measured by F-DIPS structural interviews. Respond only with YES or NO.

    General Context:
    Psychological disorders are common globally, with approximately 12-25% of people living with a mental health condition at any given time, and about 50% of individuals developing at least one mental health disorder during their lifetime by age 75.

    Instructions:
    NO: The person did not develop any new psychological disorder between T1 and T2. This means they were either psychologically healthy at both time points (no diagnosis at T1 and T2), had an ongoing psychological disorder across both time points (diagnosis at both T1 and T2), or they recovered and were diagnosed only at the first, but healthy at the second time point (diagnosis at T1 but no diagnosis at T2).
    YES: The person was psychologically healthy 

In [14]:
# # create random input for testing
# class_definitions_instruction, class_definitions_prompt = create_class_definitions_prompt(data_row_random)
# print(class_definitions_prompt)

In [15]:
X_train_class_definitions_prompt_df = pd.DataFrame(X_train_class_definitions_prompt)
X_train_class_definitions_prompt_df.to_csv("X_train_pred/prompts/X_train_class_definitions_prompt.csv", sep = ",", index = True)

class_definitions_instruction_df = pd.DataFrame([class_definitions_instruction])
class_definitions_instruction_df.to_csv("X_train_pred/instructions/class_definitions_instruction.csv", sep = ",", index = True)

### 1.4 Profiled simple prompt

In [16]:
def create_profiled_simple_prompt(data_row):
    """
    Create a profiled simple prompt for zero-shot classification.
    """

    # Convert the data row to a string format using the real column names
    data_string = create_data(data_row)

    instruction = "You must respond ONLY with YES or NO to the question. Do not output anything else. Example output format: YES/NO"

    prompt = f"You are an expert in clinical psychology and mental health diagnostics. You specialize in analyzing longitudinal data to determine whether individuals develop psychological disorders over time. You are deeply familiar with diagnostic frameworks (e.g., DSM, ICD, and F-DIPS) and are trained in interpreting self-reports and longitudinal predictors. \n You recognize clinical patterns and changes in psychological variables over time. You can determine mental health risk factors and protective factors. You pay special attention to temporal changes in symptoms and mental health predictors.\n You work on a high-stakes project and analyse data very accurately and with a high level of precision and scientific thoroughness. You have a high level of responsibility and methodological rigour. You understand that your analysis has important implications for clinical research. Therefore, you approach each decision with careful consideration. You base your decision on data collected through F-DIPS structured diagnostic interviews and predictors at two time points (T1 and T2). Does the following person develop a psychological disorder between time point T1 and T2? {data_string}"

    return instruction, prompt

In [17]:
# Create a profiled simple prompt for each row in the test set
X_train_profiled_simple_prompt = []
for i in range(len(X_train)):
    row = X_train.iloc[i]
    profiled_simple_instruction, profiled_simple_prompt = create_profiled_simple_prompt(row)
    X_train_profiled_simple_prompt.append(profiled_simple_prompt)

print(X_train_profiled_simple_prompt[0])

You are an expert in clinical psychology and mental health diagnostics. You specialize in analyzing longitudinal data to determine whether individuals develop psychological disorders over time. You are deeply familiar with diagnostic frameworks (e.g., DSM, ICD, and F-DIPS) and are trained in interpreting self-reports and longitudinal predictors. 
 You recognize clinical patterns and changes in psychological variables over time. You can determine mental health risk factors and protective factors. You pay special attention to temporal changes in symptoms and mental health predictors.
 You work on a high-stakes project and analyse data very accurately and with a high level of precision and scientific thoroughness. You have a high level of responsibility and methodological rigour. You understand that your analysis has important implications for clinical research. Therefore, you approach each decision with careful consideration. You base your decision on data collected through F-DIPS struct

In [18]:
# # create random input for testing
# profiled_simple_instruction, profiled_simple_prompt = create_profiled_simple_prompt(data_row_random)
# print(profiled_simple_prompt)

In [19]:
X_train_profiled_simple_prompt_df = pd.DataFrame(X_train_profiled_simple_prompt)
X_train_profiled_simple_prompt_df.to_csv("X_train_pred/prompts/X_train_profiled_simple_prompt.csv", sep = ",", index = True)

profiled_simple_instruction_df = pd.DataFrame([profiled_simple_instruction])
profiled_simple_instruction_df.to_csv("X_train_pred/instructions/profiled_simple_instruction.csv", sep = ",", index = True)

### 1.5 Few-shot prompt

In [20]:
def create_few_shot_prompt(data_row, few_shot_examples):
    """
    Create a few-shot prompt for zero-shot classification.
    """

    example_data = ""

    # create example prompt but including the target aswell, stating Example 1: ..., Example 2: ...
    for i, example in enumerate(few_shot_examples):
        example_df = pd.DataFrame([example], columns = data_change.columns)
        example_data = example_data + f"\n \n Example {i+1}: " + create_data(example_df.iloc[0])
        label = "YES" if example[-1] == 1 else "NO"
        example_data = example_data + f"\n \n Target: Does this person in example {i+1} develop a psychological disorder between time point T1 and T2?: {label}"

    # Convert the data row to a string format using the real column names
    data_string = create_data(data_row)

    instruction = "You must respond ONLY with YES or NO to the question. Do not output anything else. Example output format: YES/NO"

    prompt = f"""Please refer to the following examples of correctly classified data points including the true classification ('Target: Does this person in example 1 develop a psychological disorder between time point T1 and T2?'): {example_data}

    Based on the previous example data prompts, classify the following data. Does the following person develop a psychological disorder between time point T1 and T2? {data_string}"""

    return instruction, prompt

In [21]:
def get_similar_examples(data_row, dat_test, n):
    """
    Get the top-n semantically similar examples to the test product from the training set.
    """

    # get columns that we want to use for similarity
    preds = dat_test.drop("hpi", axis = 1).columns
    dat_preds = dat_test[preds]
    data_row = data_row[preds]
    data_row = np.array([data_row])

    # dictionary to store the distances
    distances = {}

    for row in dat_preds.iterrows():
        # create array with the row and the data_row
        row_input = np.array([row[1]])
        dist_row = np.concatenate((data_row, row_input), axis = 0)

        # calculate the distance between the two rows
        dist = pdist(dist_row, metric = "euclidean")

        # store the distance in the dictionary
        distances[row[0]] = dist[0]

    # sort the dictionary by distance
    sorted_distances = sorted(distances.items(), key = lambda x: x[1])

    # get top keys
    keys = []
    for key in sorted_distances[:n]:
        keys.append(key[0])

    # drop all rows that are not in keys
    ex_dat = dat_test.loc[keys]

    return ex_dat

In [41]:
# Create a few-shot prompt for each row in the test set
X_train_few_shot_prompt = []

for i in range(len(X_train)):
    row = X_train.iloc[i]
    few_shot_examples = get_similar_examples(row, dat_train, 5).values
    few_shot_instruction, few_shot_prompt = create_few_shot_prompt(row, few_shot_examples)
    X_train_few_shot_prompt.append(few_shot_prompt)

print(X_train_few_shot_prompt[0])

Please refer to the following examples of correctly classified data points including the true classification ('Target: Does this person in example 1 develop a psychological disorder between time point T1 and T2?'): 
 
 Example 1: Here are the variables you need to consider: 
 
 1. Demographic Information (centered): 
 Education: -1.0 
 T1 BMI: 0.0 
 Socioeconomic status: 0.0 
 
 2. Time Point T1 Measures (standardized): 
 T1 Positive mental health: -0.0279170753483525 
 T1 Social support: 0.1887579094812689 
 T1 General self-efficacy: -0.7683193301454964 
 T1 Life satisfaction: -4.988109170535162 
 T1 Stress: 0.0419806820921755 
 T1 Problem-focused coping: -0.1132461506506245 
 T1 Emotion-focused coping: -0.0352757160928132 
 T1 Anxiety sensitivity: 0.0273038333880951 
 T1 Fear of bodily sensations: -1.0510334432380857 
 T1 Dysfunctional attitudes: 0.1083521462552709 
 T1 General psychopathology: Global Severity Index (GSI): -0.6365022182918669 
 
 3. Changes between T1 and T2 (standar

In [23]:
# # create random input for testing
# few_shot_examples = get_similar_examples(data_row_random, dat_test, 5).values
# few_shot_examples_rand = np.array([])
#
# for ex in few_shot_examples:
#     ex = ex + np.random.rand(len(ex))
#     few_shot_examples_rand = np.append(few_shot_examples_rand, ex)
#
# few_shot_instruction, few_shot_prompt = create_few_shot_prompt(data_row_random, few_shot_examples)
# print(few_shot_prompt)

In [24]:
X_train_few_shot_prompt_df = pd.DataFrame(X_train_few_shot_prompt)
X_train_few_shot_prompt_df.to_csv("X_train_pred/prompts/X_train_few_shot_prompt.csv", sep = ",", index = True)

few_shot_instruction_df = pd.DataFrame([few_shot_instruction])
few_shot_instruction_df.to_csv("X_train_pred/instructions/few_shot_instruction.csv", sep = ",", index = True)

### 1.6 Vignette text prompt

In [25]:
def interpret_value(val, thresh_low = -0.5, thresh_high = 0.5):
    """Convert standardized value to qualitative description with value."""
    val_str = f"{val:.4f}"
    if val >= thresh_high:
        return f"above average ({val_str})"
    elif val <= thresh_low:
        return f"below average ({val_str})"
    else:
        return f"average ({val_str})"

In [26]:
def interpret_value_demographics(val, thresh_low = -0.5, thresh_high = 0.5):
    """Convert standardized value to qualitative description with value."""
    val_str = f"{val}"
    if val >= thresh_high:
        return f"above average ({val_str})"
    elif val <= thresh_low:
        return f"below average ({val_str})"
    else:
        return f"average ({val_str})"

In [27]:
def interpret_change(val, t_similar = 0.1, t_slight = 0.5, t_strong = 1.5):
    """Convert standardized value to qualitative description with value."""
    val_str = f"{val:.4f}"
    if val >= t_strong:
        return f"strongly increased ({val_str})"
    elif val >= t_slight:
        return f"increased ({val_str})"
    elif val >= t_similar:
        return f"slightly increased ({val_str})"
    elif val <= -t_strong:
        return f"strongly decreased ({val_str})"
    elif val <= -t_slight:
        return f"decreased ({val_str})"
    elif val <= -t_similar:
        return f"slightly decreased ({val_str})"
    else:
        return f"similar ({val_str})"

In [28]:
def create_vignette_prompt(data):

    # demographics
    bmi = interpret_value_demographics(data['bmi_kat'])
    education = interpret_value_demographics(data['bild'])
    ses = interpret_value_demographics(data['ses_kom'])

    # time point 1
    t1_pm_health = interpret_value(data['whi_ges'])
    t1_support = interpret_value(data['soz_ges'])
    t1_self_eff = interpret_value(data['gke_ges'])
    t1_life_sat = interpret_value(data['lzh_ges'])
    t1_coping_prob = interpret_value(data['woc_gesp'])
    t1_coping_emot = interpret_value(data['woc_gese'])
    t1_anx_sens = interpret_value(data['asi_ges'])
    t1_fear = interpret_value(data['bsq_ges'])
    t1_dysfunc = interpret_value(data['das_ges'])
    t1_gsi = interpret_value(data['scl_gsi'])
    t1_stress = interpret_value(data['ile'])

    # change variables
    change_pm_health = interpret_change(data['whi_change'])
    change_support = interpret_change(data['soz_change'])
    change_self_eff = interpret_change(data['gke_change'])
    change_life_sat = interpret_change(data['lzh_change'])
    change_anx_sens = interpret_change(data['asi_change'])
    change_fear = interpret_change(data['bsq_change'])
    change_dysfunc = interpret_change(data['das_change'])
    change_gsi = interpret_change(data['scl_change'])

    instruction = "You must respond ONLY with YES or NO to the question. Do not output anything else. Example output format: YES/NO"

    # generate prompt
    vignette = f"""A woman with an {bmi} BMI, an {education} education level, and an {ses} socioeconomic status has the following psychological profile:

    At the initial assessment at the first time point, she exhibited {t1_pm_health} positive mental health. Her social support network was {t1_support}, and she displayed {t1_self_eff} general self-efficacy. Her life satisfaction was {t1_life_sat}. When faced with challenges, she tended to use {t1_coping_prob} problem-focused and {t1_coping_emot} emotion-focused coping strategies. Her anxiety sensitivity was {t1_anx_sens}, and her fear of bodily sensations was {t1_fear}. She reported {t1_dysfunc} levels of dysfunctional attitudes and {t1_gsi} levels of general psychopathology. Her stress level was {t1_stress}.

    By the second time point, approximately 17 months later, she reported {change_pm_health} positive mental health. Compared to the first time point, the level of social support she received was {change_support}, and her general self-efficacy was {change_self_eff}. Her life satisfaction was {change_life_sat} when compared to the first time point. Her anxiety sensitivity was reported to be {change_anx_sens}, and fear of bodily sensations was {change_fear}. Her dysfunctional attitudes were {change_dysfunc}, and her general psychopathology was {change_gsi} compared to the initial assessment.

    Did this person develop a psychological disorder between time point T1 and T2?"""

    return instruction, vignette

In [29]:
# create a vignette prompt for each row in the test set
X_train_vignette_prompt = []
for i in range(len(X_train)):
    row = X_train.iloc[i]
    vignette_instruction, vignette_prompt = create_vignette_prompt(row)
    X_train_vignette_prompt.append(vignette_prompt)

print(X_train_vignette_prompt[113])

A woman with an average (0.0) BMI, an average (0.0) education level, and an average (0.0) socioeconomic status has the following psychological profile:

    At the initial assessment at the first time point, she exhibited below average (-0.9944) positive mental health. Her social support network was above average (1.0790), and she displayed average (0.3650) general self-efficacy. Her life satisfaction was average (0.4991). When faced with challenges, she tended to use above average (1.9077) problem-focused and below average (-1.3318) emotion-focused coping strategies. Her anxiety sensitivity was average (0.0273), and her fear of bodily sensations was above average (0.8596). She reported average (-0.1834) levels of dysfunctional attitudes and average (0.0172) levels of general psychopathology. Her stress level was average (-0.1580).

    By the second time point, approximately 17 months later, she reported increased (1.3145) positive mental health. Compared to the first time point, the 

In [30]:
# # create random input for testing
# vignette_instruction, vignette_prompt = create_vignette_prompt(data_row_random)
# print(vignette_prompt)

In [31]:
X_train_vignette_prompt_df = pd.DataFrame(X_train_vignette_prompt)
X_train_vignette_prompt_df.to_csv("X_train_pred/prompts/X_train_vignette_prompt.csv", sep = ",", index = True)

vignette_instruction_df = pd.DataFrame([vignette_instruction])
vignette_instruction_df.to_csv("X_train_pred/instructions/vignette_instruction.csv", sep = ",", index = True)

### 1.7 Chain-of-thought prompt

In [32]:
def create_cot_prompt(data_row):
    """
    Create a profiled simple prompt for zero-shot classification.
    """

    data_string = create_data(data_row)

    instruction = "You are an expert psychologist tasked with predicting whether an individual will develop a psychological disorder between two time points (T1 and T2) based on various psychological measures and demographic information. Your goal is to provide an accurate YES or NO prediction, supported by a brief explanation of your reasoning. Example output format: \n Prediction: [YES/NO] \n Explanation: [Brief explanation supporting your prediction]."

    prompt = f"""You are an expert psychologist tasked with predicting whether an individual will develop a psychological disorder between two time points (T1 and T2) based on various psychological measures and demographic information. Your goal is to provide an accurate YES or NO prediction, supported by a brief explanation of your reasoning.

    {data_string}

    Your task is to analyze these variables and predict whether the individual will develop a psychological disorder between T1 and T2. Follow these steps:
    1. Carefully examine each variable and its potential impact on psychological health.
    2. Consider the interplay between different factors, especially how changes between T1 and T2 might influence the outcome.
    3. Evaluate both risk factors and protective factors present in the data.
    4. Make a prediction based on your analysis.

    Conduct your psychological assessment and be thorough in your consideration of the variables and their relationships. In your assessment:
    1. Categorize variables into risk factors and protective factors.
    2. Consider the severity and direction of changes between T1 and T2.
    3. Evaluate the overall trend (improving or worsening) based on the changes.
    4. Consider potential interactions between variables.

    After your assessment, provide your prediction as either YES or NO, followed by a brief explanation (no more than two sentences) supporting your decision.
    Example output format:
    Prediction: [YES/NO]
    Explanation: [Brief explanation supporting your prediction]

    Remember, your goal is to make the most accurate prediction possible based on the given information. Your final output should consist only of the prediction and brief explanation, without duplicating or rehashing any of the work you did in the psychological assessment."""

    return instruction, prompt

In [33]:
def create_cot_prompt(data_row):
    """
    Create a profiled simple prompt for zero-shot classification.
    """

    data_string = create_data(data_row)
    ex_data = create_data(X_train.iloc[764])

    instruction = "You are an expert psychologist tasked with predicting whether an individual will develop a psychological disorder between two time points (T1 and T2) based on various psychological measures and demographic information. Your goal is to provide an accurate YES or NO prediction, supported by a brief explanation of your reasoning. Example output format: \n Prediction: [YES/NO] \n Explanation: [Brief explanation supporting your prediction]."

    prompt = f"""You are an expert in clinical psychology and mental health diagnostics. Your task is to determine whether a person develops a psychological disorder between time points T1 and T2 (approximately 17 months later) based on F-DIPS structural interview data.

Instructions:
NO: The person did not develop any new psychological disorder between T1 and T2 (healthy at both points, ongoing disorder, or recovered)
YES: The person was healthy at T1 but developed a new psychological disorder at T2

To make this decision, work through your reasoning step-by-step:
1. First, analyze the demographic variables: Review education, BMI, and socioeconomic status.
2. Second, analyze the baseline mental health at T1: Look at the T1 measures to assess the person's initial psychological state.
3. Third, analyze the changes between T1 and T2: Look at how key psychological variables changed.
4. Lastly, make your final decision: Based on the demographics, T1 baseline and T1-to-T2 changes, determine if this pattern suggests new disorder development.

Conduct your psychological assessment and be thorough in your consideration of the variables and their relationships. In your assessment of each step:
Categorize variables into risk factors and protective factors. Typical risk factors are e.g. GSI, dysfunctional attitudes, anxiety sensitivity, fear of bodily sensations. Typical protective factors are e.g. general self-efficacy, life satisfaction, social support, socioeconomic status.
Also consider potential interactions between variables and that variables might be ambivalent. For example high emotion-focused coping might reduce the risk of developing a psychological disorder (e.g., because you experience less stress) or increase the risk (e.g., because you don’t really solve the problem). Equivalently, low emotion-focused coping might reduce the risk (e.g., because you don’t need to cope if you don’t have any problems) or increase the risk (e.g., because you lack an efficient way of handling your stress).
Consider the magnitude and direction of the variables. Since the variables are standardized, a value of about 0 means average, a negative value means below average and a positive value means above average. Carefully examine each variable and its potential impact on developing a psychological disorder.

Here is a correctly classified chain-of-thought example conducted by a human with a background in psychology:

{ex_data}

First, I analyze the demographic variables: Since education, BMI and socioeconomic status are all zero, all of them are average. There are seemingly no important demographic risk or protective factors, nor a strong magnitude or trend of these variables.

Second, I analyze the baseline mental health at T1: Positive mental health is average, indicating that the person believes that life has a meaning. Social support is below average (negative): the individual perceives that they do not receive much support from their social environment. General self-efficacy is slightly increased, indicating that the individual has confidence in coping with a variety of stressful demands. Life satisfaction is above average, indicating their satisfaction with different life areas (e.g., health, occupation, leisure time).
Stress is below average, indicating that they are not experiencing a lot of stress from life events and daily hassles. The strongly negative problem- and emotion focused coping indicate that they do not activate resources to solve the stressinducing problem nor try to ease inner tension without trying to solve the problem.
Anxiety sensitivity and fear of bodily sensations are negative, indicating low believe in that anxiety and its accompanying symptoms may lead to physically, psychologically, or socially harmful consequences and that they have a low fear of sensations mostly related to autonomic arousal. Dysfunctional attitudes are positive, meaning that they agreed with irrational beliefs and faulty assumptions about reality that are etiologically related to depression. Finally, general psychopathology is negative, indicating a low overall psychological distress.
To summarize an overall trend in T1, the person seems to be relatively healthy and happy. They have good positive mental health, good self-efficacy, and good life satisfaction. They had little stress, coping, anxiety sensitivity and fear of bodily sensations and low overall psychological distress. However, they have slightly higher dysfunctional attitudes and few social support, indicating a potential for developing a psychological disorder.

Third, analyze the changes between T1 and T2: Positive mental health and general self-efficacy decrease in T2, indicating that they believe less in themselves and in a meaning of life. Social support is similar to T1, which means it is still negative and the person does not feel like they are being socially supported. Life satisfaction increases in T2, which indicates a positive development. Anxiety sensitivity and dysfunctional attitudes strongly increase in T2, indicating that the person now much beliefs that anxiety leads to harmful consequences and they strongly agree with irrational beliefs related to depression. However, fear of bodily sensations decreases slightly. General psychopathology slightly increases, but is still average. To summarize an general trend in T2, the person is feeling significantly worse than in T1.

Lastly, I make my final decision and output:
YES

Don’t over interpret this correctly classified example and don’t overly stick to this. Only because in this example, the decision was based on some specific variables, does not mean that these are generally more important. Magnitudes, directions, interactions and interpretations of variables always differ and depent on the specific case. Each person is a new individual assessment. After your assessment, provide your prediction as either YES or NO.
Example output format: \n Prediction: [YES/NO] \n Explanation: [Brief explanation supporting your prediction].

Remember, your goal is to make the most accurate prediction possible based on the given information. Your final output should consist only of the prediction and brief explanation, without duplicating or rehashing any of the work you did in the psychological assessment.

Does the following person develop a psychological disorder between time point T1 and T2? {data_string}"""

    return instruction, prompt

In [34]:
# create a cot prompt for each row in the test set
X_train_cot_prompt = []

for i in range(len(X_train)):
    row = X_train.iloc[i]
    cot_instruction, cot_prompt = create_cot_prompt(row)
    X_train_cot_prompt.append(cot_prompt)

print(X_train_cot_prompt[0])

You are an expert in clinical psychology and mental health diagnostics. Your task is to determine whether a person develops a psychological disorder between time points T1 and T2 (approximately 17 months later) based on F-DIPS structural interview data.

Instructions:
NO: The person did not develop any new psychological disorder between T1 and T2 (healthy at both points, ongoing disorder, or recovered)
YES: The person was healthy at T1 but developed a new psychological disorder at T2

To make this decision, work through your reasoning step-by-step:
1. First, analyze the demographic variables: Review education, BMI, and socioeconomic status.
2. Second, analyze the baseline mental health at T1: Look at the T1 measures to assess the person's initial psychological state.
3. Third, analyze the changes between T1 and T2: Look at how key psychological variables changed.
4. Lastly, make your final decision: Based on the demographics, T1 baseline and T1-to-T2 changes, determine if this pattern 

In [35]:
# # create random input for testing
# cot_instruction, cot_prompt = create_cot_prompt(data_row_random)
# print(cot_prompt)

In [36]:
X_train_cot_prompt_df = pd.DataFrame(X_train_cot_prompt)
X_train_cot_prompt_df.to_csv("X_train_pred/prompts/X_train_cot_prompt.csv", sep = ",", index = True)

cot_instruction_df = pd.DataFrame([cot_instruction])
cot_instruction_df.to_csv("X_train_pred/instructions/cot_instruction.csv", sep = ",", index = True)

In [37]:
print(create_data(X_train.iloc[764]))

Here are the variables you need to consider: 
 
 1. Demographic Information (centered): 
 Education: 0.0 
 T1 BMI: 0.0 
 Socioeconomic status: 0.0 
 
 2. Time Point T1 Measures (standardized): 
 T1 Positive mental health: -0.0279170753483525 
 T1 Social support: -0.5234719075739229 
 T1 General self-efficacy: 0.1383196105638797 
 T1 Life satisfaction: 0.6609316470538229 
 T1 Stress: -0.3579748085379315 
 T1 Problem-focused coping: -1.6948315955326445 
 T1 Emotion-focused coping: -1.2508043631414372 
 T1 Anxiety sensitivity: -0.6332554428711353 
 T1 Fear of bodily sensations: -0.3823291810495087 
 T1 Dysfunctional attitudes: 0.1083521462552709 
 T1 General psychopathology: Global Severity Index (GSI): -0.7091383212961528 
 
 3. Changes between T1 and T2 (standardized): 
 Change in Positive mental health (T2-T1): -0.235399282050399 
 Change in Social support (T2-T1): 0.0935950031372813 
 Change in General self-efficacy (T2-T1): -0.4584361821328143 
 Change in Life satisfaction (T2-T1): 0

In [38]:
y_train.iloc[764]

np.float64(1.0)

## 2 More instructions

### 2.1 Retry prompting if invalid output format

In [39]:
# retry prompting if invalid output format
retry_instruction = "You must respond ONLY with YES or NO. If the answer is unclear or difficult to determine, choose the option that seems most likely based on the data provided. Do not output any thinking, explanation, rationale or reasoning. Example output format: \n Prediction: [YES/NO] \n Explanation: [Brief explanation supporting your prediction]."

retry_cot_instruction = """You are an expert psychologist tasked with predicting whether an individual will develop a psychological disorder between two time points (T1 and T2) based on various psychological measures and demographic information. Your goal is to provide an accurate YES or NO prediction. Don't provide an explanation. You must respond ONLY with YES or NO. If the answer is unclear or difficult to determine, choose the option that seems most likely based on the data provided. Example output format: [YES/NO]"""

print(retry_instruction, "\n")
print(retry_cot_instruction)

You must respond ONLY with YES or NO. If the answer is unclear or difficult to determine, choose the option that seems most likely based on the data provided. Do not output any thinking, explanation, rationale or reasoning. Example output format: 
 Prediction: [YES/NO] 
 Explanation: [Brief explanation supporting your prediction]. 

You are an expert psychologist tasked with predicting whether an individual will develop a psychological disorder between two time points (T1 and T2) based on various psychological measures and demographic information. Your goal is to provide an accurate YES or NO prediction. Don't provide an explanation. You must respond ONLY with YES or NO. If the answer is unclear or difficult to determine, choose the option that seems most likely based on the data provided. Example output format: [YES/NO]


In [40]:
retry_instruction_df = pd.DataFrame([retry_instruction])
retry_instruction_df.to_csv("X_train_pred/instructions/retry_instruction.csv", sep = ",", index = True)

retry_cot_instruction_df = pd.DataFrame([retry_cot_instruction])
retry_cot_instruction_df.to_csv("X_train_pred/instructions/retry_cot_instruction.csv", sep = ",", index = True)