# LLM: Creating prompts

Classify data into categories without being explicitly trained on labeled examples for that specific task. Instead, the model uses pre-trained knowledge and natural language prompts to infer the correct category.

| Prompt Name              | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
|--------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **simple_prompt**        | "Does the following person develop a psychological disorder between time point T1 and T2? Data: …"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| **class_definitions_prompt** | "Given the following data, classify whether this person develops a psychological disorder between T1 and T2 according to the instructions provided and data measured by F-DIPS structural interviews. Respond with YES or NO. Instructions: NO: The person did not develop any new psychological disorder between T1 and T2. This means they were either healthy at both time points, had an ongoing disorder across both time points, or had already recovered from a previous disorder. YES: The person was psychologically healthy at T1 but developed a psychological disorder at T2. Data: …"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| **profiled_simple_prompt** | "You are an expert in clinical psychology and mental health diagnostics. You are trained to analyze longitudinal data to assess whether a person develops a psychological disorder over time. You know how to analyze data measured with F-DIPS structural interviews at two time points T1 and T2. Does the following person develop a psychological disorder between time point T1 and T2? Data: …"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| **few_shot_prompt**      | "Please refer to the following examples of correctly classified data points with 'Total mental disorders incidence point prevalence' being the true classification: … Based on the previous example data prompts, classify the following data. Does the following person develop a psychological disorder between time point T1 and T2? Data: …"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| **vignette_text_prompt** | "A woman with a BMI of {bmi}, an education level of {education}, and a socioeconomic status of {ses} has the following psychological profile: At Time Point 1, she showed {t1_pm_health} positive mental health, {t1_support} social support, and {t1_self_eff} general self-efficacy. Her life satisfaction was {t1_life_sat}, and she relied on {t1_coping_prob} problem-focused and {t1_coping_emot} emotion-focused coping strategies. Anxiety sensitivity was {t1_anx_sens}, and her fear of bodily sensations was {t1_fear}. She reported {t1_dysfunc} levels of dysfunctional attitudes and {t1_gsi} levels of general psychopathology. Her stress level was {t1_stress}. By Time Point 2, approximately 17 months later, she reported {t2_pm_health} positive mental health, {t2_support} social support, and {t2_self_eff} self-efficacy. Life satisfaction remained {t2_life_sat}. However, anxiety sensitivity had become {t2_anx_sens}, and fear of bodily sensations was {t2_fear}. Dysfunctional attitudes were {t2_dysfunc}, and general psychopathology was {t2_gsi}. Does this person develop a psychological disorder between time point T1 and T2?" |

## 0 Imports

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from scipy.spatial.distance import pdist

In [2]:
data_change = pd.read_csv("../dat/dips/DIPS_Data_cleaned_change.csv", sep = ",", low_memory = False)

## 1 Zero-shot prompts

In this section, I will **create different prompts for zero-shot classification** for every person (row) in the test set. The prompts will be used to classify whether a person develops a psychological disorder between time point T1 and T2.

### 1.1 Data preparation

In [3]:
# Predictors
X = data_change
X = X.drop(["hpi"], axis = 1)

# Target
y = data_change["hpi"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42, stratify = y)

# concatenate X and y for example data
dat_train = pd.concat([X_train, y_train], axis = 1)

print("LLMs \n",
      "X_train shape: ", X_train.shape, round(X_train.shape[0]/len(X), 2), "\n",
      "X_test shape: ", X_test.shape, round(X_test.shape[0]/len(X), 2),  "\n",
      "y_train shape: ", y_train.shape, round(y_train.shape[0]/len(y), 2), "\n",
      "y_test shape: ", y_test.shape, round(y_test.shape[0]/len(y), 2), "\n")

LLMs 
 X_train shape:  (983, 22) 0.8 
 X_test shape:  (246, 22) 0.2 
 y_train shape:  (983,) 0.8 
 y_test shape:  (246,) 0.2 



In [4]:
# get clear names for each column in data
preds = X.columns
preds

Index(['whi_ges', 'soz_ges', 'gke_ges', 'lzh_ges', 'ile', 'woc_gesp',
       'woc_gese', 'asi_ges', 'bsq_ges', 'das_ges', 'scl_gsi', 'bild',
       'bmi_kat', 'ses_kom', 'whi_change', 'soz_change', 'gke_change',
       'lzh_change', 'asi_change', 'bsq_change', 'das_change', 'scl_change'],
      dtype='object')

In [5]:
predictor_names = ["T1 Positive mental health",    #whi_ges
                   "T1 Social support",            #soz_ges (SAVE)
                   "T1 General self-efficacy",     #gke_ges SAVE
                   "T1 Life satisfaction",         #lzh_ges SAVE
                   "T1 Stress",                     #ile SAVE
                   "T1 Problem-focused coping",    #woc_gesp SAVE
                   "T1 Emotion-focused coping",    #woc_gese SAVE
                   "T1 Anxiety sensitivity",       #asi_ges SAVE
                   "T1 Fear of bodily sensations", #bsq_ges SAVE
                   "T1 Dysfunctional attitudes",   #das_ges SAVE
                   "T1 General psychopathology: Global Severity Index (GSI)",    #scl_gsi SAVE
                   "Education",                    #bild
                   "T1 BMI",                       #bmi_kat
                   "Socioeconomic status",         #ses_kom
                   "Change in Positive mental health (T2-T1)",    #whi_change
                   "Change in Social support (T2-T1)",            #soz_change
                   "Change in General self-efficacy (T2-T1)",     #gke_change
                   "Change in Life satisfaction (T2-T1)",         #lzh_change
                   "Change in Anxiety sensitivity (T2-T1)",       #asi_change
                   "Change in Fear of bodily sensations (T2-T1)", #bsq_change
                   "Change in Dysfunctional attitudes (T2-T1)",   #das_change
                   "Change in General psychopathology: Global Severity Index (GSI) (T2-T1)",    #scl_change
                   ]

target_names = ["Total mental disorders incidence point prevalence"]    #hpi

In [6]:
# print col and names
for col, name in zip(preds, predictor_names):
    print(f"{col}: {name}")

whi_ges: T1 Positive mental health
soz_ges: T1 Social support
gke_ges: T1 General self-efficacy
lzh_ges: T1 Life satisfaction
ile: T1 Stress
woc_gesp: T1 Problem-focused coping
woc_gese: T1 Emotion-focused coping
asi_ges: T1 Anxiety sensitivity
bsq_ges: T1 Fear of bodily sensations
das_ges: T1 Dysfunctional attitudes
scl_gsi: T1 General psychopathology: Global Severity Index (GSI)
bild: Education
bmi_kat: T1 BMI
ses_kom: Socioeconomic status
whi_change: Change in Positive mental health (T2-T1)
soz_change: Change in Social support (T2-T1)
gke_change: Change in General self-efficacy (T2-T1)
lzh_change: Change in Life satisfaction (T2-T1)
asi_change: Change in Anxiety sensitivity (T2-T1)
bsq_change: Change in Fear of bodily sensations (T2-T1)
das_change: Change in Dysfunctional attitudes (T2-T1)
scl_change: Change in General psychopathology: Global Severity Index (GSI) (T2-T1)


### 1.2 Simple prompt

In [27]:
def create_simple_prompt(data_row):
    """
    Create a simple prompt for zero-shot classification.
    """

    # Convert the data row to a string format using the real column names
    data_string = ", ".join([f"{name}: {value}" for name, value in zip(predictor_names, data_row)])

    instruction = "Respond only with YES or NO."

    prompt = f"Does the following person develop a psychological disorder between time point T1 and T2? Data: {data_string}"

    return instruction, prompt

In [28]:
# # Test the function
# data_row = X.iloc[4]
# simple_instruction, simple_prompt = create_simple_prompt(data_row)
# print(simple_prompt)
# print("\n", target_names[0], y.iloc[4])

In [29]:
# Create a simple prompt for each row in the test set
X_test_simple_prompt = []

for i in range(len(X_test)):
    data_row = X_test.iloc[i]
    simple_instruction, simple_prompt = create_simple_prompt(data_row)
    X_test_simple_prompt.append(simple_prompt)

X_test_simple_prompt[0]

'Does the following person develop a psychological disorder between time point T1 and T2? Data: T1 Positive mental health: -0.0279170753483525, T1 Social support: 0.1421238143169474, T1 General self-efficacy: 0.3649793457412237, T1 Life satisfaction: 0.3372886835461141, T1 Stress: 0.4419361727222826, T1 Problem-focused coping: 1.7319368683783989, T1 Emotion-focused coping: 0.2078300133169115, T1 Anxiety sensitivity: 0.1594156886399411, T1 Fear of bodily sensations: 0.2863750811390516, T1 Dysfunctional attitudes: 0.2750686254386546, T1 General psychopathology: Global Severity Index (GSI): 0.0172227087467131, Education: 2.0, T1 BMI: 1.0, Socioeconomic status: 2.0, Change in Positive mental health (T2-T1): -0.7520166349788642, Change in Social support (T2-T1): 0.7057099569575698, Change in General self-efficacy (T2-T1): -0.1819798191096402, Change in Life satisfaction (T2-T1): 0.1407378746091848, Change in Anxiety sensitivity (T2-T1): -0.8617238998696288, Change in Fear of bodily sensatio

In [30]:
# # create random input for testing
data_row = X_test.iloc[4]

# add a random number to the data_row
data_row = data_row + np.random.rand(len(data_row))

simple_instruction, simple_prompt = create_simple_prompt(data_row)
print(simple_prompt)

Does the following person develop a psychological disorder between time point T1 and T2? Data: T1 Positive mental health: -0.2905672162128131, T1 Social support: 1.1846793113149312, T1 General self-efficacy: 1.4027858457889284, T1 Life satisfaction: 1.7808992275100926, T1 Stress: -0.19837236880230458, T1 Problem-focused coping: 0.46604060565293626, T1 Emotion-focused coping: -0.3212951873071682, T1 Anxiety sensitivity: -0.45587513172050786, T1 Fear of bodily sensations: 0.11007137309325277, T1 Dysfunctional attitudes: -0.28622667394204704, T1 General psychopathology: Global Severity Index (GSI): 0.16050338946000808, Education: 2.103862786001591, T1 BMI: 2.187934071238933, Socioeconomic status: 2.0448716320569904, Change in Positive mental health (T2-T1): 0.7600767184098469, Change in Social support (T2-T1): -0.03491680764470617, Change in General self-efficacy (T2-T1): -3.7909570621094906, Change in Life satisfaction (T2-T1): -0.5165594508517094, Change in Anxiety sensitivity (T2-T1): 

In [31]:
X_test_simple_prompt_df = pd.DataFrame(X_test_simple_prompt)
X_test_simple_prompt_df.to_csv("../dat/prompts/X_test_simple_prompt.csv", sep = ",", index = True)

### 1.3 Class definitions prompt

In [32]:
def create_class_definitions_prompt(data_row):
    """
    Create a class definitions prompt for zero-shot classification.
    """

    # Convert the data row to a string format using the real column names
    data_string = ", ".join([f"{name}: {value}" for name, value in zip(predictor_names, data_row)])

    instruction = "Respond only with YES or NO."

    prompt = f"Given the following data, classify whether this person develops a psychological disorder between T1 and T2 according to the instructions provided and data measured by F-DIPS structural interviews. Respond with YES or NO. Instructions: NO: The person did not develop any new psychological disorder between T1 and T2. This means they were either healthy at both time points, had an ongoing disorder across both time points, or had already recovered from a previous disorder. YES: The person was psychologically healthy at T1 but developed a psychological disorder at T2. Does the following person develop a psychological disorder between time point T1 and T2? Data: {data_string}"

    return instruction, prompt

In [33]:
# # Test the function
# data_row = X.iloc[4]
# class_definitions_instruction, class_definitions_prompt = create_class_definitions_prompt(data_row)
# print(class_definitions_prompt)
# print("\n", target_names[0], y.iloc[4])

In [34]:
# Create a class definitions prompt for each row in the test set
X_test_class_definitions_prompt = []
for i in range(len(X_test)):
    data_row = X_test.iloc[i]
    class_definitions_instruction, class_definitions_prompt = create_class_definitions_prompt(data_row)
    X_test_class_definitions_prompt.append(class_definitions_prompt)

X_test_class_definitions_prompt[0]

'Given the following data, classify whether this person develops a psychological disorder between T1 and T2 according to the instructions provided and data measured by F-DIPS structural interviews. Respond with YES or NO. Instructions: NO: The person did not develop any new psychological disorder between T1 and T2. This means they were either healthy at both time points, had an ongoing disorder across both time points, or had already recovered from a previous disorder. YES: The person was psychologically healthy at T1 but developed a psychological disorder at T2. Does the following person develop a psychological disorder between time point T1 and T2? Data: T1 Positive mental health: -0.0279170753483525, T1 Social support: 0.1421238143169474, T1 General self-efficacy: 0.3649793457412237, T1 Life satisfaction: 0.3372886835461141, T1 Stress: 0.4419361727222826, T1 Problem-focused coping: 1.7319368683783989, T1 Emotion-focused coping: 0.2078300133169115, T1 Anxiety sensitivity: 0.159415688

In [35]:
# # create random input for testing
# data_row = np.random.rand(len(predictor_names))
# class_definitions_instruction, class_definitions_prompt = create_class_definitions_prompt(data_row)
# print(class_definitions_prompt)

In [36]:
X_test_class_definitions_prompt_df = pd.DataFrame(X_test_class_definitions_prompt)
X_test_class_definitions_prompt_df.to_csv("../dat/prompts/X_test_class_definitions_prompt.csv", sep = ",", index = True)

### 1.4 Profiled simple prompt

In [37]:
def create_profiled_simple_prompt(data_row):
    """
    Create a profiled simple prompt for zero-shot classification.
    """

    # Convert the data row to a string format using the real column names
    data_string = ", ".join([f"{name}: {value}" for name, value in zip(predictor_names, data_row)])

    instruction = "Respond only with YES or NO."

    prompt = f"You are an expert in clinical psychology and mental health diagnostics. You are trained to analyze longitudinal data to assess whether a person develops a psychological disorder over time. You know how to analyze data measured with F-DIPS structural interviews at two time points T1 and T2. Does the following person develop a psychological disorder between time point T1 and T2? Data: {data_string}"

    return instruction, prompt

In [38]:
# # Test the function
# data_row = X.iloc[4]
# profiled_simple_instruction, profiled_simple_prompt = create_profiled_simple_prompt(data_row)
# print(profiled_simple_prompt)
# print("\n", target_names[0], y.iloc[4])

In [39]:
# Create a profiled simple prompt for each row in the test set
X_test_profiled_simple_prompt = []
for i in range(len(X_test)):
    data_row = X_test.iloc[i]
    profiled_simple_instruction, profiled_simple_prompt = create_profiled_simple_prompt(data_row)
    X_test_profiled_simple_prompt.append(profiled_simple_prompt)

X_test_profiled_simple_prompt[0]

'You are an expert in clinical psychology and mental health diagnostics. You are trained to analyze longitudinal data to assess whether a person develops a psychological disorder over time. You know how to analyze data measured with F-DIPS structural interviews at two time points T1 and T2. Does the following person develop a psychological disorder between time point T1 and T2? Data: T1 Positive mental health: -0.0279170753483525, T1 Social support: 0.1421238143169474, T1 General self-efficacy: 0.3649793457412237, T1 Life satisfaction: 0.3372886835461141, T1 Stress: 0.4419361727222826, T1 Problem-focused coping: 1.7319368683783989, T1 Emotion-focused coping: 0.2078300133169115, T1 Anxiety sensitivity: 0.1594156886399411, T1 Fear of bodily sensations: 0.2863750811390516, T1 Dysfunctional attitudes: 0.2750686254386546, T1 General psychopathology: Global Severity Index (GSI): 0.0172227087467131, Education: 2.0, T1 BMI: 1.0, Socioeconomic status: 2.0, Change in Positive mental health (T2-T

In [40]:
# # create random input for testing
# data_row = np.random.rand(len(predictor_names))
# profiled_simple_instruction, profiled_simple_prompt = create_profiled_simple_prompt(data_row)
# print(profiled_simple_prompt)

In [41]:
X_test_profiled_simple_prompt_df = pd.DataFrame(X_test_profiled_simple_prompt)
X_test_profiled_simple_prompt_df.to_csv("../dat/prompts/X_test_profiled_simple_prompt.csv", sep = ",", index = True)

### 1.5 Few-shot prompt

In [42]:
def create_few_shot_prompt(data_row, few_shot_examples):
    """
    Create a few-shot prompt for zero-shot classification.
    """

    example_data = []

    # create example prompt but including the target aswell, stating Example 1: ..., Example 2: ...
    for i, example in enumerate(few_shot_examples):
        example_data.append(
            f"Example {i + 1}: " + ", ".join([f"{name}: {value}" for name, value in zip(predictor_names + target_names, example)])
        )

    # Convert the data row to a string format using the real column names
    data_string = ", ".join([f"{name}: {value}" for name, value in zip(predictor_names, data_row)])

    instruction = "Respond only with YES or NO."

    prompt1 = f"Please refer to the following examples of correctly classified data points with 'Total mental disorders incidence point prevalence' being the true classification: {example_data}"

    prompt2 = f"Based on the previous example data prompts, classify the following data. Does the following person develop a psychological disorder between time point T1 and T2? Data: {data_string}"

    # concatenate
    prompt = f"{prompt1} {prompt2}"

    return instruction, prompt

In [43]:
def get_similar_examples(data_row, dat_train, n):
    """
    Get the top-n semantically similar examples to the test product from the training set.
    """

    # get columns that we want to use for similarity
    preds = dat_train.drop("hpi", axis = 1).columns
    dat_preds = dat_train[preds]
    data_row = data_row[preds]
    data_row = np.array([data_row])

    # dictionary to store the distances
    distances = {}

    for row in dat_preds.iterrows():
        # create array with the row and the data_row
        row_input = np.array([row[1]])
        dist_row = np.concatenate((data_row, row_input), axis = 0)

        # calculate the distance between the two rows
        dist = pdist(dist_row, metric = "euclidean")

        # store the distance in the dictionary
        distances[row[0]] = dist[0]

    # sort the dictionary by distance
    sorted_distances = sorted(distances.items(), key = lambda x: x[1])

    # get top keys
    keys = []
    for key in sorted_distances[:n]:
        keys.append(key[0])

    # drop all rows that are not in keys
    ex_dat = dat_train.loc[keys]

    return ex_dat

In [44]:
# # Test the function
# close_ex = get_similar_examples(X.iloc[4], dat_train, 5)
#
# data_row = X.iloc[4]
# few_shot_examples = close_ex.values
# few_shot_instruction, few_shot_prompt = create_few_shot_prompt(data_row, few_shot_examples)
# print(few_shot_prompt)

In [45]:
# Create a few-shot prompt for each row in the test set
X_test_few_shot_prompt = []
for i in range(len(X_test)):
    data_row = X_test.iloc[i]
    few_shot_examples = get_similar_examples(data_row, dat_train, 5).values
    few_shot_instruction, few_shot_prompt = create_few_shot_prompt(data_row, few_shot_examples)
    X_test_few_shot_prompt.append(few_shot_prompt)

X_test_few_shot_prompt[0]

"Please refer to the following examples of correctly classified data points with 'Total mental disorders incidence point prevalence' being the true classification: ['Example 1: T1 Positive mental health: -0.0279170753483525, T1 Social support: -0.167356999046327, T1 General self-efficacy: -0.5416595949681524, T1 Life satisfaction: -0.471818725223128, T1 Stress: 0.241958427407229, T1 Problem-focused coping: 0.8532782878883876, T1 Emotion-focused coping: 0.2078300133169115, T1 Anxiety sensitivity: 0.6878631096473256, T1 Fear of bodily sensations: 0.3819042614516962, T1 Dysfunctional attitudes: -0.2667599319073422, T1 General psychopathology: Global Severity Index (GSI): -0.2733217032704306, Education: 2.0, T1 BMI: 1.0, Socioeconomic status: 1.0, Change in Positive mental health (T2-T1): -0.7520166349788642, Change in Social support (T2-T1): 0.8692924015129987, Change in General self-efficacy (T2-T1): -0.1819798191096402, Change in Life satisfaction (T2-T1): 0.6686513409272499, Change in 

In [46]:
# # create random input for testing
# data_row = X_test.iloc[4]
# # add a random number to the data_row
# data_row = data_row + np.random.rand(len(data_row))
#
# few_shot_examples = get_similar_examples(data_row, dat_train, 5).values
# few_shot_instruction, few_shot_prompt = create_few_shot_prompt(data_row, few_shot_examples)
# print(few_shot_prompt)

In [47]:
X_test_few_shot_prompt_df = pd.DataFrame(X_test_few_shot_prompt)
X_test_few_shot_prompt_df.to_csv("../dat/prompts/X_test_few_shot_prompt.csv", sep = ",", index = True)

### 1.6 Vignette text prompt

In [48]:
def interpret_value(val, thresh_low = -1, thresh_high = 1):
    """Convert standardized value to qualitative description with value."""
    val_str = f"{val:.3f}"
    if val > thresh_high:
        return f"above average ({val_str})"
    elif val < thresh_low:
        return f"below average ({val_str})"
    else:
        return f"average ({val_str})"

In [49]:
def interpret_change(val, thresh_low = -1, thresh_high = 1):
    """Convert standardized value to qualitative description with value."""
    val_str = f"{val:.3f}"
    if val > thresh_high:
        return f"increased ({val_str})"
    elif val < thresh_low:
        return f"decreased ({val_str})"
    else:
        return f"similar ({val_str})"

In [50]:
def create_vignette_prompt(data):

    # demographics
    bmi = data['bmi_kat']
    education = data['bild']
    ses = data['ses_kom']

    # time point 1
    t1_pm_health = interpret_value(data['whi_ges'])
    t1_support = interpret_value(data['soz_ges'])
    t1_self_eff = interpret_value(data['gke_ges'])
    t1_life_sat = interpret_value(data['lzh_ges'])
    t1_coping_prob = interpret_value(data['woc_gesp'])
    t1_coping_emot = interpret_value(data['woc_gese'])
    t1_anx_sens = interpret_value(data['asi_ges'])
    t1_fear = interpret_value(data['bsq_ges'])
    t1_dysfunc = interpret_value(data['das_ges'])
    t1_gsi = interpret_value(data['scl_gsi'])
    t1_stress = interpret_value(data['ile'])

    # # time point 2
    # t2_pm_health = interpret_value(data['whi_gesy'])
    # t2_support = interpret_value(data['soz_gesy'])
    # t2_self_eff = interpret_value(data['gke_gesy'])
    # t2_life_sat = interpret_value(data['lzh_gesy'])
    # t2_anx_sens = interpret_value(data['asi_gesy'])
    # t2_fear = interpret_value(data['bsq_gesy'])
    # t2_dysfunc = interpret_value(data['das_gesy'])
    # t2_gsi = interpret_value(data['scl_gsiy'])

    # change variables
    change_pm_health = interpret_change(data['whi_change'])
    change_support = interpret_change(data['soz_change'])
    change_self_eff = interpret_change(data['gke_change'])
    change_life_sat = interpret_change(data['lzh_change'])
    change_anx_sens = interpret_change(data['asi_change'])
    change_fear = interpret_change(data['bsq_change'])
    change_dysfunc = interpret_change(data['das_change'])
    change_gsi = interpret_change(data['scl_change'])

    instruction = "Respond only with YES or NO."

    # generate prompt
    # vignette = f"""A woman with a BMI of {bmi}, an education level of {education}, and a socioeconomic status of {ses} has the following psychological profile: At Time Point 1, she showed {t1_pm_health} positive mental health, {t1_support} social support, and {t1_self_eff} general self-efficacy. Her life satisfaction was {t1_life_sat}, and she relied on {t1_coping_prob} problem-focused and {t1_coping_emot} emotion-focused coping strategies. Anxiety sensitivity was {t1_anx_sens}, and her fear of bodily sensations was {t1_fear}. She reported {t1_dysfunc} levels of dysfunctional attitudes and {t1_gsi} levels of general psychopathology. Her stress level was {t1_stress}. By Time Point 2, approximately 17 months later, she reported {t2_pm_health} positive mental health, {t2_support} social support, and {t2_self_eff} self-efficacy. Life satisfaction remained {t2_life_sat}. However, anxiety sensitivity had become {t2_anx_sens}, and fear of bodily sensations was {t2_fear}. Dysfunctional attitudes were {t2_dysfunc}, and general psychopathology was {t2_gsi}. Does this person develop a psychological disorder between time point T1 and T2?"""

    vignette = f"""A woman with a BMI of {bmi}, an education level of {education}, and a socioeconomic status of {ses} has the following psychological profile: At time point 1, she showed {t1_pm_health} positive mental health, {t1_support} social support, and {t1_self_eff} general self-efficacy. Her life satisfaction was {t1_life_sat}, and she relied on {t1_coping_prob} problem-focused and {t1_coping_emot} emotion-focused coping strategies. Anxiety sensitivity was {t1_anx_sens}, and her fear of bodily sensations was {t1_fear}. She reported {t1_dysfunc} levels of dysfunctional attitudes and {t1_gsi} levels of general psychopathology. Her stress level was {t1_stress}. By time point 2, approximately 17 months later, she reported {change_pm_health} positive mental health, {change_support} social support, and {change_self_eff} self-efficacy. Life satisfaction was {change_life_sat}. Anxiety sensitivity was reported to be {change_anx_sens}, and fear of bodily sensations was {change_fear}. Dysfunctional attitudes were {change_dysfunc}, and general psychopathology was {change_gsi} compared to time point 1. Does this person develop a psychological disorder between time point T1 and T2?"""

    return instruction, vignette

In [51]:
# # Test the function
# data_row = X.iloc[4]
# vignette_instruction, vignette_prompt = create_vignette_prompt(data_row)
# print(vignette_prompt)

In [52]:
# Create a vignette prompt for each row in the test set
X_test_vignette_prompt = []
for i in range(len(X_test)):
    data_row = X_test.iloc[i]
    vignette_instruction, vignette_prompt = create_vignette_prompt(data_row)
    X_test_vignette_prompt.append(vignette_prompt)

X_test_vignette_prompt[0]

'A woman with a BMI of 1.0, an education level of 2.0, and a socioeconomic status of 2.0 has the following psychological profile: At time point 1, she showed average (-0.028) positive mental health, average (0.142) social support, and average (0.365) general self-efficacy. Her life satisfaction was average (0.337), and she relied on above average (1.732) problem-focused and average (0.208) emotion-focused coping strategies. Anxiety sensitivity was average (0.159), and her fear of bodily sensations was average (0.286). She reported average (0.275) levels of dysfunctional attitudes and average (0.017) levels of general psychopathology. Her stress level was average (0.442). By time point 2, approximately 17 months later, she reported similar (-0.752) positive mental health, similar (0.706) social support, and similar (-0.182) self-efficacy. Life satisfaction was similar (0.141). Anxiety sensitivity was reported to be similar (-0.862), and fear of bodily sensations was similar (-0.847). Dy

In [53]:
# # create random input for testing
# data_row = X_test.iloc[4]
# # add a random number to the data_row
# data_row = data_row + np.random.rand(len(data_row))
#
# vignette_instruction, vignette_prompt = create_vignette_prompt(data_row)
# print(vignette_prompt)

In [54]:
X_test_vignette_prompt_df = pd.DataFrame(X_test_vignette_prompt)
X_test_vignette_prompt_df.to_csv("../dat/prompts/X_test_vignette_prompt.csv", sep = ",", index = True)

### 1.7 Claude prompt

In [10]:
def create_claude_prompt(data):
    """
    Create a profiled simple prompt for zero-shot classification.
    """

    # Convert the data row to a string format using the real column names

    # time point T1 measures
    t1_positive_mental_health = data['whi_ges']
    t1_social_support = data['soz_ges']
    t1_general_self_efficacy = data['gke_ges']
    t1_life_satisfaction = data['lzh_ges']
    t1_stress = data['ile']
    t1_problem_focused_coping = data['woc_gesp']
    t1_emotion_focused_coping = data['woc_gese']
    t1_anxiety_sensitivity = data['asi_ges']
    t1_fear_of_bodily_sensations = data['bsq_ges']
    t1_dysfunctional_attitudes = data['das_ges']
    t1_general_psychopathology_gsi = data['scl_gsi']

    # demographics
    bmi = data['bmi_kat']
    education = data['bild']
    ses = data['ses_kom']

    # Changed between T1 and T2
    change_positive_mental_health = data['whi_change']
    change_social_support = data['soz_change']
    change_general_self_efficacy = data['gke_change']
    change_life_satisfaction = data['lzh_change']
    change_anxiety_sensitivity = data['asi_change']
    change_fear_of_bodily_sensations = data['bsq_change']
    change_dysfunctional_attitudes = data['das_change']
    change_general_psychopathology_gsi = data['scl_change']

    instruction = "You are an expert psychologist tasked with predicting whether an individual will develop a psychological disorder between two time points (T1 and T2) based on various psychological measures and demographic information. Your goal is to provide an accurate YES or NO prediction, supported by a brief explanation of your reasoning. Example output format: \n Prediction: [YES/NO] \n Explanation: [Brief explanation supporting your prediction]."

    prompt = f"You are an expert psychologist tasked with predicting whether an individual will develop a psychological disorder between two time points (T1 and T2) based on various psychological measures and demographic information. Your goal is to provide an accurate YES or NO prediction, supported by a brief explanation of your reasoning. \n Here are the variables you need to consider: \n 1. Time Point T1 Measures: \n T1 Positive mental health: {t1_positive_mental_health} \n T1 Social support: {t1_social_support} \n T1 General self efficacy: {t1_general_self_efficacy} \n T1 Life satisfaction: {t1_life_satisfaction} \n T1 Stress: {t1_stress} \n T1 problem focused coping: {t1_problem_focused_coping} \n T1 emotion focused coping: {t1_emotion_focused_coping} \n T1 anxiety sensitivity: {t1_anxiety_sensitivity} \n T1 fear of bodily sensations: {t1_fear_of_bodily_sensations} \n T1 dysfunctional attitudes: {t1_dysfunctional_attitudes} \n T1 General psychopathology (Global Severity Index (GSI)): {t1_general_psychopathology_gsi} \n \n 2. Demographic Information: \n Education: {education} \n T1 BMI: {bmi} \n Socioeconomic status: {ses} \n \n 3. Changes between T1 and T2: \n Change in positive mental health: {change_positive_mental_health} \n Change in social support. {change_social_support} \n Change in general self efficacy: {change_general_self_efficacy} \n Change in life satisfaction: {change_life_satisfaction} \n Change in anxiety sensitivity: {change_anxiety_sensitivity} \n Change in fear of bodily sensations: {change_fear_of_bodily_sensations} \n Change in dysfunctional attitudes: {change_dysfunctional_attitudes} \n Change in general psychopathology (Global Severity Index (GSI)): {change_general_psychopathology_gsi} \n \n Your task is to analyze these variables and predict whether the individual will develop a psychological disorder between T1 and T2. Follow these steps: \n 1. Carefully examine each variable and its potential impact on psychological health. \n 2. Consider the interplay between different factors, especially how changes between T1 and T2 might influence the outcome. \n 3. Evaluate both risk factors and protective factors present in the data. \n 4. Make a prediction based on your analysis. \n \n Conduct your psychological assessment inside <psychological_assessment> tags in your thinking block. Be thorough in your consideration of the variables and their relationships. In your assessment: \n 1. Categorize variables into risk factors and protective factors. \n 2. Consider the severity and direction of changes between T1 and T2. \n 3. Evaluate the overall trend (improving or worsening) based on the changes. \n 4. Consider potential interactions between variables. \n \n  After your assessment, provide your prediction as either YES or NO, followed by a brief explanation (no more than two sentences) supporting your decision. \n Example output format: \n Prediction: [YES/NO] \n Explanation: [Brief explanation supporting your prediction] \n \n Remember, your goal is to make the most accurate prediction possible based on the given information. Your final output should consist only of the prediction and brief explanation, without duplicating or rehashing any of the work you did in the psychological assessment."

    return instruction, prompt

In [11]:
# Create a claude prompt for each row in the test set
X_test_claude_prompt = []

for i in range(len(X_test)):
    data_row = X_test.iloc[i]
    claude_instruction, claude_prompt = create_claude_prompt(data_row)
    X_test_claude_prompt.append(claude_prompt)

X_test_claude_prompt[0]

'You are an expert psychologist tasked with predicting whether an individual will develop a psychological disorder between two time points (T1 and T2) based on various psychological measures and demographic information. Your goal is to provide an accurate YES or NO prediction, supported by a brief explanation of your reasoning. \n Here are the variables you need to consider: \n 1. Time Point T1 Measures: \n T1 Positive mental health: -0.0279170753483525 \n T1 Social support: 0.1421238143169474 \n T1 General self efficacy: 0.3649793457412237 \n T1 Life satisfaction: 0.3372886835461141 \n T1 Stress: 0.4419361727222826 \n T1 problem focused coping: 1.7319368683783989 \n T1 emotion focused coping: 0.2078300133169115 \n T1 anxiety sensitivity: 0.1594156886399411 \n T1 fear of bodily sensations: 0.2863750811390516 \n T1 dysfunctional attitudes: 0.2750686254386546 \n T1 General psychopathology (Global Severity Index (GSI)): 0.0172227087467131 \n \n 2. Demographic Information: \n Education: 2.

In [12]:
X_test_claude_prompt_df = pd.DataFrame(X_test_claude_prompt)
X_test_claude_prompt_df.to_csv("../dat/prompts/X_test_claude_prompt.csv", sep = ",", index = True)