# LLM: Zero-shot classification through LLMs and prompts

Classify data into categories without being explicitly trained on labeled examples for that specific task. Instead, the model uses pre-trained knowledge and natural language prompts to infer the correct category.

| Prompt Name              | Description                                                                                                                                                                                                                                                                                                                                                                |
|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **simple_prompt**        | "Respond with YES or NO. Does the following person develop a psychological disorder between time point T1 and T2? Data: …"                                                                                                                                                                                                                                                 |
| **class_definitions_prompt** | "Given the following data, classify whether this person develops a psychological disorder between T1 and T2. Respond with YES or NO. Instructions: - NO: The person shows no evidence of developing a psychological disorder between T1 and T2. - YES: The person shows indicators consistent with the development of a psychological disorder between T1 and T2. Data: …" |
| **profiled_simple_prompt** | "You are an expert in clinical psychology and mental health diagnostics. You are trained to analyze longitudinal data to assess whether a person develops a psychological disorder over time. Tell me if the following data shows the development of a psychological disorder between time point T1 and T2. Just answer YES or NO. Data: …"                                |
| **few_shot_prompt**      | "Select the top-5 semantically similar examples to the test product from the training set, using the Sentence-Transformer model, examples are then included in the prompt as in-context learning examples for the LLMs. Prompt 1: Please refer to … Prompt 2: Based on the previous data prompt, classify the following data into the following categories: …"             |
| **vignette_text_prompt** | "Data is in csv format, but what if we prompt ChatGPT to convert the table format into a vivid/interesting/emotional text? → vignette. Vignette: Each fictional patient was assigned an age, sex, physical examination findings, and clinical symptoms. ‘Can you grade this case according to … Classification System?’"                                                   |

## 0 Imports

In [6]:
import os
import pandas as pd
import numpy as np
from openai import OpenAI
from sklearn.model_selection import train_test_split

In [3]:
data = pd.read_csv("../dat/dips/DIPS_Data_cleaned.csv", sep = ",", low_memory = False)
data_pred = pd.read_csv("../dat/dips/DIPS_Data_cleaned_pred.csv", sep = ",", low_memory = False)
data_pred_y = pd.read_csv("../dat/dips/DIPS_Data_cleaned_pred_y.csv", sep = ",", low_memory = False)

## 1 Zero-shot classification

### 1.1 Data preparation

In [7]:
# Predictors
X = data
X = X.drop(["hpi"], axis = 1)

# Target
y = data["hpi"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

print("LLMs \n",
      "X_train shape: ", X_train.shape, round(X_train.shape[0]/len(X), 2), "\n",
      "X_test shape: ", X_test.shape, round(X_test.shape[0]/len(X), 2),  "\n",
      "y_train shape: ", y_train.shape, round(y_train.shape[0]/len(y), 2), "\n",
      "y_test shape: ", y_test.shape, round(y_test.shape[0]/len(y), 2), "\n")

LLMs 
 X_train shape:  (1005, 22) 0.8 
 X_test shape:  (252, 22) 0.2 
 y_train shape:  (1005,) 0.8 
 y_test shape:  (252,) 0.2 



In [8]:
# get clear names for each column in data
preds = X.columns
preds

Index(['whi_ges', 'soz_ges', 'gke_ges', 'lzh_ges', 'woc_gesp', 'woc_gese',
       'asi_ges', 'bsq_ges', 'das_ges', 'scl_gsi', 'bild', 'bmi_kat',
       'ses_kom', 'whi_gesy', 'soz_gesy', 'gke_gesy', 'lzh_gesy', 'asi_gesy',
       'bsq_gesy', 'das_gesy', 'scl_gsiy', 'ile'],
      dtype='object')

In [11]:
predictor_names = ["T1 Positive mental health",    #whi_ges
                   "T1 Social support",            #soz_ges (SAVE)
                   "T1 General self-efficacy",     #gke_ges SAVE
                   "T1 Life satisfaction",         #lzh_ges SAVE
                   "T1 Problem-focused coping",    #woc_gesp SAVE
                   "T1 Emotion-focused coping",    #woc_gese SAVE
                   "T1 Anxiety sensitivity",       #asi_ges SAVE
                   "T1 Fear of bodily sensations", #bsq_ges SAVE
                   "T1 Dysfunctional attitudes",   #das_ges SAVE
                   "T1 General psychopathology: Global Severity Index (GSI)",    #scl_gsi SAVE
                   "Education",                    #bild
                   "T1 BMI",                          #bmi_kat
                   "Socioeconomic status",         #ses_kom
                   "T2 Positive mental health",    #whi_gesy
                   "T2 Social support",            #soz_gesy (SAVE)
                   "T2 General self-efficacy",     #gke_gesy SAVE
                   "T2 Life satisfaction",         #lzh_gesy SAVE
                   "T2 Anxiety sensitivity",       #asi_gesy SAVE
                   "T2 Fear of bodily sensations", #bsq_gesy SAVE
                   "T2 Dysfunctional attitudes",   #das_gesy SAVE
                   "T2 General psychopathology: Global Severity Index (GSI)",    #scl_gsi SAVE
                   "T1 Stress"                     #ile SAVE
                   ]

target_names = ["Total mental disorders incidence point prevalence"]    #hpi

In [12]:
# print col and names
for col, name in zip(preds, predictor_names):
    print(f"{col}: {name}")

whi_ges: T1 Positive mental health
soz_ges: T1 Social support
gke_ges: T1 General self-efficacy
lzh_ges: T1 Life satisfaction
woc_gesp: T1 Problem-focused coping
woc_gese: T1 Emotion-focused coping
asi_ges: T1 Anxiety sensitivity
bsq_ges: T1 Fear of bodily sensations
das_ges: T1 Dysfunctional attitudes
scl_gsi: T1 General psychopathology: Global Severity Index (GSI)
bild: Education
bmi_kat: T1 BMI
ses_kom: Socioeconomic status
whi_gesy: T2 Positive mental health
soz_gesy: T2 Social support
gke_gesy: T2 General self-efficacy
lzh_gesy: T2 Life satisfaction
asi_gesy: T2 Anxiety sensitivity
bsq_gesy: T2 Fear of bodily sensations
das_gesy: T2 Dysfunctional attitudes
scl_gsiy: T2 General psychopathology: Global Severity Index (GSI)
ile: T1 Stress


### 1.1 Testing prompting

In [2]:
client = OpenAI(
    api_key = os.environ.get("OPENAI_API_KEY"),
)

# testing
response = client.responses.create(
    model = "gpt-4o",
    instructions = "You are a coding assistant that talks like a pirate.",
    input = "How do I check if a Python object is an instance of a class?",
)

print(response.output_text)

RateLimitError: Error code: 429 - {'error': {'message': 'You exceeded your current quota, please check your plan and billing details. For more information on this error, read the docs: https://platform.openai.com/docs/guides/error-codes/api-errors.', 'type': 'insufficient_quota', 'param': None, 'code': 'insufficient_quota'}}

### 1.3 Simple prompt

In [13]:
def create_simple_prompt(data_row):
    """
    Create a simple prompt for zero-shot classification.
    """

    # Convert the data row to a string format using the real column names
    data_string = ", ".join([f"{name}: {value}" for name, value in zip(predictor_names, data_row)])

    prompt = f"Respond with YES or NO. Does the following person develop a psychological disorder between time point T1 and T2? Data: {data_string}"

    return prompt

In [30]:
# Test the function
data_row = data.iloc[4]
simple_prompt = create_simple_prompt(data_row)
print(simple_prompt)
print("\n", target_names[0], data["hpi"].iloc[4])

Respond with YES or NO. Does the following person develop a psychological disorder between time point T1 and T2? Data: T1 Positive mental health: -0.0346394734930797, T1 Social support: -1.9443495599450165, T1 General self-efficacy: 0.5865222429197926, T1 Life satisfaction: -1.2777276005686404, T1 Problem-focused coping: -0.6434380872019897, T1 Emotion-focused coping: -0.4501970947608277, T1 Anxiety sensitivity: -1.6912121763530454, T1 Fear of bodily sensations: -1.3399218609334862, T1 Dysfunctional attitudes: 1.528535706802646, T1 General psychopathology: Global Severity Index (GSI): -0.5273858134379611, Education: 1.0, T1 BMI: 1.0, Socioeconomic status: 2.0, T2 Positive mental health: -0.2748340578894306, T2 Social support: 1.0533586187543231, T2 General self-efficacy: 1.3559331135757966, T2 Life satisfaction: 1.005628346789385, T2 Anxiety sensitivity: -1.1666271173074825, T2 Fear of bodily sensations: -1.0416157340293282, T2 Dysfunctional attitudes: -0.8267579141727915, T2 General p

In [18]:
# Create a simple prompt for each row in the test set
X_test_simple_prompt = []

for i in range(len(X_test)):
    data_row = X_test.iloc[i]
    simple_prompt = create_simple_prompt(data_row)
    X_test_simple_prompt.append(simple_prompt)

X_test_simple_prompt[0]

'Respond with YES or NO. Does the following person develop a psychological disorder between time point T1 and T2? Data: T1 Positive mental health: -0.9916025104337652, T1 Social support: 0.1912554943776842, T1 General self-efficacy: -1.209478876081861, T1 Life satisfaction: 0.1815598805259552, T1 Problem-focused coping: -1.0834346466693496, T1 Emotion-focused coping: -0.1263180001181833, T1 Anxiety sensitivity: 1.083708747266506, T1 Fear of bodily sensations: 1.2360549695511902, T1 Dysfunctional attitudes: -0.6885415900752596, T1 General psychopathology: Global Severity Index (GSI): -0.4917971179478549, Education: 2.0, T1 BMI: 2.0, Socioeconomic status: 2.0, T2 Positive mental health: -0.7828728972526869, T2 Social support: 0.0984652458135236, T2 General self-efficacy: -1.1421997527016898, T2 Life satisfaction: -0.6556858132553557, T2 Anxiety sensitivity: 1.5262249302236135, T2 Fear of bodily sensations: 1.9574857842600144, T2 Dysfunctional attitudes: -0.2807869674967928, T2 General ps

In [31]:
# create random input for testing
data_row = np.random.rand(len(predictor_names))
simple_prompt = create_simple_prompt(data_row)
print(simple_prompt)

Respond with YES or NO. Does the following person develop a psychological disorder between time point T1 and T2? Data: T1 Positive mental health: 0.5095653574926321, T1 Social support: 0.550369543121323, T1 General self-efficacy: 0.40898981976163884, T1 Life satisfaction: 0.7402492391816218, T1 Problem-focused coping: 0.13654584812841952, T1 Emotion-focused coping: 0.32854208900697246, T1 Anxiety sensitivity: 0.18247347851500506, T1 Fear of bodily sensations: 0.9929708436334378, T1 Dysfunctional attitudes: 0.6081881376323585, T1 General psychopathology: Global Severity Index (GSI): 0.3126652197050417, Education: 0.41002661251298256, T1 BMI: 0.009057344343104345, Socioeconomic status: 0.7889356866833619, T2 Positive mental health: 0.7277780849380541, T2 Social support: 0.5991167634257386, T2 General self-efficacy: 0.7783703307580181, T2 Life satisfaction: 0.025403205082699998, T2 Anxiety sensitivity: 0.7199126911514265, T2 Fear of bodily sensations: 0.755801054354372, T2 Dysfunctional a

### 1.4 Class definitions

In [None]:
def create_class_definitions_prompt(data_row):
    """
    Create a class definitions prompt for zero-shot classification.
    """

    # Convert the data row to a string format using the real column names
    data_string = ", ".join([f"{name}: {value}" for name, value in zip(predictor_names, data_row)])

