# LLM: Zero-shot classification through LLMs and prompts

**Models**:

- GPT-4o (OpenAI)
- Gemini (Google)
- Gemma (Google)
- Llama (Meta)
- Claude (Anthropic)
- DeepSeek


## 0 Imports

In [1]:
import os
import pandas as pd
import anthropic
import numpy as np
import time
import re
from openai import OpenAI
from google import genai
from google.genai import types
from sklearn.model_selection import train_test_split
from scipy.spatial.distance import pdist

In [2]:
X_test_simple_prompt_df = pd.read_csv("../dat/prompts/X_test_simple_prompt.csv", sep = ",", index_col = 0)
X_test_class_definitions_prompt_df = pd.read_csv("../dat/prompts/X_test_class_definitions_prompt.csv", sep = ",", index_col = 0)
X_test_profiled_simple_prompt_df = pd.read_csv("../dat/prompts/X_test_profiled_simple_prompt.csv", sep = ",", index_col = 0)
X_test_few_shot_prompt_df = pd.read_csv("../dat/prompts/X_test_few_shot_prompt.csv", sep = ",", index_col = 0)
X_test_vignette_prompt_df = pd.read_csv("../dat/prompts/X_test_vignette_prompt.csv", sep = ",", index_col = 0)
X_test_claude_prompt_df = pd.read_csv("../dat/prompts/X_test_claude_prompt.csv", sep = ",", index_col = 0)

In [3]:
# convert to arrays
X_test_simple_prompt = X_test_simple_prompt_df.values.flatten()
X_test_class_definitions_prompt = X_test_class_definitions_prompt_df.values.flatten()
X_test_profiled_simple_prompt = X_test_profiled_simple_prompt_df.values.flatten()
X_test_few_shot_prompt = X_test_few_shot_prompt_df.values.flatten()
X_test_vignette_prompt = X_test_vignette_prompt_df.values.flatten()
X_test_claude_prompt = X_test_claude_prompt_df.values.flatten()

In [4]:
simple_instruction = "Respond only with YES or NO."
class_definitions_instruction = "Respond only with YES or NO."
profiled_simple_instruction = "Respond only with YES or NO."
few_shot_instruction = "Respond only with YES or NO."
vignette_instruction = "Respond only with YES or NO."
claude_instruction = "You are an expert psychologist tasked with predicting whether an individual will develop a psychological disorder between two time points (T1 and T2) based on various psychological measures and demographic information. Your goal is to provide an accurate YES or NO prediction, supported by a brief explanation of your reasoning. Example output format: \n Prediction: [YES/NO] \n Explanation: [Brief explanation supporting your prediction]."

## 1 Zero-shot classification with LLMs

In this section, I will use the prompts created in the previous section to **classify the test set using different LLMs**. The LLMs will be used to classify whether a person develops a psychological disorder between time point T1 and T2.

### 1.1 ChatGPT (OpenAI)

#### 1.1.1 Testing prompting

In [5]:
# client = OpenAI(
#     api_key = os.environ.get("OPENAI_API_KEY"),
# )
#
# # testing
# response = client.responses.create(
#     model = "gpt-4o-mini",
#     instructions = "You are a coding assistant that talks like a pirate.",
#     input = "How do I check if a Python object is an instance of a class?",
# )
#
# print(response.output_text)

#### 1.1.2. Prompting with ChatGPT-4o

In [6]:
# simple_prompt_y_pred_GPT = []
#
# client = OpenAI(
#     api_key = os.environ.get("OPENAI_API_KEY"),
# )
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_simple_prompt:
#     response = client.responses.create(
#         model = "gpt-4o",
#         instructions = simple_instruction,
#         input = prompt,
#     )
#     simple_prompt_y_pred_GPT.append(response.output_text)
#     print(response.output_text)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_GPT_simple_prompt = end - start
# time_GPT_simple_prompt_df = pd.DataFrame({"time": [time_GPT_simple_prompt]})
# time_GPT_simple_prompt_df.to_csv("../exp/times_LLMs/GPT4/time_GPT4_simple_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_simple_GPT = pd.Series(simple_prompt_y_pred_GPT).value_counts()
# print(counts_simple_GPT)
#
# # convert YES to 1 and NO to 0
# simple_prompt_y_pred_GPT = [1 if response == "YES" else 0 if response == "NO" else np.nan for response in simple_prompt_y_pred_GPT]
# simple_prompt_y_pred_GPT
#
# # save the array to a csv file
# simple_prompt_df_GPT = pd.DataFrame(simple_prompt_y_pred_GPT, columns = ["y_pred"])
# simple_prompt_df_GPT.to_csv("../exp/preds_LLMs/GPT4/y_pred_GPT4_simple_prompt.csv", sep = ",", index = False)

In [7]:
# class_def_y_pred_GPT = []
#
# client = OpenAI(
#     api_key = os.environ.get("OPENAI_API_KEY"),
# )
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_class_definitions_prompt:
#     response = client.responses.create(
#         model = "gpt-4o",
#         instructions = class_definitions_instruction,
#         input = prompt,
#     )
#     class_def_y_pred_GPT.append(response.output_text)
#     print(response.output_text)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_GPT_class_definitions = end - start
# time_GPT_class_definitions_df = pd.DataFrame({"time": [time_GPT_class_definitions]})
# time_GPT_class_definitions_df.to_csv("../exp/times_LLMs/GPT4/time_GPT4_class_definitions_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_class_def_GPT = pd.Series(class_def_y_pred_GPT).value_counts()
# print(counts_class_def_GPT)
#
# # convert YES to 1 and NO to 0
# class_def_y_pred_GPT = [1 if response == "YES" else 0 for response in class_def_y_pred_GPT]
# class_def_y_pred_GPT
#
# # save the array to a csv file
# class_def_df_GPT = pd.DataFrame(class_def_y_pred_GPT, columns = ["y_pred"])
# class_def_df_GPT.to_csv("../exp/preds_LLMs/GPT4/y_pred_GPT4_class_definitions_prompt.csv", sep = ",", index = False)

In [8]:
# profiled_simple_y_pred_GPT = []
#
# client = OpenAI(
#     api_key = os.environ.get("OPENAI_API_KEY"),
# )
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_profiled_simple_prompt:
#     response = client.responses.create(
#         model = "gpt-4o",
#         instructions = profiled_simple_instruction,
#         input = prompt,
#     )
#     profiled_simple_y_pred_GPT.append(response.output_text)
#     print(response.output_text)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_GPT_profiled_simple = end - start
# time_GPT_profiled_simple_df = pd.DataFrame({"time": [time_GPT_profiled_simple]})
# time_GPT_profiled_simple_df.to_csv("../exp/times_LLMs/GPT4/time_GPT4_profiled_simple_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_profiled_simple_GPT = pd.Series(profiled_simple_y_pred_GPT).value_counts()
# print(counts_profiled_simple_GPT)
#
# # convert YES to 1 and NO to 0
# profiled_simple_y_pred_GPT_val = [1 if response == "YES" else 0 for response in profiled_simple_y_pred_GPT]
# profiled_simple_y_pred_GPT_val
#
# # save the array to a csv file
# profiled_simple_df_GPT = pd.DataFrame(profiled_simple_y_pred_GPT_val, columns = ["y_pred"])
# profiled_simple_df_GPT.to_csv("../exp/preds_LLMs/GPT4/y_pred_GPT4_profiled_simple_prompt.csv", sep = ",", index = False)

In [9]:
# few_shot_y_pred_GPT = []
#
# client = OpenAI(
#     api_key = os.environ.get("OPENAI_API_KEY"),
# )
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_few_shot_prompt:
#     response = client.responses.create(
#         model = "gpt-4o",
#         instructions = few_shot_instruction,
#         input = prompt,
#     )
#     few_shot_y_pred_GPT.append(response.output_text)
#     print(response.output_text)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_GPT_few_shot = end - start
# time_GPT_few_shot_df = pd.DataFrame({"time": [time_GPT_few_shot]})
# time_GPT_few_shot_df.to_csv("../exp/times_LLMs/GPT4/time_GPT4_few_shot_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_few_shot_GPT = pd.Series(few_shot_y_pred_GPT).value_counts()
# print(counts_few_shot_GPT)
#
# # convert YES to 1 and NO to 0
# few_shot_y_pred_GPT_val = [1 if response == "YES" else 0 for response in few_shot_y_pred_GPT]
# few_shot_y_pred_GPT_val
#
# # save the array to a csv file
# few_shot_df_GPT = pd.DataFrame(few_shot_y_pred_GPT_val, columns = ["y_pred"])
# few_shot_df_GPT.to_csv("../exp/preds_LLMs/GPT4/y_pred_GPT4_few_shot_prompt.csv", sep = ",", index = False)

In [10]:
# vignette_y_pred_GPT = []
#
# client = OpenAI(
#     api_key = os.environ.get("OPENAI_API_KEY"),
# )
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_vignette_prompt:
#     response = client.responses.create(
#         model = "gpt-4o",
#         instructions = vignette_instruction,
#         input = prompt,
#     )
#     vignette_y_pred_GPT.append(response.output_text)
#     print(response.output_text)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_GPT_vignette = end - start
# time_GPT_vignette_df = pd.DataFrame({"time": [time_GPT_vignette]})
# time_GPT_vignette_df.to_csv("../exp/times_LLMs/GPT4/time_GPT4_vignette_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_vignette_GPT = pd.Series(vignette_y_pred_GPT).value_counts()
# print(counts_vignette_GPT)
#
# # convert YES to 1 and NO to 0
# vignette_y_pred_GPT_val = [1 if response == "YES" else 0 for response in vignette_y_pred_GPT]
# vignette_y_pred_GPT_val
#
# # save the array to a csv file
# vignette_df_GPT = pd.DataFrame(vignette_y_pred_GPT_val, columns = ["y_pred"])
# vignette_df_GPT.to_csv("../exp/preds_LLMs/GPT4/y_pred_GPT4_vignette_prompt.csv", sep = ",", index = False)

#### 1.1.3 Misclassified cases reasons

In [11]:
# y_pred_GPT4_simple_prompt = pd.read_csv("../exp/preds_LLMs/y_pred_GPT4_simple_prompt.csv", sep = ",")
# y_pred_GPT4_class_definition_prompt = pd.read_csv("../exp/preds_LLMs/y_pred_GPT4_class_definitions_prompt.csv", sep = ",")
# y_pred_GPT4_profiled_simple_prompt = pd.read_csv("../exp/preds_LLMs/y_pred_GPT4_profiled_simple_prompt.csv", sep = ",")
# y_pred_GPT4_few_shot_prompt = pd.read_csv("../exp/preds_LLMs/y_pred_GPT4_few_shot_prompt.csv", sep = ",")
# y_pred_GPT4_vignette_prompt = pd.read_csv("../exp/preds_LLMs/y_pred_GPT4_vignette_prompt.csv", sep = ",")
#
# # convert to array
# y_pred_GPT4_simple_prompt = y_pred_GPT4_simple_prompt["y_pred"].to_numpy()
# y_pred_GPT4_class_definition_prompt = y_pred_GPT4_class_definition_prompt["y_pred"].to_numpy()
# y_pred_GPT4_profiled_simple_prompt = y_pred_GPT4_profiled_simple_prompt["y_pred"].to_numpy()
# y_pred_GPT4_few_shot_prompt = y_pred_GPT4_few_shot_prompt["y_pred"].to_numpy()
# y_pred_GPT4_vignette_prompt = y_pred_GPT4_vignette_prompt["y_pred"].to_numpy()

In [12]:
# # indentify misclassified cases by comparing y_pred_GPT4_XXX and y_test, save index
# misclassified_cases_simple = []
# misclassified_cases_class_def = []
# misclassified_cases_profiled_simple = []
# misclassified_cases_few_shot = []
# misclassified_cases_vignette = []
#
# for i in range(len(y_pred_GPT4_simple_prompt)):
#     if y_pred_GPT4_simple_prompt[i] != y_test.iloc[i]:
#         misclassified_cases_simple.append(i)
# total_cases_simple = len(y_pred_GPT4_simple_prompt)
# misscl_cases_simple = len(misclassified_cases_simple)
# correct_clases_simple = total_cases_simple - misscl_cases_simple
#
# for i in range(len(y_pred_GPT4_class_definition_prompt)):
#     if y_pred_GPT4_class_definition_prompt[i] != y_test.iloc[i]:
#         misclassified_cases_class_def.append(i)
# total_cases_class_def = len(y_pred_GPT4_class_definition_prompt)
# misscl_cases_class_def = len(misclassified_cases_class_def)
# correct_clases_class_def = total_cases_class_def - misscl_cases_class_def
#
# for i in range(len(y_pred_GPT4_profiled_simple_prompt)):
#     if y_pred_GPT4_profiled_simple_prompt[i] != y_test.iloc[i]:
#         misclassified_cases_profiled_simple.append(i)
# total_cases_profiled = len(y_pred_GPT4_profiled_simple_prompt)
# misscl_cases_profiled = len(misclassified_cases_profiled_simple)
# correct_clases_profiled = total_cases_profiled - misscl_cases_profiled
#
# for i in range(len(y_pred_GPT4_few_shot_prompt)):
#     if y_pred_GPT4_few_shot_prompt[i] != y_test.iloc[i]:
#         misclassified_cases_few_shot.append(i)
# total_cases_few_shot = len(y_pred_GPT4_few_shot_prompt)
# misscl_cases_few_shot = len(misclassified_cases_few_shot)
# correct_clases_few_shot = total_cases_few_shot - misscl_cases_few_shot
#
# for i in range(len(y_pred_GPT4_vignette_prompt)):
#     if y_pred_GPT4_vignette_prompt[i] != y_test.iloc[i]:
#         misclassified_cases_vignette.append(i)
# total_cases_vignette = len(y_pred_GPT4_vignette_prompt)
# misscl_cases_vignette = len(misclassified_cases_vignette)
# correct_clases_vignette = total_cases_vignette - misscl_cases_vignette

In [13]:
# # save as df with total, correct and missclassified cases
# simple_cases_df = pd.DataFrame({"total": [total_cases_simple], "correct": [correct_clases_simple], "missclassified": [misscl_cases_simple]})
# simple_cases_df.to_csv("../exp/reasons_misclassifications_LLMs/GPT4/simple_cases_GPT_df.csv", sep = ",", index = True)
#
# class_def_cases_df = pd.DataFrame({"total": [total_cases_class_def], "correct": [correct_clases_class_def], "missclassified": [misscl_cases_class_def]})
# class_def_cases_df.to_csv("../exp/reasons_misclassifications_LLMs/GPT4/class_def_cases_GPT_df.csv", sep = ",", index = True)
#
# profiled_cases_df = pd.DataFrame({"total": [total_cases_profiled], "correct": [correct_clases_profiled], "missclassified": [misscl_cases_profiled]})
# profiled_cases_df.to_csv("../exp/reasons_misclassifications_LLMs/GPT4/profiled_cases_GPT_df.csv", sep = ",", index = True)
#
# few_shot_cases_df = pd.DataFrame({"total": [total_cases_few_shot], "correct": [correct_clases_few_shot], "missclassified": [misscl_cases_few_shot]})
# few_shot_cases_df.to_csv("../exp/reasons_misclassifications_LLMs/GPT4/few_shot_cases_GPT_df.csv", sep = ",", index = True)
#
# vignette_cases_df = pd.DataFrame({"total": [total_cases_vignette], "correct": [correct_clases_vignette], "missclassified": [misscl_cases_vignette]})
# vignette_cases_df.to_csv("../exp/reasons_misclassifications_LLMs/GPT4/vignette_cases_GPT_df.csv", sep = ",", index = True)


In [14]:
# simple_prompt_reasons = []
# class_def_prompt_reasons = []
# profiled_simple_prompt_reasons = []
# few_shot_prompt_reasons = []
# vignette_prompt_reasons = []
#
# client = OpenAI(
#     api_key = os.environ.get("OPENAI_API_KEY"),
# )
#
# instruction_reason = "Please categorize why you misclassified the data. Respond only with the following categories as reasons for the misclassification in order to improve prompting. Possible categories are: \nLack of context (emphasize or indicate the context of the query), \nLack of examples (few-shot prompting with several examples of appropriate responses are shown before posing the actual question missing), \nLack of feedback (interactive refining the prompt), \nLack of counterfactual demonstrations (instances containing false facts to improve faithfulness in knowledge conflict situations), \nLack of opinion-based information (reframe the context as a narrator’s statement and inquire about the narrator’s opinions), \nKnowledge conflicts (memorized facts became outdated and counterfactual facts), \nPrediction with Abstention (model is uncertain about their predictions) \n \n Do not mention specific change (e.g., increase or decrease) in predictors, do not go into detail of this specific case and do not repeat the question. Only respond with one or multiple of the categories as reasons for the misclassification, separated by ','. Mention the most important category first."
#
# # iterate over the misclassified cases and save the response for each prompt in an array
# print("Simple prompt: \n \n")
# for i in misclassified_cases_simple:
#     response = client.responses.create(
#         model = "gpt-4o",
#         instructions = instruction_reason,
#         input = f"Misclassified case {i}: Prompt: {X_test_simple_prompt[i]} Response: {y_pred_GPT4_simple_prompt[i]} True label: {y_test.iloc[i]}"
#     )
#     simple_prompt_reasons.append(response.output_text)
#     print(response.output_text)
#
# print("\n \n Class definition prompt: \n \n")
# for i in misclassified_cases_class_def:
#     response = client.responses.create(
#         model = "gpt-4o",
#         instructions = instruction_reason,
#         input = f"Misclassified case {i}: Prompt: {X_test_class_definitions_prompt[i]} Response: {y_pred_GPT4_class_definition_prompt[i]} True label: {y_test.iloc[i]}"
#     )
#     class_def_prompt_reasons.append(response.output_text)
#     print(response.output_text)
#
# print("\n \n Profiled simple prompt: \n \n")
# for i in misclassified_cases_profiled_simple:
#     response = client.responses.create(
#         model = "gpt-4o",
#         instructions = instruction_reason,
#         input = f"Misclassified case {i}: Prompt: {X_test_profiled_simple_prompt[i]} Response: {y_pred_GPT4_profiled_simple_prompt[i]} True label: {y_test.iloc[i]}"
#     )
#     profiled_simple_prompt_reasons.append(response.output_text)
#     print(response.output_text)
#
# print("\n \n Few shot prompt: \n \n")
# for i in misclassified_cases_few_shot:
#     response = client.responses.create(
#         model = "gpt-4o",
#         instructions = instruction_reason,
#         input = f"Misclassified case {i}: Prompt: {X_test_few_shot_prompt[i]} Response: {y_pred_GPT4_few_shot_prompt[i]} True label: {y_test.iloc[i]}"
#     )
#     few_shot_prompt_reasons.append(response.output_text)
#     print(response.output_text)
#
# print("\n \n Vignette prompt: \n \n")
# for i in misclassified_cases_vignette:
#     response = client.responses.create(
#         model = "gpt-4o",
#         instructions = instruction_reason,
#         input = f"Misclassified case {i}: Prompt: {X_test_vignette_prompt[i]} Response: {y_pred_GPT4_vignette_prompt[i]} True label: {y_test.iloc[i]}"
#     )
#     vignette_prompt_reasons.append(response.output_text)
#     print(response.output_text)

In [15]:
# all_reasons_simple = []
# all_reasons_class_def = []
# all_reasons_profiled_simple = []
# all_reasons_few_shot = []
# all_reasons_vignette = []
#
# for reason in simple_prompt_reasons:
#     reason = reason.split(", ")
#     reason = [re.sub(r'[^A-Za-z\s]', '', r).strip() for r in reason]
#     all_reasons_simple.append(reason)
#
# for reason in class_def_prompt_reasons:
#     reason = reason.split(", ")
#     reason = [re.sub(r'[^A-Za-z\s]', '', r).strip() for r in reason]
#     all_reasons_class_def.append(reason)
#
# for reason in profiled_simple_prompt_reasons:
#     reason = reason.split(", ")
#     reason = [re.sub(r'[^A-Za-z\s]', '', r).strip() for r in reason]
#     all_reasons_profiled_simple.append(reason)
#
# for reason in few_shot_prompt_reasons:
#     reason = reason.split(", ")
#     reason = [re.sub(r'[^A-Za-z\s]', '', r).strip() for r in reason]
#     all_reasons_few_shot.append(reason)
#
# for reason in vignette_prompt_reasons:
#     reason = reason.split(", ")
#     reason = [re.sub(r'[^A-Za-z\s]', '', r).strip() for r in reason]
#     all_reasons_vignette.append(reason)

In [16]:
# simple_prompt_reasons_dict = {}
# class_def_prompt_reasons_dict = {}
# profiled_simple_prompt_reasons_dict = {}
# few_shot_prompt_reasons_dict = {}
# vignette_prompt_reasons_dict = {}
#
# for i in all_reasons_simple:
#     for j in i:
#         # count the occurrences of each reason
#         if j in simple_prompt_reasons_dict:
#             simple_prompt_reasons_dict[j] += 1
#         else:
#             simple_prompt_reasons_dict[j] = 1
# simple_prompt_reasons_df = pd.DataFrame.from_dict(simple_prompt_reasons_dict, orient='index', columns=['count'])
# # simple_prompt_reasons_df.to_csv("../exp/reasons_misclassifications_LLMs/GPT4/simple_prompt_reasons.csv", sep = ",", index = True)
#
#
# for i in all_reasons_class_def:
#     for j in i:
#         # count the occurrences of each reason
#         if j in class_def_prompt_reasons_dict:
#             class_def_prompt_reasons_dict[j] += 1
#         else:
#             class_def_prompt_reasons_dict[j] = 1
# class_def_prompt_reasons_df = pd.DataFrame.from_dict(class_def_prompt_reasons_dict, orient='index', columns=['count'])
# # class_def_prompt_reasons_df.to_csv("../exp/reasons_misclassifications_LLMs/GPT4/class_def_prompt_reasons.csv", sep = ",", index = True)
#
# for i in all_reasons_profiled_simple:
#     for j in i:
#         # count the occurrences of each reason
#         if j in profiled_simple_prompt_reasons_dict:
#             profiled_simple_prompt_reasons_dict[j] += 1
#         else:
#             profiled_simple_prompt_reasons_dict[j] = 1
# profiled_simple_prompt_reasons_df = pd.DataFrame.from_dict(profiled_simple_prompt_reasons_dict, orient='index', columns=['count'])
# # profiled_simple_prompt_reasons_df.to_csv("../exp/reasons_misclassifications_LLMs/GPT4/profiled_simple_prompt_reasons.csv", sep = ",", index = True)
#
#
# for i in all_reasons_few_shot:
#     for j in i:
#         # count the occurrences of each reason
#         if j in few_shot_prompt_reasons_dict:
#             few_shot_prompt_reasons_dict[j] += 1
#         else:
#             few_shot_prompt_reasons_dict[j] = 1
# few_shot_prompt_reasons_df = pd.DataFrame.from_dict(few_shot_prompt_reasons_dict, orient='index', columns=['count'])
# # few_shot_prompt_reasons_df.to_csv("../exp/reasons_misclassifications_LLMs/GPT4/few_shot_prompt_reasons.csv", sep = ",", index = True)
#
# for i in all_reasons_vignette:
#     for j in i:
#         # count the occurrences of each reason
#         if j in vignette_prompt_reasons_dict:
#             vignette_prompt_reasons_dict[j] += 1
#         else:
#             vignette_prompt_reasons_dict[j] = 1
# vignette_prompt_reasons_df = pd.DataFrame.from_dict(vignette_prompt_reasons_dict, orient='index', columns=['count'])
# # vignette_prompt_reasons_df.to_csv("../exp/reasons_misclassifications_LLMs/GPT4/vignette_prompt_reasons.csv", sep = ",", index = True)

In [17]:
# print(simple_prompt_reasons_dict, "\n \n")
# print(class_def_prompt_reasons_dict, "\n \n")
# print(profiled_simple_prompt_reasons_dict, "\n \n")
# print(few_shot_prompt_reasons_dict, "\n \n")
# print(vignette_prompt_reasons_dict)

### 1.2 Gemini (Google)

#### 1.2.1 Testing prompting

In [18]:
# client = genai.Client(api_key = os.environ.get("GEMINI_API_KEY"))
#
# response = client.models.generate_content(
#     model = "gemini-2.0-flash",
#     contents = "Explain how AI works in a few words",
# )
#
# print(response.text)

In [19]:
# client = genai.Client(api_key = os.environ.get("GEMINI_API_KEY"))
#
# response = client.models.generate_content(
#     model = "gemini-2.0-flash",
#     config = types.GenerateContentConfig(
#         system_instruction = simple_instruction),
#     contents = simple_prompt
# )
#
# # gemini-2.5-pro-preview-05-06

### 1.3 Gemma (Google)

### 1.4 Llama (Meta)

### 1.5 Claude (Anthropic)

#### 1.5.1 Testing prompting

In [20]:
# client = anthropic.Anthropic(api_key = os.environ.get("ANTHROPIC_API_KEY"))
#
# message = client.messages.create(
#     model = "claude-3-7-sonnet-20250219",
#     max_tokens = 20000,
#     temperature = 1,
#     thinking = {
#         "type": "enabled",
#         "budget_tokens": 16000
#     },
#     system = claude_instruction,
#     messages = [
#         {
#             "role": "user",
#             "content": [
#                 {
#                     "type": "text",
#                     "text": X_test_claude_prompt[0]
#                 }
#             ]
#         }
#     ]
# )
# print(message.content)

In [21]:
# print(message.content[0].thinking)

In [22]:
# prediction = re.findall(r'Prediction: (.*)', message.content[1].text)
# prediction[0]

In [23]:
# # extract what comes after Explanation:
# explanation = re.findall(r'Explanation: (.*)', message.content[1].text)
# explanation[0]

#### 1.5.2 Prompting with Claude 3.7 Sonnet

In [24]:
# simple_prompt_y_pred_claude = []
# simple_prompt_thinking_claude = []
#
# client = anthropic.Anthropic(api_key = os.environ.get("ANTHROPIC_API_KEY"))
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_simple_prompt:
#     message = client.messages.create(
#         model = "claude-3-7-sonnet-20250219",
#         max_tokens = 20000,
#         temperature = 1,
#         thinking = {
#             "type": "enabled",
#             "budget_tokens": 16000
#         },
#         system = simple_instruction,
#         messages = [
#             {
#                 "role": "user",
#                 "content": [
#                     {
#                         "type": "text",
#                         "text": prompt
#                     }
#                 ]
#             }
#         ]
#     )
#     simple_prompt_y_pred_claude.append(message.content[1].text)
#     simple_prompt_thinking_claude.append(message.content[0].thinking)
#     print(message.content[1].text)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_claude_simple_prompt = end - start
# time_claude_simple_prompt_df = pd.DataFrame({"time": [time_claude_simple_prompt]})
# time_claude_simple_prompt_df.to_csv("../exp/times_LLMs/Claude/time_claude_simple_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_simple_claude = pd.Series(simple_prompt_y_pred_claude).value_counts()
# print(counts_simple_claude)
#
# # convert YES to 1 and NO to 0
# simple_prompt_y_pred_claude = [1 if response == "YES" else 0 if response == "NO" else np.nan for response in simple_prompt_y_pred_claude]
#
# # save the array to a csv file
# simple_prompt_df_claude = pd.DataFrame(simple_prompt_y_pred_claude, columns = ["y_pred"])
# simple_prompt_df_claude.to_csv("../exp/preds_LLMs/Claude/y_pred_claude_simple_prompt.csv", sep = ",", index = False)
#
# simple_prompt_df_thinking_claude = pd.DataFrame(simple_prompt_thinking_claude, columns = ["thinking"])
# simple_prompt_df_thinking_claude.to_csv("../exp/preds_LLMs/Claude/Thinking/thinking_claude_simple_prompt.csv", sep = ",", index = False)

In [25]:
# class_def_y_pred_claude = []
# class_def_thinking_claude = []
#
# client = anthropic.Anthropic(api_key = os.environ.get("ANTHROPIC_API_KEY"))
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_class_definitions_prompt:
#     message = client.messages.create(
#         model = "claude-3-7-sonnet-20250219",
#         max_tokens = 20000,
#         temperature = 1,
#         thinking = {
#             "type": "enabled",
#             "budget_tokens": 16000
#         },
#         system = class_definitions_instruction,
#         messages = [
#             {
#                 "role": "user",
#                 "content": [
#                     {
#                         "type": "text",
#                         "text": prompt
#                     }
#                 ]
#             }
#         ]
#     )
#     class_def_y_pred_claude.append(message.content[1].text)
#     class_def_thinking_claude.append(message.content[0].thinking)
#     print(message.content[1].text)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_claude_class_definitions = end - start
# time_claude_class_definitions_df = pd.DataFrame({"time": [time_claude_class_definitions]})
# time_claude_class_definitions_df.to_csv("../exp/times_LLMs/Claude/time_claude_class_definitions_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_class_def_claude = pd.Series(class_def_y_pred_claude).value_counts()
# print(counts_class_def_claude)
#
# # convert YES to 1 and NO to 0
# class_def_y_pred_claude = [1 if response == "YES" else 0 for response in class_def_y_pred_claude]
#
# # save the array to a csv file
# class_def_df_claude = pd.DataFrame(class_def_y_pred_claude, columns = ["y_pred"])
# class_def_df_claude.to_csv("../exp/preds_LLMs/Claude/y_pred_claude_class_definitions_prompt.csv", sep = ",", index = False)
#
# class_def_prompt_df_thinking_claude = pd.DataFrame(class_def_thinking_claude, columns = ["thinking"])
# class_def_prompt_df_thinking_claude.to_csv("../exp/preds_LLMs/Claude/Thinking/thinking_claude_class_def_prompt.csv", sep = ",", index = False)

In [26]:
# profiled_simple_y_pred_claude = []
# profiled_simple_thinking_claude = []
#
# client = anthropic.Anthropic(api_key = os.environ.get("ANTHROPIC_API_KEY"))
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_profiled_simple_prompt:
#     message = client.messages.create(
#         model = "claude-3-7-sonnet-20250219",
#         max_tokens = 20000,
#         temperature = 1,
#         thinking = {
#             "type": "enabled",
#             "budget_tokens": 16000
#         },
#         system = profiled_simple_instruction,
#         messages = [
#             {
#                 "role": "user",
#                 "content": [
#                     {
#                         "type": "text",
#                         "text": prompt
#                     }
#                 ]
#             }
#         ]
#     )
#     profiled_simple_y_pred_claude.append(message.content[1].text)
#     profiled_simple_thinking_claude.append(message.content[0].thinking)
#     print(message.content[1].text)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_claude_profiled_simple = end - start
# time_claude_profiled_simple_df = pd.DataFrame({"time": [time_claude_profiled_simple]})
# time_claude_profiled_simple_df.to_csv("../exp/times_LLMs/Claude/time_claude_profiled_simple_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_profiled_simple_claude = pd.Series(profiled_simple_y_pred_claude).value_counts()
# print(counts_profiled_simple_claude)
#
# # convert YES to 1 and NO to 0
# profiled_simple_y_pred_claude_val = [1 if response == "YES" else 0 for response in profiled_simple_y_pred_claude]
#
# # save the array to a csv file
# profiled_simple_df_claude = pd.DataFrame(profiled_simple_y_pred_claude_val, columns = ["y_pred"])
# profiled_simple_df_claude.to_csv("../exp/preds_LLMs/Claude/y_pred_claude_profiled_simple_prompt.csv", sep = ",", index = False)
#
# profiled_simple_prompt_df_thinking_claude = pd.DataFrame(profiled_simple_thinking_claude, columns = ["thinking"])
# profiled_simple_prompt_df_thinking_claude.to_csv("../exp/preds_LLMs/Claude/Thinking/thinking_claude_profiled_simple_prompt.csv", sep = ",", index = False)

In [27]:
# few_shot_y_pred_claude = []
# few_shot_thinking_claude = []
#
# client = anthropic.Anthropic(api_key = os.environ.get("ANTHROPIC_API_KEY"))
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_few_shot_prompt:
#     message = client.messages.create(
#         model = "claude-3-7-sonnet-20250219",
#         max_tokens = 20000,
#         temperature = 1,
#         thinking = {
#             "type": "enabled",
#             "budget_tokens": 16000
#         },
#         system = few_shot_instruction,
#         messages = [
#             {
#                 "role": "user",
#                 "content": [
#                     {
#                         "type": "text",
#                         "text": prompt
#                     }
#                 ]
#             }
#         ]
#     )
#     few_shot_y_pred_claude.append(message.content[1].text)
#     few_shot_thinking_claude.append(message.content[0].thinking)
#     print(message.content[1].text)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_claude_few_shot = end - start
# time_claude_few_shot_df = pd.DataFrame({"time": [time_claude_few_shot]})
# time_claude_few_shot_df.to_csv("../exp/times_LLMs/Claude/time_claude_few_shot_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_few_shot_claude = pd.Series(few_shot_y_pred_claude).value_counts()
# print(counts_few_shot_claude)
#
# # convert YES to 1 and NO to 0
# few_shot_y_pred_claude_val = [1 if response == "YES" else 0 for response in few_shot_y_pred_claude]
#
# # save the array to a csv file
# few_shot_df_claude = pd.DataFrame(few_shot_y_pred_claude_val, columns = ["y_pred"])
# few_shot_df_claude.to_csv("../exp/preds_LLMs/Claude/y_pred_claude_few_shot_prompt.csv", sep = ",", index = False)
#
# few_shot_prompt_df_thinking_claude = pd.DataFrame(few_shot_thinking_claude, columns = ["thinking"])
# few_shot_prompt_df_thinking_claude.to_csv("../exp/preds_LLMs/Claude/Thinking/thinking_claude_few_shot_prompt.csv", sep = ",", index = False)

In [28]:
# vignette_y_pred_claude = []
# vignette_thinking_claude = []
#
# client = anthropic.Anthropic(api_key = os.environ.get("ANTHROPIC_API_KEY"))
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_vignette_prompt:
#     message = client.messages.create(
#         model = "claude-3-7-sonnet-20250219",
#         max_tokens = 20000,
#         temperature = 1,
#         thinking = {
#             "type": "enabled",
#             "budget_tokens": 16000
#         },
#         system = vignette_instruction,
#         messages = [
#             {
#                 "role": "user",
#                 "content": [
#                     {
#                         "type": "text",
#                         "text": prompt
#                     }
#                 ]
#             }
#         ]
#     )
#     vignette_y_pred_claude.append(message.content[1].text)
#     vignette_thinking_claude.append(message.content[0].thinking)
#     print(message.content[1].text)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_claude_vignette = end - start
# time_claude_vignette_df = pd.DataFrame({"time": [time_claude_vignette]})
# time_claude_vignette_df.to_csv("../exp/times_LLMs/Claude/time_claude_vignette_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_vignette_claude = pd.Series(vignette_y_pred_claude).value_counts()
# print(counts_vignette_claude)
#
# # convert YES to 1 and NO to 0
# vignette_y_pred_claude_val = [1 if response == "YES" else 0 for response in vignette_y_pred_claude]
#
# # save the array to a csv file
# vignette_df_claude = pd.DataFrame(vignette_y_pred_claude_val, columns = ["y_pred"])
# vignette_df_claude.to_csv("../exp/preds_LLMs/Claude/y_pred_claude_vignette_prompt.csv", sep = ",", index = False)
#
# vignette_prompt_df_thinking_claude = pd.DataFrame(vignette_thinking_claude, columns = ["thinking"])
# vignette_prompt_df_thinking_claude.to_csv("../exp/preds_LLMs/Claude/Thinking/thinking_claude_vignette_prompt.csv", sep = ",", index = False)

In [29]:
# claude_prompt_y_pred_claude = []
# claude_prompt_explanation_claude = []
# claude_prompt_thinking_claude = []
#
# client = anthropic.Anthropic(api_key = os.environ.get("ANTHROPIC_API_KEY"))
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_claude_prompt[:2]:
#     message = client.messages.create(
#         model = "claude-3-7-sonnet-20250219",
#         max_tokens = 20000,
#         temperature = 1,
#         thinking = {
#             "type": "enabled",
#             "budget_tokens": 16000
#         },
#         system = claude_instruction,
#         messages = [
#             {
#                 "role": "user",
#                 "content": [
#                     {
#                         "type": "text",
#                         "text": prompt
#                     }
#                 ]
#             }
#         ]
#     )
#     try:
#         prediction = re.findall(r'Prediction: (.*)', message.content[1].text)[0]
#         explanation = re.findall(r'Explanation: (.*)', message.content[1].text)[0]
#         claude_prompt_y_pred_claude.append(prediction)
#         claude_prompt_explanation_claude.append(explanation)
#         claude_prompt_thinking_claude.append(message.content[0].thinking)
#         print(prediction)
#     except IndexError:
#         print("IndexError")
#         claude_prompt_y_pred_claude.append("IndexError")
#         claude_prompt_explanation_claude.append("IndexError")
#         claude_prompt_thinking_claude.append("IndexError")
#
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_claude_claude_prompt = end - start
# time_claude_claude_prompt_df = pd.DataFrame({"time": [time_claude_claude_prompt]})
# time_claude_claude_prompt_df.to_csv("../exp/times_LLMs/Claude/time_claude_claude_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_claude_prompt_claude = pd.Series(claude_prompt_y_pred_claude).value_counts()
# print(counts_claude_prompt_claude)
#
# # convert YES to 1 and NO to 0
# claude_prompt_y_pred_claude_val = [1 if response == "YES" else 0 for response in claude_prompt_y_pred_claude]
#
# # save the array to a csv file
# claude_prompt_df_claude = pd.DataFrame(claude_prompt_y_pred_claude_val, columns = ["y_pred"])
# claude_prompt_df_claude.to_csv("../exp/preds_LLMs/Claude/y_pred_claude_claude_prompt_prompt.csv", sep = ",", index = False)
#
# claude_prompt_prompt_df_thinking_claude = pd.DataFrame(claude_prompt_thinking_claude, columns = ["thinking"])
# claude_prompt_prompt_df_thinking_claude.to_csv("../exp/preds_LLMs/Claude/Thinking/thinking_claude_claude_prompt.csv", sep = ",", index = False)
#
# claude_prompt_prompt_df_explanation_claude = pd.DataFrame(claude_prompt_explanation_claude, columns = ["thinking"])
# claude_prompt_prompt_df_explanation_claude.to_csv("../exp/preds_LLMs/Claude/Thinking/explanation_claude_claude_prompt.csv", sep = ",", index = False)

NO
NO
Time taken: 53.0749671459198 seconds
NO    2
Name: count, dtype: int64


### 1.6 DeepSeek

#### 1.6.1 Testing prompting

In [31]:
client = OpenAI(api_key = os.environ.get("DeepSeek_API_Key"), base_url = "https://api.deepseek.com")

response = client.chat.completions.create(
    model = "deepseek-reasoner",
    messages = [
        {"role": "system", "content": simple_instruction},
        {"role": "user", "content": X_test_simple_prompt[0]},
    ],
    stream = False
)

print(response.choices[0].message.content)

NO


In [35]:
response.choices[0].message.reasoning_content

"Okay, let's see. I need to determine if the person develops a psychological disorder between T1 and T2 based on the given data. The answer should be just YES or NO. \n\nFirst, looking at the T1 values. The General psychopathology (GSI) at T1 is 0.0172, which is very low. That suggests that at T1, their psychopathology isn't severe. Other factors like stress are 0.44, which is moderate. Positive mental health is slightly negative but close to zero. Social support, self-efficacy, life satisfaction are all positive, but not extremely high. Problem-focused coping is quite high at 1.73, which might be a protective factor. Emotion-focused coping is lower. Anxiety sensitivity and fear of bodily sensations are low to moderate. Dysfunctional attitudes are around 0.27.\n\nNow, looking at the changes from T1 to T2. The change in GSI (General psychopathology) is -0.8256, which means it decreased. That's a significant drop. But wait, if GSI decreases, that would imply their psychopathology is gett

#### 1.6.2 Prompting with DeepSeek Reasoning R1

In [24]:
# simple_prompt_y_pred_deeps = []
# simple_prompt_thinking_deeps = []
#
# client = OpenAI(api_key = os.environ.get("DeepSeek_API_Key"), base_url = "https://api.deepseek.com")
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_simple_prompt:
#     response = client.chat.completions.create(
#         model = "deepseek-reasoner",
#         messages = [
#             {"role": "system", "content": simple_instruction},
#             {"role": "user", "content": prompt},
#         ],
#         stream = False
#     )
#     simple_prompt_y_pred_deeps.append(response.choices[0].message.content)
#     simple_prompt_thinking_deeps.append(response.choices[0].message.reasoning_content)
#     print(response.choices[0].message.content)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_deeps_simple_prompt = end - start
# time_deeps_simple_prompt_df = pd.DataFrame({"time": [time_deeps_simple_prompt]})
# time_deeps_simple_prompt_df.to_csv("../exp/times_LLMs/DeepSeek/time_deeps_simple_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_simple_deeps = pd.Series(simple_prompt_y_pred_deeps).value_counts()
# print(counts_simple_deeps)
#
# # convert YES to 1 and NO to 0
# simple_prompt_y_pred_deeps = [1 if response == "YES" else 0 if response == "NO" else np.nan for response in simple_prompt_y_pred_deeps]
#
# # save the array to a csv file
# simple_prompt_df_deeps = pd.DataFrame(simple_prompt_y_pred_deeps, columns = ["y_pred"])
# simple_prompt_df_deeps.to_csv("../exp/preds_LLMs/DeepSeek/y_pred_deeps_simple_prompt.csv", sep = ",", index = False)
#
# simple_prompt_df_thinking_deeps = pd.DataFrame(simple_prompt_thinking_deeps, columns = ["thinking"])
# simple_prompt_df_thinking_deeps.to_csv("../exp/preds_LLMs/DeepSeek/Thinking/thinking_deeps_simple_prompt.csv", sep = ",", index = False)

In [25]:
# class_def_y_pred_deeps = []
# class_def_thinking_deeps = []
#
# client = OpenAI(api_key = os.environ.get("DeepSeek_API_Key"), base_url = "https://api.deepseek.com")
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_class_definitions_prompt:
#     response = client.chat.completions.create(
#         model = "deepseek-reasoner",
#         messages = [
#             {"role": "system", "content": simple_instruction},
#             {"role": "user", "content": prompt},
#         ],
#         stream = False
#     )
#     class_def_y_pred_deeps.append(response.choices[0].message.content)
#     class_def_thinking_deeps.append(response.choices[0].message.reasoning_content)
#     print(response.choices[0].message.content)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_deeps_class_definitions = end - start
# time_deeps_class_definitions_df = pd.DataFrame({"time": [time_deeps_class_definitions]})
# time_deeps_class_definitions_df.to_csv("../exp/times_LLMs/DeepSeek/time_deeps_class_definitions_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_class_def_deeps = pd.Series(class_def_y_pred_deeps).value_counts()
# print(counts_class_def_deeps)
#
# # convert YES to 1 and NO to 0
# class_def_y_pred_deeps = [1 if response == "YES" else 0 for response in class_def_y_pred_deeps]
#
# # save the array to a csv file
# class_def_df_deeps = pd.DataFrame(class_def_y_pred_deeps, columns = ["y_pred"])
# class_def_df_deeps.to_csv("../exp/preds_LLMs/DeepSeek/y_pred_deeps_class_definitions_prompt.csv", sep = ",", index = False)
#
# class_def_prompt_df_thinking_deeps = pd.DataFrame(class_def_thinking_deeps, columns = ["thinking"])
# class_def_prompt_df_thinking_deeps.to_csv("../exp/preds_LLMs/DeepSeek/Thinking/thinking_deeps_class_def_prompt.csv", sep = ",", index = False)

In [26]:
# profiled_simple_y_pred_deeps = []
# profiled_simple_thinking_deeps = []
#
# client = OpenAI(api_key = os.environ.get("DeepSeek_API_Key"), base_url = "https://api.deepseek.com")
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_profiled_simple_prompt:
#     response = client.chat.completions.create(
#         model = "deepseek-reasoner",
#         messages = [
#             {"role": "system", "content": simple_instruction},
#             {"role": "user", "content": prompt},
#         ],
#         stream = False
#     )
#     profiled_simple_y_pred_deeps.append(response.choices[0].message.content)
#     profiled_simple_thinking_deeps.append(response.choices[0].message.reasoning_content)
#     print(response.choices[0].message.content)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_deeps_profiled_simple = end - start
# time_deeps_profiled_simple_df = pd.DataFrame({"time": [time_deeps_profiled_simple]})
# time_deeps_profiled_simple_df.to_csv("../exp/times_LLMs/DeepSeek/time_deeps_profiled_simple_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_profiled_simple_deeps = pd.Series(profiled_simple_y_pred_deeps).value_counts()
# print(counts_profiled_simple_deeps)
#
# # convert YES to 1 and NO to 0
# profiled_simple_y_pred_deeps_val = [1 if response == "YES" else 0 for response in profiled_simple_y_pred_deeps]
#
# # save the array to a csv file
# profiled_simple_df_deeps = pd.DataFrame(profiled_simple_y_pred_deeps_val, columns = ["y_pred"])
# profiled_simple_df_deeps.to_csv("../exp/preds_LLMs/DeepSeek/y_pred_deeps_profiled_simple_prompt.csv", sep = ",", index = False)
#
# profiled_simple_prompt_df_thinking_deeps = pd.DataFrame(profiled_simple_thinking_deeps, columns = ["thinking"])
# profiled_simple_prompt_df_thinking_deeps.to_csv("../exp/preds_LLMs/DeepSeek/Thinking/thinking_deeps_profiled_simple_prompt.csv", sep = ",", index = False)

In [27]:
# few_shot_y_pred_deeps = []
# few_shot_thinking_deeps = []
#
# client = OpenAI(api_key = os.environ.get("DeepSeek_API_Key"), base_url = "https://api.deepseek.com")
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_few_shot_prompt:
#     response = client.chat.completions.create(
#         model = "deepseek-reasoner",
#         messages = [
#             {"role": "system", "content": simple_instruction},
#             {"role": "user", "content": prompt},
#         ],
#         stream = False
#     )
#     few_shot_y_pred_deeps.append(response.choices[0].message.content)
#     few_shot_thinking_deeps.append(response.choices[0].message.reasoning_content)
#     print(response.choices[0].message.content)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_deeps_few_shot = end - start
# time_deeps_few_shot_df = pd.DataFrame({"time": [time_deeps_few_shot]})
# time_deeps_few_shot_df.to_csv("../exp/times_LLMs/DeepSeek/time_deeps_few_shot_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_few_shot_deeps = pd.Series(few_shot_y_pred_deeps).value_counts()
# print(counts_few_shot_deeps)
#
# # convert YES to 1 and NO to 0
# few_shot_y_pred_deeps_val = [1 if response == "YES" else 0 for response in few_shot_y_pred_deeps]
#
# # save the array to a csv file
# few_shot_df_deeps = pd.DataFrame(few_shot_y_pred_deeps_val, columns = ["y_pred"])
# few_shot_df_deeps.to_csv("../exp/preds_LLMs/DeepSeek/y_pred_deeps_few_shot_prompt.csv", sep = ",", index = False)
#
# few_shot_prompt_df_thinking_deeps = pd.DataFrame(few_shot_thinking_deeps, columns = ["thinking"])
# few_shot_prompt_df_thinking_deeps.to_csv("../exp/preds_LLMs/DeepSeek/Thinking/thinking_deeps_few_shot_prompt.csv", sep = ",", index = False)

In [28]:
# vignette_y_pred_deeps = []
# vignette_thinking_deeps = []
#
# client = OpenAI(api_key = os.environ.get("DeepSeek_API_Key"), base_url = "https://api.deepseek.com")
#
# # measure time in seconds
# start = time.time()
#
# # iterate over the test set and save the response for each prompt in an array
# for prompt in X_test_vignette_prompt:
#     response = client.chat.completions.create(
#         model = "deepseek-reasoner",
#         messages = [
#             {"role": "system", "content": simple_instruction},
#             {"role": "user", "content": prompt},
#         ],
#         stream = False
#     )
#     vignette_y_pred_deeps.append(response.choices[0].message.content)
#     vignette_thinking_deeps.append(response.choices[0].message.reasoning_content)
#     print(response.choices[0].message.content)
#
# end = time.time()
# print(f"Time taken: {end - start} seconds")
# time_deeps_vignette = end - start
# time_deeps_vignette_df = pd.DataFrame({"time": [time_deeps_vignette]})
# time_deeps_vignette_df.to_csv("../exp/times_LLMs/DeepSeek/time_deeps_vignette_prompt.csv", sep = ",", index = False)
#
# # value counts for array
# counts_vignette_deeps = pd.Series(vignette_y_pred_deeps).value_counts()
# print(counts_vignette_deeps)
#
# # convert YES to 1 and NO to 0
# vignette_y_pred_deeps_val = [1 if response == "YES" else 0 for response in vignette_y_pred_deeps]
#
# # save the array to a csv file
# vignette_df_deeps = pd.DataFrame(vignette_y_pred_deeps_val, columns = ["y_pred"])
# vignette_df_deeps.to_csv("../exp/preds_LLMs/DeepSeek/y_pred_deeps_vignette_prompt.csv", sep = ",", index = False)
#
# vignette_prompt_df_thinking_deeps = pd.DataFrame(vignette_thinking_deeps, columns = ["thinking"])
# vignette_prompt_df_thinking_deeps.to_csv("../exp/preds_LLMs/DeepSeek/Thinking/thinking_deeps_vignette_prompt.csv", sep = ",", index = False)

### 1.7 Grok (xAI)

#### 1.7.1 Testing prompting

In [8]:
# client = OpenAI(
#     api_key = os.environ.get("XAI_API_KEY"),
#     base_url = "https://api.x.ai/v1",
# )
#
# completion = client.chat.completions.create(
#     model = "grok-3-beta",
#     # model = "grok-3-mini-beta",
#     messages = [
#         {"role": "system", "content": simple_instruction},
#         {"role": "user", "content": X_test_simple_prompt[0]},
#     ],
#     # reasoning_effort = "high"
# )
# print(completion.choices[0].message)

ChatCompletionMessage(content='NO', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content='First, the user asks me to respond only with YES or NO to whether the person develops a psychological disorder between T1 and T2.\n\nI have data points at T1 and changes from T1 to T2 for various psychological measures.\n\nKey data includes:\n\n- T1 General psychopathology: Global Severity Index (GSI): 0.0172227087467131\n\n- Change in General psychopathology: GSI (T2-T1): -0.8256637821414634\n\nGSI is a measure of general psychopathology. A higher GSI typically indicates more severe symptoms.\n\nAt T1, GSI is very low (0.0172), suggesting low psychopathology.\n\nThe change from T1 to T2 is -0.8257, which is negative. This means the GSI decreased from T1 to T2. So, at T2, the GSI is even lower.\n\nA decrease in GSI suggests an improvement in mental health, not the development of a disorder.\n\nThe question is about developing a psycho

In [9]:
# completion.choices[0].message.content

'NO'

#### 1.7.2 Prompting with Grok 3 Beta

In [14]:
simple_prompt_y_pred_grok = []

client = OpenAI(
    api_key = os.environ.get("XAI_API_KEY"),
    base_url = "https://api.x.ai/v1",
)

# measure time in seconds
start = time.time()

# iterate over the test set and save the response for each prompt in an array
for prompt in X_test_simple_prompt:
    completion = client.chat.completions.create(
        model = "grok-3-beta",
        messages = [
            {"role": "system", "content": simple_instruction},
            {"role": "user", "content": X_test_simple_prompt[0]},
        ],
    )
    simple_prompt_y_pred_grok.append(completion.choices[0].message.content)
    print(completion.choices[0].message.content)

end = time.time()
print(f"Time taken: {end - start} seconds")
time_grok_simple_prompt = end - start
time_grok_simple_prompt_df = pd.DataFrame({"time": [time_grok_simple_prompt]})
time_grok_simple_prompt_df.to_csv("../exp/times_LLMs/Grok/time_grok_simple_prompt.csv", sep = ",", index = False)

# value counts for array
counts_simple_grok = pd.Series(simple_prompt_y_pred_grok).value_counts()
print(counts_simple_grok)

# convert YES to 1 and NO to 0
simple_prompt_y_pred_grok = [1 if response == "YES" else 0 if response == "NO" else np.nan for response in simple_prompt_y_pred_grok]

# save the array to a csv file
simple_prompt_df_grok = pd.DataFrame(simple_prompt_y_pred_grok, columns = ["y_pred"])
simple_prompt_df_grok.to_csv("../exp/preds_LLMs/Grok/y_pred_grok_simple_prompt.csv", sep = ",", index = False)

Does the following person develop a psychological disorder between time point T1 and T2? Data: T1 Positive mental health: -0.0279170753483525, T1 Social support: 0.1421238143169474, T1 General self-efficacy: 0.3649793457412237, T1 Life satisfaction: 0.3372886835461141, T1 Stress: 0.4419361727222826, T1 Problem-focused coping: 1.7319368683783989, T1 Emotion-focused coping: 0.2078300133169115, T1 Anxiety sensitivity: 0.1594156886399411, T1 Fear of bodily sensations: 0.2863750811390516, T1 Dysfunctional attitudes: 0.2750686254386546, T1 General psychopathology: Global Severity Index (GSI): 0.0172227087467131, Education: 2.0, T1 BMI: 1.0, Socioeconomic status: 2.0, Change in Positive mental health (T2-T1): -0.7520166349788642, Change in Social support (T2-T1): 0.7057099569575698, Change in General self-efficacy (T2-T1): -0.1819798191096402, Change in Life satisfaction (T2-T1): 0.1407378746091848, Change in Anxiety sensitivity (T2-T1): -0.8617238998696288, Change in Fear of bodily sensation

In [13]:
class_def_y_pred_grok = []

client = OpenAI(
    api_key = os.environ.get("XAI_API_KEY"),
    base_url = "https://api.x.ai/v1",
)

# measure time in seconds
start = time.time()

# iterate over the test set and save the response for each prompt in an array
for prompt in X_test_class_definitions_prompt:
    completion = client.chat.completions.create(
        model = "grok-3-beta",
        messages = [
            {"role": "system", "content": class_definitions_instruction},
            {"role": "user", "content": X_test_simple_prompt[0]},
        ],
    )
    class_def_y_pred_grok.append(completion.choices[0].message.content)
    print(completion.choices[0].message.content)

end = time.time()
print(f"Time taken: {end - start} seconds")
time_grok_class_definitions = end - start
time_grok_class_definitions_df = pd.DataFrame({"time": [time_grok_class_definitions]})
time_grok_class_definitions_df.to_csv("../exp/times_LLMs/Grok/time_grok_class_definitions_prompt.csv", sep = ",", index = False)

# value counts for array
counts_class_def_grok = pd.Series(class_def_y_pred_grok).value_counts()
print(counts_class_def_grok)

# convert YES to 1 and NO to 0
class_def_y_pred_grok = [1 if response == "YES" else 0 for response in class_def_y_pred_grok]

# save the array to a csv file
class_def_df_grok = pd.DataFrame(class_def_y_pred_grok, columns = ["y_pred"])
class_def_df_grok.to_csv("../exp/preds_LLMs/Grok/y_pred_grok_class_definitions_prompt.csv", sep = ",", index = False)

Given the following data, classify whether this person develops a psychological disorder between T1 and T2 according to the instructions provided and data measured by F-DIPS structural interviews. Respond with YES or NO. Instructions: NO: The person did not develop any new psychological disorder between T1 and T2. This means they were either healthy at both time points, had an ongoing disorder across both time points, or had already recovered from a previous disorder. YES: The person was psychologically healthy at T1 but developed a psychological disorder at T2. Does the following person develop a psychological disorder between time point T1 and T2? Data: T1 Positive mental health: -0.0279170753483525, T1 Social support: 0.1421238143169474, T1 General self-efficacy: 0.3649793457412237, T1 Life satisfaction: 0.3372886835461141, T1 Stress: 0.4419361727222826, T1 Problem-focused coping: 1.7319368683783989, T1 Emotion-focused coping: 0.2078300133169115, T1 Anxiety sensitivity: 0.1594156886

In [15]:
profiled_simple_y_pred_grok = []

client = OpenAI(
    api_key = os.environ.get("XAI_API_KEY"),
    base_url = "https://api.x.ai/v1",
)

# measure time in seconds
start = time.time()

# iterate over the test set and save the response for each prompt in an array
for prompt in X_test_profiled_simple_prompt:
    completion = client.chat.completions.create(
        model = "grok-3-beta",
        messages = [
            {"role": "system", "content": profiled_simple_instruction},
            {"role": "user", "content": X_test_simple_prompt[0]},
        ],
    )
    profiled_simple_y_pred_grok.append(completion.choices[0].message.content)
    print(completion.choices[0].message.content)

end = time.time()
print(f"Time taken: {end - start} seconds")
time_grok_profiled_simple = end - start
time_grok_profiled_simple_df = pd.DataFrame({"time": [time_grok_profiled_simple]})
time_grok_profiled_simple_df.to_csv("../exp/times_LLMs/Grok/time_grok_profiled_simple_prompt.csv", sep = ",", index = False)

# value counts for array
counts_profiled_simple_grok = pd.Series(profiled_simple_y_pred_grok).value_counts()
print(counts_profiled_simple_grok)

# convert YES to 1 and NO to 0
profiled_simple_y_pred_grok_val = [1 if response == "YES" else 0 for response in profiled_simple_y_pred_grok]

# save the array to a csv file
profiled_simple_df_grok = pd.DataFrame(profiled_simple_y_pred_grok_val, columns = ["y_pred"])
profiled_simple_df_grok.to_csv("../exp/preds_LLMs/Grok/y_pred_grok_profiled_simple_prompt.csv", sep = ",", index = False)

You are an expert in clinical psychology and mental health diagnostics. You are trained to analyze longitudinal data to assess whether a person develops a psychological disorder over time. You know how to analyze data measured with F-DIPS structural interviews at two time points T1 and T2. Does the following person develop a psychological disorder between time point T1 and T2? Data: T1 Positive mental health: -0.0279170753483525, T1 Social support: 0.1421238143169474, T1 General self-efficacy: 0.3649793457412237, T1 Life satisfaction: 0.3372886835461141, T1 Stress: 0.4419361727222826, T1 Problem-focused coping: 1.7319368683783989, T1 Emotion-focused coping: 0.2078300133169115, T1 Anxiety sensitivity: 0.1594156886399411, T1 Fear of bodily sensations: 0.2863750811390516, T1 Dysfunctional attitudes: 0.2750686254386546, T1 General psychopathology: Global Severity Index (GSI): 0.0172227087467131, Education: 2.0, T1 BMI: 1.0, Socioeconomic status: 2.0, Change in Positive mental health (T2-T1

In [16]:
few_shot_y_pred_grok = []

client = OpenAI(
    api_key = os.environ.get("XAI_API_KEY"),
    base_url = "https://api.x.ai/v1",
)

# measure time in seconds
start = time.time()

# iterate over the test set and save the response for each prompt in an array
for prompt in X_test_few_shot_prompt:
    completion = client.chat.completions.create(
        model = "grok-3-beta",
        messages = [
            {"role": "system", "content": few_shot_instruction},
            {"role": "user", "content": X_test_simple_prompt[0]},
        ],
    )
    few_shot_y_pred_grok.append(completion.choices[0].message.content)
    print(completion.choices[0].message.content)

end = time.time()
print(f"Time taken: {end - start} seconds")
time_grok_few_shot = end - start
time_grok_few_shot_df = pd.DataFrame({"time": [time_grok_few_shot]})
time_grok_few_shot_df.to_csv("../exp/times_LLMs/Grok/time_grok_few_shot_prompt.csv", sep = ",", index = False)

# value counts for array
counts_few_shot_grok = pd.Series(few_shot_y_pred_grok).value_counts()
print(counts_few_shot_grok)

# convert YES to 1 and NO to 0
few_shot_y_pred_grok_val = [1 if response == "YES" else 0 for response in few_shot_y_pred_grok]

# save the array to a csv file
few_shot_df_grok = pd.DataFrame(few_shot_y_pred_grok_val, columns = ["y_pred"])
few_shot_df_grok.to_csv("../exp/preds_LLMs/Grok/y_pred_grok_few_shot_prompt.csv", sep = ",", index = False)

Please refer to the following examples of correctly classified data points with 'Total mental disorders incidence point prevalence' being the true classification: ['Example 1: T1 Positive mental health: -0.0279170753483525, T1 Social support: -0.167356999046327, T1 General self-efficacy: -0.5416595949681524, T1 Life satisfaction: -0.471818725223128, T1 Stress: 0.241958427407229, T1 Problem-focused coping: 0.8532782878883876, T1 Emotion-focused coping: 0.2078300133169115, T1 Anxiety sensitivity: 0.6878631096473256, T1 Fear of bodily sensations: 0.3819042614516962, T1 Dysfunctional attitudes: -0.2667599319073422, T1 General psychopathology: Global Severity Index (GSI): -0.2733217032704306, Education: 2.0, T1 BMI: 1.0, Socioeconomic status: 1.0, Change in Positive mental health (T2-T1): -0.7520166349788642, Change in Social support (T2-T1): 0.8692924015129987, Change in General self-efficacy (T2-T1): -0.1819798191096402, Change in Life satisfaction (T2-T1): 0.6686513409272499, Change in A

In [17]:
vignette_y_pred_grok = []

client = OpenAI(
    api_key = os.environ.get("XAI_API_KEY"),
    base_url = "https://api.x.ai/v1",
)

# measure time in seconds
start = time.time()

# iterate over the test set and save the response for each prompt in an array
for prompt in X_test_vignette_prompt:
    completion = client.chat.completions.create(
        model = "grok-3-beta",
        messages = [
            {"role": "system", "content": vignette_instruction},
            {"role": "user", "content": X_test_simple_prompt[0]},
        ],
    )
    vignette_y_pred_grok.append(completion.choices[0].message.content)
    print(completion.choices[0].message.content)

end = time.time()
print(f"Time taken: {end - start} seconds")
time_grok_vignette = end - start
time_grok_vignette_df = pd.DataFrame({"time": [time_grok_vignette]})
time_grok_vignette_df.to_csv("../exp/times_LLMs/Grok/time_grok_vignette_prompt.csv", sep = ",", index = False)

# value counts for array
counts_vignette_grok = pd.Series(vignette_y_pred_grok).value_counts()
print(counts_vignette_grok)

# convert YES to 1 and NO to 0
vignette_y_pred_grok_val = [1 if response == "YES" else 0 for response in vignette_y_pred_grok]

# save the array to a csv file
vignette_df_grok = pd.DataFrame(vignette_y_pred_grok_val, columns = ["y_pred"])
vignette_df_grok.to_csv("../exp/preds_LLMs/Grok/y_pred_grok_vignette_prompt.csv", sep = ",", index = False)

A woman with a BMI of 1.0, an education level of 2.0, and a socioeconomic status of 2.0 has the following psychological profile: At time point 1, she showed average (-0.028) positive mental health, average (0.142) social support, and average (0.365) general self-efficacy. Her life satisfaction was average (0.337), and she relied on above average (1.732) problem-focused and average (0.208) emotion-focused coping strategies. Anxiety sensitivity was average (0.159), and her fear of bodily sensations was average (0.286). She reported average (0.275) levels of dysfunctional attitudes and average (0.017) levels of general psychopathology. Her stress level was average (0.442). By time point 2, approximately 17 months later, she reported similar (-0.752) positive mental health, similar (0.706) social support, and similar (-0.182) self-efficacy. Life satisfaction was similar (0.141). Anxiety sensitivity was reported to be similar (-0.862), and fear of bodily sensations was similar (-0.847). Dys

### 1.8 Mistral