Used GPT-4-o1-preview to generate an ontology. first prompted it to generate tasks (via chatgpt interface):

```
you are an expert in psychological research.  Researchers in the field of psychology study specific 
psychological constructs, which are the building blocks of the mind, such as 
memory, attention, theory of mind, and so on.  To study them, researchers use 
experimental tasks or surveys, which are meant to measure behavior related to the 
constructs.  In many cases psychological tasks have several different experimental
conditions, which are meant to manipulate the construct in different ways.  These are 
commonly compared to one another in order to measure the effect of the manipulation; we 
refer to these comparisons as contrasts.

your job is to generate a list of all of the psychological tasks and surveys used by researchers in this field.  Please be as specific and exhaustive as possible. 

you should return these as a JSON list, with no additional text.
```

This identified a list of 144 tasks.  These were then used to generate descriptions using the following prompt:

```
for each of these tasks, please generate:

1) a brief description of the task
2) a list of the psychological constructs that the task is used to assess
3) a small number of references for each task

Please return these as a dictionary of sub-dictionaries, with the task names as keys and with the elements 'description', 'constructs', and 'references' within each sub-dictionary
```

Results from this were stored in [gpt4_task_ontology.json]().

In [6]:
#autoreload
%load_ext autoreload
%autoreload 2

import json
import os
from openai import OpenAI
from dotenv import load_dotenv
from ontology_learner.gpt4_batch_utils import get_batch_results, save_batch_results
from ontology_learner.json_utils import load_jsonl, parse_jsonl_results
from llm_query.chat_client import ChatClientFactory
from tqdm import tqdm
from pathlib import Path
import fasttext
import time
import hashlib
import secrets

from gpt_term_mining import (
    clean_task_name,
    clean_task_ontology
)

# Load environment variables from .env file
load_dotenv()

datadir = Path(os.getenv('DATADIR'))
print(datadir)




The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
/Users/poldrack/Dropbox/data/ontology-learner/data


In [7]:
with open(datadir / 'gpt4/gpt4_task_ontology.json') as f:
    task_ontology = json.load(f)

print(len(task_ontology.keys()))

144


Clean up the ontology - move acronyms into structure

In [9]:
ontology_clean

{'stroop task': {'description': "A cognitive task where participants name the ink color of a word that may spell out a different color (e.g., the word 'Red' printed in blue ink), measuring the ability to inhibit cognitive interference.",
  'constructs': ['Attention',
   'Cognitive Control',
   'Inhibitory Control',
   'Executive Function'],
  'references': ['Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18(6), 643–662.'],
  'acronym': []},
 'wisconsin card sorting task': {'description': 'A test where participants sort cards according to different rules (color, shape, number) without explicit instruction, assessing cognitive flexibility and problem-solving abilities.',
  'constructs': ['Cognitive Flexibility',
   'Executive Function',
   'Set Shifting',
   'Problem Solving'],
  'references': ['Berg, E. A. (1948). A simple objective technique for measuring flexibility in thinking. The Journal of General Psychology, 39(1), 15

Here we extract all of the constructs for further annotation.

In [10]:
constructs = {}

for taskname, taskdict in ontology_clean.items():
    taskdict = taskdict.copy()
    for construct in taskdict['constructs']:
        if construct not in constructs:
            constructs[construct] = []
        constructs[construct].append(taskname)

print(len(constructs))

186


In [11]:
# create json list of constructs

with open(datadir / 'gpt4/gpt4_construct_list.json', 'w') as f:
    json.dump(list(constructs.keys()), f, indent=2)


This list was then fed into GPT-4-o1-preview with the following prompt:

```
The following is a list of psychological constructs identified above.  These represent only a fraction of all of the constructs that are studied in the field.  please use your expert knowledge of psychology to expand this list to contain a wider selection of the constructs studied within the field.  Please return your result as a json list.
```

the result was saved to [gpt4_expanded_construct_list.json]().

In [12]:
with open(datadir / 'gpt4/gpt4_expanded_construct_list.json') as f:
    expanded_constructs = json.load(f)

print(len(expanded_constructs))
expanded_constructs = list(set(expanded_constructs))
print(len(expanded_constructs))


866
807


I tried to further expand these but the chatgpt interface wouldn't do it due to the length of the list, so we then move to the API.  We also switch to GPT-4o due to cost of o1-preview.

In [13]:
def get_construct_prompt(construct):
    prompt = f"""
# CONTEXT #
Researchers in the field of cognitive neuroscience and psychology study specific 
psychological constructs, which are the building blocks of the mind, such as 
memory, attention, theory of mind, and so on.  

# OBJECTIVE #
Your job is to analyze a specific construct: {construct}.

- You should first determine whether it is truly a psychological construct, or whether it is some 
other kind of thing.  For example, "working memory" is a psychological construct, 
but "n-back task" is not (it is a task, not a construct).  Include a 'type' key in your response with the value 'construct' if it is 
truly a psychological construct or 'other' if it is not.

If it is a psychological construct, please do the following:
- provide a short description of the construct.
- provide a short list of widely cited publications that describe the construct. Include a 
'references' key in your response with a list of the references.
- provide a list of commonly used tasks or surveys that measure the construct.  Include a 'tasks' key in your response with a list of the tasks.
- Provide a list of other constructs that are related to this construct.  Include a 'related_constructs' key in your response with a list of the related constructs.
Be as specific as possible, using names that are as specific as possible.

# RESPONSE #
Please return the results in JSON format.  Use the following keys:
- type: 'construct', or 'other'
- description: a short description of the construct
- references: a list of references that use the construct
- tasks: a list of tasks used to measure the construct
- related_constructs: a list of other constructs that are related to this construct
Respond only with JSON, without any additional text or description.
"""
    return prompt


create batch submission

In [14]:
expanded_constructs


['Framing Effect',
 'Chaining',
 'Syntax',
 'Bipolar Disorder',
 'Behavioral Economics',
 'Obedience',
 'Verbal Recall',
 'Spatial Memory',
 'Forgiveness',
 'Acceptance and Commitment',
 'Moral Disgust',
 'Equity Theory',
 'Ego Depletion',
 'Phonological Loop',
 'Self-Esteem',
 'Personality Disorders',
 'Law of Pragnanz',
 'Attention',
 'Microaggressions',
 'Wisdom',
 'Prosocial Behavior',
 'Vicarious Reinforcement',
 'Haptic Memory',
 'Semantic Memory',
 'Cognitive Distortions',
 'Cross-Modal Perception',
 'Sensory Memory',
 'Expressive Suppression',
 'Ethnocentrism',
 'Perception',
 'Executive Attention',
 'Health Status',
 'Self-Evaluation',
 'Attention Deficits',
 'Evolutionary Psychology',
 'Modeling',
 'Job Stress',
 'Metacognition',
 'Sleep Deprivation',
 'Visual Processing',
 'Collective Efficacy',
 'Psychological Contract',
 'Just-World Hypothesis',
 'Intergroup Conflict',
 'Archetypes',
 'Cognitive Empathy',
 'Technostress',
 'Confirmation Bias',
 'Morphological Processing',


In [9]:

outdir = datadir / 'gpt4/construct_refinement_results'
outdir.mkdir(exist_ok=True, parents=True)
construct_refinement_result_file = list(outdir.glob('*.jsonl'))[0]

if construct_refinement_result_file.exists():
    run_batch = False
else:
    run_batch = True

if run_batch:

    api_key = os.environ.get("OPENAI")
    client = OpenAI(api_key=api_key)

    system_msg = """
        You are an expert in psychology and neuroscience.
        You should be as specific and as comprehensive as possible in your responses.
        Your response should be a JSON object with no additional text.  
        """

    # wanted to use 01-preview but it's too expensive so we fall back to GPT-4o
    model = 'gpt-4o'
    client = ChatClientFactory.create_client("openai", api_key, 
                                                system_msg=system_msg,
                                                model=model)
    
    batchfile = datadir / 'gpt4/gpt4_construct_expansion_batch.jsonl'

    if batchfile.exists():
        batchfile.unlink()

    for construct in expanded_constructs:

        prompt = get_construct_prompt(construct)
        kwargs = {'model': model,  'messages': [{"role": "user", "content": prompt}]}
        try:
            batch_request = client.create_batch_request(construct, prompt)
        except Exception as e:
            print(f'error processing {pmcid}: {e}')
            continue

        with open(batchfile, 'a') as f:
            f.write(json.dumps(batch_request) + '\n')


Run batch request

In [10]:

if run_batch:
    batch_client = OpenAI(api_key=api_key)

    batch_input_file = batch_client.files.create(file=open(batchfile, "rb"),
                                            purpose="batch")

    batch_input_file_id = batch_input_file.id

    batch_metadata = batch_client.batches.create(
        input_file_id=batch_input_file_id,
        endpoint="/v1/chat/completions",
        completion_window="24h",
        metadata={
            "description": "construct annotation"
        }
    )

    print(batch_client.batches.retrieve(batch_metadata.id))

    print(batch_client.batches.retrieve(batch_metadata.id).status)
    while batch_client.batches.retrieve(batch_metadata.id).status != 'completed':
        time.sleep(60)
        print(batch_client.batches.retrieve(batch_metadata.id).status)


In [11]:
if run_batch:
    batch_results = get_batch_results(batch_client, batch_metadata.id)

    outfile = save_batch_results(batch_results, batch_metadata.id, outdir)


Load construct refinement results

In [12]:
construct_refinement_result_file = list(outdir.glob('*.jsonl'))[0]

construct_refinement_results = parse_jsonl_results(load_jsonl(construct_refinement_result_file))

print(len(construct_refinement_results))

807


In [13]:
construct_refinement_results['Mindfulness']

{'type': 'construct',
 'description': 'Mindfulness is a psychological construct that refers to the quality of being present and fully engaged with the current moment, without judgment or distraction. It involves awareness and acceptance of feelings, thoughts, and bodily sensations in a non-reactive way.',
 'references': ['Kabat-Zinn, J. (1990). Full Catastrophe Living: Using the Wisdom of Your Body and Mind to Face Stress, Pain, and Illness. Delta.',
  'Brown, K. W., & Ryan, R. M. (2003). The benefits of being present: Mindfulness and its role in psychological well-being. Journal of Personality and Social Psychology, 84(4), 822-848.',
  'Baer, R. A. (2003). Mindfulness Training as a Clinical Intervention: A Conceptual and Empirical Review. Clinical Psychology: Science and Practice, 10(2), 125-143.'],
 'tasks': ['Mindful Attention Awareness Scale (MAAS)',
  'Five Facet Mindfulness Questionnaire (FFMQ)',
  'Kentucky Inventory of Mindfulness Skills (KIMS)'],
 'related_constructs': ['Atten

Combine gpt-4 task list with list of tasks from construct refinement.

In [14]:
expanded_task_ontology = ontology_clean.copy()

for construct, result in construct_refinement_results.items():
    result = result.copy()
    if result['type'] == 'construct':
        for task in result['tasks']:
            taskname_clean, acronym = clean_task_name(task)
            if taskname_clean not in expanded_task_ontology:
                expanded_task_ontology[taskname_clean] = {
                    'description': None,
                    'constructs': [construct],
                    'acronym': [acronym]
                }
            else:
                expanded_task_ontology[taskname_clean]['constructs'].append(construct)



In [15]:
expanded_task_ontology['theory of mind task']

{'description': "Tasks designed to assess the understanding that others have beliefs, desires, and perspectives different from one's own, fundamental for social cognition.",
 'constructs': ['Theory of Mind',
  'Perspective Taking',
  'Social Cognition',
  'Cognitive Development',
  'Beliefs',
  'Social Cognition',
  'Social Brain Hypothesis'],
 'references': ['Wimmer, H., & Perner, J. (1983). Beliefs about beliefs. Cognition, 13(1), 103–128.'],
 'acronym': 'e.g., sally-anne task'}

Get all task names and constructs and create a hash dictionary for each.


In [None]:
def generate_random_hash(hashlength=12):
    # Generate a secure random string
    random_string = secrets.token_hex(32)  # Generates a 32-character hexadecimal string

    # Create a hash object
    hash_object = hashlib.sha256()

    # Update the hash object with the bytes of the random string
    hash_object.update(random_string.encode('utf-8'))

    # Get the hexadecimal representation of the hash
    random_hash = hash_object.hexdigest()

    return random_hash[:hashlength]


generate_random_hash()

Create unique identifiers for each construct and task label.  These will be useful when we start collapsing overlapping labels/constructs.

'b1e9454e56ef'

Create a full task ontology, keyed by a hash rather than task labels.  

In [17]:
full_task_ontology = {}

for k, v in expanded_task_ontology.items():
    # create a random hash for the id
    v = v.copy()
    task_id = k
    while task_id in expanded_task_ontology:
        task_id = 'task_' + generate_random_hash()
    v['name'] = k
    v['construct_names'] = [i.lower() for i in v['constructs']]
    v['construct_ids'] = []
    full_task_ontology[task_id] = v
    

assert len(full_task_ontology) == len(expanded_task_ontology)



In [18]:
v

{'description': None,
 'constructs': ['Reappraisal'],
 'acronym': [[]],
 'name': 'reappraisal instruction trials',
 'construct_names': ['reappraisal'],
 'construct_ids': []}

Do the same for constructs.

In [24]:
full_construct_ontology = {}

def task_name_to_hash(taskname, full_task_ontology):
    return [k for k, v in full_task_ontology.items() if taskname_clean == v['name']][0]

for k, v in construct_refinement_results.items():
    v = v.copy()
    if v['type'] == 'construct':
        construct_id = k
        while construct_id in construct_refinement_results:
            construct_id = 'concept_' + generate_random_hash()
        v['name'] = k
        v['task_ids'] = []
        full_construct_ontology[construct_id] = v
        for task in v['tasks']:
            taskname_clean, _ = clean_task_name(task)
            task_hash = task_name_to_hash(taskname_clean, full_task_ontology)
            v['task_ids'].append(task_hash)
            if task_hash not in full_task_ontology:
                print(f'task {taskname_clean} not found in full task ontology')
            elif construct not in full_task_ontology[task_hash]['construct_names']:
                full_task_ontology[task_hash]['construct_ids'].append(construct_id)
                full_task_ontology[task_hash]['construct_names'].append(k)

# now replace constructs in full_task_ontology with the new construct ids


with open(datadir / 'gpt4/gpt4_full_construct_ontology.json', 'w') as f:
    json.dump(full_construct_ontology, f, indent=2)

with open(datadir / 'gpt4/gpt4_full_task_ontology.json', 'w') as f:
    json.dump(full_task_ontology, f, indent=2)


In [25]:
full_task_ontology

{'task_541ad64a78cd': {'description': "A cognitive task where participants name the ink color of a word that may spell out a different color (e.g., the word 'Red' printed in blue ink), measuring the ability to inhibit cognitive interference.",
  'constructs': ['Attention',
   'Cognitive Control',
   'Inhibitory Control',
   'Executive Function',
   'Bilingual Cognitive Advantage',
   'Response Inhibition',
   'Resource Allocation',
   'Goal Maintenance',
   'Cognitive Resources',
   'Information Processing',
   'Monitoring',
   'Cognitive Flexibility',
   'Behavioral Control',
   'Executive Control',
   'Top-Down Processing',
   'Reaction Time',
   'Performance',
   'Dual-Task Performance',
   'Task Switching',
   'Executive Attention',
   'Selective Attention',
   'Conflict Monitoring',
   'Attention',
   'Attentional Set',
   'Self-Regulation',
   'Error Monitoring',
   'Cognitive Processing',
   'Attention Bias',
   'Set Shifting',
   'Central Executive',
   'Response Selection',
  

Generate a text embedding using fasttext for all of the concepts and tasks, for use in identifying overlapping items.

In [21]:
# first generate a text file with all of the concepts and tasks
with open(datadir / 'gpt4/gpt4_full_ontology.txt', 'w') as f:
    for k, v in full_task_ontology.items():
        f.write(f'{k}\n')
        f.write(f'{v["description"]}\n')
