## Ontology learning using GPT-4

Used GPT-4-o1-preview to generate an initial ontology. First prompted it to generate tasks (via chatgpt interface):

```
you are an expert in psychological research.  Researchers in the field of psychology study specific 
psychological constructs, which are the building blocks of the mind, such as 
memory, attention, theory of mind, and so on.  To study them, researchers use 
experimental tasks or surveys, which are meant to measure behavior related to the 
constructs.  In many cases psychological tasks have several different experimental
conditions, which are meant to manipulate the construct in different ways.  These are 
commonly compared to one another in order to measure the effect of the manipulation; we 
refer to these comparisons as contrasts.

your job is to generate a list of all of the psychological tasks and surveys used by researchers in this field.  Please be as specific and exhaustive as possible. 

you should return these as a JSON list, with no additional text.
```

This identified a list of 144 tasks.  These were then used to generate descriptions using the following prompt:

```
for each of these tasks, please generate:

1) a brief description of the task
2) a list of the psychological constructs that the task is used to assess
3) a small number of references for each task

Please return these as a dictionary of sub-dictionaries, with the task names as keys and with the elements 'description', 'constructs', and 'references' within each sub-dictionary
```

Results from this were stored in [gpt4_task_ontology.json]().

In [77]:
#autoreload
%load_ext autoreload
%autoreload 2

import json
import os
from openai import OpenAI
from dotenv import load_dotenv
from ontology_learner.gpt4_batch_utils import get_batch_results, save_batch_results
from ontology_learner.json_utils import load_jsonl, parse_jsonl_results
from llm_query.chat_client import ChatClientFactory
from tqdm import tqdm
from pathlib import Path
import fasttext
import time
import hashlib
import secrets
import pandas as pd
import numpy as np
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from collections import defaultdict

from gpt_term_mining import (
    clean_task_name,
    clean_task_ontology,
    get_construct_task_dict_from_task_ontology,
    get_construct_prompt,
    mk_batch_script,
    run_batch_request,
    wait_for_batch_completion,
    get_main_construct_dict,
    get_task_prompt,
    get_task_cluster_prompt,
)

# Load environment variables from .env file
load_dotenv()

datadir = Path(os.getenv('DATADIR'))
print(datadir)




The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
/Users/poldrack/Dropbox/data/ontology-learner/data


### Load task ontology results generated using the ChatGPT console

In [3]:
with open(datadir / 'gpt4/gpt4_task_ontology.json') as f:
    task_ontology = json.load(f)

print(len(task_ontology.keys()))

144


Clean up the ontology - in particular, make lower case and move acronyms into structure

In [4]:

ontology_clean = clean_task_ontology(task_ontology)



Here we extract all of the constructs from the task annotation for further annotation.

In [5]:
construct_task_dict = get_construct_task_dict_from_task_ontology(ontology_clean)
print(len(construct_task_dict))
# create json list of constructs

with open(datadir / 'gpt4/gpt4_construct_list.json', 'w') as f:
    json.dump(list(construct_task_dict.keys()), f, indent=2)

186


This list was then fed into GPT-4-o1-preview via ChatGPT with the following prompt:

```
The following is a list of psychological constructs identified above.  These represent only a fraction of all of the constructs that are studied in the field.  please use your expert knowledge of psychology to expand this list to contain a wider selection of the constructs studied within the field.  Please return your result as a json list.
```

the result was saved to [gpt4_expanded_construct_list.json]().

In [6]:
with open(datadir / 'gpt4/gpt4_expanded_construct_list.json') as f:
    expanded_constructs = json.load(f)

print(len(expanded_constructs))
expanded_constructs = list(set(expanded_constructs))
print(len(expanded_constructs))


866
807


I tried to further expand these but the chatgpt interface wouldn't do it due to the length of the list, so we then move to the API.  We also switch to GPT-4o due to cost of o1-preview.

### Construct refinement using GPT-4o

create batch submission using the prompt above

In [7]:

outdir = datadir / 'gpt4/construct_refinement_results'
outdir.mkdir(exist_ok=True, parents=True)
construct_refinement_result_file = list(outdir.glob('*.jsonl'))[0]

if construct_refinement_result_file.exists():
    run_batch = False
else:
    run_batch = True

batchfile = datadir / 'gpt4/gpt4_construct_expansion_batch.jsonl'



if run_batch:
    mk_batch_script(batchfile, expanded_constructs, get_construct_prompt)
    batch_metadata = run_batch_request(batchfile)

    wait_for_batch_completion(batch_metadata)
    batch_results = get_batch_results(batch_client, batch_metadata.id)
    outfile = save_batch_results(batch_results, batch_metadata.id, outdir)

Load construct refinement results

In [8]:
construct_refinement_result_file = list(outdir.glob('*.jsonl'))[0]
construct_refinement_results = parse_jsonl_results(load_jsonl(construct_refinement_result_file))

# exclude non-constructs
construct_refinement_results = {k: v for k, v in construct_refinement_results.items() if v['type'] == 'construct'}
print(len(construct_refinement_results))

741


In [9]:
construct_refinement_results['Mindfulness']

{'type': 'construct',
 'description': 'Mindfulness is a psychological construct that refers to the quality of being present and fully engaged with the current moment, without judgment or distraction. It involves awareness and acceptance of feelings, thoughts, and bodily sensations in a non-reactive way.',
 'references': ['Kabat-Zinn, J. (1990). Full Catastrophe Living: Using the Wisdom of Your Body and Mind to Face Stress, Pain, and Illness. Delta.',
  'Brown, K. W., & Ryan, R. M. (2003). The benefits of being present: Mindfulness and its role in psychological well-being. Journal of Personality and Social Psychology, 84(4), 822-848.',
  'Baer, R. A. (2003). Mindfulness Training as a Clinical Intervention: A Conceptual and Empirical Review. Clinical Psychology: Science and Practice, 10(2), 125-143.'],
 'tasks': ['Mindful Attention Awareness Scale (MAAS)',
  'Five Facet Mindfulness Questionnaire (FFMQ)',
  'Kentucky Inventory of Mindfulness Skills (KIMS)'],
 'related_constructs': ['Atten

Create a main list of constructs by combining the construct refinement result keys with any related constructs that are not in the main list




In [10]:

main_construct_dict = get_main_construct_dict(construct_refinement_results, construct_task_dict)

print(len(construct_refinement_results))
print(len(main_construct_dict))

741
1524


Some of the constructs (those identified in the last round of refinement) will not yet be annotated, so we identify and annotate those.

In [11]:
unannotated_constructs = [k for k, v in main_construct_dict.items() if len(v) == 0]
print(len(unannotated_constructs))


783


In [52]:
outdir = datadir / 'gpt4/construct_expansion_unannotated_results'

if 0:
    batchfile = datadir / 'gpt4/gpt4_construct_expansion_unannotated_batch.jsonl'
    if not batchfile.exists():
        mk_batch_script(batchfile, unannotated_constructs, get_construct_prompt)

    batch_metadata, batch_client = run_batch_request(batchfile)

    wait_for_batch_completion(batch_metadata, batch_client)

    batch_results = get_batch_results(batch_client, batch_metadata.id)
    outfile = save_batch_results(batch_results, batch_metadata.id, outdir)


Combine expansion results with refinement results.

In [53]:
construct_expansion_result_file = list(outdir.glob('*.jsonl'))[0]
construct_expansion_results = parse_jsonl_results(load_jsonl(construct_expansion_result_file))

# exclude non-constructs
construct_expansion_results = {k: v for k, v in construct_expansion_results.items() if v['type'] == 'construct'}
print(len(construct_expansion_results))

full_construct_results = {**construct_refinement_results, **construct_expansion_results}
# assert len(full_construct_results) == len(construct_refinement_results) + len(construct_expansion_results)
print(len(full_construct_results))

720
1461


Now we find all of the tasks listed in the construct dicts and add them to the task ontology if they don't already exist - i.e. just the same as we did above for constructs.

In [54]:
expanded_task_ontology = ontology_clean.copy()
unexpanded_tasks = []

for construct, result in full_construct_results.items():
    result = result.copy()
    if result['type'] == 'construct':
        for task in result['tasks']:
            taskname_clean, acronym = clean_task_name(task)
            if taskname_clean not in expanded_task_ontology:
                unexpanded_tasks.append(taskname_clean)
                expanded_task_ontology[taskname_clean] = {
                    'description': None,
                    'constructs': [construct],
                    'acronym': [acronym]
                }
            elif construct not in expanded_task_ontology[taskname_clean]['constructs']:
                expanded_task_ontology[taskname_clean]['constructs'].append(construct)

print(len(ontology_clean))
print(len(expanded_task_ontology))

143
3167


In [15]:
batchfile = datadir / 'gpt4/gpt4_task_expansion_unannotated_batch.jsonl'
outdir = datadir / 'gpt4/task_expansion_unannotated_results'

if not batchfile.exists():
    mk_batch_script(batchfile, unexpanded_tasks, get_task_prompt)

    batch_metadata, batch_client = run_batch_request(batchfile)

    wait_for_batch_completion(batch_metadata, batch_client)

    batch_results = get_batch_results(batch_client, batch_metadata.id)
    outfile = save_batch_results(batch_results, batch_metadata.id, outdir)


In [16]:
task_expansion_result_file = list(outdir.glob('*.jsonl'))[0]
task_expansion_results = parse_jsonl_results(load_jsonl(task_expansion_result_file))

# exclude non-constructs
task_expansion_results = {k: v for k, v in task_expansion_results.items() if v['type'] in ['task', 'survey']}
print(len(task_expansion_results))

full_task_results = {**expanded_task_ontology, **task_expansion_results}
print(len(full_task_results))


error decoding distance matching task
2554
2950


Get all task names and constructs and create a hash dictionary for each.


In [17]:
def generate_random_hash(hashlength=12):
    # Generate a secure random string
    random_string = secrets.token_hex(32)  # Generates a 32-character hexadecimal string

    # Create a hash object
    hash_object = hashlib.sha256()

    # Update the hash object with the bytes of the random string
    hash_object.update(random_string.encode('utf-8'))

    # Get the hexadecimal representation of the hash
    random_hash = hash_object.hexdigest()

    return random_hash[:hashlength]


generate_random_hash()

'89a0e1400b02'

Create unique identifiers for each construct and task label.  These will be useful when we start collapsing overlapping labels/constructs.

In [51]:
len(full_task_results.keys())



2950

In [72]:
full_task_results['stroop task']

{'description': "A cognitive task where participants name the ink color of a word that may spell out a different color (e.g., the word 'Red' printed in blue ink), measuring the ability to inhibit cognitive interference.",
 'constructs': ['Attention',
  'Cognitive Control',
  'Inhibitory Control',
  'Executive Function',
  'Bilingual Cognitive Advantage',
  'Response Inhibition',
  'Resource Allocation',
  'Goal Maintenance',
  'Cognitive Resources',
  'Information Processing',
  'Monitoring',
  'Cognitive Flexibility',
  'Behavioral Control',
  'Executive Control',
  'Top-Down Processing',
  'Reaction Time',
  'Performance',
  'Dual-Task Performance',
  'Task Switching',
  'Executive Attention',
  'Selective Attention',
  'Conflict Monitoring',
  'Attentional Set',
  'Self-Regulation',
  'Error Monitoring',
  'Cognitive Processing',
  'Attention Bias',
  'Set Shifting',
  'Central Executive',
  'Response Selection',
  'Ego Depletion',
  'Behavioral Regulation',
  'Cognitive Inhibition'

In [56]:
len(full_construct_results.keys())


1461

Generate a text embedding using fasttext for all of the concepts and tasks, for use in identifying overlapping items.

In [57]:
# first generate a text file with all of the concepts and tasks
with open(datadir / 'gpt4/gpt4_full_text_for_embedding_concepts.txt', 'w') as f:

    for k, v in full_construct_results.items():
        v = v.copy()
        f.write(f'construct_{k.replace(' ', '_')}: {k} {json.dumps(v)}\n')


with open(datadir / 'gpt4/gpt4_full_text_for_embedding_tasks.txt', 'w') as f:
    for k, v in full_task_results.items():
        v = v.copy()
        if 'acronym' in v:
            del v['acronym']
        v['constructs'] = list(set([i.lower() for i in v['constructs']]))
        v['type'] = 'task'
        f.write(f'task_{k.replace(' ', '_')}: {k} {json.dumps(v)}\n')



In [24]:
# generate a fasttext model
task_model_file = datadir / 'gpt4/gpt4_task_model.bin'
if not task_model_file.exists():
    task_model = fasttext.train_unsupervised(
        (datadir / 'gpt4/gpt4_full_text_for_embedding_tasks.txt').as_posix(), 
        dim=200)
    task_model.save_model(task_model_file.as_posix())
else:
    print(f'Loading task model from {task_model_file}')
    task_model = fasttext.load_model(task_model_file.as_posix())


Loading task model from /Users/poldrack/Dropbox/data/ontology-learner/data/gpt4/gpt4_task_model.bin


In [25]:
tasks = sorted(list(full_task_results.keys()))


with open(datadir / 'gpt4/gpt4_task_names.txt', 'w') as f:
    for k in tasks:
        f.write(f'{k}\n')


In [26]:
# create embeddings for the task names
task_embeddings = {}
for k in tasks:
    task_embeddings[k] = task_model.get_sentence_vector(k)

task_embeddings_df = pd.DataFrame(task_embeddings).T
# scale the embeddings
scaler = StandardScaler()
task_embeddings_scaled = scaler.fit_transform(task_embeddings_df)
task_embeddings_scaled_df = pd.DataFrame(task_embeddings_scaled, index=task_embeddings_df.index)
print(task_embeddings_scaled_df.shape)


(2950, 200)


In [27]:
# cluster the task names
cluster = AgglomerativeClustering(n_clusters=None, distance_threshold=7)
cluster.fit(task_embeddings_scaled_df)
print(f'Found {len(set(cluster.labels_))} clusters')
# print out the clusters
task_cluster_dict = defaultdict(list)

for i in set(cluster.labels_):
    if np.sum(cluster.labels_ == i) > 1:
        #print(f'Cluster {i}')
        for j in task_embeddings_scaled_df.index[cluster.labels_ == i]:
            #print(f'  {j}')
            task_cluster_dict[i].append(j)

task_cluster_dict

Found 2525 clusters


defaultdict(list,
            {0: ['the world values survey', 'world values survey'],
             1: ['mood induction procedures and memory task',
              'mood induction procedures followed by memory recall task',
              'mood induction procedures followed by memory retrieval task'],
             2: ['interpersonal justice scale', 'interpersonal trust scale'],
             3: ['color adjustment task', 'visual angle adjustment task'],
             4: ['dissociative experiences scale',
              'the dissociative experiences scale'],
             5: ['semantic encoding task', 'semantic priming task'],
             6: ['letter identification task',
              'lineup identification task',
              'morpheme identification task'],
             7: ['olweus bullying questionnaire',
              'revised olweus bully/victim questionnaire'],
             8: ['sleep diaries', 'sleep diary'],
             9: ['bogardus social distance scale',
              'social dis

In [76]:
full_task_results['stroop task']['description']

"A cognitive task where participants name the ink color of a word that may spell out a different color (e.g., the word 'Red' printed in blue ink), measuring the ability to inhibit cognitive interference."

In [80]:
len([k for k,v in full_task_results.items() if v['description'] is None])



253

In [91]:
#  include descriptions when available
batchfile = datadir / 'gpt4/gpt4_task_clustering_batch.jsonl'
tasks = []
descriptions = []

for k in full_task_results.keys():
    if full_task_results[k]['description'] is None:
        full_task_results[k]['description'] = ''
    tasks.append(k)

task_desc_list = [k + ' ' + full_task_results[k]['description'] 
                  for k in full_task_results.keys() ]

if not batchfile.exists():
    mk_batch_script(batchfile, task_desc_list, 
                    get_task_cluster_prompt,
                    custom_ids=tasks)

    batch_metadata, batch_client = run_batch_request(batchfile)

wait_for_batch_completion(batch_metadata, batch_client)

batch_results = get_batch_results(batch_client, batch_metadata.id)
outfile = save_batch_results(batch_results, batch_metadata.id, outdir)



in_progress


KeyboardInterrupt: 

In [92]:
def cancel_all_running_batches(client):
    batches = client.batches.list()
    for batch in batches:
        if batch.status == 'in_progress':
            print(f'cancelling batch {batch.id}')
            client.batches.cancel(batch.id)

cancel_all_running_batches(batch_client)


cancelling batch batch_6751cba66b7481918b1c6c680cec860f


In [86]:
batch_results = get_batch_results(batch_client, 'batch_6751aad0c874819198e4d99a544d592a')
outdir = datadir / 'gpt4/task_clustering_results'
outfile = save_batch_results(batch_results, 'batch_6751aad0c874819198e4d99a544d592a', outdir)


In [84]:
batch_metadata.id

'batch_6751ae73f97c8191b8cb09dd05289196'

In [62]:
batchfile = datadir / 'gpt4/gpt4_task_clustering_batch.jsonl'
outdir = datadir / 'gpt4/task_clustering_results'

if not batchfile.exists():
    mk_batch_script(batchfile, list(task_cluster_dict.values()), 
                    get_task_cluster_prompt,
                    custom_ids=[str(i) for i in task_cluster_dict.keys()])

    batch_metadata, batch_client = run_batch_request(batchfile)

wait_for_batch_completion(batch_metadata, batch_client)

batch_results = get_batch_results(batch_client, batch_metadata.id)
outfile = save_batch_results(batch_results, batch_metadata.id, outdir)


completed


create dict from original task labels to harmonized/clustered labels

In [81]:
task_clustering_result_file = list(outdir.glob('*.jsonl'))[0]
task_clustering_results = parse_jsonl_results(load_jsonl(task_clustering_result_file))

clustered_task_dict = {}
orig_task_to_cluster_dict = {}
for k, v in task_clustering_results.items():
    for label, tasks in v.items():
        clustered_task_dict[label] = tasks
        for task in tasks:
            orig_task_to_cluster_dict[task] = label

for k, v in clustered_task_dict.items():
    if len(v) > 1:
        print(k)
        for task in v:
            print(f'  {task}')
        print()



Acceptance and Commitment
  acceptance
  commitment

Aggression
  physical aggression
  verbal aggression

Anima and Animus
  Anima
  Animus

Anxiety
  general anxiety
  separation anxiety

Mentalizing/Theory of Mind
  mentalizing
  theory of mind

Attachment
  secure attachment
  insecure attachment

Avoidant Attachment Style
  Avoidant Attachment
  Dismissive Avoidant

Anxious Attachment Style
  Anxious Attachment
  Preoccupied Attachment

Disorganized Attachment Style
  Disorganized Attachment
  Fearful Avoidant

Attention
  attention
  concentration

Mentalizing/ToM
  mentalizing
  theory of mind

Big Five Personality Traits
  Openness
  Conscientiousness
  Extraversion
  Agreeableness
  Neuroticism

Empathy
  empathy
  empathic ability

Self-regulation
  impulse control
  emotion regulation

Social Skills
  prosocial behavior
  social competence

Conduct Problems
  rule-breaking behavior
  conduct disorder

Peer Relations
  peer acceptance
  peer rejection

Cognitive Development
 

### perform concept clustering

First do agglomerative clustering on concepts


In [58]:
# fit embeddings for concepts
concept_model_file = datadir / 'gpt4/gpt4_concept_model.bin'
if not concept_model_file.exists():
    concept_model = fasttext.train_unsupervised(
        (datadir / 'gpt4/gpt4_full_text_for_embedding_concepts.txt').as_posix(), 
        dim=200)
    concept_model.save_model(concept_model_file.as_posix())
else:
    print(f'Loading concept model from {concept_model_file}')
    concept_model = fasttext.load_model(concept_model_file.as_posix())



Read 0M words
Number of words:  5164
Number of labels: 0
Progress: 100.0% words/sec/thread:   74472 lr:  0.000000 avg.loss:  2.163138 ETA:   0h 0m 0s


In [59]:
# generate embeddings for concepts
concept_embeddings = {}
for k, v in full_construct_results.items():
    if v['type'] == 'construct':
        concept_embeddings[k] = task_model.get_sentence_vector(k)

concept_embeddings_df = pd.DataFrame(concept_embeddings).T
# scale the embeddings
scaler = StandardScaler()
concept_embeddings_scaled = scaler.fit_transform(concept_embeddings_df)
concept_embeddings_scaled_df = pd.DataFrame(concept_embeddings_scaled, index=concept_embeddings_df.index)
print(concept_embeddings_scaled_df.shape)


(1461, 200)


In [60]:
# cluster the task names
cluster = AgglomerativeClustering(n_clusters=None, distance_threshold=7)
cluster.fit(concept_embeddings_scaled_df)
print(f'Found {len(set(cluster.labels_))} clusters')
# print out the clusters
concept_cluster_dict = defaultdict(list)

for i in set(cluster.labels_):
    if np.sum(cluster.labels_ == i) > 1:
        #print(f'Cluster {i}')
        for j in concept_embeddings_scaled_df.index[cluster.labels_ == i]:
            #print(f'  {j}')
            concept_cluster_dict[i].append(j)

concept_cluster_dict

Found 1318 clusters


defaultdict(list,
            {0: ['Servant Leadership', 'Leadership Styles'],
             1: ['linguistic awareness',
              'semantic awareness',
              'syntactic awareness'],
             2: ['employee engagement', 'job involvement'],
             3: ['need for closure',
              'need for belonging',
              'need for intimacy'],
             4: ['Analogical Reasoning', 'Ethical Reasoning'],
             5: ['substance use disorder', 'substance use'],
             6: ['Reconstructive Memory',
              'Declarative Memory',
              'Prospective Memory'],
             7: ['cognitive facilitation', 'cognitive restoration'],
             8: ['Hypothetical-Deductive Reasoning',
              'Inductive Reasoning',
              'Deductive Reasoning'],
             9: ['social value orientation', 'social dominance orientation'],
             10: ['Actor-Observer Bias', 'Self-Serving Bias'],
             11: ['Social Support', 'Social Capital'],
     

Examination of these clusters showed that they rarely contained synonymous terms; more often, they contained contrasting terms. Thus, we don't do any further refinement.

In [118]:
constructs = sorted(list(full_construct_results.keys()))
for k in constructs:
    print(k)


ADHD Symptoms
Abstract Reasoning
Abstract Thinking
Acceptance and Commitment
Acculturation
Acculturative Stress
Action Selection
Actor-Observer Bias
Adaptive Coping
Affect
Affective Empathy
Affective Forecasting
Affective Symptoms
Aggression
Agnosia
Agreeableness
Alcohol Use
Alerting
Alexithymia
Altruism
Ambivalent Sexism
Analogical Reasoning
Anchoring
Anger
Anima and Animus
Animal Cognition
Anosognosia
Anterograde Amnesia
Antisocial Behavior
Anxiety
Anxiety Disorders
Aphasia
Apraxia
Archetypes
Associative Learning
Associative Networks
Associative Thinking
Attachment
Attachment Behaviors
Attachment Disorders
Attachment Security
Attachment Styles
Attention
Attention Bias
Attention Deficit
Attention Deficits
Attention Restoration
Attention to Detail
Attentional Blink
Attentional Control
Attentional Set
Attitude Change
Attitudes
Attribution
Auditory Processing
Authentic Leadership
Authority Influence
Automatic Thoughts
Autonomy
Availability Heuristic
Avoidance
Avoidance Behaviors
Avoidant

In [113]:

scaler = StandardScaler()
task_embeddings_scaled = scaler.fit_transform(task_embeddings_df)
task_embeddings_scaled_df = pd.DataFrame(task_embeddings_scaled, index=task_embeddings_df.index)

cluster = AgglomerativeClustering(n_clusters=None, distance_threshold=10)
cluster.fit(task_embeddings_scaled)
print(f'Found {len(set(cluster.labels_))} clusters')

Found 1292 clusters


In [114]:
# print out the clusters
for i in set(cluster.labels_):
    if np.sum(cluster.labels_ == i) > 1:
        print(f'Cluster {i}')
        for j in task_embeddings_scaled_df.index[cluster.labels_ == i]:
            print(f'  {j}')


Cluster 0
  task_dyadic_negotiation_task
  task_resource_allocation_task
  task_negotiation_simulations
  task_negotiation_task
Cluster 1
  task_reciprocity_questionnaire
  task_equity_sensitivity_instrument
  task_justice_sensitivity_inventory
Cluster 2
  task_ethical_decision-making_measure
  task_rest's_ethical_decision-making_model
Cluster 3
  task_multimodal_emotion_comprehension_test
  task_geneva_emotion_recognition_test
  task_facial_affect_recognition_task
  task_multimodal_emotion_recognition_test
  task_facial_affect_processing_task
  task_the_emotion_recognition_task
  task_ekman_60_faces_test
  task_emotion_recognition_index
  task_vu_amsterdam_emotion_recognition_test
Cluster 4
  task_edmondson's_team_psychological_safety_survey
  task_perceived_team_safety_scale
  task_workplace_deviance_scale
  task_employee_deviance_scale
Cluster 5
  task_the_interpersonal_attraction_scale
  task_miller_social_intimacy_scale
  task_the_miller_social_intimacy_scale
  task_interpersonal_

Create a full task ontology, keyed by a hash rather than task labels.  

In [17]:
full_task_ontology = {}

for k, v in expanded_task_ontology.items():
    # create a random hash for the id
    v = v.copy()
    task_id = k
    while task_id in expanded_task_ontology:
        task_id = 'task_' + generate_random_hash()
    v['name'] = k
    v['construct_names'] = [i.lower() for i in v['constructs']]
    v['construct_ids'] = []
    full_task_ontology[task_id] = v
    

assert len(full_task_ontology) == len(expanded_task_ontology)



Do the same for constructs.

In [24]:
full_construct_ontology = {}

def task_name_to_hash(taskname, full_task_ontology):
    return [k for k, v in full_task_ontology.items() if taskname_clean == v['name']][0]

for k, v in construct_refinement_results.items():
    v = v.copy()
    if v['type'] == 'construct':
        construct_id = k
        while construct_id in construct_refinement_results:
            construct_id = 'concept_' + generate_random_hash()
        v['name'] = k
        v['task_ids'] = []
        full_construct_ontology[construct_id] = v
        for task in v['tasks']:
            taskname_clean, _ = clean_task_name(task)
            task_hash = task_name_to_hash(taskname_clean, full_task_ontology)
            v['task_ids'].append(task_hash)
            if task_hash not in full_task_ontology:
                print(f'task {taskname_clean} not found in full task ontology')
            elif construct not in full_task_ontology[task_hash]['construct_names']:
                full_task_ontology[task_hash]['construct_ids'].append(construct_id)
                full_task_ontology[task_hash]['construct_names'].append(k)

# now replace constructs in full_task_ontology with the new construct ids


with open(datadir / 'gpt4/gpt4_full_construct_ontology.json', 'w') as f:
    json.dump(full_construct_ontology, f, indent=2)

with open(datadir / 'gpt4/gpt4_full_task_ontology.json', 'w') as f:
    json.dump(full_task_ontology, f, indent=2)
