Annotate diseases/phenotypes using chatGPT #19

bschilder · 2023-03-20T18:23:07Z

(checked boxes indicate at least an initial attempt has been made)

Annotations

Severity score (without criterion).
Severity score (using Lazarin 2014 (table 2) criteria)
Childhood onset
Causes death

Models

chatGPT (gpt-3.5)
chatGPT (gpt-4) (pending payment details from @NathanSkene)
bioGPT

Annotating HPO phenotypes using chatGPT via gptstudio

Set up

install.packages("gptstudio")
library(gptstudio)

# Load HPO terms 
terms_dt = HPOExplorer::load_phenotype_to_genes(3)
terms_cols = list(name="Phenotype",
                  id="ID")

# Get unique terms and their ID's 
terms_dt_sub <.- unique(terms_dt[,unname(unlist(terms_cols)), with=FALSE])

Attempt #1

Here I'm using the congenital onset terms (without HPO ID) that were provided to us by Peter Robinson. Will also try:

inputting HPO ID into prompt
asking chatGPT to add column with HPO ID

# congenital onset terms without HPO ID
congenital_onset <- "Syndactyly; 
Ventricular septal defect; Atrioventricular canal defect; 
Atrial septal defect; Abnormal connection of the cardiac segments; 
Fetal anomaly; Neural tube defect; 
Coloboma; Microtia; Cryptotia; 
Cupped ear; Cleft helix; Low-set ears; 
Synotia; Holoprosencephaly; Exstrophy; 
Abdominal wall defect; Abnormal lung lobation; 
Unilateral primary pulmonary dysgenesis"

# define the effects you need answers to e.g. does the phenotype cause death
effects <- "mental retardation, death, impaired mobility, 
physical malformations, blindness, sensory impairments, 
immunodeficiency, cancer, reduced fertility."

# define the columns of the output table 
table_columns <- "phenotype, mental retardation, death, impaired mobility,
physical malformations, blindness, sensory impairments, immunodeficiency, cancer, 
reduced fertility, congenital onset, jusitification."

# define chatGPT prompt
question = paste("Do:", 
                 congenital_onset, 
                 ", typically cause:",
                 effects, 
                 "Do they have congenital onset?",
                 "You must give one-word yes or no answers and give a justification for why they do or don't have congenital onset.",
                 "You must provide the output in .tsv format with columns:",
                 table_columns)
question <- gsub("\n", "", question)

# run chatgpt 5 times for the same prompt
n = 5
run_chatgpt <- function(q){
  all_res <- gptstudio::openai_create_chat_completion(prompt = question)
  choices <- fread(all_res[["choices"]]$message.content)
  }

res_allPheno <- lapply(seq_len(n), function(x) run_chatgpt(1))

res_allPheno_dt <- data.table::rbindlist(res_list,fill = TRUE,
                                        use.names = TRUE,
                                        idcol = "iteration")

# order alphabetically so that you can compare results across phenotypes
res_allPheno_dt <- res_allPheno_dt [order(res_allPheno_dt $phenotype), ]

Below is a subset of res_allPheno_dt. The answers chatGPT gives over iterations of the same prompt are not consistent e.g. look at mental retardation for coloboma. A coloboma is an area of missing tissue in your eye, and through a quick google search is not associated with mental retardation.

iteration	phenotype	mental retardation	death	impaired mobility	physical malformations	blindness	sensory impairments	immunodeficiency	cancer	reduced fertility	congenital onset	justification
1	Atrioventricular canal defect	Yes	Yes	Yes	Yes	No	No	No	No	No	Yes	Congenital heart defect present at birth
2	Atrioventricular canal defect	Yes, in some cases	May lead to premature death	no	May lead to growth failure, fatigue or rapid breathing	May lead to vision problems	None	None	None	No	AV canal defect is present at birth and is a congenital condition.
3	Atrioventricular canal defect	Yes	Yes	No	Yes	No	No	No	No	No	Yes	Atrioventricular canal defect is a congenital heart defect in which there is an opening in the center of the heart where the walls separating the heart chambers should be.
4	Atrioventricular canal defect	Yes	Possible	None	Physical malformations	No	No	No	No	No	Yes	Congenital onset is typical of this phenotype as it is a result of abnormal development of the heart during fetal development.
5	Atrioventricular canal defect	Yes	Yes	No	Yes	No	No	No	No	No	Yes	It is a congenital heart defect that is present at birth.
1	Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	Congenital ear malformation present at birth
2	Cleft helix	No	None	None	May lead to physical malformations of the ear	None	None	None	None	Yes	Cleft helix is present at birth and is a congenital condition.
3	Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	Cleft helix is a congenital anomaly characterized by a cleft or gap in the top part of the ear.
4	Cleft helix	No	None	None	Physical malformations	No	No	No	No	No	Yes	Congenital onset is typical of this phenotype as it is a result of incomplete development of the ear during fetal development.
5	Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	A cleft helix is a rare congenital malformation of the ear.
1	Coloboma	Yes	No	No	Yes	Yes	Yes	No	No	No	Yes	Present at birth and can affect vision and eye structure
2	Coloboma	No	May lead to vision problems or blindness	May depend on location on the body	None	May lead to vision problems or blindness	May lead to hearing loss or deafness	None	None	No	Coloboma is present at birth and is a congenital condition.
3	Coloboma	Yes	No	No	Yes	Yes	Yes	No	No	No	Yes	Coloboma is a congenital anomaly characterized by a gap or hole in one of the structures of the eye.
4	Coloboma	No	None	None	Physical malformations	Possible	Possible	No	No	No	Yes	Congenital onset is typical of this phenotype as it is a result of incomplete fusion of the tissues that form the eye during fetal development.
5	Coloboma	Yes	No	No	Yes	Yes	No	No	No	No	Yes	A coloboma is a birth defect that affects the eye.
1	Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	Congenital ear malformation present at birth
2	Cryptotia	No	None	None	May lead to physical malformations of the ear	None	None	None	None	Yes	Cryptotia is present at birth and is a congenital condition.
3	Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	Cryptotia is a congenital anomaly characterized by a hidden ear that is partially or completely covered by skin.
4	Cryptotia	No	None	None	Physical malformations	No	No	No	No	No	Yes	Congenital onset is typical of this phenotype as it is a result of abnormal development of the ear during fetal development.
5	Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	Cryptotia is a congenital ear deformity.
1	Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	Congenital ear malformation present at birth
2	Cupped ear	No	None	None	May lead to physical malformations of the ear	None	None	None	None	Yes	Cupped ear is present at birth and is a congenital condition.
3	Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	Cupped ear is a congenital anomaly characterized by an ear that is shaped like a cup and protrudes outward from the side of the head.
4	Cupped ear	No	None	None	Physical malformations	No	No	No	No	No	Yes	Congenital onset is typical of this phenotype as it is a result of abnormal development of the ear during fetal development.
5	Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	A cupped ear is a congenital malformation.
1	Exstrophy	Yes	No	Yes	Yes	No	No	No	No	No	Yes	Present at birth and affects bladder and pelvic development
2	Exstrophy	No	None	None	May lead to physical malformations of the abdominal wall or pelvic organs	None	None	None	May lead to reduced fertility	Yes	Exstrophy is present at birth and is a congenital condition.
3	Exstrophy	Yes	No	Yes	Yes	No	No	No	No	No	Yes	Exstrophy is a congenital anomaly characterized by a defect in the abdominal wall or bladder.
4	Exstrophy	No	None	None	Physical malformations	No	No	No	No	No	Yes	Congenital onset is typical of this phenotype as it is a result of abnormal development of the abdominal wall during fetal development.
5	Exstrophy	Yes	No	Yes	Yes	No	No	No	No	No	Yes	Exstrophy is a congenital abnormality where the bladd

Attempt #2

What if I run the prompt one phenotype at a time, with 3 iterations?

congenital_onset_split <- as.list(strsplit(congenital_onset, "; ")[[1]])

results_list <- list() 

for (j in 1:3) { 
  res_individualPheno <- lapply(seq_len(length(congenital_onset_split)), function(i){
    pheno <- congenital_onset_split[[i]]
    question = paste("Does",
                     pheno, 
                     "typically cause:", 
                     effects,
                     "Does",
                     pheno, 
                     "have congenital onset?",
                     "You must give one-word yes or no answers and give a justification for why it does or doesn't have congenital onset.",
                     "You must provide the output in .tsv format with columns:",
                     table_columns)
    question <- gsub("\n", "", question)
    print(question)
    all_res <- gptstudio::openai_create_chat_completion(prompt = question)
    choices <- fread(all_res[["choices"]]$message.content)
    return(choices)
  })
  results_list[[j]] <- res_individualPheno_list 
}


list <- unlist(res_individualPheno_list, recursive = FALSE)

res_individualPheno_dt <- data.table::rbindlist(list,fill = TRUE,
                                         use.names = TRUE,
                                         idcol = "iteration")

# order alphabetically so that you can compare results across phenotypes
res_individualPheno_dt <- res_individualPheno_dt[order(res_individualPheno_dt$phenotype), ]

Below is a subset of res_individualPheno_dt, I've shown the same phenotypes as for res_allPheno_dt for comparison. There seems to be more consistency across the iterations when you run chatgpt on each phenotype individually.

phenotype	mental retardation	death	impaired mobility	physical malformations	blindness	sensory impairments	immunodeficiency	cancer	reduced fertility	congenital onset	justification	justification
Atrioventricular canal defect	no	no	no	yes	no	no	no	no	no	yes	NA	Defect occurs during fetal development, therefore present at birth.
Atrioventricular canal defect	No	No	No	Yes	No	No	No	No	No	Yes	NA	Atrioventricular canal defect is a congenital heart defect. It is present at birth and develops as the heart forms during fetal development.
Atrioventricular canal defect	No	No	No	Yes	No	No	No	No	No	Yes	NA	Atrioventricular canal defect is a congenital heart defect that occurs during fetal development.
Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	NA	Cleft helix is a genetic condition that is present at birth, thus indicating that it has a congenital onset.
Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	NA	Cleft helix is a genetic condition, meaning it is present at birth and caused by inherited gene mutations. It is a congenital condition.
Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	NA	Congenital onset is indicated by the presence of a physical malformation at birth, which is true for cleft helix.
Coloboma	No	No	No	Yes	Yes	Yes	No	No	No	Yes	NA	Congenital onset means present at birth, and coloboma is a congenital condition that occurs when certain structures in the eye or other parts of the body don't develop properly during fetal growth. Therefore, it has a congenital onset.
Coloboma	No	No	No	Yes	Yes	Yes	No	No	No	Yes	NA	Congenital onset refers to a condition that is present at or before birth. Coloboma is a congenital condition, as it occurs when the eye doesn't develop properly during pregnancy.
Coloboma	no	no	no	yes	yes	yes	no	no	no	yes	NA	Coloboma is a congenital birth defect that affects the eyes, and it is usually present from birth. It is caused by abnormal development of the eye during gestation.
Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	NA	Cryptotia is a congenital ear anomaly.
Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	NA	Cryptotia is a congenital ear malformation that is present at birth.
Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	NA	Cryptotia is a congenital condition, meaning it is present at or before birth.
Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	NA	Cupped ear is associated with physical malformations and is present at birth (congenital).
Cupped ear	no	no	no	yes	no	no	no	no	no	yes	NA	The development of an ear occurs during fetal development, hence the onset of cupped ear is congenital.
Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	NA	It is a congenital deformity that occurs during fetal development.
Exstrophy	No	No	Yes	Yes	No	No	No	No	Yes	Yes	NA	It is a birth defect that occurs during fetal development.
Exstrophy	No	No	Yes	Yes	No	No	No	Yes	Yes	Yes	NA	Exstrophy is a congenital anomaly that occurs during fetal development. The anterior body wall fails to properly fuse together, resulting in the exposure of internal organs.
Exstrophy	No	No	Yes	Yes	No	No	No	No	Yes	Yes	NA	Consequence of abnormal embryonic development

Attempt #3

Here I'm repeating attempt #1 with the addition of providing chatGPT with the definition of each congenital onset term.

# make dataframe with congenital onset phenotypes and their IDs, match column names to those in hpo meta 
congenital_onset_dt <- data.table(preferredlabel = c("Syndactyly",
                            "Ventricular septal defect",
                            "Atrioventricular canal defect",
                            "Atrial septal defect",
                            "Abnormal connection of the cardiac segments",
                            "Fetal anomaly",
                            "Neural tube defect",
                            "Coloboma",
                            "Microtia",
                            "Cryptotia",
                            "Cupped ear",
                            "Cleft helix",
                            "Low-set ears",
                            "Synotia",
                            "Holoprosencephaly",
                            "Exstrophy",
                            "Abdominal wall defect",
                            "Abnormal lung lobation",
                            "Unilateral primary pulmonary dysgenesis"),
                   HPO_ID = c("HP:0001159",
                          "HP:0001629",
                          "HP:0006695",
                          "HP:0001631",
                          "HP:0011545",
                          "HP:0034057",
                          "HP:0045005",
                          "HP:0000589",
                          "HP:0008551",
                          "HP:0011252",
                          "HP:0000378",
                          "HP:0009902",
                          "HP:0000369",
                          "HP:0100663",
                          "HP:0001360",
                          "HP:0100548",
                          "HP:0010866",
                          "HP:0002101",
                          "HP:0006549"))


# get HPO metadata table for all descendant terms of 'phenotypic abnormality' 
hpo_meta <- HPOExplorer::make_phenos_dataframe("HP:0000118")

# get meta info for congenital onset phenotypes
congenital_onset_dt <- merge(congenital_onset_dt, hpo_meta)

# phenos + definition for prompt, note that some don't have a definition in the hpo_meta table
phenos <- paste(
  paste0(congenital_onset_dt[[1]],
         " - ",congenital_onset_dt[[7]]),
  collapse="; "
) 

phenos <- gsub("\"\"","'", phenos)

# define chatGPT prompt
question = paste("Do:", 
                 phenos, 
                 ", typically cause:",
                 effects, 
                 "Do they have congenital onset?",
                 "You must give one-word yes or no answers and give a justification for why they do or don't have congenital onset.",
                 "You must provide the output in .tsv format with columns:",
                 table_columns)
question <- gsub("\n", "", question)

# run chatgpt 5 times for the same prompt
n = 5
run_chatgpt <- function(q){
  all_res <- gptstudio::openai_create_chat_completion(prompt = question)
  choices <- fread(all_res[["choices"]]$message.content)
}

res_multiPheno_def <- lapply(seq_len(n), function(x) run_chatgpt(1))

res_multiPheno_def_dt <- data.table::rbindlist(res_multiPheno_def,fill = TRUE,
                                         use.names = TRUE,
                                         idcol = "iteration")

# order alphabetically so that you can compare results across phenotypes
res_multiPheno_def_dt <- res_multiPheno_def_dt[order(res_multiPheno_def_dt$phenotype), ]

Here is a subset of res_multiPheno_def_dt. Including the definition in the prompt seems to: (i) improve consistency in results but (ii) reduces accuracy e.g. coloboma doesn't seem to be associated with mental retardation, and Atrioventricular canal defect does not 'typically' cause if there is surgical intervention (see below the table for a more detailed answer for this phenotype from chatGPT).

iteration	phenotype	mental retardation	death	impaired mobility	physical malformations	blindness	sensory impairments	immunodeficiency	cancer	reduced fertility	congenital onset	justification
1	Atrioventricular canal defect	Yes	Yes	No	Yes	No	No	No	No	No	Yes	Present at birth (congenital).
2	Atrioventricular canal defect	Yes	Yes	Yes	Yes	No	No	No	No	No	Yes	This condition is present at birth and affects the heart.
3	Atrioventricular canal defect	Yes	Yes	No	Yes	No	No	No	No	No	Yes	This is a defect in the atrioventricular septum of the heart which is a congenital defect.
4	Atrioventricular canal defect	Yes	Yes	No	Yes	No	No	No	No	No	Yes	The term refers to a congenital heart defect that is present at birth (congenital).
5	Atrioventricular canal defect	Yes	Yes	No	Yes	No	No	No	No	No	Yes	Congenital onset is specified in the definition.
1	Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	Present at birth (congenital).
2	Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	This is a congenital abnormality that affects the ear.
3	Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	Cleft helix is a defect that is present since birth.
4	Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	The term refers to a developmental defect of the helix of the ear that is present at birth (congenital).
5	Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	Congenital onset is specified in the definition.
1	Coloboma	Yes	No	No	Yes	Yes	Yes	No	No	No	Yes	Develops during fetal development and is present at birth (congenital).
2	Coloboma	Yes	No	No	Yes	Yes	No	No	No	No	Yes	This is a developmental defect that is present at birth.
3	Coloboma	Yes	Yes	No	Yes	Yes	Yes	No	No	No	Yes	Coloboma is a developmental defect that occurs during embryonic development.
4	Coloboma	Yes	No	No	Yes	Yes	Yes	No	No	No	Yes	The term refers to a developmental defect of the eye that is present at birth (congenital).
5	Coloboma	Yes	No	No	Yes	Yes	Yes	No	No	No	Yes	Congenital onset is specified in the definition.
1	Cryptotia	No	No	Yes	Yes	No	No	No	No	No	Yes	Present at birth (congenital).
2	Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	This is a congenital abnormality that affects the ear.
3	Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	Cryptotia is present at birth.
4	Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	The term refers to a developmental defect of the auricle of the ear that is present at birth (congenital).
5	Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	Congenital onset is specified in the definition.
1	Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	Present at birth (congenital).
2	Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	This is a congenital abnormality that affects the ear.
3	Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	This is a defect in ear folding which occurs during embryonic development.
4	Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	The term refers to a developmental defect of the ear that is present at birth (congenital).
5	Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	Congenital onset is specified in the definition.
1	Exstrophy	Yes	Yes	Yes	Yes	No	No	No	No	No	Yes	Present at birth (congenital).
2	Exstrophy	No	No	No	Yes	No	No	No	No	No	Yes	This is a developmental defect that is present at birth.
3	Exstrophy	No	No	Yes	Yes	No	No	No	No	No	Yes	Exstrophy is a result of developmental defects in embryonic development.
4	Exstrophy	No	No	Yes	Yes	No	No	No	No	No	Yes	The term refers to a developmental defect of the abdominal wall that is present at birth (congenital).
5	Exstrophy	Yes	Yes	Yes	Yes	No	No	No	No	No	Yes	Congenital onset is specified in the definition.

Attempt #4

Here I'm repeating attempt #2 with the addition of providing chatGPT with the definition of each congenital onset term.
results_list <- list()

for (j in 1:3) { 
res_indPheno_def <- lapply(seq_len(nrow(congenital_onset_dt)), function(i){
  pheno <- congenital_onset_dt$preferredlabel[[i]]
  definition <- congenital_onset_dt$definition[[i]]
  question <- paste("Does",
                    pheno, 
                    "-",
                    definition,
                    ", typically cause:", 
                    effects,
                    "Does",
                    pheno, 
                    "have congenital onset?",
                    "You must give one-word yes or no answers and give a justification for why it does or doesn't have congenital onset.",
                    "You must provide the output in .tsv format with columns:",
                    table_columns)
  question <- gsub("\n", "", question)
  question <- gsub(". , typically", ", typically", question)
  all_res <- gptstudio::openai_create_chat_completion(prompt = question)
  choices <- fread(all_res[["choices"]]$message.content)
})
results_list[[j]] <- res_indPheno_def
}

list <- unlist(results_list, recursive = FALSE)

res_indPheno_def_dt <- data.table::rbindlist(list,fill = TRUE,
                                                use.names = TRUE,
                                                idcol = "iteration")

# order alphabetically so that you can compare results across phenotypes
res_indPheno_def_dt <- res_individualPheno_dt[order(res_individualPheno_dt$phenotype), ]

Here is a subset of res_indPheno_def_dt.

iteration	phenotype	mental retardation	death	impaired mobility	physical malformations	blindness	sensory impairments	immunodeficiency	cancer	reduced fertility	congenital onset	justification	justification
5	Atrioventricular canal defect	No	No	No	Yes	No	No	No	No	No	Yes	Cause is a defect of the atrioventricular septum which develops during fetal development, making it congenital.	NA
24	Atrioventricular canal defect	No	No	No	Yes	No	No	No	No	No	Yes	Atrioventricular canal defect is a congenital heart defect, meaning it is present at birth.	NA
43	Atrioventricular canal defect	No	Yes	No	Yes	No	No	No	No	No	Yes	Atrioventricular canal defect is a congenital heart defect that is present at birth.	NA
6	Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	Cleft helix is a congenital malformation that occurs during fetal development.	NA
25	Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	Cleft helix is a physical malformation that is present at birth and affects the ear.	NA
44	Cleft helix	No	No	No	Yes	No	No	No	No	No	Yes	Cleft helix is a physical malformation of the ear that is present at birth, indicating a congenital onset.	NA
7	Coloboma	No	No	No	Yes	Yes	No	No	No	No	Yes	Coloboma is a congenital condition as it results from incomplete closure of the optic fissure during embryonic development, which occurs during the early stages of fetal development.	NA
26	Coloboma	No	No	No	Yes	Yes	Yes	No	No	No	Yes	Coloboma is a developmental defect that is present at birth, therefore it has a congenital onset.	NA
45	Coloboma	no	no	no	yes	yes	yes	no	no	no	yes	It is a developmental defect, meaning it occurs during fetal development and is present at birth.	NA
8	Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	Cryptotia is a congenital condition, meaning it is present at birth. It is caused by abnormal development of the ear during fetal development.	NA
27	Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	Cryptotia is a congenital anomaly caused by abnormal development of the auricle in utero.	NA
46	Cryptotia	No	No	No	Yes	No	No	No	No	No	Yes	Cryptotia is a congenital anomaly that develops during fetal growth and is present at birth.	NA
9	Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	Cupped ear is a physical malformation that is present at birth, thus it has a congenital onset.	NA
28	Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	Cupped ear is a physical malformation that is present at birth, indicating congenital onset.	NA
47	Cupped ear	No	No	No	Yes	No	No	No	No	No	Yes	Cupped ear is a physical malformation that is present at birth and does not develop later in life. Therefore, it has a congenital onset.	NA
10	Exstrophy	No	No	Yes	Yes	No	No	No	No	Yes	Yes	Exstrophy is a congenital birth defect that occurs during fetal development.	NA
29	Exstrophy	No	No	Yes	Yes	No	No	No	No	Yes	Yes	Exstrophy is a congenital abnormality, present at birth.	NA
48	Exstrophy	No	No	Yes	Yes	No	No	No	No	Yes	Yes	Exstrophy is a congenital condition that occurs during.	NA

@bschilder @NathanSkene

NathanSkene · 2023-03-26T12:41:27Z

That prompt is not including the description of the phenotype is it? Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Kitty Murphy ***@***.***> Sent: Sunday, March 26, 2023 11:55:25 AM To: neurogenomics/RareDiseasePrioritisation ***@***.***> Cc: Skene, Nathan G ***@***.***>; Mention ***@***.***> Subject: Re: [neurogenomics/RareDiseasePrioritisation] Annotate diseases/phenotypes using chatGPT (Issue #19) This email from ***@***.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address. Annotating HPO phenotypes using chatGPT via gptstudio Set up install.packages("gptstudio") library(gptstudio) # Load HPO terms terms_dt = HPOExplorer::load_phenotype_to_genes(3) terms_cols = list(name="Phenotype", id="ID") # Get unique terms and their ID's terms_dt_sub <.- unique(terms_dt[,unname(unlist(terms_cols)), with=FALSE]) Attempt #1<#1> Here I'm using the congenital onset terms (without HPO ID) that were provided to us by Peter Robinson. Will also try: * inputting HPO ID into prompt * asking chatGPT to add column with HPO ID # congenital onset terms without HPO ID congenital_onset <- "Syndactyly; Ventricular septal defect; Atrioventricular canal defect; Atrial septal defect; Abnormal connection of the cardiac segments; Fetal anomaly; Neural tube defect; Coloboma; Microtia; Cryptotia; Cupped ear; Cleft helix; Low-set ears; Synotia; Holoprosencephaly; Exstrophy; Abdominal wall defect; Abnormal lung lobation; Unilateral primary pulmonary dysgenesis" # define the effects you need answers to e.g. does the phenotype cause death effects <- "mental retardation, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility." # define the columns of the output table table_columns <- "phenotype, mental retardation, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility, congenital onset, jusitification." # define chatGPT prompt question = paste("Do:", congenital_onset, ", typically cause:", effects, "Do they have congenital onset?", "You must give one-word yes or no answers and give a justification for why they do or don't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question) # run chatgpt 5 times for the same prompt n = 5 run_chatgpt <- function(q){ all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) } res_allPheno <- lapply(seq_len(n), function(x) run_chatgpt(1)) res_allPheno_dt <- data.table::rbindlist(res_list,fill = TRUE, use.names = TRUE, idcol = "iteration") # order alphabetically so that you can compare results across phenotypes res_allPheno_dt <- res_allPheno_dt [order(res_allPheno_dt $phenotype), ] Below is a subset of res_allPheno_dt. The answers chatGPT gives over iterations of the same prompt are not consistent e.g. look at mental retardation for coloboma. A coloboma is an area of missing tissue in your eye, and through a quick google search is not associated with mental retardation. iteration phenotype mental retardation death impaired mobility physical malformations blindness sensory impairments immunodeficiency cancer reduced fertility congenital onset justification 1 Atrioventricular canal defect Yes Yes Yes Yes No No No No No Yes Congenital heart defect present at birth 2 Atrioventricular canal defect Yes, in some cases May lead to premature death no May lead to growth failure, fatigue or rapid breathing May lead to vision problems None None None No AV canal defect is present at birth and is a congenital condition. 3 Atrioventricular canal defect Yes Yes No Yes No No No No No Yes Atrioventricular canal defect is a congenital heart defect in which there is an opening in the center of the heart where the walls separating the heart chambers should be. 4 Atrioventricular canal defect Yes Possible None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of abnormal development of the heart during fetal development. 5 Atrioventricular canal defect Yes Yes No Yes No No No No No Yes It is a congenital heart defect that is present at birth. 1 Cleft helix No No No Yes No No No No No Yes Congenital ear malformation present at birth 2 Cleft helix No None None May lead to physical malformations of the ear None None None None Yes Cleft helix is present at birth and is a congenital condition. 3 Cleft helix No No No Yes No No No No No Yes Cleft helix is a congenital anomaly characterized by a cleft or gap in the top part of the ear. 4 Cleft helix No None None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of incomplete development of the ear during fetal development. 5 Cleft helix No No No Yes No No No No No Yes A cleft helix is a rare congenital malformation of the ear. 1 Coloboma Yes No No Yes Yes Yes No No No Yes Present at birth and can affect vision and eye structure 2 Coloboma No May lead to vision problems or blindness May depend on location on the body None May lead to vision problems or blindness May lead to hearing loss or deafness None None No Coloboma is present at birth and is a congenital condition. 3 Coloboma Yes No No Yes Yes Yes No No No Yes Coloboma is a congenital anomaly characterized by a gap or hole in one of the structures of the eye. 4 Coloboma No None None Physical malformations Possible Possible No No No Yes Congenital onset is typical of this phenotype as it is a result of incomplete fusion of the tissues that form the eye during fetal development. 5 Coloboma Yes No No Yes Yes No No No No Yes A coloboma is a birth defect that affects the eye. 1 Cryptotia No No No Yes No No No No No Yes Congenital ear malformation present at birth 2 Cryptotia No None None May lead to physical malformations of the ear None None None None Yes Cryptotia is present at birth and is a congenital condition. 3 Cryptotia No No No Yes No No No No No Yes Cryptotia is a congenital anomaly characterized by a hidden ear that is partially or completely covered by skin. 4 Cryptotia No None None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of abnormal development of the ear during fetal development. 5 Cryptotia No No No Yes No No No No No Yes Cryptotia is a congenital ear deformity. 1 Cupped ear No No No Yes No No No No No Yes Congenital ear malformation present at birth 2 Cupped ear No None None May lead to physical malformations of the ear None None None None Yes Cupped ear is present at birth and is a congenital condition. 3 Cupped ear No No No Yes No No No No No Yes Cupped ear is a congenital anomaly characterized by an ear that is shaped like a cup and protrudes outward from the side of the head. 4 Cupped ear No None None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of abnormal development of the ear during fetal development. 5 Cupped ear No No No Yes No No No No No Yes A cupped ear is a congenital malformation. 1 Exstrophy Yes No Yes Yes No No No No No Yes Present at birth and affects bladder and pelvic development 2 Exstrophy No None None May lead to physical malformations of the abdominal wall or pelvic organs None None None May lead to reduced fertility Yes Exstrophy is present at birth and is a congenital condition. 3 Exstrophy Yes No Yes Yes No No No No No Yes Exstrophy is a congenital anomaly characterized by a defect in the abdominal wall or bladder. 4 Exstrophy No None None Physical malformations No No No No No Yes Congenital onset is typical of this phenotype as it is a result of abnormal development of the abdominal wall during fetal development. 5 Exstrophy Yes No Yes Yes No No No No No Yes Exstrophy is a congenital abnormality where the bladd Attempt #2<#2> What if I run the prompt one phenotype at a time, with 3 iterations? congenital_onset_split <- as.list(strsplit(congenital_onset, "; ")[[1]]) results_list <- list() for (j in 1:3) { res_individualPheno <- lapply(seq_len(length(congenital_onset_split)), function(i){ pheno <- congenital_onset_split[[i]] question = paste("Does", pheno, "typically cause:", effects, "Does", pheno, "have congenital onset?", "You must give one-word yes or no answers and give a justification for why it does or doesn't have congenital onset.", "You must provide the output in .tsv format with columns:", table_columns) question <- gsub("\n", "", question) print(question) all_res <- gptstudio::openai_create_chat_completion(prompt = question) choices <- fread(all_res[["choices"]]$message.content) return(choices) }) results_list[[j]] <- res_individualPheno_list # store the result in the list } list <- unlist(res_individualPheno_list, recursive = FALSE) res_individualPheno_dt <- data.table::rbindlist(list,fill = TRUE, use.names = TRUE, idcol = "iteration") # order alphabetically so that you can compare results across phenotypes res_individualPheno_dt <- res_individualPheno_dt[order(res_individualPheno_dt$phenotype), ] Below is a subset of res_individualPheno_dt, I've shown the same phenotypes as for res_allPheno_dt for comparison. There seems to be more consistency across the iterations when you run chatgpt on each phenotype individually. phenotype mental retardation death impaired mobility physical malformations blindness sensory impairments immunodeficiency cancer reduced fertility congenital onset justification Atrioventricular canal defect no no no yes no no no no no yes NA Atrioventricular canal defect No No No Yes No No No No No Yes NA Atrioventricular canal defect No No No Yes No No No No No Yes NA Cleft helix No No No Yes No No No No No Yes NA Cleft helix No No No Yes No No No No No Yes NA Cleft helix No No No Yes No No No No No Yes NA Coloboma No No No Yes Yes Yes No No No Yes NA Coloboma No No No Yes Yes Yes No No No Yes NA Coloboma no no no yes yes yes no no no yes NA Cryptotia No No No Yes No No No No No Yes NA Cryptotia No No No Yes No No No No No Yes NA Cryptotia No No No Yes No No No No No Yes NA Cupped ear No No No Yes No No No No No Yes NA Cupped ear no no no yes no no no no no yes NA Cupped ear No No No Yes No No No No No Yes NA Exstrophy No No Yes Yes No No No No Yes Yes NA Exstrophy No No Yes Yes No No No Yes Yes Yes NA Exstrophy No No Yes Yes No No No No Yes Yes NA @bschilder<https://github.com/bschilder> @NathanSkene<https://github.com/NathanSkene> — Reply to this email directly, view it on GitHub<#19 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH5ZPE5L3RZIYKPKXP4QW3TW6AVC3ANCNFSM6AAAAAAWBOOU2U>. You are receiving this because you were mentioned.Message ID: ***@***.***>

bschilder · 2023-03-26T14:53:30Z

Nice progress @KittyMurphy . That's interesting about the responses being more consistent when provided individually. Wondering if this has to with informational overload like we were discussing before. Might be an aspect of chatGPT that other people have noticed and documented.

One thing that would be helpful is to come up with a function that computes consistently scores for each metric. That will give us at least some quantitative metric of performance (tho not exactly the ground truth). Something like:

dat=xlsx::read.xlsx("~/Downloads/annot.xlsx",1)
avg <- dplyr::group_by(dat, phenotype) |> dplyr::summarise( mental.retardation_consistency=1/length(unique(mental.retardation)))
avg

After computing the within phenotype consistency, you can compute mean consistency:

mean(avg$mental.retardation_consistency)
# 0.75

That prompt is not including the description of the phenotype is it?

@NathanSkene I believe this is only providing the chatGPT with the name of the phenotype, not the full description of it. Thus, any other information about the disease is being pulled from the LLM itself.

NathanSkene · 2023-03-26T15:24:32Z

Good idea to get some stats on it. Could also use scoring to compare ChatGPt3 vs 4 consistency: expect some folks will be interested. Including the HPO description might help it get a more consistent understanding of what the phenotype is. Brian, do you know how the descriptions can be accessed programmatically? Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Brian M. Schilder ***@***.***> Sent: Sunday, March 26, 2023 3:53:41 PM To: neurogenomics/RareDiseasePrioritisation ***@***.***> Cc: Skene, Nathan G ***@***.***>; Mention ***@***.***> Subject: Re: [neurogenomics/RareDiseasePrioritisation] Annotate diseases/phenotypes using chatGPT (Issue #19) This email from ***@***.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders list<https://spam.ic.ac.uk/SpamConsole/Senders.aspx> to disable email stamping for this address. Nice progress @KittyMurphy<https://github.com/KittyMurphy> . That's interesting about the responses being more consistent when provided individually. Wondering if this has to with informational overload like we were discussing before. Might be an aspect of chatGPT that other people have noticed and documented. One thing that would be helpful is to come up with a function that computes consistently scores for each metric. That will give us at least some quantitative metric of performance (tho not exactly the ground truth). Something like: dat=xlsx::read.xlsx("~/Downloads/annot.xlsx",1) dplyr::group_by(dat, phenotype) |> dplyr::summarise( mental.retardation_consistency=1/length(unique(mental.retardation))) [Screenshot 2023-03-26 at 14 34 57]<https://user-images.githubusercontent.com/34280215/227779366-f8ee8286-30af-486f-b39f-21d3c6ce5767.png> That prompt is not including the description of the phenotype is it? @NathanSkene<https://github.com/NathanSkene> I believe this is only providing the chatGPT with the name of the phenotype, not the full description of it. Thus, any other information about the disease is being pulled from the LLM itself. — Reply to this email directly, view it on GitHub<#19 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH5ZPE7U7FACDH2TKEAWITTW6BJ7LANCNFSM6AAAAAAWBOOU2U>. You are receiving this because you were mentioned.Message ID: ***@***.***>

KittyMurphy · 2023-03-26T15:26:26Z

Already working on adding the description, @bschilder I assume the best way to get this is to use the definition column in HPOExplorer::make_phenos_dataframe?

bschilder · 2023-03-26T15:27:45Z

Already working on adding the description, @bschilder I assume the best way to get this is to use the definition column in HPOExplorer::make_phenos_dataframe?

Yeah, that'll work. Or the subfunction which is more direct:
HPOExplorer::add_hpo_definition()

NathanSkene · 2023-03-28T09:07:05Z

The current prompts do not include a statement for "Do not consider indirect effects". Would be worth adding this in and seeing if it makes any difference.

bschilder · 2023-05-09T17:06:37Z

I tried out AutoGPT to see if this might be a useful avenue. Here’s what I learned:

Pros

It can search the internet, via APIs or via Selenium queries. For example, if you ask it something it’s unsure about, it can read the relevant literature/databases on the topic to gain more expertise in that area.
It has built-in python code for reading/writing code or other files. This means no need to copy-and-paste output from the browser interface. Using this feature I was able to tell it to read in a series of CSVs with 100 HPO terms each (that I had created beforehand) so that each query was a manageable size that didn’t exceed the token limit.
There is a dedicated Docker container to run AutoGPT. The instructions are not super straightforward (or correct) but after some troubleshooting and checking the GitHub Issues i was able to get things working. I took notes on exactly how to do this and will share.

Cons

As very few people have API access to GPT4 atm, it means that when we use AutoGPT we can only use the GPT3.5-turbo model. As you know, this is not as sophisticated of a model and will do thing like write lazy code that just assigns the same annotations to every phenotype, or simply do substring searches for the term “blindness” within the HPO term itself (which isn’t very useful).
It requires you to have a paid OpenAI account. In the interest of time, I just entered my personal credit card details. It’s actually not too bad; after a whole day or making hundreds of queries I only racked up $1.46 in charges. But still something to do mindful of.
It’s very tricky to get it to do what you actually want, and requires a lot of trial-and-error to get it close. This will hopefully be better with GPT4, but in the meantime i wasn’t able to get it to produce any kind of meaningful annotation for the HPO terms.

bschilder · 2023-05-09T17:10:30Z

Here is my favorite example of how AutoGPT can be very lazy 😅

KittyMurphy · 2023-05-19T11:03:30Z

I have now performed a trial run to annotate phenotypes using chat gpt via selenium. Initially we asked gpt to provide the output in .tsv format but I had difficulty trying to extract this from the chat interface into python. To overcome this, I asked gpt to provide the output as python code that I could then run to generate a data frame. @bschilder noted that earlier versions of gpt could sometimes be lazy when asking for code.

Here is a prompt example:
"I need to annotate phenotypes as to whether they typically cause: intellectual disability, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility? Do they have congenital onset? You must give one-word yes or no answers. Do not consider indirect effects. You must provide the output in python code as a data frame called df with columns: phenotype, intellectual_disability, death, impaired_mobility, physical_malformations, blindness, sensory_impairments, immunodeficiency, cancer, reduced_fertility, congenital_onset, justification. These are the phenotypes: Abnormality of body height; Multicystic kidney dysplasia; Autosomal dominant inheritance; Autosomal recessive inheritance; Abnormal morphology of female internal genitalia; Functional abnormality of the bladder; Recurrent urinary tract infections; Neurogenic bladder; Urinary urgency; Hypoplasia of the uterus; Abnormality of the bladder; Bladder diverticulum"

Here is the trial run using ~100 phenotypes (note, there are ~200 because I think I appended the results twice by mistake): annot_HPO_gpt_test.csv

@NathanSkene noted that the phenotype 'Azoospermia' is not being annotated as reducing fertility. This is worrying as upon a literature search of this phenotype:
"Azoospermia is the complete absence of spermatozoa in the ejaculate. It is the most severe and one of the leading causes of male infertility. The exact pathophysiology of azoospermia is not always known. Azoospermia can be due to pre-testicular, testicular, and post-testicular causes."

Next, I want to:

Run the prompt that included the 'Azoospermia' phenotype again, once asking gpt to provide the output as python code and once as a semi-colon separated list.
This time round, the python code output 'Azoospermia' as reducing fertility and the justification column had justifications and not just NAs.
I just realised that the prompt doesn't specify what the justification column should be for, maybe that's why it was outputting NAs previously.
Ask gpt to add a justification column for each phenotype, this might require including less phenotypes in the prompt so as to not overwhelm gpt with information (this seems to be an issue with earlier version of gpt)
Seems to work well with the justifications but the response generation was stopped prematurely, probably due to token usage.
Repeated using only 4 phenotypes (used 12 before), again seems to be working well e.g. the justification column for reduced fertility for Azoospermia: 'Azoospermia leads to male infertility', but the response generation was also stopped prematurely.

bschilder · 2023-05-19T17:59:02Z

Thanks @KittyMurphy !

A couple of other ideas for reducing token usage (tho whether this helps will depend on how OpenAI counts 'tokens', which i'm still not totally clear on):

Using a persistent session and only defining the task in the first prompt. After that, just keep asking it to produce the same output each time. Hopefully this won't impact the quality of the outputs.
Ask to return "Y/N" instead of "Yes/No"
Ask chatGPT to abbreviate columns names (e.g. "Physical_Malformations"-->"PM")

bschilder · 2023-05-19T19:38:16Z

Annotation output checks

All of the following annotation validation procedures described below can be rerun with any new annotations using the new internal function: HPOExplorer:::check_annot_gpt
https://github.com/neurogenomics/HPOExplorer/blob/master/R/check_annot_gpt.R

Check phenotype names

Check whether chatGPT hasn't modified the phenotype names such that we can't link it back to the input HPO terms.

  d <- data.table::fread(path, key = "Phenotype")
  annot <- HPOExplorer::load_phenotype_to_genes()
  d$Phenotype[!d$Phenotype %in% annot$Phenotype]
# character(0)

✅ All phenotypes in HPO gene annotations file verbatim.

Check annotation consistency

For phenotype that chatGPT annotated more than once, how consistent are the Y/N annotations it gave for each?

 nm <- names(d)[!names(d) %in% c("Phenotype","Justification")]
  d_mean <- d[,lapply(.SD,function(x){mean(x=="Yes")}),.SDcols=nm, by="Phenotype"]
  d_consist <- lapply(d_mean[,-1], function(x)sum(x%in%c(0,1)/nrow(d_mean)))
d_consist

$Intellectual_Disability
[1] 1

$Death
[1] 1

$Impaired_Mobility
[1] 1

$Physical_Malformations
[1] 1

$Blindness
[1] 1

$Sensory_Impairments
[1] 1

$Immunodeficiency
[1] 1

$Cancer
[1] 1

$Reduced_Fertility
[1] 0.7708333

$Congenital_Onset
[1] 1

mean(unlist(d_consist))
#  0.9770833

✅ At least In this small subsampling, 9/10 annotation columns are 100% consistent across chatGPT runs. This results in an average consistency score of 97.7% across all annotations. "Reduced_Fertility" is one to look out for, as it does not appear to always provide the same annotation here (77%, which may seem not too bad but remember that baseline is 50% as the options are binary).

Check phenotype classifications

As some of these phenotypes belong to specific branches of the HPO that should guarantee have a particular annotation (e.g. all forms of blindness phenotypes cause Blindness ('Yes'), we can use this information to validate the chatGPT-provided annotations.

While we can confirm annotations that we would expect (true positives vs. false negatives), this doesn't really let us definitively says whether some phenotypes do NOT cause a given condition such as blindness (true negatives).

d$HPO_ID <- harmonise_phenotypes(phenotypes = d$Phenotype,
                                   as_hpo_ids = TRUE)
  ## Find matching HPO branches
  hpo <- get_hpo() 
  queries <- list(
    Intellectual_Disability=c("intellectual disability"),
    Impaired_Mobility=c("Abnormal central motor function",
                        "Abnormality of movement"),
    Physical_Malformations=c("malformation","morphology"),
    Blindness=c("^blindness"),
    Sensory_Impairments=c("Abnormality of vision",
                          "Abnormality of the sense of smell",
                          "Abnormality of taste sensation",
                          "Somatic sensory dysfunction",
                          "Hearing abnormality"
                          ),
    Immunodeficiency=c("Immunodeficiency"),
    Cancer=c("Neoplasm","Cancer"),
    Reduced_Fertility=c("Decreased fertility")
    ) 
  tiers <- lapply(queries, function(q){
    terms <- grep(paste(q,collapse = "|"),
         hpo$name,
         ignore.case = TRUE, value = TRUE)
    ontologyIndex::get_descendants(ontology = hpo,
                                   roots = names(terms),
                                   exclude_roots = FALSE) |>
      unique()
  })
  annot_check <- lapply(seq_len(nrow(d)), function(i){
    r <- d[i,]
    cbind(
      r[,c("Phenotype","HPO_ID")],
      lapply(stats::setNames(names(tiers),names(tiers)),
             function(x){
               if(r$HPO_ID %in% tiers[[x]]){
                 r[,x,with=FALSE][[1]]=="Yes"
               } else {
                 NA
               }
             }) |> data.table::as.data.table()
    )
  }) |> data.table::rbindlist()
  
### Number of rows where annotation is NA
  missing_rate <- sapply(
    annot_check[,names(tiers),with=FALSE],
    function(x){sum(is.na(x))/length(x)})
missing_rate

Intellectual_Disability       Impaired_Mobility  Physical_Malformations 
              1.0000000               1.0000000               0.4558824 
              Blindness     Sensory_Impairments        Immunodeficiency 
              1.0000000               1.0000000               1.0000000 
                 Cancer       Reduced_Fertility 
              0.9901961               0.9607843

True positive rate

### Number of rows where the annotation was checkable and TRUE
true_pos_rate <- sapply(annot_check[,names(tiers),with=FALSE], function(x){sum(na.omit(x)==TRUE)/length(na.omit(x))})
true_pos_rate

Intellectual_Disability       Impaired_Mobility  Physical_Malformations 
                    NaN                     NaN               0.5765766 
              Blindness     Sensory_Impairments        Immunodeficiency 
                    NaN                     NaN                     NaN 
                 Cancer       Reduced_Fertility 
              1.0000000               0.5000000

False negative rate

### Number of rows where the annotation was checkable and FALSE
false_neg_rate <- sapply(annot_check[,names(tiers),with=FALSE], function(x){sum(na.omit(x)==FALSE)/length(na.omit(x))})
false_neg_rate

Intellectual_Disability       Impaired_Mobility  Physical_Malformations 
                    NaN                     NaN               0.4234234 
              Blindness     Sensory_Impairments        Immunodeficiency 
                    NaN                     NaN                     NaN 
                 Cancer       Reduced_Fertility 
              0.0000000               0.5000000

KittyMurphy · 2023-06-08T07:02:31Z

I have since updated the prompt twice.

Example prompt 1.1: I need to annotate phenotypes as to whether they typically cause: intellectual disability, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility? Do they always have congenital onset? You must give one-word yes or no answers. Do not consider indirect effects. You must provide the output in python code as a data frame called df with columns: phenotype, intellectual_disability, death, impaired_mobility, physical_malformations, blindness, sensory_impairments, immunodeficiency, cancer, reduced_fertility, congenital_onset. Also add justification columns for each outcome. These are the phenotypes: Recurrent urinary tract infections; Neurogenic bladder; Urinary urgency

Here are the results for ~500 phenotypes: gpt_hpo_annotations.csv. The issue here was that we were getting non yes or no answers for some of the phenotypic outcomes e.g. 'can be', 'may be'. To get around this, we decided to add a scale for the phenotypic outcomes, so instead of yes or no answers we ask chat gpt to answer using a scale of: never, rarely, often, always. Due to limited token usage we had to drop the number of phenotypes in each prompt to two.

Example prompt 1.2: I need to annotate phenotypes as to whether they typically cause: intellectual disability, death, impaired mobility, physical malformations, blindness, sensory impairments, immunodeficiency, cancer, reduced fertility? Do they have congenital onset? To answer, use a severity scale of: never, rarely, often, always. Do not consider indirect effects. You must provide the output in python code as a data frame called df with columns: phenotype, intellectual_disability, death, impaired_mobility, physical_malformations, blindness, sensory_impairments, immunodeficiency, cancer, reduced_fertility, congenital_onset. Also add justification columns for each outcome. These are the phenotypes: Urinary urgency; Hypoplasia of the uterus

Here are the results so far: gpt_hpo_annotations_scale.csv

Currently waiting for help from Eugene to get this set up on a remote machine so that it can run 24/7, and it will probably take ~2 weeks.

bschilder · 2023-06-14T14:43:42Z

@KittyMurphy I'm looking into some resources that might be helpful:

ChatGPT File uploader (google chrome extension)
https://chrome.google.com/webstore/detail/chatgpt-file-uploader-ext/becfinhbfclcgokjlobojlnldbfillpf/

Bing Chat: Microsoft's iteration of ChatGPT:
https://www.bing.com/

bschilder · 2023-11-06T14:23:23Z

Update

Stage 1

We first only ran GPT annotations for the 2,832 phenotypes that were significantly enriched for at least one cell type in our first round of analyses

Stage 2

Then, we expanded to all 10,969 phenotypes that appeared within the HPO gene annotations file. This should be sufficient for the first Rare Disease Celltyping paper, as it allows us to prioritise all phenotypes relevant for that paper.

annot=HPOExplorer::load_phenotype_to_genes()
length(unique(annot$hpo_name))
# [1] 10969

@KittyMurphy is running the last of these now.

Stage 3

Finally, we will further extend our GPT annotations to all phenotypes in the HPO, which is currently 18,057 total phenotypes. This will be used for the GPT annotations manuscript.

hpo=HPOExplorer::get_hpo()
> length(unique(hpo$name))
# [1] 18057

KittyMurphy · 2023-11-06T14:28:53Z

I've actually been using the below code to get the phenotypes:

annot <- HPOExplorer::make_phenos_dataframe()
length(unique(annot$hpo_id))
[1] 10954

I'll make sure I run the remaining 15 phenotypes that are called with HPOExplorer::load_phenotype_to_genes() but just wanted to flag the discrepancy between the two.

bschilder · 2023-11-06T14:50:38Z

@KittyMurphy make_phenos_dataframe calls load_phenotype_to_genes to get the data, so they should be the same (unless somehow certain phenotypes get filtered in the former function).
https://github.com/neurogenomics/HPOExplorer/blob/master/R/make_phenos_dataframe.R

Could you check whether this discrepancy stems from :

the functions themselves
Different versions of HPO ontology/genes data (note, data is cached by default).
Different versions of HPOExplorer

bschilder assigned bschilder and KittyMurphy Mar 20, 2023

bschilder changed the title ~~Annotate disease severity using chatGPT~~ Annotate diseases/phenotypes using chatGPT Mar 20, 2023

bschilder added this to the Publish rare disease celltyping manuscript milestone Oct 23, 2023

bschilder mentioned this issue Nov 2, 2023

Prioritise therapeutic candidates for nervous system phenotypes #30

Closed

bschilder added the enhancement New feature or request label Nov 2, 2023

bschilder mentioned this issue Nov 6, 2023

Are congenital phenotypes more enriched for fetal cell types? neurogenomics/rare_disease_celltyping#47

Closed

bschilder mentioned this issue Nov 7, 2023

Improve HPO data version control neurogenomics/HPOExplorer#38

Closed

bschilder closed this as completed Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotate diseases/phenotypes using chatGPT #19

Annotate diseases/phenotypes using chatGPT #19

bschilder commented Mar 20, 2023 •

edited

bschilder commented Mar 25, 2023

KittyMurphy commented Mar 26, 2023 •

edited

NathanSkene commented Mar 26, 2023 via email

bschilder commented Mar 26, 2023 •

edited

NathanSkene commented Mar 26, 2023 via email

KittyMurphy commented Mar 26, 2023

bschilder commented Mar 26, 2023

NathanSkene commented Mar 28, 2023

bschilder commented May 9, 2023 •

edited

bschilder commented May 9, 2023

KittyMurphy commented May 19, 2023 •

edited

bschilder commented May 19, 2023

bschilder commented May 19, 2023 •

edited

KittyMurphy commented Jun 8, 2023

bschilder commented Jun 14, 2023

bschilder commented Nov 6, 2023 •

edited

KittyMurphy commented Nov 6, 2023

bschilder commented Nov 6, 2023

Annotate diseases/phenotypes using chatGPT #19

Annotate diseases/phenotypes using chatGPT #19

Comments

bschilder commented Mar 20, 2023 • edited

Annotations

Models

Related

bschilder commented Mar 25, 2023

KittyMurphy commented Mar 26, 2023 • edited

Annotating HPO phenotypes using chatGPT via gptstudio

Set up

Attempt #1

Attempt #2

Attempt #3

Attempt #4

NathanSkene commented Mar 26, 2023 via email

bschilder commented Mar 26, 2023 • edited

NathanSkene commented Mar 26, 2023 via email

KittyMurphy commented Mar 26, 2023

bschilder commented Mar 26, 2023

NathanSkene commented Mar 28, 2023

bschilder commented May 9, 2023 • edited

Pros

Cons

bschilder commented May 9, 2023

KittyMurphy commented May 19, 2023 • edited

bschilder commented May 19, 2023

bschilder commented May 19, 2023 • edited

Annotation output checks

Check phenotype names

Check annotation consistency

Check phenotype classifications

True positive rate

False negative rate

KittyMurphy commented Jun 8, 2023

bschilder commented Jun 14, 2023

bschilder commented Nov 6, 2023 • edited

Update

Stage 1

Stage 2

Stage 3

KittyMurphy commented Nov 6, 2023

bschilder commented Nov 6, 2023

bschilder commented Mar 20, 2023 •

edited

KittyMurphy commented Mar 26, 2023 •

edited

bschilder commented Mar 26, 2023 •

edited

bschilder commented May 9, 2023 •

edited

KittyMurphy commented May 19, 2023 •

edited

bschilder commented May 19, 2023 •

edited

bschilder commented Nov 6, 2023 •

edited