Add a generate-extract command for parsing the results of generated text #158

cmungall · 2023-07-31T15:56:56Z

Use case: generate a description of a concept (e.g. cell type) entirely from LLM's "latent knowledge base", and then extract structure knowledge from it, thus bypassing the need for an incomplete pubmed search.

Could also be used as a kind of validation procedure on generated text - compare extracted knowledge with what is in KB. The difference is either hallucination or KB gaps.

This PR does two things: - Add a combined generate-extract command, fixes #158 - Adds cell type templates, fixes #159

) This PR does two things: - Add a combined generate-extract command, fixes #158 - Adds cell type templates, fixes #159 ## Generate-Extract `ontogpt generate-extract -m gpt-4 -t cell_type "Acinar Cell Of Salivary Gland"` This does two things 1. asks GPT to generate a summary of the cell type 2. parses/extracts knowledge from that cell type This rescuscitates the original HALO idea. We could in principle **directly generate an entire knowledgebase in structured form from the latent GPT KB** Example output: ```yaml extracted_object: cell_type: Acinar cell of a salivary gland parents: - CL:0000066 subtypes: - CL:0000313 - CL:0000319 localizations: - UBERON:0001044 - UBERON:0009842 diseases: - AUTO:Sj%C3%B6gren%27s%20syndrome - MONDO:0021357 named_entities: - id: CL:0000066 label: Epithelial cell - id: CL:0000313 label: Serous cells - id: CL:0000319 label: Mucous cells - id: UBERON:0001044 label: Salivary gland - id: UBERON:0009842 label: Acinus - id: AUTO:Sj%C3%B6gren%27s%20syndrome label: Sjögren's syndrome - id: MONDO:0021357 label: Salivary gland tumors ``` ## Cell Type Templates This PR also demonstrates using subclasses for more refined subtypes Compare the two: 1. `ontogpt generate-extract -m gpt-4 -t cell_type "L2/3 Intratelencephalic Projecting Glutamatergic Neuron Of The Primary Motor Cortex"` 2. 1ontogpt generate-extract -m gpt-4 -t cell_type.InterneuronDocument "L2/3 Intratelencephalic Projecting Glutamatergic Neuron Of The Primary Motor Cortex"` The first uses the generic base class. the second uses a subclass designed for interneurons, which has an extra slot for projection fields Example output: ```yaml extracted_object: cell_type: L2/3 Intratelencephalic Projecting Glutamatergic Neuron of the Primary Motor Cortex range: Not mentioned parents: - AUTO:excitatory%20neuron subtypes: - AUTO:Not%20mentioned localizations: - UBERON:0000956 - UBERON:0001384 genes: - AUTO:Not%20mentioned diseases: - MONDO:0005180 - MONDO:0020128 projects_to_or_from: - UBERON:0001893 named_entities: - id: UBERON:0001893 label: telencephalon - id: AUTO:excitatory%20neuron label: excitatory neuron - id: AUTO:Not%20mentioned label: Not mentioned - id: UBERON:0000956 label: cerebral cortex - id: UBERON:0001384 label: primary motor cortex - id: MONDO:0005180 label: Parkinson's disease - id: MONDO:0020128 label: motor neuron disease ```

…-initiative#159 This PR does two things: - Add a combined generate-extract command, fixes monarch-initiative#158 - Adds cell type templates, fixes monarch-initiative#159

cmungall added a commit that referenced this issue Jul 31, 2023

Adding generate-extract command, 158. Add cell type templates #159

3774dec

This PR does two things: - Add a combined generate-extract command, fixes #158 - Adds cell type templates, fixes #159

cmungall mentioned this issue Jul 31, 2023

Adding generate-extract command, 158. Add cell type templates #159 #162

Merged

cmungall closed this as completed in #162 Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a generate-extract command for parsing the results of generated text #158

Add a generate-extract command for parsing the results of generated text #158

cmungall commented Jul 31, 2023

Add a generate-extract command for parsing the results of generated text #158

Add a generate-extract command for parsing the results of generated text #158

Comments

cmungall commented Jul 31, 2023