# Create an Ontology Index

CurateGPT depends on *indexes* for many operations. These are used to provide context for LLM queries.

Indexes may contain unstructured textual information, structured data (conforming to any LinkML schema you provide), or a mix.
A special kind of index is an ontology index. This assumes a very simple data model for ontologies (this will be extended later), with fields
for

- id
- label
- definition
- relationships

Behind the scenes, OAK is used to access a variety of different ontologies and allows them to be indexed. See the oaklib docs for
documentation on handles such as `sqlite:obo:go`

Let's start by making an index of GO:

In [1]:
!curategpt ontology index -m openai:  -c terms_go sqlite:obo:go

This currently takes about ~1 hr; if you use openai to embed the terms you will need an openai key. You can also
leave the `-m` option off and it will use the default chromadb embedding model.

Unless you specify `--path` (or `-p`) this is stored in the `./db` folder. The `-c` specifies a collection.

In [2]:
!curategpt search -c terms_go "alginate transport"

## 1 DISTANCE: 0.14579163491725922
id: Alginate
label: alginate
relationships:
- predicate: rdfs:subClassOf
  target: CarbohydrateAcidAnion
- predicate: rdfs:subClassOf
  target: IonicPolymer

## 2 DISTANCE: 0.24588903784751892
id: SodiumAlginate
label: sodium alginate
relationships:
- predicate: HasPart
  target: Alginate
- predicate: HasRole
  target: HematologicAgent
- predicate: rdfs:subClassOf
  target: OrganicSodiumSalt
- predicate: rdfs:subClassOf
  target: CopolymerMacromolecule

## 3 DISTANCE: 0.3071279227733612
id: AlginicAcid
label: alginic acid
relationships:
- predicate: HasRole
  target: HematologicAgent
- predicate: rdfs:subClassOf
  target: Heteroglycan
- predicate: rdfs:subClassOf
  target: CopolymerMacromolecule
- predicate: rdfs:subClassOf
  target: Exopolysaccharide

## 4 DISTANCE: 0.33000582456588745
id: AlginicAcidAcetylation
label: alginic acid acetylation
definition: The addition of O-acetyl ester groups to alginic acid, a linear polymer
  of D-mannuronate and L

In [4]:
!curategpt search -c terms_go "alginate transport" --relevance-factor 0.4

## 1 DISTANCE: 0.14579163491725922
id: Alginate
label: alginate
relationships:
- predicate: rdfs:subClassOf
  target: CarbohydrateAcidAnion
- predicate: rdfs:subClassOf
  target: IonicPolymer

## 2 DISTANCE: 0.3051307201385498
id: BacteriocinTransport
label: bacteriocin transport
definition: The directed movement of a bacteriocin into, out of or within a cell,
  or between cells, by means of some agent such as a transporter or pore. Bacteriocins
  are a group of antibiotics produced by bacteria and are encoded by a group of naturally
  occurring plasmids, e.g. Col E1. Bacteriocins are toxic to bacteria closely related
  to the bacteriocin producing strain.
relationships:
- predicate: HasPrimaryInput
  target: Bacteriocin
- predicate: rdfs:subClassOf
  target: PeptideTransport

## 3 DISTANCE: 0.26726552844047546
id: AllantoinTransport
label: allantoin transport
definition: The directed movement of allantoin, (2,5-dioxo-4-imidazolidinyl)urea,
  into, out of or within a cell, or between c

# Index the cell ontology

First we will try creating an index with the default chromadb embeddings

In [5]:
!curategpt index-ontology -c terms_cl sqlite:obo:cl

In [11]:
!curategpt search -c terms_cl "neocortical Martinotti cells"

## 1 DISTANCE: 0.529904842376709
id: MartinottiNeuron
label: Martinotti neuron
definition: An interneuron that has Martinotti morphology. These interneurons are
  scattered throughout various layers of the cerebral cortex, sending their axons
  up to the cortical layer I where they form axonal arborization.
relationships:
- predicate: HasCharacteristic
  target: MartinottiMorphology
- predicate: rdfs:subClassOf
  target: MultipolarNeuron
- predicate: rdfs:subClassOf
  target: GABAergicInterneuron

## 2 DISTANCE: 0.6669416427612305
id: TMartinottiNeuron
label: T Martinotti neuron
definition: A Martinotti neuron that has axons that form a horizontal ramification,
  making it T-shaped.
relationships:
- predicate: HasCharacteristic
  target: TMartinottiMorphology
- predicate: rdfs:subClassOf
  target: MartinottiNeuron

## 3 DISTANCE: 0.7783235311508179
id: FanMartinottiNeuron
label: fan Martinotti neuron
definition: A Martinotti neuron that has axons that form a fan-like plexus.
relationsh

We can see the first few results are largely relevant; however, the cell cortex doesn't have much to do with
the neocortex.

Let's try diversifying the results:

In [13]:
!curategpt search -c terms_cl "neocortical Martinotti cells" --relevance-factor 0.5

## 1 DISTANCE: 0.529904842376709
id: MartinottiNeuron
label: Martinotti neuron
definition: An interneuron that has Martinotti morphology. These interneurons are
  scattered throughout various layers of the cerebral cortex, sending their axons
  up to the cortical layer I where they form axonal arborization.
relationships:
- predicate: HasCharacteristic
  target: MartinottiMorphology
- predicate: rdfs:subClassOf
  target: MultipolarNeuron
- predicate: rdfs:subClassOf
  target: GABAergicInterneuron

## 2 DISTANCE: 1.077032446861267
id: NeoplasticCell
label: neoplastic cell
definition: An abnormal cell exhibiting dysregulation of cell proliferation or programmed
  cell death and capable of forming a neoplasm, an aggregate of cells in the form
  of a tumor mass or an excess number of abnormal cells (liquid tumor) within an organism.
relationships:
- predicate: HasCharacteristic
  target: Neoplastic
- predicate: rdfs:subClassOf
  target: AbnormalCell

## 3 DISTANCE: 1.0383251905441284
id: K

### Compare with OpenAI index



In [10]:
!curategpt ontology index -m openai: -c terms_cl_oai sqlite:obo:cl

In [14]:
!curategpt search -c terms_cl_oai "neocortical Martinotti cells"

## 1 DISTANCE: 0.22947755455970764
id: NeocortexBasketCell
label: neocortex basket cell
definition: Any basket cell that is part of a neocortex.
relationships:
- predicate: HasSomaLocation
  target: Neocortex
- predicate: rdfs:subClassOf
  target: BasketCell
- predicate: rdfs:subClassOf
  target: CerebralCortexGABAergicInterneuron

## 2 DISTANCE: 0.2639077305793762
id: CorticalGranuleCell
label: cortical granule cell
definition: Granule cell that is part of the cerebral cortex.
relationships:
- predicate: rdfs:subClassOf
  target: GranuleCell
- predicate: rdfs:subClassOf
  target: CerebralCortexNeuron
- predicate: rdfs:subClassOf
  target: NeuronOfTheForebrain

## 3 DISTANCE: 0.2880401611328125
id: KidneyCorticalCell
label: kidney cortical cell
relationships:
- predicate: PartOf
  target: CortexOfKidney
- predicate: rdfs:subClassOf
  target: KidneyCell

## 4 DISTANCE: 0.28898707032203674
id: Neocortex
label: neocortex
definition: 'An area of cerebral cortex defined on the basis of cyto

Surprisingly, no actual Martonotti cells

In [15]:
!curategpt search -c terms_cl_oai "neocortical Martinotti neurons"

## 1 DISTANCE: 0.2589622437953949
id: NearProjectingGlutamatergicCorticalNeuron
label: near-projecting glutamatergic cortical neuron
definition: A glutamatergic neuron located in the cerebral cortex that projects axons
  locally rather than distantly.
relationships:
- predicate: HasCharacteristic
  target: NearProjecting
- predicate: rdfs:subClassOf
  target: GlutamatergicNeuron
- predicate: rdfs:subClassOf
  target: CerebralCortexNeuron

## 2 DISTANCE: 0.27403831481933594
id: CorticothalamicProjectingGlutamatergicCorticalNeuron
label: corticothalamic-projecting glutamatergic cortical neuron
definition: A glutamatergic neuron located in the cerebral cortex that projects to
  the thalamus.
relationships:
- predicate: HasCharacteristic
  target: CorticothalamicProjecting
- predicate: rdfs:subClassOf
  target: GlutamatergicNeuron
- predicate: rdfs:subClassOf
  target: CerebralCortexNeuron

## 3 DISTANCE: 0.27480828762054443
id: NeocortexBasketCell
label: neocortex basket cell
definition: 

In [16]:
!curategpt search -c terms_cl_oai "Martinotti neuron"

## 1 DISTANCE: 0.0
id: MartinottiNeuron
label: Martinotti neuron
definition: An interneuron that has Martinotti morphology. These interneurons are
  scattered throughout various layers of the cerebral cortex, sending their axons
  up to the cortical layer I where they form axonal arborization.
relationships:
- predicate: HasCharacteristic
  target: MartinottiMorphology
- predicate: rdfs:subClassOf
  target: MultipolarNeuron
- predicate: rdfs:subClassOf
  target: GABAergicInterneuron

## 2 DISTANCE: 0.0993199422955513
id: TMartinottiNeuron
label: T Martinotti neuron
definition: A Martinotti neuron that has axons that form a horizontal ramification,
  making it T-shaped.
relationships:
- predicate: HasCharacteristic
  target: TMartinottiMorphology
- predicate: rdfs:subClassOf
  target: MartinottiNeuron

## 3 DISTANCE: 0.16247744858264923
id: FanMartinottiNeuron
label: fan Martinotti neuron
definition: A Martinotti neuron that has axons that form a fan-like plexus.
relationships:
- predic

## ENVO

In [17]:
!curategpt ontology index -c terms_envo sqlite:obo:envo

In [19]:
!curategpt ontology index -m openai: -c terms_envo_oai sqlite:obo:envo

In [20]:
!curategpt search -c terms_envo "decidous forest"

## 1 DISTANCE: 0.6482061147689819
id: TropicalForest
label: tropical forest
definition: A forest ecosystem which is subject to tropical climate conditions.
relationships:
- predicate: rdfs:subClassOf
  target: ForestEcosystem
- predicate: rdfs:subClassOf
  target: TropicalEnvironment

## 2 DISTANCE: 0.6844499111175537
id: ForestFloor
label: forest floor
definition: Land which is present within a forest biome.
relationships:
- predicate: rdfs:subClassOf
  target: Land

## 3 DISTANCE: 0.6939178109169006
id: PlantedForest
label: planted forest
definition: A forest that has been intentionally established by human intervention.
relationships:
- predicate: rdfs:subClassOf
  target: ForestEcosystem

## 4 DISTANCE: 0.7068362832069397
id: MontaneForest
label: montane forest
relationships:
- predicate: rdfs:subClassOf
  target: ForestEcosystem

## 5 DISTANCE: 0.7241395115852356
id: BroadleafForest
label: broadleaf forest
definition: A forest biome which contains densely packed populations or com

In [21]:
!curategpt search -c terms_envo_oai "decidous forest"

## 1 DISTANCE: 0.14267714321613312
id: AreaOfDeciduousForest
label: area of deciduous forest
definition: An area of a planet's surface which is primarily covered by a forest in
  which the majority of trees shed foliage simultaneously in response to seasonal
  change. The surfaces of this area (including the surface of the forest canopy) are
  in contact with an atmospheric column extending from the planetary boundary layer
  to the planet's exosphere with little to no physical obstruction.
relationships:
- predicate: AdjacentTo
  target: AtmosphericBoundaryLayer
- predicate: rdfs:subClassOf
  target: TerrestrialEnvironmentalZone

## 2 DISTANCE: 0.1432930827140808
id: Deciduous_plant_
label: deciduous (plant)
definition: A quality inhering in a plant by virtue of the bearer's disposition to
  shed foliage.
relationships:
- predicate: rdfs:subClassOf
  target: Shedability

## 3 DISTANCE: 0.16446948051452637
id: TropicalDeciduousBroadleafForest
label: tropical deciduous broadleaf forest
