# How to index the caDSR metadata element registry with LinkML-Store





In [1]:
import os
import json

In [2]:



# note: this is a symlink to the actual data
path = "cadsr/cde-json"
objs = []
for root, dirs, files in os.walk(path):
    for file in files:
        if file.endswith(".json"):
            with open(os.path.join(root, file)) as stream:
                obj = json.load(stream)
                objs.append(obj['DataElement'])


In [3]:
len(objs)

74229

In [4]:
import yaml
print(yaml.dump(objs[1]))

AlternateNames: []
ClassificationSchemes: []
DataElementConcept:
  ConceptualDomain:
    administrativeNotes: null
    beginDate: '2006-09-28'
    changeDescription: null
    context: CCR
    contextVersion: '1'
    createdBy: REEVESD
    dateCreated: '2006-09-28'
    dateModified: '2008-11-19'
    deletedIndicator: 'No'
    endDate: null
    id: 1E838B40-6636-0A25-E044-0003BA3F9857
    latestVersionIndicator: 'Yes'
    longName: MEASURE/INSTRUMENT TESTING
    modifiedBy: REEVESD
    origin: CCR:Center for Cancer Research
    preferredDefinition: Process and results associated with self-reported measures
      and instruments, surveys, other tools
    preferredName: Person Measure/Instrument Testing
    publicId: '2524082'
    registrationStatus: Application
    unresolvedIssues: null
    version: '1'
    workflowStatus: RELEASED
  ObjectClass:
    Concepts:
    - conceptCode: C15747
      definition: Supportive care is that which helps the patient and their family
        to cope with

## Creating a client and attaching to a database

First we will create a client as normal:

In [5]:
from linkml_store import Client

client = Client()

Next we'll attach to a MongoDB instance. this assumes you have one running already.

In [6]:
db = client.attach_database("mongodb://localhost:27017/cadsr", "cadsr", recreate_if_exists=True)

## Creating a collection

We'll create a simple test collection. The concept of collection in linkml-store maps directly to mongodb collections

In [7]:
collection = db.create_collection("cdes", recreate_if_exists=True)

## Loading

In [8]:
collection.insert(objs)

In [9]:
collection.commit()

In [10]:
collection.find({}, limit=5).num_rows

74229

Let's check with pandas just to make sure it looks as expected:

In [11]:
qr = collection.find({}, limit=3)
qr.rows_dataframe

Unnamed: 0,publicId,version,preferredName,preferredDefinition,longName,context,contextVersion,DataElementConcept,ValueDomain,ClassificationSchemes,...,beginDate,endDate,createdBy,dateCreated,modifiedBy,dateModified,changeDescription,administrativeNotes,unresolvedIssues,deletedIndicator
0,2869761,1,Clinical Performed Observation Outcome Referen...,A coded value specifying the relationship of a...,2868088v1.0:2803170v1.0,NCIP,1,"{'publicId': '2868088', 'version': '1', 'prefe...","{'publicId': '2803170', 'version': '1', 'prefe...","[{'publicId': '2714898', 'version': '3.02', 'l...",...,2009-05-06,,UMLLOADER_BRIDGPRD,2009-05-06,SBREXT,2019-10-03,Moved UML alt def to CDE preferred def and rel...,,,No
1,7571389,1,Supportive Care When I Hear the Term Palliativ...,A person's agreement with a statement related ...,7571388v1.0:3682709v2.0,NHLBI,1,"{'publicId': '7571388', 'version': '1', 'prefe...","{'publicId': '3682709', 'version': '2', 'prefe...",[],...,2021-01-29,,MALUMK,2021-01-29,MALUMK,2021-03-18,Released. 03/18/2021 KMM; System generated def...,,,No
2,2773112,1,Antibody Antibody Isotype java.lang.Boolean,An antibody is a type of protein made by B lym...,2753919v1.0:2178538v1.0,NCIP,1,"{'publicId': '2753919', 'version': '1', 'prefe...","{'publicId': '2178538', 'version': '1', 'prefe...","[{'publicId': '2772168', 'version': '1.1', 'lo...",...,2008-08-12,,UMLLOADER_CALIMS,2008-08-12,CHILLIJ,2010-06-12,8/12/2010 released at request of caLIMS2 model...,,,No


In [12]:
qr.rows[1]

{'publicId': '7571389',
 'version': '1',
 'preferredName': 'Supportive Care When I Hear the Term Palliative Care, I Feel Fear Agreement 5 Point Likert Scale',
 'preferredDefinition': "A person's agreement with a statement related to feeling fear when hearing the term palliative care using a five-point Likert scale.",
 'longName': '7571388v1.0:3682709v2.0',
 'context': 'NHLBI',
 'contextVersion': '1',
 'DataElementConcept': {'publicId': '7571388',
  'version': '1',
  'preferredName': 'Supportive Care When I Hear The Term Palliative Care, I Feel Fear Agreement Scale',
  'preferredDefinition': 'Supportive care is that which helps the patient and their family to cope with cancer and treatment of it from pre-diagnosis, through the process of diagnosis and treatment, to cure, continuing illness or death and into bereavement. It helps the patient to maximize the benefits of treatment and to live as well as possible with the effects of the disease. Supportive therapy may provide a patient with

In [None]:
collection.query_facets(facet_columns=["context", "origin"])

## Semantic Search

We will index phenopackets using a template that extracts the subject, phenotypic features and diseases.

In [None]:
template = """name: {{preferredName}}
def: {{preferredDefinition}}"""

In [None]:
from linkml_store.index.implementations.llm_indexer import LLMIndexer

index = LLMIndexer(
    name="cde", 
    cached_embeddings_database="tmp/llm_cde_cache.db",
    text_template=template,
    text_template_syntax="jinja2",
)

In [None]:
print(index.object_to_text(qr.rows[0]))

In [None]:
collection.attach_indexer(index, auto_index=True)

## Queries

We can specify key-value constraints:

In [None]:
qr = collection.search("variables relevant for long COVID")
qr.rows_dataframe[0:10]

In [None]:
qr.ranked_rows[0]

## Validation

__TODO__    