## Deal with (one or relational) quantitative queries on Dandi Search

#### Example

Input: `Are there dandisets that do not contain 2 species but do contain 3 or more measurements?`

Output:
```
DANDI:000251/draft
- Species (1): Mus musculus - House mouse
- Approaches (2): behavioral approach, electrophysiological approach
- Measurement Techniques (3): analytical technique, behavioral technique, spike sorting technique
- Variables Measured (3): SpatialSeries, ProcessingModule, Units
DANDI:000491/0.230602.1307
- Species (1): Mus musculus - House mouse
- Approaches (1): microscopy approach; cell population imaging
- Measurement Techniques (3): two-photon microscopy technique, analytical technique, surgical technique
- Variables Measured (5): TwoPhotonSeries, ImagingPlane, OpticalChannel, ProcessingModule, PlaneSegmentation
DANDI:000458/0.230317.0039
- Species (1): Mus musculus - House mouse
- Approaches (2): electrophysiological approach, behavioral approach
- Measurement Techniques (6): surgical technique, spike sorting technique, behavioral technique, analytical technique, signal filtering technique, multi electrode extracellular electrophysiology recording technique
- Variables Measured (6): BehavioralTimeSeries, Units, ElectricalSeries, ElectrodeGroup, LFP, ProcessingModule
```

In [4]:
from qdrant_client import QdrantClient as Qdrant
from qdrant_client.models import models
import os

# ---------------------

QUERY = "Are there dandisets that do not contain 2 species but do contain 3 or more measurements?"

# ---------------------


In [5]:
from langchain.llms.openai import OpenAIChat
import json

SPECIES = "species"
SCIENTIFIC_APPROACHES = "scientific_approaches"
VARIABLES_MEASURED = "varirables_measured"
MEASUREMENT_TECHNIQUES = "measurement_techniques"

template = """
OBJECTIVE: 
Extract a numerical value and its corresponding asset type from a user query. 
The asset type must be classified to one of the following: {}, {}, {}, {}
If multiple numerical value and asset type pairs are identified, generate a list of triplets, each containing the numerical value, asset type, and comparison operator. 
The comparison operator (>=, >, <, <=, ==, !=) should best represent the number of assets to be retrieved. Remember to account for negation based on the query wording.
If either the numerical value or asset type is missing, do not create a triplet for that instance.

DESIRED OUTPUT FORMAT (python list of triplets):
- "number_of_assets": [int - numerical value],
- "asset_type": [string - type of asset],
- "comparison_op": [string - type of comparison operator]

Note: Ensure the exact dictionary key and values are used as mentioned above.

USER QUERY:
{}
""".format(SPECIES, SCIENTIFIC_APPROACHES, VARIABLES_MEASURED, MEASUREMENT_TECHNIQUES, QUERY)

llm = OpenAIChat(model_name="gpt-3.5-turbo", max_tokens=100, temperature=0)
response = llm(template)
try:
    response_json = json.loads(response)
    print(response_json)
except Exception:
    print("Failed to retrieve JSON. Output:", response)



[{'number_of_assets': 2, 'asset_type': 'species', 'comparison_op': '!='}, {'number_of_assets': 3, 'asset_type': 'measurement_techniques', 'comparison_op': '>='}]


In [6]:
import sys

if not response_json:
    print("No quantitive extractions found.")
    sys.exit(0)

# qdrant client
qdrant_client = Qdrant(
    location="https://906c3b3f-d3ff-4497-905f-2d7089487cf9.us-east4-0.gcp.cloud.qdrant.io", 
    port="6333",
    api_key=os.environ.get("QDRANT_API_KEY"),
)

# determine asset type
def get_filter_key(asset_type: str):
    if asset_type == SPECIES:
        key = "number_of_species"
    elif asset_type == SCIENTIFIC_APPROACHES:
        key = "number_of_approaches"
    elif asset_type == VARIABLES_MEASURED:
        key = "number_of_variables_measured"
    elif asset_type == MEASUREMENT_TECHNIQUES:
        key = "number_of_measurement_techniques"
    else:
        key = None
    return key

# determine comparison operator
def get_condition(key: str, comparison_op: str, num_asset: int) -> (models.FieldCondition, bool):
    is_must = True
    if comparison_op == "==":
        match = models.MatchValue(value=num_asset)
        condition = models.FieldCondition(key=key, match=match)
    elif comparison_op == "!=":
        match = models.MatchValue(value=num_asset)
        condition = models.FieldCondition(key=key, match=match)
        is_must = False
    elif comparison_op == ">=":
        range = models.Range(gte=float(num_asset))
        condition = models.FieldCondition(key=key, range=range)
    elif comparison_op == ">":
        range = models.Range(gt=float(num_asset))
        condition = models.FieldCondition(key=key, range=range)
    elif comparison_op == "<=":
        range = models.Range(lte=float(num_asset))
        condition = models.FieldCondition(key=key, range=range)
    elif comparison_op == "<":
        range = models.Range(lt=float(num_asset))
        condition = models.FieldCondition(key=key, range=range)
    else:
        condition = None
    return condition, is_must

# get matches
must_conditions = []
must_not_conditions = []
for i, extraction in enumerate(response_json):
    asset_type = extraction.get("asset_type", None)
    number_of_assets = extraction.get("number_of_assets", None)
    comparison_op = extraction.get("comparison_op", None)
    if not asset_type or not number_of_assets or not comparison_op:
        print(f"Triplet #{i} skipped (due to None value).")
        continue

    key = get_filter_key(asset_type=asset_type)
    condition, is_must = get_condition(key=key, comparison_op=comparison_op.strip(), num_asset=number_of_assets)
    if is_must:
        must_conditions.append(condition)
    else:
        must_not_conditions.append(condition)

# qdrant filter
filter = models.Filter(
    must=must_conditions if must_conditions else None, 
    must_not=must_not_conditions if must_not_conditions else None
)

# query similar results based on filter
docs = qdrant_client.scroll("dandi_collection_ada002", scroll_filter=filter, limit=10, with_vectors=False, with_payload=True)[0]
print("QUERY:", QUERY)
print("-----")
if not docs:
    print("No relevant dandisets found.")
else:
    for doc in docs:
        pl = doc.payload
        print(f"DANDI:{pl['dandiset_id']}/{pl['dandiset_version']}")
        print(f"- Species ({pl['number_of_species']}):", ", ".join(pl["species"]))
        print(f"- Approaches ({pl['number_of_approaches']}):", ", ".join(pl["approaches"]))
        print(f"- Measurement Techniques ({pl['number_of_measurement_techniques']}):", ", ".join(pl["measurement_techniques"]))
        print(f"- Variables Measured ({pl['number_of_variables_measured']}):", ", ".join(pl["variables_measured"]))

QUERY: Are there dandisets that do not contain 2 species but do contain 3 or more measurements?
-----
DANDI:000251/draft
- Species (1): Mus musculus - House mouse
- Approaches (2): behavioral approach, electrophysiological approach
- Measurement Techniques (3): analytical technique, behavioral technique, spike sorting technique
- Variables Measured (3): SpatialSeries, ProcessingModule, Units
DANDI:000491/0.230602.1307
- Species (1): Mus musculus - House mouse
- Approaches (1): microscopy approach; cell population imaging
- Measurement Techniques (3): two-photon microscopy technique, analytical technique, surgical technique
- Variables Measured (5): TwoPhotonSeries, ImagingPlane, OpticalChannel, ProcessingModule, PlaneSegmentation
DANDI:000458/0.230317.0039
- Species (1): Mus musculus - House mouse
- Approaches (2): electrophysiological approach, behavioral approach
- Measurement Techniques (6): surgical technique, spike sorting technique, behavioral technique, analytical technique, sig