## Deal with (one or relational) quantitative queries on Dandi Search

#### Example

Input: `Are there dandisets that do not contain 2 species but do contain 3 or more measurements?`

Output:
```
DANDI:000251/draft
- Species (1): Mus musculus - House mouse
- Approaches (2): behavioral approach, electrophysiological approach
- Measurement Techniques (3): analytical technique, behavioral technique, spike sorting technique
- Variables Measured (3): SpatialSeries, ProcessingModule, Units
DANDI:000491/0.230602.1307
- Species (1): Mus musculus - House mouse
- Approaches (1): microscopy approach; cell population imaging
- Measurement Techniques (3): two-photon microscopy technique, analytical technique, surgical technique
- Variables Measured (5): TwoPhotonSeries, ImagingPlane, OpticalChannel, ProcessingModule, PlaneSegmentation
DANDI:000458/0.230317.0039
- Species (1): Mus musculus - House mouse
- Approaches (2): electrophysiological approach, behavioral approach
- Measurement Techniques (6): surgical technique, spike sorting technique, behavioral technique, analytical technique, signal filtering technique, multi electrode extracellular electrophysiology recording technique
- Variables Measured (6): BehavioralTimeSeries, Units, ElectricalSeries, ElectrodeGroup, LFP, ProcessingModule
```

In [68]:
from qdrant_client import QdrantClient as Qdrant
from qdrant_client.models import models
import os

# ---------------------

QUERY = "Show me datasets that have at least 100 files and 2 or more animals."

# ---------------------


In [69]:
from langchain.llms.openai import OpenAIChat
import json

supported_quantity_values = ["species", "approaches", "variables_measured", "measurement_techniques", "subjects", "bytes", "files", "cells", "samples"]

template = f"""
OBJECTIVE: 
Extract a numerical value and its corresponding asset type from a user query. 
Classify the asset type into one of the following categories (asset_type values): {", ".join(supported_quantity_values)}
If multiple numerical value and asset type pairs are identified, generate triplets with the corresponding numerical value, asset type, and a comparison operator. 
Use operators (>=, >, <, <=, ==, !=) based on the query to represent the exact filtered number of a specific asset to be retrieved. Account for negation in the query wording.
Do not create a triplet if either the numerical value or asset type is missing.
If no triplets exist, then return [].
IMPORTANT: Use the specified dictionary keys and values exactly as mentioned above and below.

OUTPUT FORMAT (JSON - list of triplets):
[
    {{
        "number_of_assets": [integer]
        "asset_type": [string]
        "comparison_op": [string]
    }},
    {{ ... }}, ...
]

USER QUERY:
{QUERY}
"""

# llm_lingua = PromptCompressor()

llm = OpenAIChat(model_name="gpt-3.5-turbo", max_tokens=100, temperature=0)
response = llm(template)
try:
    response_json = json.loads(response)
    print(response_json)
except Exception:
    print("Failed to retrieve JSON. Output:", response)



[{'number_of_assets': 100, 'asset_type': 'files', 'comparison_op': '>='}, {'number_of_assets': 2, 'asset_type': 'species', 'comparison_op': '>='}]


In [70]:
if not response_json:
    print("No quantitive extractions found.")
    exit(1)

# qdrant client
qdrant_client = Qdrant(
    location="https://906c3b3f-d3ff-4497-905f-2d7089487cf9.us-east4-0.gcp.cloud.qdrant.io", 
    port="6333",
    api_key=os.environ.get("QDRANT_API_KEY"),
)

# determine comparison operator
def get_condition(key: str, comparison_op: str, num_asset: int) -> (models.FieldCondition, bool):
    is_must = True
    if comparison_op == "==":
        match = models.MatchValue(value=num_asset)
        condition = models.FieldCondition(key=key, match=match)
    elif comparison_op == "!=":
        match = models.MatchValue(value=num_asset)
        condition = models.FieldCondition(key=key, match=match)
        is_must = False
    elif comparison_op == ">=":
        range = models.Range(gte=float(num_asset))
        condition = models.FieldCondition(key=key, range=range)
    elif comparison_op == ">":
        range = models.Range(gt=float(num_asset))
        condition = models.FieldCondition(key=key, range=range)
    elif comparison_op == "<=":
        range = models.Range(lte=float(num_asset))
        condition = models.FieldCondition(key=key, range=range)
    elif comparison_op == "<":
        range = models.Range(lt=float(num_asset))
        condition = models.FieldCondition(key=key, range=range)
    else:
        condition = None
    return condition, is_must

# get matches
must_conditions = []
must_not_conditions = []
for i, extraction in enumerate(response_json):
    asset_type = extraction.get("asset_type", None)
    number_of_assets = extraction.get("number_of_assets", None)
    comparison_op = extraction.get("comparison_op", None)
    if not asset_type or not number_of_assets or not comparison_op:
        print(f"Triplet #{i} skipped (due to None value).")
        continue
    if asset_type not in supported_quantity_values:
        print(f"Triplet #{i} has invalid asset type selected (LLM error).")
        continue
    key = f"number_of_{asset_type.strip()}"
    condition, is_must = get_condition(key=key, comparison_op=comparison_op.strip(), num_asset=number_of_assets)
    if is_must:
        must_conditions.append(condition)
    else:
        must_not_conditions.append(condition)

# qdrant filter
filter = models.Filter(
    must=must_conditions if must_conditions else None, 
    must_not=must_not_conditions if must_not_conditions else None,
)

# query similar results based on filter
docs = qdrant_client.scroll("dandi_collection_ada002", scroll_filter=filter, limit=10, with_vectors=False, with_payload=True)[0]
print("QUERY:", QUERY)
print("-----")
if not docs:
    print("No relevant dandisets found.")
else:
    for doc in docs:
        pl = doc.payload
        print(f"DANDI:{pl['dandiset_id']}/{pl['dandiset_version']}")
        print(f"- Species ({pl['number_of_species']}):", ", ".join(pl["species"]))
        print(f"- Approaches ({pl['number_of_approaches']}):", ", ".join(pl["approaches"]))
        print(f"- Measurement Techniques ({pl['number_of_measurement_techniques']}):", ", ".join(pl["measurement_techniques"]))
        print(f"- Variables Measured ({pl['number_of_variables_measured']}):", ", ".join(pl["variables_measured"]))
        print(f"- Bytes: {pl['number_of_bytes']}")
        print(f"- Files: {pl['number_of_files']}")
        print(f"- Subjects: {pl['number_of_subjects']}")
        print(f"- Cells: {pl['number_of_cells']}")
        print(f"- Samples: {pl['number_of_samples']}")

QUERY: Show me datasets that have at least 100 files and 2 or more animals.
-----
DANDI:000636/draft
- Species (2): Human, Homo sapiens - Human
- Approaches (1): electrophysiological approach
- Measurement Techniques (3): analytical technique, current clamp technique, voltage clamp technique
- Variables Measured (5): ProcessingModule, VoltageClampSeries, CurrentClampStimulusSeries, VoltageClampStimulusSeries, CurrentClampSeries
- Bytes: 24473831989
- Files: 706
- Subjects: 109
- Cells: 616
- Samples: None
DANDI:000341/draft
- Species (3): Rattus norvegicus - Norway rat, Homo sapiens - Human, Mus musculus - House mouse
- Approaches (1): electrophysiological approach
- Measurement Techniques (1): current clamp technique
- Variables Measured (2): CurrentClampStimulusSeries, CurrentClampSeries
- Bytes: 711580684440
- Files: 787
- Subjects: 310
- Cells: None
- Samples: None
DANDI:000630/0.230915.2257
- Species (2): Human, Homo sapiens - Human
- Approaches (1): electrophysiological approach
