Fix #49: handle comma-concatenated ontology term ids in translate() by jkobject · Pull Request #18 · jkobject/scPRINT

jkobject · 2026-05-20T07:50:56Z

CELLxGENE allows comma-concatenated ontology term ids in some metadata columns (most commonly self_reported_ethnicity_ontology_term_id, e.g. HANCESTRO:0005,HANCESTRO:0008 for multi-ethnicity). These concatenated strings are not individual entries in lamindb, so the previous translate() call resolved .loc[val] on a missing index and crashed the entire Embedder.log_adata() step.

Changes

Split comma-concatenated ontology ids and resolve each part separately, joining the resolved names with ,.
Fall back to the raw id (instead of raising) when an ontology id is not in the current lamindb instance, so a single missing / custom / locally-curated term no longer kills the whole embedding run.

Workaround for users on older releases

Pre-processing the metadata before embedding still works, e.g. dropping the second ancestry:

col = "self_reported_ethnicity_ontology_term_id"
adata.obs[col] = adata.obs[col].astype(str).str.split(",").str[0]

CC @danielee0707.

CELLxGENE allows comma-concatenated ontology terms in single-cell metadata (e.g. self_reported_ethnicity_ontology_term_id='HANCESTRO:0005, HANCESTRO:0008'). These concatenated strings are not individual entries in lamindb, so the old translate() call hit '.filter(ontology_id=...).one()' on a non-existent record and crashed the whole Embedder.log_adata() step. This patch: - splits comma-concatenated ontology ids and resolves each part separately, joining the resolved names with the same separator, - falls back to the raw id (instead of raising) when an ontology id is not present in the current lamindb instance, so a single missing / custom term no longer kills the entire embedding run. Refs jkobject/scPRINT#49.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: adc9870737

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T07:54:29Z

+        return {i: _lookup(i) for i in set(val)}
    elif type(val) is dict or type(val) is Counter:
-        return {obj.loc[k]["name"] if k != "unknown" else k: v for k, v in val.items()}
+        return {_lookup(k): v for k, v in val.items()}


Aggregate colliding translated keys in Counter input

When val is a dict/Counter, the comprehension {_lookup(k): v for k, v in val.items()} silently overwrites earlier entries if multiple raw IDs resolve to the same translated label (for example, comma-concatenated IDs that differ only by whitespace, or different IDs that normalize to the same output string). In those cases counts are lost instead of combined, which can skew label-frequency summaries produced from value_counts()/Counter inputs; this should accumulate values per translated key rather than keep only the last one.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

Updates translate() to be more robust to CELLxGENE metadata that contains comma-concatenated ontology term IDs (e.g. multi-ethnicity) and to avoid hard failures when an ontology ID can’t be resolved in the current LaminDB/Bionty instance.

Changes:

Add a helper lookup that splits comma-concatenated ontology IDs and translates each part, joining results with ,.
Add a fallback behavior to return the raw ontology ID when a lookup misses, instead of raising.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -458,14 +458,33 @@ def translate(
        obj = bt.Ethnicity.df().set_index("ontology_id")
    else:
        return None


+        if ontology_id == "unknown":
+            return ontology_id
+        if "," in ontology_id:
+            parts = [p.strip() for p in ontology_id.split(",") if p.strip()]
+            return ",".join(_lookup(p) for p in parts)


jkobject · 2026-05-20T08:25:22Z

Closing — cantinilab#51 (already merged) implements an equivalent fix (try/except inside make_adata) plus jkobject/scDataLoader#33 fixes translate() at the source. No need to land my version on the stale jkobject/scPRINT fork.

Copilot AI review requested due to automatic review settings May 20, 2026 07:50

Copilot started reviewing on behalf of jkobject May 20, 2026 07:51 View session

jkobject mentioned this pull request May 20, 2026

Comma concatenated ontology term ID not in lamindb causes error in translate cantinilab/scPRINT#49

Closed

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

Copilot AI reviewed May 20, 2026

View reviewed changes

jkobject closed this May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #49: handle comma-concatenated ontology term ids in translate()#18

Fix #49: handle comma-concatenated ontology term ids in translate()#18
jkobject wants to merge 1 commit into
mainfrom
fix/translate-multi-ontology

jkobject commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jkobject commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jkobject commented May 20, 2026

Changes

Workaround for users on older releases

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

jkobject commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants