Skip to content

Fix #49: handle comma-concatenated ontology term ids in translate()#18

Closed
jkobject wants to merge 1 commit into
mainfrom
fix/translate-multi-ontology
Closed

Fix #49: handle comma-concatenated ontology term ids in translate()#18
jkobject wants to merge 1 commit into
mainfrom
fix/translate-multi-ontology

Conversation

@jkobject
Copy link
Copy Markdown
Owner

Fixes cantinilab#49.

CELLxGENE allows comma-concatenated ontology term ids in some metadata columns (most commonly self_reported_ethnicity_ontology_term_id, e.g. HANCESTRO:0005,HANCESTRO:0008 for multi-ethnicity). These concatenated strings are not individual entries in lamindb, so the previous translate() call resolved .loc[val] on a missing index and crashed the entire Embedder.log_adata() step.

Changes

  • Split comma-concatenated ontology ids and resolve each part separately, joining the resolved names with ,.
  • Fall back to the raw id (instead of raising) when an ontology id is not in the current lamindb instance, so a single missing / custom / locally-curated term no longer kills the whole embedding run.

Workaround for users on older releases

Pre-processing the metadata before embedding still works, e.g. dropping the second ancestry:

col = "self_reported_ethnicity_ontology_term_id"
adata.obs[col] = adata.obs[col].astype(str).str.split(",").str[0]

CC @danielee0707.

CELLxGENE allows comma-concatenated ontology terms in single-cell
metadata (e.g. self_reported_ethnicity_ontology_term_id='HANCESTRO:0005,
HANCESTRO:0008'). These concatenated strings are not individual entries
in lamindb, so the old translate() call hit
'.filter(ontology_id=...).one()' on a non-existent record and crashed
the whole Embedder.log_adata() step.

This patch:
- splits comma-concatenated ontology ids and resolves each part
  separately, joining the resolved names with the same separator,
- falls back to the raw id (instead of raising) when an ontology id is
  not present in the current lamindb instance, so a single missing /
  custom term no longer kills the entire embedding run.

Refs jkobject/scPRINT#49.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: adc9870737

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scprint/model/utils.py
return {i: _lookup(i) for i in set(val)}
elif type(val) is dict or type(val) is Counter:
return {obj.loc[k]["name"] if k != "unknown" else k: v for k, v in val.items()}
return {_lookup(k): v for k, v in val.items()}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Aggregate colliding translated keys in Counter input

When val is a dict/Counter, the comprehension {_lookup(k): v for k, v in val.items()} silently overwrites earlier entries if multiple raw IDs resolve to the same translated label (for example, comma-concatenated IDs that differ only by whitespace, or different IDs that normalize to the same output string). In those cases counts are lost instead of combined, which can skew label-frequency summaries produced from value_counts()/Counter inputs; this should accumulate values per translated key rather than keep only the last one.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates translate() to be more robust to CELLxGENE metadata that contains comma-concatenated ontology term IDs (e.g. multi-ethnicity) and to avoid hard failures when an ontology ID can’t be resolved in the current LaminDB/Bionty instance.

Changes:

  • Add a helper lookup that splits comma-concatenated ontology IDs and translates each part, joining results with ,.
  • Add a fallback behavior to return the raw ontology ID when a lookup misses, instead of raising.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scprint/model/utils.py
@@ -458,14 +458,33 @@ def translate(
obj = bt.Ethnicity.df().set_index("ontology_id")
else:
return None
Comment thread scprint/model/utils.py
Comment on lines +470 to +474
if ontology_id == "unknown":
return ontology_id
if "," in ontology_id:
parts = [p.strip() for p in ontology_id.split(",") if p.strip()]
return ",".join(_lookup(p) for p in parts)
@jkobject
Copy link
Copy Markdown
Owner Author

Closing — cantinilab#51 (already merged) implements an equivalent fix (try/except inside make_adata) plus jkobject/scDataLoader#33 fixes translate() at the source. No need to land my version on the stale jkobject/scPRINT fork.

@jkobject jkobject closed this May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comma concatenated ontology term ID not in lamindb causes error in translate

2 participants