Fix #49: handle comma-concatenated ontology term ids in translate()#18
Fix #49: handle comma-concatenated ontology term ids in translate()#18jkobject wants to merge 1 commit into
Conversation
CELLxGENE allows comma-concatenated ontology terms in single-cell metadata (e.g. self_reported_ethnicity_ontology_term_id='HANCESTRO:0005, HANCESTRO:0008'). These concatenated strings are not individual entries in lamindb, so the old translate() call hit '.filter(ontology_id=...).one()' on a non-existent record and crashed the whole Embedder.log_adata() step. This patch: - splits comma-concatenated ontology ids and resolves each part separately, joining the resolved names with the same separator, - falls back to the raw id (instead of raising) when an ontology id is not present in the current lamindb instance, so a single missing / custom term no longer kills the entire embedding run. Refs jkobject/scPRINT#49.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: adc9870737
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return {i: _lookup(i) for i in set(val)} | ||
| elif type(val) is dict or type(val) is Counter: | ||
| return {obj.loc[k]["name"] if k != "unknown" else k: v for k, v in val.items()} | ||
| return {_lookup(k): v for k, v in val.items()} |
There was a problem hiding this comment.
Aggregate colliding translated keys in Counter input
When val is a dict/Counter, the comprehension {_lookup(k): v for k, v in val.items()} silently overwrites earlier entries if multiple raw IDs resolve to the same translated label (for example, comma-concatenated IDs that differ only by whitespace, or different IDs that normalize to the same output string). In those cases counts are lost instead of combined, which can skew label-frequency summaries produced from value_counts()/Counter inputs; this should accumulate values per translated key rather than keep only the last one.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Updates translate() to be more robust to CELLxGENE metadata that contains comma-concatenated ontology term IDs (e.g. multi-ethnicity) and to avoid hard failures when an ontology ID can’t be resolved in the current LaminDB/Bionty instance.
Changes:
- Add a helper lookup that splits comma-concatenated ontology IDs and translates each part, joining results with
,. - Add a fallback behavior to return the raw ontology ID when a lookup misses, instead of raising.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -458,14 +458,33 @@ def translate( | |||
| obj = bt.Ethnicity.df().set_index("ontology_id") | |||
| else: | |||
| return None | |||
| if ontology_id == "unknown": | ||
| return ontology_id | ||
| if "," in ontology_id: | ||
| parts = [p.strip() for p in ontology_id.split(",") if p.strip()] | ||
| return ",".join(_lookup(p) for p in parts) |
|
Closing — cantinilab#51 (already merged) implements an equivalent fix (try/except inside |
Fixes cantinilab#49.
CELLxGENE allows comma-concatenated ontology term ids in some metadata columns (most commonly
self_reported_ethnicity_ontology_term_id, e.g.HANCESTRO:0005,HANCESTRO:0008for multi-ethnicity). These concatenated strings are not individual entries in lamindb, so the previoustranslate()call resolved.loc[val]on a missing index and crashed the entireEmbedder.log_adata()step.Changes
,.Workaround for users on older releases
Pre-processing the metadata before embedding still works, e.g. dropping the second ancestry:
CC @danielee0707.