Skip to content

NMDC: Document string/scalar-to-structured measurement transforms (tests only)#131

Merged
amc-corey-cox merged 2 commits intomainfrom
test-range-override-string-to-object
Mar 9, 2026
Merged

NMDC: Document string/scalar-to-structured measurement transforms (tests only)#131
amc-corey-cox merged 2 commits intomainfrom
test-range-override-string-to-object

Conversation

@turbomam
Copy link
Copy Markdown
Member

@turbomam turbomam commented Feb 26, 2026

Closes #129

NMDC Relevance

This test documents the inverse NMDC pattern: transforming string or split scalar inputs into structured measurement objects (for example depth -> QuantityValue).

What This PR Adds

  • Test-only coverage for string-to-object range override behavior using expr.
  • No code changes; validates existing behavior on current main.

Coverage

  • Baseline string-to-string passthrough
  • Object construction from separate fields
  • Parsing composite strings like "5 m" into object fields

Why It Matters

For NMDC schema-integration workflows (including MIxS-adjacent data), this confirms linkml-map can already express string/scalar -> structured-object transforms without engine changes.

Copilot AI review requested due to automatic review settings February 26, 2026 14:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive tests demonstrating linkml-map's capability to transform slots from string ranges to structured object ranges (e.g., depth: stringdepth: QuantityValue). This pattern is common in schema integration projects where upstream schemas use simple strings for values that downstream schemas model as structured objects.

Changes:

  • Adds three test scenarios: string-to-string passthrough (baseline), constructing objects from separate fields via expressions, and parsing composite strings into structured objects
  • Documents that dict construction via expr works without code changes
  • Complements PR #130's inverse pattern (object-to-scalar flattening)

assert result["depth"] == {
"has_numeric_value": 5.0,
"has_unit": "m",
}
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_parse_string_into_object test parses a string with split(" ") but does not validate edge cases. Consider adding tests for:

  • Empty string input (e.g., depth: "")
  • Null/None input (e.g., depth: null)
  • Malformed strings (e.g., depth: "5" without unit, depth: "five m" with non-numeric value, depth: "5 m" with multiple spaces)
  • Strings with extra whitespace (e.g., depth: " 5 m ")

Without validation, these cases could cause IndexError or ValueError exceptions at runtime. While the current test documents the happy path, adding edge case tests would ensure robust handling of real-world data where input quality may vary.

Suggested change
}
}
def test_parse_string_into_object_empty_and_null_depth() -> None:
"""Edge cases: empty string or None depth should not parse as valid."""
for depth in ("", None):
with pytest.raises((TypeError, ValueError, IndexError)):
_run(
source_schema=_source_schema_string(),
target_schema=_target_schema_quantity(),
transform_spec=TRANSFORM_PARSE,
input_data={"id": "samp1", "depth": depth},
source_type="StringSample",
)
def test_parse_string_into_object_malformed_and_whitespace() -> None:
"""Edge cases: malformed or oddly spaced depth strings."""
# Missing unit, non-numeric value, multiple internal spaces, extra outer spaces
malformed_depth_values = [
"5", # no unit
"five m", # non-numeric value
"5 m", # multiple spaces between value and unit
" 5 m ", # leading/trailing whitespace that split(' ') may mishandle
]
for depth in malformed_depth_values:
with pytest.raises((TypeError, ValueError, IndexError)):
_run(
source_schema=_source_schema_string(),
target_schema=_target_schema_quantity(),
transform_spec=TRANSFORM_PARSE,
input_data={"id": "samp1", "depth": depth},
source_type="StringSample",
)

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +116
'{"has_numeric_value": float(depth.split(" ")[0]),'
' "has_unit": depth.split(" ")[1]}'
),
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression in line 114 calls split(" ") twice on the same string, which is inefficient. While this works correctly, consider documenting or testing whether the expression evaluator supports variable assignment within expressions (e.g., parts = depth.split(" ")) to avoid redundant string operations. If variable assignment is not supported, this duplication is necessary but worth noting in a comment for maintainability.

Copilot uses AI. Check for mistakes.
"has_unit": "m",
}


Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test constructs a QuantityValue object using float() conversion, but the test only validates the happy path. Consider adding a test case for test_construct_object_from_subfields where depth_value contains a non-numeric string (e.g., "five") to document the expected behavior. Should this raise an error during transformation, or should it be validated before the transformation? This would help clarify the contract for users of this pattern.

Suggested change
def test_construct_object_from_subfields_non_numeric() -> None:
"""Non-numeric depth_value should raise an error during transformation."""
with pytest.raises(ValueError):
_run(
source_schema=_source_schema_flat(),
target_schema=_target_schema_quantity(),
transform_spec=TRANSFORM_CONSTRUCT,
input_data={"id": "samp1", "depth_value": "five", "depth_unit": "m"},
source_type="FlatSample",
)

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +124


# ---------------------------------------------------------------------------
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description notes that expressions returning dictionaries are not validated against the target schema. Consider adding a test case that explicitly validates this behavior - for example, an expression that returns {"wrong_key": 5.0} for a QuantityValue target - to document the current behavior. This would make it easier to track when/if schema validation of expression outputs is implemented in the future, and helps users understand the current limitations.

Suggested change
# ---------------------------------------------------------------------------
# Transformation where the expr returns a dict that does NOT match the
# QuantityValue schema. This documents that expression outputs are not
# currently validated against the target schema.
TRANSFORM_BAD_QUANTITY: dict[str, Any] = {
"class_derivations": {
"StructuredSample": {
"populated_from": "StringSample",
"slot_derivations": {
"id": {},
"depth": {
# Deliberately return a dict with the wrong key for a
# QuantityValue target; current behavior is that this
# passes through without schema validation.
"expr": '{"wrong_key": 5.0}',
},
},
},
},
}
# ---------------------------------------------------------------------------
# Tests documenting current behavior for dict-valued expressions
# ---------------------------------------------------------------------------
def test_dict_expression_not_validated_against_target_schema() -> None:
"""Expression dict outputs are not schema-validated for QuantityValue.
This test encodes the current limitation: an expr that returns a dict
with an unexpected key for a QuantityValue target is accepted and the
dict is used as-is, rather than being validated against the
QuantityValue schema (which would require has_numeric_value/has_unit).
"""
source_schema = _source_schema_string()
target_schema = _target_schema_quantity()
result = _run(
source_schema=source_schema,
target_schema=target_schema,
transform_spec=TRANSFORM_BAD_QUANTITY,
input_data={"id": "sample1", "depth": "5 m"},
source_type="StringSample",
)
# The transform should succeed and produce the unvalidated dict.
assert "depth" in result
assert isinstance(result["depth"], dict)
assert "wrong_key" in result["depth"]
# And it should *not* be coerced into the expected QuantityValue shape.
assert "has_numeric_value" not in result["depth"]
assert "has_unit" not in result["depth"]
# ---------------------------------------------------------------------------

Copilot uses AI. Check for mistakes.
- Fix spec mutation: use copy.deepcopy() on transform specs before
  passing to create_transformer_specification()
- Add parametrized tests for malformed parse inputs (null, empty,
  no-unit, non-numeric) — all yield None via simpleeval error handling
- Add test for non-numeric depth_value in construct expr
- Add test documenting validation gap: wrong dict keys pass through
  without target schema validation
@turbomam turbomam changed the title Add tests for string-to-object range override patterns NMDC: Document string/scalar-to-structured measurement transforms (tests only) Feb 26, 2026
Copy link
Copy Markdown
Contributor

@amc-corey-cox amc-corey-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like co-pilot has a bunch of false positives. This looks good to me.

@amc-corey-cox amc-corey-cox merged commit 8c03485 into main Mar 9, 2026
10 checks passed
@amc-corey-cox amc-corey-cox deleted the test-range-override-string-to-object branch March 9, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test: slot import with range override (MIxS string → structured object)

3 participants