NMDC: Document string/scalar-to-structured measurement transforms (tests only)#131
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive tests demonstrating linkml-map's capability to transform slots from string ranges to structured object ranges (e.g., depth: string → depth: QuantityValue). This pattern is common in schema integration projects where upstream schemas use simple strings for values that downstream schemas model as structured objects.
Changes:
- Adds three test scenarios: string-to-string passthrough (baseline), constructing objects from separate fields via expressions, and parsing composite strings into structured objects
- Documents that dict construction via
exprworks without code changes - Complements PR #130's inverse pattern (object-to-scalar flattening)
| assert result["depth"] == { | ||
| "has_numeric_value": 5.0, | ||
| "has_unit": "m", | ||
| } |
There was a problem hiding this comment.
The test_parse_string_into_object test parses a string with split(" ") but does not validate edge cases. Consider adding tests for:
- Empty string input (e.g.,
depth: "") - Null/None input (e.g.,
depth: null) - Malformed strings (e.g.,
depth: "5"without unit,depth: "five m"with non-numeric value,depth: "5 m"with multiple spaces) - Strings with extra whitespace (e.g.,
depth: " 5 m ")
Without validation, these cases could cause IndexError or ValueError exceptions at runtime. While the current test documents the happy path, adding edge case tests would ensure robust handling of real-world data where input quality may vary.
| } | |
| } | |
| def test_parse_string_into_object_empty_and_null_depth() -> None: | |
| """Edge cases: empty string or None depth should not parse as valid.""" | |
| for depth in ("", None): | |
| with pytest.raises((TypeError, ValueError, IndexError)): | |
| _run( | |
| source_schema=_source_schema_string(), | |
| target_schema=_target_schema_quantity(), | |
| transform_spec=TRANSFORM_PARSE, | |
| input_data={"id": "samp1", "depth": depth}, | |
| source_type="StringSample", | |
| ) | |
| def test_parse_string_into_object_malformed_and_whitespace() -> None: | |
| """Edge cases: malformed or oddly spaced depth strings.""" | |
| # Missing unit, non-numeric value, multiple internal spaces, extra outer spaces | |
| malformed_depth_values = [ | |
| "5", # no unit | |
| "five m", # non-numeric value | |
| "5 m", # multiple spaces between value and unit | |
| " 5 m ", # leading/trailing whitespace that split(' ') may mishandle | |
| ] | |
| for depth in malformed_depth_values: | |
| with pytest.raises((TypeError, ValueError, IndexError)): | |
| _run( | |
| source_schema=_source_schema_string(), | |
| target_schema=_target_schema_quantity(), | |
| transform_spec=TRANSFORM_PARSE, | |
| input_data={"id": "samp1", "depth": depth}, | |
| source_type="StringSample", | |
| ) |
| '{"has_numeric_value": float(depth.split(" ")[0]),' | ||
| ' "has_unit": depth.split(" ")[1]}' | ||
| ), |
There was a problem hiding this comment.
The expression in line 114 calls split(" ") twice on the same string, which is inefficient. While this works correctly, consider documenting or testing whether the expression evaluator supports variable assignment within expressions (e.g., parts = depth.split(" ")) to avoid redundant string operations. If variable assignment is not supported, this duplication is necessary but worth noting in a comment for maintainability.
| "has_unit": "m", | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
The test constructs a QuantityValue object using float() conversion, but the test only validates the happy path. Consider adding a test case for test_construct_object_from_subfields where depth_value contains a non-numeric string (e.g., "five") to document the expected behavior. Should this raise an error during transformation, or should it be validated before the transformation? This would help clarify the contract for users of this pattern.
| def test_construct_object_from_subfields_non_numeric() -> None: | |
| """Non-numeric depth_value should raise an error during transformation.""" | |
| with pytest.raises(ValueError): | |
| _run( | |
| source_schema=_source_schema_flat(), | |
| target_schema=_target_schema_quantity(), | |
| transform_spec=TRANSFORM_CONSTRUCT, | |
| input_data={"id": "samp1", "depth_value": "five", "depth_unit": "m"}, | |
| source_type="FlatSample", | |
| ) |
|
|
||
|
|
||
| # --------------------------------------------------------------------------- |
There was a problem hiding this comment.
The PR description notes that expressions returning dictionaries are not validated against the target schema. Consider adding a test case that explicitly validates this behavior - for example, an expression that returns {"wrong_key": 5.0} for a QuantityValue target - to document the current behavior. This would make it easier to track when/if schema validation of expression outputs is implemented in the future, and helps users understand the current limitations.
| # --------------------------------------------------------------------------- | |
| # Transformation where the expr returns a dict that does NOT match the | |
| # QuantityValue schema. This documents that expression outputs are not | |
| # currently validated against the target schema. | |
| TRANSFORM_BAD_QUANTITY: dict[str, Any] = { | |
| "class_derivations": { | |
| "StructuredSample": { | |
| "populated_from": "StringSample", | |
| "slot_derivations": { | |
| "id": {}, | |
| "depth": { | |
| # Deliberately return a dict with the wrong key for a | |
| # QuantityValue target; current behavior is that this | |
| # passes through without schema validation. | |
| "expr": '{"wrong_key": 5.0}', | |
| }, | |
| }, | |
| }, | |
| }, | |
| } | |
| # --------------------------------------------------------------------------- | |
| # Tests documenting current behavior for dict-valued expressions | |
| # --------------------------------------------------------------------------- | |
| def test_dict_expression_not_validated_against_target_schema() -> None: | |
| """Expression dict outputs are not schema-validated for QuantityValue. | |
| This test encodes the current limitation: an expr that returns a dict | |
| with an unexpected key for a QuantityValue target is accepted and the | |
| dict is used as-is, rather than being validated against the | |
| QuantityValue schema (which would require has_numeric_value/has_unit). | |
| """ | |
| source_schema = _source_schema_string() | |
| target_schema = _target_schema_quantity() | |
| result = _run( | |
| source_schema=source_schema, | |
| target_schema=target_schema, | |
| transform_spec=TRANSFORM_BAD_QUANTITY, | |
| input_data={"id": "sample1", "depth": "5 m"}, | |
| source_type="StringSample", | |
| ) | |
| # The transform should succeed and produce the unvalidated dict. | |
| assert "depth" in result | |
| assert isinstance(result["depth"], dict) | |
| assert "wrong_key" in result["depth"] | |
| # And it should *not* be coerced into the expected QuantityValue shape. | |
| assert "has_numeric_value" not in result["depth"] | |
| assert "has_unit" not in result["depth"] | |
| # --------------------------------------------------------------------------- |
- Fix spec mutation: use copy.deepcopy() on transform specs before passing to create_transformer_specification() - Add parametrized tests for malformed parse inputs (null, empty, no-unit, non-numeric) — all yield None via simpleeval error handling - Add test for non-numeric depth_value in construct expr - Add test documenting validation gap: wrong dict keys pass through without target schema validation
amc-corey-cox
left a comment
There was a problem hiding this comment.
Looks like co-pilot has a bunch of false positives. This looks good to me.
Closes #129
NMDC Relevance
This test documents the inverse NMDC pattern: transforming string or split scalar inputs into structured measurement objects (for example
depth->QuantityValue).What This PR Adds
expr.main.Coverage
"5 m"into object fieldsWhy It Matters
For NMDC schema-integration workflows (including MIxS-adjacent data), this confirms linkml-map can already express string/scalar -> structured-object transforms without engine changes.