Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong reference resolution if there are few dependent artefacts with the same ID but different agencies or version #164

Closed
sychsergiy opened this issue Feb 2, 2024 · 10 comments · Fixed by #165
Assignees
Labels
bug reader Handle standard SDMX message types xml SDMX-ML format

Comments

@sychsergiy
Copy link

sychsergiy commented Feb 2, 2024

Hi @khaeru

I used sdmx1 lib to parse xml file(attached) and discovered the following issue.
IMF_STA_DSD_GFS(4.0.1).xml.zip

XmlReader saves into stack artefacts based on resource ID, however resource ID is not unique(agency+reousrceId+version is unique)
We have a few ConceptSchemes with the same ID but a different agency or version -- in this case IMF_STA:CS_MASTER(1.0.1) for STATUS attribute and IMF:CS_MASTER(4.0.0) for COUNTRY dimension.
Concept which references one of those ConceptSchemes can't be resolved correctly and None appears in core_representation

image

Here is stack state when COUNTRY dimension tries to resolve reference on CS_MASTER concept scheme
it contains dict with few CS_MASTER ConceptSchemes but is not possible to access them by id
image

Is this a known issue?

I think the problem might be much more complex in the same case with SDMX 3.0 where version might have a wildcard.

@sychsergiy
Copy link
Author

steps to reproduce:
run with attached file:

from io import BytesIO

from sdmx.reader.xml.v21 import Reader


def ignore_none_tag():
    """
    xml reader fails with NotImplemented when faces <None/> tag
    """
    Reader.parser["None", "start"] = None
    Reader.parser["None", "end"] = None


def read_xml(file):
    with open(file, "rb") as f:
        content = f.read()
        response_io = BytesIO(content)
        return Reader().read_message(response_io)


def main():
    ignore_none_tag()
    message = read_xml("IMF_STA_DSD_GFS(4.0.1).xml")
    # print core_representation for COUNTRY dimension 
    print(message.structure["DSD_GFS"].dimensions.components[0].concept_identity.core_representation)


if __name__ == "__main__":
    main()

@sychsergiy
Copy link
Author

sychsergiy commented Feb 2, 2024

<None/> tag(should be parsed to NoSpecifiedRelationship) parsing might be a reason for another ticket, but I'm not sure yet if this is a bug in sdmx1 lib or XML file doesn't fully correspond SDMX standard -- still checking

@khaeru
Copy link
Owner

khaeru commented Feb 2, 2024

Thanks for the report, including a test specimen and code. I will try to reproduce and fix.

I recall there was an earlier bug (#116) that the SDMX-ML reader would return a StructureMessage that only had a subset of artifacts, when some had the same ID but different maintainers and/or versions. That was addressed in #124. The issue you describe is similar, but distinct: it's about the proper/expected association between artifacts within the message when some have the same ID.

One point of information that could help: is there a particular query (I guess against IMF) that returns this set of artefacts that you include in your ZIP file? This isn't essential, but would help us ensure any fix is durable.

@khaeru khaeru self-assigned this Feb 2, 2024
@khaeru khaeru added bug xml SDMX-ML format reader Handle standard SDMX message types labels Feb 2, 2024
@khaeru
Copy link
Owner

khaeru commented Feb 2, 2024

As a note to self: this might be resolved by adjusting Reader.pop_resolved_ref() / Reader.get_single() to operate by URN rather than a combination of id and/or version. Since every (artifact class, maintainer ID, artifact ID, version) is will have a URN that is (by definition) unique, this could disambiguate and even simplify some of the existing code.

@khaeru
Copy link
Owner

khaeru commented Feb 4, 2024

<None/> tag(should be parsed to NoSpecifiedRelationship) parsing might be a reason for another ticket, but I'm not sure yet if this is a bug in sdmx1 lib or XML file doesn't fully correspond SDMX standard -- still checking

I think the latter. Here we can benefit from the very useful sdmx.validate_xml(…) contributed by @goatsweater in #154:

import sdmx

sdmx.install_schemas()
sdmx.validate_xml("IMF_STA_DSD_GFS(4.0.1).xml")

After pruning away content that triggers other, unrelated errors, I see:

Element 'None': This element is not expected. Expected is one of (
  {http://www.sdmx.org/resources/sdmxml/schemas/v2_1/structure}None,
  {http://www.sdmx.org/resources/sdmxml/schemas/v2_1/structure}Dimension,
  {http://www.sdmx.org/resources/sdmxml/schemas/v2_1/structure}Group,
  {http://www.sdmx.org/resources/sdmxml/schemas/v2_1/structure}PrimaryMeasure
)., line 29573

This indicates that, per the XML Schema Documents, the tag should be <str:None/> and not <None/>. So the file is not valid SDMX-ML.

We already have a parser for the first of these:

sdmx/sdmx/reader/xml/v21.py

Lines 1521 to 1523 in fa936b2

@end("str:None")
def _ar_kind(reader: Reader, elem):
return reader.class_for_tag(elem.tag)()

…and I find that when I replace all instances of the invalid tag, then sdmx.read_xml() parses the file without raising any exception.

I'll proceed to investigate the main issue here with this modified file. But it would still be helpful if you could say where this specimen is coming from. This would indicate whether/where we can report the invalid SDMX-ML to the data provider.

khaeru added a commit to khaeru/sdmx-test-data that referenced this issue Feb 4, 2024
khaeru added a commit to khaeru/sdmx-test-data that referenced this issue Feb 4, 2024
- Manually reduced while still reproducing the reported issue.
- Correct invalid <None/> to <str:None/>.
khaeru added a commit that referenced this issue Feb 4, 2024
khaeru added a commit that referenced this issue Feb 4, 2024
@sychsergiy
Copy link
Author

sychsergiy commented Feb 5, 2024

Here is SDMX query:
client.datastructure("DSD_GFS", provider="*", params={"references": "all"})
.../registry/sdmx/2.1/datastructure/IMF_STA/DSD_GFS/latest?references=all

However, data provider is hidden by a private network so don't I think it is super useful, at least for now.
I know for sure that it will be exposed to the public soon, maybe via subscription-key which I can share.

The platform supports a bunch of features that you might be interested in. Such as SDMX 3.0 Hierarchy support, SDMX 2.1 HierarchicalCodelist support, version wildcards, and a lot of other 3.0 stuff.

So, I will send you a message as soon as the platform is exposed to the public.

@sychsergiy
Copy link
Author

cc: @FedorYatsenko

@khaeru
Copy link
Owner

khaeru commented Feb 5, 2024

However, data provider is hidden by a private network so don't I think it is super useful, at least for now. I know for sure that it will be exposed to the public soon, maybe via subscription-key which I can share.

The platform supports a bunch of features that you might be interested in. Such as SDMX 3.0 Hierarchy support, SDMX 2.1 HierarchicalCodelist support, version wildcards, and a lot of other 3.0 stuff.

Okay, thanks for this information. I also did not know Fedor was your colleague. From the sounds of it, I guess you are working with the IMF statistics division.

It is definitely useful to have "real-world" examples of different kinds of SDMX applications, including:

  • REST APIs that support the various kinds of queries,
  • data, metadata, and structure messages in the different formats,
  • both SDMX 2.1 and 3.0.

This is because there is no real "reference implementation" of SDMX, and the samples published with the standards are not really comprehensive—there is a lot of possible usage for which no official example or test suite exists.

At the same time, the primary goal for this sdmx1 package is to faithfully implement the standards, rather than align with quirks of other implementations (server or client software). So what's most valuable is cases where I can say confidently, "Aha, I can take this REST API or this provider's XML/JSON as a clear example of what the standards described, and test sdmx1 against it."

@sychsergiy
Copy link
Author

sychsergiy commented Feb 5, 2024

We aren't from IMF organization.

We are working with a platform which consumes, stores, and provides statistical data according to SDMX and also provides analytic tools over it.

The provided example is a public IMF dataflow(and related DSD) -- but the platform is generic.
The platform's goal is the same, but from the server side-- fully align SDMX standard (it supports both 2.1 and 3.0, XML and JSON).
I assume there are minimal violations -- because of bugs or historical reasons -- but 99% of it is aligned with the standard.

For analytics purposes, we also working on the SDMX client.

We have our implementation of SDMX Infomodel in python -- we even thinking of releasing it to open source as well.
Once we found sdmx1, we realized that our implementation is a bit ahead in terms of SDMX 3.0 structure artefacts and features(wildcard versioning), but our data client is far behind then sdmx1.

We also have sdmx-to-pandas and pandas-to-sdmx converters -- which is outside of standard but super useful.
This part is also a bit behind yours I think.

Our colleagues from the parallel team have already open-sourced SDMX Infomodel Java implementation.
https://github.com/epam/jsdmx

This is because there is no real "reference implementation" of SDMX, and the samples published with the standards are not really comprehensive—there is a lot of possible usage for which no official example or test suite exists.

What I mean by that message is the platform I described might be that "real-world" example. Especially from SDMX 3.0 perspective. Because I didn't find any public DataSource in sources.json with 3.0 support.

I perfectly understand that the scope of this lib is only SDMX standard and we don't encourage you to implement anything outside of it. Our overall goal is pretty much the same.

@khaeru
Copy link
Owner

khaeru commented Feb 6, 2024

Great stuff, thanks for letting me know the context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug reader Handle standard SDMX message types xml SDMX-ML format
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants