Skip to content

pipestat retrieval breaks with common prefix #159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nsheff opened this issue Feb 21, 2024 · 7 comments
Closed

pipestat retrieval breaks with common prefix #159

nsheff opened this issue Feb 21, 2024 · 7 comments
Assignees
Labels
bug Something isn't working likely-solved

Comments

@nsheff
Copy link
Contributor

nsheff commented Feb 21, 2024

If I have two record identifiers that share a common prefix, pipestat will return the value for the shorter one even when you request the longer one.

This may be the root of this problem I found earlier in looper: pepkit/looper#470

Watch this:

Create pipestatmanager object

psm3 = pipestat.PipestatManager(
    record_identifier="sample1",
    results_file_path="analysis/bug_test.yaml",
    schema_path="analysis/seqcol_pipestat_schema.yaml",
)

Here's my seqcol_pipestat_schema.yaml

title: Refget henge back-end schema
description: Allows pipestat to be used as a simple key-value store.
type: object
properties:
  pipeline_name: "refget"
  samples:
    type: object
    properties:
      value:
        type: string
        description: "Value of the object referred to by the key"

Report/retrieve results is broken if you use similar sample names:

psm3.report({"value": "abcdefg"}, record_identifier="sample1")
psm3.report({"value": "12345"}, record_identifier="sample")
psm3.retrieve_one("sample", "value")
# '12345'
 psm3.retrieve_one("sample1", "value")  # should return 'abcdefg'
# '12345'

And continuing:

psm3.report({"value": "This is a new value"}, record_identifier="sample1", force_overwrite=True)
psm3.retrieve_one("sample1", "value")
# '12345'

It seems that if anything matches the first prefix of the sample, it will return that somehow. This is a critical bug.

The problem is with retrieval, not with reporting, because the file itself is actually correct, it shows:

refget:
  project: {}
  sample:
    sample1:
      value: This is a new value
      pipestat_created_time: '2024-02-21 07:49:04'
      pipestat_modified_time: '2024-02-21 07:50:52'
    sample:
      value: '12345'
      pipestat_created_time: '2024-02-21 07:49:13'
      pipestat_modified_time: '2024-02-21 07:49:13'
@donaldcampbelljr
Copy link
Contributor

Additional info: Confirmed it is just for the filebackend.

@donaldcampbelljr
Copy link
Contributor

Found it:
if record_identifier in filter_condition["value"]:

should be

if record_identifier == filter_condition["value"]:

I'm failing a couple of tests with this change so I need to investigate that.

@donaldcampbelljr
Copy link
Contributor

This is fixed

@nsheff
Copy link
Contributor Author

nsheff commented Feb 21, 2024

Can you add in a test that mimics the above? Like, have two samples named sample1 and sample, so one is a prefix of the other, and show you can retrieve them both correctly?

@nsheff
Copy link
Contributor Author

nsheff commented Feb 21, 2024

Might be good to see if this solves pepkit/looper#470 (which should probably move to pipestat)

@donaldcampbelljr
Copy link
Contributor

Yes, I already added that test

@nsheff
Copy link
Contributor Author

nsheff commented Feb 21, 2024

I can confirm the fix is working correctly in my hands, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working likely-solved
Projects
None yet
Development

No branches or pull requests

2 participants