Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Cross-references of jobs #96

Open
csadorf opened this issue Aug 17, 2018 · 6 comments
Open

Proposal: Cross-references of jobs #96

csadorf opened this issue Aug 17, 2018 · 6 comments
Labels
enhancement New feature or request good first issue Good for newcomers pinned Instructs stale bot to ignore this issue. proposal

Comments

@csadorf
Copy link
Contributor

csadorf commented Aug 17, 2018

Original report by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


About

This issue describes a proposed feature which would standardize the way that jobs may be referenced within and between different projects.
One typical use case is the need to store aggregated results or data that is shared among many different jobs within one larger data space.

Rationale

There is currently no standardized way to reference jobs from different projects in order to define relationships of jobs within or across projects.
This puts the burden on the user to conceptualize and implement such references, which leads to duplication of effort and possible complications when code is interfaced by 3rd parties.
A standardization of references will make it easier for users to setup a data spaces with above mentioned relationships.

Example

Assuming that the user performed multiple computations at different state points and wants to generate aggregated results, such as a phase diagram, based on that data.
We propose that such a workflow would be supported with the following API:

from itertools import tee
import signac
# ...

# Main project:
project = signac.get_project()

# Project for aggregated results:
phase_diagrams = signac.get_project('phase-diagrams')

for (p, T), group in project.groupby(('p', 'T')):
    with phase_diagrams.open_job(dict(p=p, T=T)) as pd_job:
        # We store references (links) to the original jobs within
        # the pd_job's document in order to track the data provenance.
        pd_job.doc.origin, group = tee(group)

        # Now, we generate and store the phase diagram.
        generate_and_save_phase_diagram(group, 'phase_diagram.pdf')

The above mentioned workflow allows us to easily determine the origin data:

for pd_job in phase_diagrams:
    origin_jobs = phase_diagrams.lookup(pd_job.doc.origin)

Definitions

Terms used in this proposal document:

  • link: A uniform resource identifier (URI), which contains all information needed to lookup another job.
  • sub-project: A project within a sub-directory of the current project root directory.
  • parent-project: A project within a parent-directory of the current project root directory.
  • neighbor-project: A project within another directory that is on the same level as the current project root directory.

Explicitly supported use-cases

The following use-cases should be supported by the proposed concept and implementation:

  1. Specify the following relationships: one-to-one, one-to-many, many-to-one, many-to-many,

  2. Specify a reference:

    a) within the same project,

    b) from one project to a sub-project,

    c) from one project to a parent-project,

    d) from one project to a neighbor-project.

Concept

We need two pieces of information in order to be able to locate a job within or across projects:

  1. The project that the referenced job belongs to.
  2. The job id of the referenced job.

The project is referenced by a relative or an absolute path to its root directory.
A relative path is defined as relative to a specific project, where the default is the current project.

A link is a URI defined like this:

signac://relative/path/to/project#abcdef123456...

The URI scheme is called 'signac', the project root directory is defined as the combination of the netloc and path component, and the job id is specified through the fragment component.

A signac URI can be parsed for example with the urllib.parse.parse_url function:

o = urllib.parse.parse_url('signac://path/to/project#abcdef')
project = signac.get_project(o.netloc + o.path)
job = project.open_job(id=o.fragment)

Proposed API

The high-level API is comprised of project-based methods and root namespace functions.

Project-based API

Using the project-based API, all links are generated relative to a specific project.

project.link_to(job)

Generate a link document for job relative to project.

project.lookup(link)

Lookup the project referenced in link relative to project's root directory and then return the referenced job.
This function will raise a LookupError if the referenced project cannot be found and a KeyError if the referenced job does not exist in the looked-up project.

project.lookup_project(link)

Lookup the project referenced in link relative to project's root directory.
This function will raise a LookupError if the referenced project cannot be found.

Root-namespace API

The root-namespace API works like the project-based API, but always acts on the current project, that means the project returned by signac.get_project().
The root-namespace API can also be used if users want so specify an arbitrary path or even absolute paths.

signac.link_to(job, from=None)

This function will generate a link to job relative to from.
If the argument for from is None (the default), then the link will be relative to the return value of signac.get_project().root_directory(), otherwise it will be relative to path specified in from.

Instead of a directory path, one can also pass an instance of Project as the from argument, in which case the link will be relative to the project's root directory.

signac.lookup(link, from=None)

This function will attempt to look-up the job referenced in link relative to from.
If no argument for from is provided, the link will be relative to the return value of signac.get_project().root_directory().

The argument for from can be a directory or an instance of Project.

Automatic-conversion of instances of Job to links

When storing an instance of Job within a job's state point or document, it is automatically converted to a link.
For example:

job.doc.other = other_job

This is equivalent to:

job.doc.other = signac.link_to(other_job)

This enables users to specify links with a concise API and predictable behavior.
To ensure that links are relative to the project of the job that contains the references, it is recommended to use with:

with job:
    job.doc.other = other_job

By entering the job's workspace prior to the look-up, we can guarantee that we use the same reference:

with job:
    other_job = signac.lookup(job.doc.other)

Examples

Single link to another job

To create a reference to another job you simply call:

project = signac.get_project()
link = project.link_to(other_job)
# equivalent to: link = signac.link_to(other_job, from=project)

To look up the referenced we use the complimentary Project.lookup function:

other_job = project.lookup(link)
# equivalent to: other_job = signac.lookup(link, from=project)

In general the following relationship is always true:

assert other_job = project.lookup(project.link_to(other_job))

Link across projects

Jobs and their reference do not need to belong to the same project.
For example:

project_a = signac.get_project('a/')
project_b = signac.get_project('b/')

job_in_a = project_a.open_job({'foo': 0})
job_in_b = project_a.open_job({'bar': 0})

job_in_a.doc.other_job = project_a.link_to(job_in_b)
job_in_b = project_a.lookup(job_in_a.doc.other_job)

Caveats

Migration

Changing the state point of a job, for example by adding an additional key, changes its id and will therefore break the references.
Therefore, special care must be taken when migrating referenced jobs:

Assuming that we have a one-to-many relationship, where one parent-job is referenced by many children-jobs:

for parent in project_a:
    for i in range(3):
        link_to_parent = project_b.link_to(parent)
        project_b.open_job(dict(i=i, parent=link_to_parent)).init()

Then, to properly migrate all parent jobs, we could use the following recipe, where we take advantage of the groupbydoc function:

for parent_link, children in project_b.groupbydoc('parent'):
    parent = project_b.lookup(parent_link)
    parent.sp.setdefault('new_key', False)
    for child in children:
        child.doc.parent = project_b.link_to(parent)

Fixing broken references

Assuming that a user migrated jobs without taking care to update the references.
One could use the following recipe to repair those broken links:

for parent_link, children in project_b.groupbydoc('parent'):
    broken_parent = signac.lookup(parent_link)
    assert broken_parent not in project_a
    parent_candidates = project_a.find_jobs(broken_parent.sp())
    assert len(parent_candidates) == 1
    for child in children:
        child.doc.parent = project_b.link_to(parent_candidates[0])
@csadorf
Copy link
Contributor Author

csadorf commented Aug 18, 2018

Original comment by Bradley Dice (Bitbucket: bdice, GitHub: bdice).


I don't think we've discussed "many to many" relationships. I'm not sure (if?) how those fit into this framework.

@csadorf
Copy link
Contributor Author

csadorf commented Aug 18, 2018

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


@bdice I would argue that a many-to-many relationship in this context could be realized by storing many references to project A and many references to project B within one location.
This could for example be a job that uses data from multiple jobs from multiple different projects.

@csadorf
Copy link
Contributor Author

csadorf commented Aug 18, 2018

Original comment by Carl Simon Adorf (Bitbucket: csadorf, GitHub: csadorf).


Update link to be a URI, not a document.

@mikemhenry mikemhenry added this to the v2.0.0 milestone Feb 8, 2019
@mikemhenry mikemhenry removed the v2 label Feb 8, 2019
@csadorf csadorf added enhancement New feature or request and removed critical labels Jul 5, 2019
@csadorf csadorf modified the milestones: v2.0.0, v1.3.0 Jul 5, 2019
@csadorf csadorf mentioned this issue Dec 5, 2019
@bdice bdice removed this from the v1.3.0 milestone Dec 6, 2019
@bdice bdice added the proposal label Feb 27, 2020
@bdice bdice changed the title Proposal for cross-references of jobs Proposal: Cross-references of jobs Feb 27, 2020
@vyasr
Copy link
Contributor

vyasr commented Sep 16, 2020

Addressing this issue thoroughly requires quite a bit of additional work. For the moment I'm bumping this feature past the 2.0 milestone. I'm (possibly ambitiously) targeting #189 for version 2.0, but since there's more work to do to enable this proposal fully I don't think we need this for 2.0. I also don't think this feature needs to break any existing APIs, so we could add it to a 2.x release.

@vyasr vyasr added this to the v3.0.0 milestone Sep 16, 2020
@stale
Copy link

stale bot commented Mar 31, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 31, 2021
@stale stale bot closed this as completed Apr 18, 2021
@vyasr
Copy link
Contributor

vyasr commented Feb 21, 2022

Reopening since there is still interest in this work.

@vyasr vyasr reopened this Feb 21, 2022
@stale stale bot removed the stale label Feb 21, 2022
@vyasr vyasr added the pinned Instructs stale bot to ignore this issue. label Feb 21, 2022
@b-butler b-butler added the good first issue Good for newcomers label Feb 13, 2023
@bdice bdice removed this from the v2.1.0 milestone Jul 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers pinned Instructs stale bot to ignore this issue. proposal
Projects
None yet
Development

No branches or pull requests

5 participants