Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement API endpoint that returns name of collection containing document having a given id #532

Conversation

eecavanna
Copy link
Collaborator

@eecavanna eecavanna commented May 23, 2024

Description

In this branch, I implemented a new API endpoint. Its behavior is described in its docstring, shown here:

@router.get("/nmdcschema/ids/{doc_id}/collection-name")
def get_collection_name_by_doc_id(
    doc_id: str,
    mdb: MongoDatabase = Depends(get_mongo_db),
):
    r"""
    Returns the name of the collection, if any, containing the document having the specified `id`.

    This endpoint uses the NMDC Schema to determine the schema class of which an instance could have
    the specified value as its `id`; and then uses the NMDC Schema to determine the names of the
    `Database` slots (i.e. Mongo collection names) that could contain instances of that schema class.

    This endpoint then searches those Mongo collections for a document having that `id`.
    If it finds one, it responds with the name of the collection containing the document.
    If it does not find one, it response with an `HTTP 404 Not Found` response.
    """

Fixes #531

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration, if it is not simply make up-test && make test-run.

  • I implemented an automated test targeting the new functionality and that test ran/passed on GitHub Actions:
    image

Configuration Details: none

Checklist:

  • My code follows the style guidelines of this project (have you run black nmdc_runtime/?)
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (in docs/ and in https://github.com/microbiomedata/NMDC_documentation/?)
  • I have added tests that prove my fix is effective or that my feature works, incl. considering downstream usage (e.g. https://github.com/microbiomedata/notebook_hackathons) if applicable.
  • New and existing unit and functional tests pass locally with my changes (make up-test && make test-run)

    I delegated this test running to GitHub Actions, which passes

@eecavanna
Copy link
Collaborator Author

eecavanna commented May 25, 2024

@aclum, is this the type of HTTP response you had in mind for the endpoint that we discussed; the endpoint that, for a given id, determines the collection—if any—in which a document having that id would reside, and determines the schema class—if any—of which an instance would have that as its id?

    {
        "id": "nmdc:sty-10-123456abcdef",
        "collection_name": "study_set",
        "class_name": "Study",
    }

Note: The value of the "id" property in the HTTP response would be a verbatim copy of the one in the HTTP request. It's only included in the response for the convenience of the client.

CC: @sujaypatil96

@aclum
Copy link
Contributor

aclum commented May 30, 2024

Yes. I think this would also be useful for the data portal, I have a ticket that has been in for a while to be able to search by an nmdc identifier. microbiomedata/nmdc-server#964 cc @naglepuff @marySalvi @jeffbaumes

@ssarrafan ssarrafan requested a review from dwinston May 31, 2024 18:40
@dwinston
Copy link
Collaborator

dwinston commented Jun 7, 2024

What is expected for an id that looks like e.g. nmdc:clsite-99-123abc? From /nmdcschema/typecodes, clsite is the typecode for a CollectingBiosamplesFromSite object, but this can be in either the collecting_biosamples_from_site_set (https://github.com/microbiomedata/nmdc-schema/blob/v10.5.3/nmdc_schema/nmdc_materialized_patterns.schema.json#L3531) or planned_process_set (https://github.com/microbiomedata/nmdc-schema/blob/v10.5.3/nmdc_schema/nmdc_materialized_patterns.schema.json#L3653) slots, according to the schema. So in this case, both collection names should be returned, no?

I have updated the PR to reflect this and return a list of collections.

@aclum
Copy link
Contributor

aclum commented Jun 7, 2024

The issue of being valid for multiple collections is fixed in berkeley schema. The original feature of this endpoint was to return which collection the ID is in so if it is ambiguous for now it should check which collection the document actually resides in.

@eecavanna eecavanna changed the title Implement endpoint that returns collection and class name compatible with id Implement endpoint that returns class name and collection names compatible with id Jun 13, 2024
@eecavanna eecavanna marked this pull request as ready for review June 13, 2024 08:05
@eecavanna
Copy link
Collaborator Author

eecavanna commented Jun 13, 2024

The original feature of this endpoint was to return which collection the ID is in so if it is ambiguous for now it should check which collection the document actually resides in.

This endpoint only uses the schema, not the database. The class_name it returns is the name of the schema class, if any, for which the specified value would be a valid id (on an instance of that class). The collection_names list it returns is a list of the names of all the "collections" (technically, the names of slots of the Database schema class), if any, that could contain a document having that value as its id.

In other words, it currently doesn't do this:

if it is ambiguous for now it should check which collection the document actually resides in.

Instead, it returns a list of all collections that — according to the schema — a document having that id could reside in.

CC: @aclum

@eecavanna
Copy link
Collaborator Author

Sounds to me like one of the requirements for this endpoint is to also indicate which collection, if any, a document having the specified id resides in.

@eecavanna
Copy link
Collaborator Author

eecavanna commented Jun 13, 2024

Sounds to me like one of the requirements for this endpoint is to also indicate which collection, if any, a document having the specified id resides in.

Update: I have added this feature to the endpoint.

Here's the response shape:

{
    "id": "string",                             // `id` from the URL (the `hypothetical_doc_id` path parameter)
    "compatible_class_name": "string",          // name of the class of which an instance _could_ have that `id`
    "compatible_collection_names": ["string"],  // names of all collections that _could_ contain a document having that `id`
    "containing_collection_name": "string"      // name of the collection in which a document having that `id` _does_ exist, if any
}

Example response:

{
    "id": "nmdc:sty-1-foobar",
    "compatible_class_name": "Study",
    "compatible_collection_names": ["study_set"],
    "containing_collection_name": "study_set"
}

@eecavanna
Copy link
Collaborator Author

I am ready for this PR branch to be reviewed/merged.

sujaypatil96
sujaypatil96 previously approved these changes Jun 13, 2024
Copy link
Collaborator

@sujaypatil96 sujaypatil96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the /nmdcschema/ids/{hypothetical_doc_id}/class-and-collection-names endpoint with 3 ids.

  • Case 1: An Extraction id which exists in the database (for testing, in my local Mongo), and can (theoretically) be part of multiple collections
Screenshot 2024-06-13 at 10 12 02 AM
  • Case 2: A study id which can technically be part of only one collection
Screenshot 2024-06-13 at 10 12 57 AM
  • Case 3: Non-conformant id
Screenshot 2024-06-13 at 10 13 27 AM

It worked like a charm in all three cases, so I'm happy with its behavior and am ready for this PR to be merged 🚀

@eecavanna
Copy link
Collaborator Author

Hi @aclum, are you OK with how this endpoint behaves? If so (and someone approves it via GitHub's Review mechanism), I'll merge it in. I accidentally invalidated @sujaypatil96's approval by making an additional commit.

@sujaypatil96 added screenshots of some example request/response pairs above—thanks, @sujaypatil96!

Copy link
Collaborator

@PeopleMakeCulture PeopleMakeCulture left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. thank you for writing tests!

@PeopleMakeCulture
Copy link
Collaborator

@eecavanna You're good to merge. If/when there's time, I'd like to consolidate and replace these utils with your logic. (non-blocking):

  • nmdc_runtime.api.core.metadata.map_id_to_collection
  • nmdc_runtime.util.collection_name_to_class_names
  • nmdc_runtime.api.db.mongo.nmdc_schema_collection_names

@eecavanna
Copy link
Collaborator Author

eecavanna commented Jun 13, 2024

Based on @aclum's feedback during today's infrastructure meeting, I will update the endpoint as follows:

  • 1. If a document having the specified id exists in any of the collections the schema says it can exist in, return the name of that collection. If it happens to exist in multiple collections, only return the name of one collection that it exists in.
  • 2. If no document having the specified id exists in any of the collections the schema says it can exist in, return a "null-ish" response (TBD).

@eecavanna eecavanna changed the title Implement endpoint that returns class name and collection names compatible with id Implement API endpoint that returns name of collection containing document having a given id Jun 13, 2024
@eecavanna
Copy link
Collaborator Author

Sorry about all the thrash here! Getting the automated tests running locally was not happening smoothly for me (I think one of the tests hung last night), so I've been relying on the GHA workflow to run them for me. A downside of that is that the test failures generate notifications—at least to me.

@eecavanna
Copy link
Collaborator Author

Hi @aclum, I updated the API response based upon our conversation earlier today.

The new behavior is:

  1. Scenario 1: User specifies the id of nmdc:sty-1-foobar to the API. A document having that id exists in the study_set collection. The API responds with:

    {
       "id": "nmdc:sty-1-foobar",
       "collection_name": "study_set",
    }
  2. Scenario 2: User specifies the id of nmdc:sty-1-nonex to the API. A document having that id does not exist in any collections (that the schema says it can exist in). The API responds with a HTTP 404 Not Found response.

I'm ready for this PR branch to be reviewed/merged in.

@eecavanna eecavanna merged commit 6bcf735 into main Jun 13, 2024
@eecavanna eecavanna deleted the 531-implement-api-endpoint-that-accepts-id-value-and-returns-its-class-name-and-collection-name branch June 13, 2024 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement API endpoint that accepts id value and returns its class name and collection name
5 participants