Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charon API endpoints don't respect global dataset-name redirections #858

Open
tsibley opened this issue May 13, 2024 · 2 comments
Open

Charon API endpoints don't respect global dataset-name redirections #858

tsibley opened this issue May 13, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@tsibley
Copy link
Member

tsibley commented May 13, 2024

Initial reporting by @huddlej in Slack; initial troubleshooting happened there and some brief in-person discussion.

By way of example:

https://nextstrain.org/flu/seasonal/h1n1pdm/ha/2y@2024-05-09 redirects to https://nextstrain.org/seasonal-flu/h1n1pdm/ha/2y@2024-05-09, but https://nextstrain.org/charon/getDataset?prefix=flu/seasonal/h1n1pdm/ha/2y@2024-05-09 responds with a 404 instead of a 302 redirection.

When a browser (or RESTful API client) directly requests the first URL, the redirection occurs and then Auspice (or the data) is served. Auspice then makes the request to the Charon API for the new name and everything is happy. Auspice never sees the old URLs.

However, when rendering a narrative, Auspice can see the old names, and requests to the Charon API with the old names can fail. In this particular example, the failure is because the old name is not in the resources index and so accessing a past snapshot by date does not work.

Remove the date and observe a secondary failure mode, however: the Charon API returns data but it's from 23 April 2024 instead of 9 May 2024, because it's sourcing the last update under the old name without respecting the dataset renames. This failure mode is more pernicious because it's a) not entirely obvious (success but increasingly out of date) and b) has existed for well-before snapshots/resource indexing existed.

@tsibley tsibley added the bug Something isn't working label May 13, 2024
@tsibley
Copy link
Member Author

tsibley commented May 13, 2024

The global dataset-name redirects

/** A list of original patterns with matching redirect patterns
* that are used to create a list of match functions for transforming paths
* into parameters and compile functions for transforming parameters into a valid path
* intended to be used in `updateDatasetUrl`.
*
* Order matters as `updateDatasetUrl` returns the first matching pattern,
* so list more specific patterns first.
*
* The original and redirect patterns MUST share the same params
*/
const datasetRedirectPatterns = [
/** prior to June 2021 our core nCoV builds were available at
* /ncov/global, ncov/asia etc. These used GISAID data exclusively.
* We now have GISAID builds and GenBank builds, and so the URLs
* (i.e. names on s3://nextstrain-data) have changed. We add redirects
* for the old URLs to point to the new GISAID URLs.
* The timestamped URLs (e.g. /ncov/:region/YYYY-MM-DD) which currently exist
* will not be redirected, but new URLs will be of the format
* /ncov/gisaid/:region/YYYY-MM-DD.
*/
['/ncov/:region(global|asia|oceania|north-america|south-america|europe|africa)', '/ncov/gisaid/:region'],
/**
* We shifted from using 'monkeypox' to 'mpox', as per WHO naming
* recommendations. Note that monkeypox YYYY-MM-DD URLs remain,
* e.g. /monkeypox/hmpxv1/2022-09-04
*/
['/monkeypox/mpxv', '/mpox/all-clades'],
['/monkeypox/hmpxv1', '/mpox/clade-IIb'],
['/monkeypox/hmpxv1/big', '/mpox/lineage-B.1'],
["/monkeypox", "/mpox"],
/**
* We shifted avian-flu and seasonal-flu to top level to stop nesting them
* both under flu/ so that avian-flu is more discoverable
*/
['/flu/avian(.*)', '/avian-flu(.*)'],
['/flu/seasonal(.*)', '/seasonal-flu(.*)'],
/**
* Redirect /flu URL itself. We choose seasonal-flu to mimic historical behaviour
*/
['/flu', '/seasonal-flu'],
].map(([originalPattern, redirectPattern]) => [
match(`${originalPattern}${versionDatasetPattern}`),
compile(`${redirectPattern}${versionDatasetPattern}`)
]);

are implemented via a /* intercepting route

/* handle redirects for dataset paths which have changed name & preserve any queries */
/* We do route matching within `updateDatasetUrl`, so just capture all routes */
app.route("/*").get((req, res, next) => {
const urlPath = url.parse(req.url).pathname;
const newUrlPath = updateDatasetUrl(urlPath);
if (newUrlPath !== urlPath) {
return res.redirect(url.format({pathname: newUrlPath, query: req.query}))
}
return next('route')
});

and via the indexer

/**
* Process server-hardcoded dataset redirects before the defaults, as the
* path we redirect to may itself be further redirected by the defaults.
* e.g. redirects monkeypox -> mpox, core routing redirects mpox -> mpox/clade-IIb
* Requires a leading slash to match the paths expected in server.
*/
const serverRedirectedUrlPath = _removeSlashes(updateDatasetUrl(`/${urlPath}`))
return resolvedPath[serverRedirectedUrlPath] || serverRedirectedUrlPath

which both make calls to updateDatasetUrl()

/**
* Checks if the provided URL pathname matches are any of our datasetRedirectPatterns
* If the there is match, then return the new URL for the dataset.
* If there is no match, then return the original URL pathname.
*
* @param {string} originalUrlPathname
* @returns {string}
*/
function updateDatasetUrl(originalUrlPathname) {
for (const [urlMatch, toPath] of datasetRedirectPatterns) {
const matchingURL = urlMatch(originalUrlPathname)
if (matchingURL) {
return toPath({...matchingURL.params})
}
}
return originalUrlPathname
}

Fixing Charon by continuing down those paths could include sprinkling more updateDatasetUrl() calls in just the right place(s) with maybe (or maybe not) another catch-all route (e.g. /charon/* which inspects prefix) to parallel the existing /* route.

But stepping back, I think the more systematic and sustainable thing to do would be to make the dataset-rename redirections go thru the existing canonicalization layer that's currently used (solely) for default dataset handling (e.g. /seasonal-flu maps to /seasonal-flu/h3n2/ha/2y). This would truly centralize handling changes to dataset names and not rely on overlay routes like /*.

The indexer's remapCoreUrl() already points at these two alike things—the manifest and dataset renames—that yet are treated separately in the codebase:

export const remapCoreUrl = (urlPath) => {
if (!resolvedPath) {
/* map only computed the first time this function is run */
const serverManifest = JSON.parse(readFileSync(path.join(__basedir, "data", "manifest_core.json")));
resolvedPath = resolveAll(
convertManifestJsonToAvailableDatasetList(serverManifest).defaults
)
}
/**
* Process server-hardcoded dataset redirects before the defaults, as the
* path we redirect to may itself be further redirected by the defaults.
* e.g. redirects monkeypox -> mpox, core routing redirects mpox -> mpox/clade-IIb
* Requires a leading slash to match the paths expected in server.
*/
const serverRedirectedUrlPath = _removeSlashes(updateDatasetUrl(`/${urlPath}`))
return resolvedPath[serverRedirectedUrlPath] || serverRedirectedUrlPath
}

(From an implementation perspective, I'd think ideally this would look like the indexer going thru Dataset.resolve() too, but I don't know how feasible that is given current indexer design.)

@jameshadfield
Copy link
Member

jameshadfield commented May 17, 2024

But stepping back, I think the more systematic and sustainable thing to do would be to make the dataset-rename redirections go thru the existing canonicalization layer that's currently used (solely) for default dataset handling (e.g. /seasonal-flu maps to /seasonal-flu/h3n2/ha/2y). This would truly centralize handling changes to dataset names and not rely on overlay routes like /*.

Yeah, this would be clarifying (for the indexer as well).

charon/getAvailable requests are never made for a narrative, so from Auspice's perspective it's ok that those routes don't redirect to the canonical name. (Problematic for other reasons, but that's the current implementation.)

Stepping back (me this time), maintaining two APIs (rest + charon) will always be susceptible to these kinds of bugs. We can change auspice to use the restful API, or change the server to rewrite charon calls into restful ones (not straightforward if you end up sending redirects down the line). Staying with two parallel APIs - even if they use a lot of the same methods - seems non ideal. As mentioned above, this bug has probably existed from the day global dataset-name redirects were added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants