Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface previous dataset versions across URL redirects #783

Merged
merged 5 commits into from
Jan 21, 2024

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Jan 17, 2024

Closes #777. See that issue or the commit messages for full details.

Example URLs which are enabled via this work:

The earliest nCoV dataset we have is from 2020-01-19, and was back then accessed via /ncov. Now /ncov@2020-01-19 redirects to /ncov/gisaid/global/6m@2020-01-19 and loads that dataset.

We can access the /dengue/denv1 datasets prior to Jan 2024 via either /dengue/denv1@YYYY-MM-DD or /dengue/denv1/genome@YYYY-MM-DD, with the former redirecting to the latter.

Testing

Best done locally, via something like

mkdir -p devData
node resourceIndexer/main.js --output devData/index.json --save-inventories \
  --indent --resourceTypes dataset --collections core # needs AWS creds to access s3://nextstrain-inventories
RESOURCE_INDEX="devData/index.json" node server.js --verbose

The heroku review app will not correctly use this functionality as the live index has not been updated with the changes in this PR. If you want a non-local way to test this let me know and I'll make it happen. (You can test a subset of the functionality, e.g. /dengue/denv1@2024-01-03 works as expected.)

@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-uw9ge9 January 17, 2024 22:37 Inactive
@jameshadfield jameshadfield requested a review from a team January 17, 2024 22:48
@tsibley tsibley self-requested a review January 18, 2024 21:52
src/endpoints/sources.js Outdated Show resolved Hide resolved
resourceIndexer/core-url-remapping.js Outdated Show resolved Hide resolved
resourceIndexer/core-url-remapping.js Outdated Show resolved Hide resolved
resourceIndexer/core-url-remapping.js Outdated Show resolved Hide resolved
resourceIndexer/core-url-remapping.js Outdated Show resolved Hide resolved
I've supported using local manifest+inventory files from the start as it
makes for a much nicer dev experience than fetching from S3 each time.
This adds the ability to easily save a local copy of the latest
inventory data from S3. This both makes getting started much easier and
provides an easy way to update your local data.
This is to allow the resource indexer to import
`convertManifestJsonToAvailableDatasetList` without the side-effects of
running code on import and setting `global.availableDatasets`.

No code changes, just extracting 2 functions into a new file.
following the approach the server uses to redirect URLs. For instance,
we change the resource name for "dengue/denv1" to "dengue/denv1/genome"
because the former URL is redirected to the latter.

This allows the server to access historical datasets for the URL
"dengue/denv1/genome" which now include all "dengue/denv1"
("dengue_denv1.json") datasets.

Due to the way the server currently works this will not allow historical
version access for "dengue/denv1" URLs because the instantiation of the
Dataset (Resource) instance happens before the resolve/redirecting
happens, and the "dengue/denv1" dataset is considered invalid because
it's not present in the index _and_ a version descriptor is being
provided.
The automated CI tests were using the correct index
The desired behaviour is that (e.g.) dengue/denv1@version redirects to
dengue/denv1/genome@version, similarly to how the non-versioned URL
currently behaves. This failed because of three inter-related aspects:

1. The previous commit removed (e.g.) "dengue/denv1" from the index,
   choosing instead to place that data under the "dengue/denv1/genome"
   ID.
2. The server instantiates the Resource (Dataset) class for the request
   before resolving/redirecting the URL. During this instantiation the
   requested version was checked against the index and since the
   resource wasn't present in the index this failed (NotFound error).
   This was solved by deferring the version lookup until after the URL
   resolving/redirecting stage.
3. The redirect URL didn't include the version descriptor.
@jameshadfield jameshadfield force-pushed the james/resource-index-url-remapping branch from f9d2000 to c88a87d Compare January 19, 2024 02:51
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-james-reso-uw9ge9 January 19, 2024 02:51 Inactive
Comment on lines +327 to +338
/* test Charon API */
const charonUrl = (path, sidecar) =>
`${BASE_URL}/charon/getDataset?prefix=${path}${sidecar ? `&type=root-sequence` : ''}`;

/* note that the redirect URL of charon requests is constructed by creating
a URL object (the RESTful API does not) which results in '/' and '@' being
escaped in the query. This isn't necessary, but it's also fine to do so. */

it(`Charon main dataset: ${fromUrl} goes to ${toUrl} `, async () => {
const res = await fetch(charonUrl(fromUrl, false), {redirect: 'manual'});
expect(res.status).toEqual(307);
expect(decodeURIComponent(res.headers.get('location'))).toEqual(charonUrl(toUrl, false));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

These are only tests, so non-blocking, but this is the wrong way 'round in terms of URL encoding/decoding. The Location header is a URL, not a single URL component, so the more appropriate function is decodeURI not decodeURIComponent. Instead of decoding the Location URL at all though, the right way 'round fix is making charonUrl properly encode the URL it generates.

@jameshadfield jameshadfield merged commit 2ab0598 into master Jan 21, 2024
5 checks passed
@jameshadfield jameshadfield deleted the james/resource-index-url-remapping branch January 21, 2024 20:42
jameshadfield added a commit to nextstrain/docs.nextstrain.org that referenced this pull request Jan 22, 2024
jameshadfield added a commit that referenced this pull request Feb 1, 2024
Recent work surfaced previous dataset versions across URL redirects¹ by
mirroring the dataset-name resolution process we use for requests on the
server. However it neglected to consider the redirects which are handled
prior to this in the server. This commit adds that functionality as
well. This situation was recently discussed in slack².

To use a couple of examples representative of the redirects we use:
1. the dataset (URL) path "mpox/all-clades" now includes previous
   versions which were named "monkeypox_mpxv.json", thus extending the
   snapshot history of this dataset from 2023-09-23 to 2022-06-12.
2. the dataset (URL) path "ncov/gisaid/global/6m" (and paths which would
   be resolved to this, such as "ncov", "ncov/gisaid") contained periods
   without any snapshots because we were using the url "ncov/global".
   These datasets (versions of "ncov_global.json") are now included as
   snapshots for "ncov/gisaid/global/6m".

¹ <#783>
² <https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939>
jameshadfield added a commit that referenced this pull request Feb 6, 2024
Recent work surfaced previous dataset versions across URL redirects¹ by
mirroring the dataset-name resolution process we use for requests on the
server. However it neglected to consider the redirects which are handled
prior to this in the server. This commit adds that functionality as
well. This situation was recently discussed in slack².

To use a couple of examples representative of the redirects we use:
1. the dataset (URL) path "mpox/all-clades" now includes previous
   versions which were named "monkeypox_mpxv.json", thus extending the
   snapshot history of this dataset from 2023-09-23 to 2022-06-12.
2. the dataset (URL) path "ncov/gisaid/global/6m" (and paths which would
   be resolved to this, such as "ncov", "ncov/gisaid") contained periods
   without any snapshots because we were using the url "ncov/global".
   These datasets (versions of "ncov_global.json") are now included as
   snapshots for "ncov/gisaid/global/6m".

¹ <#783>
² <https://bedfordlab.slack.com/archives/CSKMU6YUC/p1706483980082939>
jameshadfield added a commit that referenced this pull request Feb 6, 2024
This functionality should have been implemented as part of
<#783> but that PR
didn't consider the hardcoded redirects in `src/redirects.js`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[resource index] use core manifest to update datasets whose URLs are redirected
3 participants