Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add CKAN content provider #1336

Merged
merged 10 commits into from
Jun 28, 2024
Merged

[MRG] Add CKAN content provider #1336

merged 10 commits into from
Jun 28, 2024

Conversation

u10313335
Copy link
Contributor

This PR add support for CKAN datasets as content provider for repo2docker, e.g.:

repo2docker https://demo.ckan.org/dataset/sample-dataset-1

It has been tested on some CKAN repositories such as:

We also adopt this PR in our Binder service (as the CKAN dataset repository): https://binder.depositar.io/, e.g.:

Copy link

welcome bot commented Feb 20, 2024

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

@u10313335 u10313335 changed the title [WIP] Add CKAN content provider [MRG] Add CKAN content provider Feb 20, 2024
Copy link
Collaborator

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay in reviewing! Thank you for the PR :) I've left a couple of style comments that should be fairly easy fixes.

The one question for me is wether the unit should be 'dataset' or 'dataset at a particular version'. This is illustrated by the current unit test failing because the version of the package has changed. The equivalent of sorts in git land is 'git repo url + hash', where the hash is counted as the 'ref'. If no 'ref' is provided, we default to using the latest (or whatever HEAD) points to. Does CKAN need something similar?

repo2docker/contentproviders/ckan.py Outdated Show resolved Hide resolved
tests/unit/contentproviders/test_ckan.py Show resolved Hide resolved
tests/unit/contentproviders/test_ckan.py Outdated Show resolved Hide resolved
@u10313335
Copy link
Contributor Author

The one question for me is wether the unit should be 'dataset' or 'dataset at a particular version'. This is illustrated by the current unit test failing because the version of the package has changed. The equivalent of sorts in git land is 'git repo url + hash', where the hash is counted as the 'ref'. If no 'ref' is provided, we default to using the latest (or whatever HEAD) points to. Does CKAN need something similar?

Thank you @yuvipanda for your review! As a newcomer to Project Jupyter, it really encouraged me.

Regard to the question referring to the unit, there are indeed "activities" that work like the versions for datasets. For example:

https://demo.ckan.org/dataset/activity/sample-dataset-1

However, activities are often unavailable as they can be hidden by admins. In fact, the above demo site of CKAN also restricts the access to activities. Another example is the data.gov.

Therefore, from my point of view, it's much safer to take the 'dataset' rather than the 'dataset at a particular version' as the unit for the CKAN provider.

@manics
Copy link
Member

manics commented Apr 1, 2024

However, activities are often unavailable as they can be hidden by admins. In fact, the above demo site of CKAN also restricts the access to activities. Another example is the data.gov.

Does this just disable listing previous versions, or does it completely prevent access to previous versions? Do you know why admins chose to prevent this, since they're reducing the utility of their CKAN repository as a reproducible data source?

@u10313335
Copy link
Contributor Author

However, activities are often unavailable as they can be hidden by admins. In fact, the above demo site of CKAN also restricts the access to activities. Another example is the data.gov.

Does this just disable listing previous versions, or does it completely prevent access to previous versions? Do you know why admins chose to prevent this, since they're reducing the utility of their CKAN repository as a reproducible data source?

When the ckan.auth.public_activity_stream_detail is set to True, it will completely block both list and access to previous versions for the public (without login). The reason was explained in the changelog:

A full history of dataset changes is now displayed in the Activity Stream to admins, and optionally to the public. By default this is enabled for new installs, but disabled for sites which upgrade (just in case the history is sensitive).

AFAIK, most of sites were upgraded from the old versions of CKAN.

Or I think we could still support the versions (urls with activity_id), e.g.:

https://demo.ckan.org/dataset/sample-dataset-1?activity_id=0da6b0a5-9fe5-4c34-9545-9079903a8449

And go to the latest one when the version is missing, e.g.:

https://demo.ckan.org/dataset/sample-dataset-1

@manics
Copy link
Member

manics commented Apr 9, 2024

I think we could still support the versions (urls with activity_id), e.g.:
https://demo.ckan.org/dataset/sample-dataset-1?activity_id=0da6b0a5-9fe5-4c34-9545-9079903a8449
And go to the latest one when the version is missing

That sounds reasonable to me- support a link to a fixed version where possible, fallback to the latest if not. Are there any disadvantages to doing this?

@u10313335
Copy link
Contributor Author

That sounds reasonable to me- support a link to a fixed version where possible, fallback to the latest if not. Are there any disadvantages to doing this?

I don't see any disadvantages in this moment. I will try to handle this in the following days.

@u10313335
Copy link
Contributor Author

u10313335 commented Apr 19, 2024

The content provider has been updated to support the versions ("activities" in CKAN) and fallback to the latest. There are two types of URL patterns to the activities:

  1. https://SITE_URL/dataset/DATASET_NAME?activity_id=ACTIVITY_ID (for CKAN version 2.9), e.g., https://demo.ckan.org/dataset/sample-dataset-1?activity_id=0da6b0a5-9fe5-4c34-9545-9079903a8449.

  2. https://SITE_URL/dataset/DATASET_NAME/history/ACTIVITY_ID (for CKAN version 2.10). I can't find any CKAN 2.10 sites which opens up access to activities for the public (we are still using the CKAN 2.9...) — sorry about that. But it has been tested with an internal CKAN 2.10 site.

Also note that the metadata_modified timestamp is still used to determine the "versions" in the repo2docker since the metadata_modified is available across all CKAN versions.

@u10313335
Copy link
Contributor Author

Hi @yuvipanda @manics, if there is any need for further improvement, please let me know. Thanks!

@yuvipanda
Copy link
Collaborator

@u10313335 apologies for the delayed response. I think this is good to go, but I wanted to cleanup some of the URL handling a little bit. I can't seem to push to your branch, nor make a PR to your branch for some reason, so I've pushed my changes to https://github.com/yuvipanda/repo2docker/tree/ckan. Also here's a patch file, which can be applied via git am if needed:

From 49613412583def1330cd8b7d6e42664728367945 Mon Sep 17 00:00:00 2001
From: YuviPanda <yuvipanda@gmail.com>
Date: Mon, 10 Jun 2024 15:49:40 -0700
Subject: [PATCH 1/2] Use urlencode to construct query strings

---
 repo2docker/contentproviders/ckan.py | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/repo2docker/contentproviders/ckan.py b/repo2docker/contentproviders/ckan.py
index 9131b01..c1e31a5 100644
--- a/repo2docker/contentproviders/ckan.py
+++ b/repo2docker/contentproviders/ckan.py
@@ -1,6 +1,6 @@
 from datetime import datetime, timedelta, timezone
 from os import path
-from urllib.parse import parse_qs, urlparse
+from urllib.parse import parse_qs, urlparse, urlencode
 
 from requests import Session
 
@@ -86,11 +86,10 @@ class CKAN(ContentProvider):
         # handle the activites
         if activity_id:
             fetch_url = (
-                f"{spec['api_url']}activity_data_show?"
-                f"id={activity_id}&object_type=package"
+                f"{spec['api_url']}activity_data_show?" + urlencode({"id": activity_id, "object_type": "package"})
             )
         else:
-            fetch_url = f"{spec['api_url']}package_show?id={dataset_id}"
+            fetch_url = f"{spec['api_url']}package_show?" + urlencode({"id": dataset_id})
 
         resp = self.urlopen(
             fetch_url,
-- 
2.43.2


From 45fb7b908527630370b4ff1a99c89c5afafea2dc Mon Sep 17 00:00:00 2001
From: YuviPanda <yuvipanda@gmail.com>
Date: Mon, 10 Jun 2024 17:13:22 -0700
Subject: [PATCH 2/2] Cleanup URL parsing mechanisms

---
 repo2docker/contentproviders/ckan.py | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/repo2docker/contentproviders/ckan.py b/repo2docker/contentproviders/ckan.py
index c1e31a5..9c74afd 100644
--- a/repo2docker/contentproviders/ckan.py
+++ b/repo2docker/contentproviders/ckan.py
@@ -43,28 +43,32 @@ class CKAN(ContentProvider):
         if not parsed_url.netloc:
             return None
 
-        url_parts_1 = parsed_url.path.split("/history/")
-        url_parts_2 = url_parts_1[0].split("/")
-        if url_parts_2[-2] == "dataset":
-            self.dataset_id = url_parts_2[-1]
-        else:
+        if "/dataset/" not in parsed_url.path:
+            # Not actually a dataset
             return None
 
-        api_url_path = "/api/3/action/"
+        # CKAN may be under a URL prefix, and we should accomodate that
+        url_prefix, dataset_url = parsed_url.path.split("/dataset/")
+
+        dataset_url_parts = dataset_url.split("/")
+        self.dataset_id = dataset_url_parts[0]
+
         api_url = parsed_url._replace(
-            path="/".join(url_parts_2[:-2]) + api_url_path, query=""
+            path=f"{url_prefix}/api/3/action/", query=""
         ).geturl()
 
         status_show_url = f"{api_url}status_show"
         resp = self.urlopen(status_show_url)
         if resp.status_code == 200:
 
-            # handle the activites
+            # Activity ID may be present either as a query parameter, activity_id
+            # or as part of the URL, under `/history/<activity-id>`. If `/history/`
+            # is present, that takes precedence over `activity_id`
             activity_id = None
-            if parse_qs(parsed_url.query).get("activity_id") is not None:
+            if "history" in dataset_url_parts:
+                activity_id = dataset_url_parts[dataset_url_parts.index("history") + 1]
+            elif parse_qs(parsed_url.query).get("activity_id") is not None:
                 activity_id = parse_qs(parsed_url.query).get("activity_id")[0]
-            if len(url_parts_1) == 2:
-                activity_id = url_parts_1[-1]
 
             self.version = self._fetch_version(api_url)
             return {
-- 
2.43.2

If these changes look good to you, apply those and I'll merge it? If not you can also select the 'allow edits from maintainers' button and I'll push these in too.

@u10313335
Copy link
Contributor Author

Thanks @yuvipanda again for the reviews and suggestions! I have applied your commits.

@yuvipanda
Copy link
Collaborator

I've somehow missed your comment - sorry about that, @u10313335.

Tests are now unfortunately failing due to #1354. I'm running all these tests locally - if they pass, I'll merge this PR (as it's already been waiting so long!) before figuring out how to fix that.

Thank you for your patience.

@yuvipanda
Copy link
Collaborator

I was able to successfully run these tests locally, so I'm going to merge this one. I'll try to come back to the associated binderhub PR soon as well.

Thank you so much for your patience as we worked through this, @u10313335! Hope to continue to see PRs from you, and I think our reviewing speeds should be better now.

@yuvipanda yuvipanda merged commit 9f15678 into jupyterhub:main Jun 28, 2024
8 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants