New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move croquemort features from udata to udata-croquemort v2 #1110
Move croquemort features from udata to udata-croquemort v2 #1110
Conversation
2c25948
to
6acd50b
Compare
@noirbizarre I'd be happy to have your feedback at this point, especially about the |
@noirbizarre proposed refactor (cf previous comment) in 63f9fb8 and tests added. |
udata/linkchecker/backends.py
Outdated
|
||
log = logging.getLogger(__name__) | ||
|
||
DEFAULT_LINKCHECKER = 'croquemort' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that croquemort should not be referenced here but comes from udata-croquemort's settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe introduce a DEFAULT_LINKCHECKER
configuration variable with no_check
by default and has to be set to croquemort
when you install udata-croquemort
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or something like LINK_CHECKER = 'path.to.DummyChecker'
by default, I kinda like the django-like Python path in settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed something more coherent IMHO. no_check
is now an entrypoint provided linkchecker in udata (there is your django-like path ;-)). LINKCHECKING_DEFAULT_LINKCHECKER
is by default the key to this linkchecker. It will be replaced by croquemort
if you install udata-croquemort
and want it to be the default one (or whatever linkchecker you want).
setup.py
Outdated
@@ -132,6 +132,9 @@ def pip(filename): | |||
'ods = udata.harvest.backends.ods:OdsHarvester', | |||
'ckan = udata.harvest.backends.ckan:CkanBackend', | |||
'dcat = udata.harvest.backends.dcat:DcatBackend', | |||
], | |||
'udata.linkcheckers': [ | |||
'no_check = udata.linkchecker.backends:NoCheckLinchecker', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(missing k
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙇
udata/core/dataset/api.py
Outdated
def get(self, dataset, rid): | ||
'''Checks that a resource's URL exists and returns metadata.''' | ||
resource = self.get_resource_or_404(dataset, rid) | ||
result, status = check_resource(resource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return check_resource
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
udata/core/dataset/models.py
Outdated
@@ -366,21 +349,13 @@ def check_availability(self): | |||
Return a list of booleans. | |||
""" | |||
# Only check remote resources. | |||
# XXX is this enough? from the frontend we check every type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, it is only in use within the admin to get global stats. To be checked.
udata/linkchecker/backends.py
Outdated
ENTRYPOINT = 'udata.linkcheckers' | ||
|
||
|
||
class NoCheckLinchecker(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NoCheckLinkChecker
(missing k)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙇
udata/linkchecker/backends.py
Outdated
|
||
def get(name): | ||
'''Get a linkchecker given its name or fallback on default''' | ||
all_lc = get_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fond of shortened variables names but YMMV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, linkcheckers
or link_checkers
might be more explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
udata/linkchecker/checker.py
Outdated
# store the check result in the resource's extras | ||
resource.extras.update(_get_check_keys(result)) | ||
resource.save() | ||
return result, result.get('check:status') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that pertinent to keep the status in result dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be more straightforward on frontend.
udata/linkchecker/checker.py
Outdated
the `resource.extras['check:checker']` attribute with a key that points | ||
to a valid `udata.linkcheckers` entrypoint. If not set, it will | ||
fallback on croquemort. If set to `no_check` it will assume the resource | ||
is available via a dummy linkchecker. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a line about the fact that it returns a flask-like response (dict, status)
LINKCHECKING_ENABLED = True | ||
LINKCHECKING_IGNORE_DOMAINS = [] | ||
LINKCHECKING_CACHE_DURATION = 60 * 5 # in seconds | ||
LINKCHECKING_DEFAULT_LINKCHECKER = 'no_check' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we don't put the path here too? udata.linkchecker.backends:NoCheckLinchecker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the endpoint name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these settings documented somewhere?
udata/settings.py
Outdated
@@ -184,6 +187,7 @@ class Defaults(object): | |||
# 'RESOURCE_DOWNLOAD': , # Demo = 5, Prod = ? | |||
# 'RESOURCE_REDIRECT': , # Demo = 6, Prod = ? | |||
# } | |||
# TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO?
udata/core/dataset/api.py
Outdated
def get(self, dataset, rid): | ||
'''Checks that a resource's URL exists and returns metadata.''' | ||
resource = self.get_resource_or_404(dataset, rid) | ||
result, status = check_resource(resource) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
udata/core/dataset/models.py
Outdated
return self.check_availability(group=None) | ||
Non checked resources are presumed available. | ||
''' | ||
return self.extras.get('check:available', True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Available by default ? Why not the opposite ? Or 'unknown' by default ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is used by dataset.quality.has_unavailable_resources
. I think we do not want to flag some datasets as "bad quality" just because we did not check their resources. (It was already done like that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nop, but I think we can safely set it to unknown
(I saw this possibility in this PR). This is more accurate I think
udata/linkchecker/backends.py
Outdated
|
||
def get(name): | ||
'''Get a linkchecker given its name or fallback on default''' | ||
all_lc = get_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, linkcheckers
or link_checkers
might be more explicit
udata/linkchecker/backends.py
Outdated
lc = all_lc.get(name) | ||
if not lc: | ||
default_lc = current_app.config.get( | ||
'LINKCHECKING_DEFAULT_LINKCHECKER') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LINKCHECKER_DEFAULT
or DEFAULT_LINKCHECKER
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it has to start by LINKCHECKING_
to stay coherent with the other settings. And I used the LINKCHECKING_
prefix and not LINKCHECKER_
because LINKCHECKER_ENABLED
does not really make sense since we can have multiple link checkers, LINKCHECKING_ENABLED
is better IMHO.
udata/linkchecker/backends.py
Outdated
) | ||
|
||
|
||
def _ep_to_kv(entrypoint): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate of https://github.com/opendatateam/udata/blob/master/udata/harvest/backends/__init__.py#L17-L26
Should be factorized
udata/linkchecker/checker.py
Outdated
cached_check = get_cache(resource) | ||
if cached_check: | ||
return cached_check | ||
linkchecker_type = resource.extras.get('check:checker', 'croquemort') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value should be configurable, especially if the croquemort
implementation is in an optionnal extension.
Other alternative: handle the fact that there may not be a default link checker (and provide an unkown status)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. And this is useless since I implemented to no_check
fallback and LINKCHECKING_DEFAULT_LINKCHECKER
.
elif result.get('check:error'): | ||
return {'error': result['check:error']}, 500 | ||
elif not result.get('check:status'): | ||
return {'error': 'No status in response from linkchecker'}, 503 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
error or unkown ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same NB as above.
LINKCHECKING_ENABLED = True | ||
LINKCHECKING_IGNORE_DOMAINS = [] | ||
LINKCHECKING_CACHE_DURATION = 60 * 5 # in seconds | ||
LINKCHECKING_DEFAULT_LINKCHECKER = 'no_check' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the endpoint name
udata/settings.py
Outdated
LINKCHECKING_CACHE_DURATION = 60 * 5 # in seconds | ||
LINKCHECKING_DEFAULT_LINKCHECKER = 'no_check' | ||
|
||
# `udata-croquemort` settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop these lines I think
udata/templates/macros/metadata.html
Outdated
@@ -46,8 +46,8 @@ | |||
/> | |||
{% endif %} | |||
|
|||
<meta name="check-urls" content="{{ 'true' if config.CROQUEMORT else 'false' }}" /> | |||
<meta name="check-urls-ignore" content="{{ config.CROQUEMORT_IGNORE|tojson|urlencode }}" /> | |||
<meta name="check-urls" content="{{ 'true' if config.LINKCHECKING_ENABLED else 'false' }}" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or <meta name="check-urls" content="{{ config.LINKCHECKING_ENABLED|to_json }}" />
Works for other booleans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice start
js/front/dataset/index.js
Outdated
} | ||
}); | ||
}) | ||
.catch(error => console.log('Something went wrong with the linkchecker', error)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to put a format-label-unchecked
or something like that in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, could come in handy 👍
@property | ||
def is_available(self): | ||
return self.check_availability(group=None) | ||
NB: `unknown` will evaluate to True in the aggregate checks using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tricky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It kind of is :-/ I hesitated with None
but the advantage of this solution is that we don't need to modify the aggregate checks (all([])
).
@@ -135,11 +135,13 @@ def resources_availability(self): | |||
*[org.check_availability() for org in self.organizations] | |||
) | |||
) | |||
# Filter out the unknown | |||
availabilities = [a for a in availabilities if type(a) is bool] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using isinstance
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that when checking for a basic type because it reads as en English sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
LINKCHECKING_ENABLED = True | ||
LINKCHECKING_IGNORE_DOMAINS = [] | ||
LINKCHECKING_CACHE_DURATION = 60 * 5 # in seconds | ||
LINKCHECKING_DEFAULT_LINKCHECKER = 'no_check' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these settings documented somewhere?
@davidbgk I can't reply on your comment on |
@@ -135,11 +135,13 @@ def resources_availability(self): | |||
*[org.check_availability() for org in self.organizations] | |||
) | |||
) | |||
# Filter out the unknown | |||
availabilities = [a for a in availabilities if type(a) is bool] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
8d9b8dd
to
5e386ac
Compare
Rebased on |
5e386ac
to
35eb5a4
Compare
@davidbgk is this OK for you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Good work!
this.$api.get(checkurl) | ||
.then((res) => { | ||
const status = res['check:status']; | ||
if (status >= 200 && status < 400) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts for future: deal with redirection :)
udata/core/user/models.py
Outdated
if availabilities: | ||
# Trick will work because it's a sum() of booleans. | ||
return round(100. * sum(availabilities) / len(availabilities), 2) | ||
else: | ||
return 0 | ||
return 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explain why?
9547d04
to
cbbed63
Compare
cbbed63
to
959b57f
Compare
959b57f
to
986cb50
Compare
986cb50
to
e720d02
Compare
Croquemort specific stuff is removed from udata and will live in udata-croquemort. Linkchecking can now be specified on a resource level. Related change: default availability for a user's datasets is now 100% (vs 0%).
e720d02
to
040f4a9
Compare
Replace #1099
Depends on opendatateam/udata-croquemort#2 (comment) — see there for more info.