Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move croquemort features from udata to udata-croquemort v2 #1110

Merged

Conversation

abulte
Copy link
Contributor

@abulte abulte commented Sep 7, 2017

Replace #1099

Depends on opendatateam/udata-croquemort#2 (comment) — see there for more info.

@abulte
Copy link
Contributor Author

abulte commented Sep 13, 2017

@noirbizarre I'd be happy to have your feedback at this point, especially about the XXXs in the code. Once I know how to best organize the CheckDatasetResource.get logic, I should be able to try and test it.

@abulte
Copy link
Contributor Author

abulte commented Sep 13, 2017

@noirbizarre proposed refactor (cf previous comment) in 63f9fb8 and tests added.

@abulte abulte changed the title [WIP] Move croquemort features from udata to udata-croquemort v2 Move croquemort features from udata to udata-croquemort v2 Sep 13, 2017
@abulte abulte requested a review from a team September 13, 2017 15:46

log = logging.getLogger(__name__)

DEFAULT_LINKCHECKER = 'croquemort'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that croquemort should not be referenced here but comes from udata-croquemort's settings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe introduce a DEFAULT_LINKCHECKER configuration variable with no_check by default and has to be set to croquemort when you install udata-croquemort?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or something like LINK_CHECKER = 'path.to.DummyChecker' by default, I kinda like the django-like Python path in settings.

Copy link
Contributor Author

@abulte abulte Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed something more coherent IMHO. no_check is now an entrypoint provided linkchecker in udata (there is your django-like path ;-)). LINKCHECKING_DEFAULT_LINKCHECKER is by default the key to this linkchecker. It will be replaced by croquemort if you install udata-croquemort and want it to be the default one (or whatever linkchecker you want).

setup.py Outdated
@@ -132,6 +132,9 @@ def pip(filename):
'ods = udata.harvest.backends.ods:OdsHarvester',
'ckan = udata.harvest.backends.ckan:CkanBackend',
'dcat = udata.harvest.backends.dcat:DcatBackend',
],
'udata.linkcheckers': [
'no_check = udata.linkchecker.backends:NoCheckLinchecker',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(missing k)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇

def get(self, dataset, rid):
'''Checks that a resource's URL exists and returns metadata.'''
resource = self.get_resource_or_404(dataset, rid)
result, status = check_resource(resource)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return check_resource ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -366,21 +349,13 @@ def check_availability(self):
Return a list of booleans.
"""
# Only check remote resources.
# XXX is this enough? from the frontend we check every type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, it is only in use within the admin to get global stats. To be checked.

ENTRYPOINT = 'udata.linkcheckers'


class NoCheckLinchecker(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoCheckLinkChecker (missing k)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇


def get(name):
'''Get a linkchecker given its name or fallback on default'''
all_lc = get_all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fond of shortened variables names but YMMV

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, linkcheckers or link_checkers might be more explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# store the check result in the resource's extras
resource.extras.update(_get_check_keys(result))
resource.save()
return result, result.get('check:status')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that pertinent to keep the status in result dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be more straightforward on frontend.

the `resource.extras['check:checker']` attribute with a key that points
to a valid `udata.linkcheckers` entrypoint. If not set, it will
fallback on croquemort. If set to `no_check` it will assume the resource
is available via a dummy linkchecker.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a line about the fact that it returns a flask-like response (dict, status)

LINKCHECKING_ENABLED = True
LINKCHECKING_IGNORE_DOMAINS = []
LINKCHECKING_CACHE_DURATION = 60 * 5 # in seconds
LINKCHECKING_DEFAULT_LINKCHECKER = 'no_check'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we don't put the path here too? udata.linkchecker.backends:NoCheckLinchecker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the endpoint name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these settings documented somewhere?

@@ -184,6 +187,7 @@ class Defaults(object):
# 'RESOURCE_DOWNLOAD': , # Demo = 5, Prod = ?
# 'RESOURCE_REDIRECT': , # Demo = 6, Prod = ?
# }
# TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO?

def get(self, dataset, rid):
'''Checks that a resource's URL exists and returns metadata.'''
resource = self.get_resource_or_404(dataset, rid)
result, status = check_resource(resource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return self.check_availability(group=None)
Non checked resources are presumed available.
'''
return self.extras.get('check:available', True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Available by default ? Why not the opposite ? Or 'unknown' by default ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used by dataset.quality.has_unavailable_resources. I think we do not want to flag some datasets as "bad quality" just because we did not check their resources. (It was already done like that).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nop, but I think we can safely set it to unknown (I saw this possibility in this PR). This is more accurate I think


def get(name):
'''Get a linkchecker given its name or fallback on default'''
all_lc = get_all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, linkcheckers or link_checkers might be more explicit

lc = all_lc.get(name)
if not lc:
default_lc = current_app.config.get(
'LINKCHECKING_DEFAULT_LINKCHECKER')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LINKCHECKER_DEFAULT or DEFAULT_LINKCHECKER ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it has to start by LINKCHECKING_ to stay coherent with the other settings. And I used the LINKCHECKING_ prefix and not LINKCHECKER_ because LINKCHECKER_ENABLED does not really make sense since we can have multiple link checkers, LINKCHECKING_ENABLED is better IMHO.

)


def _ep_to_kv(entrypoint):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cached_check = get_cache(resource)
if cached_check:
return cached_check
linkchecker_type = resource.extras.get('check:checker', 'croquemort')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value should be configurable, especially if the croquemort implementation is in an optionnal extension.
Other alternative: handle the fact that there may not be a default link checker (and provide an unkown status)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. And this is useless since I implemented to no_check fallback and LINKCHECKING_DEFAULT_LINKCHECKER.

elif result.get('check:error'):
return {'error': result['check:error']}, 500
elif not result.get('check:status'):
return {'error': 'No status in response from linkchecker'}, 503
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error or unkown ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same NB as above.

LINKCHECKING_ENABLED = True
LINKCHECKING_IGNORE_DOMAINS = []
LINKCHECKING_CACHE_DURATION = 60 * 5 # in seconds
LINKCHECKING_DEFAULT_LINKCHECKER = 'no_check'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the endpoint name

LINKCHECKING_CACHE_DURATION = 60 * 5 # in seconds
LINKCHECKING_DEFAULT_LINKCHECKER = 'no_check'

# `udata-croquemort` settings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop these lines I think

@@ -46,8 +46,8 @@
/>
{% endif %}

<meta name="check-urls" content="{{ 'true' if config.CROQUEMORT else 'false' }}" />
<meta name="check-urls-ignore" content="{{ config.CROQUEMORT_IGNORE|tojson|urlencode }}" />
<meta name="check-urls" content="{{ 'true' if config.LINKCHECKING_ENABLED else 'false' }}" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or <meta name="check-urls" content="{{ config.LINKCHECKING_ENABLED|to_json }}" />
Works for other booleans.

Copy link
Contributor

@noirbizarre noirbizarre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start

}
});
})
.catch(error => console.log('Something went wrong with the linkchecker', error));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to put a format-label-unchecked or something like that in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, could come in handy 👍

@property
def is_available(self):
return self.check_availability(group=None)
NB: `unknown` will evaluate to True in the aggregate checks using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kind of is :-/ I hesitated with None but the advantage of this solution is that we don't need to modify the aggregate checks (all([])).

@@ -135,11 +135,13 @@ def resources_availability(self):
*[org.check_availability() for org in self.organizations]
)
)
# Filter out the unknown
availabilities = [a for a in availabilities if type(a) is bool]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using isinstance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that when checking for a basic type because it reads as en English sentence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

LINKCHECKING_ENABLED = True
LINKCHECKING_IGNORE_DOMAINS = []
LINKCHECKING_CACHE_DURATION = 60 * 5 # in seconds
LINKCHECKING_DEFAULT_LINKCHECKER = 'no_check'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these settings documented somewhere?

@abulte
Copy link
Contributor Author

abulte commented Sep 25, 2017

@davidbgk I can't reply on your comment on udata/settings.py (don't know why): nope they aren't documented, I have to do it 👍

@@ -135,11 +135,13 @@ def resources_availability(self):
*[org.check_availability() for org in self.organizations]
)
)
# Filter out the unknown
availabilities = [a for a in availabilities if type(a) is bool]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@abulte abulte changed the base branch from dev to master September 26, 2017 08:27
@abulte
Copy link
Contributor Author

abulte commented Sep 26, 2017

Rebased on master.

@abulte
Copy link
Contributor Author

abulte commented Sep 27, 2017

@davidbgk is this OK for you?

Copy link
Member

@davidbgk davidbgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Good work!

this.$api.get(checkurl)
.then((res) => {
const status = res['check:status'];
if (status >= 200 && status < 400) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts for future: deal with redirection :)

if availabilities:
# Trick will work because it's a sum() of booleans.
return round(100. * sum(availabilities) / len(availabilities), 2)
else:
return 0
return 100
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explain why?

@abulte abulte force-pushed the feature/udata-croquemort-v2 branch 5 times, most recently from 9547d04 to cbbed63 Compare October 4, 2017 11:40
@abulte abulte changed the base branch from master to feature/linkchecker October 4, 2017 11:40
@abulte abulte force-pushed the feature/udata-croquemort-v2 branch from cbbed63 to 959b57f Compare October 6, 2017 09:40
@abulte abulte force-pushed the feature/udata-croquemort-v2 branch from 959b57f to 986cb50 Compare October 18, 2017 08:21
@abulte abulte force-pushed the feature/udata-croquemort-v2 branch from 986cb50 to e720d02 Compare October 18, 2017 12:12
Croquemort specific stuff is removed from udata and will live in udata-croquemort.
Linkchecking can now be specified on a resource level.
Related change: default availability for a user's datasets is now 100% (vs 0%).
@abulte abulte force-pushed the feature/udata-croquemort-v2 branch from e720d02 to 040f4a9 Compare October 18, 2017 12:28
@abulte abulte merged commit ce56a0b into opendatateam:feature/linkchecker Oct 18, 2017
@abulte abulte deleted the feature/udata-croquemort-v2 branch October 18, 2017 12:34
abulte added a commit that referenced this pull request Oct 18, 2017
Croquemort specific stuff is removed from udata and will live in udata-croquemort.
Linkchecking can now be specified on a resource level.
Related change: default availability for a user's datasets is now 100% (vs 0%).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants