Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DCAT harvester refactoring #2096

Merged

Conversation

@noirbizarre
Copy link
Member

commented Apr 4, 2019

This PR refactor the DCAT harvesting to store only one graph for all datasets in a DCAT catalog.
The graph is stored in factorized way (ie. in JSON-LD with a @context to massively reduce the size.

These changes allows to process way more datasets in a single Job. As each harvesting job store all its items into a single MongoDB document, each job can store at most 16Mb (See MongoDB Limits)

This also prevent some data loss because properly slicing a graph is difficult in RDF when you don't known by advance all node properties. This will allow to parse more triplets for each dataset and resource.

These changes also allow processing of DCAT Dataset as Blank nodes instead of URIRef.

@noirbizarre noirbizarre requested a review from opendatateam/etalab Apr 4, 2019
@noirbizarre noirbizarre force-pushed the noirbizarre:dcat-harvester-refactoring branch from b89a6ed to 0916969 Apr 4, 2019
@noirbizarre noirbizarre added this to the 1.6.7 milestone Apr 4, 2019
@abulte
abulte approved these changes Apr 4, 2019
CHANGELOG.md Outdated Show resolved Hide resolved
udata/harvest/tests/test_dcat_backend.py Outdated Show resolved Hide resolved
noirbizarre added 2 commits Apr 3, 2019
@noirbizarre noirbizarre force-pushed the noirbizarre:dcat-harvester-refactoring branch from 0916969 to f63f113 Apr 4, 2019
@noirbizarre noirbizarre merged commit 882595d into opendatateam:master Apr 4, 2019
3 checks passed
3 checks passed
ci/circleci: assets Your tests passed on CircleCI!
Details
ci/circleci: dist Your tests passed on CircleCI!
Details
ci/circleci: python Your tests passed on CircleCI!
Details
@noirbizarre noirbizarre deleted the noirbizarre:dcat-harvester-refactoring branch Apr 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.