Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDF/DCAT support #966

Merged
merged 23 commits into from Jun 22, 2017
Merged

RDF/DCAT support #966

merged 23 commits into from Jun 22, 2017

Conversation

noirbizarre
Copy link
Contributor

@noirbizarre noirbizarre commented Jun 12, 2017

This PR adds DCAT support:

  • Harvest RDF/DCAT endpoints
    • RDFXML
    • turtle
    • JSON-LD
    • n3
    • nt
    • guess license
    • parse temporal coverage
    • guess endpoint format from extension
    • pagination support:
      • hydra current (PartialCollection)
      • hydra legacy (PaginatedCollection)
  • Expose RDF/DCAT endpoints
    • JSON-LD context (/context.jsonld)
    • catalog
      • root content negociation (/catalog)
      • RDFXML (/catalog.xml)
      • turtle (/catalog.ttl)
      • JSON-LD (/catalog.json(ld))
      • n3 (/catalog.n3)
      • nt (/catalog.nt)
      • pagination support
    • dataset
      • root content negociation (/datasets/{id}/rdf)
      • html page content negociation ({lang}/datasets/{id}/)
      • RDFXML (/datasets/{id}/rdf.xml)
      • turtle (/datasets/{id}/rdf.ttl)
      • JSON-LD (/datasets/{id}/rdf.json(ld))
      • n3 (/datasets/{id}/rdf.n3)
      • nt (/datasets/{id}/rdf.nt)
      • temporal coverage
    • dataportal compliance (/data.{json,xml,ttl})
  • links html pages to their rdf counterparts:
    • home
    • dataset
  • documenation
    • harvesting documentation
    • rdf documentation:
      • endpoints
      • fields mapping

These points can/should be done in other PRs:

  • spatial coverage handling (parsing and exposition), idealy reusing geozones
  • harvest content type negociation improvement (use headers...)
  • allow search result to be available as rdf in addition to CSV (very easy)

@davidbgk
Copy link
Member

Fixtures are huge 😱

}

# Map formats to default used extensions
RDF_EXTENSIONS = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need all these extensions? I would only activate jsonld for now. It impacts (at least) the number of links rendered in the template.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are main RDF supported extensions, supporting them comes without extra programming cost and choice has been discussed and validated with @ColinMaudry.
Different format for different usage/tools.
We can't predict what usage or tools will be use to consume udata rdf (not so many tools support JSON-LD). In fact, I've been using different tools during that PR and not all of them support json-ld.
On the harvest part, we need to be able to harvest these formats: not all portals expose json-ld which is the younger format in the list (this is why this is the only one which requires an extra depdency to rdflib).
The DCAT spec is a RDF specification, not JSON-LD specific (DCAT vocabulary itself is not exposed as JSON-LD but only xml and turtle)

@noirbizarre
Copy link
Contributor Author

I removed the fixtures that shouldn't be there.

@ColinMaudry
Copy link

ColinMaudry commented Jun 12, 2017

Regarding content negotiation, is it possible to negotiate an RDF format from /datasets/{id} ?

@noirbizarre
Copy link
Contributor Author

Yes, still to be done (it's in the checklist).
The helpers are done, I just miss the view integration

@ColinMaudry
Copy link

Fixture = endpoint? I'm not sure of what you call a fixture.

@noirbizarre
Copy link
Contributor Author

A fixture a a set of data required by the test(s).
In our case, fixtures are RDF files to simulate server response.

@noirbizarre
Copy link
Contributor Author

Requires #967

@noirbizarre
Copy link
Contributor Author

Requires #968

@noirbizarre
Copy link
Contributor Author

Requires #969

Copy link
Member

@vinyll vinyll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions for English sentences


## Vocabulary

- **Backend**: designate a protocol implementation to harvest a remote endpoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation protocol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't mean the same thing ;)

## Vocabulary

- **Backend**: designate a protocol implementation to harvest a remote endpoint.
- **Source**: it's remote end point to harvest. Each harvest source is caracterized by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its


- **Backend**: designate a protocol implementation to harvest a remote endpoint.
- **Source**: it's remote end point to harvest. Each harvest source is caracterized by
a single endpoint URL and a backend implementation. An harvester is configured for each source.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Behavior

After an harvester for a given source has been created and validated, it will run either on demand or periodically.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a harvester


After an harvester for a given source has been created and validated, it will run either on demand or periodically.

An harvesting job is done in three separate phases:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A harvesting

unschedule Run an harvester synchronously
run Run an harvester synchronously
validate Validate a source given its identifier
attach Attach existing dataset to their harvest remote id.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no final point

run Run an harvester synchronously
validate Validate a source given its identifier
attach Attach existing dataset to their harvest remote id.
delete Delete an harvest source
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a harvest

### DCAT (prefered)

This backend harvest any [DCAT][] endpoint.
This is now the recommanded way to harvest remote portals and repositories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommended

This is now the recommanded way to harvest remote portals and repositories
(and so to expose opendata metadata for any portal and repository).

As pagination is not described into the DCAT specifcation, we try to detect some supported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specification


```

You take a look at the [existing backends][backends-repository] to see exiting implementations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may take a look

# Harvesting

Harvesting is the process of fetching of automatically remote metadata (ie. from other data portals or not)
and store them into udata for being able to search them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Harvesting is the process of automatically syncing remote metadata (i.e. from other data portals or not) and storing them into udata to be able to find remote data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, so many of typos in my sentence 😱
I'll fix that ASAP. But I think, we need to keep ``fetchingassyncing` suggest it's bidirectionnal while harvesting is not.


1. `initialize`: the harvester fetches remote identifiers to harvest and create a single task for each.
2. `process`: each task created in the `initialize` is executed. Each item is processed independently.
3. `finalize`: when all tasks are done, the `finilize` is a closure for the job and mark it as done.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finalize (second one)

docs/rdf.md Outdated
/dataset/{id}/rdf.{format}

The dataset pages serve as identifier and perform content negociation too,
so the following URL will all redirect to the same RDF endpoint:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

URLs

elif isinstance(period_of_time, RdfResource):
return temporal_from_resource(period_of_time)
except:
# There is a lot of case where parsing could/should fail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are a lot of cases

elif dataset.deleted:
abort(410)

format = guess_format(format)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These few lines are duplicated in rdf_catalog_format, mutualize?

elif isinstance(period_of_time, RdfResource):
return temporal_from_resource(period_of_time)
except:
# There are a lot of cases where parsing could/should fail
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that worth logging it for future improvements?

udata/rdf.py Outdated
resulting JSON-LD.

See: https://github.com/RDFLib/rdflib-jsonld/blob/master/rdflib_jsonld/serializer.py#L101-L103
''' # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put only a URL on one line, the linter shouldn't complain even without a # noqa comment. At least mine 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mine is crying without 😢

@noirbizarre noirbizarre merged commit a66184e into opendatateam:dev Jun 22, 2017
@noirbizarre noirbizarre deleted the dcat branch June 22, 2017 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants