Skip to content

Commit

Permalink
Merge pull request #966 from noirbizarre/dcat
Browse files Browse the repository at this point in the history
RDF/DCAT support
  • Loading branch information
noirbizarre committed Jun 22, 2017
2 parents 843fb3c + fb7fb5f commit a66184e
Show file tree
Hide file tree
Showing 32 changed files with 3,580 additions and 14 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Expand Up @@ -2,6 +2,12 @@

## Current (in progress)

- Added a [DCAT](https://www.w3.org/TR/vocab-dcat/) harvester
and expose metadata as RDF/DCAT
[#966](https://github.com/opendatateam/udata/pull/966)
See the dedicated documentions:
- [RDF](https://udata.readthedocs.io/en/stable/rdf/)
- [Harvesting](https://udata.readthedocs.io/en/stable/harvesting/)
- Upgrade to Flask-Mongoengine 0.9.2, Flask-WTF 0.14.2, mongoengine 0.11.0.
[#812](https://github.com/opendatateam/udata/pull/812)
- Upgrade to Flask-Login 0.4.0 and switch from Flask-Security to the latest
Expand Down
222 changes: 222 additions & 0 deletions docs/harvesting.md
@@ -0,0 +1,222 @@
# Harvesting

Harvesting is the process of automatically fetching remote metadata (ie. from other data portals or not)
and storing them into udata to be able to search and find them.

## Vocabulary

- **Backend**: designate a protocol implementation to harvest a remote endpoint.
- **Source**: it's a remote endpoint to harvest. Each harvest source is caracterized by
a single endpoint URL and a backend implementation. A harvester is configured for each source.
- **Job**: designate a full harvesting for a given source.
- **Validation**: each created harvester needs to be validated by the admin team before being run.

## Behavior

After a harvester for a given source has been created and validated,
it will run either on demand or periodically.

A harvesting job is done in three separate phases:

1. `initialize`: the harvester fetches remote identifiers to harvest and create a single task for each.
2. `process`: each task created in the `initialize` is executed. Each item is processed independently.
3. `finalize`: when all tasks are done, the `finalize` is a closure for the job and mark it as done.

Harvested dataset will have the following `extras` properties:

| Property | Meaning |
|---------------------|------------------------------------------------------------------|
| harvest:domain | Domain on which dataset has been harvested (ex: `data.test.org`) |
| harvest:remote_id | Dataset identifier on the remote repository |
| harvest:source_id | Harvester identifier |
| harvest:last_update | Last time this dataset has been harvested |

## Administration interface

You can see the harvester administration interface in the `System` view.

![Administration harvester listing](screenshots/admin-harvest.png)

You'll have an overview of all harvesters and their state (pending validation, last run...)

Each harvester have a full job history with every remote harvested items.

![Administration harvester details](screenshots/admin-single-harvester.png)

## Shell

All harvesting operations are grouped together into the `harvest` command namespace:

```shell
usage: udata harvest [-?]
{jobs,launch,create,schedule,purge,sources,backends,unschedule,run,validate,attach,delete}
...

Handle remote repositories harvesting operations

positional arguments:
jobs List started harvest jobs
launch Launch a source harvesting on the workers
create Create a new harvest source
schedule Schedule a harvest job to run periodically
purge Permanently remove deleted harvest sources
sources List all harvest sources
backends List available backends
unschedule Unschedule a periodical harvest job
run Run a harvester synchronously
validate Validate a source given its identifier
attach Attach existing datasets to their harvest remote id
delete Delete a harvest source

optional arguments:
-?, --help show this help message and exit
```

## Backends

`udata` comes with 3 harvest backends but you can implement your own backend.

### DCAT

This backend harvest any [DCAT][] endpoint.
This is now the recommended way to harvest remote portals and repositories
(and so to expose opendata metadata for any portal and repository).

As pagination is not described into the DCAT specification, we try to detect some supported
pagination ontology:
- [Hydra PartialCollectionView](http://www.hydra-cg.com/spec/latest/core/#hydra:PartialCollectionView)
- [Legacy Hydra PagedCollection](https://www.w3.org/community/hydra/wiki/Pagination)

Fields are extracted according these rules:

#### Dataset fields

| Dataset | dcat:Dataset | notes |
|-------------------|-------------------------|---------------------------------------------|
| title | dct:title | |
| description | dct:description | Detect and parse HTML as Markdown |
| tags | dct:keyword + dct:theme | |
| frequency | dct:accrualPeriodicity | |
| temporal_coverage | dct:temporal | See [Temporal coverage](#temporal-coverage) |
| license | N/A | See [License detection](#license-detection) |
| resources | dct:distribution | Also match the buggy dct:distributions |

| Dataset.extras | dcat:Dataset | notes |
|----------------|----------------|--------------------------|
| dct:identifier | dct:identifier | |
| uri | @id | URI Reference if present |

#### Resource fields

| Resource | dcat:Distribution | notes |
|---------------|------------------------------------|----------------------------------------|
| title | dct:title | If missing, guessed from URL or format |
| description | dct:description | Detect and parse HTML |
| url | dcat:downloadURL or dcat:accessURL | |
| published | dct:issued | |
| last_modified | dct:modified | |
| format | dct:format | |
| mime | dcat:mediaType | |
| filesize | dcat:bytesSize | |
| checksum | spdx:checksum | See [Checksum](#checksum) |


#### Temporal coverage

Temporal coverage can be expressed in many ways. This harvester try the following patterns:
- DCAT-AP format using schema.org properties (`schema:startDate` and `schema:endDate`)
- [Gov.uk Time Interval][gov-uk-references] parsing
- ISO date interval as literal (ie. `YYYY[-MM[-DD]]/YYYY[-MM[-DD]]`)
- A Single ISO month or year (ie. `YYYY[-MM]`)

#### Checksum

| Checksum | spdx:Checksum |
|----------|--------------------|
| type | spdx:algorithm |
| value | spdx:checksumValue |

#### License detection

License is extracted from one of the DCAT distribution.
The havester try to guess the license from `dct:license` and `dct:right`.
The first match is kept.
If none matches, no license is set on the dataset.

### CKAN

This backend harvest CKAN repositories/portals through their API.

### OpenDataSoft

This backend harvest OpenDataSoft repositories/portals through their API (v1).

### Custom

You can implement your own backends by extending `udata.harvest.backends.BaseBackend`
and implementing the `initialize()` and `process()` methods.

A minimal harvester adding fake random datasets might looks like:

```python
# -*- coding: utf-8 -*-
from __future__ import unicode_literals

from udata.models import db, Resource
from udata.utils import faker

from . import BaseBackend, register


@register
class RandomBackend(BaseBackend):
name = 'random'
display_name = 'Random'

def initialize(self):
'''Generate a list of fake identifiers to harvest'''
# In a real implementation, you should iter over
# a remote endpoint to list identifiers to harvest
# and optionnaly store extra data
for _ in range(faker.pyint()):
self.add_item(faker.uuid4()) # Accept kwargs to store data

def process(self, item):
'''Generate a random dataset from a fake identifier'''
# Get or create a harvested dataset with this identifier.
# Harvest metadata are already filled on creation.
dataset = self.get_dataset(item.remote_id)

# In a real implementation you should :
# - fetch the remote dataset (if necessary)
# - validate the fetched payload
# - map its content to the dataset fields
# - store extra significant data in the `extra` attribute
# - map resources data

dataset.title = faker.sentence()
dataset.description = faker.text()
dataset.tags = list(set(faker.words(nb=faker.pyint())))

# Resources
for i in range(faker.pyint()):
dataset.resources.append(Resource(
title=faker.sentence(),
description=faker.text(),
url=faker.url()
filetype='remote',
mime=faker.mime_type(category='text'),
format=faker.file_extension(category='text'),
filesize=faker.pyint()
))

return dataset

```

You may take a look at the [existing backends][backends-repository] to see exiting implementations.


[DCAT]: https://www.w3.org/TR/vocab-dcat/
[backends-repository]: https://github.com/opendatateam/udata/tree/master/udata/harvest/backends
[gov-uk-references]: http://reference.data.gov.uk/

0 comments on commit a66184e

Please sign in to comment.