Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global: removal of workflows dependency #19

Closed
wants to merge 10 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .travis.invenio.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ ASSETS_AUTO_BUILD = False

PACKAGES = [
'invenio_oaiharvester',
'invenio_workflows',
'invenio_accounts',
'invenio_records',
'invenio_formatter',
Expand Down
46 changes: 23 additions & 23 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
..
This file is part of Invenio.
Copyright (C) 2015 CERN.
Copyright (C) 2015, 2016 CERN.

Invenio is free software; you can redistribute it
and/or modify it under the terms of the GNU General Public License as
Expand Down Expand Up @@ -51,9 +51,10 @@ Invenio module for OAI-PMH metadata harvesting between repositories.
Features
========

This module allows you to easily harvest OAI-PMH repositories, thanks to the `Sickle`_ module, and feed the
output into your ingestion workflows, or simply to files. You can configure
your OAI-PMH sources via a web-interface and run or schedule immediate harvesting jobs
This module allows you to easily harvest OAI-PMH repositories, thanks to the `Sickle`_ module, and via signals
you can hook the output into your application, or simply to files.

You keep configurations of your OAI-PMH sources via SQLAlchemy models and run or schedule immediate harvesting jobs
via command-line or regularly via `Celery beat`_.

.. _Celery beat: http://celery.readthedocs.org/en/latest/userguide/periodic-tasks.html
Expand All @@ -73,43 +74,40 @@ If you want to have your harvested records saved in a directory automatically, i

.. code-block:: shell

inveniomanage oaiharvester get -u http://export.arxiv.org/oai2 -i oai:arXiv.org:1507.07286 -o dir

inveniomanage oaiharvester get -u http://export.arxiv.org/oai2 -i oai:arXiv.org:1507.07286 -d /tmp

Note the output ``-o`` parameter that specifies how to output the harvested records. The three options are:

* Sent to a workflow (E.g. `-o workflow`)
* Saved files in a folder (E.g. `-o dir`)
* Printed to stdout (default)
Note the directory ``-d`` parameter that specifies a directory to save harvested XML files.


Harvesting with workflows
=========================
Integration with your application
=================================

.. code-block:: shell
If you want to integrate ``invenio-oaiharvester`` into your application, you should hook into
the signals sent by the harvester upon completed harvesting.

inveniomanage oaiharvester get -u http://export.arxiv.org/oai2 -i oai:arXiv.org:1507.07286 -o workflow
See ``invenio_oaiharvester.signals:oaiharvest_finished``.

When you send an harvested record to a workflow you can process the harvested
files however you'd like and then even upload it automatically into your own repository.

This module already provides some
Check also the defined Celery tasks under ``invenio_oaiharvester.tasks``.


Managing OAI-PMH sources
========================

If you want to store configuration for an OAI repository, you can use the
administration interface available via the admin panel. This is useful if you regularly need to query a server.
SQLAlchemy model ``invenio_oaiharvester.models:OaiHARVEST``.

Here you can add information about the server URL, metadataPrefix to use etc. This information is also available when scheduling and running tasks:
This is useful if you regularly need to query a server.

Here you can add information about the server URL, metadataPrefix to use etc.
This information is also available when scheduling and running tasks:

.. code-block:: shell

inveniomanage oaiharvester get -n somerepo -i oai:example.org:1234

Here we are using the `-n, --name` parameter to specify which stored OAI-PMH source to query, by name.
Here we are using the `-n, --name` parameter to specify which configured
OAI-PMH source to query, using the ``name`` property.


API
Expand All @@ -120,6 +118,8 @@ If you need to schedule or run harvests via Python, you can use our API:
.. code-block:: python

from invenio_oaiharvester.api import get_records
for rec in get_records(identifiers=["oai:arXiv.org:1207.7214"],
url="http://export.arxiv.org/oai2"):

request, records = get_records(identifiers=["oai:arXiv.org:1207.7214"],
url="http://export.arxiv.org/oai2")
for record in records:
print rec.raw
8 changes: 7 additions & 1 deletion invenio_oaiharvester/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
#
# This file is part of Invenio.
# Copyright (C) 2013 CERN.
# Copyright (C) 2013, 2015, 2016 CERN.
#
# Invenio is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
Expand All @@ -18,3 +18,9 @@
# 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

from __future__ import absolute_import, print_function, unicode_literals

from .version import __version__

from .api import get_records, list_records

__all__ = ('__version__', 'get_records', 'list_records')
68 changes: 37 additions & 31 deletions invenio_oaiharvester/api.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
#
# This file is part of Invenio.
# Copyright (C) 2015 CERN.
# Copyright (C) 2015, 2016 CERN.
#
# Invenio is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
Expand All @@ -19,36 +19,40 @@

from __future__ import absolute_import, print_function, unicode_literals

import datetime

from sickle import Sickle

from .errors import NameOrUrlMissing, WrongDateCombination
from .utils import get_oaiharvest_object


def list_records(metadata_prefix=None, from_date=None, until_date=None,
url=None, name=None, setSpec=None):
url=None, name=None, setspecs=None):
"""Harvest records from an OAI repo, based on datestamp and/or set parameters.

:param metadata_prefix: The prefix for the metadata return (defaults to 'oai_dc').
:param from_date: The lower bound date for the harvesting (optional).
:param until_date: The upper bound date for the harvesting (optional).
:param url: The The url to be used to create the endpoint.
:param name: The name of the OaiHARVEST object that we want to use to create the endpoint.
:param setSpec: The 'set' criteria for the harvesting (optional).
:param setspecs: The 'set' criteria for the harvesting (optional).
:return: An iterator of harvested records.
"""
if url:
request = Sickle(url)
elif name:
request, _metadata_prefix, lastrun = get_from_oai_name(name)
if name:
url, _metadata_prefix, lastrun, _setspecs = get_info_by_oai_name(name)

# In case we provide a prefix, we don't want it to be
# overwritten by the one we get from the name variable.
if metadata_prefix is None:
metadata_prefix = _metadata_prefix
else:
if setspecs is None:
setspecs = _setspecs
elif not url:
raise NameOrUrlMissing("Retry using the parameters -n <name> or -u <url>.")

request = Sickle(url)

# By convention, when we have a url we have no lastrun, and when we use
# the name we can either have from_date (if provided) or lastrun.
dates = {
Expand All @@ -60,12 +64,19 @@ def list_records(metadata_prefix=None, from_date=None, until_date=None,
if (dates['until'] is not None) and (dates['from'] > dates['until']):
raise WrongDateCombination("'Until' date larger than 'from' date.")

if metadata_prefix is None:
metadata_prefix = "oai_dc"

return request.ListRecords(metadataPrefix=metadata_prefix,
set=setSpec,
**dates)
lastrun_date = datetime.datetime.now()
records = []
for spec in setspecs.split():
for record in request.ListRecords(metadataPrefix=metadata_prefix or "oai_dc",
set=spec,
**dates):
records.append(record)
# Update lastrun?
if from_date is None and until_date is None and name is not None:
oai_source = get_oaiharvest_object(name)
oai_source.update_lastrun(lastrun_date)
oai_source.save()
return request, records


def get_records(identifiers, metadata_prefix=None, url=None, name=None):
Expand All @@ -77,39 +88,34 @@ def get_records(identifiers, metadata_prefix=None, url=None, name=None):
:param name: The name of the OaiHARVEST object that we want to use to create the endpoint.
:return: An iterator of harvested records.
"""
if url:
request = Sickle(url)
elif name:
request, _metadata_prefix, _ = get_from_oai_name(name)
if name:
url, _metadata_prefix, _, __ = get_info_by_oai_name(name)

# In case we provide a prefix, we don't want it to be
# overwritten by the one we get from the name variable.
if metadata_prefix is None:
metadata_prefix = _metadata_prefix
else:
elif not url:
raise NameOrUrlMissing("Retry using the parameters -n <name> or -u <url>.")

if metadata_prefix is None:
metadata_prefix = "oai_dc"

request = Sickle(url)
records = []
for identifier in identifiers:
arguments = {
'identifier': identifier,
'metadataPrefix': metadata_prefix
'metadataPrefix': metadata_prefix or "oai_dc"
}
yield request.GetRecord(**arguments)
records.append(request.GetRecord(**arguments))
return request, records


def get_from_oai_name(name):
def get_info_by_oai_name(name):
"""Get basic OAI request data from the OaiHARVEST model.

:param name: name of the source (OaiHARVEST.name)

:return: (Sickle obj, metadataprefix, lastrun)
:return: (url, metadataprefix, lastrun as YYYY-MM-DD, setspecs)
"""
obj = get_oaiharvest_object(name)

req = Sickle(obj.baseurl)
metadata_prefix = obj.metadataprefix
lastrun = obj.lastrun
return req, metadata_prefix, lastrun
lastrun = obj.lastrun.strftime("%Y-%m-%d")
return obj.baseurl, obj.metadataprefix, lastrun, obj.setspecs
5 changes: 1 addition & 4 deletions invenio_oaiharvester/config.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
#
# This file is part of Invenio.
# Copyright (C) 2014, 2015 CERN.
# Copyright (C) 2014, 2015, 2016 CERN.
#
# Invenio is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
Expand Down Expand Up @@ -34,6 +34,3 @@

OAIHARVESTER_STORAGEDIR = os.path.join(CFG_DATADIR, "oaiharvester", "storage")
"""Path to a storage directory where the oaiharvester may put files."""

OAIHARVESTER_RECORD_ARXIV_ID_LOOKUP = "system_control_number.value"
"""Path to the arXiv ID value used by sample post-process tasks."""
10 changes: 1 addition & 9 deletions invenio_oaiharvester/errors.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8; -*-
#
# This file is part of Invenio.
# Copyright (C) 2015 CERN.
# Copyright (C) 2015, 2016 CERN.
#
# Invenio is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
Expand Down Expand Up @@ -36,11 +36,3 @@ class WrongDateCombination(Exception):

class IdentifiersOrDates(Exception):
"""Identifiers cannot be used in combination with dates."""


class WrongOutputIdentifier(Exception):
"""Output type not recognized. Try 'workflow', directory' or omit for stdout."""


class WorkflowNotFound(Exception):
"""Workflow not found. Try '-o workflow -w <workflow name> or provide a name (-n <name>)."""
Loading