Skip to content

Commit

Permalink
Merge a082a58 into bcf3d87
Browse files Browse the repository at this point in the history
  • Loading branch information
jalavik committed Jan 22, 2016
2 parents bcf3d87 + a082a58 commit b81c7dc
Show file tree
Hide file tree
Showing 26 changed files with 5,166 additions and 747 deletions.
1 change: 0 additions & 1 deletion .travis.invenio.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ ASSETS_AUTO_BUILD = False

PACKAGES = [
'invenio_oaiharvester',
'invenio_workflows',
'invenio_accounts',
'invenio_records',
'invenio_formatter',
Expand Down
46 changes: 23 additions & 23 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
..
This file is part of Invenio.
Copyright (C) 2015 CERN.
Copyright (C) 2015, 2016 CERN.
Invenio is free software; you can redistribute it
and/or modify it under the terms of the GNU General Public License as
Expand Down Expand Up @@ -51,9 +51,10 @@ Invenio module for OAI-PMH metadata harvesting between repositories.
Features
========

This module allows you to easily harvest OAI-PMH repositories, thanks to the `Sickle`_ module, and feed the
output into your ingestion workflows, or simply to files. You can configure
your OAI-PMH sources via a web-interface and run or schedule immediate harvesting jobs
This module allows you to easily harvest OAI-PMH repositories, thanks to the `Sickle`_ module, and via signals
you can hook the output into your application, or simply to files.

You keep configurations of your OAI-PMH sources via SQLAlchemy models and run or schedule immediate harvesting jobs
via command-line or regularly via `Celery beat`_.

.. _Celery beat: http://celery.readthedocs.org/en/latest/userguide/periodic-tasks.html
Expand All @@ -73,43 +74,40 @@ If you want to have your harvested records saved in a directory automatically, i

.. code-block:: shell
inveniomanage oaiharvester get -u http://export.arxiv.org/oai2 -i oai:arXiv.org:1507.07286 -o dir
inveniomanage oaiharvester get -u http://export.arxiv.org/oai2 -i oai:arXiv.org:1507.07286 -d /tmp
Note the output ``-o`` parameter that specifies how to output the harvested records. The three options are:
* Sent to a workflow (E.g. `-o workflow`)
* Saved files in a folder (E.g. `-o dir`)
* Printed to stdout (default)
Note the directory ``-d`` parameter that specifies a directory to save harvested XML files.


Harvesting with workflows
=========================
Integration with your application
=================================

.. code-block:: shell
If you want to integrate ``invenio-oaiharvester`` into your application, you should hook into
the signals sent by the harvester upon completed harvesting.

inveniomanage oaiharvester get -u http://export.arxiv.org/oai2 -i oai:arXiv.org:1507.07286 -o workflow
See ``invenio_oaiharvester.signals:oaiharvest_finished``.

When you send an harvested record to a workflow you can process the harvested
files however you'd like and then even upload it automatically into your own repository.

This module already provides some
Check also the defined Celery tasks under ``invenio_oaiharvester.tasks``.


Managing OAI-PMH sources
========================

If you want to store configuration for an OAI repository, you can use the
administration interface available via the admin panel. This is useful if you regularly need to query a server.
SQLAlchemy model ``invenio_oaiharvester.models:OaiHARVEST``.

Here you can add information about the server URL, metadataPrefix to use etc. This information is also available when scheduling and running tasks:
This is useful if you regularly need to query a server.

Here you can add information about the server URL, metadataPrefix to use etc.
This information is also available when scheduling and running tasks:

.. code-block:: shell
inveniomanage oaiharvester get -n somerepo -i oai:example.org:1234
Here we are using the `-n, --name` parameter to specify which stored OAI-PMH source to query, by name.
Here we are using the `-n, --name` parameter to specify which configured
OAI-PMH source to query, using the ``name`` property.


API
Expand All @@ -120,6 +118,8 @@ If you need to schedule or run harvests via Python, you can use our API:
.. code-block:: python
from invenio_oaiharvester.api import get_records
for rec in get_records(identifiers=["oai:arXiv.org:1207.7214"],
url="http://export.arxiv.org/oai2"):
request, records = get_records(identifiers=["oai:arXiv.org:1207.7214"],
url="http://export.arxiv.org/oai2")
for record in records:
print rec.raw
8 changes: 7 additions & 1 deletion invenio_oaiharvester/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
#
# This file is part of Invenio.
# Copyright (C) 2013 CERN.
# Copyright (C) 2013, 2015, 2016 CERN.
#
# Invenio is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
Expand All @@ -18,3 +18,9 @@
# 59 Temple Place, Suite 330, Boston, MA 02111-1307, USA.

from __future__ import absolute_import, print_function, unicode_literals

from .version import __version__

from .api import get_records, list_records

__all__ = ('__version__', 'get_records', 'list_records')
55 changes: 26 additions & 29 deletions invenio_oaiharvester/api.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
#
# This file is part of Invenio.
# Copyright (C) 2015 CERN.
# Copyright (C) 2015, 2016 CERN.
#
# Invenio is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
Expand Down Expand Up @@ -37,18 +37,20 @@ def list_records(metadata_prefix=None, from_date=None, until_date=None,
:param setSpec: The 'set' criteria for the harvesting (optional).
:return: An iterator of harvested records.
"""
if url:
request = Sickle(url)
elif name:
request, _metadata_prefix, lastrun = get_from_oai_name(name)
if name:
url, _metadata_prefix, lastrun, _setSpec = get_info_by_oai_name(name)

# In case we provide a prefix, we don't want it to be
# overwritten by the one we get from the name variable.
if metadata_prefix is None:
metadata_prefix = _metadata_prefix
else:
if setSpec is None:
setSpec = _setSpec
elif not url:
raise NameOrUrlMissing("Retry using the parameters -n <name> or -u <url>.")

request = Sickle(url)

# By convention, when we have a url we have no lastrun, and when we use
# the name we can either have from_date (if provided) or lastrun.
dates = {
Expand All @@ -60,12 +62,12 @@ def list_records(metadata_prefix=None, from_date=None, until_date=None,
if (dates['until'] is not None) and (dates['from'] > dates['until']):
raise WrongDateCombination("'Until' date larger than 'from' date.")

if metadata_prefix is None:
metadata_prefix = "oai_dc"

return request.ListRecords(metadataPrefix=metadata_prefix,
set=setSpec,
**dates)
records = []
for spec in setSpec.split():
records.extend(list(request.ListRecords(metadataPrefix=metadata_prefix or "oai_dc",
set=spec,
**dates)))
return request, records


def get_records(identifiers, metadata_prefix=None, url=None, name=None):
Expand All @@ -77,39 +79,34 @@ def get_records(identifiers, metadata_prefix=None, url=None, name=None):
:param name: The name of the OaiHARVEST object that we want to use to create the endpoint.
:return: An iterator of harvested records.
"""
if url:
request = Sickle(url)
elif name:
request, _metadata_prefix, _ = get_from_oai_name(name)
if name:
url, _metadata_prefix, _, __ = get_info_by_oai_name(name)

# In case we provide a prefix, we don't want it to be
# overwritten by the one we get from the name variable.
if metadata_prefix is None:
metadata_prefix = _metadata_prefix
else:
elif not url:
raise NameOrUrlMissing("Retry using the parameters -n <name> or -u <url>.")

if metadata_prefix is None:
metadata_prefix = "oai_dc"

request = Sickle(url)
records = []
for identifier in identifiers:
arguments = {
'identifier': identifier,
'metadataPrefix': metadata_prefix
'metadataPrefix': metadata_prefix or "oai_dc"
}
yield request.GetRecord(**arguments)
records.append(request.GetRecord(**arguments))
return request, records


def get_from_oai_name(name):
def get_info_by_oai_name(name):
"""Get basic OAI request data from the OaiHARVEST model.
:param name: name of the source (OaiHARVEST.name)
:return: (Sickle obj, metadataprefix, lastrun)
:return: (url, metadataprefix, lastrun as YYYY-MM-DD, setspecs)
"""
obj = get_oaiharvest_object(name)

req = Sickle(obj.baseurl)
metadata_prefix = obj.metadataprefix
lastrun = obj.lastrun
return req, metadata_prefix, lastrun
lastrun = obj.lastrun.strftime("%Y-%m-%d")
return obj.baseurl, obj.metadataprefix, lastrun, obj.setspecs
5 changes: 1 addition & 4 deletions invenio_oaiharvester/config.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8 -*-
#
# This file is part of Invenio.
# Copyright (C) 2014, 2015 CERN.
# Copyright (C) 2014, 2015, 2016 CERN.
#
# Invenio is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
Expand Down Expand Up @@ -34,6 +34,3 @@

OAIHARVESTER_STORAGEDIR = os.path.join(CFG_DATADIR, "oaiharvester", "storage")
"""Path to a storage directory where the oaiharvester may put files."""

OAIHARVESTER_RECORD_ARXIV_ID_LOOKUP = "system_control_number.value"
"""Path to the arXiv ID value used by sample post-process tasks."""
10 changes: 1 addition & 9 deletions invenio_oaiharvester/errors.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# -*- coding: utf-8; -*-
#
# This file is part of Invenio.
# Copyright (C) 2015 CERN.
# Copyright (C) 2015, 2016 CERN.
#
# Invenio is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
Expand Down Expand Up @@ -36,11 +36,3 @@ class WrongDateCombination(Exception):

class IdentifiersOrDates(Exception):
"""Identifiers cannot be used in combination with dates."""


class WrongOutputIdentifier(Exception):
"""Output type not recognized. Try 'workflow', directory' or omit for stdout."""


class WorkflowNotFound(Exception):
"""Workflow not found. Try '-o workflow -w <workflow name> or provide a name (-n <name>)."""
Loading

0 comments on commit b81c7dc

Please sign in to comment.