Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: document how to push to HAL #2629

Closed
jacquerie opened this issue Aug 15, 2017 · 5 comments
Closed

docs: document how to push to HAL #2629

jacquerie opened this issue Aug 15, 2017 · 5 comments

Comments

@jacquerie
Copy link
Contributor

As part of #2628, and for future reference, document which steps are needed to push to HAL.

@jacquerie jacquerie self-assigned this Aug 15, 2017
@jacquerie jacquerie removed their assignment Sep 29, 2017
@jacquerie jacquerie self-assigned this Nov 10, 2017
@jacquerie
Copy link
Contributor Author

jacquerie commented Nov 10, 2017

Manual Push

Dumping the data from Legacy

Imports and stuff:

>>> from invenio.intbitset import intbitset
>>> from invenio.search_engine import format_record, perform_request_search, search_unit
>>> from invenio.webuser import collect_user_info, get_uid_from_email, get_email_from_username
>>>
>>> ADMIN_USER_INFO = collect_user_info(get_uid_from_email(get_email_from_username('admin')))

Dump all conferences:

>>> conference_recids = perform_request_search(cc='Conferences')
>>> with open('Conferences.xml', 'w') as f:
...     for recid in conference_recids:
...         f.write(format_record(recid, 'XME', user_info=ADMIN_USER_INFO))

Dump all institutions:

>>> institution_recids = perform_request_search(cc='Institutions')
>>> with open('Institutions.xml', 'w') as f:
...     for recid in institution_recids:
...         f.write(format_record(recid, 'XME', user_info=ADMIN_USER_INFO))

Dump all HAL recods:

>>> hal_recids = (search_unit('HAL', f='595__c', m='a') | intbitset(perform_request_search(cc='HAL Hidden'))) - search_unit('DELETED', f='980', m='a')
>>> with open('HAL.xml', 'w') as f:
...     for recid in hal_recids:
...         f.write(format_record(recid, 'XME', user_info=ADMIN_USER_INFO))

Configuring a local Labs instance

Add the following to inspirehep.cfg:

RECORDS_SKIP_FILES = True

HAL_COL_IRI = 'https://api.archives-ouvertes.fr/sword/hal'
HAL_EDIT_IRI = 'https://api.archives-ouvertes.fr/sword/'

HAL_USER_NAME = 'username'  # tbag
HAL_USER_PASS = 'password'  # tbag

Loading the data in a local Labs instance

$ docker-compose up -d
$ docker-compose scale worker=$(($(nproc) - 1))
$ docker-compose run --rm web scripts/recreate_records --no-populate
$ docker-compose run --rm web inspirehep migrator populate -f Conferences.xml
$ docker-compose run --rm web inspirehep migrator populate -f Institutions.xml
$ docker-compose run --rm web inspirehep migrator populate -f HAL.xml

Pushing to HAL

Imports and stuff:

>>> from invenio_records.models import RecordMetadata
>>> from inspirehep.modules.hal.core.tei import convert_to_tei
>>> from inspirehep.modules.hal.core.sword import create, update
>>>
>>> records = [record.json for record in RecordMetadata.query]

The actual push:

>>> ok, ko = [], []
... with open('HAL.log' ,'w') as f:
...     for record in records:
...         if 'Literature' in record['_collections'] or 'HAL Hidden' in record['_collections']:
...             try:
...                 tei = convert_to_tei(record)
...                 try:
...                     hal_id = ''
...                     ids = record.get('external_system_identifiers', [])
...                     for id_ in ids:
...                         if id_['schema'] == 'HAL':
...                             hal_id = id_['value']
...                     if hal_id:
...                         update(tei.encode('utf8'), hal_id.encode('utf8'))
...                         f.write('UPD: %s %s\n' % (record['control_number'], hal_id))
...                     else:
...                         receipt = create(tei.encode('utf8'))
...                         f.write('NEW: %s %s\n' % (record['control_number'], receipt.id))
...                     ok.append(record['control_number'])
...                 except Exception, e:
...                     f.write('HAL: %s %s\n' % (record['control_number'], str(e)))
...                     ko.append(record['control_number'])
...             except Exception, e:
...                 f.write('TEI: %s %s\n' % (record['control_number'], str(e)))
...                 ko.append(record['control_number'])

@jacquerie
Copy link
Contributor Author

With the above steps anyone should be able to do a manual push, so that my availability stops being a blocker for more regular pushes.

CC: @mathieugrives

@jacquerie jacquerie removed their assignment Dec 13, 2017
@puntonim puntonim added this to the HAL Integration milestone Feb 7, 2018
@puntonim puntonim reopened this Feb 7, 2018
@puntonim
Copy link
Contributor

puntonim commented Feb 7, 2018

I enhanced this monster such that it can be run in a labs machine (not in localhost).
Ssh into a labs machine, open a screen, then an inspire shell and copy/paste (%cpaste) this enhanced monster:

import datetime
import time

from flask import current_app

from invenio_records.models import RecordMetadata
from inspirehep.modules.hal.core.tei import convert_to_tei
from inspirehep.modules.hal.core.sword import create, update

# Set the proper configuration.
current_app.config['RECORDS_SKIP_FILES'] = True
current_app.config['HAL_COL_IRI'] = 'https://api.archives-ouvertes.fr/sword/hal'
current_app.config['HAL_EDIT_IRI'] = 'https://api.archives-ouvertes.fr/sword/'


def run(username, password, limit=None):
    start = time.time()
    current_app.config['HAL_USER_NAME'] = username
    current_app.config['HAL_USER_PASS'] = password
    records = RecordMetadata.query.filter(RecordMetadata.json['_export_to'].op('@>')('{"HAL": true}') )
    if limit:
        records = records[:limit]
    # log_file = os.path.join(os.path.dirname(__file__), 'HAL.log')
    log_file = '/opt/inspire/HAL.log'
    ok = ko = 0
    with open(log_file, 'w') as f:
        for i, raw_record in enumerate(records):
            if i % 10 == 0:
                now = str(datetime.timedelta(seconds=time.time()-start))
                print '%s records processed in %s: %s ok, %s ko' % (i, now, ok, ko)
            record = raw_record.json
            if 'Literature' in record['_collections'] or 'HAL Hidden' in record['_collections']:
                try:
                    tei = convert_to_tei(record)
                except Exception, e:
                    f.write('EXC TEI: %s %s\n' % (record['control_number'], str(e)))
                    # ko.append(record['control_number'])
                    ko += 1
                    continue

                success = False
                for _ in range(2):
                    try:
                        hal_id = ''
                        ids = record.get('external_system_identifiers', [])
                        for id_ in ids:
                            if id_['schema'] == 'HAL':
                                hal_id = id_['value']
                        if hal_id:
                            update(tei.encode('utf8'), hal_id.encode('utf8'))
                            f.write('UPD: %s %s\n' % (record['control_number'], hal_id))
                        else:
                            receipt = create(tei.encode('utf8'))
                            f.write('NEW: %s %s\n' % (record['control_number'], receipt.id))
                        success = True
                        break
                    except Exception, e:
                        continue
                if success:
                    # ok.append(record['control_number'])
                    ok += 1
                else:
                    f.write('EXC HAL: %s %s\n' % (record['control_number'], str(e)))
                    # ko.append(record['control_number'])
                    ko += 1
    print '%s records processed in %s: %s ok, %s ko' % (i, now, ok, ko)

Then run it with (remove the 20 to run it on all records):

run(USERNAME-IN-TBAG, PASSWORD-IN-TBAG, 20)

TODO next:

  • performance of the query RecordMetadata.query.filter are really bad because it's reading a non indexed json field from the DB. Read it from ES instead (together with the full record).
  • make it an executable script (Flask-Script maybe?)
  • schedule it (Celery beat?)
  • proper error reporting (Gitter?)
  • performance of HAL APIs are bad, frequent timeout errors: not much we can do.
    But the script performs 2 attempts for each failed record (for _ in range(2):): make it configurable via kwargs -- Note: last time I run it with 10 attempts and it took too long: 4852 records processed in 1 day, 0:23:13.167395: 2050 ok, 2803 ko)
  • log file of the export ended on 8/2/2018:
    @inspire-prod-worker3-task1 in /home/inspire/HAL-push/HAL.log
  • avoid spamming Sentry on every error
  • check the forwarding for inspire-hal-cataloger@cern.ch (currently too many people receive these emails, and there will be an email for each indexed by HAL)

@jacquerie
Copy link
Contributor Author

@mathieugrives requested to include subtitles in the next export that @ammirate will do. You can cherry-pick this commit for that: jacquerie@2625e19

@puntonim
Copy link
Contributor

OBSOLETE

We made a script to be run with:

$ /usr/bin/time -v inspirehep hal push

It will ask for username, password and limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants