docs: document how to push to HAL #2629

jacquerie · 2017-08-15T20:27:22Z

As part of #2628, and for future reference, document which steps are needed to push to HAL.

jacquerie · 2017-11-10T10:31:44Z

Manual Push

Dumping the data from Legacy

Imports and stuff:

>>> from invenio.intbitset import intbitset
>>> from invenio.search_engine import format_record, perform_request_search, search_unit
>>> from invenio.webuser import collect_user_info, get_uid_from_email, get_email_from_username
>>>
>>> ADMIN_USER_INFO = collect_user_info(get_uid_from_email(get_email_from_username('admin')))

Dump all conferences:

>>> conference_recids = perform_request_search(cc='Conferences')
>>> with open('Conferences.xml', 'w') as f:
...     for recid in conference_recids:
...         f.write(format_record(recid, 'XME', user_info=ADMIN_USER_INFO))

Dump all institutions:

>>> institution_recids = perform_request_search(cc='Institutions')
>>> with open('Institutions.xml', 'w') as f:
...     for recid in institution_recids:
...         f.write(format_record(recid, 'XME', user_info=ADMIN_USER_INFO))

Dump all HAL recods:

>>> hal_recids = (search_unit('HAL', f='595__c', m='a') | intbitset(perform_request_search(cc='HAL Hidden'))) - search_unit('DELETED', f='980', m='a')
>>> with open('HAL.xml', 'w') as f:
...     for recid in hal_recids:
...         f.write(format_record(recid, 'XME', user_info=ADMIN_USER_INFO))

Configuring a local Labs instance

Add the following to inspirehep.cfg:

RECORDS_SKIP_FILES = True

HAL_COL_IRI = 'https://api.archives-ouvertes.fr/sword/hal'
HAL_EDIT_IRI = 'https://api.archives-ouvertes.fr/sword/'

HAL_USER_NAME = 'username'  # tbag
HAL_USER_PASS = 'password'  # tbag

Loading the data in a local Labs instance

$ docker-compose up -d
$ docker-compose scale worker=$(($(nproc) - 1))
$ docker-compose run --rm web scripts/recreate_records --no-populate
$ docker-compose run --rm web inspirehep migrator populate -f Conferences.xml
$ docker-compose run --rm web inspirehep migrator populate -f Institutions.xml
$ docker-compose run --rm web inspirehep migrator populate -f HAL.xml

Pushing to HAL

Imports and stuff:

>>> from invenio_records.models import RecordMetadata
>>> from inspirehep.modules.hal.core.tei import convert_to_tei
>>> from inspirehep.modules.hal.core.sword import create, update
>>>
>>> records = [record.json for record in RecordMetadata.query]

The actual push:

>>> ok, ko = [], []
... with open('HAL.log' ,'w') as f:
...     for record in records:
...         if 'Literature' in record['_collections'] or 'HAL Hidden' in record['_collections']:
...             try:
...                 tei = convert_to_tei(record)
...                 try:
...                     hal_id = ''
...                     ids = record.get('external_system_identifiers', [])
...                     for id_ in ids:
...                         if id_['schema'] == 'HAL':
...                             hal_id = id_['value']
...                     if hal_id:
...                         update(tei.encode('utf8'), hal_id.encode('utf8'))
...                         f.write('UPD: %s %s\n' % (record['control_number'], hal_id))
...                     else:
...                         receipt = create(tei.encode('utf8'))
...                         f.write('NEW: %s %s\n' % (record['control_number'], receipt.id))
...                     ok.append(record['control_number'])
...                 except Exception, e:
...                     f.write('HAL: %s %s\n' % (record['control_number'], str(e)))
...                     ko.append(record['control_number'])
...             except Exception, e:
...                 f.write('TEI: %s %s\n' % (record['control_number'], str(e)))
...                 ko.append(record['control_number'])

jacquerie · 2017-11-10T10:50:31Z

With the above steps anyone should be able to do a manual push, so that my availability stops being a blocker for more regular pushes.

CC: @mathieugrives

puntonim · 2018-02-07T21:41:58Z

I enhanced this monster such that it can be run in a labs machine (not in localhost).
Ssh into a labs machine, open a screen, then an inspire shell and copy/paste (%cpaste) this enhanced monster:

import datetime
import time

from flask import current_app

from invenio_records.models import RecordMetadata
from inspirehep.modules.hal.core.tei import convert_to_tei
from inspirehep.modules.hal.core.sword import create, update

# Set the proper configuration.
current_app.config['RECORDS_SKIP_FILES'] = True
current_app.config['HAL_COL_IRI'] = 'https://api.archives-ouvertes.fr/sword/hal'
current_app.config['HAL_EDIT_IRI'] = 'https://api.archives-ouvertes.fr/sword/'


def run(username, password, limit=None):
    start = time.time()
    current_app.config['HAL_USER_NAME'] = username
    current_app.config['HAL_USER_PASS'] = password
    records = RecordMetadata.query.filter(RecordMetadata.json['_export_to'].op('@>')('{"HAL": true}') )
    if limit:
        records = records[:limit]
    # log_file = os.path.join(os.path.dirname(__file__), 'HAL.log')
    log_file = '/opt/inspire/HAL.log'
    ok = ko = 0
    with open(log_file, 'w') as f:
        for i, raw_record in enumerate(records):
            if i % 10 == 0:
                now = str(datetime.timedelta(seconds=time.time()-start))
                print '%s records processed in %s: %s ok, %s ko' % (i, now, ok, ko)
            record = raw_record.json
            if 'Literature' in record['_collections'] or 'HAL Hidden' in record['_collections']:
                try:
                    tei = convert_to_tei(record)
                except Exception, e:
                    f.write('EXC TEI: %s %s\n' % (record['control_number'], str(e)))
                    # ko.append(record['control_number'])
                    ko += 1
                    continue

                success = False
                for _ in range(2):
                    try:
                        hal_id = ''
                        ids = record.get('external_system_identifiers', [])
                        for id_ in ids:
                            if id_['schema'] == 'HAL':
                                hal_id = id_['value']
                        if hal_id:
                            update(tei.encode('utf8'), hal_id.encode('utf8'))
                            f.write('UPD: %s %s\n' % (record['control_number'], hal_id))
                        else:
                            receipt = create(tei.encode('utf8'))
                            f.write('NEW: %s %s\n' % (record['control_number'], receipt.id))
                        success = True
                        break
                    except Exception, e:
                        continue
                if success:
                    # ok.append(record['control_number'])
                    ok += 1
                else:
                    f.write('EXC HAL: %s %s\n' % (record['control_number'], str(e)))
                    # ko.append(record['control_number'])
                    ko += 1
    print '%s records processed in %s: %s ok, %s ko' % (i, now, ok, ko)

Then run it with (remove the 20 to run it on all records):

run(USERNAME-IN-TBAG, PASSWORD-IN-TBAG, 20)

TODO next:

performance of the query RecordMetadata.query.filter are really bad because it's reading a non indexed json field from the DB. Read it from ES instead (together with the full record).
make it an executable script (Flask-Script maybe?)
schedule it (Celery beat?)
proper error reporting (Gitter?)
performance of HAL APIs are bad, frequent timeout errors: not much we can do.
But the script performs 2 attempts for each failed record (for _ in range(2):): make it configurable via kwargs -- Note: last time I run it with 10 attempts and it took too long: 4852 records processed in 1 day, 0:23:13.167395: 2050 ok, 2803 ko)
log file of the export ended on 8/2/2018:
@inspire-prod-worker3-task1 in /home/inspire/HAL-push/HAL.log
avoid spamming Sentry on every error
check the forwarding for inspire-hal-cataloger@cern.ch (currently too many people receive these emails, and there will be an email for each indexed by HAL)

jacquerie · 2018-02-08T22:10:27Z

@mathieugrives requested to include subtitles in the next export that @ammirate will do. You can cherry-pick this commit for that: jacquerie@2625e19

puntonim · 2018-03-26T22:13:18Z

OBSOLETE

We made a script to be run with:

$ /usr/bin/time -v inspirehep hal push

It will ask for username, password and limit.

jacquerie self-assigned this Aug 15, 2017

jacquerie added the Type: Technical Debt label Aug 15, 2017

jacquerie removed their assignment Sep 29, 2017

jacquerie self-assigned this Nov 10, 2017

jacquerie closed this as completed Nov 24, 2017

jacquerie removed their assignment Dec 13, 2017

puntonim added this to the HAL Integration milestone Feb 7, 2018

puntonim reopened this Feb 7, 2018

puntonim added 2018 hal bugninja labels Feb 7, 2018

jacquerie mentioned this issue Feb 14, 2018

hal: also send the subtitle #3192

Merged

8 tasks

puntonim closed this as completed Mar 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: document how to push to HAL #2629

docs: document how to push to HAL #2629

jacquerie commented Aug 15, 2017

jacquerie commented Nov 10, 2017 •

edited

jacquerie commented Nov 10, 2017

puntonim commented Feb 7, 2018 •

edited

jacquerie commented Feb 8, 2018

puntonim commented Mar 26, 2018

docs: document how to push to HAL #2629

docs: document how to push to HAL #2629

Comments

jacquerie commented Aug 15, 2017

jacquerie commented Nov 10, 2017 • edited

Manual Push

Dumping the data from Legacy

Configuring a local Labs instance

Loading the data in a local Labs instance

Pushing to HAL

jacquerie commented Nov 10, 2017

puntonim commented Feb 7, 2018 • edited

jacquerie commented Feb 8, 2018

puntonim commented Mar 26, 2018

jacquerie commented Nov 10, 2017 •

edited

puntonim commented Feb 7, 2018 •

edited