New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: document how to push to HAL #2629
Comments
Manual PushDumping the data from LegacyImports and stuff: >>> from invenio.intbitset import intbitset
>>> from invenio.search_engine import format_record, perform_request_search, search_unit
>>> from invenio.webuser import collect_user_info, get_uid_from_email, get_email_from_username
>>>
>>> ADMIN_USER_INFO = collect_user_info(get_uid_from_email(get_email_from_username('admin'))) Dump all conferences: >>> conference_recids = perform_request_search(cc='Conferences')
>>> with open('Conferences.xml', 'w') as f:
... for recid in conference_recids:
... f.write(format_record(recid, 'XME', user_info=ADMIN_USER_INFO)) Dump all institutions: >>> institution_recids = perform_request_search(cc='Institutions')
>>> with open('Institutions.xml', 'w') as f:
... for recid in institution_recids:
... f.write(format_record(recid, 'XME', user_info=ADMIN_USER_INFO)) Dump all HAL recods: >>> hal_recids = (search_unit('HAL', f='595__c', m='a') | intbitset(perform_request_search(cc='HAL Hidden'))) - search_unit('DELETED', f='980', m='a')
>>> with open('HAL.xml', 'w') as f:
... for recid in hal_recids:
... f.write(format_record(recid, 'XME', user_info=ADMIN_USER_INFO)) Configuring a local Labs instanceAdd the following to RECORDS_SKIP_FILES = True
HAL_COL_IRI = 'https://api.archives-ouvertes.fr/sword/hal'
HAL_EDIT_IRI = 'https://api.archives-ouvertes.fr/sword/'
HAL_USER_NAME = 'username' # tbag
HAL_USER_PASS = 'password' # tbag Loading the data in a local Labs instance$ docker-compose up -d
$ docker-compose scale worker=$(($(nproc) - 1))
$ docker-compose run --rm web scripts/recreate_records --no-populate
$ docker-compose run --rm web inspirehep migrator populate -f Conferences.xml
$ docker-compose run --rm web inspirehep migrator populate -f Institutions.xml
$ docker-compose run --rm web inspirehep migrator populate -f HAL.xml Pushing to HALImports and stuff: >>> from invenio_records.models import RecordMetadata
>>> from inspirehep.modules.hal.core.tei import convert_to_tei
>>> from inspirehep.modules.hal.core.sword import create, update
>>>
>>> records = [record.json for record in RecordMetadata.query] The actual push: >>> ok, ko = [], []
... with open('HAL.log' ,'w') as f:
... for record in records:
... if 'Literature' in record['_collections'] or 'HAL Hidden' in record['_collections']:
... try:
... tei = convert_to_tei(record)
... try:
... hal_id = ''
... ids = record.get('external_system_identifiers', [])
... for id_ in ids:
... if id_['schema'] == 'HAL':
... hal_id = id_['value']
... if hal_id:
... update(tei.encode('utf8'), hal_id.encode('utf8'))
... f.write('UPD: %s %s\n' % (record['control_number'], hal_id))
... else:
... receipt = create(tei.encode('utf8'))
... f.write('NEW: %s %s\n' % (record['control_number'], receipt.id))
... ok.append(record['control_number'])
... except Exception, e:
... f.write('HAL: %s %s\n' % (record['control_number'], str(e)))
... ko.append(record['control_number'])
... except Exception, e:
... f.write('TEI: %s %s\n' % (record['control_number'], str(e)))
... ko.append(record['control_number']) |
With the above steps anyone should be able to do a manual push, so that my availability stops being a blocker for more regular pushes. CC: @mathieugrives |
I enhanced this monster such that it can be run in a labs machine (not in localhost). import datetime
import time
from flask import current_app
from invenio_records.models import RecordMetadata
from inspirehep.modules.hal.core.tei import convert_to_tei
from inspirehep.modules.hal.core.sword import create, update
# Set the proper configuration.
current_app.config['RECORDS_SKIP_FILES'] = True
current_app.config['HAL_COL_IRI'] = 'https://api.archives-ouvertes.fr/sword/hal'
current_app.config['HAL_EDIT_IRI'] = 'https://api.archives-ouvertes.fr/sword/'
def run(username, password, limit=None):
start = time.time()
current_app.config['HAL_USER_NAME'] = username
current_app.config['HAL_USER_PASS'] = password
records = RecordMetadata.query.filter(RecordMetadata.json['_export_to'].op('@>')('{"HAL": true}') )
if limit:
records = records[:limit]
# log_file = os.path.join(os.path.dirname(__file__), 'HAL.log')
log_file = '/opt/inspire/HAL.log'
ok = ko = 0
with open(log_file, 'w') as f:
for i, raw_record in enumerate(records):
if i % 10 == 0:
now = str(datetime.timedelta(seconds=time.time()-start))
print '%s records processed in %s: %s ok, %s ko' % (i, now, ok, ko)
record = raw_record.json
if 'Literature' in record['_collections'] or 'HAL Hidden' in record['_collections']:
try:
tei = convert_to_tei(record)
except Exception, e:
f.write('EXC TEI: %s %s\n' % (record['control_number'], str(e)))
# ko.append(record['control_number'])
ko += 1
continue
success = False
for _ in range(2):
try:
hal_id = ''
ids = record.get('external_system_identifiers', [])
for id_ in ids:
if id_['schema'] == 'HAL':
hal_id = id_['value']
if hal_id:
update(tei.encode('utf8'), hal_id.encode('utf8'))
f.write('UPD: %s %s\n' % (record['control_number'], hal_id))
else:
receipt = create(tei.encode('utf8'))
f.write('NEW: %s %s\n' % (record['control_number'], receipt.id))
success = True
break
except Exception, e:
continue
if success:
# ok.append(record['control_number'])
ok += 1
else:
f.write('EXC HAL: %s %s\n' % (record['control_number'], str(e)))
# ko.append(record['control_number'])
ko += 1
print '%s records processed in %s: %s ok, %s ko' % (i, now, ok, ko) Then run it with (remove the 20 to run it on all records): run(USERNAME-IN-TBAG, PASSWORD-IN-TBAG, 20) TODO next:
|
@mathieugrives requested to include subtitles in the next export that @ammirate will do. You can cherry-pick this commit for that: jacquerie@2625e19 |
OBSOLETE We made a script to be run with: $ /usr/bin/time -v inspirehep hal push It will ask for username, password and limit. |
As part of #2628, and for future reference, document which steps are needed to push to HAL.
The text was updated successfully, but these errors were encountered: