Devise an automated procedure to export plaintexts #918

marekhorst · 2018-10-10T12:32:05Z

As a result of this procedure we would like to obtain datastore with documentId and text pairs. It should be possible to export such datastore to external hadoop cluster.

This could be achieved by defining workflow definition binding already available IIS modules:

iis-wf-import
iis-wf-metadataextraction
iis-wf-ingest-pmc
iis-wf-transformers

and we could define it inside iis-wf-metadataextraction module.

Technically it will extract data from underlying PDF and XML-PMC caches. We could reuse already existing workflows:

import_infospace for importing identifiers deduplication mapping
importer_content_url_chain for importing contents urls required by metadataextraction and pmc_ingestion submodules
metadataextraction_cache to handle both PDF and XML-PMC caches
transformers_metadataextraction_documenttext for plaintext extraction out of ExtractedDocumentMetadata cached records

and supplement it with exporting module producing outcome in desired format. Dedicated workflow could be based on already existing primary_import workflow definition.

We should support the following input parameters:

match_content_with_metadata flag indicating contents should be filtered against metadata entries retrieved from InformationSpace. Publications identifiers will be replaced by deduplicated ones. This way only contents having metadata representation will be exported. When disabled hbase_dump_location may not be provided.
hbase_dump_location InfoSpace dump location (may point to remote cluster) required for filtering contents against their metadata representatives and mapping original identifiers into deduplicated ones
objectstore_service_location ObjectStore service location
approved_objectstores_csv predefined set of ObjectStores to be handled
ingest_pmc_cache_location - XML PMC texts cache location
metadataextraction_cache_location - PDF texts cache location
metadataextraction_excluded_checksums - set of excluded PDF checksums, should be defined in config-default.xml file placed on IIS cluster instead of being provided at-runtime
output_remote_location desired output location, may point to external cluster (e.g. DM)
reports_external_path local IIS cluster path where the processing metrics should be stored

The text was updated successfully, but these errors were encountered:

Introducing cache_retriever uber workflow realizing requested logic. The following input parameters are required: * match_content_with_metadata - flag indicating contents should be filtered against metadata entries retrieved from InformationSpace. This way only contents having metadata representation will be processed. To be disabled when processing new contents which metadata is not available in hbase or when original identifiers should be preserved (entries will not be filtered as well) * hbase_dump_location - InformationSpace HBase dump location, usually points to remote cluster HDFS location. Required when ${match_content_with_metadata} is set to true * objectstore_service_location - object store service location required by IIS to import contents * approved_objectstores_csv - CSV of object stores being subject of inference processing * ingest_pmc_cache_location - cache to be used for PMC XML metadata extraction * metadataextraction_cache_location - cache to be used for PDF metadata extraction * output_remote_location - remote cluster output location where plaintexts should be distcped * reports_external_path - HDFS location where report should be copied Currently no additional transformation is applied therefore `eu.dnetlib.iis.metadataextraction.schemas.DocumentText` avro records are exported.

Aligning cache retriever workflow with aggregation subsystem based source of contents metadata. Other minor changes.

Introducing `LicenseBasedApprover` in `ImportInformationSpaceJob` to eliminate Result entities having `bestaccessright` license property incompatible with `license_whitelist` provided at runtime. `license_whitelist` default value is set to `$UNDEFINED$` which disables licensing restriction.

Transforming avro to json using avro2json subworkflow when producing output with plaintexts.

Extending LicenseBasedApprover with access right and license verification performed on the same instance object. Best access right is not taken into account anymore.

marekhorst added activity: impl functionality: export functionality: metadataextraction labels Oct 10, 2018

marekhorst self-assigned this Oct 10, 2018

marekhorst mentioned this issue Oct 11, 2018

Introduce counters support in union3 and union4 transformers #919

Closed

marekhorst added a commit that referenced this issue Aug 2, 2022

Closes #918: Devise an automated procedure to export plaintexts

a61fc08

Aligning cache retriever workflow with aggregation subsystem based source of contents metadata. Other minor changes.

marekhorst mentioned this issue Aug 3, 2022

Add optional compression support in avro2json workflow #1371

Open

marekhorst added a commit that referenced this issue Aug 3, 2022

Closes #918: Devise an automated procedure to export plaintexts

409411b

Transforming avro to json using avro2json subworkflow when producing output with plaintexts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Devise an automated procedure to export plaintexts #918

Devise an automated procedure to export plaintexts #918

marekhorst commented Oct 10, 2018 •

edited

Loading

Devise an automated procedure to export plaintexts #918

Devise an automated procedure to export plaintexts #918

Comments

marekhorst commented Oct 10, 2018 • edited Loading

marekhorst commented Oct 10, 2018 •

edited

Loading