Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Devise an automated procedure to export plaintexts #918

Open
marekhorst opened this issue Oct 10, 2018 · 0 comments
Open

Devise an automated procedure to export plaintexts #918

marekhorst opened this issue Oct 10, 2018 · 0 comments

Comments

@marekhorst
Copy link
Member

marekhorst commented Oct 10, 2018

As a result of this procedure we would like to obtain datastore with documentId and text pairs. It should be possible to export such datastore to external hadoop cluster.

This could be achieved by defining workflow definition binding already available IIS modules:

  • iis-wf-import
  • iis-wf-metadataextraction
  • iis-wf-ingest-pmc
  • iis-wf-transformers

and we could define it inside iis-wf-metadataextraction module.

Technically it will extract data from underlying PDF and XML-PMC caches. We could reuse already existing workflows:

and supplement it with exporting module producing outcome in desired format. Dedicated workflow could be based on already existing primary_import workflow definition.

We should support the following input parameters:

  • match_content_with_metadata flag indicating contents should be filtered against metadata entries retrieved from InformationSpace. Publications identifiers will be replaced by deduplicated ones. This way only contents having metadata representation will be exported. When disabled hbase_dump_location may not be provided.
  • hbase_dump_location InfoSpace dump location (may point to remote cluster) required for filtering contents against their metadata representatives and mapping original identifiers into deduplicated ones
  • objectstore_service_location ObjectStore service location
  • approved_objectstores_csv predefined set of ObjectStores to be handled
  • ingest_pmc_cache_location - XML PMC texts cache location
  • metadataextraction_cache_location - PDF texts cache location
  • metadataextraction_excluded_checksums - set of excluded PDF checksums, should be defined in config-default.xml file placed on IIS cluster instead of being provided at-runtime
  • output_remote_location desired output location, may point to external cluster (e.g. DM)
  • reports_external_path local IIS cluster path where the processing metrics should be stored
@marekhorst marekhorst self-assigned this Oct 10, 2018
marekhorst added a commit that referenced this issue Oct 11, 2018
Introducing cache_retriever uber workflow realizing requested logic.

The following input parameters are required:

* match_content_with_metadata - flag indicating contents should be filtered against metadata entries retrieved from InformationSpace. This way only contents having metadata representation will be processed. To be disabled when processing new contents which metadata is not available in hbase or when original identifiers should be preserved (entries will not be filtered as well)
* hbase_dump_location - InformationSpace HBase dump location, usually points to remote cluster HDFS location. Required when ${match_content_with_metadata} is set to true
* objectstore_service_location - object store service location required by IIS to import contents
* approved_objectstores_csv - CSV of object stores being subject of inference processing
* ingest_pmc_cache_location - cache to be used for PMC XML metadata extraction
* metadataextraction_cache_location - cache to be used for PDF metadata extraction
* output_remote_location - remote cluster output location where plaintexts should be distcped
* reports_external_path - HDFS location where report should be copied

Currently no additional transformation is applied therefore `eu.dnetlib.iis.metadataextraction.schemas.DocumentText` avro records are exported.
marekhorst added a commit that referenced this issue Oct 26, 2018
Introducing cache_retriever uber workflow realizing requested logic.

The following input parameters are required:

* match_content_with_metadata - flag indicating contents should be filtered against metadata entries retrieved from InformationSpace. This way only contents having metadata representation will be processed. To be disabled when processing new contents which metadata is not available in hbase or when original identifiers should be preserved (entries will not be filtered as well)
* hbase_dump_location - InformationSpace HBase dump location, usually points to remote cluster HDFS location. Required when ${match_content_with_metadata} is set to true
* objectstore_service_location - object store service location required by IIS to import contents
* approved_objectstores_csv - CSV of object stores being subject of inference processing
* ingest_pmc_cache_location - cache to be used for PMC XML metadata extraction
* metadataextraction_cache_location - cache to be used for PDF metadata extraction
* output_remote_location - remote cluster output location where plaintexts should be distcped
* reports_external_path - HDFS location where report should be copied

Currently no additional transformation is applied therefore `eu.dnetlib.iis.metadataextraction.schemas.DocumentText` avro records are exported.
marekhorst added a commit that referenced this issue Aug 2, 2022
Introducing cache_retriever uber workflow realizing requested logic.

The following input parameters are required:

* match_content_with_metadata - flag indicating contents should be filtered against metadata entries retrieved from InformationSpace. This way only contents having metadata representation will be processed. To be disabled when processing new contents which metadata is not available in hbase or when original identifiers should be preserved (entries will not be filtered as well)
* hbase_dump_location - InformationSpace HBase dump location, usually points to remote cluster HDFS location. Required when ${match_content_with_metadata} is set to true
* objectstore_service_location - object store service location required by IIS to import contents
* approved_objectstores_csv - CSV of object stores being subject of inference processing
* ingest_pmc_cache_location - cache to be used for PMC XML metadata extraction
* metadataextraction_cache_location - cache to be used for PDF metadata extraction
* output_remote_location - remote cluster output location where plaintexts should be distcped
* reports_external_path - HDFS location where report should be copied

Currently no additional transformation is applied therefore `eu.dnetlib.iis.metadataextraction.schemas.DocumentText` avro records are exported.
marekhorst added a commit that referenced this issue Aug 2, 2022
Aligning cache retriever workflow with aggregation subsystem based source of contents metadata.

Other minor changes.
marekhorst added a commit that referenced this issue Aug 2, 2022
Introducing `LicenseBasedApprover` in `ImportInformationSpaceJob` to eliminate Result entities having `bestaccessright` license property incompatible with `license_whitelist` provided at runtime.

`license_whitelist` default value is set to `$UNDEFINED$` which disables licensing restriction.
marekhorst added a commit that referenced this issue Aug 3, 2022
Transforming avro to json using avro2json subworkflow when producing output with plaintexts.
marekhorst added a commit that referenced this issue Dec 14, 2022
Extending LicenseBasedApprover with access right and license verification performed on the same instance object. Best access right is not taken into account anymore.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant