-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Devise an automated procedure to export plaintexts #918
Comments
marekhorst
added a commit
that referenced
this issue
Oct 11, 2018
Introducing cache_retriever uber workflow realizing requested logic. The following input parameters are required: * match_content_with_metadata - flag indicating contents should be filtered against metadata entries retrieved from InformationSpace. This way only contents having metadata representation will be processed. To be disabled when processing new contents which metadata is not available in hbase or when original identifiers should be preserved (entries will not be filtered as well) * hbase_dump_location - InformationSpace HBase dump location, usually points to remote cluster HDFS location. Required when ${match_content_with_metadata} is set to true * objectstore_service_location - object store service location required by IIS to import contents * approved_objectstores_csv - CSV of object stores being subject of inference processing * ingest_pmc_cache_location - cache to be used for PMC XML metadata extraction * metadataextraction_cache_location - cache to be used for PDF metadata extraction * output_remote_location - remote cluster output location where plaintexts should be distcped * reports_external_path - HDFS location where report should be copied Currently no additional transformation is applied therefore `eu.dnetlib.iis.metadataextraction.schemas.DocumentText` avro records are exported.
marekhorst
added a commit
that referenced
this issue
Oct 26, 2018
Introducing cache_retriever uber workflow realizing requested logic. The following input parameters are required: * match_content_with_metadata - flag indicating contents should be filtered against metadata entries retrieved from InformationSpace. This way only contents having metadata representation will be processed. To be disabled when processing new contents which metadata is not available in hbase or when original identifiers should be preserved (entries will not be filtered as well) * hbase_dump_location - InformationSpace HBase dump location, usually points to remote cluster HDFS location. Required when ${match_content_with_metadata} is set to true * objectstore_service_location - object store service location required by IIS to import contents * approved_objectstores_csv - CSV of object stores being subject of inference processing * ingest_pmc_cache_location - cache to be used for PMC XML metadata extraction * metadataextraction_cache_location - cache to be used for PDF metadata extraction * output_remote_location - remote cluster output location where plaintexts should be distcped * reports_external_path - HDFS location where report should be copied Currently no additional transformation is applied therefore `eu.dnetlib.iis.metadataextraction.schemas.DocumentText` avro records are exported.
marekhorst
added a commit
that referenced
this issue
Aug 2, 2022
Introducing cache_retriever uber workflow realizing requested logic. The following input parameters are required: * match_content_with_metadata - flag indicating contents should be filtered against metadata entries retrieved from InformationSpace. This way only contents having metadata representation will be processed. To be disabled when processing new contents which metadata is not available in hbase or when original identifiers should be preserved (entries will not be filtered as well) * hbase_dump_location - InformationSpace HBase dump location, usually points to remote cluster HDFS location. Required when ${match_content_with_metadata} is set to true * objectstore_service_location - object store service location required by IIS to import contents * approved_objectstores_csv - CSV of object stores being subject of inference processing * ingest_pmc_cache_location - cache to be used for PMC XML metadata extraction * metadataextraction_cache_location - cache to be used for PDF metadata extraction * output_remote_location - remote cluster output location where plaintexts should be distcped * reports_external_path - HDFS location where report should be copied Currently no additional transformation is applied therefore `eu.dnetlib.iis.metadataextraction.schemas.DocumentText` avro records are exported.
marekhorst
added a commit
that referenced
this issue
Aug 2, 2022
Aligning cache retriever workflow with aggregation subsystem based source of contents metadata. Other minor changes.
marekhorst
added a commit
that referenced
this issue
Aug 2, 2022
Introducing `LicenseBasedApprover` in `ImportInformationSpaceJob` to eliminate Result entities having `bestaccessright` license property incompatible with `license_whitelist` provided at runtime. `license_whitelist` default value is set to `$UNDEFINED$` which disables licensing restriction.
marekhorst
added a commit
that referenced
this issue
Aug 3, 2022
Transforming avro to json using avro2json subworkflow when producing output with plaintexts.
marekhorst
added a commit
that referenced
this issue
Dec 14, 2022
Extending LicenseBasedApprover with access right and license verification performed on the same instance object. Best access right is not taken into account anymore.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
As a result of this procedure we would like to obtain datastore with
documentId
andtext
pairs. It should be possible to export such datastore to external hadoop cluster.This could be achieved by defining workflow definition binding already available IIS modules:
iis-wf-import
iis-wf-metadataextraction
iis-wf-ingest-pmc
iis-wf-transformers
and we could define it inside
iis-wf-metadataextraction
module.Technically it will extract data from underlying PDF and XML-PMC caches. We could reuse already existing workflows:
import_infospace
for importing identifiers deduplication mappingimporter_content_url_chain
for importing contents urls required by metadataextraction and pmc_ingestion submodulesmetadataextraction_cache
to handle both PDF and XML-PMC cachestransformers_metadataextraction_documenttext
for plaintext extraction out ofExtractedDocumentMetadata
cached recordsand supplement it with exporting module producing outcome in desired format. Dedicated workflow could be based on already existing
primary_import
workflow definition.We should support the following input parameters:
match_content_with_metadata
flag indicating contents should be filtered against metadata entries retrieved from InformationSpace. Publications identifiers will be replaced by deduplicated ones. This way only contents having metadata representation will be exported. When disabledhbase_dump_location
may not be provided.hbase_dump_location
InfoSpace dump location (may point to remote cluster) required for filtering contents against their metadata representatives and mapping original identifiers into deduplicated onesobjectstore_service_location
ObjectStore service locationapproved_objectstores_csv
predefined set of ObjectStores to be handledingest_pmc_cache_location
- XML PMC texts cache locationmetadataextraction_cache_location
- PDF texts cache locationmetadataextraction_excluded_checksums
- set of excluded PDF checksums, should be defined inconfig-default.xml
file placed on IIS cluster instead of being provided at-runtimeoutput_remote_location
desired output location, may point to external cluster (e.g. DM)reports_external_path
local IIS cluster path where the processing metrics should be storedThe text was updated successfully, but these errors were encountered: