Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate new aggregation subsystem based content importer module #1361

Closed

Conversation

marekhorst
Copy link
Member

@marekhorst marekhorst commented Jun 29, 2022

This pull request closes #1298 by introducing support for importing DocumentContentUrl records from the newly introduced hive-based content aggregation subsystem.

The aggregation subsystem support is automatically enabled after providing explicit import_content_pdfaggregation_table_name parameter value at runtime. When the parameter is unspecified then content importer module works in legacy mode (objectstore compatible).

This pull request includes agg_subsystem_based_importer integration test which works properly on the main ocean cluster but cannot be run on a dedicated CI cluster due to the missing hive-service therefore it is currently disabled in eu.dnetlib.iis.wf.importer.content.WorkflowTest class and should be reenabled after installing Hive service on CI cluster (dedicated issue covering this task was already crated: #1360).

…Store in favour of newly introduced PDF aggregation service

Initial implementation, proof of concept.
…Store in favour of newly introduced PDF aggregation service

Refined implementation. Subworkflow still to be integrated with IIS main workflow.
…Store in favour of newly introduced PDF aggregation service

Limiting entries to the ones with actual_url not null in other not to violate the DocumentContentURL schema constraints.
…Store in favour of newly introduced PDF aggregation service

Renaming `actual_url` to `location` as a column reference pointing to S3 content location.
…Store in favour of newly introduced PDF aggregation service

Enabling the new hive-based content metadata importer with content_url importer uber workflow and metadataextraction cache builder.
…Store in favour of newly introduced PDF aggregation service

Excluding conflicting jackson libraries.
…Store in favour of newly introduced PDF aggregation service

The following new properties are supported:
* `import_content_pdfaggregation_table_name` to be specified in at runtime 'dbname.tablename' format to indicate both hive database and table
* `import_content_pdfaggregation_hive_metastore_uris` with URIs pointing to the hive metastore utilized by the PDF aggregation subsystem, to be defined statically in IIS environment (default-config.xml file)

New hive-based PDF aggregation service support is automatically enabled by providing explicit `import_content_pdfaggregation_table_name` parameter value at runtime. When the parameter is unspecified then content importer module works in legacy mode (objectstore compatible).
Copy link
Contributor

@mpol mpol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think I understand what's going on here.


@BeforeEach
public void beforeEach() {
super.beforeEach();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this convention rather confusing. If we have a method in the superclass that is explicitly annotated with @BeforeEach, overriding it and then using super instead of just letting the inherited method be run seems like the wrong thing to do. But I see that this is a common convention in iis so this PR is not a place to change it.

Copy link
Member Author

@marekhorst marekhorst Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I didn't pay attention to this, just copy&pasted from some other test class.

I guess it might be some leftover from the time when the IIS test code base was switched from junit4 to junit5 as it seems the default behavior has changed. Again, I didn't check this thoroughly so let's do the cleanup as a part of the dedicated #1365 task.

}

//@Test
//TODO reenable this test once Hive is installed on CI Test hadoop cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A perfect case for @Disabled("TODO reenable this test once Hive is installed on CI Test hadoop cluster"), I think.

Even @DisabledIf could be fun if Hive is easily detectable, but that's not necessarily a good idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will stick to @Disabled instead of @DisabledIf because in the end we want to have it enabled just need to fix the environment.

@@ -239,7 +280,7 @@

<decision name="decision-existence-filter">
<switch>
<case to="content-url-dispatcher">${input_id eq "$UNDEFINED$"}</case>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular change could be confusing but it is bound to a specific commit which is closing the #1342 issue spotted when working on #1298.

Maybe it is not worth to elevate it to a dedicated PR but we should find a nice way to indicate this solution somehow in the commit comment once the commits get squshed before the merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was surprised at first but found the reason for this change.

I personally prefer just not squashing, but adding "Also closes #1342, which was found during this work, by going to content-url-dedup from decision-existence-filter in the content_url_chain" should be enough, I think.

…Store in favour of newly introduced PDF aggregation service

Applying code review fixes.
@marekhorst
Copy link
Member Author

Already merged with IIS master branch with this squashed commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants