Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate EOSC services mining with IIS primary workflow #1398

Closed

Conversation

marekhorst
Copy link
Member

Originally requested in: #1394 and on redmine: https://support.openaire.eu/issues/6354.

This PR introduces new EOSC services mining subworkflow and fully integrates it with the IIS primary workflow.

It includes importing services from the graph, processing (mining) and exporting matched datasources as a dedicated ActionSet.

marekhorst and others added 8 commits May 10, 2023 17:10
Implementing datasource importer module importing services. Supplementing unit and integration tests suite.
Implementing service mining exporter module. Supplementing unit and integration tests suite.
Introducing sqlite_builder phase producing sqlitedb with services imported from an avro datastore.
Introducing main_sqlite phase performing the text mining.
Binding all sub-phases in main uber workflow.
Introducing integration tests for eosc service mining.
Integrating EOSC services mining with IIS primary workflow. Integrating EOSC services mining with the IIS primary processing integration test.
Changing service record eligibility rule from eosctype=Service to monitoring if given record was collected from the specified datasource. This was required due to the datasource deduplication process which resulted in some datasources having eosctype set to 'Service' being replaced by the new representative having eosctype set to "Data Source".

Increasing allowed mapper timeout for main_sqlite phase due to slow performance when running mining on the full set of datasources, not services only.
Making `url` field in `eu.dnetlib.iis.importer.schemas.Service` avro record mandatory as it is required by the mining script.
@marekhorst marekhorst requested a review from mpol May 10, 2023 15:16
@marekhorst
Copy link
Member Author

One of the commits addresses an obsolete class removal: #1393.

@mpol
Copy link
Contributor

mpol commented May 11, 2023

One of the commits addresses an obsolete class removal: #1393.

Is there a good reason to mix this change in? Looks quite separate.

Copy link
Contributor

@mpol mpol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as far as I can tell. I only have two small questions and one small suggestion that could be handled in a separate PR.

// ------------------------ PRIVATE --------------------------

/**
* Creates similarity related actions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something, but is this code related to similarity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is probably some copy/paste leftover. Fixed.

@@ -136,10 +133,14 @@ public static void main(String[] args) throws Exception {
OafEntityToAvroConverter<Result, DataSetReference> datasetConverter = new DatasetMetadataConverter(dataInfoBasedApprover);
OafEntityToAvroConverter<Organization, eu.dnetlib.iis.importer.schemas.Organization> organizationConverter = new OrganizationConverter();
OafEntityToAvroConverter<Project, eu.dnetlib.iis.importer.schemas.Project> projectConverter = new ProjectConverter();
OafEntityToAvroConverter<Datasource, eu.dnetlib.iis.importer.schemas.Service> serviceConverter = new ServiceConverter(
(StringUtils.isNotBlank(params.eligibleServiceCollectedFromDatasourceId)
&& !UNDEFINED_NONEMPTY_VALUE.equals(params.eligibleServiceCollectedFromDatasourceId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extract the test to a method? It looks reusable and is quite long. Possibly even do:

String eligibleId = nullIfNotValid(params.eligibleServiceCollectedFromDatasourceId);
serviceConverter = new ServiceConverter(eligibleId);

There are already similar methods in WorkflowRuntimeParameters but not an exact match.

Copy link
Member Author

@marekhorst marekhorst May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, WorkflowRuntimeParameters is where this piece of code was taken from. Just didn't bother to do the refactoring but you're right, let's keep it clean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just introduced getValueOrNullIfNotValid() in WorkflowRuntimeParameters and refactored other methods to rely on this newly introduced method.


String targetDbLocation = System.getProperty("java.io.tmpdir") + File.separatorChar + "services.db";
File targetDbFile = new File(targetDbLocation);
targetDbFile.setWritable(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method does not work for nonexistent files, so is the file supposed to exist already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I vaguely remember it was required at some point so it looks like under some circumstances (idk, maybe running this process on the same datanode which run this process before?) this file might already exist.

This piece of code also originates from other *DBBuilder class so what I could do is to extract this to a dedicated common method and create a dedicated issue to take a closer look at this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: #1399

I don't want to spin another batch of refactoring as a part of #1398 so I am delegating this task to become part of this newly created issue.

@marekhorst
Copy link
Member Author

One of the commits addresses an obsolete class removal: #1393.

Is there a good reason to mix this change in? Looks quite separate.

My initial idea was to put this change into a separate commit which seemed be enough in terms of "isolation". But now I realized the was followed up with a bunch of #1394 related commits which I'd prefer to squash into the single one thus making this initial idea obsolete.

I will find a way to revert this change and reintroduce it as a separate commit outside this thread.

@marekhorst
Copy link
Member Author

This pull request was supplemented with bunch of code review fixes, commits were squashed into the single f45b54f commit which was pushed into the master branch.

#1393 was extracted into the separate 06f0eed commit and also merged with the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants