-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: download linked objects in a WFS using "resolvedepth" #1180
base: master
Are you sure you want to change the base?
Conversation
fc243b7
to
ef11840
Compare
@stempler Any idea why my PRs are always failing with the same error in the "Comment with links to the artifacts"? |
cadb5a6
to
5519274
Compare
&& totalFeaturesProcessed >= maxNumberOfFeatures) | ||
|| (size != UNKNOWN_SIZE && totalFeaturesProcessed >= size); | ||
return maxNumberOfFeatures != UNLIMITED | ||
&& totalFeaturesProcessed >= maxNumberOfFeatures; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't we still get problems here because the additional objects are counted as features processed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just setup the hits == UNKNOWN_SIZE
, that's the code left. It didn't make sense to me to keep it as long as size
would also be always UNKNOWN_SIZE
.
I have used just 2 projects to test, probably it's not enough, but I haven't encountered any problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just setup the hits == UNKNOWN_SIZE
size would also be always UNKNOWN_SIZE
Can you explain why size
should always be UNKNOWN_SIZE
? In case the request does not contain a RESOLVEDEPTH
parameter, the old logic can still be used IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this code
switch (wfsVersion) {
case "1.1.0":
// The "numberOfFeatures" reported by a 1.1.0 WFS may be smaller
// than the actual number of features matches by the query if the
// number of features returned per query is limited on the server
// side. Therefore do not rely on it as a size information here.
this.size = UNKNOWN_SIZE;
break;
case "2.0.0":
case "2.0.2":
// The "numberMatched" reported by a 2.0.0/2.0.2 WFS should be
// number of features matched by the query. If hits equals
// UNKNOWN_SIZE then size is also set to that value
this.size = isLimited() ? Math.min(maxNumberOfFeatures, hits) : hits;
break;
default:
this.size = UNKNOWN_SIZE;
}
having hits = UNKNOWN_SIZE, also for the case 2.0.2 the size would come out as UNKNOWN_SIZE, independently if isLimited() is true or false.
@@ -312,7 +279,18 @@ public WfsBackedGmlInstanceIterator iterator() { | |||
*/ | |||
@Override | |||
public boolean isEmpty() { | |||
return size == 0; | |||
if (!emptyInitialized) { | |||
ResourceIterator<Instance> it = iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a more effective way could be using the hits query that was previously used - it could even be used with maxFeatures
/ count
set to 1
as we are only interested if there are matches at all.
With that we would avoid the server having to load and transfer a potentially large amount of data.
// Use primordial URI and issue "hits" request to check if the WFS will | ||
// return anything at all | ||
int hits; | ||
if (ignoreNumberMatched) { | ||
hits = UNKNOWN_SIZE; | ||
} | ||
else { | ||
try { | ||
hits = requestHits(primordialUri); | ||
} catch (WFSException e) { | ||
log.debug(MessageFormat.format( | ||
"Failed to perform hits query (REQUESTTYPE=hits): {0}", e.getMessage()), e); | ||
hits = UNKNOWN_SIZE; | ||
} | ||
} | ||
|
||
switch (wfsVersion) { | ||
case "1.1.0": | ||
// The "numberOfFeatures" reported by a 1.1.0 WFS may be smaller | ||
// than the actual number of features matches by the query if the | ||
// number of features returned per query is limited on the server | ||
// side. Therefore do not rely on it as a size information here. | ||
this.size = UNKNOWN_SIZE; | ||
break; | ||
case "2.0.0": | ||
case "2.0.2": | ||
// The "numberMatched" reported by a 2.0.0/2.0.2 WFS should be | ||
// number of features matched by the query. If hits equals | ||
// UNKNOWN_SIZE then size is also set to that value | ||
this.size = isLimited() ? Math.min(maxNumberOfFeatures, hits) : hits; | ||
break; | ||
default: | ||
this.size = UNKNOWN_SIZE; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this must be removed. I think this can still be used in general (as before) when the query does not contain a RESOLVEDEPTH
parameter, or am I missing something?
&& totalFeaturesProcessed >= maxNumberOfFeatures) | ||
|| (size != UNKNOWN_SIZE && totalFeaturesProcessed >= size); | ||
return maxNumberOfFeatures != UNLIMITED | ||
&& totalFeaturesProcessed >= maxNumberOfFeatures; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just setup the hits == UNKNOWN_SIZE
size would also be always UNKNOWN_SIZE
Can you explain why size
should always be UNKNOWN_SIZE
? In case the request does not contain a RESOLVEDEPTH
parameter, the old logic can still be used IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message and PR should reference the related GitHub issue #1084
Or if we just setup the hits to UNKNOWN_SIZE, then the ignoreNumberMatched box doesn't make any sense because wont be used. |
70b72af
to
91bac9e
Compare
@florianesser I have squashed the previous commit and this one just to be able to better see the change. |
boolean isUsedresolvedepth = false; | ||
for (NameValuePair nameValuePair : params) { | ||
if ("resolvedepth".equalsIgnoreCase(nameValuePair.getName())) { | ||
isUsedresolvedepth = true; | ||
break; | ||
} | ||
} | ||
|
||
if (ignoreNumberMatched || isUsedresolvedepth) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emanuelaepure10 I think this can be reduced to a single line as primordialQueryParams
contains all query parameters in upper case (see here)
boolean isUsedresolvedepth = false; | |
for (NameValuePair nameValuePair : params) { | |
if ("resolvedepth".equalsIgnoreCase(nameValuePair.getName())) { | |
isUsedresolvedepth = true; | |
break; | |
} | |
} | |
if (ignoreNumberMatched || isUsedresolvedepth) { | |
if (ignoreNumberMatched || primordialQueryParams.contains("RESOLVEDEPTH")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I overlooked primordialQueryParams
.
Using "resolvedepth" when trying to download data via WFS is now downloading the linked objects. As the requested hits to a WFS is better to be setup to UNKNOWN_SIZE if the WFS contains a "resolvedepth" parameter. ING-4128 Closes halestudio#1084
91bac9e
to
8822bbc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -228,7 +228,8 @@ public WfsBackedGmlInstanceCollection(LocatableInputSupplier<? extends InputStre | |||
// Use primordial URI and issue "hits" request to check if the WFS will | |||
// return anything at all | |||
int hits; | |||
if (ignoreNumberMatched) { | |||
|
|||
if (ignoreNumberMatched || primordialQueryParams.containsKey("RESOLVEDEPTH")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment why this is done in case of RESOLVEDEPTH
being included?
Did you test if the paging works as expected when additionalObjects are returned?
I could imagine a possible impact could be that the request for the second page uses a wrong offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you test if the paging works as expected when additionalObjects are returned?
I could imagine a possible impact could be that the request for the second page uses a wrong offset.
For this I would probably need a bigger project I believe. I will ask the service team.
@@ -225,10 +225,16 @@ public WfsBackedGmlInstanceCollection(LocatableInputSupplier<? extends InputStre | |||
} | |||
} | |||
|
|||
// for a query containing RESOLVEDEPTH we disable the pagination | |||
if (primordialQueryParams.containsKey("RESOLVEDEPTH")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing the issue with @florianesser, we identified a potential solution: disable pagination if the parameters contain "RESOLVEDEPTH".
Another approach could be to disable pagination directly from the UI. I attempted this but haven't found a viable solution yet.
@stempler, your input and ideas would be greatly appreciated to help us find the best solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem with that would be that in many cases then it would not be possible to load all data of a service.
- Was your intent to inform the user about this restriction in the UI?
- Might be unexpected when used from another context, e.g. Gradle or CLI
- Did you explore if you would see some way to determine if an instance originated from
additionalObjects
or not? (e.g. see if it is feasible to detect here once enteringadditionalObjects
and adding the information to created instances in the instance metadata)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was your intent to inform the user about this restriction in the UI?
Initially, I wanted to disable the checkbox, but I didn't find the way, but if you have any suggestion how to achieve that I would be very happy to hear.
- I have tried to understand in the
GmlReaderSettingsPage
if there is any way to get the link and like that be able to check if the "resolvedepth" parameter is there. - I tried to extend
GmlReaderSettingsPage
into the WFS UI package but without success. - Not the best solution would be to add a note in the actual UI and inform the user that that parameter for steps won't be used in case "resolvedepth" has been used.
Might be unexpected when used from another context, e.g. Gradle or CLI
True.
Did you explore if you would see some way to determine if an instance originated from additionalObjects or not?
I have tried but again without getting the info where the instance comes from. I will definitely try your suggestion.
@@ -578,7 +597,31 @@ public Instance next() { | |||
} | |||
|
|||
Instance instance = iterator.next(); | |||
return new StreamGmlInstance(instance, totalFeaturesProcessed++); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stempler Could you help me understand how is it possible that for a total or
totalFeaturesProcessed: 754
which is the BP_Plan and additionalFeatureProcessed: 1844
processed here with a sum of 2598,
HS shows the following numbers:
Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to add a unit test where a WFS FeatureCollection file with additional objects is read. That doesn't allow to test the pagination, but if you include at least one object with a duplicate GML ID you can use this to test:
- The correct detection of the "additional objects"
- If your mechanism for detecting GML ID duplicates works as expected
Once it has been ruled out that a problem is hidden there the paging behavior can be analysed.
...boldt.hale.io.gml/src/eu/esdihumboldt/hale/io/gml/reader/internal/GmlInstanceCollection.java
Outdated
Show resolved
Hide resolved
...boldt.hale.io.gml/src/eu/esdihumboldt/hale/io/gml/reader/internal/GmlInstanceCollection.java
Outdated
Show resolved
Hide resolved
|
||
Iterable<QName> mapPropertiess = instance.getPropertyNames(); | ||
for (QName qName : mapPropertiess) { | ||
if (qName.getLocalPart().equals("id") && qName.getPrefix().equals("gml")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The namespace prefix is nothing that should be checked, because that can be an arbitrary value.
Here is an example where we check for a GML namespace.
Since the GML namespace used should be the same in the same data, I would suggest to only determine the property name once and after that use a more efficient approach than iterating all property names by getting the property directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is something I don't understand.
In the GmlInstanceCollection the TypeDefinition def actually have a namespace URI starting with http://www.opengis.net/wfs,
but when arriving to the WfsBakendGmlInstanceCollection the instance doesn't contain anymore a namespace URI starting with http://www.opengis.net/wfs, but just a QName; id, http://www.opengis.net/gml, gml, so cannot even be compared with a namespace such as http://www.opengis.net/wfs.
....gml/src/eu/esdihumboldt/hale/io/gml/reader/internal/wfs/WfsBackedGmlInstanceCollection.java
Outdated
Show resolved
Hide resolved
@@ -578,7 +597,31 @@ public Instance next() { | |||
} | |||
|
|||
Instance instance = iterator.next(); | |||
return new StreamGmlInstance(instance, totalFeaturesProcessed++); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to add a unit test where a WFS FeatureCollection file with additional objects is read. That doesn't allow to test the pagination, but if you include at least one object with a duplicate GML ID you can use this to test:
- The correct detection of the "additional objects"
- If your mechanism for detecting GML ID duplicates works as expected
Once it has been ruled out that a problem is hidden there the paging behavior can be analysed.
boolean condition = (maxNumberOfFeatures != UNLIMITED | ||
&& totalFeaturesProcessed >= maxNumberOfFeatures) | ||
|| (size != UNKNOWN_SIZE && totalFeaturesProcessed >= size) | ||
|| (totalFeaturesProcessed == size && !iterator.hasNext()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a change to the condition needed with this additional term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the additional objects found after the last main instance would be lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I think it would be very valuable to have that as a comment on that term.
@stempler does it happened to remember if there is somewhere a test example that contains "additionalObjects"? Thanks |
Using "resolvedepth" when trying to download data via WFS is now downloading the linked objects. As the requested hits to a WFS is better to be setup to UNKNOWN_SIZE, so the size comes as well to be always UNKNOWN_SIZE, the code related to hits and size has been removed and the related methods have been changed accordingly.
ING-4128
Closes #1084