Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot parse EMPTY token with white space #1013

Closed
pramsey opened this issue Dec 8, 2023 · 1 comment · Fixed by #1025
Closed

Cannot parse EMPTY token with white space #1013

pramsey opened this issue Dec 8, 2023 · 1 comment · Fixed by #1025
Labels
Bug Good First Issue New developers consider for experience.

Comments

@pramsey
Copy link
Member

pramsey commented Dec 8, 2023

This does not parse

MULTIPOINT( EMPTY, (10 10), (20 20))

this does

MULTIPOINT(EMPTY, (10 10), (20 20))

Here's a test case

// EMPTY token with some white space
template<>
template<>
void object::test<24>
()
{
    GeomPtr geom(wktreader.read("MULTIPOINT( EMPTY, (10 10), (20 20))"));
    auto ngeoms = geom->getNumGeometries();

    ensure(ngeoms == 3);
}
@pramsey pramsey added Bug Good First Issue New developers consider for experience. labels Dec 8, 2023
@JamesParrott
Copy link
Contributor

JamesParrott commented Dec 22, 2023

Hi Paul,

Thanks for the example. I could easily reproduce it Python, using shapely.from_wkt

>>> shapely.from_wkt("MULTIPOINT( EMPTY, (10 10), (20 20))")
...
shapely.errors.GEOSException: ParseException: Expected number but encountered word: 'EMPTY'

Within the Geos code base, it certainly seems like the intention was to ignore whitespace, as str.find_first_not_of(" \r\n\t",... is used (notice the space in the string " \r\n\t"), both in StringTokenizer::peekNextToken,

pos = str.find_first_not_of(" \r\n\t", static_cast<string::size_type>(iter - str.begin()));
. and in StringTokenizer::nextToken

Possible explanation

Within WKTReader::readMultiPointText,

WKTReader::readMultiPointText(StringTokenizer* tokenizer, OrdinateSet& ordinateFlags) const
, on encountering EMPTY preceded by a space, either tokenizer->peekNextToken does return the enum StringTokenizer::TT_WORD, or perhaps due to some unlikely, very subtle bug it incorrectly returns StringTokenizer::TT_NUMBER. The third, error handling, else clause has nothing to do with it. In either scenario, flow continues to call WKTReader::getPreciseCoordinate (either directly in the number case), or if treated as TT_WORD) probably via calls to WKTReader::readPointText and WKTReader::getCoordinates. This calls WKTReader::getNextNumber

The Exception seen in Shapely above: Expected number but encountered word: 'EMPTY', is thrown inside WKTReader::getNextNumber

throw ParseException("Expected number but encountered word", tokenizer->getSVal());

If StringTokenizer::TT_WORD is being returned correctly by tokenizer->peekNextToken, then within WKTReader::getCoordinates possibly the check for EMPTY at

if(nextToken == "EMPTY") {
is foiled, as perhaps
in WKTReader::getNextEmptyOrOpener, within the call therein to ``WKTReader::getNextWord, the call to tokenizer->nextToken()` , which advances the iterator, should instead call `tokenizer->peekNextToken`?

All those functions are fundamental to the module though, well tested, and used by so many other WKT types (e.g. WKTReader::getCoordinates is also used by POINT, LINESTRING and LINEARRING and POLYGON) that very strong justification is needed before changing them. A special function for Multipoint could be added but that would duplicate code and increase maintenance work. Doing it properly, there's a big risk of breaking something else in fixing this particular bug.

So before I spend even longer digging in to this, please could someone confirm what the intended scope of the flavour of WKT that Geos will parse is? In the Geos WKT grammar spec, EMPTY is not supported -within the parens- in Multipoints at all (meaning a Multipoint is either a non-empty sequence of points, or it's empty. And a point cannot be empty):

<point> ::= <x> <y> [ <z> ] [ <m> ]

<multipoint text> ::=
    <empty set> |
    <left paren> <point> {<comma> <point>} ... <right paren>

<multipoint text representation> ::=
    MULTIPOINT [ <zm> ] <multipoint text>

https://libgeos.org/specifications/wkt/

So yes this is a bug. But please can someone either confirm or rule out, that the bug is actually because EMPTY should not be supported within the parens of a MULTIPOINT at all?

Parsers can be free to choose to parse a superset of a standard (if that standard doesn't require an error to be raised on invalid grammar). WKT is intended to be human readable. Either requiring structural whitespace or forbidding it would both defeat that very purpose. More importantly, in the official WKT spec, section 7.2.2 (2D geom only), multipoints are defined as empty, or a non-empty sequence of point texts, and point texts may be empty:

<multipoint tagged text> ::= multipoint <multipoint text> 
<multipoint text> ::= <empty set> | <left paren>
                                     <point text>
                                     {<comma> <point text>}* 
                                     <right paren>
<point text> ::= <empty set> | <left paren> <point> <right paren> 
 <point> ::= <x> <y> 

https://www.ogc.org/standard/sfa/
https://portal.ogc.org/files/?artifact_id=25355

So perhaps my point is just a quibble that the BNF for WKT in the geos documentation needs making consistent with (the OG) the OGC's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Good First Issue New developers consider for experience.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants