Cannot parse EMPTY token with white space #1013

pramsey · 2023-12-08T17:35:56Z

This does not parse

MULTIPOINT( EMPTY, (10 10), (20 20))

this does

MULTIPOINT(EMPTY, (10 10), (20 20))

Here's a test case

// EMPTY token with some white space
template<>
template<>
void object::test<24>
()
{
    GeomPtr geom(wktreader.read("MULTIPOINT( EMPTY, (10 10), (20 20))"));
    auto ngeoms = geom->getNumGeometries();

    ensure(ngeoms == 3);
}

The text was updated successfully, but these errors were encountered:

JamesParrott · 2023-12-22T12:52:15Z

Hi Paul,

Thanks for the example. I could easily reproduce it Python, using shapely.from_wkt

>>> shapely.from_wkt("MULTIPOINT( EMPTY, (10 10), (20 20))")
...
shapely.errors.GEOSException: ParseException: Expected number but encountered word: 'EMPTY'

Within the Geos code base, it certainly seems like the intention was to ignore whitespace, as str.find_first_not_of(" \r\n\t",... is used (notice the space in the string " \r\n\t"), both in StringTokenizer::peekNextToken,

geos/src/io/StringTokenizer.cpp

Line 145 in 0aef713

    
           pos = str.find_first_not_of(" \r\n\t", static_cast<string::size_type>(iter - str.begin()));

. and in StringTokenizer::nextToken

Possible explanation

Within WKTReader::readMultiPointText,

geos/src/io/WKTReader.cpp

Line 331 in 0aef713

    
           WKTReader::readMultiPointText(StringTokenizer* tokenizer, OrdinateSet& ordinateFlags) const

, on encountering EMPTY preceded by a space, either tokenizer->peekNextToken does return the enum StringTokenizer::TT_WORD, or perhaps due to some unlikely, very subtle bug it incorrectly returns StringTokenizer::TT_NUMBER. The third, error handling, else clause has nothing to do with it. In either scenario, flow continues to call WKTReader::getPreciseCoordinate (either directly in the number case), or if treated as TT_WORD) probably via calls to WKTReader::readPointText and WKTReader::getCoordinates. This calls WKTReader::getNextNumber

The Exception seen in Shapely above: Expected number but encountered word: 'EMPTY', is thrown inside WKTReader::getNextNumber

geos/src/io/WKTReader.cpp

Line 140 in 0aef713

    
           throw  ParseException("Expected number but encountered word", tokenizer->getSVal());

If StringTokenizer::TT_WORD is being returned correctly by tokenizer->peekNextToken, then within WKTReader::getCoordinates possibly the check for EMPTY at

geos/src/io/WKTReader.cpp

Line 66 in 0aef713

if(nextToken == "EMPTY") {

is foiled, as perhaps
in WKTReader::getNextEmptyOrOpener, within the call therein to ``WKTReader::getNextWord, the call to tokenizer->nextToken()` , which advances the iterator, should instead call `tokenizer->peekNextToken`?

All those functions are fundamental to the module though, well tested, and used by so many other WKT types (e.g. WKTReader::getCoordinates is also used by POINT, LINESTRING and LINEARRING and POLYGON) that very strong justification is needed before changing them. A special function for Multipoint could be added but that would duplicate code and increase maintenance work. Doing it properly, there's a big risk of breaking something else in fixing this particular bug.

So before I spend even longer digging in to this, please could someone confirm what the intended scope of the flavour of WKT that Geos will parse is? In the Geos WKT grammar spec, EMPTY is not supported -within the parens- in Multipoints at all (meaning a Multipoint is either a non-empty sequence of points, or it's empty. And a point cannot be empty):

<point> ::= <x> <y> [ <z> ] [ <m> ]

<multipoint text> ::=
    <empty set> |
    <left paren> <point> {<comma> <point>} ... <right paren>

<multipoint text representation> ::=
    MULTIPOINT [ <zm> ] <multipoint text>

https://libgeos.org/specifications/wkt/

So yes this is a bug. But please can someone either confirm or rule out, that the bug is actually because EMPTY should not be supported within the parens of a MULTIPOINT at all?

Parsers can be free to choose to parse a superset of a standard (if that standard doesn't require an error to be raised on invalid grammar). WKT is intended to be human readable. Either requiring structural whitespace or forbidding it would both defeat that very purpose. More importantly, in the official WKT spec, section 7.2.2 (2D geom only), multipoints are defined as empty, or a non-empty sequence of point texts, and point texts may be empty:

<multipoint tagged text> ::= multipoint <multipoint text> 
<multipoint text> ::= <empty set> | <left paren>
                                     <point text>
                                     {<comma> <point text>}* 
                                     <right paren>
<point text> ::= <empty set> | <left paren> <point> <right paren> 
 <point> ::= <x> <y>

https://www.ogc.org/standard/sfa/
https://portal.ogc.org/files/?artifact_id=25355

So perhaps my point is just a quibble that the BNF for WKT in the geos documentation needs making consistent with (the OG) the OGC's.

pramsey added Bug Good First Issue New developers consider for experience. labels Dec 8, 2023

mwtoews mentioned this issue Jan 9, 2024

Fix StringTokenizer::peekNextToken #1025

Merged

mwtoews closed this as completed in #1025 Jan 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot parse EMPTY token with white space #1013

Cannot parse EMPTY token with white space #1013

pramsey commented Dec 8, 2023

JamesParrott commented Dec 22, 2023 •

edited

Loading

Cannot parse EMPTY token with white space #1013

Cannot parse EMPTY token with white space #1013

Comments

pramsey commented Dec 8, 2023

JamesParrott commented Dec 22, 2023 • edited Loading

Possible explanation

JamesParrott commented Dec 22, 2023 •

edited

Loading