WKT lexer #2360

eriktim · 2014-07-13T21:26:41Z

As I found myself some spare time and enthusiasm, I tried to write a lexer to replace the regex-based parsing of WKT strings (see also #2172). Here is the result. Please let me know whether this is anywhere near the concepts that have been conceived.

tschaub · 2014-07-14T17:44:39Z

This looks like really nice work @GingerIK. I haven't given it a thorough review yet, but hope to find time to do so soon. I know the current tests are missing this, but it might be nice to see tests of the behavior when given invalid WKT. And while a lexer is definitely cooler than regex based parsing, we probably should at least give a nod to pragmatism and look to see if there are benefits in terms of parsing speed (especially given that this looks like it would bump up the build size).

@GingerIK do you think it would be possible to put together some simple benchmarks (run against compiled code)? And it would be nice to report the build size before and after.

At some point we should establish a convention for benchmarking (adding utilities as dev-dependencies as needed) and track changes to those benchmarks. It would be nice to be able to evaluate proposed changes based on changes to benchmarks (and build size). This should be ticketed separately though.

tschaub · 2014-07-14T18:01:39Z

src/ol/format/wktformat.js

+  'MultiLineString': ol.format.WKT.encodeMultiLineStringGeometry_,
+  'MultiPolygon': ol.format.WKT.encodeMultiPolygonGeometry_,
+  'GeometryCollection': ol.format.WKT.encodeGeometryCollectionGeometry_
+};


I know this is how the GeoJSON format (and others) are written, but it strikes me that we would get smaller compiled output if instead we used the ol.geom.GeometryType enum. I imagine this means it couldn't be @const, but I'm not sure we get much benefit from that. So something like:

/** * @type {Object.<string, function(ol.geom.Geometry): string>} */ ol.format.WKT.GeometryEncoder_ = {}; ol.format.WKT.GeometryEncoder_[ol.geom.GeometryType.POINT] = ol.format.WKT.encodePointGeometry_; // ... etc.

Ugly to write. But perhaps smaller in the end.

It would also be fine with me to handle this consistently (as you do) and change other formats at the same time.

I totally agree (both on the ol.geom.GeometryType and the 'ugly' part). So I definitely would like to give it a try to compare both builds. However, I got a bit stuck on the upper camel case used by ol.geom.GeometryType. Parsing the (typically uppercase) WKT's to camel case isn't very nice. Do you happen to know a simple workaround?

For encoding, CamelCase keys in this object are appropriate (since that's what geometry.getType() will return). For decoding, a separate enum could be used with UPPERCASE keys (these would be used in assigning to ol.format.WKT.Parser.GeometryParser_ and ol.format.WKT.Parser.GeometryConstructor_ below). But I think both of these would be minor minification optimizations and should be handled separately.

I'll give it a try later on. I don't expect too much out of it either.

FYI, a short test on the ol.format.WKT.GeometryEncoder_ surprisingly shows an increase of 22 bytes...

eriktim · 2014-07-15T11:51:44Z

Thanks for the feedback @tschaub.

I have done a simple benchmarking earlier on which showed the lexer is about 2 times as fast. I'll post an updated version of this benchmark later on.

Concerning the build size I ran the following script:

hash0=`git merge-base master wkt-lexer`
hash1=`git rev-parse wkt-lexer`

# regex
git checkout $hash0
./build.py -c && ./build.py build
size0=`cat build/ol.js | wc -c`

# lexer
git checkout $hash1
./build.py -c && ./build.py build
size1=`cat build/ol.js | wc -c`

diff=`expr $size1 - $size0`
diffPerc=`echo "$diff $size0" | awk '{printf "%0.2f%%", 100*$1/$2}'`

echo "$hash0 $size0"
echo "$hash1 $size1 (+$diff bytes, +$diffPerc)"

Which results in:

cc9acef01fd4810fa8d5c09dd1c6cac8a6d4bea5 376873
3d475ac665cfbc3e304cfc0e31471df2bc2468eb 378071 (+1198 bytes, +0.32%)

Moreover, I added some tests to validate empty and invalid geometries.
An important note: getCoordinates() on ol.geom.Point failed, so I added a simple workaround.

That's it for now!

eriktim · 2014-07-15T15:35:51Z

OK, here is the speed test.

I temporarily added the following code to examples/wkt.js:

var geoms = ['POINT(1.0 2.0)',
             'LINESTRING(1.0 2.0,3.0 4.0)',
             'POLYGON((1.0 2.0,3.0 4.0),(5.0 6.0))',
             'MULTIPOINT((1.0 2.0),(3.0 4.0))',
             'MULTILINESTRING((1.0 2.0,3.0 4.0),(5.0 6.0))',
             'MULTIPOLYGON(((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)))'];
geoms.push('GEOMETRYCOLLECTION(' + geoms.join(',') + ')');
geoms.push('MULTIPOLYGON(((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)),' +
    '((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)),' +
    '((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)),' +
    '((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)),' +
    '((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)),' +
    '((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)))');
var points = [];
for (var i = 0; i < 100; ++i) {
  points.push(i + ' ' + i);
}
geoms.push('LINESTRING(' + points.join(',') + ')');

console.time('TOTAL');
for (var k = 0, kk = geoms.length; k < kk; ++k) {
  var geom = geoms[k];
  var type = geom.substr(0, geom.indexOf('('));
  console.time(type);
  for (var i = 0; i < 100000; ++i) {
    format.readFeature(geom);
  }
  console.timeEnd(type);
}
console.timeEnd('TOTAL');

Giving these results:

Regex parser

POINT:                3903.893ms
LINESTRING:           6179.088ms
POLYGON:              9847.962ms
MULTIPOINT:           6485.805ms
MULTILINESTRING:     10072.264ms
MULTIPOLYGON:        15957.217ms
GEOMETRYCOLLECTION:  42606.922ms
MULTIPOLYGON:        77302.610ms
LINESTRING:         128722.897ms
TOTAL:              301079.753ms

Lexer parser

POINT:                3994.630ms (102.3%)
LINESTRING:           4362.073ms ( 70.6%)
POLYGON:              4772.003ms ( 48.5%)
MULTIPOINT:           4750.383ms ( 73.2%)
MULTILINESTRING:      4918.195ms ( 48.8%)
MULTIPOLYGON:         5563.338ms ( 34.9%)
GEOMETRYCOLLECTION:  17542.039ms ( 41.2%)
MULTIPOLYGON:        10879.889ms ( 14.1%)
LINESTRING:          20842.405ms ( 16.2%)
TOTAL:               77626.011ms ( 25.8%)

So it looks like the more complex the WKT string becomes, the more the lexer seems to pay off (in terms of speed). Which kinda makes sense.

tschaub · 2014-07-15T21:22:28Z

src/ol/format/wktformat.js

+ * Class to tokenize a WKT string.
+ * @param {string} wkt WKT string.
+ * @constructor
+ * @protected


This could be @private I think (and named ol.format.WKT.Lexer_). Also minor and not necessary to change.

I agree, and the same holds for the parser. But somehow it feels more comfortable to use protected classes. I guess it looks better :-)

tschaub · 2014-07-15T21:39:20Z

This is very nice work @GingerIK. Thanks for the great contribution - and the extra work providing benchmarks and build results!

WKT lexer.

eriktim · 2014-07-16T11:06:10Z

You're most welcome!

elemoine · 2014-08-18T19:47:00Z

I just took a look at ths patch. Great contribution!

eriktim · 2014-08-19T21:48:33Z

Thanks @elemoine

eriktim added 2 commits July 13, 2014 22:36

Encode WKT strings statically

621aafb

Parse WKT strings using a lexer/parser

4c03b3b

tschaub reviewed Jul 14, 2014
View reviewed changes

eriktim added 2 commits July 15, 2014 13:20

Allow for empty Point & GeometryCollection

fe8a72d

Encode empty geometries as WKT strings

4abc887

Add tests for empty & invalid WKT strings

1e7dc5c

tschaub reviewed Jul 15, 2014
View reviewed changes

tschaub added a commit that referenced this pull request Jul 15, 2014

Merge pull request #2360 from gingerik/wkt-lexer

7a26966

WKT lexer.

tschaub merged commit 7a26966 into openlayers:master Jul 15, 2014

eriktim deleted the wkt-lexer branch July 16, 2014 11:06

eriktim mentioned this pull request Jul 31, 2014

[wip] Benchmark #2480

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WKT lexer #2360

WKT lexer #2360

eriktim commented Jul 13, 2014

tschaub commented Jul 14, 2014

tschaub Jul 14, 2014

eriktim Jul 15, 2014

tschaub Jul 15, 2014

eriktim Jul 16, 2014

eriktim Jul 16, 2014

eriktim commented Jul 15, 2014

eriktim commented Jul 15, 2014

tschaub Jul 15, 2014

eriktim Jul 16, 2014

tschaub commented Jul 15, 2014

eriktim commented Jul 16, 2014

elemoine commented Aug 18, 2014

eriktim commented Aug 19, 2014

WKT lexer #2360

WKT lexer #2360

Conversation

eriktim commented Jul 13, 2014

tschaub commented Jul 14, 2014

tschaub Jul 14, 2014

Choose a reason for hiding this comment

eriktim Jul 15, 2014

Choose a reason for hiding this comment

tschaub Jul 15, 2014

Choose a reason for hiding this comment

eriktim Jul 16, 2014

Choose a reason for hiding this comment

eriktim Jul 16, 2014

Choose a reason for hiding this comment

eriktim commented Jul 15, 2014

eriktim commented Jul 15, 2014

tschaub Jul 15, 2014

Choose a reason for hiding this comment

eriktim Jul 16, 2014

Choose a reason for hiding this comment

tschaub commented Jul 15, 2014

eriktim commented Jul 16, 2014

elemoine commented Aug 18, 2014

eriktim commented Aug 19, 2014