Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WKT lexer #2360

Merged
merged 5 commits into from Jul 15, 2014
Merged

WKT lexer #2360

merged 5 commits into from Jul 15, 2014

Conversation

eriktim
Copy link
Contributor

@eriktim eriktim commented Jul 13, 2014

As I found myself some spare time and enthusiasm, I tried to write a lexer to replace the regex-based parsing of WKT strings (see also #2172). Here is the result. Please let me know whether this is anywhere near the concepts that have been conceived.

@tschaub
Copy link
Member

tschaub commented Jul 14, 2014

This looks like really nice work @GingerIK. I haven't given it a thorough review yet, but hope to find time to do so soon. I know the current tests are missing this, but it might be nice to see tests of the behavior when given invalid WKT. And while a lexer is definitely cooler than regex based parsing, we probably should at least give a nod to pragmatism and look to see if there are benefits in terms of parsing speed (especially given that this looks like it would bump up the build size).

@GingerIK do you think it would be possible to put together some simple benchmarks (run against compiled code)? And it would be nice to report the build size before and after.

At some point we should establish a convention for benchmarking (adding utilities as dev-dependencies as needed) and track changes to those benchmarks. It would be nice to be able to evaluate proposed changes based on changes to benchmarks (and build size). This should be ticketed separately though.

'MultiLineString': ol.format.WKT.encodeMultiLineStringGeometry_,
'MultiPolygon': ol.format.WKT.encodeMultiPolygonGeometry_,
'GeometryCollection': ol.format.WKT.encodeGeometryCollectionGeometry_
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is how the GeoJSON format (and others) are written, but it strikes me that we would get smaller compiled output if instead we used the ol.geom.GeometryType enum. I imagine this means it couldn't be @const, but I'm not sure we get much benefit from that. So something like:

/**
 * @type {Object.<string, function(ol.geom.Geometry): string>}
 */
ol.format.WKT.GeometryEncoder_ = {};
ol.format.WKT.GeometryEncoder_[ol.geom.GeometryType.POINT] = ol.format.WKT.encodePointGeometry_;
// ... etc.

Ugly to write. But perhaps smaller in the end.

It would also be fine with me to handle this consistently (as you do) and change other formats at the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree (both on the ol.geom.GeometryType and the 'ugly' part). So I definitely would like to give it a try to compare both builds. However, I got a bit stuck on the upper camel case used by ol.geom.GeometryType. Parsing the (typically uppercase) WKT's to camel case isn't very nice. Do you happen to know a simple workaround?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For encoding, CamelCase keys in this object are appropriate (since that's what geometry.getType() will return). For decoding, a separate enum could be used with UPPERCASE keys (these would be used in assigning to ol.format.WKT.Parser.GeometryParser_ and ol.format.WKT.Parser.GeometryConstructor_ below). But I think both of these would be minor minification optimizations and should be handled separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a try later on. I don't expect too much out of it either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, a short test on the ol.format.WKT.GeometryEncoder_ surprisingly shows an increase of 22 bytes...

@eriktim
Copy link
Contributor Author

eriktim commented Jul 15, 2014

Thanks for the feedback @tschaub.

I have done a simple benchmarking earlier on which showed the lexer is about 2 times as fast. I'll post an updated version of this benchmark later on.

Concerning the build size I ran the following script:

hash0=`git merge-base master wkt-lexer`
hash1=`git rev-parse wkt-lexer`

# regex
git checkout $hash0
./build.py -c && ./build.py build
size0=`cat build/ol.js | wc -c`

# lexer
git checkout $hash1
./build.py -c && ./build.py build
size1=`cat build/ol.js | wc -c`

diff=`expr $size1 - $size0`
diffPerc=`echo "$diff $size0" | awk '{printf "%0.2f%%", 100*$1/$2}'`

echo "$hash0 $size0"
echo "$hash1 $size1 (+$diff bytes, +$diffPerc)"

Which results in:

cc9acef01fd4810fa8d5c09dd1c6cac8a6d4bea5 376873
3d475ac665cfbc3e304cfc0e31471df2bc2468eb 378071 (+1198 bytes, +0.32%)

Moreover, I added some tests to validate empty and invalid geometries.
An important note: getCoordinates() on ol.geom.Point failed, so I added a simple workaround.

That's it for now!

@eriktim
Copy link
Contributor Author

eriktim commented Jul 15, 2014

OK, here is the speed test.

I temporarily added the following code to examples/wkt.js:

var geoms = ['POINT(1.0 2.0)',
             'LINESTRING(1.0 2.0,3.0 4.0)',
             'POLYGON((1.0 2.0,3.0 4.0),(5.0 6.0))',
             'MULTIPOINT((1.0 2.0),(3.0 4.0))',
             'MULTILINESTRING((1.0 2.0,3.0 4.0),(5.0 6.0))',
             'MULTIPOLYGON(((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)))'];
geoms.push('GEOMETRYCOLLECTION(' + geoms.join(',') + ')');
geoms.push('MULTIPOLYGON(((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)),' +
    '((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)),' +
    '((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)),' +
    '((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)),' +
    '((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)),' +
    '((1.0 2.0,3.0 4.0),(5.0 6.0)),((7.0 8.0,9.0 0.0)))');
var points = [];
for (var i = 0; i < 100; ++i) {
  points.push(i + ' ' + i);
}
geoms.push('LINESTRING(' + points.join(',') + ')');

console.time('TOTAL');
for (var k = 0, kk = geoms.length; k < kk; ++k) {
  var geom = geoms[k];
  var type = geom.substr(0, geom.indexOf('('));
  console.time(type);
  for (var i = 0; i < 100000; ++i) {
    format.readFeature(geom);
  }
  console.timeEnd(type);
}
console.timeEnd('TOTAL');

Giving these results:

Regex parser

POINT:                3903.893ms
LINESTRING:           6179.088ms
POLYGON:              9847.962ms
MULTIPOINT:           6485.805ms
MULTILINESTRING:     10072.264ms
MULTIPOLYGON:        15957.217ms
GEOMETRYCOLLECTION:  42606.922ms
MULTIPOLYGON:        77302.610ms
LINESTRING:         128722.897ms
TOTAL:              301079.753ms

Lexer parser

POINT:                3994.630ms (102.3%)
LINESTRING:           4362.073ms ( 70.6%)
POLYGON:              4772.003ms ( 48.5%)
MULTIPOINT:           4750.383ms ( 73.2%)
MULTILINESTRING:      4918.195ms ( 48.8%)
MULTIPOLYGON:         5563.338ms ( 34.9%)
GEOMETRYCOLLECTION:  17542.039ms ( 41.2%)
MULTIPOLYGON:        10879.889ms ( 14.1%)
LINESTRING:          20842.405ms ( 16.2%)
TOTAL:               77626.011ms ( 25.8%)

So it looks like the more complex the WKT string becomes, the more the lexer seems to pay off (in terms of speed). Which kinda makes sense.

* Class to tokenize a WKT string.
* @param {string} wkt WKT string.
* @constructor
* @protected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be @private I think (and named ol.format.WKT.Lexer_). Also minor and not necessary to change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and the same holds for the parser. But somehow it feels more comfortable to use protected classes. I guess it looks better :-)

@tschaub
Copy link
Member

tschaub commented Jul 15, 2014

This is very nice work @GingerIK. Thanks for the great contribution - and the extra work providing benchmarks and build results!

tschaub added a commit that referenced this pull request Jul 15, 2014
@tschaub tschaub merged commit 7a26966 into openlayers:master Jul 15, 2014
@eriktim
Copy link
Contributor Author

eriktim commented Jul 16, 2014

You're most welcome!

@eriktim eriktim deleted the wkt-lexer branch July 16, 2014 11:06
@eriktim eriktim mentioned this pull request Jul 31, 2014
@elemoine
Copy link
Member

I just took a look at ths patch. Great contribution!

@eriktim
Copy link
Contributor Author

eriktim commented Aug 19, 2014

Thanks @elemoine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants