Skip to content

Commit

Permalink
Replace geowave subproject with GeoTrellis/GeoWave data adapter (#3364)
Browse files Browse the repository at this point in the history
* Replace geowave subproject with GeoTrellis / GeoWave data adapter

* Update geowave module description

* Add geowave to cassandra tests

* Downgrade JTS to 1.16 to avoid bin-compat problem with GeoWave

* Downgrade GeoTools version to 23.2

* Fix GeoWave builds

* Fix GeoTools deps

* Move GeoWave into a separate executor, rm BlockingThreadPool implementation from benchmarks, generate missing headers

* Update SBT plugins

* Upd GeoWave syntax to match Scala 2.13

Co-authored-by: Grigory Pomadchin <gr.pomadchin@gmail.com>
  • Loading branch information
echeipesh and pomadchin committed Apr 30, 2021
1 parent fa88aed commit b52f70c
Show file tree
Hide file tree
Showing 154 changed files with 6,840 additions and 1,739 deletions.
6 changes: 6 additions & 0 deletions .circleci/build-and-test-geowave.sh
@@ -0,0 +1,6 @@
#!/bin/bash

.circleci/unzip-rasters.sh

./sbt -Dsbt.supershell=false "++$SCALA_VERSION" \
"project geowave" test || { exit 1; }
25 changes: 25 additions & 0 deletions .circleci/config.yml
Expand Up @@ -137,6 +137,21 @@ jobs:
.circleci/build-and-test-accumulo.sh
- save_cache: *save_build_cache

geowave:
parameters:
scala-version:
type: string
executor: executor-cassandra
steps:
- checkout
- restore_cache: *restore_build_cache
- run:
name: Test Cassandra
command: |
export SCALA_VERSION=<< parameters.scala-version >>
.circleci/build-and-test-geowave.sh
- save_cache: *save_build_cache

scaladocs:
parameters:
scala-version:
Expand Down Expand Up @@ -223,6 +238,16 @@ workflows:
tags:
only: /^v.*/

- geowave:
matrix:
parameters:
scala-version: [ "2.12.13", "2.13.5" ]
filters:
branches:
only: /.*/
tags:
only: /^v.*/

- scaladocs:
matrix:
parameters:
Expand Down
1 change: 1 addition & 0 deletions .locationtech/deploy-212.sh
Expand Up @@ -17,6 +17,7 @@
&& ./sbt "project hbase-spark" publish -no-colors -J-Drelease=locationtech \
&& ./sbt "project cassandra" publish -no-colors -J-Drelease=locationtech \
&& ./sbt "project cassandra-spark" publish -no-colors -J-Drelease=locationtech \
&& ./sbt "project geowave" publish -no-colors -J-Drelease=locationtech
&& ./sbt "project geotools" publish -no-colors -J-Drelease=locationtech \
&& ./sbt "project shapefile" publish -no-colors -J-Drelease=locationtech \
&& ./sbt "project layer" publish -no-colors -J-Drelease=locationtech \
Expand Down
7 changes: 4 additions & 3 deletions CHANGELOG.md
Expand Up @@ -19,6 +19,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- S3LayerDeleter cannot handle over 1000 objects to delete [#3371](https://github.com/locationtech/geotrellis/issues/3371)
- Drop Scala 2.11 cross compilation [#3259](https://github.com/locationtech/geotrellis/issues/3259)
- Fix MosaicRasterSource.tileToLayout behavior [#3338](https://github.com/locationtech/geotrellis/pull/3338)
- Replace geowave subproject with GeoTrellis/GeoWave data adapter [#3364](https://github.com/locationtech/geotrellis/pull/3364)

## [3.5.2] - 2021-02-01

Expand All @@ -40,8 +41,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Fix `LayoutTileSource` buffer should only be 1/2 a cellsize to avoid going out of bounds and creating `NODATA` values [#3302](https://github.com/locationtech/geotrellis/pull/3302)
- Remove unused allocation from CroppedTile [#3297](https://github.com/locationtech/geotrellis/pull/3297)
- Fix GeometryCollection::getAll extension method [#3295](https://github.com/locationtech/geotrellis/pull/3295)
- Update gdal-warp-bindings v1.1.1 [#3303](https://github.com/locationtech/geotrellis/pull/3303)
- gdal-warp-bindings 1.1.1 is a bugfix release that addresses a crash when initializing the bindings on MacOS. See:
- Update gdal-warp-bindings v1.1.1 [#3303](https://github.com/locationtech/geotrellis/pull/3303)
- gdal-warp-bindings 1.1.1 is a bugfix release that addresses a crash when initializing the bindings on MacOS. See:
- https://github.com/geotrellis/gdal-warp-bindings#macos
- https://github.com/geotrellis/gdal-warp-bindings/pull/99

Expand Down Expand Up @@ -80,7 +81,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- GDALRasterSource works inconsistenly with BitCellType and ByteCellType [#3232](https://github.com/locationtech/geotrellis/issues/3232)
- rasterizeWithValue accepts only topologically valid polygons [#3236](https://github.com/locationtech/geotrellis/pull/3236)
- Rasterizer.rasterize should be consistent with rasterizeWithValue [#3238](https://github.com/locationtech/geotrellis/pull/3238)
- GeoTrellisRasterSource should return None on empty reads [#3240](https://github.com/locationtech/geotrellis/pull/3240)
- GeoTrellisRasterSource should return None on empty reads [#3240](https://github.com/locationtech/geotrellis/pull/3240)
- ArrayTile equals method always returns true if first elements are NaN [#3242](https://github.com/locationtech/geotrellis/issues/3242)
- Fixed resource issue with JpegDecompressor that was causing a "too many open files in the system" exception on many parallel reads of JPEG compressed GeoTiffs. [#3249](https://github.com/locationtech/geotrellis/pull/3249)
- Fix MosaicRasterSource, GDALRasterSource and GeoTiffResampleRasterSource behavior [#3252](https://github.com/locationtech/geotrellis/pull/3252)
Expand Down
Expand Up @@ -48,7 +48,7 @@ object AccumuloCollectionReader {
val codec = KeyValueRecordCodec[K, V]
val includeKey = (key: K) => queryKeyBounds.includeKey(key)

val ranges = queryKeyBounds.flatMap(decomposeBounds).toIterator
val ranges = queryKeyBounds.flatMap(decomposeBounds).iterator

implicit val ec = executionContext
implicit val cs = IO.contextShift(ec)
Expand Down
26 changes: 9 additions & 17 deletions build.sbt
Expand Up @@ -14,6 +14,7 @@ lazy val root = Project("geotrellis", file("."))
gdal,
`gdal-spark`,
geotools,
geowave,
hbase,
`hbase-spark`,
layer,
Expand Down Expand Up @@ -145,7 +146,7 @@ lazy val `hbase-spark` = project
.settings(projectDependencies := { Seq((hbase / projectID).value, (spark / projectID).value.exclude("com.google.protobuf", "protobuf-java")) })
.settings(Settings.`hbase-spark`)

lazy val `spark-pipeline` = Project(id = "spark-pipeline", base = file("spark-pipeline")).
lazy val `spark-pipeline` = project.
dependsOn(spark, `s3-spark`, `spark-testkit` % "test").
settings(Settings.`spark-pipeline`)

Expand All @@ -155,24 +156,15 @@ lazy val geotools = project
)
.settings(Settings.geotools)

/* lazy val geomesa = project
.dependsOn(`spark-testkit` % Test, spark, geotools, `accumulo-spark`)
.settings(Settings.geomesa)
.settings(
scalaVersion := "2.11.12",
crossScalaVersions := Seq("2.11.12")
)
lazy val geowave = project
.dependsOn(
proj4, raster, layer, store, accumulo,
`spark-testkit` % Test, geotools
)
.dependsOn(raster, store, `raster-testkit` % Test)
.settings(Settings.geowave)
.settings(
scalaVersion := "2.11.12",
crossScalaVersions := Seq("2.11.12")
) */

lazy val `geowave-benchmark` = (project in file("geowave/benchmark"))
.dependsOn(geowave)
.enablePlugins(JmhPlugin)
.settings(Settings.geowaveBenchmark)
.settings(publish / skip := true)

lazy val shapefile = project
.dependsOn(raster, `raster-testkit` % Test)
Expand Down
Expand Up @@ -68,7 +68,7 @@ object CassandraCollectionReader {
instance.withSessionDo { session =>
val statement = session.prepare(query)

IOUtils.parJoin[K, V](ranges.toIterator){ index: BigInt =>
IOUtils.parJoin[K, V](ranges.iterator){ index: BigInt =>
val row = session.execute(statement.bind(index: BigInteger))
if (row.asScala.nonEmpty) {
val bytes = row.one().getBytes("value").array()
Expand Down
2 changes: 1 addition & 1 deletion docs/guide/module-hierarchy.rst
Expand Up @@ -90,7 +90,7 @@ store `GeoWave <https://github.com/ngageoint/geowave>`__.

*Provides:* ``geotrellis.spark.io.geowave.*``

- Save and load ``RDD``\ s of features to and from GeoWave.
- Provides `GeoTrellisDataAdapter` to store GeoTrellis raster tiles and other Avro encoded records through GeoWave `DataTypeAdapter` interface.

geotrellis-hbase
----------------
Expand Down
Expand Up @@ -246,23 +246,23 @@ class GDALRasterSourceRDDSpec extends AnyFunSpec with TestEnvironment with Befor
// println(Thread.currentThread().getName())
// Thread.sleep((Math.random() * 100).toLong)
val lts = reprojRS(i)
lts.readAll(lts.keys.take(10).toIterator)
lts.readAll(lts.keys.take(10).iterator)
reprojRS(i).source.resolutions

dirtyCalls(reprojRS(i).source)
}, IO {
// println(Thread.currentThread().getName())
// Thread.sleep((Math.random() * 100).toLong)
val lts = reprojRS(i)
lts.readAll(lts.keys.take(10).toIterator)
lts.readAll(lts.keys.take(10).iterator)
reprojRS(i).source.resolutions

dirtyCalls(reprojRS(i).source)
}, IO {
// println(Thread.currentThread().getName())
// Thread.sleep((Math.random() * 100).toLong)
val lts = reprojRS(i)
lts.readAll(lts.keys.take(10).toIterator)
lts.readAll(lts.keys.take(10).iterator)
reprojRS(i).source.resolutions

dirtyCalls(reprojRS(i).source)
Expand Down
2 changes: 2 additions & 0 deletions geowave/Makefile
@@ -0,0 +1,2 @@
cqlsh:
docker exec -it $(FOLDER)_cassandra_1 cqlsh
55 changes: 55 additions & 0 deletions geowave/README.md
@@ -0,0 +1,55 @@
# GeoTrellis/GeoWave Connector

GeoTrellis/GeoWave connector for storing raster and volumetric data.

- [GeoTrellis/GeoWave Connector](#geotrellisgeowave-connector)
- [Requirements](#requirements)
- [Project Inventory](#project-inventory)
- [Development](#development)
- [!Important](#important)
- [Executing Tests](#executing-tests)
## Requirements

- Docker Engine 17.12+
- Docker Compose 1.21+
- OpenJDK 8

## Project Inventory

- `src` - Main project with `GeoTrellisDataAdapter` enabling storing GeoTrellis types with GeoWave
- `benchmark` - Skeleton for microbenchmarks on GeoWave queries
- `docs` - Overview of GeoWave concepts relevant to index and data adapter usage

## Development

### !Important

After merging PRs / fetching changes from master and other branches be sure that you _recreated_
dev env. Any changes introduced into interfaces that are present in the `Persistable Registry`
and have `fromBinary` and `toBinary`methods can cause serialization / deserialization issues
in tests and as a consequence tests would fail with various of unpredictable runtime exceptions.

### Executing Tests

Tests are dependent on Apache Cassandra, Kafka, ZooKeeper, and Graphite with Grafana. First, ensure
these dependencies are running:

```bash
docker-compose up -d cassandra
```

Now, you can execute tests from project root:

```bash
$ ./sbt "project geowave" test
...
[info] All tests passed.
[success] Total time: 72 s, completed Nov 22, 2019 11:48:25 AM
```

When you're done, ensure that the services and networks created by Docker
Compose are torn down:

```bash
docker-compose down
```
99 changes: 99 additions & 0 deletions geowave/benchmark/README.md
@@ -0,0 +1,99 @@
# JMH Benchmarks

## Instructions

1. Make the following cassandra changes:
```yaml
cassandra:
image: cassandra:3.11
environment:
- MAX_HEAP_SIZE=4G
- HEAP_NEWSIZE=800M
- CASSANDRA_LISTEN_ADDRESS=127.0.0.1
mem_limit: 8G
memswap_limit: -1
```
2. Ingest data into Cassandra via `sbt "project geowave-benchmark" run`
3. Run benchmarks via `jmh:run -i 5 -wi 5 -f1 -t1 .*QueryBenchmark.*`
It is recommend to run run benchmarks via `jmh:run -i 20 -wi 10 -f1 -t1 .*QueryBenchmark.*`
(to do at least 10 warm up iterations and 20 of actual iterations, just to get a bit more consistent results).

## Results

<pre><code>
jmh:run -i 20 -wi 10 -f 1 -t 1 .*QueryBenchmark.*

88 Entries
Benchmark Mode Cnt Score Error Units
<b>entireSpatialGeometryQuery avgt 20 5.278 ± 0.643 s/op</b>
entireSpatialQuery avgt 20 1.155 ± 0.057 s/op
entireSpatialTemporalElevationElevationQuery avgt 20 1.145 ± 0.069 s/op
entireSpatialTemporalElevationGeometryQuery avgt 20 1.089 ± 0.030 s/op
<b>entireSpatialTemporalElevationGeometryTemporalElevationQuery avgt 20 5.963 ± 0.358 s/op</b>
entireSpatialTemporalElevationGeometryTemporalQuery avgt 20 1.093 ± 0.042 s/op
entireSpatialTemporalElevationQuery avgt 20 1.117 ± 0.033 s/op
entireSpatialTemporalElevationTemporalQuery avgt 20 1.080 ± 0.029 s/op
entireSpatialTemporalGeometryQuery avgt 20 1.117 ± 0.039 s/op
<b>entireSpatialTemporalGeometryTemporalQuery avgt 20 4.223 ± 0.213 s/op</b>
entireSpatialTemporalQuery avgt 20 1.072 ± 0.036 s/op
entireSpatialTemporalTemporalQuery avgt 20 1.110 ± 0.039 s/op

328 Entries
Benchmark Mode Cnt Score Error Units
<b>entireSpatialGeometryQuery avgt 20 4.705 ± 0.146 s/op</b>
entireSpatialQuery avgt 20 5.249 ± 0.503 s/op
entireSpatialTemporalElevationElevationQuery avgt 20 4.919 ± 0.310 s/op
entireSpatialTemporalElevationGeometryQuery avgt 20 4.688 ± 0.251 s/op
<b>entireSpatialTemporalElevationGeometryTemporalElevationQuery avgt 20 15.801 ± 6.629 s/op</b>
entireSpatialTemporalElevationGeometryTemporalQuery avgt 20 5.212 ± 0.467 s/op
entireSpatialTemporalElevationQuery avgt 20 5.256 ± 1.107 s/op
entireSpatialTemporalElevationTemporalQuery avgt 20 4.878 ± 0.324 s/op
entireSpatialTemporalGeometryQuery avgt 20 4.760 ± 0.498 s/op
<b>entireSpatialTemporalGeometryTemporalQuery avgt 20 4.272 ± 0.126 s/op</b>
entireSpatialTemporalQuery avgt 20 4.553 ± 0.275 s/op
entireSpatialTemporalTemporalQuery avgt 20 4.736 ± 0.290 s/op
</code></pre>

## Interpretation:

The index type does affect the query performance.
The more dimensions there are defined for the index, the more ranges
would be generated for the SFC and the more range requests would be sent to Cassandra.
All ranged queries are marked as bold in benchmark results, all other benchmarks generate
full scan queries.

Full scan by a three dimensional index is more expensive than by a single
or two dimensional index. The more dimensions SFC has, the more ranges would be generated.

These benchmarks are not representative since were done with a local instance of Cassandra
and demonstrate only the local relative performance that shows how the Query performance
depends on the index type and the amount of data. In fact it is a Cassandra instance benchmark,
though it can give some general sense of how index and query types affect the performance.

This benchmark measures in fact only full table scans (done via multiple ranged select queries or
via a single select).

In the `entireSpatialTemporalElevationGeometryTemporalElevationQuery` case the results
are a bit high: too many range queries are generated and it is hard for a single Cassandra instance
to handle them.

### Legend:
- `entireSpatial({Temporal|TemporalElevation})` performs a full table scan:
```genericsql
SELECT * FROM QueryBench.indexName;
```
- In all cases where the query contains not all the index dimensions
(for instance a spatial query only from the spatial temporal indexed table),
GeoWave performs a full table scan:
```genericsql
SELECT * FROM QueryBench.indexName;
```
- In all cases where the query contains all the index dimensions defined for the table,
GeoWave performs multiple ranged queries (number of SFC splits depends on the index dimensionality),
**benchmarks that generate such queries are marked as bold in the JMH report**:
```genericsql
SELECT * FROM QueryBench.indexName
WHERE partition=:partition_val
AND adapter_id IN :adapter_id_val
AND sort>=:sort_min AND sort<:sort_max;
```
12 changes: 12 additions & 0 deletions geowave/benchmark/src/main/resources/application.conf
@@ -0,0 +1,12 @@
geotrellis.geowave.connection.store {
data-store-type = "cassandra"
options = {
"contactPoints": "localhost",
"contactPoints": ${?CASSANDRA_HOST},
"gwNamespace" : "geotrellis"
}
}

geotrellis.blocking-thread-pool {
threads = default
}
31 changes: 31 additions & 0 deletions geowave/benchmark/src/main/resources/logback.xml
@@ -0,0 +1,31 @@
<configuration debug="true">
<variable name="LEVEL" value="${LOG_LEVEL:-INFO}"/>

<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%white(%d{HH:mm:ss.SSS}) %highlight(%-5level) %cyan(%logger{50}) - %msg %n</pattern>
</encoder>
</appender>

<root level="${LEVEL}">
<appender-ref ref="STDOUT" />
</root>

<logger name="org.apache.kafka" level="${LEVEL}"/>
<logger name="mil.navsea.geoindex" level="DEBUG"/>

<!-- In order to enable this logging you have to register QueryLogger with Cassandra session -->
<!-- https://docs.datastax.com/en/developer/java-driver/2.1/manual/logging/#logging-query-latencies -->

<!--
<logger name="com.datastax.driver.core.QueryLogger.NORMAL">
<level value="TRACE"/>
</logger>
<logger name="com.datastax.driver.core.QueryLogger.SLOW">
<level value="TRACE"/>
</logger>
<logger name="com.datastax.driver.core.QueryLogger.ERROR">
<level value="TRACE"/>
</logger>
-->
</configuration>

0 comments on commit b52f70c

Please sign in to comment.