Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RasterizeRDD for Geometry #2266

Merged
merged 26 commits into from
Aug 31, 2017

Conversation

echeipesh
Copy link
Contributor

@echeipesh echeipesh commented Jun 28, 2017

Resolves: #2228
Resolves: #2168

): RDD[(SpatialKey, Tile)] with Metadata[LayoutDefinition] = {
val intValue = implicitly[Numeric[T]].toInt(value)
val dblValue = implicitly[Numeric[T]].toDouble(value)
val options = Options(includePartial=true, sampleType=PixelIsArea)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a certain reason why you defined options in this method when it's already being passed in as a parameter (line 44)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, thank you for the catch.

import org.apache.spark.rdd._
import scala.collection.immutable.VectorBuilder

object RasterizeRDD {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it'd be worth it to have another overload method or two that doesn't require options?

def fromGeometry[G <: Geometry, T: Numeric](
    geoms: RDD[G],
    layout: LayoutDefinition,
    ct: CellType,
    value: T
  ): RDD[(SpatialKey, Tile)] with Metadata[LayoutDefinition] = {
    fromGeometry(geoms, layout, ct, value, new HashPartitioner(geoms.sparkContext.defaultParallelism), Options.DEFAULT)
  }

  def fromGeometry[G <: Geometry, T: Numeric](
    geoms: RDD[G],
    layout: LayoutDefinition,
    ct: CellType,
    value: T,
    numPartitions: Int
  ): RDD[(SpatialKey, Tile)] with Metadata[LayoutDefinition] = {
    fromGeometry(geoms, layout, ct, value, new HashPartitioner(numPartitions), Options.DEFAULT)
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good point. Thank you for taking the time to type them up.

@echeipesh
Copy link
Contributor Author

echeipesh commented Jul 7, 2017

RasterizeRDD Method Extensions

When creating method extensions for `RasterizeRDD.fromGeometry` we have to consder oprtional arguments.

def fromGeometry[G <: Geometry, T: Numeric](
  geoms: RDD[G],
  value: T,
  layout: LayoutDefinition,
  ct: CellType,
  partitioner: Partitioner,
  options: Rasterizer.Options
): RDD[(SpatialKey, Tile)] with Metadata[LayoutDefinition] = ???

There are two or three optional arguments:

def fromGeometry[G <: Geometry, T: Numeric](
  geoms: RDD[G],
  value: T,
  layout: LayoutDefinition,
  ct: CellType = IntConstantNoDataCellType,
  partitioner: Partitioner = new HasPartitioner(geoms.getNumPartitions),
  options: Rasterizer.Options= Rasterizer.Options.DEFAULT
): RDD[(SpatialKey, Tile)] with Metadata[LayoutDefinition] = ???

An existing pattern is to create RasterizeRDD.Options as follows

object RasterizeRDD {
  case class Options(
    rasterizerOptions: rasterizerOptions = Rasterizer.Options.DEFAULT,
    partitioner: Option[Partitioner] = None
  )
  val DEFAULT = Options()
}

trait GeometryRDDRasterizeMethods[G <: Geometry] extends MethodExtensions[RDD[G]] {
  def rasterizeWithValue[T: Numeric](
    value: T,
    cellType: CellType
    options: RasterizeRDD.Options = RasterizeRDD.Options.DEFAULT
  ): RDD[(SpatialKey, Tile)] with Metadata[LayoutDefinition] = ???
}

val geoms: RDD[Polygon] = ???

geoms.rasterizeWithValue(1, IntConstantNoDataCellType,
  RasterizeRDD.Options(
    partitioner = Some(new HashPartitioner(geoms.getNumPartitions)),
    rasterizerOptions = Rasterizer.Options(true, PixelIsArea)
  )
)

This is all well and good but it has the following two drawbacks:

  1. Specifying the options for RDD of geoms gets very nested
  2. Something somewhere has to match the None cases for Partitioner
  3. Moving more common option of CellType makes the ugly call more frequent

The main problem is that the true defaults in this case depend on parameters. There is a way scala handles that case is with multiple parameter lists.

Proposal

object RasterizeRDD {
  case class Options(
    rasterizerOptions: rasterizerOptions = Rasterizer.Options.DEFAULT,
    partitioner: Partitioner
  )

}

trait GeometryRDDRasterizeMethods[G <: Geometry] extends MethodExtensions[RDD[G]] {
  def rasterizeWithValue[T: Numeric](
    value: T
  )(
    cellType: CellType = narrowestCellType(value),
    includePartial: Boolean = true,
    sampleType: PixelSampleType = PixelIsPoint,
    partitioner = new HashPartitioner(self.getNumPartitions),
  ): RDD[(SpatialKey, Tile)] with Metadata[LayoutDefinition] = {
    val rasterizerOptions = Rasterizer.Options(includePartial, sampleType)
    val options = RasterizeRDD.Options(Some(rasterizerOptions), partitioner)
    ??? // make the call
  }
}

val geoms: RDD[Polygon] = ???

geoms.rasterizeWithValue(1)
geoms.rasterizeWithValue(4)(cellType = ByteCellType)
geoms.rasterizeWithValue(4)(
  cellType = ByteCellType,
  sampleType = PixelIsArea)

What we get for the trouble:

  1. We flattened out nested Options object at the call site
  2. The method call is now tab-completable
  3. Preserve Option objects as composition pattern
  4. Default velues are figured out at method definition site
  5. We can clearly use input dependant defaults like CellType

@echeipesh
Copy link
Contributor Author

This is what the scaladoc looks like for the method:
screen shot 2017-07-07 at 2 46 15 pm

@jbouffard
Copy link

I think the proposal looks good. I never liked how nested defining options could become in GeoTrellis. Do you think it'd be worth having more overloads just in case someone already has Options defined?

@echeipesh
Copy link
Contributor Author

Not sure that is possible with this pattern. You may only have one instance of a method name with default arguments, which means you couldn't define def rasterizeWithValue(value: Double, options: RasterizeRDD.Options = RasterizeRDD.Options.DEFAULT).

And it wouldn't make sense to put it into the same method signature because you'd have to choose if the rest parameters trump the contents of the options or not.

I suppose with this pattern if somebody needed to use the Options instance they would be required to use the RasterizeRDD.fromGeometry directly.

@jbouffard
Copy link

Ah, I wasn't aware of that. Then I think the current implementation is fine then. It might be worth noting somewhere in the docs that RasterizeRDD.fromGeometry should be used if the user has an Options instance ready to go.

Copy link
Contributor

@aklink aklink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested with ShapeFileReader.readSimpleFeatures(...) using MultiPolygons, works fine for me

Edited:
I am using
val rasterizedRDD : RDD[(SpatialKey, Tile)] with Metadata[LayoutDefinition] = RasterizeRDD.fromGeometry( polyRdd, 1, IntConstantNoDataCellType, ld, Options.DEFAULT)

What is currently not working on my site:
val rasterizedRdd = polyRdd.rasterizeWithValue(1, IntConstantNoDataCellType, ld)

Error:(112, 33) value rasterizeWithValue is not a member of org.apache.spark.rdd.RDD[geotrellis.vector.MultiPolygon]
val rasterizedRdd = polyRdd.rasterizeWithValue(1, IntConstantNoDataCellType, ld)

Maybe some missing import statement on my site, or a wrong usage of types in combination with:
ShapeFileReader.readSimpleFeatures(shapefileName)

Also didn't manage it to pass different rasterization values (only one value per RDD allowed) if not all geometries in RDD are same "class" (if using raster value as object class, specified by a field/column in shapefile). But maybe this is not intended to be supported by RasterizeRDD. If all geometries have same value, then it is working fine.

@aklink
Copy link
Contributor

aklink commented Jul 10, 2017

I have one suggestion how RasterizeRDD could be extended to support one value per geometry instead one value per RDD:

package geotrellis.spark.rasterize

import geotrellis.raster._
import geotrellis.raster.rasterize._
import geotrellis.spark._
import geotrellis.spark.tiling._
import geotrellis.vector._
import org.apache.spark.rdd._
import org.apache.spark.{HashPartitioner, Partitioner}

object RasterizeFeaturesRDD {

  /**
   * Rasterize an RDD of Geometry objects into a tiled raster RDD.
   * Cells not intersecting any geometry will left as NODATA.
   * Value will be converted to type matching specified [[CellType]].
   *
   * @param features Cell values for cells intersecting a feature consisting of (geometry,value)
   * @param layout Raster layer layout for the result of rasterization
   * @param cellType [[CellType]] for creating raster tiles
   * @param options Rasterizer options for cell intersection rules
   * @param partitioner Partitioner for result RDD
   */
  def fromFeature[G <: Geometry, D <: Double](

//geoms: RDD[G], value: Double,
features: RDD[(G,D)],
features: RDD[Feature[G,D]],

    cellType: CellType,
    layout: LayoutDefinition,
    options: Rasterizer.Options = Rasterizer.Options.DEFAULT,
    partitioner: Option[Partitioner] = None
  ): RDD[(SpatialKey, Tile)] with Metadata[LayoutDefinition] = {
    val layoutRasterExtent = RasterExtent(layout.extent, layout.layoutCols, layout.layoutRows)
    val layoutRasterizerOptions = Rasterizer.Options(includePartial=true, sampleType=PixelIsArea)

    /** Key geometry by spatial keys of intersecting tiles */
    def keyGeom(feature: (Geometry, Double)): Iterator[(SpatialKey, ((Geometry, Double), SpatialKey))] = {
      var keySet = Set.empty[SpatialKey]
      feature._1.foreach(layoutRasterExtent, layoutRasterizerOptions){ (col, row) =>
        keySet = keySet + SpatialKey(col, row)
      }
      keySet.toIterator.map { key => (key, (feature, key)) }
    }

    // key the geometry to intersecting tiles so it can be rasterized in the map-side combine
    val keyed: RDD[(SpatialKey, ((Geometry, Double), SpatialKey))] =

features.flatMap { case (geom,value) => keyGeom(geom, value) }
features.flatMap { feature => keyGeom(feature.geom, feature.data) }

    val createTile = (tup: ((Geometry, Double), SpatialKey)) => {
      val ((geom,value), key) = tup
      val tile = ArrayTile.empty(cellType, layout.tileCols, layout.tileRows)
      val re = RasterExtent(layout.mapTransform(key), layout.tileCols, layout.tileRows)
      geom.foreach(re, options){ tile.setDouble(_, _, value) }
      tile: MutableArrayTile
    }

    val updateTile = (tile: MutableArrayTile, tup: ((Geometry, Double), SpatialKey)) => {
      val ((geom,value), key) = tup
      val re = RasterExtent(layout.mapTransform(key), layout.tileCols, layout.tileRows)
      geom.foreach(re, options){ tile.setDouble(_, _, value) }
      tile: MutableArrayTile
    }

    val mergeTiles = (t1: MutableArrayTile, t2: MutableArrayTile) => {
      t1.merge(t2).mutable
    }

    val tiles = keyed.combineByKeyWithClassTag[MutableArrayTile](
      createCombiner = createTile,
      mergeValue = updateTile,
      mergeCombiners = mergeTiles,
      partitioner.getOrElse(new HashPartitioner(features.getNumPartitions))
    )

    ContextRDD(tiles.asInstanceOf[RDD[(SpatialKey, Tile)]], layout)
  }
}

This works perfectly for me when using:

val shapeFileName = ???
val attribName = ???
val layout = ???
val extent = ???
val ld = LayoutDefinition(extent, layout)
val multipolygonsWithValue = ShapeFileReader.readSimpleFeatures(shapefileName)
      .filter { feat => "MultiPolygon" != feat.getFeatureType.getGeometryDescriptor.getType.getName }
      .map { feat => (MultiPolygon.jts2MultiPolygon(feat.geom[jts.MultiPolygon].get), feat.attribute(attribName)) }
      .map { case (mp: MultiPolygon, value: Long)  => Feature(mp, value.toDouble) }
val multipolygonsRdd = sc.parallelize(multipolygonsWithValue)
val rasterizedRDD : RDD[(SpatialKey, Tile)] with Metadata[LayoutDefinition] = RasterizeFeaturesRDD
      .fromFeature( multipolygonsRdd, IntConstantNoDataCellType, ld, Options.DEFAULT)

@aklink
Copy link
Contributor

aklink commented Jul 10, 2017

Should have taken into account #2168 to distinguish between fromGeometry and fromFeature, but with some modifications above propsal could be adjusted to fit this, I hope. Edit: applied change above

@aklink
Copy link
Contributor

aklink commented Jul 10, 2017

An other Option would be using

def fromFeature[G <: Geometry, T : Numeric](
    features: RDD[(G,T)],
    cellType: CellType,
    layout: LayoutDefinition,
    options: Rasterizer.Options = Rasterizer.Options.DEFAULT,
    partitioner: Option[Partitioner] = None
  ): RDD[(SpatialKey, Tile)] with Metadata[LayoutDefinition] = ???

then use (Geometry, T) instead of (Geometry,Double) but then value can't be used directly, rather a value.toString.toDouble transform would be needed in createTile and updateTile for geom.foreach(re, options){ tile.setDouble(_, _, value.toString.toDouble) }

@echeipesh
Copy link
Contributor Author

@aklink thank you for testing and poking at this!

The fromFeature should totally work like that, but it should be around Feature[G, Double] type since we have it and it will be generated by reading GeoJSON and such.

The other point is that after looking at the code there doesn't seem to be any point in using Numeric[T] type class. Following the same logic as CellType.withNoData Double is wide enough to represent any possible cell value and using MutableTile.setDouble makes sure that the assignment is consistent with CellType of the generated tiles.

From that perspective Numeric[T] doesn't provide any real flexibility, just a useless type parameter. A map step to get a Feature[G, D] to Feature[G, Double] is cheap, explicit and unsurprising way to use this feature. By the same token RasterizeRDD.fromGeometry can actually delegate to RasterizerRDD.fromFeature by mapping geoms into features with the constant value.

What are you thoughts on z-order when rasterizing features with different values?
There is no way to control the order in which reduction happens in combineByKeyWithClassTag so I'm sure some really weird results as possible where the apparent z-order flips on tile boundaries when features with differing values overlap.

@aklink
Copy link
Contributor

aklink commented Jul 11, 2017

I have changed to features: RDD[Feature[G,Double]]
Regarding Z-Order: I didn't think that far. Since the Shapefile used to get Geometries has no overlapping Geometries (overlapping is invalid for Shapefiles, possible - but not allowed) this case can not happen on my site. My scala skills are also not sufficient to handle this.

@echeipesh
Copy link
Contributor Author

Its valuable to know that its not an issue for a pretty significant use case. Perhaps z-order functionality can be part of a different PR.

Copy link
Contributor

@aklink aklink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aa880d1 missing Signed-off-by, but identical to commit a8c4203 with Signed-off-by
may be causing ip-validation failure (maybe removing this commit aa880d1 from history could solve issue, it was accidentally pushed)

Copy link
Contributor

@aklink aklink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

84d45b1 missing Signed-off-by, but no changes

jamesmcclain pushed a commit to jamesmcclain/geotrellis that referenced this pull request Aug 17, 2017
… on RasterizeRDD PR locationtech#2266

Signed-off-by: Adrian Klink <Adrian.Klink@eftas.com>
This compensates for RDD being invariant on its type parameter
echeipesh and others added 24 commits August 30, 2017 22:49
Squashed commits:
[f449338] Add scaladocs for rdd.rasterizeWithValue (+1 squashed commit)
Squashed commits:
[a42724a] fix: Pass through Rasterizer.Options correctly
Expect that for large layers a geometry will intersect small fraction of the tiles.
In such a case keeping a set of keys is more efficient than a bitmap.
For small layer it doesn't matter what choice we made.
If this ever fails the next step is a BloomFilter instead of a Set.
… on RasterizeRDD PR locationtech#2266

Signed-off-by: Adrian Klink <Adrian.Klink@eftas.com>
Signed-off-by: Adrian Klink <Adrian.Klink@eftas.com>
Signed-off-by: Adrian Klink <Adrian.Klink@eftas.com>
…,Double]]]

Signed-off-by: Adrian Klink <Adrian.Klink@eftas.com>
Signed-off-by: Adrian Klink <Adrian.Klink@eftas.com>
The line rasterizer converts edge endpoints to integers, whereas the
polygon rasterizer preserves the precision of vertices.  The loss of
precision caused by conversion to integers makes the line rasterizer
inappropriate for this application.
This will allow it to be re-used in the per-tile rasterizer
This can be re-used and tested when outside of line function
Signed-off-by: Eugene Cheipesh <echeipesh@gmail.com>
@echeipesh echeipesh merged commit 5c5ebed into locationtech:master Aug 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants