Permalink
Browse files

splitting up geodata chapter

  • Loading branch information...
1 parent cae3c4d commit 5d10afb5cd062d78fd7d471071022384fc2e1a99 Philip (flip) Kromer committed Feb 19, 2013
@@ -511,275 +511,3 @@ This section will show how to
* efficiently segment region polygons (county boundaries, watershed regions, etc) into grid cells
* store data pertaining to such regions in a grid-cell form: for example, pivoting a population-by-county table into a population-of-each-overlapping-county record on each quadtile.
-
-==== Adaptive Grid Size ====
-
-The world is a big place, but we don't use all of it the same. Most of the world is water. Lots of it is Siberia. Half the tiles at zoom level 2 have only a few thousand inhabitantsfootnote:[000 001 100 101 202 203 302 and 303].
-
-Suppose you wanted to store a "what country am I in" dataset -- a geo-joinable decomposition of the region boundaries of every country. You'll immediately note that
-Monaco fits easily within on one zoom-level 12 quadtile; Russia spans two zoom-level 1 quadtiles.
-Without multiscaling, to cover the globe at 1-km scale and 64-kB records would take 70 terabytes -- and 1-km is not all that satisfactory. Huge parts of the world would be taken up by grid cells holding no border that simply said "Yep, still in Russia".
-
-There's a simple modification of the grid system that lets us very naturally describe multiscale data.
-
-The figures (REF: multiscale images) show the quadtiles covering Japan at ZL=7. For reasons you'll see in a bit, we will split everything up to at least that zoom level; we'll show the further decomposition down to ZL=9.
-
-image::images/fu05-quadkeys-multiscale-ZL7.png[Japan at Zoom Level 7]
-
-Already six of the 16 tiles shown don't have any land coverage, so you can record their values:
-
- 1330000xx { Pacific Ocean }
- 1330011xx { Pacific Ocean }
- 1330013xx { Pacific Ocean }
- 1330031xx { Pacific Ocean }
- 1330033xx { Pacific Ocean }
- 1330032xx { Pacific Ocean }
-
-Pad out each of the keys with `x`'s to meet our lower limit of ZL=9.
-
-The quadkey `1330011xx` means "I carry the information for grids `133001100`, `133001101`, `133001110`, `133001111`, ".
-
-image::images/fu05-quadkeys-multiscale-ZL8.png[Japan at Zoom Level 8]
-
-
-
-image::images/fu05-quadkeys-multiscale-ZL9.png[Japan at Zoom Level 9]
-
-
-You should uniformly decompose everything to some upper zoom level so that if you join on something uniformly distributed across the globe you don't have cripplingly large skew in data size sent to each partition. A zoom level of 7 implies 16,000 tiles -- a small quantity given the exponential growth of tile sizes
-
-
-
-With the upper range as your partition key, and the whole quadkey is the sort key, you can now do joins. In the reducer,
-
-* read keys on each side until one key is equal to or a prefix of the other.
-* emit combined record using the more specific of the two keys
-* read the next record from the more-specific column, until there's no overlap
-
-Take each grid cell; if it needs subfeatures, divide it else emit directly.
-
-You must emit high-level grid cells with the lsb filled with XX or something that sorts after a normal cell; this means that to find the value for a point,
-
-* Find the corresponding tile ID,
-* Index into the table to find the first tile whose ID is larger than the given one.
-
- 00.00.00
- 00.00.01
- 00.00.10
- 00.00.11
- 00.01.--
- 00.10.--
- 00.11.00
- 00.11.01
- 00.11.10
- 00.11.11
- 01.--.--
- 10.00.--
- 10.01.--
- 10.10.01
- 10.10.10
- 10.10.11
- 10.10.00
- 10.11.--
-
-
-==== Tree structure of Quadtile indexing ====
-
-You can look at quadtiles is as a tree structure. Each branch splits the plane exactly in half by area, and only leaf nodes hold data.
-
-The first quadtile scheme required we develop every branch of the tree to the same depth. The multiscale quadtile scheme effectively says "hey, let's only expand each branch to its required depth". Our rule to break up a quadtile if any section of it needs development preserves the "only leaf nodes hold data". Breaking tiles always exactly in two makes it easy to assign features to their quadtile and facilitates joins betweeen datasets that have never met. There are other ways to make these tradeoffs, though -- read about K-D trees in the "keep exploring" section at end of chapter.
-
-
-==== Map Polygons to Grid Tiles ====
-
-
-
- +----------------------------+
- | |
- | C |
- | ~~+---------\ |
- | / | \ /
- | / | \ /|
- | / | \ / |
- \ / | B \ / |
- | | | |
- | A +--------------' |
- | | |
- | | D /
- | | __/
- \____/ \ |
- \____________,
-
-
- +-+-----------+-------------+--+------
- | | | | |
- | | | C | |
- 000x | | C ~~+--+------\ | | 0100
- | | / A|B | B \ | /
- |_|____/___|__|________\____|/|_______
- | | C / | | \ C / |
- | \ / |B | B \ /| |
- 001x | | | | | |D| 0110
- | | A +--+-----------' | |
- | | |D | D | |
- +---+------+--+-------------+-/-------
- | | A |D | _|/
- | \____/ \ | D | |
- 100x | \|___________, | 1100
- | | |
- | | |
- +-------------+-------------+---------
- ^ 1000 ^ 1001
-
-* Tile 0000: `[A, B, C ]`
-* Tile 0001: `[ B, C ]`
-* Tile 0010: `[A, B, C, D]`
-* Tile 0011: `[ B, C, D]`
-
-* Tile 0100: `[ C, ]`
-* Tile 0110: `[ C, D]`
-
-* Tile 1000: `[A, D]`
-* Tile 1001: `[ D]`
-* Tile 1100: `[ D]`
-
-For each grid, also calculate the area each polygon covers within that grid.
-
-Pivot:
-
-* A: `[ 0000 0010 1000 ]`
-* B: `[ 0000 0001 0010 0011 ]`
-* C: `[ 0000 0001 0010 0011 0100 0110 ]`
-* D: `[ 0010 0011 0110 1000 1001 1100 ]`
-
-
-
-=== Weather Near You ===
-
-The weather station data is sampled at each weather station, and forms our best estimate for the surrounding region's weather.
-
-So weather data is gathered at a _point_, but imputes information about a _region_. You can't just slap each point down on coarse-grained tiles -- the closest weather station might lie just over on the next quad, and you're writing a check for very difficult calculations at run time.
-
-We also have a severe version of the multiscale problem. The coverage varies wildly over space: a similar number of weather stations cover a single large city as cover the entire Pacific ocean. It also varies wildly over time: in the 1970s, the closest weather station to Austin, TX was about 150 km away in San Antonio. Now, there are dozens in Austin alone.
-
-
-==== Find the Voronoi Polygon for each Weather Station ====
-
-These factors rule out any naïve approach to locality, but there's an elegant solution known as a Voronoi diagram footnote:[see http://en.wikipedia.org/wiki/Voronoi_diagram[Wikipedia entry] or (with a Java-enabled browser) this http://www.cs.cornell.edu/home/chew/Delaunay.html[Voronoi Diagram applet]].
-
-The Voronoi diagram covers the plane with polygons, one per point -- I'll call that the "centerish" of the polygon. Within each polygon, you are closer to its centerish than any other. By extension, locations on the boundary of each Voronoi polygon are equidistant from the centerish on either side; polygon corners are equidistant from centerishes of all touching polygons footnote:[John Snow, the father of epidemiology, mapped cholera cases from an 1854 outbreak against the voronoi regions defined by each neighborhood's closest water pump. The resulting infographic made plain to contemporary physicians and officials that bad drinking water, not "miasma" (bad air), transmitted cholera. http://johnsnow.matrix.msu.edu/book_images12.php].
-
-If you'd like to skip the details, just admire the diagram (REF) and agree that it's the "right" picture. As you would in practice, we're going to use vetted code from someone with a PhD and not write it ourselves.
-
-The details: Connect each point with a line to its neighbors, dividing the plane into triangles; there's an efficient alorithm (http://en.wikipedia.org/wiki/Delaunay_triangulation[Delaunay Triangulation]) to do so optimally. If I stand at the midpoint of the edge connecting two locations, and walk perpendicular to the edge in either direction, I will remain equidistant from each point. Extending these lines defines the Voronoi diagram -- a set of polygons, one per point, enclosing the area closer to that point than any other.
-
-<remark>TODO: above paragraph not very clear, may not be necessary.</remark>
-
-
-==== Break polygons on quadtiles ====
-
-Now let's put Mr. Voronoi to work. Use the weather station locations to define a set of Voronoi polygons, treating each weather station's observations as applying uniformly to the whole of that polygon.
-
-Break the Voronoi polygons up by quadtile as we did above -- quadtiles will either contain a piece of boundary (and so are at the lower-bound zoom level), or are entirely contained within a boundary. You should choose a lower-bound zoom level that avoids skew but doesn't balloon the dataset's size.
-
-Also produce the reverse mapping, from weather station to the quadtile IDs its polygon covers.
-
-==== Map Observations to Grid Cells ====
-
-Now join observations to grid cells and reduce each grid cell.
-
-// === GeoJSON ===
-// Using polymaps to view results
-
-=== K-means clustering to summarize ===
-
-(TODO: section under construction)
-
-we will describe how to use clustering to form a progressive summary of point-level detail.
-
-there are X million wikipedia topics
-
-at distant zoom levels, storing them in a single record would be foolish
-
-what we can do is summarize their contents -- coalesce records into groups based on their natural spatial arrangement. If the points represented foursquare checkins, those clusters would match the population distribution. If they were wind turbine generators, they would cluster near shores and praries.
-
-K-Means Clustering is an effective way to form that summarization.
-
-=== Keep Exploring ===
-
-===== Balanced Quadtiles =====
-
-Earlier, we described how quadtiles define a tree structure, where each branch of the tree divides the plane exactly in half and leaf nodes hold features. The multiscale scheme handles skewed distributions by developing each branch only to a certain depth. Splits are even, but the tree is lopsided (the many finer zoom levels you needed for New York City than for Irkutsk).
-
-K-D trees are another approach. The rough idea: rather than blindly splitting in half by area, split the plane to have each half hold the same-ish number of points. It's more complicated, but it leads to a balanced tree while still accommodating highly-skew distributions. Jacob Perkins (`@thedatachef`) has a http://thedatachef.blogspot.com/2012/10/k-d-tree-generation-with-apache-pig.html[great post about K-D trees] with further links.
-
-===== It's not just for Geo =====
-
-=== Exercises ===
-
-[[brain_example]]
-**Exercise 1**: Extend quadtile mapping to three dimensions
-
-To jointly model network and spatial relationship of neurons in the brain, you will need to use not two but three spatial dimensions. Write code to map positions within a 200mm-per-side cube to an "octcube" index analogous to the quadtile scheme. How large (in mm) is each cube using 30-bit keys? using 63-bit keys?
-
-For even higher dimensions of fun, extend the http://en.wikipedia.org/wiki/Voronoi_diagram#Higher-order_Voronoi_diagrams[Voronoi diagram to three dimensions].
-
-**Exercise 2**: Locality
-
-We've seen a few ways to map feature data to joinable datasets. Describe how you'd join each possible pair of datasets from this list (along with the story it would tell):
-
-* Census data: dozens of variables, each attached to a census tract ID, along with a region polygon for each census tract.
-* Cell phone antenna locations: cell towers are spread unevenly, and have a maximum range that varies by type of antenna.
- - case 1: you want to match locations to the single nearest antenna, if any is within range.
- - case 2: you want to match locations to all antennae within range.
-* Wikipedia pages having geolocations.
-* Disease reporting: 60,000 points distributed sparsely and unevenly around the country, each reporting the occurence of a disease.
-
-For example, joining disease reports against census data might expose correlations of outbreak with ethnicity or economic status. I would prepare the census regions as quadtile-split polygons. Next, map each disease report to the right quadtile and in the reducer identify the census region it lies within. Finally, join on the tract ID-to-census record table.
-
-**Exercise 3**: Write a generic utility to do multiscale smoothing
-
-Its input is a uniform sampling of values: a value for every grid cell at some zoom level.
-However, lots of those values are similar.
-Combine all grid cells whose values lie within a certain tolerance into
-
-Example: merge all cells whose contents lie within 10% of each other
-
- 00 10
- 01 11
- 02 9
- 03 8
- 10 14
- 11 15
- 12 12
- 13 14
- 20 19
- 21 20
- 22 20
- 23 21
- 30 12
- 31 14
- 32 8
- 33 3
-
- 10 11 14 18 .9.5. 14 18
- 9 8 12 14 . . 12 14
- 19 20 12 14 . 20. 12 14
- 20 21 8 3 . . 8 3
-
-
-
-=== References ===
-
-* http://kartoweb.itc.nl/geometrics/Introduction/introduction.html -- an excellent overview of projections, reference surfaces and other fundamentals of geospatial analysis.
-* http://msdn.microsoft.com/en-us/library/bb259689.aspx
-* http://www.maptiler.org/google-maps-coordinates-tile-bounds-projection/
-* http://wiki.openstreetmap.org/wiki/QuadTiles
-* https://github.com/simplegeo/polymaps
-* http://www.slideshare.net/mmalone/scaling-gis-data-in-nonrelational-data-stores[Scaling GIS Data in Non-relational Data Stores] by Mike Malone
-
-* http://www.comp.lancs.ac.uk/~kristof/research/notes/voronoi/[Voronoi Diagrams]
-* http://bl.ocks.org/4122298[US County borders in GeoJSON]
-
-
-
Oops, something went wrong.

0 comments on commit 5d10afb

Please sign in to comment.