Geospatial data typically comes in one of two data models: rasters which are similar to images with a regular grid of pixels whose values represent some spatial phenomenon (e.g. elevation) and vectors which are entities with discrete geometries (e.g. state boundaries). This software, rasterstats
, exists solely to extract information from geospatial raster data based on vector geometries.
Primarily, this involves zonal statistics: a method of summarizing and aggregating the raster values intersecting a vector geometry. For example, zonal statistics provides answers such as the mean precipitation or maximum elevation of an administrative unit. Additionally, functions are provided for point queries, most notably the ability to query a raster at a point and get an interpolated value rather than the simple nearest pixel.
The typical usage of rasterstats functions involves two arguments, a vector and a raster dataset:
>>> from rasterstats import zonal_stats, point_query
>>> stats = zonal_stats('tests/data/polygons.shp', 'tests/data/slope.tif')
>>> pts = point_query('tests/data/points.shp', 'tests/data/slope.tif')
zonal_stats
gives us a list of two dictionaries corresponding to each input polygon:
>>> from pprint import pprint
>>> pprint(stats)
[{'count': 75,
'max': 22.273418426513672,
'mean': 14.660084635416666,
'min': 6.575114727020264},
{'count': 50,
'max': 82.69043731689453,
'mean': 56.60576171875,
'min': 16.940950393676758}]
while point_query
gives us a list of raster values corresponding to each input point:
>>> pts
[14.037668283186257, 33.1370268256543, 36.46848854950241]
The most common use case is having vector data sources in a file such as an ESRI Shapefile or any other format supported by fiona
. The path to the file can be passed in directly as the first argument:
>>> zs = zonal_stats('tests/data/polygons.shp', 'tests/data/slope.tif')
If you have multi-layer sources, you can specify the layer
by either name or index:
>>> zs = zonal_stats('tests/data', 'tests/data/slope.tif', layer="polygons")
In addition to the basic usage above, rasterstats supports other mechanisms of specifying vector geometries.
The vector argument can be an iterable of GeoJSON-like features such as a fiona source:
>>> import fiona
>>> with fiona.open('tests/data/polygons.shp') as src:
... zs = zonal_stats(src, 'tests/data/slope.tif')
You can also pass in an iterable of python objects that support the __geo_interface__
(e.g. Shapely, ArcPy, PyShp, GeoDjango):
>>> from shapely.geometry import Point
>>> pt = Point(245000, 1000000)
>>> pt.__geo_interface__
{'type': 'Point', 'coordinates': (245000.0, 1000000.0)}
>>> point_query([pt], 'tests/data/slope.tif')
[21.32739672330894]
Strings in well known text (WKT) and binary (WKB) format :
>>> pt.wkt
'POINT (245000 1000000)'
>>> point_query([pt], 'tests/data/slope.tif')
[21.32739672330894]
>>> pt.wkb
'\x01\x01\x00\x00\x00\x00\x00\x00\x00@\xe8\rA\x00\x00\x00\x00\x80\x84.A'
>>> point_query([pt], 'tests/data/slope.tif')
[21.32739672330894]
Any format that can be read by rasterio
is supported by rasterstats
. To test if a data source is supported by your installation (this might differ depending on the format support of the underlying GDAL library), use the rio command line tool:
$ rio info raster.tif
You can specify the path to the raster directly:
>>> zs = zonal_stats('tests/data/polygons.shp', 'tests/data/slope.tif')
If the raster contains multiple bands, you must specify the band (1-indexed):
>>> zs = zonal_stats('tests/data/polygons.shp', 'tests/data/slope.tif', band=1)
Or you can pass a numpy ndarray
with an affine transform mapping the array dimensions to a coordinate reference system:
>>> import rasterio
>>> with rasterio.open('tests/data/slope.tif') as src:
... affine = src.transform
... array = src.read(1)
>>> zs = zonal_stats('tests/data/polygons.shp', array, affine=affine)
By default, the zonal_stats
function will return the following statistics
- min
- max
- mean
- count
Optionally, these statistics are also available.
- sum
- std
- median
- majority
- minority
- unique
- range
- nodata
- percentile (see note below for details)
You can specify the statistics to calculate using the stats
argument:
>>> stats = zonal_stats("tests/data/polygons.shp",
... "tests/data/slope.tif",
... stats=['min', 'max', 'median', 'majority', 'sum'])
You can also specify as a space-delimited string:
>>> stats = zonal_stats("tests/data/polygons.shp",
... "tests/data/slope.tif",
... stats="min max median majority sum")
Note that certain statistics (majority, minority, and unique) require significantly more processing due to expensive counting of unique occurences for each pixel value.
You can also use a percentile statistic by specifying percentile_<q>
where <q>
can be a floating point number between 0 and 100.
You can define your own aggregate functions using the add_stats
argument. This is a dictionary with the name(s) of your statistic as keys and the function(s) as values. For example, to reimplement the mean statistic:
>>> from __future__ import division
>>> import numpy as np
>>> def mymean(x):
... return np.ma.mean(x)
then use it in your zonal_stats
call like so:
>>> zonal_stats("tests/data/polygons.shp",
... "tests/data/slope.tif",
... stats="count",
... add_stats={'mymean':mymean})
[{'count': 75, 'mymean': 14.660084635416666}, {'count': 50, 'mymean': 56.605761718750003}]
To have access to geometry properties, a dictionnary can be passed to the user-defined function:
>>> def mymean_prop(x,prop):
... return np.ma.mean(x) * prop['id']
then use it in your zonal_stats
call like so:
>>> zonal_stats("tests/data/polygons.shp",
... "tests/data/slope.tif",
... stats="count",
... add_stats={'mymean_prop':mymean_prop},
... properties=['id'])
[{'count': 75, 'mymean_prop': 14.660084635416666}, {'count': 50, 'mymean_prop': 113.2115234375}]
If you want to retain the geometries and properties of the input features, you can output a list of geojson features using geojson_out
. The features contain the zonal statistics as additional properties:
>>> stats = zonal_stats("tests/data/polygons.shp",
... "tests/data/slope.tif",
... geojson_out=True)
>>> stats[0]['type']
'Feature'
>>> stats[0]['properties'].keys()
[u'id', 'count', 'max', 'mean', 'min']
There is no right or wrong way to rasterize a vector. The default strategy is to include all pixels along the line render path (for lines), or cells where the center point is within the polygon (for polygons). Alternatively, you can opt for the all_touched
strategy which rasterizes the geometry by including all pixels that it touches. You can enable this specifying:
>>> zs = zonal_stats("tests/data/polygons.shp",
... "tests/data/slope.tif",
... all_touched=True)
The figure above illustrates the difference; the default all_touched=False
is on the left while the all_touched=True
option is on the right. Both approaches are valid and there are tradeoffs to consider. Using the default rasterizer may miss polygons that are smaller than your cell size resulting in None
stats for those geometries. Using the all_touched
strategy includes many cells along the edges that may not be representative of the geometry and may give severly biased results in some cases.
You can treat rasters as categorical (i.e. raster values represent discrete classes) if you're only interested in the counts of unique pixel values.
For example, you may have a raster vegetation dataset and want to summarize vegetation by polygon. Statistics such as mean, median, sum, etc. don't make much sense in this context (What's the sum of oak + grassland
?).
Using categorical
, the output is dictionary with the unique raster values as keys and pixel counts as values:
>>> zonal_stats('tests/data/polygons.shp',
... 'tests/data/slope_classes.tif',
... categorical=True)[1]
{1.0: 1, 2.0: 9, 5.0: 40}
rasterstats will report using the pixel values as keys. To associate the pixel values with their appropriate meaning, you can use a category_map
:
>>> cmap = {1.0: 'low', 2.0: 'med', 5.0: 'high'}
>>> zonal_stats('tests/data/polygons.shp',
... 'tests/data/slope_classes.tif',
... categorical=True, category_map=cmap)[1]
{'high': 40, 'med': 9, 'low': 1}
Internally, we create a masked raster dataset for each feature in order to calculate statistics. Optionally, we can include these data in the output of zonal_stats
using the raster_out
argument:
>>> zonal_stats('tests/data/polygons.shp',
... 'tests/data/slope_classes.tif',
... stats="count",
... raster_out=True)[0].keys()
['count', 'mini_raster_affine', 'mini_raster_array', 'mini_raster_nodata']
Notice we have three additional keys:
* ``mini_raster_array``: The clipped and masked numpy array
* ``mini_raster_affine``: transformation as an Affine object
* ``mini_raster_nodata``: The nodata value
Keep in mind that having ndarrays in your stats dictionary means it is more difficult to serialize to json and other text formats.
rasterstats
aims to do only one thing well: getting information from rasters based on vector geometry. This module doesn't support coordinate reprojection, raster re-sampling, geometry manipulations or any other geospatial data transformations as those are better left to other Python packages. To the extent possible, data input is handled by fiona
and rasterio
, though there are some wrapper functions for IO to maintain usability. Where interoperability between packages is needed, loose coupling, simple python data structure and standard interfaces like GeoJSON are employed to keep the core library lean.
This work grew out of a need to have a native python implementation (based on numpy) for zonal statisics. I had been using starspan, a C++ command line tool, as well as GRASS's r.statistics for many years. They were suitable for offline analyses but were rather clunky to deploy in a large python application. In 2013, I implemented a proof-of-concept zonal stats function which eventually became rasterstats
. It has been in production in several large python web applications ever since, replacing the starspan wrapper madrona.raster_stats.