Skip to content

Commit

Permalink
Fixed typos and added performance notes
Browse files Browse the repository at this point in the history
  • Loading branch information
jbednar committed Apr 9, 2018
1 parent d47ac21 commit 3be3a17
Showing 1 changed file with 26 additions and 5 deletions.
31 changes: 26 additions & 5 deletions examples/user_guide/14-Large_Data.ipynb
Expand Up @@ -96,7 +96,7 @@
"\n",
"Because all of the data in these plots gets transferred directly into the web browser, the interactive functionality will be available even on a static export of this figure as a web page. Note that even though the visualization above is not computationally expensive, even with just 1000 points as in the scatterplot above, the plot already suffers from [overplotting](https://anaconda.org/jbednar/plotting_pitfalls), with later points obscuring previously plotted points. \n",
"\n",
"With much larger datasets, these issues will quickly make it impossible to see the true structure of the data. We can easily declare 50X or 1000X larger versions of the same plots above, but if we tried to visualize them they would be nearly unusable even if the brwoser did not crash:"
"With much larger datasets, these issues will quickly make it impossible to see the true structure of the data. We can easily declare 50X or 1000X larger versions of the same plots above, but if we tried to visualize them they would be nearly unusable even if the browser did not crash:"
]
},
{
Expand Down Expand Up @@ -154,7 +154,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In all three of the above plots, `rasterize()` is being called to aggregate the data (a large set of x,y locations) into a rectangular grid, with each grid cell counting up the number of points that fall into it. In the plot on the left, only `rasterize()` is done, and the resulting numeric array of counts is passed to Bokeh for colormapping. Bokeh can thenuse dynamic (client-side, browser-based) operations in JavaScript, allowing users to have dynamic control over even static HTML plots. For instance, in this case, users can use the Box Select tool and select a range of the histogram shown, dynamically remapping the colors used in the plot to cover the selected range.\n",
"In all three of the above plots, `rasterize()` is being called to aggregate the data (a large set of x,y locations) into a rectangular grid, with each grid cell counting up the number of points that fall into it. In the plot on the left, only `rasterize()` is done, and the resulting numeric array of counts is passed to Bokeh for colormapping. Bokeh can then use dynamic (client-side, browser-based) operations in JavaScript, allowing users to have dynamic control over even static HTML plots. For instance, in this case, users can use the Box Select tool and select a range of the histogram shown, dynamically remapping the colors used in the plot to cover the selected range.\n",
"\n",
"The other two plots should be identical. In both cases, the numerical array output of `rasterize()` is mapped into RGB colors by Datashader itself, in Python (\"server-side\"), which allows special Datashader computations like the histogram-equalization in the above plots and the \"spreading\" discussed below. The `shade()` and `datashade()` operations accept a `cmap` argument that lets you control the colormap used, which can be selected to match the HoloViews/Bokeh `cmap` option but is strictly independent of it. See ``hv.help(rasterize)``, ``hv.help(shade)``, and ``hv.help(datashade)`` for options that can be selected, and the [Datashader web site](http://datashader.org) for all the details. You can also try the lower-level ``hv.aggregate()`` (for points and lines) and ``hv.regrid()` (for image/raster data) operations, which may provide more control."
]
Expand Down Expand Up @@ -522,11 +522,32 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Performance\n",
"# Optimizing performance\n",
"\n",
"Although HoloViews tries to convert whatever data you have provided into what Datashader itself supports, you will see much better performance if you store your data in a format that Datashader understands already, so that HoloViews can simply pass it down to Datashader without copying or transforming it. For point, line, and trimesh data, Datashader supports Dask and Pandas dataframes, and so those two data sources will be fastest (with Dask Dataframes somewhat faster and also able to make use of distributed computational resources). For rasters, Datashader supports xarray objects, and so those will be implemented most efficiently. See the [Datashader docs](http://datashader.org) for instructions and examples for dealing with even quite large datasets (in the billions of points) on commodity hardware. \n",
"Datashader and HoloViews have different design principles that are worth keeping in mind when using them in combination, if you want to ensure good overall performance. By design, Datashader supports only a small number of operations and datatypes, focusing only on what can be implemented very efficiently. HoloViews instead focuses on supporting the typical workflows of Python users, recognizing that the most computationally efficient choice is only going to be faster overall if it also minimizes the time users have to spend getting things working. \n",
"\n",
"The combination of HoloViews and Datashader allows you to work uniformly with data covering a huge range of sizes, trading off the ability to export user-manipulable plots against file size and browser compatibility, and allowing you to render even the largest dataset faithfully. HoloViews makes the full power of Datashader available in just a few lines of code, letting you reveal your data regardless of its size."
"HoloViews thus helps you get something working quickly, but once it is working and you realize that you need to do this often or that it comes up against the limits of your computing hardware, you can consider whether you can get much better performance by considering the following issues and suggestions.\n",
"\n",
"### Use a Datashader-supported data structure\n",
"\n",
"HoloViews helpfully tries to convert whatever data you have provided into what Datashader supports, which is good for optimizing your time to an initial solution, but will not always be the fastest approach computationally. If you ensure that you store your data in a format that Datashader understands already, HoloViews can simply pass it down to Datashader without copying or transforming it:\n",
"\n",
"1. For point, line, and trimesh data, Datashader supports Dask and Pandas dataframes, and so those two data sources will be fastest. Of those two, Dask Dataframes will usually be somewhat faster and also able to make use of distributed computational resources and out-of-core processing.\n",
"2. For rasters, Datashader supports xarray objects, and so if your data is provided as an xarray plotting will be faster. \n",
"\n",
"See the [Datashader docs](http://datashader.org) for examples of dealing with even quite large datasets (in the billions of points) on commodity hardware, including many HoloViews-based examples.\n",
"\n",
"### Cache initial procesing with `precompute=True`\n",
"\n",
"In the typical case of having datasets much larger than the plot resolution, HoloViews Datashader-based operations that work on the full dataset (`rasterize`, `aggregate`,`regrid`) are computationally expensive; the others are not (`shade`, `spread`, `dynspread`, etc.) \n",
"\n",
"The expensive operations are all of type `ResamplingOperation`, which has a parameter `precompute` (see `hv.help(hv.operation.datashader.rasterize)`, etc.) Precompute can be used to get faster performance in interactive usage by caching the last set of data used in plotting (*after* any transformations needed) and reusing it when it is requested again. `precompute` is False by default, because it requires using memory to store the cached data, but if you have enough memory, you can enable it so that repeated interactions (such as zooming and panning) will be much faster than the first one. In practice, most Datashader-plots don't need to do extensive precomputing, but enabling it for TriMesh plots (or anything based on TriMesh, such as QuadMesh) can greatly speed up interactive usage.\n",
"\n",
"### Project data only once\n",
"\n",
"If you are working with geographic data using [GeoViews](http://geoviews.org) that needs to be projected before display and/or before datashading, GeoViews will have to do this every time you update a plot, which can drown out the performance improvement you get by using Datashader. GeoViews allows you to project the entire dataset at once using `gv.operation.project`, and once you do this you should be able to use Datashader at full speed.\n",
"\n",
"If you follow these suggestions, the combination of HoloViews and Datashader will allow you to work uniformly with data covering a huge range of sizes. Per session or per plot, you can trade off the ability to export user-manipulable plots against file size and browser compatibility, and allowing you to render even the largest dataset faithfully. HoloViews makes the full power of Datashader available in just a few lines of code, giving you a natural way to work with your data regardless of its size."
]
}
],
Expand Down

0 comments on commit 3be3a17

Please sign in to comment.