## 4: Using the `.plot` API

<div class="alert alert-warning" role="alert"> <strong>WORK IN PROGRESS:</strong> We are in the progress of updating these material in anticipation of a tutorial at the 2019 SciPy conference. Work will be complete by the morning of July 8th 2019. Check out <a href="https://github.com/pyviz/holoviz/tree/v0.1.1">this tag</a> to access the materials as they were before these changes started. For the latest version of the tutorial, visit <a href="https://holoviz.org/tutorial">holoviz.org</a>.
</div>

If you have tried to visualize `pandas.DataFrame`s before, then you have likely encountered the `.plot` API. `.plot` provides a quick way to get from data to a visualization while keeping the focus on the data. Since this has become such a popular method of generating plots, other libraries have adopted the same interface. It is now possible to generate [matplotlib](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html), [plotly](https://github.com/santosjorge/cufflinks), [bokeh](https://github.com/PatrikHlobil/Pandas-Bokeh), [holoviews](https://hvplot.pyviz.org), or [vega](https://altair-viz.github.io/pdvega) plots all using a very similar approach. From the user's perspective, this has made the barrier to switching plotting library much lower. 

In this notebook we'll explore what is possible with the default `.plot` API and demonstrate the capabilities of `.hvplot`.

### Read in the data

In [None]:
import dask.dataframe as dd

In [None]:
df = dd.read_parquet('../data/earthquakes.parq')
df.head()

### Using default `.plot`

The first thing that we'd like to do with this data is visualize the locations of every earthquake. So we would like to make a `scatter` plot where `x='longitude'` and `y='latitude'`. If you are familiar with the `pandas.plot` API you might expect to execute: `df.plot.scatter(x='longitude', y='latitude')`. Feel free to try this out in a new cell. It throws an error: `
AttributeError: 'DataFrame' object has no attribute 'plot'`. Since we have a dask dataframe rather than a pandas dataframe, we need to first convert it to pandas. In order to make the data more manageable, we'll use just a fraction (1%) of it and call that `small_df`. 

In [None]:
%matplotlib inline

In [None]:
small_df = df.sample(frac=.01).compute()
small_df.shape

Now we have a smaller dataset with just 21k earthquakes. We can use that to test out our visualizations before ramping back up to the full dataset.

In [None]:
small_df.plot.scatter(x='longitude', y='latitude')

### Using `.hvplot`
That is a good place to start and you can start to see the structure of the edges of the plates (which often correspond with the edges of the continents). You can make a very similar plot with the same arguments using hvplot. 

In [None]:
import hvplot.pandas

In [None]:
small_df.hvplot.scatter(x='longitude', y='latitude')

You might have noticed that many of the dots in the scatter that we've just created lie on top of one another. This is called "overplotting" and can be avoided in a variety of ways, such as by making the dots slightly transparent, or binning the data. These approaches have the downside of introducing bias because you need to choose the alpha or the edges of the bins, and in order to do that, you make assumptions about the data. Rather than use these traditional approaches, we use a library called [datashader](datashader.org) which aggregates to the pixel. In `hvplot` we can activate this capability by setting `datashade=True`.

In [None]:
### Exercise: Try changing the alpha (try .1) on the plot above to see the effect of this approach, or create a hexbin plot.

In [None]:
small_df.hvplot.scatter(x='longitude', y='latitude', datashade=True)

Remember that we are still only plotting 1% of the data (21k earthquakes). With datashader, we can easily plot all the original full dataset. Despite the fact that the full data are still in a dask object, we can still use the same API because `hvplot` differs from the other `.plot` libraries in that it doesn't just target pandas objects. Instead hvplot can be used with: 
 - Pandas : DataFrame, Series (columnar/tabular data)
 - xarray : Dataset, DataArray (labelled multidimensional arrays)
 - Dask : DataFrame, Series (distributed/out of core arrays and columnar data)
 - Streamz : DataFrame(s), Series(s) (streaming columnar data)
 - Intake : DataSource (data catalogues)
 - GeoPandas : GeoDataFrame (geometry data)
 - NetworkX : Graph (network graphs)

In [None]:
import hvplot.dask

In [None]:
df.hvplot.scatter(x='longitude', y='latitude', datashade=True)

**NOTE:** That probably felt very slow to render, and when you zoom into the plot, the image re-renders. This can also be slow (tens of seconds) when the data is being read from disk. If your computer has sufficient RAM, persist the dataset in memory to significantly speed up the time that it takes to re-render the plot on zoom events.

In [None]:
df = df.persist()

In [None]:
df.hvplot.scatter(x='longitude', y='latitude', datashade=True)

In [None]:
### Exercise: If you are brave and don't mind restarting your kernel if it dies, create a scatter without setting datashade=True.

### Statistical Plots

Next we'll look at the frequency of different magnitude earthquakes.

| Magnitude     | Earthquake Effect | Estimated Number Each Year |
|---------------|-------------------|----------------------------|
| 2.5 or less   | Usually not felt, but can be recorded by seismograph. |900,000|
| 2.5 to 5.4    | Often felt, but only causes minor damage. |30,000 |
| 5.5 to 6.0    | Slight damage to buildings and other structures. |500 |
| 6.1 to 6.9    | May cause a lot of damage in very populated areas. | 100 |
| 7.0 to 7.9    | Major earthquake. Serious damage. | 20 |
| 8.0 or greater| Great earthquake. Can totally destroy communities near the epicenter. | One every 5 to 10 years |

As a first pass, we'll use a histogram first with `plot.hist` on the small data, then with `.hvplot.hist` on the full dataset. Before plotting we can clean the data by setting any magnitudes that are less than 0 to NaN.

In [None]:
cleaned_df = df.copy()
cleaned_df['mag'] = df.mag.where(df.mag > 0)
cleaned_small_df = cleaned_df.sample(frac=.01).compute()

In [None]:
cleaned_small_df.plot.hist(y='mag')

Similarly we can create a histogram of the whole dataset using hvplot.

In [None]:
cleaned_df.hvplot.hist(y='mag', bin_range=(0,10), bins=50)

In [None]:
### Exercise: Create a kernel density estimate (kde) plot of magnitude.

### Categorical variables

Next we'll categorize the earthquakes based on depth. You can read about all the different variables available in this dataset [here](https://earthquake.usgs.gov/data/comcat/data-eventterms.php). In the interest of time, we'll use the small dataset and assume that it is representative of all the earthquakes. According to the [USGS page on earthquakes depths](https://earthquake.usgs.gov/learn/topics/determining_depth.php):

> Shallow earthquakes are between 0 and 70 km deep; 
intermediate earthquakes, 70 - 300 km deep; and deep earthquakes, 
300 - 700 km deep. 
In general, the term "deep-focus earthquakes" is applied to earthquakes deeper than 70 km. 
All earthquakes deeper than 70 km are localized within great slabs of lithosphere that are sinking into the Earth's mantle.

First we'll use `pd.cut` to split the small_dataset into depth categories.

In [None]:
import numpy as np
import pandas as pd

In [None]:
depth_bins = [-np.inf, 70, 300, np.inf]
depth_names = ['Shallow', 'Intermediate', 'Deep']
depth_class_column = pd.cut(cleaned_small_df['depth'], depth_bins, labels=depth_names)

cleaned_small_df.insert(1, 'depth_class', depth_class_column)

We can now use this new categorical variable to group our data, first we will overlay all our groups on the same plot using the `by` option:

In [None]:
cleaned_small_df.hvplot.hist(y='mag', by='depth_class')

**NOTE:** Click on the legend to turn off certain categories and see what is behind them.

In [None]:
### Exercise: Add subplots=True and width=300 to see the different classes side-by-side. The y-axis will be linked, so try zooming.

To instead use a widget to toggle between classes, use the `groupby` option:

In [None]:
cleaned_small_df.hvplot.bivariate(x='mag', y='depth', groupby='depth_class')

### Classifying by magnitude

In addition to classifying by depth, we can classify by magnitude.

| Class    | Magnitude | 
|----------|-----------|
| Great    | 8 or more | 
| Major    | 7 - 7.9   | 
| Strong   | 6 - 6.9   |
| Moderate | 5 - 5.9   |
| Light    | 4 - 4.9   |
| Minor    | 3 -3.9    |

In [None]:
classified_df = df[df.mag >= 3].compute()

depth_class_column = pd.cut(classified_df['depth'], depth_bins, labels=depth_names)
classified_df.insert(1, 'depth_class', depth_class_column)

mag_bins = [2.9, 3.9, 4.9, 5.9, 6.9, 7.9, 10]
mag_names = ['Minor', 'Light', 'Moderate', 'Strong', 'Major', 'Great']
mag_class_column = pd.cut(classified_df['mag'], mag_bins, labels=mag_names)
classified_df.insert(11, 'mag_class', mag_class_column)

In [None]:
classified_df.hvplot.heatmap(x='mag_class', y='depth_class', C='id', reduce_function=np.count_nonzero)