Skip to content

Commit

Permalink
Add ecdfplot function (#2141)
Browse files Browse the repository at this point in the history
* Add basic ecdfplot implementation

* Allow user to override drawstyle

* Add unit tests

* Add docstring content

* Add more docstring information and fix test

* Add complementary ECDF

* Add ecdfplot API examples

* Fix step plots with y data variable

* Housekeeping

* Fix error message

* Mention ecdfplot in release notes
  • Loading branch information
mwaskom committed Jun 17, 2020
1 parent 2d6f86a commit eca47b9
Show file tree
Hide file tree
Showing 9 changed files with 579 additions and 17 deletions.
1 change: 1 addition & 0 deletions doc/api.rst
Expand Up @@ -45,6 +45,7 @@ Distribution plots

distplot
histplot
ecdfplot
kdeplot
rugplot

Expand Down
130 changes: 130 additions & 0 deletions doc/docstrings/ecdfplot.ipynb
@@ -0,0 +1,130 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plot a univariate distribution along the x axis:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import seaborn as sns; sns.set()\n",
"penguins = sns.load_dataset(\"penguins\")\n",
"sns.ecdfplot(data=penguins, x=\"flipper_length_mm\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Flip the plot by assigning the data variable to the y axis:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.ecdfplot(data=penguins, y=\"flipper_length_mm\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If neither `x` nor `y` is assigned, the dataset is treated as wide-form, and a histogram is drawn for each numeric column:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.ecdfplot(data=penguins.filter(like=\"culmen_\", axis=\"columns\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also draw multiple histograms from a long-form dataset with hue mapping:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.ecdfplot(data=penguins, x=\"culmen_length_mm\", hue=\"species\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The default distribution statistic is normalized to show a proportion, but you can show absolute counts instead:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.ecdfplot(data=penguins, x=\"culmen_length_mm\", hue=\"species\", stat=\"count\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's also possible to plot the empirical complementary CDF (1 - CDF):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.ecdfplot(data=penguins, x=\"culmen_length_mm\", hue=\"species\", complementary=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "seaborn-refactor (py38)",
"language": "python",
"name": "seaborn-refactor"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
2 changes: 1 addition & 1 deletion doc/docstrings/histplot.ipynb
Expand Up @@ -103,7 +103,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also draw multiple histograms from a long-form dataset with hue mapping:"
"You can otherwise draw multiple histograms from a long-form dataset with hue mapping:"
]
},
{
Expand Down
6 changes: 5 additions & 1 deletion doc/releases/v0.11.0.txt
Expand Up @@ -9,14 +9,18 @@ v0.11.0 (Unreleased)
Modernization of distribution functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

First, a new function, :func:`histplot` has been added. :func:`histplot` draws univariate or bivariate histograms with a number of features, including:
First, two new functions, :func:`histplot` and :func:`ecdfplot` has been added.

:func:`histplot` draws univariate or bivariate histograms with a number of features, including:

- mapping multiple distributions with a ``hue`` semantic
- normalization to show density, probability, or frequency statistics
- flexible parameterization of bin size, including proper bins for discrete variables
- adding a KDE fit to show a smoothed distribution over all bin statistics
- experimental support for histograms over categorical and datetime variables. GH2125

:func:`ecdfplot` draws univariate empirical cumulative distribution functions, using a similar interface.

Second, the existing functions :func:`kdeplot` and :func:`rugplot` have been completely overhauled. Two of the oldest functions in the library, these lacked aspects of the otherwise-common seaborn API, such as the ability to assign variables by name from a ``data`` object; they had no capacity for semantic mapping; and they had numerous other inconsistencies and smaller issues.

The overhauled functions now share a common API with the rest of seaborn, they can show conditional distributions by mapping a third variable with a ``hue`` semantic, and have been improved in numerous other ways. The `github pull request (GH2104) <https://github.com/mwaskom/seaborn/pull/2104>`_ has a longer explanation of the changes and the motivation behind them.
Expand Down
3 changes: 3 additions & 0 deletions seaborn/_docstrings.py
Expand Up @@ -116,6 +116,9 @@ def from_function_params(cls, func):
""",
kdeplot="""
kdeplot : Plot univariate or bivariate distributions using kernel density estimation.
""",
ecdfplot="""
ecdfplot : Plot empirical cumulative distribution functions.
""",
rugplot="""
rugplot : Plot a tick at each observation value along the x and/or y axes.
Expand Down
79 changes: 79 additions & 0 deletions seaborn/_statistics.py
@@ -1,3 +1,29 @@
"""Statistical transformations for visualization.
This module is currently private, but is being written to eventually form part
of the public API.
The classes should behave roughly in the style of scikit-learn.
- All data-independent parameters should be passed to the class constructor.
- Each class should impelment a default transformation that is exposed through
__call__. These are currently written for vector arguements, but I think
consuming a whole `plot_data` DataFrame and return it with transformed
variables would make more sense.
- Some class have data-dependent preprocessing that should be cached and used
multiple times (think defining histogram bins off all data and then counting
observations within each bin multiple times per data subsets). These currently
have unique names, but it would be good to have a common name. Not quite
`fit`, but something similar.
- Alternatively, the transform interface could take some information about grouping
variables and do a groupby internally.
- Some classes should define alternate transforms that might make the most sense
with a different function. For example, KDE usually evaluates the distribution
on a regular grid, but it would be useful for it to transform at the actual
datapoints. Then again, this could be controlled by a parameter at the time of
class instantiation.
"""
from distutils.version import LooseVersion
from numbers import Number
import numpy as np
Expand Down Expand Up @@ -345,3 +371,56 @@ def __call__(self, x1, x2=None, weights=None):
return self._eval_univariate(x1, weights)
else:
return self._eval_bivariate(x1, x2, weights)


class ECDF:
"""Univariate empirical cumulative distribution estimator."""
def __init__(self, stat="proportion", complementary=False):
"""Initialize the class with its paramters
Parameters
----------
stat : {{"proportion", "count"}}
Distribution statistic to compute.
complementary : bool
If True, use the complementary CDF (1 - CDF)
"""
_check_argument("stat", ["count", "proportion"], stat)
self.stat = stat
self.complementary = complementary

def _eval_bivariate(self, x1, x2, weights):
"""Inner function for ECDF of two variables."""
raise NotImplementedError("Bivariate ECDF is not implemented")

def _eval_univariate(self, x, weights):
"""Inner function for ECDF of one variable."""
sorter = x.argsort()
x = x[sorter]
weights = weights[sorter]
y = weights.cumsum()

if self.stat == "proportion":
y = y / y.max()

x = np.r_[-np.inf, x]
y = np.r_[0, y]

if self.complementary:
y = y.max() - y

return y, x

def __call__(self, x1, x2=None, weights=None):
"""Return proportion or count of observations below each sorted datapoint."""
x1 = np.asarray(x1)
if weights is None:
weights = np.ones_like(x1)
else:
weights = np.asarray(weights)

if x2 is None:
return self._eval_univariate(x1, weights)
else:
return self._eval_bivariate(x1, x2, weights)

0 comments on commit eca47b9

Please sign in to comment.