Add ecdfplot function (#2141)

* Add basic ecdfplot implementation * Allow user to override drawstyle * Add unit tests * Add docstring content * Add more docstring information and fix test * Add complementary ECDF * Add ecdfplot API examples * Fix step plots with y data variable * Housekeeping * Fix error message * Mention ecdfplot in release notes
mwaskom · Jun 17, 2020 · eca47b9 · eca47b9
1 parent 2d6f86a
commit eca47b9
Show file tree

Hide file tree

Showing 9 changed files with 579 additions and 17 deletions.
diff --git a/doc/api.rst b/doc/api.rst
@@ -45,6 +45,7 @@ Distribution plots
 
     distplot
     histplot
+    ecdfplot
     kdeplot
     rugplot
 

diff --git a/doc/docstrings/ecdfplot.ipynb b/doc/docstrings/ecdfplot.ipynb
@@ -0,0 +1,130 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Plot a univariate distribution along the x axis:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import seaborn as sns; sns.set()\n",
+    "penguins = sns.load_dataset(\"penguins\")\n",
+    "sns.ecdfplot(data=penguins, x=\"flipper_length_mm\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Flip the plot by assigning the data variable to the y axis:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sns.ecdfplot(data=penguins, y=\"flipper_length_mm\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If neither `x` nor `y` is assigned, the dataset is treated as wide-form, and a histogram is drawn for each numeric column:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sns.ecdfplot(data=penguins.filter(like=\"culmen_\", axis=\"columns\"))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can also draw multiple histograms from a long-form dataset with hue mapping:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sns.ecdfplot(data=penguins, x=\"culmen_length_mm\", hue=\"species\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The default distribution statistic is normalized to show a proportion, but you can show absolute counts instead:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sns.ecdfplot(data=penguins, x=\"culmen_length_mm\", hue=\"species\", stat=\"count\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It's also possible to plot the empirical complementary CDF (1 - CDF):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sns.ecdfplot(data=penguins, x=\"culmen_length_mm\", hue=\"species\", complementary=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "seaborn-refactor (py38)",
+   "language": "python",
+   "name": "seaborn-refactor"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/doc/docstrings/histplot.ipynb b/doc/docstrings/histplot.ipynb
@@ -103,7 +103,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You can also draw multiple histograms from a long-form dataset with hue mapping:"
+    "You can otherwise draw multiple histograms from a long-form dataset with hue mapping:"
    ]
   },
   {

diff --git a/doc/releases/v0.11.0.txt b/doc/releases/v0.11.0.txt
@@ -9,14 +9,18 @@ v0.11.0 (Unreleased)
 Modernization of distribution functions
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-First, a new function, :func:`histplot` has been added. :func:`histplot` draws univariate or bivariate histograms with a number of features, including:
+First, two new functions, :func:`histplot` and :func:`ecdfplot` has been added.
+
+:func:`histplot` draws univariate or bivariate histograms with a number of features, including:
 
 - mapping multiple distributions with a ``hue`` semantic
 - normalization to show density, probability, or frequency statistics
 - flexible parameterization of bin size, including proper bins for discrete variables
 - adding a KDE fit to show a smoothed distribution over all bin statistics
 - experimental support for histograms over categorical and datetime variables. GH2125
 
+:func:`ecdfplot` draws univariate empirical cumulative distribution functions, using a similar interface.
+
 Second, the existing functions :func:`kdeplot` and :func:`rugplot` have been completely overhauled. Two of the oldest functions in the library, these lacked aspects of the otherwise-common seaborn API, such as the ability to assign variables by name from a ``data`` object; they had no capacity for semantic mapping; and they had numerous other inconsistencies and smaller issues.
 
 The overhauled functions now share a common API with the rest of seaborn, they can show conditional distributions by mapping a third variable with a ``hue`` semantic, and have been improved in numerous other ways. The `github pull request (GH2104) <https://github.com/mwaskom/seaborn/pull/2104>`_ has a longer explanation of the changes and the motivation behind them.

diff --git a/seaborn/_docstrings.py b/seaborn/_docstrings.py
@@ -116,6 +116,9 @@ def from_function_params(cls, func):
     """,
     kdeplot="""
 kdeplot : Plot univariate or bivariate distributions using kernel density estimation.
+    """,
+    ecdfplot="""
+ecdfplot : Plot empirical cumulative distribution functions.
     """,
     rugplot="""
 rugplot : Plot a tick at each observation value along the x and/or y axes.

diff --git a/seaborn/_statistics.py b/seaborn/_statistics.py
@@ -1,3 +1,29 @@
+"""Statistical transformations for visualization.
+
+This module is currently private, but is being written to eventually form part
+of the public API.
+
+The classes should behave roughly in the style of scikit-learn.
+
+- All data-independent parameters should be passed to the class constructor.
+- Each class should impelment a default transformation that is exposed through
+  __call__. These are currently written for vector arguements, but I think
+  consuming a whole `plot_data` DataFrame and return it with transformed
+  variables would make more sense.
+- Some class have data-dependent preprocessing that should be cached and used
+  multiple times (think defining histogram bins off all data and then counting
+  observations within each bin multiple times per data subsets). These currently
+  have unique names, but it would be good to have a common name. Not quite
+  `fit`, but something similar.
+- Alternatively, the transform interface could take some information about grouping
+  variables and do a groupby internally.
+- Some classes should define alternate transforms that might make the most sense
+  with a different function. For example, KDE usually evaluates the distribution
+  on a regular grid, but it would be useful for it to transform at the actual
+  datapoints. Then again, this could be controlled by a parameter at  the time of
+  class instantiation.
+
+"""
 from distutils.version import LooseVersion
 from numbers import Number
 import numpy as np
@@ -345,3 +371,56 @@ def __call__(self, x1, x2=None, weights=None):
             return self._eval_univariate(x1, weights)
         else:
             return self._eval_bivariate(x1, x2, weights)
+
+
+class ECDF:
+    """Univariate empirical cumulative distribution estimator."""
+    def __init__(self, stat="proportion", complementary=False):
+        """Initialize the class with its paramters
+
+        Parameters
+        ----------
+        stat : {{"proportion", "count"}}
+            Distribution statistic to compute.
+        complementary : bool
+            If True, use the complementary CDF (1 - CDF)
+
+        """
+        _check_argument("stat", ["count", "proportion"], stat)
+        self.stat = stat
+        self.complementary = complementary
+
+    def _eval_bivariate(self, x1, x2, weights):
+        """Inner function for ECDF of two variables."""
+        raise NotImplementedError("Bivariate ECDF is not implemented")
+
+    def _eval_univariate(self, x, weights):
+        """Inner function for ECDF of one variable."""
+        sorter = x.argsort()
+        x = x[sorter]
+        weights = weights[sorter]
+        y = weights.cumsum()
+
+        if self.stat == "proportion":
+            y = y / y.max()
+
+        x = np.r_[-np.inf, x]
+        y = np.r_[0, y]
+
+        if self.complementary:
+            y = y.max() - y
+
+        return y, x
+
+    def __call__(self, x1, x2=None, weights=None):
+        """Return proportion or count of observations below each sorted datapoint."""
+        x1 = np.asarray(x1)
+        if weights is None:
+            weights = np.ones_like(x1)
+        else:
+            weights = np.asarray(weights)
+
+        if x2 is None:
+            return self._eval_univariate(x1, weights)
+        else:
+            return self._eval_bivariate(x1, x2, weights)