<a href="https://colab.research.google.com/github/maswadkar/python/blob/master/pandas_003_Essential_basic_functionality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
pd.__version__

# Essential basic functionality

<p>Here we discuss a lot of the essential functionality common to the pandas data
structures. To begin, let’s create some example objects like we did in
the <a class="reference internal" href="10min.html#min"><span class="std std-ref">10 minutes to pandas</span></a> section:</p>

In [None]:
index = pd.date_range("1/1/2000", periods=8)
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

## Head and tail

<p>To view a small sample of a Series or DataFrame object, use the
<a class="reference internal" href="../reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head" title="pandas.DataFrame.head"><code class="xref py py-meth docutils literal notranslate"><span class="pre">head()</span></code></a> and <a class="reference internal" href="../reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail" title="pandas.DataFrame.tail"><code class="xref py py-meth docutils literal notranslate"><span class="pre">tail()</span></code></a> methods. The default number
of elements to display is five, but you may pass a custom number.</p>

In [None]:
long_series = pd.Series(np.random.randn(1000))

In [None]:
long_series.head()

In [None]:
long_series.tail(3)

## Attributes and underlying data

<p>pandas objects have a number of attributes enabling you to access the metadata</p>
<ul class="simple">
<li><p><strong>shape</strong>: gives the axis dimensions of the object, consistent with ndarray</p></li>
<li><dl class="simple">
<dt>Axis labels</dt><dd><ul>
<li><p><strong>Series</strong>: <em>index</em> (only axis)</p></li>
<li><p><strong>DataFrame</strong>: <em>index</em> (rows) and <em>columns</em></p></li>
</ul>
</dd>
</dl>
</li>
</ul>

<p>Note, <strong>these attributes can be safely assigned to</strong>!</p>

In [None]:
df[:2]

In [None]:
df.columns = [x.lower() for x in df.columns]

In [None]:
df

<p>pandas objects (<a class="reference internal" href="../reference/api/pandas.Index.html#pandas.Index" title="pandas.Index"><code class="xref py py-class docutils literal notranslate"><span class="pre">Index</span></code></a>, <a class="reference internal" href="../reference/api/pandas.Series.html#pandas.Series" title="pandas.Series"><code class="xref py py-class docutils literal notranslate"><span class="pre">Series</span></code></a>, <a class="reference internal" href="../reference/api/pandas.DataFrame.html#pandas.DataFrame" title="pandas.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>) can be
thought of as containers for arrays, which hold the actual data and do the
actual computation. For many types, the underlying array is a
<a class="reference external" href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="(in NumPy v1.23)"><code class="xref py py-class docutils literal notranslate"><span class="pre">numpy.ndarray</span></code></a>. However, pandas and 3rd party libraries may <em>extend</em>
NumPy’s type system to add support for custom arrays
(see <a class="reference internal" href="#basics-dtypes"><span class="std std-ref">dtypes</span></a>).</p>

<p>To get the actual data inside a <a class="reference internal" href="../reference/api/pandas.Index.html#pandas.Index" title="pandas.Index"><code class="xref py py-class docutils literal notranslate"><span class="pre">Index</span></code></a> or <a class="reference internal" href="../reference/api/pandas.Series.html#pandas.Series" title="pandas.Series"><code class="xref py py-class docutils literal notranslate"><span class="pre">Series</span></code></a>, use
the <code class="docutils literal notranslate"><span class="pre">.array</span></code> property</p>

In [None]:
s.array

In [None]:
s.index.array

<p><a class="reference internal" href="../reference/api/pandas.Series.array.html#pandas.Series.array" title="pandas.Series.array"><code class="xref py py-attr docutils literal notranslate"><span class="pre">array</span></code></a> will always be an <a class="reference internal" href="../reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray" title="pandas.api.extensions.ExtensionArray"><code class="xref py py-class docutils literal notranslate"><span class="pre">ExtensionArray</span></code></a>.
The exact details of what an <a class="reference internal" href="../reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray" title="pandas.api.extensions.ExtensionArray"><code class="xref py py-class docutils literal notranslate"><span class="pre">ExtensionArray</span></code></a> is and why pandas uses them are a bit
beyond the scope of this introduction. See <a class="reference internal" href="#basics-dtypes"><span class="std std-ref">dtypes</span></a> for more.</p>

<p>If you know you need a NumPy array, use <a class="reference internal" href="../reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy" title="pandas.Series.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_numpy()</span></code></a>
or <code class="xref py py-meth docutils literal notranslate"><span class="pre">numpy.asarray()</span></code>.</p>

In [None]:
s.to_numpy()

In [None]:
np.asarray(s)

<p>When the Series or Index is backed by
an <a class="reference internal" href="../reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray" title="pandas.api.extensions.ExtensionArray"><code class="xref py py-class docutils literal notranslate"><span class="pre">ExtensionArray</span></code></a>, <a class="reference internal" href="../reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy" title="pandas.Series.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_numpy()</span></code></a>
may involve copying data and coercing values. See <a class="reference internal" href="#basics-dtypes"><span class="std std-ref">dtypes</span></a> for more.</p>

<p><a class="reference internal" href="../reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy" title="pandas.Series.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_numpy()</span></code></a> gives some control over the <code class="docutils literal notranslate"><span class="pre">dtype</span></code> of the
resulting <a class="reference external" href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="(in NumPy v1.23)"><code class="xref py py-class docutils literal notranslate"><span class="pre">numpy.ndarray</span></code></a>. For example, consider datetimes with timezones.
NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there
are two possibly useful representations:</p>


<ol class="arabic simple">
<li><p>An object-dtype <a class="reference external" href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="(in NumPy v1.23)"><code class="xref py py-class docutils literal notranslate"><span class="pre">numpy.ndarray</span></code></a> with <a class="reference internal" href="../reference/api/pandas.Timestamp.html#pandas.Timestamp" title="pandas.Timestamp"><code class="xref py py-class docutils literal notranslate"><span class="pre">Timestamp</span></code></a> objects, each
with the correct <code class="docutils literal notranslate"><span class="pre">tz</span></code></p></li>
<li><p>A <code class="docutils literal notranslate"><span class="pre">datetime64[ns]</span></code> -dtype <a class="reference external" href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="(in NumPy v1.23)"><code class="xref py py-class docutils literal notranslate"><span class="pre">numpy.ndarray</span></code></a>, where the values have
been converted to UTC and the timezone discarded</p></li>
</ol>


<p>Timezones may be preserved with <code class="docutils literal notranslate"><span class="pre">dtype=object</span></code></p>

In [None]:
ser = pd.Series(pd.date_range("2000", periods=2, tz="CET"))

In [None]:
ser.to_numpy(dtype=object)

<p>Or thrown away with <code class="docutils literal notranslate"><span class="pre">dtype='datetime64[ns]'</span></code></p>

In [None]:
ser.to_numpy(dtype="datetime64[ns]")

<p>Getting the “raw data” inside a <a class="reference internal" href="../reference/api/pandas.DataFrame.html#pandas.DataFrame" title="pandas.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> is possibly a bit more
complex. When your <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> only has a single data type for all the
columns, <a class="reference internal" href="../reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy" title="pandas.DataFrame.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">DataFrame.to_numpy()</span></code></a> will return the underlying data:</p>

In [None]:
df.to_numpy()

<p>If a DataFrame contains homogeneously-typed data, the ndarray can
actually be modified in-place, and the changes will be reflected in the data
structure. For heterogeneous data (e.g. some of the DataFrame’s columns are not
all the same dtype), this will not be the case. The values attribute itself,
unlike the axis labels, cannot be assigned to.</p>

<div class="admonition note">
<p class="admonition-title">Note</p>
<p>When working with heterogeneous data, the dtype of the resulting ndarray
will be chosen to accommodate all of the data involved. For example, if
strings are involved, the result will be of object dtype. If there are only
floats and integers, the resulting array will be of float dtype.</p>
</div>

<p>In the past, pandas recommended <a class="reference internal" href="../reference/api/pandas.Series.values.html#pandas.Series.values" title="pandas.Series.values"><code class="xref py py-attr docutils literal notranslate"><span class="pre">Series.values</span></code></a> or <a class="reference internal" href="../reference/api/pandas.DataFrame.values.html#pandas.DataFrame.values" title="pandas.DataFrame.values"><code class="xref py py-attr docutils literal notranslate"><span class="pre">DataFrame.values</span></code></a>
for extracting the data from a Series or DataFrame. You’ll still find references
to these in old code bases and online. Going forward, we recommend avoiding
<code class="docutils literal notranslate"><span class="pre">.values</span></code> and using <code class="docutils literal notranslate"><span class="pre">.array</span></code> or <code class="docutils literal notranslate"><span class="pre">.to_numpy()</span></code>. <code class="docutils literal notranslate"><span class="pre">.values</span></code> has the following
drawbacks:</p>

<ol class="arabic simple">
<li><p>When your Series contains an <a class="reference internal" href="../development/extending.html#extending-extension-types"><span class="std std-ref">extension type</span></a>, it’s
unclear whether <a class="reference internal" href="../reference/api/pandas.Series.values.html#pandas.Series.values" title="pandas.Series.values"><code class="xref py py-attr docutils literal notranslate"><span class="pre">Series.values</span></code></a> returns a NumPy array or the extension array.
<a class="reference internal" href="../reference/api/pandas.Series.array.html#pandas.Series.array" title="pandas.Series.array"><code class="xref py py-attr docutils literal notranslate"><span class="pre">Series.array</span></code></a> will always return an <a class="reference internal" href="../reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray" title="pandas.api.extensions.ExtensionArray"><code class="xref py py-class docutils literal notranslate"><span class="pre">ExtensionArray</span></code></a>, and will never
copy data. <a class="reference internal" href="../reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy" title="pandas.Series.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">Series.to_numpy()</span></code></a> will always return a NumPy array,
potentially at the cost of copying / coercing values.</p></li>
<li><p>When your DataFrame contains a mixture of data types, <a class="reference internal" href="../reference/api/pandas.DataFrame.values.html#pandas.DataFrame.values" title="pandas.DataFrame.values"><code class="xref py py-attr docutils literal notranslate"><span class="pre">DataFrame.values</span></code></a> may
involve copying data and coercing values to a common dtype, a relatively expensive
operation. <a class="reference internal" href="../reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy" title="pandas.DataFrame.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">DataFrame.to_numpy()</span></code></a>, being a method, makes it clearer that the
returned NumPy array may not be a view on the same data in the DataFrame.</p></li>
</ol>

## Accelerated-Operations

<p>pandas has support for accelerating certain types of binary numerical and boolean operations using
the <code class="docutils literal notranslate"><span class="pre">numexpr</span></code> library and the <code class="docutils literal notranslate"><span class="pre">bottleneck</span></code> libraries.</p>
<p>These libraries are especially useful when dealing with large data sets, and provide large
speedups. <code class="docutils literal notranslate"><span class="pre">numexpr</span></code> uses smart chunking, caching, and multiple cores. <code class="docutils literal notranslate"><span class="pre">bottleneck</span></code> is
a set of specialized cython routines that are especially fast when dealing with arrays that have
<code class="docutils literal notranslate"><span class="pre">nans</span></code>.</p>
<p>Here is a sample (using 100 column x 100,000 row <code class="docutils literal notranslate"><span class="pre">DataFrames</span></code>):</p>
<table class="colwidths-given table">
<colgroup>
<col style="width: 25%">
<col style="width: 25%">
<col style="width: 25%">
<col style="width: 25%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Operation</p></th>
<th class="head"><p>0.11.0 (ms)</p></th>
<th class="head"><p>Prior Version (ms)</p></th>
<th class="head"><p>Ratio to Prior</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">df1</span> <span class="pre">&gt;</span> <span class="pre">df2</span></code></p></td>
<td><p>13.32</p></td>
<td><p>125.35</p></td>
<td><p>0.1063</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">df1</span> <span class="pre">*</span> <span class="pre">df2</span></code></p></td>
<td><p>21.71</p></td>
<td><p>36.63</p></td>
<td><p>0.5928</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">df1</span> <span class="pre">+</span> <span class="pre">df2</span></code></p></td>
<td><p>22.04</p></td>
<td><p>36.50</p></td>
<td><p>0.6039</p></td>
</tr>
</tbody>
</table>
<p>You are highly encouraged to install both libraries. See the section
<a class="reference internal" href="../getting_started/install.html#install-recommended-dependencies"><span class="std std-ref">Recommended Dependencies</span></a> for more installation info.</p>
<p>These are both enabled to be used by default, you can control this by setting the options:</p>

In [None]:
pd.set_option("compute.use_bottleneck", False)
pd.set_option("compute.use_numexpr", False)

## Flexible binary operations

<p>With binary operations between pandas data structures, there are two key points
of interest:</p>

<ul class="simple">
<li><p>Broadcasting behavior between higher- (e.g. DataFrame) and
lower-dimensional (e.g. Series) objects.</p></li>
<li><p>Missing data in computations.</p></li>
</ul>

<p>We will demonstrate how to manage these issues independently, though they can
be handled simultaneously.</p>

### Matching / broadcasting behavior

<p>DataFrame has the methods <a class="reference internal" href="../reference/api/pandas.DataFrame.add.html#pandas.DataFrame.add" title="pandas.DataFrame.add"><code class="xref py py-meth docutils literal notranslate"><span class="pre">add()</span></code></a>, <a class="reference internal" href="../reference/api/pandas.DataFrame.sub.html#pandas.DataFrame.sub" title="pandas.DataFrame.sub"><code class="xref py py-meth docutils literal notranslate"><span class="pre">sub()</span></code></a>,
<a class="reference internal" href="../reference/api/pandas.DataFrame.mul.html#pandas.DataFrame.mul" title="pandas.DataFrame.mul"><code class="xref py py-meth docutils literal notranslate"><span class="pre">mul()</span></code></a>, <a class="reference internal" href="../reference/api/pandas.DataFrame.div.html#pandas.DataFrame.div" title="pandas.DataFrame.div"><code class="xref py py-meth docutils literal notranslate"><span class="pre">div()</span></code></a> and related functions
<a class="reference internal" href="../reference/api/pandas.DataFrame.radd.html#pandas.DataFrame.radd" title="pandas.DataFrame.radd"><code class="xref py py-meth docutils literal notranslate"><span class="pre">radd()</span></code></a>, <a class="reference internal" href="../reference/api/pandas.DataFrame.rsub.html#pandas.DataFrame.rsub" title="pandas.DataFrame.rsub"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rsub()</span></code></a>, …
for carrying out binary operations. For broadcasting behavior,
Series input is of primary interest. Using these functions, you can use to
either match on the <em>index</em> or <em>columns</em> via the <strong>axis</strong> keyword:</p>

In [None]:
df = pd.DataFrame({
    "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
    "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
    "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
})

In [None]:
df

In [None]:
row = df.iloc[1]

In [None]:
column = df["two"]

In [None]:
df.sub(row, axis="columns")

In [None]:
df.sub(row, axis=1)

In [None]:
df.sub(column, axis="index")

In [None]:
df.sub(column,axis=0)

<p>Furthermore you can align a level of a MultiIndexed DataFrame with a Series.</p>

In [None]:
dfmi = df.copy()

In [None]:
dfmi.index = pd.MultiIndex.from_tuples([(1, "a"), (1, "b"), (1, "c"), (2, "a")], names=["first", "second"])

In [None]:
dfmi

In [None]:
dfmi.sub(column, axis=0, level="second")

<p>Series and Index also support the <a class="reference external" href="https://docs.python.org/3/library/functions.html#divmod" title="(in Python v3.10)"><code class="xref py py-func docutils literal notranslate"><span class="pre">divmod()</span></code></a> builtin. This function takes
the floor division and modulo operation at the same time returning a two-tuple
of the same type as the left hand side. For example:</p>

In [None]:
s = pd.Series(np.arange(10))

In [None]:
s

In [None]:
div, rem = divmod(s, 3)

In [None]:
div

In [None]:
rem

In [None]:
idx = pd.Index(np.arange(10))

In [None]:
idx

In [None]:
div, rem = divmod(idx, 3)

In [None]:
div

In [None]:
rem

<p>We can also do elementwise <a class="reference external" href="https://docs.python.org/3/library/functions.html#divmod" title="(in Python v3.10)"><code class="xref py py-func docutils literal notranslate"><span class="pre">divmod()</span></code></a>:</p>

In [None]:
div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])

In [None]:
div

In [None]:
rem

### Missing data / operations with fill values

<p>In Series and DataFrame, the arithmetic functions have the option of inputting
a <em>fill_value</em>, namely a value to substitute when at most one of the values at
a location are missing. For example, when adding two DataFrame objects, you may
wish to treat NaN as 0 unless both DataFrames are missing that value, in which
case the result will be NaN (you can later replace NaN with some other value
using <code class="docutils literal notranslate"><span class="pre">fillna</span></code> if you wish).</p>

In [None]:
df

In [None]:
df2 = df.fillna(1)

In [None]:
df2

In [None]:
df+df2

In [None]:
df.add(df2, fill_value=0)

### Flexible comparisons

<p>Series and DataFrame have the binary comparison methods <code class="docutils literal notranslate"><span class="pre">eq</span></code>, <code class="docutils literal notranslate"><span class="pre">ne</span></code>, <code class="docutils literal notranslate"><span class="pre">lt</span></code>, <code class="docutils literal notranslate"><span class="pre">gt</span></code>,
<code class="docutils literal notranslate"><span class="pre">le</span></code>, and <code class="docutils literal notranslate"><span class="pre">ge</span></code> whose behavior is analogous to the binary
arithmetic operations described above:</p>

In [None]:
df.gt(df2)

In [None]:
df2.ne(df)

<p>These operations produce a pandas object of the same type as the left-hand-side
input that is of dtype <code class="docutils literal notranslate"><span class="pre">bool</span></code>. These <code class="docutils literal notranslate"><span class="pre">boolean</span></code> objects can be used in
indexing operations, see the section on <a class="reference internal" href="indexing.html#indexing-boolean"><span class="std std-ref">Boolean indexing</span></a>.</p>

### Boolean reductions

<p>You can apply the reductions: <a class="reference internal" href="../reference/api/pandas.DataFrame.empty.html#pandas.DataFrame.empty" title="pandas.DataFrame.empty"><code class="xref py py-attr docutils literal notranslate"><span class="pre">empty</span></code></a>, <a class="reference internal" href="../reference/api/pandas.DataFrame.any.html#pandas.DataFrame.any" title="pandas.DataFrame.any"><code class="xref py py-meth docutils literal notranslate"><span class="pre">any()</span></code></a>,
<a class="reference internal" href="../reference/api/pandas.DataFrame.all.html#pandas.DataFrame.all" title="pandas.DataFrame.all"><code class="xref py py-meth docutils literal notranslate"><span class="pre">all()</span></code></a>, and <a class="reference internal" href="../reference/api/pandas.DataFrame.bool.html#pandas.DataFrame.bool" title="pandas.DataFrame.bool"><code class="xref py py-meth docutils literal notranslate"><span class="pre">bool()</span></code></a> to provide a
way to summarize a boolean result.</p>

In [None]:
(df > 0).all()

In [None]:
(df > 0).any()

<p>You can reduce to a final boolean value.</p>

In [None]:
(df > 0).any().any()

<p>You can test if a pandas object is empty, via the <a class="reference internal" href="../reference/api/pandas.DataFrame.empty.html#pandas.DataFrame.empty" title="pandas.DataFrame.empty"><code class="xref py py-attr docutils literal notranslate"><span class="pre">empty</span></code></a> property.</p>

In [None]:
df.empty

In [None]:
pd.DataFrame(columns=list("ABC")).empty

<p>To evaluate single-element pandas objects in a boolean context, use the method
<a class="reference internal" href="../reference/api/pandas.DataFrame.bool.html#pandas.DataFrame.bool" title="pandas.DataFrame.bool"><code class="xref py py-meth docutils literal notranslate"><span class="pre">bool()</span></code></a>:</p>

In [None]:
pd.Series([True]).bool()

In [None]:
pd.Series([False]).bool()

In [None]:
pd.DataFrame([True]).bool()

In [None]:
pd.DataFrame([False]).bool()

<p>See <a class="reference internal" href="gotchas.html#gotchas-truth"><span class="std std-ref">gotchas</span></a> for a more detailed discussion.</p>

### Comparing if objects are equivalent

<p>Often you may find that there is more than one way to compute the same
result. As a simple example, consider <code class="docutils literal notranslate"><span class="pre">df</span> <span class="pre">+</span> <span class="pre">df</span></code> and <code class="docutils literal notranslate"><span class="pre">df</span> <span class="pre">*</span> <span class="pre">2</span></code>. To test
that these two computations produce the same result, given the tools
shown above, you might imagine using <code class="docutils literal notranslate"><span class="pre">(df</span> <span class="pre">+</span> <span class="pre">df</span> <span class="pre">==</span> <span class="pre">df</span> <span class="pre">*</span> <span class="pre">2).all()</span></code>. But in
fact, this expression is False:</p>

In [None]:
df + df == df * 2

In [None]:
(df + df == df * 2).all()

<p>Notice that the boolean DataFrame <code class="docutils literal notranslate"><span class="pre">df</span> <span class="pre">+</span> <span class="pre">df</span> <span class="pre">==</span> <span class="pre">df</span> <span class="pre">*</span> <span class="pre">2</span></code> contains some False values!
This is because NaNs do not compare as equals:</p>

In [None]:
np.nan == np.nan

<p>So, NDFrames (such as Series and DataFrames)
have an <a class="reference internal" href="../reference/api/pandas.DataFrame.equals.html#pandas.DataFrame.equals" title="pandas.DataFrame.equals"><code class="xref py py-meth docutils literal notranslate"><span class="pre">equals()</span></code></a> method for testing equality, with NaNs in
corresponding locations treated as equal.</p>

In [None]:
(df + df).equals(df * 2)

<p>Note that the Series or DataFrame index needs to be in the same order for
equality to be True:</p>

In [None]:
df1 = pd.DataFrame({"col": ["foo", 0, np.nan]})

In [None]:
df2 = pd.DataFrame({"col": [np.nan, 0, "foo"]}, index=[2, 1, 0])

In [None]:
df1.equals(df2)

In [None]:
df1.equals(df2.sort_index())

### Comparing array-like objects

<p>You can conveniently perform element-wise comparisons when comparing a pandas
data structure with a scalar value:</p>

In [None]:
pd.Series(["foo", "bar", "baz"]) == "foo"

In [None]:
pd.Index(["foo", "bar", "baz"]) == "foo"

<p>pandas also handles element-wise comparisons between different array-like
objects of the same length:</p>

In [None]:
pd.Series(["foo", "bar", "baz"]) == pd.Index(["foo", "bar", "qux"])

In [None]:
pd.Series(["foo", "bar", "baz"]) == np.array(["foo", "bar", "qux"])

<p>Trying to compare <code class="docutils literal notranslate"><span class="pre">Index</span></code> or <code class="docutils literal notranslate"><span class="pre">Series</span></code> objects of different lengths will
raise a ValueError:</p>

In [None]:
try:
    pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo', 'bar'])
except Exception as e:
    print(e)

In [None]:
try:
    pd.Series(['foo', 'bar', 'baz']) == pd.Series(['foo'])
except Exception as e:
    print(e)

<p>Note that this is different from the NumPy behavior where a comparison can
be broadcast:</p>

In [None]:
np.array([1, 2, 3]) == np.array([2])

<p>or it can return False if broadcasting can not be done:</p>

In [None]:
np.array([1, 2, 3]) == np.array([1, 2])

### Combining overlapping data sets

<p>A problem occasionally arising is the combination of two similar data sets
where values in one are preferred over the other. An example would be two data
series representing a particular economic indicator where one is considered to
be of “higher quality”. However, the lower quality series might extend further
back in history or have more complete data coverage. As such, we would like to
combine two DataFrame objects where missing values in one DataFrame are
conditionally filled with like-labeled values from the other DataFrame. The
function implementing this operation is <a class="reference internal" href="../reference/api/pandas.DataFrame.combine_first.html#pandas.DataFrame.combine_first" title="pandas.DataFrame.combine_first"><code class="xref py py-meth docutils literal notranslate"><span class="pre">combine_first()</span></code></a>,
which we illustrate:</p>

In [None]:
df1 = pd.DataFrame({"A": [1.0, np.nan, 3.0, 5.0, np.nan], "B": [np.nan, 2.0, 3.0, np.nan, 6.0]})

In [None]:
df2 = pd.DataFrame(
    {
        "A": [5.0, 2.0, 4.0, np.nan, 3.0, 7.0],
        "B": [np.nan, np.nan, 3.0, 4.0, 6.0, 8.0],
    }
)

In [None]:
df1

In [None]:
df2

In [None]:
df1.combine_first(df2)

### General DataFrame combine

<p>The <a class="reference internal" href="../reference/api/pandas.DataFrame.combine_first.html#pandas.DataFrame.combine_first" title="pandas.DataFrame.combine_first"><code class="xref py py-meth docutils literal notranslate"><span class="pre">combine_first()</span></code></a> method above calls the more general
<a class="reference internal" href="../reference/api/pandas.DataFrame.combine.html#pandas.DataFrame.combine" title="pandas.DataFrame.combine"><code class="xref py py-meth docutils literal notranslate"><span class="pre">DataFrame.combine()</span></code></a>. This method takes another DataFrame
and a combiner function, aligns the input DataFrame and then passes the combiner
function pairs of Series (i.e., columns whose names are the same).</p>

<p>So, for instance, to reproduce <a class="reference internal" href="../reference/api/pandas.DataFrame.combine_first.html#pandas.DataFrame.combine_first" title="pandas.DataFrame.combine_first"><code class="xref py py-meth docutils literal notranslate"><span class="pre">combine_first()</span></code></a> as above:</p>

In [None]:
def combiner(x, y):
    return np.where(pd.isna(x), y, x)

In [None]:
df1.combine(df2, combiner)

## Descriptive statistics

<p>There exists a large number of methods for computing descriptive statistics and
other related operations on <a class="reference internal" href="../reference/series.html#api-series-stats"><span class="std std-ref">Series</span></a>, <a class="reference internal" href="../reference/frame.html#api-dataframe-stats"><span class="std std-ref">DataFrame</span></a>. Most of these
are aggregations (hence producing a lower-dimensional result) like
<a class="reference internal" href="../reference/api/pandas.DataFrame.sum.html#pandas.DataFrame.sum" title="pandas.DataFrame.sum"><code class="xref py py-meth docutils literal notranslate"><span class="pre">sum()</span></code></a>, <a class="reference internal" href="../reference/api/pandas.DataFrame.mean.html#pandas.DataFrame.mean" title="pandas.DataFrame.mean"><code class="xref py py-meth docutils literal notranslate"><span class="pre">mean()</span></code></a>, and <a class="reference internal" href="../reference/api/pandas.DataFrame.quantile.html#pandas.DataFrame.quantile" title="pandas.DataFrame.quantile"><code class="xref py py-meth docutils literal notranslate"><span class="pre">quantile()</span></code></a>,
but some of them, like <a class="reference internal" href="../reference/api/pandas.DataFrame.cumsum.html#pandas.DataFrame.cumsum" title="pandas.DataFrame.cumsum"><code class="xref py py-meth docutils literal notranslate"><span class="pre">cumsum()</span></code></a> and <a class="reference internal" href="../reference/api/pandas.DataFrame.cumprod.html#pandas.DataFrame.cumprod" title="pandas.DataFrame.cumprod"><code class="xref py py-meth docutils literal notranslate"><span class="pre">cumprod()</span></code></a>,
produce an object of the same size. Generally speaking, these methods take an
<strong>axis</strong> argument, just like <em>ndarray.{sum, std, …}</em>, but the axis can be
specified by name or integer:</p>

<ul class="simple">
<li><p><strong>Series</strong>: no axis argument needed</p></li>
<li><p><strong>DataFrame</strong>: “index” (axis=0, default), “columns” (axis=1)</p></li>
</ul>

<p>For example:</p>

In [None]:
df

In [None]:
df.mean(0)

In [None]:
df.mean(1)

<p>All such methods have a <code class="docutils literal notranslate"><span class="pre">skipna</span></code> option signaling whether to exclude missing
data (<code class="docutils literal notranslate"><span class="pre">True</span></code> by default):</p>

In [None]:
df.sum(0,skipna=False)

In [None]:
df.sum(1,skipna=True)

<p>Combined with the broadcasting / arithmetic behavior, one can describe various
statistical procedures, like standardization (rendering data zero mean and
standard deviation of 1), very concisely:</p>

In [None]:
ts_stand = (df - df.mean()) / df.std()

In [None]:
ts_stand.std()

In [None]:
xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [None]:
xs_stand.std(1)

<p>Note that methods like <a class="reference internal" href="../reference/api/pandas.DataFrame.cumsum.html#pandas.DataFrame.cumsum" title="pandas.DataFrame.cumsum"><code class="xref py py-meth docutils literal notranslate"><span class="pre">cumsum()</span></code></a> and <a class="reference internal" href="../reference/api/pandas.DataFrame.cumprod.html#pandas.DataFrame.cumprod" title="pandas.DataFrame.cumprod"><code class="xref py py-meth docutils literal notranslate"><span class="pre">cumprod()</span></code></a>
preserve the location of <code class="docutils literal notranslate"><span class="pre">NaN</span></code> values. This is somewhat different from
<a class="reference internal" href="../reference/api/pandas.DataFrame.expanding.html#pandas.DataFrame.expanding" title="pandas.DataFrame.expanding"><code class="xref py py-meth docutils literal notranslate"><span class="pre">expanding()</span></code></a> and <a class="reference internal" href="../reference/api/pandas.DataFrame.rolling.html#pandas.DataFrame.rolling" title="pandas.DataFrame.rolling"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rolling()</span></code></a> since <code class="docutils literal notranslate"><span class="pre">NaN</span></code> behavior
is furthermore dictated by a <code class="docutils literal notranslate"><span class="pre">min_periods</span></code> parameter.</p>

In [None]:
df.cumsum()

<p>Here is a quick reference summary table of common functions. Each also takes an
optional <code class="docutils literal notranslate"><span class="pre">level</span></code> parameter which applies only if the object has a
<a class="reference internal" href="advanced.html#advanced-hierarchical"><span class="std std-ref">hierarchical index</span></a>.</p>

<table class="colwidths-given table">
<colgroup>
<col style="width: 20%">
<col style="width: 80%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Function</p></th>
<th class="head"><p>Description</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">count</span></code></p></td>
<td><p>Number of non-NA observations</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">sum</span></code></p></td>
<td><p>Sum of values</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">mean</span></code></p></td>
<td><p>Mean of values</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">mad</span></code></p></td>
<td><p>Mean absolute deviation</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">median</span></code></p></td>
<td><p>Arithmetic median of values</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">min</span></code></p></td>
<td><p>Minimum</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">max</span></code></p></td>
<td><p>Maximum</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">mode</span></code></p></td>
<td><p>Mode</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">abs</span></code></p></td>
<td><p>Absolute Value</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">prod</span></code></p></td>
<td><p>Product of values</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">std</span></code></p></td>
<td><p>Bessel-corrected sample standard deviation</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">var</span></code></p></td>
<td><p>Unbiased variance</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">sem</span></code></p></td>
<td><p>Standard error of the mean</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">skew</span></code></p></td>
<td><p>Sample skewness (3rd moment)</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">kurt</span></code></p></td>
<td><p>Sample kurtosis (4th moment)</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">quantile</span></code></p></td>
<td><p>Sample quantile (value at %)</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">cumsum</span></code></p></td>
<td><p>Cumulative sum</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">cumprod</span></code></p></td>
<td><p>Cumulative product</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">cummax</span></code></p></td>
<td><p>Cumulative maximum</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">cummin</span></code></p></td>
<td><p>Cumulative minimum</p></td>
</tr>
</tbody>
</table>

<p>Note that by chance some NumPy methods, like <code class="docutils literal notranslate"><span class="pre">mean</span></code>, <code class="docutils literal notranslate"><span class="pre">std</span></code>, and <code class="docutils literal notranslate"><span class="pre">sum</span></code>,
will exclude NAs on Series input by default:</p>

In [None]:
np.mean(df["one"])

In [None]:
np.mean(df["one"].to_numpy())

<p><a class="reference internal" href="../reference/api/pandas.Series.nunique.html#pandas.Series.nunique" title="pandas.Series.nunique"><code class="xref py py-meth docutils literal notranslate"><span class="pre">Series.nunique()</span></code></a> will return the number of unique non-NA values in a
Series:</p>

In [None]:
series = pd.Series(np.random.randn(500))

In [None]:
series[20:500] = np.nan
series[10:20] = 5

In [None]:
series.nunique()

### Summarizing data: describe

In [None]:
series = pd.Series(np.random.randn(1000))

In [None]:
series[::2] = np.nan

In [None]:
series.describe()

In [None]:
frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])

In [None]:
frame.iloc[::2] = np.nan

In [None]:
frame.iloc[::2]

In [None]:
frame.describe()

<p>You can select specific percentiles to include in the output:</p>

In [None]:
series.describe(percentiles=[0.05, 0.25, 0.75, 0.95])

<p>By default, the median is always included.</p>
<p>For a non-numerical Series object, <a class="reference internal" href="../reference/api/pandas.Series.describe.html#pandas.Series.describe" title="pandas.Series.describe"><code class="xref py py-meth docutils literal notranslate"><span class="pre">describe()</span></code></a> will give a simple
summary of the number of unique values and most frequently occurring values:</p>

In [None]:
s = pd.Series(["a", "a", "b", "b", "a", "a", np.nan, "c", "d", "a"])

In [None]:
s.describe()

<p>Note that on a mixed-type DataFrame object, <a class="reference internal" href="../reference/api/pandas.DataFrame.describe.html#pandas.DataFrame.describe" title="pandas.DataFrame.describe"><code class="xref py py-meth docutils literal notranslate"><span class="pre">describe()</span></code></a> will
restrict the summary to include only numerical columns or, if none are, only
categorical columns:</p>

In [None]:
frame = pd.DataFrame({"a": ["Yes", "Yes", "No", "No"], "b": range(4)})

In [None]:
frame.describe()

<p>This behavior can be controlled by providing a list of types as <code class="docutils literal notranslate"><span class="pre">include</span></code>/<code class="docutils literal notranslate"><span class="pre">exclude</span></code>
arguments. The special value <code class="docutils literal notranslate"><span class="pre">all</span></code> can also be used:</p>

In [None]:
frame.describe(include=["object"])

In [None]:
frame.describe(include=["number"])

In [None]:
frame.describe(include="all")

<p>That feature relies on <a class="reference internal" href="#basics-selectdtypes"><span class="std std-ref">select_dtypes</span></a>. Refer to
there for details about accepted inputs.</p>

### Index of min/max values

<p>The <a class="reference internal" href="../reference/api/pandas.DataFrame.idxmin.html#pandas.DataFrame.idxmin" title="pandas.DataFrame.idxmin"><code class="xref py py-meth docutils literal notranslate"><span class="pre">idxmin()</span></code></a> and <a class="reference internal" href="../reference/api/pandas.DataFrame.idxmax.html#pandas.DataFrame.idxmax" title="pandas.DataFrame.idxmax"><code class="xref py py-meth docutils literal notranslate"><span class="pre">idxmax()</span></code></a> functions on Series
and DataFrame compute the index labels with the minimum and maximum
corresponding values:</p>

In [None]:
s1 = pd.Series(np.random.randn(5))

In [None]:
s1

In [None]:
s1.idxmin(), s1.idxmax()

In [None]:
df1 = pd.DataFrame(np.random.randn(5, 3), columns=["A", "B", "C"])

In [None]:
df1

In [None]:
df1.idxmin()

In [None]:
df1.idxmax(axis=1)

<p>When there are multiple rows (or columns) matching the minimum or maximum
value, <a class="reference internal" href="../reference/api/pandas.DataFrame.idxmin.html#pandas.DataFrame.idxmin" title="pandas.DataFrame.idxmin"><code class="xref py py-meth docutils literal notranslate"><span class="pre">idxmin()</span></code></a> and <a class="reference internal" href="../reference/api/pandas.DataFrame.idxmax.html#pandas.DataFrame.idxmax" title="pandas.DataFrame.idxmax"><code class="xref py py-meth docutils literal notranslate"><span class="pre">idxmax()</span></code></a> return the first
matching index:</p>

In [None]:
df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=["A"], index=list("edcba"))

In [None]:
df3

In [None]:
df3["A"].idxmin()

### Value counts (histogramming) / mode

<p>The <a class="reference internal" href="../reference/api/pandas.Series.value_counts.html#pandas.Series.value_counts" title="pandas.Series.value_counts"><code class="xref py py-meth docutils literal notranslate"><span class="pre">value_counts()</span></code></a> Series method and top-level function computes a histogram
of a 1D array of values. It can also be used as a function on regular arrays:</p>

In [None]:
data = np.random.randint(0, 7, size=50)

In [None]:
data

In [None]:
s = pd.Series(data)

In [None]:
s.value_counts()

In [None]:
pd.value_counts(data)

<p>The <a class="reference internal" href="../reference/api/pandas.DataFrame.value_counts.html#pandas.DataFrame.value_counts" title="pandas.DataFrame.value_counts"><code class="xref py py-meth docutils literal notranslate"><span class="pre">value_counts()</span></code></a> method can be used to count combinations across multiple columns.
By default all columns are used but a subset can be selected using the <code class="docutils literal notranslate"><span class="pre">subset</span></code> argument.</p>

In [None]:
data = {"a": [1, 2, 3, 4], "b": ["x", "x", "y", "y"]}

In [None]:
frame = pd.DataFrame(data)

In [None]:
frame.value_counts()

<p>Similarly, you can get the most frequently occurring value(s), i.e. the mode, of the values in a Series or DataFrame:</p>

In [None]:
s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])

In [None]:
s5.mode()

In [None]:
df5 = pd.DataFrame({
    "A": np.random.randint(0, 7, size=50),
    "B": np.random.randint(-10, 15, size=50),
    }
)

In [None]:
df5.mode()

### Discretization and quantiling

<p>Continuous values can be discretized using the <a class="reference internal" href="../reference/api/pandas.cut.html#pandas.cut" title="pandas.cut"><code class="xref py py-func docutils literal notranslate"><span class="pre">cut()</span></code></a> (bins based on values)
and <a class="reference internal" href="../reference/api/pandas.qcut.html#pandas.qcut" title="pandas.qcut"><code class="xref py py-func docutils literal notranslate"><span class="pre">qcut()</span></code></a> (bins based on sample quantiles) functions:</p>

In [None]:
arr = np.random.randn(20)

In [None]:
factor = pd.cut(arr, 4)

In [None]:
factor

In [None]:
factor = pd.cut(arr, [-5, -1, 0, 1, 5])

In [None]:
factor

<p><a class="reference internal" href="../reference/api/pandas.qcut.html#pandas.qcut" title="pandas.qcut"><code class="xref py py-func docutils literal notranslate"><span class="pre">qcut()</span></code></a> computes sample quantiles. For example, we could slice up some
normally distributed data into equal-size quartiles like so:</p>

In [None]:
arr = np.random.randn(30)

In [None]:
factor = pd.qcut(arr,[0, 0.25, 0.5, 0.75, 1])

In [None]:
factor

In [None]:
pd.value_counts(factor)

<p>We can also pass infinite values to define the bins:</p>

In [None]:
arr = np.random.randn(20)

In [None]:
factor = pd.cut(arr, [-np.inf, 0, np.inf])

In [None]:
factor

## Function application

<p>To apply your own or another library’s functions to pandas objects,
you should be aware of the three methods below. The appropriate
method to use depends on whether your function expects to operate
on an entire <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> or <code class="docutils literal notranslate"><span class="pre">Series</span></code>, row- or column-wise, or elementwise.</p>

<ol class="arabic simple">
<li><p><a class="reference internal" href="#tablewise-function-application">Tablewise Function Application</a>: <a class="reference internal" href="../reference/api/pandas.DataFrame.pipe.html#pandas.DataFrame.pipe" title="pandas.DataFrame.pipe"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pipe()</span></code></a></p></li>
<li><p><a class="reference internal" href="#row-or-column-wise-function-application">Row or Column-wise Function Application</a>: <a class="reference internal" href="../reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply" title="pandas.DataFrame.apply"><code class="xref py py-meth docutils literal notranslate"><span class="pre">apply()</span></code></a></p></li>
<li><p><a class="reference internal" href="#aggregation-api">Aggregation API</a>: <a class="reference internal" href="../reference/api/pandas.DataFrame.agg.html#pandas.DataFrame.agg" title="pandas.DataFrame.agg"><code class="xref py py-meth docutils literal notranslate"><span class="pre">agg()</span></code></a> and <a class="reference internal" href="../reference/api/pandas.DataFrame.transform.html#pandas.DataFrame.transform" title="pandas.DataFrame.transform"><code class="xref py py-meth docutils literal notranslate"><span class="pre">transform()</span></code></a></p></li>
<li><p><a class="reference internal" href="#applying-elementwise-functions">Applying Elementwise Functions</a>: <a class="reference internal" href="../reference/api/pandas.DataFrame.applymap.html#pandas.DataFrame.applymap" title="pandas.DataFrame.applymap"><code class="xref py py-meth docutils literal notranslate"><span class="pre">applymap()</span></code></a></p></li>
</ol>

### Tablewise function application

<p><code class="docutils literal notranslate"><span class="pre">DataFrames</span></code> and <code class="docutils literal notranslate"><span class="pre">Series</span></code> can be passed into functions.
However, if the function needs to be called in a chain, consider using the <a class="reference internal" href="../reference/api/pandas.DataFrame.pipe.html#pandas.DataFrame.pipe" title="pandas.DataFrame.pipe"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pipe()</span></code></a> method.</p>

In [None]:
def extract_city_name(df):
    """
    Chicago, IL -> Chicago for city_name column
    """
    df["city_name"] = df["city_and_code"].str.split(",").str.get(0)
    return df

In [None]:
def add_country_name(df, country_name=None):
    """
    Chicago -> Chicago-US for city_name column
    """
    col = "city_name"
    df["city_and_country"] = df[col] + country_name
    return df

In [None]:
df_p = pd.DataFrame({"city_and_code": ["Chicago, IL"]})

In [None]:
df_p

<p><code class="docutils literal notranslate"><span class="pre">extract_city_name</span></code> and <code class="docutils literal notranslate"><span class="pre">add_country_name</span></code> are functions taking and returning <code class="docutils literal notranslate"><span class="pre">DataFrames</span></code>.</p>

<p>Now compare the following:</p>

In [None]:
add_country_name(extract_city_name(df_p), country_name="US")

<p>Is equivalent to:</p>

In [None]:
df_p.pipe(extract_city_name).pipe(add_country_name, country_name="US")

<p>pandas encourages the second style, which is known as method chaining.
<code class="docutils literal notranslate"><span class="pre">pipe</span></code> makes it easy to use your own or another library’s functions
in method chains, alongside pandas’ methods.</p>

<p>In the example above, the functions <code class="docutils literal notranslate"><span class="pre">extract_city_name</span></code> and <code class="docutils literal notranslate"><span class="pre">add_country_name</span></code> each expected a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> as the first positional argument.
What if the function you wish to apply takes its data as, say, the second argument?
In this case, provide <code class="docutils literal notranslate"><span class="pre">pipe</span></code> with a tuple of <code class="docutils literal notranslate"><span class="pre">(callable,</span> <span class="pre">data_keyword)</span></code>.
<code class="docutils literal notranslate"><span class="pre">.pipe</span></code> will route the <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> to the argument specified in the tuple.</p>

<p>For example, we can fit a regression using statsmodels. Their API expects a formula first and a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> as the second argument, <code class="docutils literal notranslate"><span class="pre">data</span></code>. We pass in the function, keyword pair <code class="docutils literal notranslate"><span class="pre">(sm.ols,</span> <span class="pre">'data')</span></code> to <code class="docutils literal notranslate"><span class="pre">pipe</span></code>:</p>

In [None]:
import statsmodels.formula.api as sm

In [None]:
bb = pd.read_csv("baseball.csv", index_col="id")

In [None]:
(bb.query("h > 0")
 .assign(ln_h=lambda df: np.log(df.h))
 .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
 .fit()
 .summary()
)

<p>The pipe method is inspired by unix pipes and more recently <a class="reference external" href="https://github.com/tidyverse/dplyr">dplyr</a> and <a class="reference external" href="https://github.com/tidyverse/magrittr">magrittr</a>, which
have introduced the popular <code class="docutils literal notranslate"><span class="pre">(%&gt;%)</span></code> (read pipe) operator for <a class="reference external" href="https://www.r-project.org">R</a>.
The implementation of <code class="docutils literal notranslate"><span class="pre">pipe</span></code> here is quite clean and feels right at home in Python.
We encourage you to view the source code of <a class="reference internal" href="../reference/api/pandas.DataFrame.pipe.html#pandas.DataFrame.pipe" title="pandas.DataFrame.pipe"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pipe()</span></code></a>.</p>

### Row or column-wise function application

<p>Arbitrary functions can be applied along the axes of a DataFrame
using the <a class="reference internal" href="../reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply" title="pandas.DataFrame.apply"><code class="xref py py-meth docutils literal notranslate"><span class="pre">apply()</span></code></a> method, which, like the descriptive
statistics methods, takes an optional <code class="docutils literal notranslate"><span class="pre">axis</span></code> argument:</p>

In [None]:
df.apply(np.mean)

In [None]:
df.apply(np.mean, axis=1)

In [None]:
df.apply(lambda x: x.max() - x.min())

In [None]:
df.apply(np.cumsum)

In [None]:
df.apply(np.exp)

<p>The <a class="reference internal" href="../reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply" title="pandas.DataFrame.apply"><code class="xref py py-meth docutils literal notranslate"><span class="pre">apply()</span></code></a> method will also dispatch on a string method name.</p>

In [None]:
df.apply("mean")

In [None]:
df.apply("mean", axis=1)

<p>The return type of the function passed to <a class="reference internal" href="../reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply" title="pandas.DataFrame.apply"><code class="xref py py-meth docutils literal notranslate"><span class="pre">apply()</span></code></a> affects the
type of the final output from <code class="docutils literal notranslate"><span class="pre">DataFrame.apply</span></code> for the default behaviour:</p>


<ul class="simple">
<li><p>If the applied function returns a <code class="docutils literal notranslate"><span class="pre">Series</span></code>, the final output is a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code>.
The columns match the index of the <code class="docutils literal notranslate"><span class="pre">Series</span></code> returned by the applied function.</p></li>
<li><p>If the applied function returns any other type, the final output is a <code class="docutils literal notranslate"><span class="pre">Series</span></code>.</p></li>
</ul>

<p>This default behaviour can be overridden using the <code class="docutils literal notranslate"><span class="pre">result_type</span></code>, which
accepts three options: <code class="docutils literal notranslate"><span class="pre">reduce</span></code>, <code class="docutils literal notranslate"><span class="pre">broadcast</span></code>, and <code class="docutils literal notranslate"><span class="pre">expand</span></code>.
These will determine how list-likes return values expand (or not) to a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code>.</p>

<p><a class="reference internal" href="../reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply" title="pandas.DataFrame.apply"><code class="xref py py-meth docutils literal notranslate"><span class="pre">apply()</span></code></a> combined with some cleverness can be used to answer many questions
about a data set. For example, suppose we wanted to extract the date where the
maximum value for each column occurred:</p>

In [None]:
tsdf = pd.DataFrame(
    np.random.randn(1000, 3),
    columns=["A", "B", "C"],
    index=pd.date_range("1/1/2000", periods=1000),
)

In [None]:
tsdf.apply(lambda x: x.idxmax())

<p>You may also pass additional arguments and keyword arguments to the <a class="reference internal" href="../reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply" title="pandas.DataFrame.apply"><code class="xref py py-meth docutils literal notranslate"><span class="pre">apply()</span></code></a>
method. For instance, consider the following function you would like to apply:</p>

In [None]:
def subtract_and_divide(x, sub, divide=1):
    return (x - sub) / divide

<p>You may then apply this function as follows:</p>

In [None]:
df.apply(subtract_and_divide,args=(5,),divide=3)

<p>Another useful feature is the ability to pass Series methods to carry out some
Series operation on each column or row:</p>

In [None]:
tsdf

In [None]:
tsdf.apply(pd.Series.interpolate)

<p>Finally, <a class="reference internal" href="../reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply" title="pandas.DataFrame.apply"><code class="xref py py-meth docutils literal notranslate"><span class="pre">apply()</span></code></a> takes an argument <code class="docutils literal notranslate"><span class="pre">raw</span></code> which is False by default, which
converts each row or column into a Series before applying the function. When
set to True, the passed function will instead receive an ndarray object, which
has positive performance implications if you do not need the indexing
functionality.</p>

### Aggregation API

<p>The aggregation API allows one to express possibly multiple aggregation operations in a single concise way.
This API is similar across pandas objects, see <a class="reference internal" href="groupby.html#groupby-aggregate"><span class="std std-ref">groupby API</span></a>, the
<a class="reference internal" href="window.html#window-overview"><span class="std std-ref">window API</span></a>, and the <a class="reference internal" href="timeseries.html#timeseries-aggregate"><span class="std std-ref">resample API</span></a>.
The entry point for aggregation is <a class="reference internal" href="../reference/api/pandas.DataFrame.aggregate.html#pandas.DataFrame.aggregate" title="pandas.DataFrame.aggregate"><code class="xref py py-meth docutils literal notranslate"><span class="pre">DataFrame.aggregate()</span></code></a>, or the alias
<a class="reference internal" href="../reference/api/pandas.DataFrame.agg.html#pandas.DataFrame.agg" title="pandas.DataFrame.agg"><code class="xref py py-meth docutils literal notranslate"><span class="pre">DataFrame.agg()</span></code></a>.</p>


<p>We will use a similar starting frame from above:</p>

In [None]:
tsdf = pd.DataFrame(
    np.random.randn(10, 3),
    columns=["A", "B", "C"],
    index=pd.date_range("1/1/2000", periods=10),
)

In [None]:
tsdf.iloc[3:7] = np.nan
tsdf

<p>Using a single function is equivalent to <a class="reference internal" href="../reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply" title="pandas.DataFrame.apply"><code class="xref py py-meth docutils literal notranslate"><span class="pre">apply()</span></code></a>. You can also
pass named methods as strings. These will return a <code class="docutils literal notranslate"><span class="pre">Series</span></code> of the aggregated
output:</p>

In [None]:
tsdf.aggregate(np.sum)

In [None]:
tsdf.aggregate("sum")

In [None]:
# these are equivalent to a ``.sum()`` because we are aggregating
# on a single function
tsdf.sum()

<p>Single aggregations on a <code class="docutils literal notranslate"><span class="pre">Series</span></code> this will return a scalar value:</p>

In [None]:
tsdf['A'].aggregate(np.sum)

#### Aggregating with multiple functions

<p>You can pass multiple aggregation arguments as a list.
The results of each of the passed functions will be a row in the resulting <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code>.
These are naturally named from the aggregation function.</p>

In [None]:
tsdf.agg(["sum"])

<p>Multiple functions yield multiple rows:</p>

In [None]:
tsdf.agg(["sum","mean"])

<p>On a <code class="docutils literal notranslate"><span class="pre">Series</span></code>, multiple functions return a <code class="docutils literal notranslate"><span class="pre">Series</span></code>, indexed by the function names:</p>

In [None]:
tsdf["A"].agg(["sum", "mean"])

<p>Passing a <code class="docutils literal notranslate"><span class="pre">lambda</span></code> function will yield a <code class="docutils literal notranslate"><span class="pre">&lt;lambda&gt;</span></code> named row:</p>

In [None]:
tsdf.aggregate(['sum',lambda x: x.mean()])

<p>Passing a named function will yield that name for the row:</p>

In [None]:
def mymean(x):
    return x.mean()


tsdf.aggregate(['sum',mymean])

#### Aggregating with a dict

<p>Passing a dictionary of column names to a scalar or a list of scalars, to <code class="docutils literal notranslate"><span class="pre">DataFrame.agg</span></code>
allows you to customize which functions are applied to which columns. Note that the results
are not in any particular order, you can use an <code class="docutils literal notranslate"><span class="pre">OrderedDict</span></code> instead to guarantee ordering.</p>

In [None]:
tsdf.agg({"A": "mean", "B": "sum"})

<p>Passing a list-like will generate a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> output. You will get a matrix-like output
of all of the aggregators. The output will consist of all unique functions. Those that are
not noted for a particular column will be <code class="docutils literal notranslate"><span class="pre">NaN</span></code>:</p>

In [None]:
tsdf.agg({"A": ["mean", "min"], "B": "sum"})

#### Mixed dtypes

<p>When presented with mixed dtypes that cannot aggregate, <code class="docutils literal notranslate"><span class="pre">.agg</span></code> will only take the valid
aggregations. This is similar to how <code class="docutils literal notranslate"><span class="pre">.groupby.agg</span></code> works.</p>

In [None]:
mdf = pd.DataFrame({
    "A": [1, 2, 3],
    "B": [1.0, 2.0, 3.0],
    "C": ["foo", "bar", "baz"],
    "D": pd.date_range("20130101", periods=3),
})

In [None]:
mdf.dtypes

In [None]:
mdf.aggregate(['min','sum'])

#### Custom describe

<p>With <code class="docutils literal notranslate"><span class="pre">.agg()</span></code> it is possible to easily create a custom describe function, similar
to the built in <a class="reference internal" href="#basics-describe"><span class="std std-ref">describe function</span></a>.</p>

In [None]:
from functools import partial

In [None]:
q_25 = partial(pd.Series.quantile, q=0.25)

In [None]:
q_25.__name__ = "25%"

In [None]:
q_75 = partial(pd.Series.quantile, q=0.75)

In [None]:
q_75.__name__ = "75%"

In [None]:
tsdf.agg(["count", "mean", "std", "min", q_25, "median", q_75, "max"])

next session

## https://pandas.pydata.org/docs/user_guide/basics.html#transform-api