<a href="https://colab.research.google.com/github/maswadkar/python/blob/master/pandas_003_Essential_basic_functionality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
pd.__version__

## Essential basic functionality

<p>Here we discuss a lot of the essential functionality common to the pandas data
structures. To begin, let’s create some example objects like we did in
the <a class="reference internal" href="10min.html#min"><span class="std std-ref">10 minutes to pandas</span></a> section:</p>

In [None]:
index = pd.date_range("1/1/2000", periods=8)
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=["A", "B", "C"])

## Head and tail

<p>To view a small sample of a Series or DataFrame object, use the
<a class="reference internal" href="../reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head" title="pandas.DataFrame.head"><code class="xref py py-meth docutils literal notranslate"><span class="pre">head()</span></code></a> and <a class="reference internal" href="../reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail" title="pandas.DataFrame.tail"><code class="xref py py-meth docutils literal notranslate"><span class="pre">tail()</span></code></a> methods. The default number
of elements to display is five, but you may pass a custom number.</p>

In [None]:
long_series = pd.Series(np.random.randn(1000))

In [None]:
long_series.head()

In [None]:
long_series.tail(3)

## Attributes and underlying data

<p>pandas objects have a number of attributes enabling you to access the metadata</p>
<ul class="simple">
<li><p><strong>shape</strong>: gives the axis dimensions of the object, consistent with ndarray</p></li>
<li><dl class="simple">
<dt>Axis labels</dt><dd><ul>
<li><p><strong>Series</strong>: <em>index</em> (only axis)</p></li>
<li><p><strong>DataFrame</strong>: <em>index</em> (rows) and <em>columns</em></p></li>
</ul>
</dd>
</dl>
</li>
</ul>

<p>Note, <strong>these attributes can be safely assigned to</strong>!</p>

In [None]:
df[:2]

In [None]:
df.columns = [x.lower() for x in df.columns]

In [None]:
df

<p>pandas objects (<a class="reference internal" href="../reference/api/pandas.Index.html#pandas.Index" title="pandas.Index"><code class="xref py py-class docutils literal notranslate"><span class="pre">Index</span></code></a>, <a class="reference internal" href="../reference/api/pandas.Series.html#pandas.Series" title="pandas.Series"><code class="xref py py-class docutils literal notranslate"><span class="pre">Series</span></code></a>, <a class="reference internal" href="../reference/api/pandas.DataFrame.html#pandas.DataFrame" title="pandas.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a>) can be
thought of as containers for arrays, which hold the actual data and do the
actual computation. For many types, the underlying array is a
<a class="reference external" href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="(in NumPy v1.23)"><code class="xref py py-class docutils literal notranslate"><span class="pre">numpy.ndarray</span></code></a>. However, pandas and 3rd party libraries may <em>extend</em>
NumPy’s type system to add support for custom arrays
(see <a class="reference internal" href="#basics-dtypes"><span class="std std-ref">dtypes</span></a>).</p>

<p>To get the actual data inside a <a class="reference internal" href="../reference/api/pandas.Index.html#pandas.Index" title="pandas.Index"><code class="xref py py-class docutils literal notranslate"><span class="pre">Index</span></code></a> or <a class="reference internal" href="../reference/api/pandas.Series.html#pandas.Series" title="pandas.Series"><code class="xref py py-class docutils literal notranslate"><span class="pre">Series</span></code></a>, use
the <code class="docutils literal notranslate"><span class="pre">.array</span></code> property</p>

In [None]:
s.array

In [None]:
s.index.array

<p><a class="reference internal" href="../reference/api/pandas.Series.array.html#pandas.Series.array" title="pandas.Series.array"><code class="xref py py-attr docutils literal notranslate"><span class="pre">array</span></code></a> will always be an <a class="reference internal" href="../reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray" title="pandas.api.extensions.ExtensionArray"><code class="xref py py-class docutils literal notranslate"><span class="pre">ExtensionArray</span></code></a>.
The exact details of what an <a class="reference internal" href="../reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray" title="pandas.api.extensions.ExtensionArray"><code class="xref py py-class docutils literal notranslate"><span class="pre">ExtensionArray</span></code></a> is and why pandas uses them are a bit
beyond the scope of this introduction. See <a class="reference internal" href="#basics-dtypes"><span class="std std-ref">dtypes</span></a> for more.</p>

<p>If you know you need a NumPy array, use <a class="reference internal" href="../reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy" title="pandas.Series.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_numpy()</span></code></a>
or <code class="xref py py-meth docutils literal notranslate"><span class="pre">numpy.asarray()</span></code>.</p>

In [None]:
s.to_numpy()

In [None]:
np.asarray(s)

<p>When the Series or Index is backed by
an <a class="reference internal" href="../reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray" title="pandas.api.extensions.ExtensionArray"><code class="xref py py-class docutils literal notranslate"><span class="pre">ExtensionArray</span></code></a>, <a class="reference internal" href="../reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy" title="pandas.Series.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_numpy()</span></code></a>
may involve copying data and coercing values. See <a class="reference internal" href="#basics-dtypes"><span class="std std-ref">dtypes</span></a> for more.</p>

<p><a class="reference internal" href="../reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy" title="pandas.Series.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">to_numpy()</span></code></a> gives some control over the <code class="docutils literal notranslate"><span class="pre">dtype</span></code> of the
resulting <a class="reference external" href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="(in NumPy v1.23)"><code class="xref py py-class docutils literal notranslate"><span class="pre">numpy.ndarray</span></code></a>. For example, consider datetimes with timezones.
NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there
are two possibly useful representations:</p>


<ol class="arabic simple">
<li><p>An object-dtype <a class="reference external" href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="(in NumPy v1.23)"><code class="xref py py-class docutils literal notranslate"><span class="pre">numpy.ndarray</span></code></a> with <a class="reference internal" href="../reference/api/pandas.Timestamp.html#pandas.Timestamp" title="pandas.Timestamp"><code class="xref py py-class docutils literal notranslate"><span class="pre">Timestamp</span></code></a> objects, each
with the correct <code class="docutils literal notranslate"><span class="pre">tz</span></code></p></li>
<li><p>A <code class="docutils literal notranslate"><span class="pre">datetime64[ns]</span></code> -dtype <a class="reference external" href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="(in NumPy v1.23)"><code class="xref py py-class docutils literal notranslate"><span class="pre">numpy.ndarray</span></code></a>, where the values have
been converted to UTC and the timezone discarded</p></li>
</ol>


<p>Timezones may be preserved with <code class="docutils literal notranslate"><span class="pre">dtype=object</span></code></p>

In [None]:
ser = pd.Series(pd.date_range("2000", periods=2, tz="CET"))

In [None]:
ser.to_numpy(dtype=object)

<p>Or thrown away with <code class="docutils literal notranslate"><span class="pre">dtype='datetime64[ns]'</span></code></p>

In [None]:
ser.to_numpy(dtype="datetime64[ns]")

<p>Getting the “raw data” inside a <a class="reference internal" href="../reference/api/pandas.DataFrame.html#pandas.DataFrame" title="pandas.DataFrame"><code class="xref py py-class docutils literal notranslate"><span class="pre">DataFrame</span></code></a> is possibly a bit more
complex. When your <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> only has a single data type for all the
columns, <a class="reference internal" href="../reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy" title="pandas.DataFrame.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">DataFrame.to_numpy()</span></code></a> will return the underlying data:</p>

In [None]:
df.to_numpy()

<p>If a DataFrame contains homogeneously-typed data, the ndarray can
actually be modified in-place, and the changes will be reflected in the data
structure. For heterogeneous data (e.g. some of the DataFrame’s columns are not
all the same dtype), this will not be the case. The values attribute itself,
unlike the axis labels, cannot be assigned to.</p>

<div class="admonition note">
<p class="admonition-title">Note</p>
<p>When working with heterogeneous data, the dtype of the resulting ndarray
will be chosen to accommodate all of the data involved. For example, if
strings are involved, the result will be of object dtype. If there are only
floats and integers, the resulting array will be of float dtype.</p>
</div>

<p>In the past, pandas recommended <a class="reference internal" href="../reference/api/pandas.Series.values.html#pandas.Series.values" title="pandas.Series.values"><code class="xref py py-attr docutils literal notranslate"><span class="pre">Series.values</span></code></a> or <a class="reference internal" href="../reference/api/pandas.DataFrame.values.html#pandas.DataFrame.values" title="pandas.DataFrame.values"><code class="xref py py-attr docutils literal notranslate"><span class="pre">DataFrame.values</span></code></a>
for extracting the data from a Series or DataFrame. You’ll still find references
to these in old code bases and online. Going forward, we recommend avoiding
<code class="docutils literal notranslate"><span class="pre">.values</span></code> and using <code class="docutils literal notranslate"><span class="pre">.array</span></code> or <code class="docutils literal notranslate"><span class="pre">.to_numpy()</span></code>. <code class="docutils literal notranslate"><span class="pre">.values</span></code> has the following
drawbacks:</p>

<ol class="arabic simple">
<li><p>When your Series contains an <a class="reference internal" href="../development/extending.html#extending-extension-types"><span class="std std-ref">extension type</span></a>, it’s
unclear whether <a class="reference internal" href="../reference/api/pandas.Series.values.html#pandas.Series.values" title="pandas.Series.values"><code class="xref py py-attr docutils literal notranslate"><span class="pre">Series.values</span></code></a> returns a NumPy array or the extension array.
<a class="reference internal" href="../reference/api/pandas.Series.array.html#pandas.Series.array" title="pandas.Series.array"><code class="xref py py-attr docutils literal notranslate"><span class="pre">Series.array</span></code></a> will always return an <a class="reference internal" href="../reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray" title="pandas.api.extensions.ExtensionArray"><code class="xref py py-class docutils literal notranslate"><span class="pre">ExtensionArray</span></code></a>, and will never
copy data. <a class="reference internal" href="../reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy" title="pandas.Series.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">Series.to_numpy()</span></code></a> will always return a NumPy array,
potentially at the cost of copying / coercing values.</p></li>
<li><p>When your DataFrame contains a mixture of data types, <a class="reference internal" href="../reference/api/pandas.DataFrame.values.html#pandas.DataFrame.values" title="pandas.DataFrame.values"><code class="xref py py-attr docutils literal notranslate"><span class="pre">DataFrame.values</span></code></a> may
involve copying data and coercing values to a common dtype, a relatively expensive
operation. <a class="reference internal" href="../reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy" title="pandas.DataFrame.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">DataFrame.to_numpy()</span></code></a>, being a method, makes it clearer that the
returned NumPy array may not be a view on the same data in the DataFrame.</p></li>
</ol>

<section id="accelerated-operations">
<span id="basics-accelerate"></span><h2>Accelerated operations<a class="headerlink" href="#accelerated-operations" title="Permalink to this headline">¶</a></h2>
<p>pandas has support for accelerating certain types of binary numerical and boolean operations using
the <code class="docutils literal notranslate"><span class="pre">numexpr</span></code> library and the <code class="docutils literal notranslate"><span class="pre">bottleneck</span></code> libraries.</p>
<p>These libraries are especially useful when dealing with large data sets, and provide large
speedups. <code class="docutils literal notranslate"><span class="pre">numexpr</span></code> uses smart chunking, caching, and multiple cores. <code class="docutils literal notranslate"><span class="pre">bottleneck</span></code> is
a set of specialized cython routines that are especially fast when dealing with arrays that have
<code class="docutils literal notranslate"><span class="pre">nans</span></code>.</p>
<p>Here is a sample (using 100 column x 100,000 row <code class="docutils literal notranslate"><span class="pre">DataFrames</span></code>):</p>
<table class="colwidths-given table">
<colgroup>
<col style="width: 25%">
<col style="width: 25%">
<col style="width: 25%">
<col style="width: 25%">
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>Operation</p></th>
<th class="head"><p>0.11.0 (ms)</p></th>
<th class="head"><p>Prior Version (ms)</p></th>
<th class="head"><p>Ratio to Prior</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">df1</span> <span class="pre">&gt;</span> <span class="pre">df2</span></code></p></td>
<td><p>13.32</p></td>
<td><p>125.35</p></td>
<td><p>0.1063</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">df1</span> <span class="pre">*</span> <span class="pre">df2</span></code></p></td>
<td><p>21.71</p></td>
<td><p>36.63</p></td>
<td><p>0.5928</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">df1</span> <span class="pre">+</span> <span class="pre">df2</span></code></p></td>
<td><p>22.04</p></td>
<td><p>36.50</p></td>
<td><p>0.6039</p></td>
</tr>
</tbody>
</table>
<p>You are highly encouraged to install both libraries. See the section
<a class="reference internal" href="../getting_started/install.html#install-recommended-dependencies"><span class="std std-ref">Recommended Dependencies</span></a> for more installation info.</p>
<p>These are both enabled to be used by default, you can control this by setting the options:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s2">"compute.use_bottleneck"</span><span class="p">,</span> <span class="kc">False</span><span class="p">)</span>
<span class="n">pd</span><span class="o">.</span><span class="n">set_option</span><span class="p">(</span><span class="s2">"compute.use_numexpr"</span><span class="p">,</span> <span class="kc">False</span><span class="p">)</span>
</pre></div>
</div>
</section>

In [None]:
pd.set_option("compute.use_bottleneck", False)
pd.set_option("compute.use_numexpr", False)

## Flexible binary operations

<p>With binary operations between pandas data structures, there are two key points
of interest:</p>

<ul class="simple">
<li><p>Broadcasting behavior between higher- (e.g. DataFrame) and
lower-dimensional (e.g. Series) objects.</p></li>
<li><p>Missing data in computations.</p></li>
</ul>

<p>We will demonstrate how to manage these issues independently, though they can
be handled simultaneously.</p>

### Matching / broadcasting behavior

<p>DataFrame has the methods <a class="reference internal" href="../reference/api/pandas.DataFrame.add.html#pandas.DataFrame.add" title="pandas.DataFrame.add"><code class="xref py py-meth docutils literal notranslate"><span class="pre">add()</span></code></a>, <a class="reference internal" href="../reference/api/pandas.DataFrame.sub.html#pandas.DataFrame.sub" title="pandas.DataFrame.sub"><code class="xref py py-meth docutils literal notranslate"><span class="pre">sub()</span></code></a>,
<a class="reference internal" href="../reference/api/pandas.DataFrame.mul.html#pandas.DataFrame.mul" title="pandas.DataFrame.mul"><code class="xref py py-meth docutils literal notranslate"><span class="pre">mul()</span></code></a>, <a class="reference internal" href="../reference/api/pandas.DataFrame.div.html#pandas.DataFrame.div" title="pandas.DataFrame.div"><code class="xref py py-meth docutils literal notranslate"><span class="pre">div()</span></code></a> and related functions
<a class="reference internal" href="../reference/api/pandas.DataFrame.radd.html#pandas.DataFrame.radd" title="pandas.DataFrame.radd"><code class="xref py py-meth docutils literal notranslate"><span class="pre">radd()</span></code></a>, <a class="reference internal" href="../reference/api/pandas.DataFrame.rsub.html#pandas.DataFrame.rsub" title="pandas.DataFrame.rsub"><code class="xref py py-meth docutils literal notranslate"><span class="pre">rsub()</span></code></a>, …
for carrying out binary operations. For broadcasting behavior,
Series input is of primary interest. Using these functions, you can use to
either match on the <em>index</em> or <em>columns</em> via the <strong>axis</strong> keyword:</p>

In [None]:
df = pd.DataFrame({
    "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
    "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
    "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
})

In [None]:
df

In [None]:
row = df.iloc[1]

In [None]:
column = df["two"]

In [None]:
df.sub(row, axis="columns")

In [None]:
df.sub(row, axis=1)

In [None]:
df.sub(column, axis="index")

In [None]:
df.sub(column,axis=0)

<p>Furthermore you can align a level of a MultiIndexed DataFrame with a Series.</p>

In [None]:
dfmi = df.copy()

In [None]:
dfmi.index = pd.MultiIndex.from_tuples([(1, "a"), (1, "b"), (1, "c"), (2, "a")], names=["first", "second"])

In [None]:
dfmi

In [None]:
dfmi.sub(column, axis=0, level="second")

<p>Series and Index also support the <a class="reference external" href="https://docs.python.org/3/library/functions.html#divmod" title="(in Python v3.10)"><code class="xref py py-func docutils literal notranslate"><span class="pre">divmod()</span></code></a> builtin. This function takes
the floor division and modulo operation at the same time returning a two-tuple
of the same type as the left hand side. For example:</p>

In [None]:
s = pd.Series(np.arange(10))

In [None]:
s

In [None]:
div, rem = divmod(s, 3)

In [None]:
div

In [None]:
rem

In [None]:
idx = pd.Index(np.arange(10))

In [None]:
idx

In [None]:
div, rem = divmod(idx, 3)

In [None]:
div

In [None]:
rem

<p>We can also do elementwise <a class="reference external" href="https://docs.python.org/3/library/functions.html#divmod" title="(in Python v3.10)"><code class="xref py py-func docutils literal notranslate"><span class="pre">divmod()</span></code></a>:</p>

In [None]:
div, rem = divmod(s, [2, 2, 3, 3, 4, 4, 5, 5, 6, 6])

In [None]:
div

In [None]:
rem

### Missing data / operations with fill values

<p>In Series and DataFrame, the arithmetic functions have the option of inputting
a <em>fill_value</em>, namely a value to substitute when at most one of the values at
a location are missing. For example, when adding two DataFrame objects, you may
wish to treat NaN as 0 unless both DataFrames are missing that value, in which
case the result will be NaN (you can later replace NaN with some other value
using <code class="docutils literal notranslate"><span class="pre">fillna</span></code> if you wish).</p>

In [None]:
df

In [None]:
df2 = df.fillna(1)

In [None]:
df2

In [None]:
df+df2

In [None]:
df.add(df2, fill_value=0)