<a href="https://colab.research.google.com/github/maswadkar/python/blob/master/pandas_002_Intro_to_data_structures.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
pd.__version__

# Intro to data structures

<p>We’ll start with a quick, non-comprehensive overview of the fundamental data
structures in pandas to get you started. The fundamental behavior about data
types, indexing, and axis labeling / alignment apply across all of the
objects. To get started, import NumPy and load pandas into your namespace:</p>

In [1]:
import numpy as np

In [2]:
import pandas as pd

<p>Here is a basic tenet to keep in mind: <strong>data alignment is intrinsic</strong>. The link
between labels and data will not be broken unless done so explicitly by you.</p>

<p>We’ll give a brief intro to the data structures, then consider all of the broad
categories of functionality and methods in separate sections.</p>

## Series

<p><a class="reference internal" href="../reference/api/pandas.Series.html#pandas.Series" title="pandas.Series"><code class="xref py py-class docutils literal notranslate"><span class="pre">Series</span></code></a> is a one-dimensional labeled array capable of holding any data
type (integers, strings, floating point numbers, Python objects, etc.). The axis
labels are collectively referred to as the <strong>index</strong>. The basic method to create a Series is to call:</p>

In [4]:
s = pd.Series(data, index=index)

NameError: name 'data' is not defined

<p>Here, <code class="docutils literal notranslate"><span class="pre">data</span></code> can be many different things:</p>

<ul class="simple">
<li><p>a Python dict</p></li>
<li><p>an ndarray</p></li>
<li><p>a scalar value (like 5)</p></li>
</ul>


<p>The passed <strong>index</strong> is a list of axis labels. Thus, this separates into a few
cases depending on what <strong>data is</strong>:</p>


<p><strong>From ndarray</strong></p>

<p>If <code class="docutils literal notranslate"><span class="pre">data</span></code> is an ndarray, <strong>index</strong> must be the same length as <strong>data</strong>. If no
index is passed, one will be created having values <code class="docutils literal notranslate"><span class="pre">[0,</span> <span class="pre">...,</span> <span class="pre">len(data)</span> <span class="pre">-</span> <span class="pre">1]</span></code>.</p>

In [5]:
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

In [6]:
s

a   -1.260531
b   -2.562424
c   -0.891685
d   -1.130719
e   -1.785424
dtype: float64

In [7]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [8]:
pd.Series(np.random.randn(5))

0   -1.080348
1   -0.556580
2   -0.480786
3   -0.081364
4    0.087695
dtype: float64

<p>pandas supports non-unique index values. If an operation
that does not support duplicate index values is attempted, an exception
will be raised at that time. The reason for being lazy is nearly all performance-based
(there are many instances in computations, like parts of GroupBy, where the index
is not used).</p>


<p><strong>From dict</strong></p>

<p>Series can be instantiated from dicts:</p>

In [9]:
d = {"b": 1, "a": 0, "c": 2}

In [10]:
pd.Series(d)

b    1
a    0
c    2
dtype: int64

<p>When the data is a dict, and an index is not passed, the <code class="docutils literal notranslate"><span class="pre">Series</span></code> index
will be ordered by the dict’s insertion order, if you’re using Python
version &gt;= 3.6 and pandas version &gt;= 0.23.</p>

<p>If you’re using Python &lt; 3.6 or pandas &lt; 0.23, and an index is not passed,
the <code class="docutils literal notranslate"><span class="pre">Series</span></code> index will be the lexically ordered list of dict keys.</p>


<p>In the example above, if you were on a Python version lower than 3.6 or a
pandas version lower than 0.23, the <code class="docutils literal notranslate"><span class="pre">Series</span></code> would be ordered by the lexical
order of the dict keys (i.e. <code class="docutils literal notranslate"><span class="pre">['a',</span> <span class="pre">'b',</span> <span class="pre">'c']</span></code> rather than <code class="docutils literal notranslate"><span class="pre">['b',</span> <span class="pre">'a',</span> <span class="pre">'c']</span></code>).</p>

<p>If an index is passed, the values in data corresponding to the labels in the
index will be pulled out.</p>


In [11]:
d = {"a": 0.0, "b": 1.0, "c": 2.0}

In [12]:
pd.Series(d)

a    0.0
b    1.0
c    2.0
dtype: float64

In [13]:
pd.Series(d, index=["b", "c", "d", "a"])

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

<p>NaN (not a number) is the standard missing data marker used in pandas.</p>

<p><strong>From scalar value</strong></p>

<p>If <code class="docutils literal notranslate"><span class="pre">data</span></code> is a scalar value, an index must be
provided. The value will be repeated to match the length of <strong>index</strong>.</p>

In [14]:
pd.Series(5.0, index=["a", "b", "c", "d", "e"])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

### Series is ndarray-like

<p><code class="docutils literal notranslate"><span class="pre">Series</span></code> acts very similarly to a <code class="docutils literal notranslate"><span class="pre">ndarray</span></code>, and is a valid argument to most NumPy functions.
However, operations such as slicing will also slice the index.</p>

In [15]:
s[0]

-1.2605308851908246

In [16]:
s[:3]

a   -1.260531
b   -2.562424
c   -0.891685
dtype: float64

In [17]:
s[s > s.median()]

c   -0.891685
d   -1.130719
dtype: float64

In [18]:
s[[4, 3, 1]]

e   -1.785424
d   -1.130719
b   -2.562424
dtype: float64

In [19]:
np.exp(s)

a    0.283503
b    0.077118
c    0.409964
d    0.322801
e    0.167726
dtype: float64

<div class="admonition note">
<p class="admonition-title">Note</p>
<p>We will address array-based indexing like <code class="docutils literal notranslate"><span class="pre">s[[4,</span> <span class="pre">3,</span> <span class="pre">1]]</span></code>
in <a class="reference internal" href="indexing.html#indexing"><span class="std std-ref">section on indexing</span></a>.</p>
</div>

<p>Like a NumPy array, a pandas Series has a <a class="reference internal" href="../reference/api/pandas.Series.dtype.html#pandas.Series.dtype" title="pandas.Series.dtype"><code class="xref py py-attr docutils literal notranslate"><span class="pre">dtype</span></code></a>.</p>

In [20]:
s.dtype

dtype('float64')

<p>This is often a NumPy dtype. However, pandas and 3rd-party libraries
extend NumPy’s type system in a few places, in which case the dtype would
be an <a class="reference internal" href="../reference/api/pandas.api.extensions.ExtensionDtype.html#pandas.api.extensions.ExtensionDtype" title="pandas.api.extensions.ExtensionDtype"><code class="xref py py-class docutils literal notranslate"><span class="pre">ExtensionDtype</span></code></a>. Some examples within
pandas are <a class="reference internal" href="categorical.html#categorical"><span class="std std-ref">Categorical data</span></a> and <a class="reference internal" href="integer_na.html#integer-na"><span class="std std-ref">Nullable integer data type</span></a>. See <a class="reference internal" href="basics.html#basics-dtypes"><span class="std std-ref">dtypes</span></a>
for more.</p>

<p>If you need the actual array backing a <code class="docutils literal notranslate"><span class="pre">Series</span></code>, use <a class="reference internal" href="../reference/api/pandas.Series.array.html#pandas.Series.array" title="pandas.Series.array"><code class="xref py py-attr docutils literal notranslate"><span class="pre">Series.array</span></code></a>.</p>

In [21]:
s.array

<PandasArray>
[-1.2605308851908246,  -2.562424372303493, -0.8916853999803319,
 -1.1307187283718223, -1.7854235998804042]
Length: 5, dtype: float64




<p>Accessing the array can be useful when you need to do some operation without the
index (to disable <a class="reference internal" href="#dsintro-alignment"><span class="std std-ref">automatic alignment</span></a>, for example).</p>

<p><a class="reference internal" href="../reference/api/pandas.Series.array.html#pandas.Series.array" title="pandas.Series.array"><code class="xref py py-attr docutils literal notranslate"><span class="pre">Series.array</span></code></a> will always be an <a class="reference internal" href="../reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray" title="pandas.api.extensions.ExtensionArray"><code class="xref py py-class docutils literal notranslate"><span class="pre">ExtensionArray</span></code></a>.
Briefly, an ExtensionArray is a thin wrapper around one or more <em>concrete</em> arrays like a
<a class="reference external" href="https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray" title="(in NumPy v1.23)"><code class="xref py py-class docutils literal notranslate"><span class="pre">numpy.ndarray</span></code></a>. pandas knows how to take an <code class="docutils literal notranslate"><span class="pre">ExtensionArray</span></code> and
store it in a <code class="docutils literal notranslate"><span class="pre">Series</span></code> or a column of a <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code>.
See <a class="reference internal" href="basics.html#basics-dtypes"><span class="std std-ref">dtypes</span></a> for more.</p>



<p>While Series is ndarray-like, if you need an <em>actual</em> ndarray, then use
<a class="reference internal" href="../reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy" title="pandas.Series.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">Series.to_numpy()</span></code></a>.</p>

In [22]:
s.to_numpy()

array([-1.26053089, -2.56242437, -0.8916854 , -1.13071873, -1.7854236 ])

<p>Even if the Series is backed by a <a class="reference internal" href="../reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray" title="pandas.api.extensions.ExtensionArray"><code class="xref py py-class docutils literal notranslate"><span class="pre">ExtensionArray</span></code></a>,
<a class="reference internal" href="../reference/api/pandas.Series.to_numpy.html#pandas.Series.to_numpy" title="pandas.Series.to_numpy"><code class="xref py py-meth docutils literal notranslate"><span class="pre">Series.to_numpy()</span></code></a> will return a NumPy ndarray.</p>

### Series is dict-like

<p>A Series is like a fixed-size dict in that you can get and set values by index
label:</p>

In [23]:
s["a"]

-1.2605308851908246

In [24]:
s["e"] = 12.0

In [25]:
s

a    -1.260531
b    -2.562424
c    -0.891685
d    -1.130719
e    12.000000
dtype: float64

In [26]:
"e" in s

True

In [27]:
"f" in s

False

<p>If a label is not contained, an exception is raised:</p>

In [29]:
try:
    s["f"]
except Exception as e:
    print(e)

'f'


<p>Using the <code class="docutils literal notranslate"><span class="pre">get</span></code> method, a missing label will return None or specified default:</p>

In [33]:
s.get("f")

In [34]:
s.get("f", np.nan)

nan

<p>See also the <a class="reference internal" href="indexing.html#indexing-attribute-access"><span class="std std-ref">section on attribute access</span></a>.</p>

### Vectorized operations and label alignment with Series

<p>When working with raw NumPy arrays, looping through value-by-value is usually
not necessary. The same is true when working with Series in pandas.
Series can also be passed into most NumPy methods expecting an ndarray.</p>

In [35]:
s + s

a    -2.521062
b    -5.124849
c    -1.783371
d    -2.261437
e    24.000000
dtype: float64

In [36]:
s*2

a    -2.521062
b    -5.124849
c    -1.783371
d    -2.261437
e    24.000000
dtype: float64

In [37]:
np.exp(s)

a         0.283503
b         0.077118
c         0.409964
d         0.322801
e    162754.791419
dtype: float64

<p>A key difference between Series and ndarray is that operations between Series
automatically align the data based on label. Thus, you can write computations
without giving consideration to whether the Series involved have the same
labels.</p>

In [41]:
s[1:] + s[:-1]

a         NaN
b   -5.124849
c   -1.783371
d   -2.261437
e         NaN
dtype: float64

<p>The result of an operation between unaligned Series will have the <strong>union</strong> of
the indexes involved. If a label is not found in one Series or the other, the
result will be marked as missing <code class="docutils literal notranslate"><span class="pre">NaN</span></code>. Being able to write code without doing
any explicit data alignment grants immense freedom and flexibility in
interactive data analysis and research. The integrated data alignment features
of the pandas data structures set pandas apart from the majority of related
tools for working with labeled data.</p>

<div class="admonition note">
<p class="admonition-title">Note</p>
<p>In general, we chose to make the default result of operations between
differently indexed objects yield the <strong>union</strong> of the indexes in order to
avoid loss of information. Having an index label, though the data is
missing, is typically important information as part of a computation. You
of course have the option of dropping labels with missing data via the
<strong>dropna</strong> function.</p>
</div>

### Name attribute

<p id="dsintro-name-attribute">Series can also have a <code class="docutils literal notranslate"><span class="pre">name</span></code> attribute:</p>

In [42]:
s = pd.Series(np.random.randn(5), name="something")

In [43]:
s

0   -1.372559
1   -0.911411
2    0.265166
3    0.191703
4   -0.392966
Name: something, dtype: float64

In [44]:
s.name

'something'

<p>The Series <code class="docutils literal notranslate"><span class="pre">name</span></code> will be assigned automatically in many cases, in particular
when taking 1D slices of DataFrame as you will see below.</p>

<p>You can rename a Series with the <a class="reference internal" href="../reference/api/pandas.Series.rename.html#pandas.Series.rename" title="pandas.Series.rename"><code class="xref py py-meth docutils literal notranslate"><span class="pre">pandas.Series.rename()</span></code></a> method.</p>

In [45]:
s2 = s.rename("different")

In [46]:
s2.name

'different'

<p>Note that <code class="docutils literal notranslate"><span class="pre">s</span></code> and <code class="docutils literal notranslate"><span class="pre">s2</span></code> refer to different objects.</p>

## DataFrame