## CS102-4 - Further Computing

Prof. Götz Pfeiffer<br>
School of Mathematics, Statistics and Applied Mathematics<br>
NUI Galway

### 2. Aspects of Data Wrangling

# Week 6: `Pandas` Objects and Operations

* The `Pandas` package provides data structures for indexed collections of data, a kind of **database tables**.
* `Pandas` objects can be thought of as enhanced versions of `NumPy` arrays in which the rows and columns are identified with labels rather than simple integer indices.
* `Pandas` provides a host of useful tools, methods, and functionality on top of the basic data structures.
* The three fundamental `Pandas` data structures are: ``Series``, ``DataFrame``, and ``Index``.

In [None]:
import numpy as np
import pandas as pd

## The `Series` Object

* A ``Series`` is a **one-dimensional** array of **indexed data**.

* It can be created from a list or array as follows:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 3.0])
data

* The ``Series`` wraps both a sequence of values and a sequence of indices.
* These sequences can be accessed with the ``values`` and ``index`` attributes.
* The ``values`` are simply a familiar `NumPy` array:

In [None]:
data.values

* The ``index`` is an array-like object of type `pd.Index`, see below.

In [None]:
data.index

### ``Series`` as generalized `NumPy` array

* The  ``Series`` is much more general and flexible than the one-dimensional `NumPy` array that it emulates.

* While the `Numpy` array has an **implicitly defined** integer index used to access the values, the ``Series`` also has an **explicitly defined** index associated with the values.

* The index need not be an integer, but can consist of values of any desired type.

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

In [None]:
# indexing by explicit index
data['b']

In [None]:
# indexing by implicit index
data[1]

* A ``Series`` provides array-style item selection via the same basic mechanisms as `NumPy` arrays – that is, **slices**, **masking**, and **fancy indexing**.

In [None]:
# slicing by explicit index
data['a':'c']

In [None]:
# slicing by implicit integer index
data[0:2]

In [None]:
# masking
data[(data > 0.3) & (data < 0.8)]

In [None]:
# fancy indexing
data[['a', 'c']]

* Note that when slicing with an explicit index (i.e., ``data['a':'c']``), the final index is **included** in the slice, while when slicing with an implicit index (i.e., ``data[0:2]``), the final index is **excluded** from the slice.

### `Series` as specialized dictionary

* A ``Series`` behaves a bit like a `Python` dictionary:

* A dictionary is a structure that maps arbitrary **keys** to a set of arbitrary **values**. 

* A ``Series`` is a structure which maps **typed keys** to a set of **typed values**.

* The type information of a ``Series`` makes it much more efficient than `Python` dictionaries for certain operations.

* A ``Series`` object can be constructed directly from a `Python` dictionary:

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

* From here, dictionary-style item access can be performed:

In [None]:
population['California']

In [None]:
population.index

* We can also use dictionary-like `Python` expressions and methods to examine the keys/indices and values:

In [None]:
'Texas' in population

In [None]:
population.keys()

In [None]:
list(population.items())

### Indexers: `loc` and  `iloc`


In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

* `Pandas` provides some special **indexer** attributes that expose the 
   different indexing schemes.

* The ``loc`` attribute allows indexing and slicing that always references the **explicit** index:

In [None]:
data.loc[1]

In [None]:
data.loc[1:3]

* The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index:

In [None]:
data.iloc[1]

In [None]:
data.iloc[1:3]

In [None]:
import this

* **Explicit is better than implicit**: ``loc`` and ``iloc`` are very useful in maintaining clean and readable code; especially in the case of integer indexes.

## The `DataFrame` Object

* The `DataFrame` is the primary `pandas` data strucure.

* A `DataFrame` contains two-dimensional **tabular data**. with both flexible **row indices** and flexible **column names**.

* Like a `Series`, a ``DataFrame`` also can be thought of either as a generalization of a `NumPy` array, or as a specialization of a `Python` dictionary.

* **Column**-wise,  you can think of a ``DataFrame`` as a sequence or a dictionary of aligned ``Series`` objects.

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})

* We can construct a `DataFrame` from a dictionary of `Series`:

In [None]:
data = pd.DataFrame({'Area': area, 'Population': pop})
data

* Then `index` attribute of a ``DataFrame`` gives access to the common index labels of the contained `Series`:

In [None]:
data.index

* Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [None]:
data.columns

### `DataFrame` as specialized dictionary

* We can also think of a ``DataFrame`` as a specialization of a dictionary.

* Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a ``Series`` of column data.

* Note that in a two-dimensional `NumPy` array, ``data[i]`` will return a **row**. For a ``DataFrame``, ``data['name']`` will return a **column**.

* The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via **dictionary-style indexing** of the column name:

In [None]:
data['Area']

* Equivalently, we can use **attribute-style access** with column names that are strings:

In [None]:
data.Area

* The dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [None]:
data['Density'] = data['Population'] / data['Area']
data

### `DataFrame` as two-dimensional array

* We can also view the ``DataFrame`` as an enhanced two-dimensional array.

* The raw underlying data array can be accessed via the ``values`` attribute:

In [None]:
data.values

* Many familiar array-like observations can be done on the ``DataFrame`` itself.

* For example, we can transpose the full ``DataFrame`` to swap rows and columns:

In [None]:
data.T

* For **array-style indexing**, `Pandas` again uses the ``loc`` and ``iloc`` indexers mentioned earlier.

* Using the ``iloc`` indexer, we can index the underlying array as if it is a simple `NumPy` array.

* The ``DataFrame`` index and column labels are maintained in the result:

In [None]:
data.iloc[:3, :2]

* Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [None]:
data.loc[:'Illinois', :'Population']

* Any of the familiar NumPy-style data access patterns can be used within these indexers.
* For example, in the ``loc`` indexer we can combine masking and fancy indexing as in the following:

In [None]:
data.loc[data.Density > 100, ['Population', 'Density']]

* Any of these indexing conventions may also be used to set or modify values;
* This is done in the standard way that you might be accustomed to from working with `NumPy`:

In [None]:
data.iloc[0, 2] = 90
data

### Constructing DataFrame objects

* A  ``DataFrame`` can be constructed in a variety of ways.

* From a list of dictionaries

* If keys in one of the dictionaries are missing, `Pandas` will fill them in with ``NaN`` (i.e., "not a number") values:

In [None]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

* From a dictionary of ``Series``, as seen before:

In [None]:
pd.DataFrame({'Population': population,
              'Area': area})

* From a two-dimensional array of data, with any specified column and index names.

* If omitted, an integer index will be used for each:

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

## The `Index` Object

* Both the ``Series`` and ``DataFrame`` objects contain an explicit **index**.

* The ``Index`` object can be thought of either as an **immutable array** or as an **ordered multi-set**. 

In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

### `Index` as immutable array

* We can use standard Python indexing notation to retrieve values or slices:

In [None]:
ind[1]

In [None]:
ind[::2]

* ``Index`` objects also have many of the attributes familiar from `NumPy` arrays:

In [None]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

* One difference between ``Index`` objects and `NumPy` arrays is that indices are **immutable**–that is, they cannot be modified via the normal means:
```
ind[1] = 0
```
would result in an error message.

* This immutability makes it safer to share indices between multiple ``DataFrame``s and arrays, without the potential for side effects from inadvertent index modification.

### `Index` as ordered set

* `Pandas` objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.

* The ``Index`` object follows many of the conventions used by Python's built-in ``set`` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA & indB  # intersection

In [None]:
indA | indB  # union

In [None]:
indA ^ indB  # symmetric difference

* These operations may also be accessed via object methods, for example ``indA.intersection(indB)``.

## Operating on Data in Pandas

* One of the essential pieces of `NumPy` is the ability to perform quick element-wise operations.

* `Pandas` inherits much of this functionality from `NumPy`, and UFuncs  are key to this.

* Pandas includes a couple useful twists, however: 

* For **unary operations** like negation and trigonometric functions, these ufuncs will *preserve index and column labels* in the output,

* For **binary operations**  such as addition and multiplication, Pandas will automatically *align indices* when passing the objects to the ufunc.

* This supports keeping the context of data and combining data from different sources.

* We will additionally see that there are well-defined operations between one-dimensional ``Series`` structures and two-dimensional ``DataFrame`` structures.

## Index Preservation

* Any `NumPy` UFunc will work on `Pandas` objects.

In [None]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

In [None]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

* If we apply a UFunc on either of these objects, the result will be another `Pandas` object *with the indices preserved:*

In [None]:
np.exp(ser)

In [None]:
np.sin(df * np.pi / 4)

## Index Alignment

* For binary operations on two ``Series`` or ``DataFrame`` objects, `Pandas` will align indices in the process of performing the operation.

### Index alignment in Series

* As an example, suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:

In [None]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

* Let's see what happens when we divide these to compute the population density:

In [None]:
population / area

* The resulting array contains the *union* of indices of the two input arrays, which could be determined using standard Python set arithmetic on these indices:

In [None]:
area.index | population.index

* Any item for which one or the other does not have an entry is marked with ``NaN``, or "Not a Number," which is how `Pandas` marks missing data.

In [None]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

* If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators.
* For example, calling ``A.add(B)`` is equivalent to calling ``A + B``, but allows optional explicit specification of the fill value for any elements in ``A`` or ``B`` that might be missing:

In [None]:
A.add(B, fill_value=0)

### Index alignment in DataFrame

A similar type of alignment takes place for *both* columns and indices when performing operations on ``DataFrame``s:

In [None]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

In [None]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

In [None]:
A + B

* Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.
* As was the case with ``Series``, we can use the associated object's arithmetic method and pass any desired ``fill_value`` to be used in place of missing entries.
Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``):

In [None]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

The following table lists Python operators and their equivalent Pandas object methods:

| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


## References

### `pandas`

* `Series`: [[doc]](https://pandas.pydata.org/pandas-docs/stable/reference/series.html)


* `DataFrame`: [[doc]](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)


* `Index`: [[doc]](https://pandas.pydata.org/pandas-docs/stable/reference/indexing.html)

## Exercises