## CS102 - Further Computing

Mark Howard<br>
School of Mathematical & Statistical Sciences<br>
NUI Galway<br>
mark.howard@nuigalway.ie

### 2. Aspects of Data Wrangling

# Week 6: `Pandas` Objects and Operations

* **Structured arrays** are special forms of `NumPy` arrays. They store compound and heterogeneous data, unlike normal `NumPy` arrays that store homogeneous data.<br> 
* You can create a structured array, for example, with the following command:<br> `np.dtype({'names':('person_names', 'person_ages', 'is_python_programmer'), 'formats': ('U9', 'i8', 'bool')})`<br> 
* This structured array would have three columns with three different datatypes as defined in the tuples.
* So, heterogeneous arrays in `NumPy` _are possible, but ill-advised_ <br>
* Use `Pandas` (built on `NumPy`) instead

* The `Pandas` package provides data structures for **labelled collections** of data, a kind of **database tables**.
* `Pandas` objects can be thought of as enhanced versions of `numpy` arrays in which the rows and columns are identified with **labels** rather than simple **integer indices**.
* The three fundamental `Pandas` data structures are: `Series`, `DataFrame`, and `Index`.
* `Pandas` provides a host of useful tools, methods, and functionality on top of the basic data structures.

In [None]:
import numpy as np
import pandas as pd

## The `Series` Object

* A ``Series`` is a **one-dimensional** array of **indexed data**.

* It can be created from a list or array as follows:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 3.0])
data

* The ``Series`` wraps both a sequence of values and a sequence of indices.
* These sequences can be accessed with the ``values`` and ``index`` **attributes**.
* The ``values`` are simply a familiar one-dimensional `numpy` array:

In [None]:
data.values

* The `index` is an array-like object of type `pd.Index`.  We'll see more on index objects later.

In [None]:
data.index

### `Series` as generalized `numpy` array

* A `Series` object is much more general and flexible than the one-dimensional `numpy` array that it emulates.

* While the `numpy` array has an **implicitly defined** integer index used to access the values, the `Series` also has an **explicitly defined** index associated with the values.

* The index need not be an integer, but can consist of values of **any desired type**.

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

In [None]:
data.index

In [None]:
# indexing by explicit index
data['b']

In [None]:
# indexing by implicit index
data[1]

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 77, 'd'])
data

In [None]:
data[77]

In [None]:
# potential for errors
data[1]

* A `Series` provides array-style item selection via the same basic mechanisms as `numpy` arrays – that is, **slices**, **masking**, and **fancy indexing**.

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
# slicing by explicit index
data['a':'c']

In [None]:
# slicing by implicit integer index
data[0:2]

<div class="alert alert-danger">

* Note that when slicing with an explicit index (i.e., ``data['a':'c']``), the final index is **included** in the slice, while when slicing with an implicit index (i.e., ``data[0:2]``), the final index is **excluded** from the slice.
    
</div>

In [None]:
# masking
data[(data > 0.3) & (data < 0.8)]

In [None]:
# fancy indexing
data[['a', 'c']]

### `Series` as specialized dictionary

* A ``Series`` behaves a bit like a `Python` dictionary:

* A dictionary is a structure that maps arbitrary **keys** to a set of arbitrary **values**. 

* A ``Series`` is a structure which maps **typed keys** to a set of **typed values**.

* The type information of a ``Series`` makes it much more efficient than `Python` dictionaries for certain operations.

* A ``Series`` object can be constructed directly from a `Python` dictionary:

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

* From here, dictionary-style item access can be performed:

In [None]:
population['California']

In [None]:
population.index

In [None]:
population.values

* We can also use dictionary-like `Python` expressions and methods to examine the keys/indices and values:

In [None]:
'Texas' in population

In [None]:
population.keys()

In [None]:
list(population.items())

### Indexers: `loc` and  `iloc`


* Some ambiguity can arise when the explicit index consists of integers, too.
* These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as `data[1]` will use the explicit indices, while a slicing operation like `data[1:3]` will use the implicit Python-style index.

In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

In [None]:
data[1]

In [None]:
data[1:3]

* `Pandas` provides some special **indexer** attributes that expose the 
   different indexing schemes: `loc` and `iloc`.

* The `loc` attribute allows indexing and slicing that always references the **explicit** index:

In [None]:
data

In [None]:
data.loc[1]

In [None]:
data.loc[1:3]

* In contrast, the `iloc` attribute allows indexing and slicing that always references the **implicit** Python-style index:

In [None]:
data

In [None]:
data.iloc[1]

In [None]:
data.iloc[1:3]

* When to use which?  Consult the **Zen of Python**:

In [None]:
import this

* **Explicit is better than implicit**: 
The explicit nature of `loc` and `iloc` make them very useful in maintaining clean and readable code

If you get confused by `.loc` and `.iloc`, a mnemonic that might help is
* `.iloc` is based on the implicit index (both starting with i) position.
* `.loc` is based on label (starting with l).

## The `DataFrame` Object

* The `DataFrame` is the primary `pandas` data strucure.

* A `DataFrame` contains two-dimensional **tabular data**. with both flexible **row indices** and flexible **column names**.

* Like a `Series`, a `DataFrame` also can be thought of either as a generalization of a `numpy` array, or as a specialization of a `Python` dictionary.

* **Column**-wise,  you can think of a `DataFrame` as a sequence or a dictionary of **aligned** `Series` objects.

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})

* We can construct a `DataFrame` from a dictionary of `Series`:

In [None]:
# feed in the area and pop series using dictionary (key:value) syntax
data = pd.DataFrame({'Area': area, 'Population': pop})
data

* Notice how there is a headed column for each series
* The `index` attribute of a ``DataFrame`` gives access to the **common index labels** of the contained `Series`:

In [None]:
data.index

* Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:

In [None]:
data.columns

### `DataFrame` as specialized dictionary

* We can also think of a ``DataFrame`` as a specialization of a dictionary.

* Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a ``Series`` of column data.

<div class="alert alert-danger">
    
* Note that in a two-dimensional `numpy` array, `data[i]` will return a **row**. For a `DataFrame`, `data['name']` will return a **column**.

</div>

* The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via **dictionary-style indexing** of the column name:

In [None]:
data['Area']

* Equivalently, and conveniently, we can use **attribute-style access** with column names that are strings:

In [None]:
data.Area

* The dictionary-style syntax can also be used to modify the object, in this case adding a new column:
* I'm simultaneously choosing to make the `dtype` be `int64` to match the other column headings

In [None]:
data['Density'] = (data['Population'] / data['Area']).astype(dtype ='int64')
data

### `DataFrame` as two-dimensional array

* We can also view the ``DataFrame`` as an enhanced two-dimensional array.

* The raw underlying data (`NumPy`) array can be accessed via the ``values`` attribute:

In [None]:
data.values

* Many familiar array-like observations can be done on the ``DataFrame`` itself.

* For example, we can transpose the full ``DataFrame`` to swap rows and columns:

In [None]:
data.T

* For **array-style indexing**, `Pandas` again uses the ``loc`` and ``iloc`` indexers mentioned earlier.

* Using the ``iloc`` indexer, we can index the underlying array as if it is a simple `NumPy` array.

* The ``DataFrame`` index and column **labels are maintained** in the result:

In [None]:
data.iloc[:3, :2]

* Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [None]:
data.loc[:'New York', :'Population']

* Any of the familiar `numpy`-style data access patterns can be used within these indexers.
* For example, in the `loc` indexer we can combine **masking** and **fancy indexing**:

In [None]:
data.loc[data.Density > 100, ['Population', 'Density']]

* Any of these indexing conventions may also be used to **set or modify** values;
* This works in the same way as with `numpy` arrays:

In [None]:
data.iloc[0, 2] = 99
data

### Constructing DataFrame objects

* A  ``DataFrame`` can be constructed in a variety of ways.

#### From a list of dictionaries

* If keys in one of the dictionaries are missing, `Pandas` will fill them in with ``NaN`` (i.e., "not a number") values:

In [None]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

#### From a dictionary of ``Series``

* as seen before:

In [None]:
pd.DataFrame({'Population': population,
              'Area': area})

#### From a two-dimensional array of data

* with any specified column and index names.

* If omitted, an integer index will be used instead:

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

## The `Index` Object

* Both the ``Series`` and ``DataFrame`` objects contain an explicit **index**.

* The ``Index`` object can be thought of either as an **immutable array** or as an **ordered multi-set**. 

In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

### `Index` as immutable array

* We can use standard Python indexing notation to retrieve values or slices:

In [None]:
ind[0]

In [None]:
ind[::2]

* ``Index`` objects also have many of the attributes of `numpy` arrays:

In [None]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

<div class="alert alert-danger">

* One difference between `Index` objects and `numpy` arrays is that indices are **immutable**–that is, they cannot be modified via the normal means:
```
ind[1] = 0
```
would result in an error message.

* This immutability makes it safer to share indices between multiple `DataFrame`s, without the potential for side effects from inadvertent index modification.
    
</div>

In [None]:
ind[1] = 0

### `Index` as ordered set

* `Pandas` objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.

* The ``Index`` object follows many of the conventions used by Python's built-in ``set`` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA & indB  # intersection (in A AND B)

In [None]:
indA | indB  # union (in A OR B)

In [None]:
indA ^ indB  # symmetric difference aka (A OR B) less (A AND B)

* These operations may also be accessed via object methods, for example ``indA.intersection(indB)``.

## References

### `pandas`

* `Series`: [[doc]](https://pandas.pydata.org/pandas-docs/stable/reference/series.html)


* `DataFrame`: [[doc]](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)


* `Index`: [[doc]](https://pandas.pydata.org/pandas-docs/stable/reference/indexing.html)

## Exercises

1. Four students, called `'AB'`, `'CD'`, `'EF'` and `'GH'` do an assignment and receive $8$, $3$, $10$, and $5$ marks, respectively.  Record this information in a  `Series` object that uses the student names as index.

1. Four students, called `'AB'`, `'CD'`, `'EF'` and `'GH'` sit an exam and receive $81$, $53$, $88$, and $45$ marks, respectively.  Record this information in a  `Series` object that uses the student names as index.

2. Combine the two `Series` objects from above into a single `DataFrame` that uses the student names as index.

3. Compute for each student the sum of their marks from above, and add these numbers as new column `'Total'` to your `DataFrame`.

3. Compute for each student the **weighted** sum of their marks from above, as five times their assignment mark plus half of their exam mark, and add these numbers as new column `'Final Mark'` to your `DataFrame`.

In [None]:
assignment=pd.Series({'AB':8,'CD':3,'EF':10,'GH':5})

In [None]:
exam=pd.Series({'AB':81,'CD':53,'EF':88,'GH':45})

In [None]:
grades=pd.DataFrame({'Assignment':assignment,'Exam':exam})

In [None]:
grades

In [None]:
grades['Total']=grades['Assignment']+grades['Exam']
grades

In [None]:
grades['Final Mark']=5*grades['Assignment']+0.5*grades['Exam']
grades