<p><font size="6"><b>03 - Pandas: Indexing and selecting data - part II</b></font></p>

> *© 2021, Joris Van den Bossche and Stijn Van Hoey  (<mailto:jorisvandenbossche@gmail.com>, <mailto:stijnvanhoey@gmail.com>). Licensed under [CC BY 4.0 Creative Commons](http://creativecommons.org/licenses/by/4.0/)*

---

In [None]:
import pandas as pd

In [None]:
# redefining the example dataframe

data = {'country': ['Belgium', 'France', 'Germany', 'Netherlands', 'United Kingdom'],
        'population': [11.3, 64.3, 81.3, 16.9, 64.9],
        'area': [30510, 671308, 357050, 41526, 244820],
        'capital': ['Brussels', 'Paris', 'Berlin', 'Amsterdam', 'London']}
countries = pd.DataFrame(data)
countries

<div class="alert alert-info" style="font-size:120%">
<b>REMEMBER</b>: <br><br>

So as a summary, `[]` provides the following convenience shortcuts:

* **Series**: selecting a **label**: `s[label]`
* **DataFrame**: selecting a single or multiple **columns**:`df['col']` or `df[['col1', 'col2']]`
* **DataFrame**: slicing or filtering the **rows**: `df['row_label1':'row_label2']` or `df[mask]`

</div>

# Changing the DataFrame index

We have mostly worked with DataFrames with the default *0, 1, 2, ... N* row labels (except for the time series data). But, we can also set one of the columns as the index.

Setting the index to the country names:

In [None]:
countries = countries.set_index('country')
countries

Reversing this operation, is `reset_index`:

In [None]:
countries.reset_index('country')

# Selecting data based on the index

<div class="alert alert-warning" style="font-size:120%">
<b>ATTENTION!</b>: <br><br>

One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. <br><br> We now have to distuinguish between:

* selection by **label** (using the row and column names)
* selection by **position** (using integers)

</div>

## Systematic indexing with `loc` and `iloc`

When using `[]` like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:
    
* `loc`: selection by label
* `iloc`: selection by position

Both `loc` and `iloc` use the following pattern: `df.loc[ <selection of the rows> , <selection of the columns> ]`.

This 'selection of the rows / columns' can be: a single label, a list of labels, a slice or a boolean mask.

Selecting a single element:

In [None]:
countries.loc['Germany', 'area']

But the row or column indexer can also be a list, slice, boolean array (see next section), ..

In [None]:
countries.loc['France':'Germany', ['area', 'population']]

<div class="alert alert-danger">
<b>NOTE</b>:

* Unlike slicing in numpy, the end label is **included**!

</div>

---
Selecting by position with `iloc` works similar as **indexing numpy arrays**:

In [None]:
countries.iloc[0:2,1:3]

---

The different indexing methods can also be used to **assign data**:

In [None]:
countries2 = countries.copy()
countries2.loc['Belgium':'Germany', 'population'] = 10

In [None]:
countries2

<div class="alert alert-info" style="font-size:120%">
<b>REMEMBER</b>: <br><br>

Advanced indexing with **loc** and **iloc**

* **loc**: select by label: `df.loc[row_indexer, column_indexer]`
* **iloc**: select by position: `df.iloc[row_indexer, column_indexer]`

</div>

<div class="alert alert-success">
<b>EXERCISE 1</b>:

<p>
<ul>
    <li>Add the population density as column to the DataFrame.</li>
</ul>
</p>
Note: the population column is expressed in millions.
</div>

In [None]:
# %load _solutions/pandas_03b_indexing1.py

<div class="alert alert-success">
<b>EXERCISE 2</b>:

 <ul>
  <li>Select the capital and the population column of those countries where the density is larger than 300</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_03b_indexing2.py

<div class="alert alert-success">

<b>EXERCISE 3</b>:

 <ul>
  <li>Add a column 'density_ratio' with the ratio of the population density to the average population density for all countries.</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_03b_indexing3.py

<div class="alert alert-success">

<b>EXERCISE 4</b>:

 <ul>
  <li>Change the capital of the UK to Cambridge</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_03b_indexing4.py

<div class="alert alert-success">
<b>EXERCISE 5</b>:

 <ul>
  <li>Select all countries whose population density is between 100 and 300 people/km²</li>
</ul>
</div>

In [None]:
# %load _solutions/pandas_03b_indexing5.py

The next exercise uses the titanic data set:

In [None]:
df = pd.read_csv("data/titanic.csv")

In [None]:
df.head()

<div class="alert alert-success">

<b>EXERCISE 6</b>:

* Select all rows for male passengers and calculate the mean age of those passengers. Do the same for the female passengers. Do this now using `.loc`.

</div>

In [None]:
# %load _solutions/pandas_03b_indexing6.py

In [None]:
# %load _solutions/pandas_03b_indexing7.py

We will later see an easier way to calculate both averages at the same time with `groupby`.

# Alignment on the index

<div class="alert alert-danger">

**WARNING**: **Alignment!** (unlike numpy)

* Pay attention to **alignment**: operations between series will align on the index:

</div>

In [None]:
population = countries['population']
s1 = population[['Belgium', 'France']]
s2 = population[['France', 'Germany']]

In [None]:
s1

In [None]:
s2

In [None]:
s1 + s2

# Pitfall: chained indexing (and the 'SettingWithCopyWarning')

In [None]:
df = countries.copy()

When updating values in a DataFrame, you can run into the infamous "SettingWithCopyWarning" and issues with chained indexing.

Assume we want to cap the population and replace all values above 50 with 50. We can do this using the basic `[]` indexing operation twice ("chained indexing"):

In [None]:
df[df['population'] > 50]['population'] = 50

However, we get a warning, and we can also see that the original dataframe did not change:

In [None]:
df

The warning message explains that we should use `.loc[row_indexer,col_indexer] = value` instead. That is what we just learned in this notebook, so we can do:

In [None]:
df.loc[df['population'] > 50, 'population'] = 50

And now the dataframe actually changed:

In [None]:
df

To explain *why* the original `df[df['population'] > 50]['population'] = 50` didn't work, we can do the "chained indexing" in two explicit steps:

In [None]:
temp = df[df['population'] > 50]
temp['population'] = 50

For Python, there is no real difference between the one-liner or this two-liner. And when writing it as two lines, you can see we make a temporary, filtered dataframe (called `temp` above). So here, with `temp['population'] = 50`, we are actually updating `temp` but not the original `df`.

<div class="alert alert-info" style="font-size:120%">

<b>REMEMBER!</b><br><br>

What to do when encountering the *value is trying to be set on a copy of a slice from a DataFrame* error?

* Use `loc` instead of chained indexing **if possible**!
* Or `copy` explicitly if you don't want to change the original data.

</div>