# DataFrame indexing

Of course we would like now to be able to access the values in our table using indices. We have seen two ways of doing indexing in Numpy: with numerical indices and with boolean arrays. We will see that we can apply a very similar approach to DataFrame indexing. However we have one additional component that we didn't have before: while elements in a Numpy array can be purely located by their position (numerical indices), elements of a DataFrame can additionally be located thanks to their column name and index (not necessarily numerical).

To understand this, we have to explore a bit further the "anatomy" of a dataframe. Let's first load one:

In [1]:
import pandas as pd
import numpy as np

In [2]:
nuclei = pd.read_csv('../exports/19838_1252_F8_1_in.csv')

In [3]:
nuclei

Unnamed: 0,label,area,mean_intensity
0,1,5629,28.21407
1,2,9904,44.429826
2,4,15070,53.126078
3,5,20884,49.792856
4,6,12972,42.911116
5,7,16068,54.610904
6,8,27912,52.343007
7,9,26131,60.766178
8,10,28071,58.83043
9,11,16176,54.782517


## Accessing columns

Columns can simply be accessed using square parenthesis (brackets) with the column name (*not its location!):

In [4]:
nuclei['area']

0      5629
1      9904
2     15070
3     20884
4     12972
5     16068
6     27912
7     26131
8     28071
9     16176
10    18853
Name: area, dtype: int64

What is returned here is a ```Series``` object. We won't discuss this object further. Just know that DataFrames are composed of such Series. Also know that series are actually built on top of Numpy arrays. This is made clear above with the ```dtype: int64``` information. We can actually access the array in that series with the ```values``` property:

In [5]:
nuclei['area'].values

array([ 5629,  9904, 15070, 20884, 12972, 16068, 27912, 26131, 28071,
       16176, 18853])

We can also access muliple columns by specifying a list of those that we need:

In [6]:
nuclei[['area', 'mean_intensity']]

Unnamed: 0,area,mean_intensity
0,5629,28.21407
1,9904,44.429826
2,15070,53.126078
3,20884,49.792856
4,12972,42.911116
5,16068,54.610904
6,27912,52.343007
7,26131,60.766178
8,28071,58.83043
9,16176,54.782517


As you can see, as soon as we have more than one column, the returned object is a DataFrame.

## Accessing indices

We have seen an example with Numpy where we accessed elements in a 2D array using a pair of indices e.g.:



In [7]:
my_array = np.random.normal(size=(3,5))
my_array

array([[ 1.629054  ,  0.83008351,  0.890091  ,  0.91890738, -0.47601898],
       [ 0.97904984,  0.65933734, -2.16228108,  0.28385469, -0.43457002],
       [ 0.03331159,  1.06584909,  0.77037236, -0.40119092, -1.04896934]])

In [8]:
my_array[2,1]

1.0658490896279376

Even though our DataFrame is a two dimensional object, we can't access its elements in the same way. For example we cannot recover the top left element by using:

In [9]:
nuclei[0,0]

KeyError: (0, 0)

As Pandas has columns names and indices (the bold numbers on the left of the DataFrame), there are specific functions to access elements either by using those values or directly by numerical indexing.

### ```loc```

The ```.loc[index, name]``` method allows us to access a specific element situated at a specific ```index``` (row) and ```name``` (column). For example:

In [None]:
nuclei.loc[3, 'area']

We can also recover all the items of a given index by only speciying the latter:

In [None]:
nuclei.loc[3]

We see that this returns a simple Series with all the items defined in the DataFrame. Just like with the columns we can also pass a **list of indices** to recover more than one line, in which case we recover a DataFrame:

In [None]:
nuclei.loc[[1,3], ['area','mean_intensity']]

Note that the index **does not have to be a sequential integer**. For example we can replace the index with a list of strings:

In [10]:
nuclei.index = np.array(['a','b','c','d','e','f','g','h','i','j','k'])
nuclei

Unnamed: 0,label,area,mean_intensity
a,1,5629,28.21407
b,2,9904,44.429826
c,4,15070,53.126078
d,5,20884,49.792856
e,6,12972,42.911116
f,7,16068,54.610904
g,8,27912,52.343007
h,9,26131,60.766178
i,10,28071,58.83043
j,11,16176,54.782517


Here we can still use the ```.loc``` method even though we don't deal with integers

In [11]:
nuclei.loc[['a','c']]

Unnamed: 0,label,area,mean_intensity
a,1,5629,28.21407
c,4,15070,53.126078


### ```iloc```

The alternative to the ```.loc[index, name]``` method is the ```.iloc[row, column]``` method. This method is closer to the Numpy approach, as here we can use the **actual location** in the DataFrame to recover elements. For example to recover the ```area``` of the third row, we need the row index = 2 and the second column (column index = 1):

In [12]:
nuclei

Unnamed: 0,label,area,mean_intensity
a,1,5629,28.21407
b,2,9904,44.429826
c,4,15070,53.126078
d,5,20884,49.792856
e,6,12972,42.911116
f,7,16068,54.610904
g,8,27912,52.343007
h,9,26131,60.766178
i,10,28071,58.83043
j,11,16176,54.782517


In [13]:
nuclei.iloc[2,1]

15070

With ```iloc``` we can now use the **same indexing** approach as seen with Numpy arrays, i.e. we can select a range of values:

In [14]:
nuclei.iloc[0:2,2:4]

Unnamed: 0,mean_intensity
a,28.21407
b,44.429826


## Logical indexing

We have seen before that we can create a boolean array by using for example a comparison such as:

In [15]:
my_array

array([[ 1.629054  ,  0.83008351,  0.890091  ,  0.91890738, -0.47601898],
       [ 0.97904984,  0.65933734, -2.16228108,  0.28385469, -0.43457002],
       [ 0.03331159,  1.06584909,  0.77037236, -0.40119092, -1.04896934]])

In [16]:
my_array > 0

array([[ True,  True,  True,  True, False],
       [ True,  True, False,  True, False],
       [ True,  True,  True, False, False]])

and then use it to **extract** the ```True``` elements in another array:

In [17]:
my_array[my_array > 0]

array([1.629054  , 0.83008351, 0.890091  , 0.91890738, 0.97904984,
       0.65933734, 0.28385469, 0.03331159, 1.06584909, 0.77037236])

We can apply the **same logic** to Pandas DataFrames. For example, let's select nuclei by area:

In [18]:
nuclei

Unnamed: 0,label,area,mean_intensity
a,1,5629,28.21407
b,2,9904,44.429826
c,4,15070,53.126078
d,5,20884,49.792856
e,6,12972,42.911116
f,7,16068,54.610904
g,8,27912,52.343007
h,9,26131,60.766178
i,10,28071,58.83043
j,11,16176,54.782517


In [19]:
nuclei['area'] > 20000

a    False
b    False
c    False
d     True
e    False
f    False
g     True
h     True
i     True
j    False
k    False
Name: area, dtype: bool

We see that we obtain just like we obtained a boolean array previously, we obtain here a **boolean series**. And here as well we can use it to extract elements from the DataFrame. This works again with simple brackets:

In [20]:
nuclei[nuclei['area'] > 20000]

Unnamed: 0,label,area,mean_intensity
d,5,20884,49.792856
g,8,27912,52.343007
h,9,26131,60.766178
i,10,28071,58.83043


As well as with the more complex function ```loc```:

In [21]:
nuclei.loc[nuclei['area'] > 20000, 'mean_intensity']

d    49.792856
g    52.343007
h    60.766178
i    58.830430
Name: mean_intensity, dtype: float64

This is extremely useful to extract only specific elements from a large table if some values should be discarded

## Exercises

1. Import the penguin dataset https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv
2. Create a new dataframe ```new_dataframe``` by extracting the ```species```, ```bill_length_mm``` and ```body_mass_g``` columns.
3. Extract the row with index 4 of ```new_dataframe```
4. Extract the ```bill_length_mm``` of the 3 first rows of ```new_dataframe```
5. Extract all rows for which the ```body_mass_g > 6000```