## Data Selection

We'll concentrate on how to slice, dice, and retrieve and set subsets of pandas objects in general. Because Series and DataFrame have received greater development attention in this area, they will be the key focus.

The axis labeling information in pandas objects is useful for a variety of reasons:

* Data is identified (metadata is provided) using established indicators, which is useful for analysis, visualization, and interactive console display.

* Allows for both implicit and explicit data alignment.

* Allows you to access and set subsets of the data set in an intuitive way.


Three different forms of multi-axis indexing are currently supported by pandas.

1. The indexing operators [] and attribute operator. in Python and NumPy offer quick and easy access to pandas data structures in a variety of situations.

2. `.loc` is mostly label-based, but it can also be used with a boolean array. .When the items are not found, .loc will produce a KeyError.

3. `.iloc` works with an integer array (from 0 to length-1 of the axis), but it can also work with a boolean array.

Except for slice indexers, which enable out-of-bounds indexing, .iloc will throw IndexError if a requested indexer is out-of-bounds. (This is in line with the Python/NumPy slice semantics.)

In this section we will use the Iris dataset. First we obtain the row and column names by using the following commands:

In [2]:
import pandas as pd

In [3]:
url = 'https://raw.githubusercontent.com/pairote-sat/SCMA248/main/Data/iris.data'

iris = pd.read_csv(url, header=None, 
                   names = ['sepal_length', 'sepal_width', 
                            'petal_length', 'petal_width', 'class'])

In [101]:
print(iris.index)
print(iris.columns)

RangeIndex(start=0, stop=150, step=1)
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'], dtype='object')


In [102]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### The indexing operators []

To begin, simply indicate the column and line (by using its index) you're interested in.

You can use the following command to get the sepal width of the fifth line (index is 4):

In [103]:
iris['sepal_width'][4]

3.6

In [104]:
# Not working
# iris[4]['sepal_width']

**Note:** Be careful, because this is not a matrix, and you might be tempted to insert the row first, then the column. Remember that it's a pandas DataFrame, and the [] operator operates on columns first, then the element of the pandas Series that results."

Sub-matrix retrieval is a simple procedure that requires only the specification of lists of indexes rather than scalars.

In [105]:
iris['sepal_width'][0:4]

0    3.5
1    3.0
2    3.2
3    3.1
Name: sepal_width, dtype: float64

In [106]:
iris[['petal_width','sepal_width']][0:4]

Unnamed: 0,petal_width,sepal_width
0,0.2,3.5
1,0.2,3.0
2,0.2,3.2
3,0.2,3.1


In [107]:
iris['sepal_width'][range(4)]

0    3.5
1    3.0
2    3.2
3    3.1
Name: sepal_width, dtype: float64

### .loc()

You can use the `.loc()` method to get something similar to the other approach (as in a matrix) of obtaining data.

In [108]:
iris.loc[4,'sepal_width']

3.6

In [109]:
# rows at index labels between 0 and 4 (inclusive)
# See https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different

iris.loc[0:4,'sepal_width']

0    3.5
1    3.0
2    3.2
3    3.1
4    3.6
Name: sepal_width, dtype: float64

In [110]:
iris.loc[range(4),['petal_width','sepal_width']]

Unnamed: 0,petal_width,sepal_width
0,0.2,3.5
1,0.2,3.0
2,0.2,3.2
3,0.2,3.1


### .iloc()

Finally, there is `.iloc()`, which is a fully optimized function that defines the positions (as in a matrix). It requires you to define the cell using the row and column numbers.

In [111]:
iris.iloc[4,1]

3.6

The following commands produce the same output as `iris.loc[0:4,'sepal_width']` and `iris.loc[range(4),['petal_width','sepal_width']]`

In [112]:
# rows at index locations between 0 and 4 (exclusive)
# See https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different

iris.iloc[0:4,1]

0    3.5
1    3.0
2    3.2
3    3.1
Name: sepal_width, dtype: float64

In [113]:
iris.iloc[range(4),[3,1]]

Unnamed: 0,petal_width,sepal_width
0,0.2,3.5
1,0.2,3.0
2,0.2,3.2
3,0.2,3.1


**Note:** .loc, .iloc, and also [] indexing can accept a callable as indexer as illustrated from the following examples. The callable must be a function with one argument (the calling Series or DataFrame) that returns valid output for indexing.

In [114]:
iris.loc[:,lambda df: ['petal_length','sepal_length']]

Unnamed: 0,petal_length,sepal_length
0,1.4,5.1
1,1.4,4.9
2,1.3,4.7
3,1.5,4.6
4,1.4,5.0
...,...,...
145,5.2,6.7
146,5.0,6.3
147,5.2,6.5
148,5.4,6.2


In [115]:
iris.loc[lambda df: df['sepal_width'] > 3.5, :]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
10,5.4,3.7,1.5,0.2,Iris-setosa
14,5.8,4.0,1.2,0.2,Iris-setosa
15,5.7,4.4,1.5,0.4,Iris-setosa
16,5.4,3.9,1.3,0.4,Iris-setosa
18,5.7,3.8,1.7,0.3,Iris-setosa
19,5.1,3.8,1.5,0.3,Iris-setosa
21,5.1,3.7,1.5,0.4,Iris-setosa
22,4.6,3.6,1.0,0.2,Iris-setosa
