# Pandas Indexing and Selecting Data  

User guide [Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing).  

The axis labeling information in pandas objects serves many purposes:
- Identifies data (i.e. provides metadata) using known indicators.
- Enables automatic and explicit data alignment.
- Allows intuitive getting and setting of subsets of the data set.  

We will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects. 

Pandas now supports three types of multi-axis indexing:
1. `.loc[]`
2. `.iloc[]`
3. `[]` (a.k.a. \_\_getitem\_\_)
4. Attribute access.   
  
Getting values from an object with multi-axes selection uses the following notation (using .loc as an example, but the following applies to .iloc as well):
- Any of the axes accessors may be the null slice `:`.
- Axes left out of the specification are assumed to be `:` (e.g. `p.loc['a']` is equivalent to `p.loc['a', :, :]`).  

|Object Type | Indexers
| :--- | :--- 
|Series | `s.loc[indexer]`
|DataFrame | `df.loc[row_indexer,column_indexer]`
|Panel| `p.loc[item_indexer,major_indexer,minor_indexer]`

In [1]:
import pandas as pd
import numpy as np

# 1. `.loc[]` is label based (but may also be used with a boolean array):  

- Every label asked for must be in the index, or a `KeyError` will be raised.
- When slicing, both the start bound **AND** the stop bound are included, if present in the index.
- Integers are valid labels, but they refer to the label and **not the position**.  

The `.loc[]` attribute is the **primary access method**. The following are valid inputs:
- A single label, e.g. 5 or 'a' (note that 5 is interpreted as a label of the index. This use is not an integer position along the index).
- A list or array of labels `['a', 'b', 'c']`.
- A slice object with labels `'a':'f'` (note that contrary to usual python slices, **both the start and the stop are included**, when present in the index!).
- A boolean array.

In [2]:
# Random state is initializer.
np.random.seed(seed=0)
s1 = pd.Series(np.random.randn(5), index=list('abcde'))
s1

a    1.764052
b    0.400157
c    0.978738
d    2.240893
e    1.867558
dtype: float64

In [3]:
s1.loc['a']

1.764052345967664

In [4]:
s1.loc['d':'e']

d    2.240893
e    1.867558
dtype: float64

Note that setting works as well:

In [5]:
s1.loc['c':] = 0
s1

a    1.764052
b    0.400157
c    0.000000
d    0.000000
e    0.000000
dtype: float64

In [6]:
np.random.seed(seed=0); s2 = pd.Series(np.random.randn(5), index=range(0,10,2))
s2

0    1.764052
2    0.400157
4    0.978738
6    2.240893
8    1.867558
dtype: float64

In [7]:
s2.loc[4]

0.9787379841057392

Notice that `s2.loc[4]` returns the element with **label** 4, **not index** 4 (which would be 1.867558).

# 2. `.iloc[]` is integer position based (from 0 to length-1, but may also be used with a boolean array).:

# 3. `[]` (a.k.a. \_\_getitem\_\_): the primary function of this option is selecting out lower-dimensional slices:
 - `series[label]`	returns a scalar value.
 - `frame[colname]`returns a Series corresponding to colname.
 
**Note**: You can also pass a list of columns to [] to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised.

In [8]:
dates = pd.date_range('2015', periods=5, freq='AS')
df = pd.DataFrame(np.random.randn(5, 3),index=dates, columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
2015-01-01,-0.977278,0.950088,-0.151357
2016-01-01,-0.103219,0.410599,0.144044
2017-01-01,1.454274,0.761038,0.121675
2018-01-01,0.443863,0.333674,1.494079
2019-01-01,-0.205158,0.313068,-0.854096


In [9]:
s = df['A']
s

2015-01-01   -0.977278
2016-01-01   -0.103219
2017-01-01    1.454274
2018-01-01    0.443863
2019-01-01   -0.205158
Freq: AS-JAN, Name: A, dtype: float64

In [10]:
dates[3]

Timestamp('2018-01-01 00:00:00', freq='AS-JAN')

In [11]:
s[dates[3]]

0.44386323274542566

In [12]:
df[['B', 'A']]

Unnamed: 0,B,A
2015-01-01,0.950088,-0.977278
2016-01-01,0.410599,-0.103219
2017-01-01,0.761038,1.454274
2018-01-01,0.333674,0.443863
2019-01-01,0.313068,-0.205158


Assignment is also possible. You may find this useful for applying a transform (in-place) to a subset of the columns (**warning**: the following doesn't work with `.loc[]` or `.iloc[]` bacause pandas aligns all AXES when setting Series and DataFrame from `.loc[]` and `.iloc[]`).

In [13]:
df[['B', 'A']] = df[['A', 'B']]
df

Unnamed: 0,A,B,C
2015-01-01,0.950088,-0.977278,-0.151357
2016-01-01,0.410599,-0.103219,0.144044
2017-01-01,0.761038,1.454274,0.121675
2018-01-01,0.333674,0.443863,1.494079
2019-01-01,0.313068,-0.205158,-0.854096


# 4. Attribute access  

You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute.  
`df.A` is equivalent to `df['A']`.  

**Warning**:
- **Assignment is possible only if the column already existes**. Otherwise use `df['col'] = ...`
- You can use this access only if the index element is a valid Python identifier, e.g. s.1 is not allowed.
- The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed.  

In any of these cases, standard indexing will still work, e.g. `s['1']`, `s['min']`, and `s['index']`.

In [14]:
df.A

2015-01-01    0.950088
2016-01-01    0.410599
2017-01-01    0.761038
2018-01-01    0.333674
2019-01-01    0.313068
Freq: AS-JAN, Name: A, dtype: float64

In [15]:
df['A']

2015-01-01    0.950088
2016-01-01    0.410599
2017-01-01    0.761038
2018-01-01    0.333674
2019-01-01    0.313068
Freq: AS-JAN, Name: A, dtype: float64