# Requirements

In [1]:
import numpy as np
import pandas as pd

# Dataframe

We create a very simple dataframe with three columns `alpha`, `beta` and `gamma` as well as and index that is non-trivial.

In [2]:
indices = 'ABCDEFGHIJK'

df = pd.DataFrame({
    'alpha': [i for i in range(1, 1 + len(indices))],
    'beta': [i**2 for i in range(1, 1 + len(indices))],
    'gamma': [i**3 for i in range(1, 1 + len(indices))],
    'idx': [c for c in indices],
})
df.set_index('idx', inplace=True)

In [3]:
df

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1,1,1
B,2,4,8
C,3,9,27
D,4,16,64
E,5,25,125
F,6,36,216
G,7,49,343
H,8,64,512
I,9,81,729
J,10,100,1000


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11 entries, A to K
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   alpha   11 non-null     int64
 1   beta    11 non-null     int64
 2   gamma   11 non-null     int64
dtypes: int64(3)
memory usage: 352.0+ bytes


# Row selection

Rows can be selected either by row number, or by index value.  Selection by value is likely to be the better choice if the index has some semantics, e.g., datetime or unique  identifier.

## By row number

To select by row numbers, use the `iloc` locator object.  It takes an `int`, a slice, or a list of `int`.

In [27]:
df.iloc[7]

alpha      8
beta      64
gamma    512
Name: H, dtype: int64

In this case, a pandas `Series` is returned.

Selecting a slice, e.g., from the second to the fourth row, the familiar Python slicing index can be used.

In [5]:
df.iloc[1:4]

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
B,2,4,8
C,3,9,27
D,4,16,64


It is also possible to provide a list of row numbers.

In [6]:
df.iloc[[1, 3, 5, 7]]

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
B,2,4,8
D,4,16,64
F,6,36,216
H,8,64,512


## By index value

To select by index value, the `loc` object can be used.  It too takes a single index value, a slice or a list of values.  In our example, the index is a `str` with values from `A` to `K` in order.

In [28]:
df.loc['B']

alpha    2
beta     4
gamma    8
Name: B, dtype: int64

As for the `iloc` locator, a pandas `Series` object is returned for a single value.

Slicing with index values is also possible. *Note:* as opposed to the familiar slicing semantics in Python, the end value is inclusive in this case.

In [7]:
df.loc['B':'E']

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
B,2,4,8
C,3,9,27
D,4,16,64
E,5,25,125


Finally, a list of index values can also be used.

In [8]:
df.loc[['B', 'D', 'H']]

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
B,2,4,8
D,4,16,64
H,8,64,512


# Column selection

Just as for rows, columns can be selected by column number, or column name.  In most situations the latter is preferred since the order and even the number of columns may change over time.

## By column number

The `iloc` selector object is again used for selecting by column number.  It takes either an `int`, a slice or a list of integers.

In [29]:
df.iloc[:, 1]

idx
A      1
B      4
C      9
D     16
E     25
F     36
G     49
H     64
I     81
J    100
K    121
Name: beta, dtype: int64

In [10]:
df.iloc[:, 0:2]

Unnamed: 0_level_0,alpha,beta
idx,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1,1
B,2,4
C,3,9
D,4,16
E,5,25
F,6,36
G,7,49
H,8,64
I,9,81
J,10,100


In [9]:
df.iloc[:, [0, 2]]

Unnamed: 0_level_0,alpha,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1,1
B,2,8
C,3,27
D,4,64
E,5,125
F,6,216
G,7,343
H,8,512
I,9,729
J,10,1000


## By column names

Selection by column name is much more often used, and typically more appropriate.

A single column can be accessed by simply using the column name as a dataframe attribute.

In [11]:
df.alpha

idx
A     1
B     2
C     3
D     4
E     5
F     6
G     7
H     8
I     9
J    10
K    11
Name: alpha, dtype: int64

This is a nice syntax, but note that it will not work when column names contain spaces or characters that would lead to Python syntax errors.  In general, it is good practice to use straightforward column names.  If that is not possible, you can use an alternative approach that is a bit more combersome.

In [30]:
df['alpha']

idx
A     1
B     2
C     3
D     4
E     5
F     6
G     7
H     8
I     9
J    10
K    11
Name: alpha, dtype: int64

To select columns, slicing can not be used directly.  However, a list of column names can be used, and is very useful in many situations.

In [12]:
df[['gamma', 'beta']]

Unnamed: 0_level_0,gamma,beta
idx,Unnamed: 1_level_1,Unnamed: 2_level_1
A,1,1
B,8,4
C,27,9
D,64,16
E,125,25
F,216,36
G,343,49
H,512,64
I,729,81
J,1000,100


# Selecting both rows and columns

Both the `iloc` and the `loc` selector objects can be used to select a range of rows and columns simultaneously.

In [32]:
df.iloc[1:9:2, [0, 2]]

Unnamed: 0_level_0,alpha,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1
B,2,8
D,4,64
F,6,216
H,8,512


In [33]:
df.loc['B':'I':2, ['alpha', 'gamma']]

Unnamed: 0_level_0,alpha,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1
B,2,8
D,4,64
F,6,216
H,8,512


# Sampling rows

It can often be useful to sample a subset of the wors in a pandas dataframe.  It is easy to get the head or the tail of a dataframe by using the methods with that name, optionally providing the number of rows you want to select (default is 5 rows).

In [34]:
df.head(3)

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1,1,1
B,2,4,8
C,3,9,27


In [35]:
df.tail(4)

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
H,8,64,512
I,9,81,729
J,10,100,1000
K,11,121,1331


A random sample of the rows can be obtained using the `sample` method.

In [36]:
df.sample(n=5)

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
D,4,16,64
G,7,49,343
E,5,25,125
K,11,121,1331
C,3,9,27


The `sample` method has many useful options.  It is possible to sample a given fraction of the rows, sample with replacement, or do weighted sampling.

# Conditional selection

In many data analysis tasks, you want to select rows based on conditions.  For instance, selecting the rows where `beta` is larger than 64 can be done in two ways.

## Simple queries

The first approach is to create a temporary `Series` of Boolean values with the obvious semantics.

In [14]:
df[df.beta > 64]

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
I,9,81,729
J,10,100,1000
K,11,121,1331


As you can see below, the condition `df.beta > 64` evalues to a series of Boolean values.

In [37]:
df.beta > 64

idx
A    False
B    False
C    False
D    False
E    False
F    False
G    False
H    False
I     True
J     True
K     True
Name: beta, dtype: bool

In [41]:
df[~(df.beta == 4)]

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1,1,1
C,3,9,27
D,4,16,64
E,5,25,125
F,6,36,216
G,7,49,343
H,8,64,512
I,9,81,729
J,10,100,1000
K,11,121,1331


Although this methods is quite fast, it may consume considerable memory for large dataframes since a `Series` object has to be constructed.  In that case, the `query` method is quite useful.

In [15]:
df.query('beta > 64')

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
I,9,81,729
J,10,100,1000
K,11,121,1331


In [44]:
df.query('not(beta == 4)')

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,1,1,1
C,3,9,27
D,4,16,64
E,5,25,125
F,6,36,216
G,7,49,343
H,8,64,512
I,9,81,729
J,10,100,1000
K,11,121,1331


## Complex queries

More complex selection criteria can be implemented using both approaches, e.g.,

In [39]:
df[((df.beta > 64) | (df.gamma > 10)) & (df.alpha < 7)]

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,3,9,27
D,4,16,64
E,5,25,125
F,6,36,216


*Note:* the bitwise logical operators `~`, `&` and `|` should be used rather than the logical operators `not`, `and` and `or`.  Due to the precedence of the bitwise operators, the braces are also required.

The `query` method offers a much cleaner syntax for the same selection query, but it is somewhat less efficient (to my surprise).

In [40]:
df.query('(beta > 64 or gamma > 10) and alpha < 7')

Unnamed: 0_level_0,alpha,beta,gamma
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C,3,9,27
D,4,16,64
E,5,25,125
F,6,36,216


## Performance

Testing the performance of the two approaches for a larger data frame reveals that `query` is the slower option.  However, it is more memory efficient since it avoid the creation of temporary Boolean `Series` objects.

In [17]:
data = pd.DataFrame({
    column: np.random.uniform(0.0, 1.0, size=(500_000, ))
    for column in 'ABCDEFGHIK'
})

In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 10 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   A       500000 non-null  float64
 1   B       500000 non-null  float64
 2   C       500000 non-null  float64
 3   D       500000 non-null  float64
 4   E       500000 non-null  float64
 5   F       500000 non-null  float64
 6   G       500000 non-null  float64
 7   H       500000 non-null  float64
 8   I       500000 non-null  float64
 9   K       500000 non-null  float64
dtypes: float64(10)
memory usage: 38.1 MB


In [22]:
%timeit data.loc[(data.A > 0.5) & (data.B < 0.5)]

8.92 ms ± 661 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [23]:
%timeit data.query('A > 0.5 and B < 0.5')

11.9 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


Obviously, you should benchmark for your own use cases.