## Advanced filtering 
### - [Selection by Callable](#callable)
### - [Selecting random samples](#sample)
### - [Setting with enlargement](#enlargement)
### - [map() and applymap() functions](#map)
### - [Indexing with isin](#isin)
### - [The where() Method](#where)
### - [The mask() Method](#mask)


<a id=callable></a>
## Selection by callable

.loc, .iloc, and also [] indexing can accept a callable as indexer. The callable must be a function with one argument (the calling Series or DataFrame) that returns valid output for indexing.

In [2]:
import pandas as pd
import numpy as np

In [5]:
df1 = pd.DataFrame(np.random.randn(6, 4),
                   index=list('abcdef'),
                   columns=list('ABCD'))
df1

Unnamed: 0,A,B,C,D
a,-1.087399,2.002552,1.60198,0.564242
b,0.666815,0.01434,1.05637,-0.051857
c,-0.721554,0.537243,0.040829,0.133982
d,-0.222398,-0.073338,-0.163332,1.405695
e,-0.398351,-0.946127,-0.397846,0.904402
f,-0.303648,1.177625,2.328959,0.129193


### Use a lambda function for selection

In [11]:
df1.loc[lambda df: df['A'] > 0, :]

Unnamed: 0,A,B,C,D
b,0.666815,0.01434,1.05637,-0.051857


In [7]:
df1.loc[:, lambda df: ['A', 'B']]

Unnamed: 0,A,B
a,-1.087399,2.002552
b,0.666815,0.01434
c,-0.721554,0.537243
d,-0.222398,-0.073338
e,-0.398351,-0.946127
f,-0.303648,1.177625


In [13]:
df1.iloc[:, lambda df: [0, 1]]

Unnamed: 0,A,B
a,-1.087399,2.002552
b,0.666815,0.01434
c,-0.721554,0.537243
d,-0.222398,-0.073338
e,-0.398351,-0.946127
f,-0.303648,1.177625


In [14]:
df1[lambda df: df.columns[0]]

a   -1.087399
b    0.666815
c   -0.721554
d   -0.222398
e   -0.398351
f   -0.303648
Name: A, dtype: float64

In [15]:
# callable indexing in Series.
df1['A'].loc[lambda s: s > 0]

b    0.666815
Name: A, dtype: float64

## Combining positional and label-based indexing

### Use df.columns.get_loc() to get a column number

In [16]:
df1.columns.get_loc('A')

0

In [18]:
df1.iloc[[0, 2], df1.columns.get_loc('A')]

a   -1.087399
c   -0.721554
Name: A, dtype: float64

### Use df.index to get row index

In [20]:
df1.loc[df1.index[[0, 2]], 'A']

a   -1.087399
c   -0.721554
Name: A, dtype: float64

<a id=sample></a>
## Selecting random samples on a Series

A random selection of rows or columns from a Series or DataFrame with the sample() method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.

In [24]:
s = pd.Series([0, 1, 2, 3, 4, 5])

In [25]:
# When no arguments are passed, returns 1 row.
s.sample()

2    2
dtype: int64

In [26]:
# One may specify either a number of rows:
s.sample(n=3)

2    2
3    3
5    5
dtype: int64

In [27]:
# Or a fraction of the rows:
s.sample(frac=0.5)

1    1
0    0
5    5
dtype: int64

By default, sample will return each row at most once, but one can also sample with replacement using the replace option:

In [28]:
# Without replacement (default):
s.sample(n=6, replace=False)

2    2
4    4
3    3
0    0
5    5
1    1
dtype: int64

In [29]:
# With replacement:
s.sample(n=6, replace=True)

5    5
4    4
5    5
5    5
0    0
4    4
dtype: int64

By default, each row has an equal probability of being selected, but if you want rows to have different probabilities, you can pass the sample function sampling weights as weights. These weights can be a list, a NumPy array, or a Series, but they must be of the same length as the object you are sampling. Missing values will be treated as a weight of zero, and inf values are not allowed. If weights do not sum to 1, they will be re-normalized by dividing all weights by the sum of the weights. For example:

In [30]:
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

s.sample(n=3, weights=example_weights)

5    5
3    3
4    4
dtype: int64

In [31]:
# Weights will be re-normalized automatically
example_weights2 = [0.5, 0, 0, 0, 0, 0]

s.sample(n=1, weights=example_weights2)

0    0
dtype: int64

##  Selecting random samples on a DataFrame
When applied to a DataFrame, you can use a column of the DataFrame as sampling weights (provided you are sampling rows and not columns) by simply passing the name of the column as a string.

In [32]:
df2 = pd.DataFrame({'col1': [9, 8, 7, 6],
                    'weight_column': [0.5, 0.4, 0.1, 0]})


df2.sample(n=3, weights='weight_column')

Unnamed: 0,col1,weight_column
0,9,0.5
2,7,0.1
1,8,0.4


In [33]:
df3 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})

df3.sample(n=1, axis=1)

Unnamed: 0,col1
0,1
1,2
2,3


Finally, one can also set a seed for sample’s random number generator using the random_state argument, which will accept either an integer (as a seed) or a NumPy RandomState object.

In [35]:
df4 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})

# With a given seed, the sample will always draw the same rows.
df4.sample(n=2, random_state=2)

Unnamed: 0,col1,col2
2,3,4
1,2,3


In [36]:
df4.sample(n=2, random_state=2)

Unnamed: 0,col1,col2
2,3,4
1,2,3


<a id=enlargement></a>
## Setting with enlargement

The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.

In the Series case this is effectively an appending operation.


In [38]:
se = pd.Series([1, 2, 3])
se

0    1
1    2
2    3
dtype: int64

In [40]:
# adding a new key/value pair will add a row
se[5] = 5.
se

0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

### Use reshape()
A DataFrame can be enlarged on either axis via .loc.

In [43]:
dfi = pd.DataFrame(np.arange(6).reshape(3, 2),
                   columns=['A', 'B'])
dfi

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,5


In [44]:
dfi.loc[:, 'C'] = dfi.loc[:, 'A']
dfi

Unnamed: 0,A,B,C
0,0,1,0
1,2,3,2
2,4,5,4


In [45]:
dfi.loc[3] = 5
dfi

Unnamed: 0,A,B,C
0,0,1,0
1,2,3,2
2,4,5,4
3,5,5,5


<a id=map></a>
## map() and applymap() functions

In [47]:
filter = dfi['A'].map(lambda x: x>3)
dfi[filter]

Unnamed: 0,A,B,C
2,4,5,4
3,5,5,5


In [56]:
# get any row with a element >-3
filter2 = dfi.applymap(lambda x: x>=3)
filter2

Unnamed: 0,A,B,C
0,False,False,False
1,False,True,False
2,True,True,True
3,True,True,True


In [57]:
# any row with a True
filter2.T.any()

0    False
1     True
2     True
3     True
dtype: bool

In [58]:
dfi[filter2.T.any()]

Unnamed: 0,A,B,C
1,2,3,2
2,4,5,4
3,5,5,5


In [59]:
df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
                    'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
                    'c': np.random.randn(7)})

In [60]:
# only want 'two' or 'three'
criterion = df2['a'].map(lambda x: x.startswith('t'))
df2[criterion]

Unnamed: 0,a,b,c
2,two,y,1.085893
3,three,x,0.919324
4,two,y,1.093819


### List comprehension is slower than map()

In [62]:
# equivalent but slower
df2[[x.startswith('t') for x in df2['a']]]

Unnamed: 0,a,b,c
2,two,y,1.085893
3,three,x,0.919324
4,two,y,1.093819


In [63]:
# Multiple criteria
df2[criterion & (df2['b'] == 'x')]

Unnamed: 0,a,b,c
3,three,x,0.919324


<a id=isin></a>
## Indexing with isin

Consider the isin() method of Series, which returns a boolean vector that is true wherever the Series elements exist in the passed list. This allows you to select rows where one or more columns have values you want:
    

In [64]:
s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
s

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [65]:
s.isin([2, 4, 6])

4    False
3    False
2     True
1    False
0     True
dtype: bool

In [66]:
s[s.isin([2, 4, 6])]

2    2
0    4
dtype: int64

DataFrame also has an isin() method. When calling isin, pass a set of values as either an array or dict. If values is an array, isin returns a DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values.

In [76]:
df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
                   'ids2': ['a', 'n', 'c', 'n']})

In [69]:
values = ['a', 'b', 1, 3]
df.isin(values)

Unnamed: 0,vals,ids,ids2
0,True,True,True
1,False,True,False
2,True,False,False
3,False,False,False


Oftentimes you’ll want to match certain values with certain columns. Just make values a dict where the key is the column, and the value is a list of items you want to check for.

In [70]:
values = {'ids': ['a', 'b'], 'vals': [1, 3]}
df.isin(values)

Unnamed: 0,vals,ids,ids2
0,True,True,False
1,False,True,False
2,True,False,False
3,False,False,False


Combine DataFrame’s isin with the any() and all() methods to quickly select subsets of your data that meet a given criteria. To select a row where each column meets its own criterion:

In [71]:
values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
row_mask = df.isin(values).all(1)
df[row_mask]

Unnamed: 0,vals,ids,ids2
0,1,a,a


<a id=where> </a>
## The where() Method and Masking

Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the where method in Series and DataFrame.

To return only the selected rows:

In [73]:
# To return only the selected rows:
s[s > 0]

3    1
2    2
1    3
0    4
dtype: int64

In [74]:
# To return the same shape as the original but fill with NaN for false conditions
s.where(s > 0)

4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64

Selecting values from a DataFrame with a boolean criterion now also preserves input data shape. where is used under the hood as the implementation. The code below is equivalent to df.where(df < 0).

In [85]:
df = pd.DataFrame(np.random.randn(6, 4),
                   index=list('abcdef'),
                   columns=list('ABCD'))
df[df<0]

Unnamed: 0,A,B,C,D
a,-0.070857,,,
b,-2.053981,,,-0.061725
c,,-0.642826,,-0.971308
d,-0.849711,-1.765664,-1.251683,-0.776705
e,,,,
f,-0.933014,,,


In [86]:
df.where(df<0)

Unnamed: 0,A,B,C,D
a,-0.070857,,,
b,-2.053981,,,-0.061725
c,,-0.642826,,-0.971308
d,-0.849711,-1.765664,-1.251683,-0.776705
e,,,,
f,-0.933014,,,


In addition, where takes an optional other argument for replacement of values where the condition is False, in the returned copy.

In [87]:
df.where(df < 0, -df)

Unnamed: 0,A,B,C,D
a,-0.070857,-1.476863,-0.32184,-0.582583
b,-2.053981,-0.155098,-0.971747,-0.061725
c,-1.38265,-0.642826,-1.482905,-0.971308
d,-0.849711,-1.765664,-1.251683,-0.776705
e,-0.815752,-0.598098,-1.681905,-0.083716
f,-0.933014,-0.412416,-0.339498,-0.313542


You may wish to set values based on some boolean criteria. This can be done intuitively like so:

In [88]:
df2 = df.copy()
df2[df2 < 0] = 0
df2

Unnamed: 0,A,B,C,D
a,0.0,1.476863,0.32184,0.582583
b,0.0,0.155098,0.971747,0.0
c,1.38265,0.0,1.482905,0.0
d,0.0,0.0,0.0,0.0
e,0.815752,0.598098,1.681905,0.083716
f,0.0,0.412416,0.339498,0.313542


By default, where returns a modified copy of the data. There is an optional parameter inplace so that the original data can be modified without creating a copy:

In [89]:
df_orig = df.copy()
df_orig.where(df > 0, -df, inplace=True)
df_orig

Unnamed: 0,A,B,C,D
a,0.070857,1.476863,0.32184,0.582583
b,2.053981,0.155098,0.971747,0.061725
c,1.38265,0.642826,1.482905,0.971308
d,0.849711,1.765664,1.251683,0.776705
e,0.815752,0.598098,1.681905,0.083716
f,0.933014,0.412416,0.339498,0.313542


<a id=mask></a>
## Mask
mask() is the inverse boolean operation of where.

In [90]:
s.mask(s >= 0)

4   NaN
3   NaN
2   NaN
1   NaN
0   NaN
dtype: float64

In [91]:
df.mask(df >= 0)

Unnamed: 0,A,B,C,D
a,-0.070857,,,
b,-2.053981,,,-0.061725
c,,-0.642826,,-0.971308
d,-0.849711,-1.765664,-1.251683,-0.776705
e,,,,
f,-0.933014,,,
