# BOOLEAN INDEXING

In [3]:
import pandas as pd
df = pd.read_csv('../data/medals.csv')
df

Unnamed: 0,Year,Medal Type,US,Canada,England,Australia
0,2001,Gold,278,188,39,44
1,2001,Silver,324,235,82,66
2,2001,Bronze,446,399,100,15
3,2002,Gold,301,298,42,66
4,2002,Silver,378,222,228,88
5,2002,Bronze,502,245,165,173
6,2003,Gold,321,276,86,163
7,2003,Silver,322,263,76,184
8,2003,Bronze,423,165,97,136
9,2004,Gold,298,146,43,152


We use Boolean indexing to filter or select parts of the data.

In [9]:
df.ix[(df['Medal Type'] == 'Gold') & (df['US'] > 300)]

Unnamed: 0,Year,Medal Type,US,Canada,England,Australia
3,2002,Gold,301,298,42,66
6,2003,Gold,321,276,86,163
12,2005,Gold,311,248,83,73
15,2006,Gold,378,176,83,47


You can also create Boolean conditions in which you use arrays to filter out parts of
the data:

In [10]:
silverSelection = df['Medal Type'] == 'Silver'
englandLow = df['England'] < 50
df.ix[silverSelection & englandLow]

Unnamed: 0,Year,Medal Type,US,Canada,England,Australia
16,2006,Silver,357,251,41,54


## The is in and any all methods

In [11]:
categorySeries = pd.Series(['Mobile', 'Tablet', 'Laptop', 'Watch', 'Desktop', ''])
categorySeries

0     Mobile
1     Tablet
2     Laptop
3      Watch
4    Desktop
5           
dtype: object

In [12]:
categorySeries.isin(['Tablet', 'Laptop', 'Desktop'])

0    False
1     True
2     True
3    False
4     True
5    False
dtype: bool

Here, we use the Boolean array to select a sub-Series containing the values that we're
interested in:

In [13]:
categorySeries[categorySeries.isin(['Tablet', 'Laptop', 'Desktop'])]

1     Tablet
2     Laptop
4    Desktop
dtype: object

With DataFrame:

In [33]:
students = {
    'Henry': { 'class': 'Computer Science', 'grade': 'A' },
    'Peter': { 'class': 'Computer Network', 'grade': 'B' },
    'Mary': { 'class': 'Database', 'grade': 'B' },
    'Jack': { 'class': 'Computer Science', 'grade': 'E' },
    'Susan': { 'class': 'Software Programming', 'grade': 'A' },
    'John': { 'class': 'Database', 'grade': 'C' },
    'Mathew': { 'class': 'Software Programming', 'grade': 'B' },
    'Sam': { 'class': 'Computer Science', 'grade': 'A' }
}
studentsDf = pd.DataFrame(students).T
studentsDf

Unnamed: 0,class,grade
Henry,Computer Science,A
Jack,Computer Science,E
John,Database,C
Mary,Database,B
Mathew,Software Programming,B
Peter,Computer Network,B
Sam,Computer Science,A
Susan,Software Programming,A


In [34]:
studentsDf.isin({ 'class': ['Computer Science'], 'grade': ['A'] })

Unnamed: 0,class,grade
Henry,True,True
Jack,True,False
John,False,False
Mary,False,False
Mathew,False,False
Peter,False,False
Sam,True,True
Susan,False,True


Creating a mask as a combination of the isin and all() methods:

In [35]:
csA = { 'class': ['Computer Science'], 'grade': ['A'] }
csAMask = studentsDf.isin(csA).all(1)
studentsDf[csAMask]

Unnamed: 0,class,grade
Henry,Computer Science,A
Sam,Computer Science,A


## Using the where() method
The where() method is used to ensure that the result of Boolean filtering is the same shape as the original data.

In [36]:
numberSeries = pd.Series([65, 83, 42, 59, 19, 82, 37, 65, 73, 92, 74, 35])
numberSeries

0     65
1     83
2     42
3     59
4     19
5     82
6     37
7     65
8     73
9     92
10    74
11    35
dtype: int64

In [37]:
numberSeries[numberSeries < 50]

2     42
4     19
6     37
11    35
dtype: int64

In [38]:
numberSeries.where(numberSeries < 50)

0      NaN
1      NaN
2     42.0
3      NaN
4     19.0
5      NaN
6     37.0
7      NaN
8      NaN
9      NaN
10     NaN
11    35.0
dtype: float64

This method appears to be useful only in the case of a Series, as we get this behavior for free in the case of a DataFrame:

In [40]:
numberDf = pd.DataFrame([[76, 84, 43, 62], [36, 82, 19, 72], [53, 52, 84, 45], [83, 14, 63, 38]], columns=['A', 'B', 'C', 'D'])
numberDf

Unnamed: 0,A,B,C,D
0,76,84,43,62
1,36,82,19,72
2,53,52,84,45
3,83,14,63,38


In [41]:
numberDf[numberDf > 50]

Unnamed: 0,A,B,C,D
0,76.0,84.0,,62.0
1,,82.0,,72.0
2,53.0,52.0,84.0,
3,83.0,,63.0,


In [42]:
numberDf.where(numberDf > 50)

Unnamed: 0,A,B,C,D
0,76.0,84.0,,62.0
1,,82.0,,72.0
2,53.0,52.0,84.0,
3,83.0,,63.0,
