
# Boolean Indexing


## `[]`, `.loc` and `.iloc` vs Boolean Indexing subset selection

Previously, we covered subset selection with `[]`, `.loc` and `.iloc`. All three of these **indexers** use either the row/column labels or their integer location to make selections. The actual **data** of the Series/DataFrame is not used at all during the selection. 

In **boolean indexing**, we will select subsets of data based on the actual values of the data in the Series/DataFrame and NOT on their row/column labels or integer locations. 


## Documentation on boolean selection
I will always recommend reading the official documentation in addition to this.

The documentation use the term **boolean indexing** but you will also see **boolean selection**.

[Boolean Indexing from pandas documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing)

In [1]:
import pandas as pd
import numpy as np

In [3]:
so = pd.read_csv('data/stackoverflow_qa.csv')
so.head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
0,5486226,2011-03-30 12:26:50,4,2113,Rolling median in python,3,4,1.0,yueerhu,125.0,Mike Pennington,26995.0
1,5515021,2011-04-01 14:50:44,8,7015,Compute a compounded return series in Python,3,6,7.0,Jason Strimpel,3301.0,Mike Pennington,26995.0
2,5558607,2011-04-05 21:13:50,2,7392,Sort a pandas DataMatrix in ascending order,2,0,1.0,Jason Strimpel,3301.0,Wes McKinney,43310.0
3,6467832,2011-06-24 12:31:45,9,13056,How to get the correlation between two timeser...,1,0,7.0,user814005,117.0,Wes McKinney,43310.0
4,7577546,2011-09-28 01:58:38,9,2488,"Using pandas, how do I subsample a large DataF...",1,0,5.0,Uri Laserson,958.0,HYRY,54137.0


## Asking simple questions in plain English

Before we get to the technical definition of boolean indexing, let's see some examples of the types of questions it can answer.

* Find all questions that were created before 2014
* Find all questions with a score more than 50
* Find all questions with a score between 50 and 100
* Find all questions answered by Scott Boston
* Find all questions answered by the following 5 users
* Find all questions that were created between March, 2014 and October 2014 that were answered by Unutbu and have score less than 5.
* Find all questions that have score between 5 and 10 or have a view count of greater than 10,000
* Find all questions that are not answered by Scott Boston

You will also see examples like this referred to by the term **queries**.<br>
Each of the above queries have a strict logical criteria that must be checked one row at a time.

## Keep or Discard entire row of data
If you were to manually answer the above queries, you would need to scan each row and determine whether the row as a whole meets the criterion or not. If the row meets the criteria, then it is kept and if not, then it is discarded.

## Each row will have a `True` or `False` value associated with it
When you perform boolean indexing, each row of the DataFrame (or value of a Series) will have a `True` or `False` value associated with it depending on whether or not it meets the criterion. True/False values are known as **boolean**. The documentation refers to the entire procedure as **boolean indexing**. 

Since we are using the booleans to select data, it is sometimes referred to as **boolean selection**. Essentially, we are using booleans to select subsets of data.

## Using `[]` and `.loc` for boolean selection
We will use the same three indexers, **`[]`** and **`.loc`** from previous Notebook to complete our boolean selections. We will do so by placing a sequence of booleans inside of these indexer. The sequence will be the same number of rows/values as the DataFrame/Series it is doing the selection on.


## Focus on `[]` for now
To simplify things, we will only the brackets, **`[]`**.

In [4]:
so_head = so.head()
so_head

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
0,5486226,2011-03-30 12:26:50,4,2113,Rolling median in python,3,4,1.0,yueerhu,125.0,Mike Pennington,26995.0
1,5515021,2011-04-01 14:50:44,8,7015,Compute a compounded return series in Python,3,6,7.0,Jason Strimpel,3301.0,Mike Pennington,26995.0
2,5558607,2011-04-05 21:13:50,2,7392,Sort a pandas DataMatrix in ascending order,2,0,1.0,Jason Strimpel,3301.0,Wes McKinney,43310.0
3,6467832,2011-06-24 12:31:45,9,13056,How to get the correlation between two timeser...,1,0,7.0,user814005,117.0,Wes McKinney,43310.0
4,7577546,2011-09-28 01:58:38,9,2488,"Using pandas, how do I subsample a large DataF...",1,0,5.0,Uri Laserson,958.0,HYRY,54137.0


### <span style="color:blue">(a) Manually create a list of booleans</span>
For instance, let's begin by creating the following list:

In [5]:
criteria = [True, False, True, False, False]

We can pass this list of booleans to [ ] operator and complete our selection:

In [6]:
so_head[criteria]

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
0,5486226,2011-03-30 12:26:50,4,2113,Rolling median in python,3,4,1.0,yueerhu,125.0,Mike Pennington,26995.0
2,5558607,2011-04-05 21:13:50,2,7392,Sort a pandas DataMatrix in ascending order,2,0,1.0,Jason Strimpel,3301.0,Wes McKinney,43310.0


# Wait a second... Isn't `[]` just for column selection?

## Operator Overloading
The indexing operator changes behavior based on what type of object is passed to it. The following pseudocode outlines how DataFrame indexing operator handles the object that it is passed:
~~~ python
>>> df[item]  # Where `df` is a DataFrame and item is some object
If item is a string then    
    Find a column name that matches the item exactly    
    Raise KeyError if there is no match    
    Return the column as a Series
    
If item is a list of strings then    
    Raise KeyError if one or more strings in item don't match columns    
    Return a DataFrame with just the columns in the list
    
If item is a slice object then   
    Works with either integer or string slices   
    Raise KeyError if label from label slice is not in index   
    Return all ROWS that are selected by the slice

If item is a list, Series or ndarray of booleans then   
    Raise ValueError if length of item not equal to length of DataFrame   
    Use the booleans to return only the rows with True in same location
~~~

In summary, primarily [ ] selects **columns**, but if you pass it a sequence of booleans it will select all **rows** that are **`True`**.*[ ]* operator is overloaded. This means, that depending on the inputs, pandas will do something completely different. Here are the rules for the different objects you pass to [ ].
* string - return a column as a Series
* list of strings - return all those columns as a DataFrame
* a slice - select rows (can do both label and integer location - confusing!)
* a sequence of booleans - select all rows where **`True`**

## <span style="color:blue">(b) use NumPy arrays</span>
You can also use NumPy arrays to do boolean selection. NumPy arrays have no index so you won't get the error above, but your array needs to be the same exact length as the object you are doing boolean selection on.

In [7]:
a = np.array([True, False, False, False, False])
so_head[a]

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
0,5486226,2011-03-30 12:26:50,4,2113,Rolling median in python,3,4,1.0,yueerhu,125.0,Mike Pennington,26995.0


## What do you mean by 'sequence'?
I keep using the term **sequence of booleans** to refer to the `True/False` values. Technically, the most common built-in [Python sequence](https://docs.python.org/3/library/stdtypes.html#typesseq) types are lists and tuples. In addition to a list, you will most often be using a pandas Series as your 'sequence' of booleans.

### <span style="color:blue"> Using pandas series</span>

Let's manually create a boolean Series to select the last three rows of **`so_head`**.

In [10]:
s = pd.Series([False, False, True, True, True])
s

0    False
1    False
2     True
3     True
4     True
dtype: bool

In [11]:
so_head[s]

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
2,5558607,2011-04-05 21:13:50,2,7392,Sort a pandas DataMatrix in ascending order,2,0,1.0,Jason Strimpel,3301.0,Wes McKinney,43310.0
3,6467832,2011-06-24 12:31:45,9,13056,How to get the correlation between two timeser...,1,0,7.0,user814005,117.0,Wes McKinney,43310.0
4,7577546,2011-09-28 01:58:38,9,2488,"Using pandas, how do I subsample a large DataF...",1,0,5.0,Uri Laserson,958.0,HYRY,54137.0


## Index Alignment

In [9]:
s = pd.Series([False, False, True, True, True])
s

0    False
1    False
2     True
3     True
4     True
dtype: bool

In [13]:
so_head

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
0,5486226,2011-03-30 12:26:50,4,2113,Rolling median in python,3,4,1.0,yueerhu,125.0,Mike Pennington,26995.0
1,5515021,2011-04-01 14:50:44,8,7015,Compute a compounded return series in Python,3,6,7.0,Jason Strimpel,3301.0,Mike Pennington,26995.0
2,5558607,2011-04-05 21:13:50,2,7392,Sort a pandas DataMatrix in ascending order,2,0,1.0,Jason Strimpel,3301.0,Wes McKinney,43310.0
3,6467832,2011-06-24 12:31:45,9,13056,How to get the correlation between two timeser...,1,0,7.0,user814005,117.0,Wes McKinney,43310.0
4,7577546,2011-09-28 01:58:38,9,2488,"Using pandas, how do I subsample a large DataF...",1,0,5.0,Uri Laserson,958.0,HYRY,54137.0


In [14]:
so_head[s]

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
2,5558607,2011-04-05 21:13:50,2,7392,Sort a pandas DataMatrix in ascending order,2,0,1.0,Jason Strimpel,3301.0,Wes McKinney,43310.0
3,6467832,2011-06-24 12:31:45,9,13056,How to get the correlation between two timeser...,1,0,7.0,user814005,117.0,Wes McKinney,43310.0
4,7577546,2011-09-28 01:58:38,9,2488,"Using pandas, how do I subsample a large DataF...",1,0,5.0,Uri Laserson,958.0,HYRY,54137.0


## Take care when creating a boolean Series by hand
The above example only worked because the index of both the boolean Series and **`so_head`** were the exact same. Let's output them so you can clearly see this.

In [15]:
s.index

RangeIndex(start=0, stop=5, step=1)

In [16]:
so_head.index

RangeIndex(start=0, stop=5, step=1)

## Boolean selection fails when the index doesn't align
When you are using a boolean Series to do boolean selection, the index of both objects must be the exact same. Let's create a slightly different Series with a different index than the DataFrame it is indexing on.

In [10]:
so_head.set_index('id')[s]

  """Entry point for launching an IPython kernel.


IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

## `IndexingError`: Unalignable boolean Series!
If the index of both the boolean Series and the object you are doing boolean selection on don't match exactly, you will get the above error. This is one reason, as you will below, why you will almost never create boolean Series by hand like this.

## <span style="color:red"> Never creating boolean Series by hand</span>
You will likely never create a boolean Series by hand as was done above. Instead, you will produce them based on the values of your data.

## Use the comparison operators to create boolean Series
The primary method of creating a Series of booleans is to use one of the six comparison operators: 
* **`<`**
* **`<=`**
* **`>`**
* **`>=`**
* **`==`**
* **`!=`** 

## Use comparison operator with a single column of data
You will almost always use the comparison operators on just a single column or Series of data. For instance, let's create a boolean Series from the **`score`** column. Let's determine if the score is at least 10.

We select the score column and then test the condition that each value is greater than or equal to 10. Notice that this operations gets applied to each value in the Series. A boolean Series is returned.

In [19]:
so['score'] >10

0        False
1        False
2        False
3        False
4        False
5        False
6         True
7        False
8         True
9        False
10       False
11       False
12       False
13       False
14        True
15       False
16       False
17        True
18       False
19       False
20       False
21       False
22       False
23       False
24        True
25       False
26        True
27       False
28       False
29       False
         ...  
56368    False
56369    False
56370    False
56371    False
56372    False
56373    False
56374    False
56375    False
56376    False
56377    False
56378    False
56379    False
56380    False
56381    False
56382    False
56383    False
56384    False
56385    False
56386    False
56387    False
56388    False
56389    False
56390    False
56391    False
56392    False
56393    False
56394    False
56395    False
56396    False
56397    False
Name: score, Length: 56398, dtype: bool

In [20]:
criteria = so['score'] >= 10
criteria.head(10)

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
8     True
9    False
Name: score, dtype: bool

## Finally making a boolean selection
Now that we have our boolean Series stored in the variable **`criteria`**, we can pass this to [ ] to select only the rows that have a score of at least 10. 

We are going to use the entire **`so`** DataFrame for the rest of the tutorial.

In [21]:
so[criteria]

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
6,7776679,2011-10-15 08:21:17,25,28159,append two data frame with pandas,2,7,4.0,Jean-Pat,882.0,Wes McKinney,43310.0
7,7813132,2011-10-18 20:16:12,10,18917,Convert array of string (category) to array of...,3,0,6.0,Jean-Pat,882.0,Wes McKinney,43310.0
8,7837722,2011-10-20 14:46:14,201,223746,What is the most efficient way to loop through...,8,3,115.0,Muppet,1563.0,Nick Crawford,2779.0
14,8916302,2012-01-18 19:41:27,29,20614,selecting across multiple columns with python ...,3,0,14.0,user248237dfsf,19244.0,Wes McKinney,43310.0
17,8991709,2012-01-24 17:59:53,136,16783,Why are pandas merges in python faster than da...,3,16,60.0,Zach,12484.0,Matt Dowle,41275.0
24,9555635,2012-03-04 14:25:36,19,6604,Open source Enthought Python alternative,8,5,6.0,tshauck,5957.0,ogrisel,24990.0
26,9588331,2012-03-06 17:01:47,22,10038,Simple cross-tabulation in pandas,2,3,5.0,Jon Clements,85944.0,Jeff Hammerbacher,3172.0
31,9652832,2012-03-11 06:00:56,41,39323,How to I load a tsv file into a Pandas DataFrame?,3,1,3.0,screechOwl,8774.0,huon,47402.0
35,9758450,2012-03-18 12:53:06,42,43262,Pandas convert dataframe to array of tuples,7,1,18.0,enrishi,303.0,Wes McKinney,43310.0
37,9762935,2012-03-18 22:34:26,11,8380,Add indexed column to DataFrame with pandas,2,0,2.0,saroele,2055.0,Wes McKinney,43310.0


In [22]:
so_score_10_or_more = so[criteria]
so_score_10_or_more.head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
6,7776679,2011-10-15 08:21:17,25,28159,append two data frame with pandas,2,7,4.0,Jean-Pat,882.0,Wes McKinney,43310.0
7,7813132,2011-10-18 20:16:12,10,18917,Convert array of string (category) to array of...,3,0,6.0,Jean-Pat,882.0,Wes McKinney,43310.0
8,7837722,2011-10-20 14:46:14,201,223746,What is the most efficient way to loop through...,8,3,115.0,Muppet,1563.0,Nick Crawford,2779.0
14,8916302,2012-01-18 19:41:27,29,20614,selecting across multiple columns with python ...,3,0,14.0,user248237dfsf,19244.0,Wes McKinney,43310.0
17,8991709,2012-01-24 17:59:53,136,16783,Why are pandas merges in python faster than da...,3,16,60.0,Zach,12484.0,Matt Dowle,41275.0


## Boolean selection in one line
It is possible to put the creation of the boolean Series inside of **[ ]** like this.

In [None]:
so[so['score'] >= 10].head()

## Single condition expression

In [11]:
# step 1 - create boolean Series
criteria = so['ans_name'] == 'Scott Boston'
print(criteria.head())
# step 2 - do boolean selection
so[criteria].head()

0    False
1    False
2    False
3    False
4    False
Name: ans_name, dtype: bool


Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
38161,43491342,2017-04-19 09:14:28,4,167,Merging pandas dataframes based on nearest val...,1,0,,AkiRoss,3991.0,Scott Boston,23611.0
38178,43190850,2017-04-03 17:31:33,1,284,Python Seaborn Plot ValueError,2,3,,Ryan,545.0,Scott Boston,23611.0
38237,43176052,2017-04-03 03:21:12,2,39,Convert an indexed pandas matrix to a flat dat...,2,0,,alvas,31923.0,Scott Boston,23611.0
38246,43209525,2017-04-04 14:03:17,5,131,Pandas: Optimal way to MultiIndex columns,2,0,0.0,sparc_spread,5470.0,Scott Boston,23611.0
38275,43211893,2017-04-04 15:45:17,0,38,How to calculate a index series for a event wi...,1,3,,zsljulius,1102.0,Scott Boston,23611.0


## Multiple condition expression

## Use `&`, `|`, `~`
Although Python uses the syntax **`and`**, **`or`**, and **`not`**, these will not work when testing multiple conditions with pandas.

You must use the following operators with pandas:
* **`&`** for **and**
* **`|`** for **or**
* **`~`** for **not**

## Our first multiple condition expression


In [19]:
so.tail()['score'] > 5

56393    False
56394    False
56395    False
56396    False
56397    False
Name: score, dtype: bool

In [25]:
score_5 = so.tail()['score']>5 

In [29]:
score_8 = so.tail()['score']<8

In [30]:
score_5

56393    False
56394    False
56395    False
56396    False
56397    False
Name: score, dtype: bool

In [34]:
(so.tail()['score']>5)  & (so.tail()['score']<8)

56393    False
56394    False
56395    False
56396    False
56397    False
Name: score, dtype: bool

In [26]:
criteria_1 = so['score'] >= 5
criteria_2 = so['ans_name'] == 'Scott Boston'

We will then use the **and** operator, the ampersand **`&`**, to combine them

In [27]:
criteria_all = criteria_1 & criteria_2

We can now pass this final criteria to [ ]

In [28]:
so[criteria_all].head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
38246,43209525,2017-04-04 14:03:17,5,131,Pandas: Optimal way to MultiIndex columns,2,0,0.0,sparc_spread,5470.0,Scott Boston,23611.0
38640,42870703,2017-03-18 05:06:22,5,125,Simultaneous operation of groupby and resample...,1,0,1.0,S. Naribole,43.0,Scott Boston,23611.0
44358,45064916,2017-07-12 18:16:49,5,428,How to find the correlation between a group of...,2,5,,BKS,506.0,Scott Boston,23611.0
44814,44877663,2017-07-03 04:07:59,9,1267,Error: float object has no attribute notnull,3,2,1.0,Vivian Tio,181.0,Scott Boston,23611.0
52013,47061564,2017-11-01 18:40:36,5,60,How to create strings from dataframe columns e...,5,0,,hernanavella,1890.0,Scott Boston,23611.0


## Multiple conditions in one line


## Use parentheses to separate conditions


Each condition will be separated like this:
```Python
(so['score'] >= 5) & (so['ans_name'] == 'Scott Boston')
```

We can then drop this expression inside of [ ]

In [29]:
so[(so['score'] >= 5) & (so['ans_name'] == 'Scott Boston')].head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
38246,43209525,2017-04-04 14:03:17,5,131,Pandas: Optimal way to MultiIndex columns,2,0,0.0,sparc_spread,5470.0,Scott Boston,23611.0
38640,42870703,2017-03-18 05:06:22,5,125,Simultaneous operation of groupby and resample...,1,0,1.0,S. Naribole,43.0,Scott Boston,23611.0
44358,45064916,2017-07-12 18:16:49,5,428,How to find the correlation between a group of...,2,5,,BKS,506.0,Scott Boston,23611.0
44814,44877663,2017-07-03 04:07:59,9,1267,Error: float object has no attribute notnull,3,2,1.0,Vivian Tio,181.0,Scott Boston,23611.0
52013,47061564,2017-11-01 18:40:36,5,60,How to create strings from dataframe columns e...,5,0,,hernanavella,1890.0,Scott Boston,23611.0


## Using an `or` condition

For the **or** condition, we use the pipe **`|`**

In [None]:
so[(so['score'] >= 100) | (so['answercount'] >= 10)].head()

## Reversing a condition with the `not` operator
The tilde character **`~`** represents the **not** operator and reverses a condition. For instance, if we wanted all the questions with score greater than 100, we could do it like this:

In [30]:
so[~(so['score'] <= 100)].head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
8,7837722,2011-10-20 14:46:14,201,223746,What is the most efficient way to loop through...,8,3,115.0,Muppet,1563.0,Nick Crawford,2779.0
17,8991709,2012-01-24 17:59:53,136,16783,Why are pandas merges in python faster than da...,3,16,60.0,Zach,12484.0,Matt Dowle,41275.0
75,10373660,2012-04-29 16:10:35,199,207980,Converting a Pandas GroupBy object to DataFrame,5,0,90.0,saveenr,2421.0,Wes McKinney,43310.0
100,10665889,2012-05-19 14:11:42,144,179896,How to take column-slices of dataframe in pandas,7,3,65.0,cpa,988.0,Ted Petrou,10426.0
106,10715965,2012-05-23 08:12:31,340,408347,add one row in a pandas.DataFrame,15,3,89.0,PhE,1988.0,fred,2342.0


## Complex conditions
It is possible to build extremely complex conditions to select rows of your DataFrame that meet a very specific criteria. For instance, we can select all questions answered by Scott Boston with **`score`** 5 or more OR questions answered by Ted Petrou with answer count 5 or more.

With multiple conditions, its probably best to break out the logic into multiple steps:

In [31]:
criteria_1 = (so['score'] >= 5) & (so['ans_name'] == 'Scott Boston')
criteria_2 = (so['answercount'] >= 5) & (so['ans_name'] == 'Ted Petrou')
criteria_all = criteria_1 | criteria_2
so[criteria_all]

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
100,10665889,2012-05-19 14:11:42,144,179896,How to take column-slices of dataframe in pandas,7,3,65.0,cpa,988.0,Ted Petrou,10426.0
38246,43209525,2017-04-04 14:03:17,5,131,Pandas: Optimal way to MultiIndex columns,2,0,0.0,sparc_spread,5470.0,Scott Boston,23611.0
38640,42870703,2017-03-18 05:06:22,5,125,Simultaneous operation of groupby and resample...,1,0,1.0,S. Naribole,43.0,Scott Boston,23611.0
44358,45064916,2017-07-12 18:16:49,5,428,How to find the correlation between a group of...,2,5,,BKS,506.0,Scott Boston,23611.0
44814,44877663,2017-07-03 04:07:59,9,1267,Error: float object has no attribute notnull,3,2,1.0,Vivian Tio,181.0,Scott Boston,23611.0
52013,47061564,2017-11-01 18:40:36,5,60,How to create strings from dataframe columns e...,5,0,,hernanavella,1890.0,Scott Boston,23611.0


## Lots of `or` conditions in a single column - use `isin`
For instance, let's say we wanted to find all the questions answered by Scott Boston, Ted Petrou, MaxU, and unutbu.

One way to do this would be with four `or` conditions.

```Python
criteria = ((so['ans_name'] == 'Scott Boston') | (so['ans_name'] == 'Ted Petrou') | 
            (so['ans_name'] == 'MaxU') | (so['ans_name'] == 'unutbu'))
```

An easier way is to use the Series method **`isin`**. Pass it a list of all the items you want to check for equality.

In [None]:
criteria = so['ans_name'].isin(['Scott Boston', 'Ted Petrou', 'MaxU', 'unutbu'])
criteria.head()

In [None]:
so[criteria].head()

## Combining `isin` with other criteria
You can use the resulting boolean Series from the **`isin`** method in the same way you would from the logical operators. For instance, If we wanted to find all the questions answered by the people above and had score greater than 30 we would do the following:

In [None]:
criteria_1 = so['ans_name'].isin(['Scott Boston', 'Ted Petrou', 'MaxU', 'unutbu'])
criteria_2 = so['score'] > 30
criteria_all = criteria_1 & criteria_2
so[criteria_all].tail()

## Use `isnull` to find rows with missing values
The **`isnull`** method returns a boolean Series where True indicates where a missing value is. For instance, questions that do not have an **accepted answer** have missing values for **`ans_name`**. Let's call **`isnull`** on this column.

In [None]:
no_answer = so['ans_name'].isnull()
no_answer.head(6)

This is just another boolean Series which we can pass to *[ ]*

In [None]:
so[no_answer].head()

An alias of **`isnull`** is the **`isna`** method. Alias means it is the same exact method with a different name.

# Boolean Selection on a Series
All the examples thus far have taken place on the **`so`** DataFrame. Boolean selection on a Series happens almost identically. Since there is only one dimension of data, the queries you ask are usually going to be simpler.

First, let's select a single column of data as a Series such as the **`commentcount`** column.

In [32]:
s = so['commentcount']
s.head()

0    4
1    6
2    0
3    0
4    0
Name: commentcount, dtype: int64

Let's test for number of comments greater than 10

In [None]:
criteria = s > 10
criteria.head()

Notice that there is no column selection here as we are already down to a single column. Let's pass this criteria to *[ ]* to select just the values greater than 10.

In [None]:
s[criteria].head()

We could have done this in one step like this

In [33]:
s[s > 10].head()

17     16
76     14
566    11
763    12
781    19
Name: commentcount, dtype: int64

If we wanted to find those comments greater than 10 but less than 15 we could have used an **and** condition like this:

In [34]:
s[(s > 10) & (s < 15)].head()

76     14
566    11
763    12
787    12
837    13
Name: commentcount, dtype: int64

## Another possibility is the `between` method
Pandas has lots of duplicate functionality built in to it. Instead of writing two boolean conditions to select all values inside of a range as was done above, you can use the **`between`** method to create a boolean Series. To use, pass it the left and right end points of the range. These endpoints are inclusive.

So, to replicate the previous example, you could have done this:

In [None]:
s[s.between(11, 14)].head()

# Simultaneous boolean selection with rows and column labels with `.loc`

Remember that **`.loc`** takes both a row selection and a column selection separated by a comma. Since the row selection comes first, you can pass it the same exact inputs that you do for *[ ]* and get the same results.

Let's take a look at a couple examples from above:

In [None]:
# same as above
so.loc[(so['score'] >= 5) & (so['ans_name'] == 'Scott Boston')]

In [None]:
# same as above
criteria = so['ans_name'].isin(['Scott Boston', 'Ted Petrou', 'MaxU', 'unutbu'])
so.loc[criteria].head()

## Separate row and column selection with a comma for `.loc`
The great benefit of **`.loc`** is that it allows you to simultaneously do boolean selection along the rows and make column selections by label.

For instance, let's say we wanted to find all the questions with more than 20k views but only return the **`creationdate`**, **`viewcount`**, and **`ans_name`** columns. You would do the following.

In [35]:
so.loc[so['viewcount'] > 20000, ['creationdate', 'viewcount', 'ans_name']].head(10)

Unnamed: 0,creationdate,viewcount,ans_name
6,2011-10-15 08:21:17,28159,Wes McKinney
8,2011-10-20 14:46:14,223746,Nick Crawford
14,2012-01-18 19:41:27,20614,Wes McKinney
31,2012-03-11 06:00:56,39323,huon
35,2012-03-18 12:53:06,43262,Wes McKinney
58,2012-04-04 23:17:23,20171,bmu
60,2012-04-08 18:01:13,77902,
71,2012-04-18 03:59:55,88115,ely
75,2012-04-29 16:10:35,207980,Wes McKinney
76,2012-04-29 22:41:28,26110,andrew cooke


You could have broken each selection into pieces like this:

```Python

row_selection = so['viewcount'] > 20000
col_selection = ['creationdate', 'viewcount', 'ans_name']
so.loc[row_selection, col_selection]
```

## Lots of combinations possible with `.loc`
Remember that **`.loc`** can take a string, a list of strings or a slice. You can use all three possible ways to select your data. You can also make very complex boolean selections for your rows.

Let's select rows with **`favoritecount`** between 30 and 40 and every third column beginning from **`title`** to the end.

In [None]:
# weird but possible
so.loc[so['favoritecount'].between(30, 40), 'title'::3].head()

## Boolean selection for the columns?
It is actually possible to use a sequence of booleans to select columns. You pass a list, Series, or array of booleans the same length as the number of columns to **`.loc`**.

Let's do a simple manual example where we create a list of booleans by hand. First, let's find out how many columns are in our dataset

In [35]:
so.shape

(56398, 12)

Let's create a list of 12 booleans

In [37]:
col_bools = [True, False, False] * 4
col_bools

[True,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False]

Use **`.loc`** to select all rows with just the `True` columns from **`col_bools`**.

In [38]:
so.loc[:, col_bools].head()

Unnamed: 0,id,viewcount,commentcount,quest_rep
0,5486226,2113,4,125.0
1,5515021,7015,6,3301.0
2,5558607,7392,0,3301.0
3,6467832,13056,0,117.0
4,7577546,2488,0,958.0


You can simultaneously select rows and columns too. Let's select the same columns but for rows that have over 500,000 views.

In [39]:
so.loc[so['viewcount'] > 500000, col_bools]

Unnamed: 0,id,viewcount,commentcount,quest_rep
171,11285613,526432,4,3369.0
181,11346283,931604,1,4206.0
397,12555323,698537,0,2920.0
581,13411544,802655,0,8807.0
1253,17071871,549481,0,3374.0
4587,19482970,541299,1,3483.0


# Column to column comparisons
All of the previous NoteBook comparisons happened against a single scalar value. It is possible to create a boolean Series by comparing one column to another. For instance, we can find all the questions where there are more answers than **`score`**.

In [36]:
criteria = so['answercount'] > so['score']
so[criteria].head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
10,8273092,2011-11-25 18:39:02,1,2333,python: pandas install errors,2,0,,codingknob,2279.0,codingknob,2279.0
46,9927711,2012-03-29 14:42:42,1,1659,Reading csv in python pandas and handling bad ...,3,0,2.0,Dave31415,914.0,eumiro,104313.0
54,10003171,2012-04-03 23:59:41,1,404,What is an efficient way in pandas to do summa...,2,1,,LmW.,486.0,Wes McKinney,43310.0
59,10027719,2012-04-05 11:28:00,0,500,Installing Pandas with Python 2.5 on Windows,1,0,,JamesS,191.0,Wes McKinney,43310.0
77,10393447,2012-05-01 04:12:13,0,130,Scope gotcha when dynamically adding methods i...,2,0,,Chris Billington,424.0,Ignacio Vazquez-Abrams,513959.0


In one line, the above would have looked like this:

```Python
so[so['answercount'] > so['score']]
```

## Almost never use `.iloc` with boolean selection
First, remember that **`.iloc`** uses INTEGER location to make its selections. 

You will rarely use **`.loc`** to do boolean selection and almost always use *[ ]* or **`.loc`**. To see why, let's try and run a simple boolean selection to find all the rows that have more than 100,000 views.

In [40]:
so.iloc[so['viewcount'] > 100000]

NotImplementedError: iLocation based boolean indexing on an integer type is not available

## `NotImplementedError`
The pandas developers have not decided to boolean selection (with a Series) for **`.iloc`** so it does not work. You can however convert the Series to a list or a NumPy array as a workaround.

Let's save our Series to a variable and double-check its type.

In [37]:
criteria = so['viewcount'] > 100000
type(criteria)

pandas.core.series.Series

In [38]:
np.array(criteria)

array([False, False, False, ..., False, False, False])

In [39]:
criteria.values # numpy array

array([False, False, False, ..., False, False, False])

Let's grab the underlying NumPy array with the **`values`** attribute and pass it to **`.iloc`**

In [41]:
so['score'] > 10

0        False
1        False
2        False
3        False
4        False
5        False
6         True
7        False
8         True
9        False
10       False
11       False
12       False
13       False
14        True
15       False
16       False
17        True
18       False
19       False
20       False
21       False
22       False
23       False
24        True
25       False
26        True
27       False
28       False
29       False
         ...  
56368    False
56369    False
56370    False
56371    False
56372    False
56373    False
56374    False
56375    False
56376    False
56377    False
56378    False
56379    False
56380    False
56381    False
56382    False
56383    False
56384    False
56385    False
56386    False
56387    False
56388    False
56389    False
56390    False
56391    False
56392    False
56393    False
56394    False
56395    False
56396    False
56397    False
Name: score, Length: 56398, dtype: bool

In [40]:
a = criteria.values
so.iloc[so['score'] > 10].head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
8,7837722,2011-10-20 14:46:14,201,223746,What is the most efficient way to loop through...,8,3,115.0,Muppet,1563.0,Nick Crawford,2779.0
75,10373660,2012-04-29 16:10:35,199,207980,Converting a Pandas GroupBy object to DataFrame,5,0,90.0,saveenr,2421.0,Wes McKinney,43310.0
81,10457584,2012-05-05 00:00:12,72,117890,Redefining the Index in a Pandas DataFrame object,2,1,19.0,nitin,2945.0,Avaris,21067.0
100,10665889,2012-05-19 14:11:42,144,179896,How to take column-slices of dataframe in pandas,7,3,65.0,cpa,988.0,Ted Petrou,10426.0
106,10715965,2012-05-23 08:12:31,340,408347,add one row in a pandas.DataFrame,15,3,89.0,PhE,1988.0,fred,2342.0


You can make simultaneous column selection as well with integers.

In [None]:
so.iloc[a, [5, 10, 11]].head()

I don't think I have ever used **`.iloc`** for boolean selection as its not implemented for Series. I added because it's one of the three main indexers in pandas and it's important to know that it's not used much at all for boolean selection.

## `.loc` and `[]` work the same on a Series for boolean selection
Boolean selection will work identically for **`.loc`** as it does with *[ ]* on a Series. Both the indexers do row selection when passed a boolean Series. Since Series don't have columns, the two indexers are identical in this situation.

In [None]:
s = so['score']

In [None]:
s[s > 100].head()

In [None]:
s.loc[s > 100].head()

# Summary

* **Boolean Indexing** or **Boolean Selection** is the selection of a subset of a Series/DataFrame based on the values themselves and not the row/column labels or integer location
* Boolean selection is used to answer common queries like "find all the female engineers with a salary over 150k/year"
* To do boolean selection, you first create a sequence of True/False values and pass it to a DataFrame/Series indexer
* Each row of data is kept or discarded
* The indexing operators are **overloaded** - change functionality depending on what is passed to them
* Typically, you will first create a boolean Series with one of the 6 comparison operators
* You will pass this boolean series to one of the indexers to make your selection
* Use the **`isin`** method to test for multiple equalities in the same column
* Use **`isnull`** to find all rows with missing values in a particular column
* Can use the **`between`** Series method to test whether Series values are within a range
* You can create complex criteria with the **and** (**`&`**), **or** (**`|`**), and **not** (**`~`**) logical operators
* When you have multiple conditions in a single line, you must wrap each expression with a parentheses
* If you have complex criteria, think about storing each set of criteria into its own variable (i.e. don't do everything in one line)
* If you are only selecting rows (by boolean indexing), then you will almost always use [ ]
* If you are simultaneously doing boolean selection on the rows and selecting column labels then you will use **`.loc`**
* You will almost never use **`.iloc`** to do boolean selection
* Boolean selection works the same for Series as it does for DataFrames

# Exercises
Boolean selection is difficult at first and the syntax is somewhat clunky. It will take some time to master. These questions will start easy and progressively become more difficult.

## Data for exercises

In [8]:
so = pd.read_csv('data/stackoverflow_qa.csv')
so.head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
0,5486226,2011-03-30 12:26:50,4,2113,Rolling median in python,3,4,1.0,yueerhu,125.0,Mike Pennington,26995.0
1,5515021,2011-04-01 14:50:44,8,7015,Compute a compounded return series in Python,3,6,7.0,Jason Strimpel,3301.0,Mike Pennington,26995.0
2,5558607,2011-04-05 21:13:50,2,7392,Sort a pandas DataMatrix in ascending order,2,0,1.0,Jason Strimpel,3301.0,Wes McKinney,43310.0
3,6467832,2011-06-24 12:31:45,9,13056,How to get the correlation between two timeser...,1,0,7.0,user814005,117.0,Wes McKinney,43310.0
4,7577546,2011-09-28 01:58:38,9,2488,"Using pandas, how do I subsample a large DataF...",1,0,5.0,Uri Laserson,958.0,HYRY,54137.0


In [10]:
employee = pd.read_csv('data/employee_sample.csv')
employee.head()

Unnamed: 0.1,Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY
0,Tom,Male,White,Engineering,23,107962
1,Niko,Male,Black,Engineering,1,30347
2,Penelope,Female,White,Engineering,12,60258
3,Aria,Female,Black,Engineering,8,43618
4,Sofia,Female,Black,Parks & Recreation,23,26125


# Tip!
Append the **`head`** method at the end of your statements to prevent long output as was done above.

# Solutions
Make sure you check your answers with the solutions notebook.

### Exercise 1
<span  style="color:green; font-size:16px">Find all the questions that have exactly 5 answers</span>

In [11]:
# your code here
#so.loc[so['answercount'] == 5, 'title'].head()
so.loc[so.loc[:,'answercount'] == 5, 'title'].head()

75       Converting a Pandas GroupBy object to DataFrame
115    How do I discretize values in a pandas DataFra...
130           pandas: combine two columns in a DataFrame
189    How to group pandas DataFrame entries by date ...
246    How to generate a list from a pandas DataFrame...
Name: title, dtype: object

### Exercise 2
<span  style="color:green; font-size:16px">Find all the questions that have less than 10 views</span>

In [12]:
# your code here
so.loc[so.loc[:,'viewcount'] < 10 , 'title'].head()

7787     How to convert hierarchical DataFrame back fro...
17653    Joining Dataframes in Pandas deletes an existi...
29414    Replace one or more sub-strings from multiple ...
36086                        Saving box plot pandas python
36340    pandas: count the non-duplicated elements when...
Name: title, dtype: object

### Exercise 3
<span  style="color:green; font-size:16px">Find all the questions where the person asking it is the same as the person answering it</span>

In [17]:
# your code here
condition = so.loc[:,'ans_name'] == 'quest_name'
so.loc[so.loc[condition,'title']]

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep


### Exercise 4
<span  style="color:green; font-size:16px">Find all the questions that don't have an accepted answer (ans_name), but have a score of more than 100</span>

In [23]:
# your code here
no_answer1 = so.loc[:,'ans_name'].isnull()
score_100  = so.loc[:,'score'] > 100
condition_all = no_answer1 & score_100
so[condition_all].head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
184,11350770,2012-07-05 18:57:34,149,155776,pandas + dataframe - select by partial string,7,0,52.0,euforia,816.0,,
514,13148429,2012-10-30 22:22:59,312,228357,How to change the order of DataFrame columns?,19,1,90.0,Timmie,1689.0,,
527,13187778,2012-11-02 00:57:33,102,202329,"Convert pandas dataframe to numpy array, prese...",8,0,45.0,mister.nobody.nz,511.0,,
560,13331698,2012-11-11 13:48:53,154,174964,How to apply a function to two columns of Pand...,9,4,66.0,bigbug,7865.0,,
5013,19377969,2013-10-15 09:42:52,128,144045,Combine two columns of text in dataframe in pa...,10,1,49.0,user2866103,831.0,,


### Exercise 5
<span  style="color:green; font-size:16px">Find all the questions where the reputation of the person asking the question is higher than the person answering it. Then find the percentage of times this happens</span>

In [48]:
# your code here
condition2 = so.loc[:,'quest_rep'] > so.loc[:,'ans_rep']
so1 = so[condition2]
print(so.shape)
print(so1.shape)
percent = 2240/56398 *100
percent

(56398, 12)
(2240, 12)


3.9717720486542074

### Exercise 6
<span  style="color:green; font-size:16px">Find all the questions where the number of answers is between 5 and 10 inclusive, and the number of views is less than 1,000.</span>

In [45]:
# your code here
so2 = so.loc[so.loc[:, 'answercount'].between(5,10)]
so2

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
8,7837722,2011-10-20 14:46:14,201,223746,What is the most efficient way to loop through...,8,3,115.0,Muppet,1563.0,Nick Crawford,2779.0
24,9555635,2012-03-04 14:25:36,19,6604,Open source Enthought Python alternative,8,5,6.0,tshauck,5957.0,ogrisel,24990.0
35,9758450,2012-03-18 12:53:06,42,43262,Pandas convert dataframe to array of tuples,7,1,18.0,enrishi,303.0,Wes McKinney,43310.0
42,9850954,2012-03-24 10:26:50,14,4498,pandas - get most recent value of a particular...,6,0,5.0,enrishi,303.0,,
75,10373660,2012-04-29 16:10:35,199,207980,Converting a Pandas GroupBy object to DataFrame,5,0,90.0,saveenr,2421.0,Wes McKinney,43310.0
87,10511024,2012-05-09 06:45:29,71,48836,"in Ipython notebook, Pandas is not displaying ...",6,3,22.0,chrisfs,1617.0,Tal Yarkoni,1710.0
100,10665889,2012-05-19 14:11:42,144,179896,How to take column-slices of dataframe in pandas,7,3,65.0,cpa,988.0,Ted Petrou,10426.0
115,10791661,2012-05-29 00:06:52,7,12210,How do I discretize values in a pandas DataFra...,5,0,3.0,Uri Laserson,958.0,lbolla,4552.0
129,10951341,2012-06-08 15:01:32,48,19879,Pandas DataFrame aggregate function using mult...,6,0,27.0,user1444817,241.0,,
130,10972410,2012-06-10 21:12:43,19,46428,pandas: combine two columns in a DataFrame,5,0,5.0,BFTM,895.0,BrenBarn,136870.0


### Exercise 7
<span  style="color:green; font-size:16px">Find the inverse of exercise 6. Verify your results by adding the rows of both returned Series to see if it matches the number of rows of the original</span>

In [50]:
# your code here
print(so2.shape)
so3 = so.loc[~so.loc[:, 'answercount'].between(5,10)]
print(so3.shape)

(473, 12)
(55925, 12)


### Use the employee data for the rest of the exercises

In [51]:
import pandas as pd

In [54]:
employee = pd.read_csv('data/employee.csv')
employee

FileNotFoundError: [Errno 2] File b'data/employee.csv' does not exist: b'data/employee.csv'

### Exercise 8
<span  style="color:green; font-size:16px">Find all the **`Black or African American`** females that work in the **`Houston Police Department-HPD`**</span>

In [None]:
# your code here

### Exercise 9
<span  style="color:green; font-size:16px">Find the females that have a salary over 100,000 OR males with salary under 50,000</span>

In [None]:
# your code here

### Exercise 10
<span  style="color:green; font-size:16px">Find the females in the following departments with salary over 60,000 (Parks & Recreation, Solid Waste Management, Fleet Management Department, Library)  </span>

In [None]:
# your code here

### Exercise 11
<span  style="color:green; font-size:16px">Find all the males with salary over 100,000. Return only the race, gender and salary columns</span>

In [None]:
# your code here

### Exercise 12
<span  style="color:green; font-size:16px">Select all salaries as a Series in a separate variable. From this series select all salaries under 25,000</span>

In [None]:
# your code here

### Exercise 13
<span  style="color:green; font-size:16px">Get the same exact result as exercise 11, but make your selection from the employee DataFrame. Use only a single line of code</span>

In [None]:
# your code here