# Boolean Indexing Single Conditions

## Overview

### Objectives

+ Boolean Indexing or Boolean Selection is the selection of a subset of a Series/DataFrame based on the **values** themselves and not the row/column labels or integer location
+ Boolean means **True** or **False**
+ Each row of the DataFrame will be kept or discarded based on the boolean value aligned with it
+ Boolean selection is a two-step process
    + First, create a **filter** - a sequence of True/False values the same length as the DataFrame/Series
    + Second, pass this filter to one of the indexers **`[ ]`** or **`loc`**
+ Boolean selection does not work with `iloc`
+ The indexing operators are overloaded — change functionality depending on what is passed to them
+ The filter is commonly created by comparing a column of data (a Series) against some scalar value


## Boolean Indexing
Boolean indexing, also referred to as **Boolean Selection**, is the process of selecting subsets of rows from DataFrames (or Series) based on the actual data values and NOT by their labels or integer locations.

### Examples of Boolean Indexing

Let's see some examples of actual questions (in plain English) that boolean indexing can help us answer from the bikes dataset.

+ Find all male riders
+ Find all rides with duration longer than 2 hours
+ Find all rides that took place between March and June of 2015.
+ Find all the rides with a duration longer than 2 hours by females with temperature higher than 90 degrees

The term **query** is used to refer to these sorts of questions.

### All queries have a logical condition
Each of the above queries have a strict logical condition that must be checked one row at a time.

### Keep or discard an entire row of data
If you were to manually answer the above queries, you would need to scan each row and determine whether the row as a whole meets the condition. If so, then it is kept, otherwise it is discarded.

### Each row will have a True or False value associated with it
When you perform boolean indexing, each row of the DataFrame (or value of a Series) will have a True or False value associated with it depending on whether or not it meets the condition. True/False values are known as boolean. The documentation refers to the entire procedure as boolean indexing. Since we are using the booleans to select subsets of data, it is sometimes referred to as **boolean selection**.

### Beginning with a small DataFrame
We will perform our first boolean indexing on a dataset of 5 rows. Let's assign the head of the bikes dataset to its own variable. The `bikes_head` DataFrame has five rows in it.

In [1]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv')
bikes_head = bikes.head()
bikes_head

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy
3,12907,Subscriber,Male,2013-07-01 10:05:00,2013-07-01 10:16:00,667,Carpenter St & Huron St,41.894556,-87.653449,19.0,Clark St & Randolph St,41.884576,-87.63189,31.0,72.0,10.0,16.1,-9999.0,mostlycloudy
4,13168,Subscriber,Male,2013-07-01 11:16:00,2013-07-01 11:18:00,130,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,Damen Ave & Pierce Ave,41.909396,-87.677692,19.0,73.0,10.0,17.3,-9999.0,partlycloudy


## Manual filtering of the data
Let's find all the rides with a trip duration greater than 900. We will do this manually by inspecting the data. 

### Create a list of booleans
By inspecting the data, we see that the 1st and 3rd rows have a trip duration greater than 900. A list of 5 boolean values is created, one for each row. The first 1st and 3rd values are `True`. The others are `False`.

In [2]:
filt = [True, False, True, False, False]

### Variable name `filt`
The variable name `filt` will be used throughout the book to contain the sequence of booleans. `filt` simply stands for filter. Being consistent with variables makes your code easier to understand.

### Pass this list into just the brackets
The above list has `True` in both the 1st and 3rd position. These will be the rows that are kept during boolean indexing. To formally do boolean indexing, we place the list inside the brackets.

In [None]:
bikes_head[filt]

### Wait a second… Isn’t `[ ]` just for column selection?

The primary purpose of *just the brackets* for a DataFrame is to select one or more columns by using either a string or a list of strings. Now, all of a sudden, this example is showing that entire rows are selected with boolean values. This is what makes pandas, unfortunately, a confusing library to use.

## Operator Overloading
*Just the brackets* is **overloaded**. This means, that depending on the inputs, pandas will do something completely different. Here are the rules for the different objects you pass to the brackets.

* **string** — return a column as a Series
* **list of strings** — return all those columns as a DataFrame
* **sequence of booleans** — select all rows where True
* **slice** — select rows (can do both label and integer location — confusing!) I never do this as it is ambiguous. This has not been covered yet.

In summary, just the indexing operator primarily selects columns, but if you pass it a sequence of booleans it will select all rows that are True.

### Using booleans in a Series and not a list
Instead of using a list to contain our booleans, we can store them in a Series. This produces the same output. Below, we use the Series constructor to create a Series object.

In [None]:
filt = pd.Series([True, False, True, False, False])
filt

### Use the boolean Series to do the boolean selection
Placing the Series directly in the brackets will again select only the rows which have True values in the Series.

In [3]:
bikes_head[filt]

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
2,10927,Subscriber,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,41.909592,-87.653497,15.0,Dearborn St & Monroe St,41.88132,-87.629521,23.0,73.0,10.0,16.1,-9999.0,mostlycloudy


## Practical Boolean Selection
We will almost never create boolean lists/Series manually like we did above but instead use the actual data to create them.

### Creating boolean Series from column data
By far the most common way to create a boolean Series will be from the values of one particular column. We will test a condition using one of the six comparison operators:

* `<`
* `<=`
* `>`
* `>=`
* `==`
* `!=`


### Create a boolean Series
Let's create a boolean Series by determining which rows have a trip duration of over 1000 seconds.

In [None]:
filt = bikes['tripduration'] > 1000
filt.head(10)

### Manually verify correctness
Let's output the head of the `tripduration` Series to manually verify that indeed integer locations 2 and 8 are the ones greater than 1000.

In [None]:
bikes['tripduration'].head(10)

### Complete our boolean indexing
We created our boolean Series, `filt`, using the greater than comparison operator on the `tripduration` column. We can now pass this result into just the brackets to filter the entire DataFrame. Verify that all `tripduration` values are greater than 1000.

In [None]:
bikes[filt].head()

### How many rows have a trip duration greater than 1000?
To answer this question, let's assign the result of the boolean selection to a variable and then retrieve the `shape` of the DataFrame.

In [None]:
bikes.shape

In [None]:
bikes_duration_1000 = bikes[filt]
bikes_duration_1000.shape

About 20% of the rides are longer than 1000 seconds.

## Boolean selection in one line
Often, you will see boolean selection happen in a single line of code instead of the multiple lines we used above. Put the expression for the filter directly inside the brackets.

In [None]:
bikes[bikes['tripduration'] > 1000].head()

I recommend assigning the filter as a separate variable to help with readability.

## Single condition expression
Our first example tested a single condition (whether the trip duration was 1,000 or more). Let's test a different single condition and find all the rides that happened when the weather was cloudy. We use the `==` operator to test for equality and again pass this variable to the brackets which completes our selection.

In [None]:
filt = bikes['events'] == 'cloudy'
bikes[filt].head()

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the index to be the title. Select all movies that have Tom Hanks as `actor1`. How many of these movies has he starred in?</span>

### Exercise 2
<span  style="color:green; font-size:16px">Select movies with and IMDB score greater than 9.</span>