# Boolean Selection More

In this chapter, we explore several more possible ways to use boolean selection to filter data.

## Boolean selection on a Series

All of the examples thus far have taken place on DataFrames. Boolean selection on a Series is completed almost identically. Since there is only one dimension of data, the queries you ask are usually going to be simpler. First, let's select a single column of data as a Series such as the temperature column from the bikes dataset.

In [1]:
import pandas as pd
bikes = pd.read_csv('input/bikes.csv', parse_dates=['starttime', 'stoptime'])
temp = bikes['temperature']
temp.head(3)

0    73.9
1    69.1
2    73.0
Name: temperature, dtype: float64

Let's select temperatures greater than 90. The procedure is the same as with DataFrames. Create a boolean Series and pass that Series to *just the bracketes*.

In [2]:
filt = temp > 90
temp[filt].head(3)

54    91.0
55    91.0
56    91.0
Name: temperature, dtype: float64

Select temperatures less than 0 or greater than 95. Multiple condition boolean Series also work the same.

In [3]:
filt1 = temp < 0
filt2 = temp > 95
filt = filt1 | filt2
temp[filt].head()

395     96.1
396     96.1
397     96.1
1871    -2.0
2049    -2.0
Name: temperature, dtype: float64

### Set the index as `starttime`

The default index is not very helpful. Let's use the `set_index` method to make the `starttime` column the new index. While, this column may not be unique it does provide us with useful labels for each row.

In [4]:
bikes2 = bikes.set_index('starttime')
bikes2.head(3)

Unnamed: 0_level_0,gender,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
starttime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2013-06-28 19:01:00,Male,2013-06-28 19:17:00,993,Lake Shore Dr & Monroe St,11,Michigan Ave & Oak St,15,73.9,12.7,mostlycloudy
2013-06-28 22:53:00,Male,2013-06-28 23:03:00,623,Clinton St & Washington Blvd,31,Wells St & Walton St,19,69.1,6.9,partlycloudy
2013-06-30 14:43:00,Male,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15,Dearborn St & Monroe St,23,73.0,16.1,mostlycloudy


Let's get back our temperature Series with its updated index.

In [5]:
temp2 = bikes2['temperature']
temp2.head()

starttime
2013-06-28 19:01:00    73.9
2013-06-28 22:53:00    69.1
2013-06-30 14:43:00    73.0
2013-07-01 10:05:00    72.0
2013-07-01 11:16:00    73.0
Name: temperature, dtype: float64

Let's select temperatures greater than 90. We expect to get a summer month and we do.

In [6]:
filt = temp2 > 90
temp2[filt].head(5)

starttime
2013-07-16 15:13:00    91.0
2013-07-16 15:31:00    91.0
2013-07-16 16:35:00    91.0
2013-07-17 17:08:00    93.0
2013-07-17 17:25:00    93.0
Name: temperature, dtype: float64

Select temperature less than 0 or greater than 95. We expect to get some winter months in the result and we do.

In [7]:
filt1 = temp2 < 0
filt2 = temp2 > 95
filt = filt1 | filt2
temp2[filt].head()

starttime
2013-08-30 15:33:00    96.1
2013-08-30 15:37:00    96.1
2013-08-30 15:49:00    96.1
2013-12-12 05:13:00    -2.0
2014-01-23 06:15:00    -2.0
Name: temperature, dtype: float64

## The `between` method

The `between` method returns a boolean Series by testing whether the current value is between two given values. For instance, if want to select the temperatures between 50 and 60 degrees we do the following:

In [8]:
filt = temp2.between(50, 60)
filt.head(3)

starttime
2013-06-28 19:01:00    False
2013-06-28 22:53:00    False
2013-06-30 14:43:00    False
Name: temperature, dtype: bool

By default, the `between` method is inclusive of the given values, so temperatures of exactly 50 or 60 would be found in the result. We pass this boolean Series to *just the brackets* to complete the selection.

In [9]:
temp2[filt].head(3)

starttime
2013-09-13 07:55:00    54.0
2013-09-13 08:04:00    57.9
2013-09-13 08:04:00    57.9
Name: temperature, dtype: float64

## Simultaneous boolean selection of rows and column labels with `loc`

The `loc` indexer was thoroughly covered in an earlier chapter and will now be brought up again to show how it can simultaneously select rows with boolean selection and columns by labels.

Remember that `loc` takes both a row selection and a column selection separated by a comma. Since the row selection comes first, you can pass it the same exact inputs that you do for *just the brackets* and get the same results. Let's run some of the previous examples of boolean selection with `loc`. Here, we select all rides with trip duration greater than 1,000.

In [10]:
filt = bikes['tripduration'] > 1000
bikes.loc[filt].head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
2,Male,2013-06-30 14:43:00,2013-06-30 15:01:00,1040,Sheffield Ave & Kingsbury St,15,Dearborn St & Monroe St,23,73.0,16.1,mostlycloudy
8,Male,2013-07-03 15:21:00,2013-07-03 15:42:00,1300,Clinton St & Washington Blvd,31,Wood St & Division St,15,71.1,0.0,cloudy
10,Male,2013-07-04 17:17:00,2013-07-04 17:42:00,1523,Morgan St & 18th St,15,Damen Ave & Pierce Ave,19,79.0,9.2,mostlycloudy


Here, we select all weather events that are either rain, snow, tstorms, or sleet.

In [11]:
filt = bikes['events'].isin(['rain', 'snow', 'tstorms', 'sleet'])
bikes.loc[filt].head(3)

Unnamed: 0,gender,starttime,stoptime,tripduration,from_station_name,start_capacity,to_station_name,end_capacity,temperature,wind_speed,events
45,Male,2013-07-15 16:43:00,2013-07-15 16:55:00,727,Greenwood Ave & 47th St,15,State St & Harrison St,19,82.9,5.8,rain
78,Male,2013-07-21 16:35:00,2013-07-21 17:06:00,1809,Michigan Ave & Pearson St,23,Millennium Park,35,82.4,11.5,tstorms
79,Male,2013-07-21 16:47:00,2013-07-21 17:03:00,999,Carpenter St & Huron St,19,Carpenter St & Huron St,19,82.4,11.5,tstorms


### Separate row and column selection with a comma for `loc`

The nice benefit of `loc` is that it allows us to simultaneously select rows with boolean selection and columns by label. Let's select rides during rain or snow and the columns `events` and `tripduration`.

In [12]:
filt = bikes['events'].isin(['rain', 'snow'])
cols = ['events', 'tripduration']
bikes.loc[filt, cols].head()

Unnamed: 0,events,tripduration
45,rain,727
112,rain,1395
124,rain,442
161,rain,890
498,rain,978


Now let's find all female riders with trip duration greater than 5,000 when it was cloudy. We'll only return the columns used during the boolean selection.

In [13]:
filt1 = bikes['gender'] == 'Female'
filt2 = bikes['tripduration'] > 5000
filt3 = bikes['events'] == 'cloudy'
filt = filt1 & filt2 & filt3
cols = ['gender', 'tripduration', 'events']
bikes.loc[filt, cols]

Unnamed: 0,gender,tripduration,events
2712,Female,79988,cloudy


## Column to column comparisons

So far, we created filters by comparing each of our column values to a single scalar value. It is possible to do element-by-element comparisons by comparing two columns to one another. For instance, the total bike capacity at each station at the start and end of the ride is stored in the `start_capacity` and `end_capactiy` columns. If we wanted to test whether there was more capacity at the start of the ride vs the end, we would do the following:

In [14]:
filt = bikes['start_capacity'] > bikes['end_capacity']

Let's use this filter with `loc` to return all the rows where the start capacity is greater than the end.

In [15]:
cols = ['start_capacity', 'end_capacity']
bikes.loc[filt, cols].head(3)

Unnamed: 0,start_capacity,end_capacity
1,31,19
6,31,19
8,31,15


### Boolean selection with `iloc` does not work

The pandas developers decided not to allow boolean selection with `iloc`. The following raises an error.

In [16]:
bikes.iloc[filt]

NotImplementedError: iLocation based boolean indexing on an integer type is not available

## Finding missing values with `isna`

The `isna` method called from either a DataFrame or a Series returns `True` for every value that is missing and `False` for any other value. Let's see this in action by calling `isna` on the start capacity column.

In [None]:
bikes['start_capacity'].isna().head(3)

### Filtering for missing values

We can now use this boolean Series to select all the rows where the capacity start column is missing. Verify that those values are indeed missing. 

In [None]:
filt = bikes['start_capacity'].isna()
bikes[filt].head(3)

### `isnull` is an alias for `isna`

There is an identical method named `isnull` that you will see in other tutorials. It is an **alias** of `isna` meaning it does the exact same thing but has a different name. Either one is suitable to use, but I prefer `isna` because of the similarity to **NaN**, the representation of missing values. There are also other methods such as `dropna` and `fillna` that have 'na' in their names.