# 2.5 Filtering

A data set contains a collection of many observations. However, many times, not all observations are relevant to the question being asked. Thus, filtering is an essential part of data analysis in that it allows the analyst to conduct their investigation with greater precision. Filtering also helps the programmer remove null values from the data set.

Before we begin, let's import the data set.

In [2]:
import pandas as pd
df = pd.read_csv("./data/titanic.csv")

df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Creating a filter

Let's say that we want to create a filter that only contains first class passengers (ie. where `Pclass` is 1). To create this dataframe, we might first try to select the column `Pclass` and create a conditional statement where the column is equal to "1".

In [3]:
df['Pclass'] == 1

0      False
1       True
2      False
3       True
4      False
       ...  
886    False
887     True
888    False
889     True
890    False
Name: Pclass, Length: 891, dtype: bool

Notice, however, that a Series object is returned instead of a dataframe! This ocurred because we grabbed a column (a Series) and tested the condition `== 1` across all of its rows. Thus, a new Series of boolean values (True/False) was returned to us.

This Series of boolean values is the heart of filters in Pandas.

We can use this filter with a dataframe by directly passing it in to the dataframe inside of square brackets.

In [9]:
df[ df['Pclass'] == 1 ].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S


Some people even prefer to set the filter equal to a variable before passing it into the dataframe. This makes the filter easier to read.

In [11]:
firstClassPassengers = df['Pclass'] == 1
df[ firstClassPassengers ].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S
23,24,1,1,"Sloper, Mr. William Thompson",male,28.0,0,0,113788,35.5,A6,S


You can also apply filters to a dataframe by using the `.loc` property. **Using `.loc` with filtering is recommended over passing in the filter to the dataframe in brackets because it allows the analyst to (1) filter rows by a condition *and* (2) select individual columns to be returned.** This can be seen below.

In [12]:
females = df['Sex'] == 'female'
df.loc[ females, ['Name', 'Sex'] ]

Unnamed: 0,Name,Sex
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
2,"Heikkinen, Miss. Laina",female
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female
8,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female
9,"Nasser, Mrs. Nicholas (Adele Achem)",female
...,...,...
880,"Shelley, Mrs. William (Imanita Parrish Hall)",female
882,"Dahlberg, Miss. Gerda Ulrika",female
885,"Rice, Mrs. William (Margaret Norton)",female
887,"Graham, Miss. Margaret Edith",female


## How it works
Dataframes are built to recognize Series objects with boolean values. When the Series of booleans gets passed in to the dataframe between parentheses, the dataframe goes through each row and keeps each one where the same row in the boolean Series is True. All of the rows that match to False are discarded, and a dataframe with rows matching the True rows returned by the filter is created. This pattern is also used in other programming languages, such as R.

## Multiple filters at once
You can also pass multiple conditions to a Series to get back a boolean Series. These conditional statements make use of standard logic statements such as *and*, *or*, and *not*, although the usage is slightly different.

| Python | Pandas |
| ------ | ------ |
| and    | &      |
| or     | \|     |
| not    | ~      |

Multiple conditions can be created by simply encapsulating each condition in parenthesis and then combining them using either the `&` or the `|` symbols. The `~` symbol can be placed in front of a condition to negate it.

In [35]:
male_and_embarked_at_queenstown = (df['Sex'] == 'male') & (df['Embarked'] == 'Q')
df.loc[male_and_embarked_at_queenstown].head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.125,,Q
46,47,0,3,"Lennon, Mr. Denis",male,,1,0,370371,15.5,,Q
116,117,0,3,"Connors, Mr. Patrick",male,70.5,0,0,370369,7.75,,Q
126,127,0,3,"McMahon, Mr. Martin",male,,0,0,370372,7.75,,Q


In [45]:
younger_than_18_or_older_than_65 = (df['Age'] < 18) | (df['Age'] > 65)
df.loc[younger_than_18_or_older_than_65]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
16,17,0,3,"Rice, Master. Eugene",male,2.0,4,1,382652,29.1250,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
851,852,0,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S
852,853,0,3,"Boulos, Miss. Nourelain",female,9.0,1,1,2678,15.2458,,C
853,854,1,1,"Lines, Miss. Mary Conover",female,16.0,0,1,PC 17592,39.4000,D28,S
869,870,1,3,"Johnson, Master. Harold Theodor",male,4.0,1,1,347742,11.1333,,S


In [51]:
not_first_class_and_not_embarked_Cherbourg = ~(df['Pclass'] == 1) & ~(df['Embarked'] == 'C')
df.loc[not_first_class_and_not_embarked_Cherbourg]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.0750,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.1250,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S


## Filtering by string
In a normal Python conditional statment, we can determine if a string exists inside of another string by using the `in` key word.

In [52]:
"wizard" in "You're a wizard, Harry!"

True

Unfortunately, it's not that simple in a Pandas dataframe. However, the Series object does come with some string methods built into it.

We can use the `.str` property to access the string methods of a Series object. Then, we can decide which method to call by extending this property into a function call. To search for a string inside of another string, we can use the `.contains` method.

In this example, we will try to find all unmarried women by searching for the title "Ms." or "Miss" inside of the column "Name".

In [55]:
unmarried_women = (df['Name'].str.contains("Ms.")) | (df['Name'].str.contains("Miss"))
df.loc[ unmarried_women ]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
14,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14.0,0,0,350406,7.8542,,S
22,23,1,3,"McGowan, Miss. Anna ""Annie""",female,15.0,0,0,330923,8.0292,,Q
...,...,...,...,...,...,...,...,...,...,...,...,...
866,867,1,2,"Duran y More, Miss. Asuncion",female,27.0,1,0,SC/PARIS 2149,13.8583,,C
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C
882,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22.0,0,0,7552,10.5167,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
