# Filtering a DataFrame

## Optimizing a data set for memory use

Before we segue into filtering, let’s quickly talk about reducing memory in pandas. Whenever importing a data set, it’s important to consider whether each column stores its data in the most optimal type. The “best” data type is the one that consumes the least memory or provides the most utility.

As another example, if your data set includes dates, it’s ideal to import them as datetimes rather than as strings, which allows for datetime-specific operations

In [1]:
import pandas as pd

pd.read_csv("../data/employees.csv")

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,8/6/93,,True,Marketing
1,Thomas,Male,3/31/96,61933.0,True,
2,Maria,Female,,130590.0,False,Finance
3,Jerry,,3/4/05,138705.0,True,Finance
4,Larry,Male,1/24/98,101004.0,True,IT
...,...,...,...,...,...,...
996,Phillip,Male,1/31/84,42392.0,False,Finance
997,Russell,Male,5/20/13,96914.0,False,Product
998,Larry,Male,4/20/13,60500.0,False,Business Dev
999,Albert,Male,5/15/12,129949.0,True,Sales


How can we increase the utility of our data set? 

We can convert the text values in the Start Date column to datetimes with the <code>parse_dates</code> parameter:

In [3]:
employees = pd.read_csv("../data/employees.csv", parse_dates = ["Start Date"]).head()
employees

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,,True,Marketing
1,Thomas,Male,1996-03-31,61933.0,True,
2,Maria,Female,NaT,130590.0,False,Finance
3,Jerry,,2005-03-04,138705.0,True,Finance
4,Larry,Male,1998-01-24,101004.0,True,IT


A few options are available for improving the speed and efficiency of <code>DataFrame</code> operations.

First, let’s summarize the data set as it currently stands. We can invoke the <code>info</code> method to see a list of the columns, their data types, a count of missing values, and the <code>DataFrame</code> ’s total memory consumption:

In [4]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   First Name  5 non-null      object        
 1   Gender      4 non-null      object        
 2   Start Date  4 non-null      datetime64[ns]
 3   Salary      4 non-null      float64       
 4   Mgmt        5 non-null      object        
 5   Team        4 non-null      object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 368.0+ bytes


## Converting data types with the astype method

Did you notice that pandas imported the Mgmt column’s values as strings? The column stores only two values: True and False . We can reduce memory use by converting the values to the more lightweight Boolean data type.

The <code>astype</code> method converts a Series ’ values to a different data type.

In [5]:
employees["Mgmt"].astype(bool)

0     True
1     True
2    False
3     True
4     True
Name: Mgmt, dtype: bool

In [6]:
employees["Mgmt"] = employees["Mgmt"].astype(bool)

In [7]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   First Name  5 non-null      object        
 1   Gender      4 non-null      object        
 2   Start Date  4 non-null      datetime64[ns]
 3   Salary      4 non-null      float64       
 4   Mgmt        5 non-null      bool          
 5   Team        4 non-null      object        
dtypes: bool(1), datetime64[ns](1), float64(1), object(3)
memory usage: 333.0+ bytes


In <code>employees</code> , however, pandas stores the Salary values at floats. To support the <code>NaNs</code> throughout the column, pandas converts the integers to floating-point numbers—a technical requirement of the library that we observed in earlier chapters.

In [9]:
employees["Salary"].fillna(0).astype(int)

0         0
1     61933
2    130590
3    138705
4    101004
Name: Salary, dtype: int64

In [11]:
employees["Salary"] = employees["Salary"].fillna(0).astype(int)

In [16]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   First Name  5 non-null      object        
 1   Gender      4 non-null      category      
 2   Start Date  4 non-null      datetime64[ns]
 3   Salary      5 non-null      int64         
 4   Mgmt        5 non-null      bool          
 5   Team        4 non-null      object        
dtypes: bool(1), category(1), datetime64[ns](1), int64(1), object(2)
memory usage: 422.0+ bytes


We can make one additional optimization. 

Pandas includes a special data type called a *category*, which is ideal for a column consisting of a small number of unique values relative to its total size. Some everyday examples of data points with a limited number of values include gender, weekdays, blood types, planets, and income groups. Behind the scenes, pandas stores only one copy of each categorical value rather than storing duplicates across rows.

In [12]:
employees.nunique()

First Name    5
Gender        2
Start Date    4
Salary        5
Mgmt          2
Team          3
dtype: int64

In [13]:
employees["Gender"].astype("category")

0      Male
1      Male
2    Female
3       NaN
4      Male
Name: Gender, dtype: category
Categories (2, object): ['Female', 'Male']

In [14]:
employees["Gender"] = employees["Gender"].astype("category")

In [15]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   First Name  5 non-null      object        
 1   Gender      4 non-null      category      
 2   Start Date  4 non-null      datetime64[ns]
 3   Salary      5 non-null      int64         
 4   Mgmt        5 non-null      bool          
 5   Team        4 non-null      object        
dtypes: bool(1), category(1), datetime64[ns](1), int64(1), object(2)
memory usage: 422.0+ bytes


In [17]:
employees["Team"] = employees["Team"].astype("category")
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   First Name  5 non-null      object        
 1   Gender      4 non-null      category      
 2   Start Date  4 non-null      datetime64[ns]
 3   Salary      5 non-null      int64         
 4   Mgmt        5 non-null      bool          
 5   Team        4 non-null      category      
dtypes: bool(1), category(2), datetime64[ns](1), int64(1), object(1)
memory usage: 519.0+ bytes


## Filtering by a single condition

Extracting a subset of data is perhaps the most common operation in data analysis. A *subset* is a portion of a larger data set that fits some kind of condition.

Suppose that we want to generate a list of all employees named "<code>Maria</code>" . To accomplish this task, we need to filter our employees data set based on the values in the First Name column. The list of employees named Maria is a subset of all employees.

To compare every Series entry with a constant value, we place the Series on one side of the equality operator and the value on the other:

In [None]:
Series == value

When we combine a <code>Series</code> with an equality operator, pandas returns a <code>Series</code> of Booleans.

In [18]:
employees["First Name"] == "Maria"

0    False
1    False
2     True
3    False
4    False
Name: First Name, dtype: bool

Pandas offers a convenient syntax for extracting rows by using a Boolean <code>Series</code>. 

To filter rows, we provide the Boolean <code>Series</code> between square brackets following the <code>DataFrame</code> :

In [19]:
employees[employees["First Name"] == "Maria"]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
2,Maria,Female,NaT,130590,False,Finance


If the use of multiple square brackets is confusing, you can assign the Boolean <code>Series</code> to a descriptive variable and then pass the variable into the square brackets instead.

In [20]:
marias = employees["First Name"] == "Maria"
employees[marias]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
2,Maria,Female,NaT,130590,False,Finance


Let’s try another example. What if we want to extract a subset of employees who are not on the Finance team? The protocol remains the same, but with a slight twist.

In [21]:
employees["Team"] != "Finance"

0     True
1     True
2    False
3    False
4     True
Name: Team, dtype: bool

In [None]:
employees[employees["Team"] != "Finance"]

What if we want to retrieve all the managers in the company? Managers have a value of <code>True</code> in the Mgmt column. 

We could execute <code>employees["Mgmt"]==True</code> , but we don’t need to because Mgmt is already a <code>Series</code> of Booleans.

In [22]:
employees[employees["Mgmt"]]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
1,Thomas,Male,1996-03-31,61933,True,
3,Jerry,,2005-03-04,138705,True,Finance
4,Larry,Male,1998-01-24,101004,True,IT


The next example generates a Boolean <code>Series</code> for Salary values greater than $100,000:

In [24]:
high_earners = employees["Salary"] > 100000
employees[high_earners].head()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
2,Maria,Female,NaT,130590,False,Finance
3,Jerry,,2005-03-04,138705,True,Finance
4,Larry,Male,1998-01-24,101004,True,IT


## Filtering by multiple conditions

We can filter a <code>DataFrame</code> with multiple conditions by creating two independent Boolean Series and then declaring the logical criterion that pandas should apply between them.

### The AND condition

Suppose that we want to find all female employees who work on the business develop-
ment team. Now pandas must look for two conditions to select a row: a value of "<code>Female</code>" in the Gender column and a value of "<code>Business Dev</code>" in the Team column.

In [25]:
is_female = employees["Gender"] == "Female"
in_biz_dev = employees["Team"] == "Business Dev"

is_female & in_biz_dev

0    False
1    False
2    False
3    False
4    False
dtype: bool

In [27]:
employees[is_female & in_biz_dev]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team


We can include any amount of Series within the square brackets as long as we separate every subsequent two with a & symbol.

In [28]:
is_manager = employees["Mgmt"]
employees[is_female & in_biz_dev & is_manager]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team


## The OR condition

We can also extract rows if they fit one of several conditions. Not all conditions have to
be true, but at least one does. 

Suppose that we want to identify all employees with a Salary below $40,000 or a Start Date after January 1, 2015. We can use mathematical operators such as < and > to arrive at two separate Boolean Series for these conditions:

In [30]:
earning_below_40k = employees["Salary"] < 40000
started_after_2015 = employees["Start Date"] > "2015-01-01"

We use a pipe symbol ( | ) between Boolean <code>Series</code> to declare <code>OR</code> criteria. 

The next example selects the rows in which either of the Boolean <code>Series</code> holds a <code>True</code> value:

In [31]:
employees[earning_below_40k | started_after_2015].tail()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing


### Inversion with ~

The tilde symbol ( ~ ) inverts the values in a Boolean <code>Series</code> . All <code>True</code> values become
<code>False</code> , and all <code>False</code> values become <code>True</code> .

In [32]:
my_series = pd.Series([True, False, True])
my_series

0     True
1    False
2     True
dtype: bool

In [35]:
~my_series

0    False
1     True
2    False
dtype: bool

In [36]:
employees[employees["Salary"] < 100000]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
1,Thomas,Male,1996-03-31,61933,True,


In [37]:
employees[~(employees["Salary"] >= 100000)]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
1,Thomas,Male,1996-03-31,61933,True,


# Filtering by condition

Some filtering operations are more complex than simple equality or inequality checks. Luckily, pandas ships with many helper methods that generate Boolean Series for these types of extractions.

## The <code>isin</code> method

What if we want to isolate the employees who belong to either the Sales, Legal, or Marketing team? We could provide three separate Boolean <code>Series</code> inside the square brackets and add the | symbol to declare <code>OR</code> criteria:

In [39]:
sales = employees["Team"]== "Sales"
legal = employees["Team"]== "Legal"
mktg = employees["Team"]== "Marketing"
employees[sales | legal | mktg]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing


Although this solution works, it isn’t scalable. What if our next report asked for employees from 15 teams instead of three?

A better solution is the isin method, which accepts an iterable of elements (list, tuple, <code>Series</code> , and so on) and returns a Boolean <code>Series</code> . <code>True</code> denotes that pandas found the row’s value among the iterable’s values, and <code>False</code> denotes that it did not. When we have the <code>Series/<code> , we can use it to filter the <code>DataFrame</code> in the usual manner.

The next example achieves the same result set:

In [40]:
all_star_teams = ["Sales", "Legal", "Marketing"]
on_all_star_teams = employees["Team"].isin(all_star_teams)
employees[on_all_star_teams]

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing


### The <code>isnull</code> and <code>notnull</code> methods

Pandas marks missing text values and missing numeric values with a <code>NaN</code> (not a num-
ber) designation, and it marks missing datetime values with a <code>NaT</code> (not a time) designation.

We can use several pandas methods to isolate rows with either null or present values in a given column. The <code>isnull</code> method returns a Boolean <code>Series</code> in which <code>True</code> denotes that a row’s value is missing:

In [41]:
employees["Team"].isnull()

0    False
1     True
2    False
3    False
4    False
Name: Team, dtype: bool

Pandas considers the <code>NaT</code> and <code>None</code> values to be null as well. The next example invokes the <code>isnull</code> method on the Start Date column:

In [42]:
employees["Start Date"].isnull()

0    False
1    False
2     True
3    False
4    False
Name: Start Date, dtype: bool

The <code>notnull</code> method returns the inverse <code>Series</code> , one in which <code>True</code> indicates that a row’s value is present. The following output communicates that indices 0, 2, 3, and 4 do not have missing values:

In [43]:
employees["Team"].notnull()

0     True
1    False
2     True
3     True
4     True
Name: Team, dtype: bool

We can produce the same result set by inverting the <code>Series</code> returned by the <code>isnull</code> method. As a reminder, we use the tilde symbol ( ~ ) to invert a Boolean <code>Series</code> :

In [45]:
(~employees["Team"].isnull())

0     True
1    False
2     True
3     True
4     True
Name: Team, dtype: bool

Either approach works, but <code>notnull</code> is a bit more descriptive and thus is recommended.

## Dealing with null values

While we’re on the topic of missing values, let’s discuss some options for dealing with them. In section 5.2, we learned how to use the <code>fillna</code> method to replace NaNs with a constant value. We could also remove them.

The <code>dropna</code> method removes <code>DataFrame</code> rows that hold any <code>NaN</code> values. 

It doesn’t matter how many values a row is missing; the method excludes the row if a single <code>NaN</code> is present.

In [46]:
employees.dropna()

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
4,Larry,Male,1998-01-24,101004,True,IT


We can pass the <code>how</code> parameter an argument of "<code>all</code>" to remove rows in which all values are missing. Only one row in the data set, the last one, satisfies this condition:

In [47]:
employees.dropna(how = "all")

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
1,Thomas,Male,1996-03-31,61933,True,
2,Maria,Female,NaT,130590,False,Finance
3,Jerry,,2005-03-04,138705,True,Finance
4,Larry,Male,1998-01-24,101004,True,IT


The how parameter’s default argument is "<code>any</code>" . An argument of "<code>any</code>" removes a row if any of its values is absent.
    
We can use the <code>subset</code> parameter to target rows with a missing value in a specific column.

The next example removes rows that have a missing value in the Gender column:

In [48]:
employees.dropna(subset = ["Gender"])

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
1,Thomas,Male,1996-03-31,61933,True,
2,Maria,Female,NaT,130590,False,Finance
4,Larry,Male,1998-01-24,101004,True,IT


We can also pass the <code>subset</code> parameter a list of columns. Pandas will remove a row if it has a missing value in any of the specified columns.

In [49]:
employees.dropna(subset = ["Start Date", "Salary"])

Unnamed: 0,First Name,Gender,Start Date,Salary,Mgmt,Team
0,Douglas,Male,1993-08-06,0,True,Marketing
1,Thomas,Male,1996-03-31,61933,True,
3,Jerry,,2005-03-04,138705,True,Finance
4,Larry,Male,1998-01-24,101004,True,IT


The <code>thresh</code> parameter specifies a minimum threshold of non-null values that a row must have for pandas to keep it. The next example filters <code>employees</code> for rows with at least four present values: