# Subsetting and Filtering with Pandas


Lesson Goals

In this lesson we will learn:

    The basics of indexing
    Selecting rows and columns
    Subsetting our data using different functions
    Filtering our data using Boolean conditions

Introduction

Since Pandas is an open source library, it has many contributions from developers eager to help. As a result, the syntax has become very expressive and there are many ways to perform similar tasks. One such example is with subsetting and filtering of DataFrames.
Indexing

Before we start talking about subsetting and filtering, we should talk a bit about indexing. There are three components to a DataFrame: The row index, the columns and the data. The default index is a row number starting at zero. However, we can define the index to be any sequence we like. We can access the index in any DataFrame and assign a new sequence of values to the index.

Recall our animals dataset:

In [13]:
import numpy as np
import pandas as pd

animals = pd.read_csv('data/animals.csv')

animals.index.values

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
       51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61])

# Selecting Rows and Columns


Selecting Columns

We can access the different columns in our data by using square brackets or directly as an attribute.

In [14]:
#calling the bodywt column using square brackets:
animals['bodywt'].head()

0     44.500
1     15.499
2      8.100
3    423.012
4    119.498
Name: bodywt, dtype: float64

In [15]:
#calling the brainwt column as an attribute of the DataFrame:
animals.brainwt.head()

0      3.385
1      0.480
2      1.350
3    464.983
4     36.328
Name: brainwt, dtype: float64

Selecting Rows

We can subset the entire DataFrame or a subset of columns using square brackets. We can indicate the starting row and the ending row and separate them by a colon. We can also leave either the starting or ending row blank. In that case the starting row will be implicitly completed to be the first row and the ending will be implicitly completed to the last. The range you specify will not be inclusive of the upper bound.



In [16]:
animals[11:20]

Unnamed: 0,brainwt,bodywt,animal
11,0.92,5.7,Arctic
12,1.0,6.6,African_giant_pouched_rat
13,0.005,0.14,Lesser_short-tailed-shrew
14,0.06,1.0,Star-nosed_mole
15,3.5,10.8,Nine-banded_armadillo
16,2.0,12.3,Tree_hyrax
17,1.7,6.3,N._American
18,2547.07,4603.17,Asian_elephant
19,0.023,0.3,Big_brown_bat


In [17]:
animals[:10]

Unnamed: 0,brainwt,bodywt,animal
0,3.385,44.5,Arctic_fox
1,0.48,15.499,Owl_monkey
2,1.35,8.1,Beaver
3,464.983,423.012,Cow
4,36.328,119.498,Gray_wolf
5,27.66,114.996,Goat
6,14.831,98.199,Roe_deer
7,1.04,5.5,Guinea_pig
8,4.19,57.998,Vervet
9,0.425,6.4,Chinchilla


In [18]:
animals[55:]

Unnamed: 0,brainwt,bodywt,animal
55,192.001,180.008,Pig
56,3.0,25.001,Echidna
57,160.004,169.0,Brazilian_tapir
58,0.9,2.6,Tenrec
59,1.62,11.4,Phalanger
60,0.104,2.5,Tree_shrew
61,4.235,50.4,Red_fox


# Subsetting Using Functions

There are two functions for more complex subsetting: .loc and .iloc.

    .loc is primarily label based. With .loc we perform subsetting using the name of the column
    .iloc is primarily integer based. With .iloc we perform subsetting using the integer position of the column

Note: If we subset our DataFrame with a column or row that does not exist, we will throw an error.

When subsetting, we first specify the row numbers or names and then the column numbers or names.

Examples of .loc and .iloc:

In [19]:
animals.iloc[5:10, 0:2]

Unnamed: 0,brainwt,bodywt
5,27.66,114.996
6,14.831,98.199
7,1.04,5.5
8,4.19,57.998
9,0.425,6.4


In [20]:
animals.loc[2:6, 'bodywt']

2      8.100
3    423.012
4    119.498
5    114.996
6     98.199
Name: bodywt, dtype: float64

In [21]:
animals.loc[1:4, ['bodywt', 'brainwt']]

Unnamed: 0,bodywt,brainwt
1,15.499,0.48
2,8.1,1.35
3,423.012,464.983
4,119.498,36.328


# Filtering

There are times when we want to use filtering to answer questions we have about the dataset. For example, we would like to know which animals have a body weight that is under 2 lbs. We can filter our DataFrame by putting this condition in the square brackets.

In [22]:
animals[animals.bodywt < 2]

Unnamed: 0,brainwt,bodywt,animal
13,0.005,0.14,Lesser_short-tailed-shrew
14,0.06,1.0,Star-nosed_mole
19,0.023,0.3,Big_brown_bat
37,0.12,1.0,Golden_hamster
38,0.023,0.4,Mouse
39,0.01,0.25,Little_brown_bat
51,0.28,1.9,Rat
52,0.075,1.2,E._American_mole
54,0.048,0.33,Musk_shrew


Our Boolean condition can also depend on the data itself, rather than a comparison to a fixed number. For example, we might be curious to know which animals have particularly small brains compared to their body. We might ask which animals have a body weight 25 times larger or more than their brain weight

In [23]:
animals[animals.bodywt > 25 * animals.brainwt]

Unnamed: 0,brainwt,bodywt,animal
1,0.48,15.499,Owl_monkey
10,0.101,4.0,Ground_squirrel
13,0.005,0.14,Lesser_short-tailed-shrew
34,6.8,179.003,Rhesus_monkey


We can also combine subsetting and filtering. For example, we might only care about the names of the animals with brains more than 25 times smaller than their total body weight.

In [24]:
animals[animals.bodywt > 25 * animals.brainwt]['animal']

1                    Owl_monkey
10              Ground_squirrel
13    Lesser_short-tailed-shrew
34                Rhesus_monkey
Name: animal, dtype: object