# Filtering columns and rows in pandas

This notebook has a little more detail on selecting and filtering data in pandas. We'll use the MLB salary data as an example -- a CSV lives at `'../data/mlb.csv'`.

In [2]:
# import pandas
import pandas as pd

In [3]:
# use the read_csv() method to create a data frame
df = pd.read_csv('../data/mlb.csv')

In [4]:
# use the `head()` method to check out what we've got
df.head()

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
0,Clayton Kershaw,LAD,SP,33000000,2014,2020,7
1,Zack Greinke,ARI,SP,31876966,2016,2021,6
2,David Price,BOS,SP,30000000,2016,2022,7
3,Miguel Cabrera,DET,1B,28000000,2014,2023,10
4,Justin Verlander,DET,SP,28000000,2013,2019,7


### Selecting one column of data

You can select a column of data using dot notation `.` or bracket notation: `[]`.

If you want to select a single column of data and the column name doesn't have spaces, you can use a period ("dot notation"). You could also pass the name of the column as a string inside square brackets ("bracket notation"); if your column names have spaces (avoid this if you can), you _must_ use bracket notation.

Let's say we wanted to select the `TEAM` column. We could do this:

In [None]:
df.TEAM

... or we could do this:

In [None]:
df['TEAM']

Either works. And: Not a huge deal, but it's generally good practice to pick one and stay consistent, at least within the same script.

### Selecting multiple columns of data

To select multiple columns of data, you use _bracket notation_ -- but instead of putting a single column name inside the brackets, you hand it a _list_ of column names.

👉 For a refresher on _lists_, [check out this notebook](http://localhost:8888/notebooks/appendix/Python%20data%20types%20and%20basic%20syntax.ipynb#Lists).

Let's select the `NAME` and `TEAM` columns.

In [None]:
name_and_team = df[['NAME', 'TEAM']]

name_and_team.head()

Lots of square brackets happening there! An alternative: You could assign the list of column names that you want to select to its own variable to make things a little clearer.

In [5]:
cols_of_interest = ['TEAM', 'NAME']
df[cols_of_interest].head()

Unnamed: 0,TEAM,NAME
0,LAD,Clayton Kershaw
1,ARI,Zack Greinke
2,BOS,David Price
3,DET,Miguel Cabrera
4,DET,Justin Verlander


A good rule of thumb: If you can do anything in your script to make things clearer, or more explicit, for other people reading your code (including your future self), do it.

### Filtering rows of data

You can also filter your data set to keep just the rows that meet your filtering condition(s) -- like using the filter dropdowns in Excel or a `WHERE` clause in SQL.

Let's say you wanted to filter our MLB data to include just the Los Angeles Dodgers.

First, we need to make sure that we understand how "Los Angeles Dodgers" is represented in the data. We can use the `unique()` method to get a unique list of values in the `TEAM` column.

In [6]:
df.TEAM.unique()

array(['LAD', 'ARI', 'BOS', 'DET', 'CHC', 'LAA', 'SEA', 'NYY', 'TEX',
       'SF', 'MIN', 'NYM', 'WSH', 'CIN', 'ATL', 'BAL', 'CWS', 'COL',
       'TOR', 'STL', 'MIL', 'PHI', 'HOU', 'KC', 'MIA', 'CLE', 'PIT', 'TB',
       'OAK', 'SD'], dtype=object)

So we want to find all the records in our data where the value in the `TEAM` column is "LAD". The equivalent SQL would be:

```SQL
SELECT *
FROM mlb
WHERE TEAM = "LAD"
```

With pandas, the basic syntax is to pass your filtering condition to the data frame in square brackets `[]`. We'll use Python's `==` [comparison operator](https://docs.python.org/3/reference/expressions.html#value-comparisons) to test for equality. Typically, you'd also want to "save" the results of your filtering by assigning the result to a variable that you can access later:

In [7]:
lad = df[df['TEAM'] == 'LAD']

In [8]:
lad.head()

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
0,Clayton Kershaw,LAD,SP,33000000,2014,2020,7
17,Adrian Gonzalez,LAD,1B,22357142,2012,2018,7
48,Andre Ethier,LAD,LF,17500000,2013,2017,5
68,Scott Kazmir,LAD,SP,14985384,2016,2018,3
89,Justin Turner,LAD,3B,13000000,2017,2020,4


You can do numerical comparisons -- let's get just the players who make $1 million or more:

In [9]:
millionaires = df[df['SALARY'] >= 1000000]

In [10]:
millionaires.head()

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
0,Clayton Kershaw,LAD,SP,33000000,2014,2020,7
1,Zack Greinke,ARI,SP,31876966,2016,2021,6
2,David Price,BOS,SP,30000000,2016,2022,7
3,Miguel Cabrera,DET,1B,28000000,2014,2023,10
4,Justin Verlander,DET,SP,28000000,2013,2019,7


Again, if it makes your script clearer, more readable, you can break your filter up into multiple pieces -- "save" the filtering condition under a new variable and _then_ hand that off to the data frame:

In [11]:
is_a_2_millionaire = df['SALARY'] >= 2000000
two_millionaires = df[is_a_2_millionaire]

In [12]:
two_millionaires.head()

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
0,Clayton Kershaw,LAD,SP,33000000,2014,2020,7
1,Zack Greinke,ARI,SP,31876966,2016,2021,6
2,David Price,BOS,SP,30000000,2016,2022,7
3,Miguel Cabrera,DET,1B,28000000,2014,2023,10
4,Justin Verlander,DET,SP,28000000,2013,2019,7


### Filtering against multiple matches
You can use the [`isin()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html) method to test a value against multiple matches -- just hand it your list of values to check against.

Let's say we wanted to return all of the players for the Texas Rangers and Houston Astros.

In [13]:
tx = df[df['TEAM'].isin(['TEX', 'HOU'])]

In [15]:
tx.TEAM.unique()

array(['TEX', 'HOU'], dtype=object)

### NOT filtering

You can filter for data that do _not_ meet some criteria by prepending a tilde `~`.

Let's filter for all players that _aren't_ in Texas. It'll be the same as the filtering statement above but with a tilde at the beginning.

In [16]:
not_tx = df[~df['TEAM'].isin(['TEX', 'HOU'])]

In [17]:
not_tx.TEAM.unique()

array(['LAD', 'ARI', 'BOS', 'DET', 'CHC', 'LAA', 'SEA', 'NYY', 'SF',
       'MIN', 'NYM', 'WSH', 'CIN', 'ATL', 'BAL', 'CWS', 'COL', 'TOR',
       'STL', 'MIL', 'PHI', 'KC', 'MIA', 'CLE', 'PIT', 'TB', 'OAK', 'SD'],
      dtype=object)

### Filtering on multiple criteria

You can filter your data on multiple criteria. A few gotchas:
- Don't use Python's `and` and `or` operators to chain the statements -- [pandas wants you to use `&` and `|`](https://pandas.pydata.org/pandas-docs/version/0.22/indexing.html#boolean-indexing)
- Don't forget to use parentheses to group your statements

Let's filter for all catchers who make the league minimum of $535,000.

In [6]:
catchers_lm = df[(df['POS'] == 'C') & (df['SALARY'] == 535000)]

In [7]:
catchers_lm

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
823,Andrew Knapp,PHI,C,535000,2017,2017,1
850,Luis Torrens,SD,C,535000,2017,2017,1
864,Stuart Turner,CIN,C,535000,2017,2017,1


Much of the time, though, it's clearer to just break up your filtering into multiple statements, like this:

In [8]:
catchers = df[df['POS'] == 'C']
catchers_lm = catchers[catchers['SALARY'] == 535000]

catchers_lm

Unnamed: 0,NAME,TEAM,POS,SALARY,START_YEAR,END_YEAR,YEARS
823,Andrew Knapp,PHI,C,535000,2017,2017,1
850,Luis Torrens,SD,C,535000,2017,2017,1
864,Stuart Turner,CIN,C,535000,2017,2017,1
