<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preliminaries" data-toc-modified-id="Preliminaries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preliminaries</a></span></li><li><span><a href="#Filtering-(subsetting)-pandas-DataFrames" data-toc-modified-id="Filtering-(subsetting)-pandas-DataFrames-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Filtering (subsetting) pandas DataFrames</a></span><ul class="toc-item"><li><span><a href="#.loc[]" data-toc-modified-id=".loc[]-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span><code>.loc[]</code></a></span><ul class="toc-item"><li><span><a href="#Subsetting-with-explicit-labels" data-toc-modified-id="Subsetting-with-explicit-labels-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Subsetting with explicit labels</a></span></li><li><span><a href="#Subsetting-with-slices-on-labels" data-toc-modified-id="Subsetting-with-slices-on-labels-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Subsetting with slices on labels</a></span></li><li><span><a href="#Subsetting-with-boolean-arrays" data-toc-modified-id="Subsetting-with-boolean-arrays-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Subsetting with boolean arrays</a></span></li></ul></li><li><span><a href="#.iloc[]" data-toc-modified-id=".iloc[]-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span><code>.iloc[]</code></a></span></li><li><span><a href="#.filter()" data-toc-modified-id=".filter()-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span><code>.filter()</code></a></span></li><li><span><a href="#Copies-vs-&quot;views&quot;" data-toc-modified-id="Copies-vs-&quot;views&quot;-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Copies vs "views"</a></span></li></ul></li><li><span><a href="#Filtering-on-a-MultiIndex" data-toc-modified-id="Filtering-on-a-MultiIndex-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Filtering on a MultiIndex</a></span></li></ul></div>

# Preliminaries

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame(data=[['TSLA',1000, 0.1],['AAPL',2000, 0.05], ['MSFT', 100, 0.07]], 
                  index = ['Tesla','Apple', 'Microsoft'], 
                  columns = ['ticker','price', 'return'])
df

Unnamed: 0,ticker,price,return
Tesla,TSLA,1000,0.1
Apple,AAPL,2000,0.05
Microsoft,MSFT,100,0.07


# Filtering (subsetting) pandas DataFrames 

## ``.loc[]``

The most common way to access a subset of the data in a dataframe is through the ``.loc`` attribute. This attribute uses square brackets instead of parentheses and contains two arguments: we use the first one to tell Python which rows we want from the parent dataframe, and the second one to specify which columns we want. This generally looks like this:

```python
DataFrame.loc[<which_rows>, <which_columns>]
```
where, instead of ``DataFrame`` you would use the name of the full dataframe you want to subset. Pandas allows for a lot of flexibility as to what you can use instead of ``<which_rows>`` and ``<which_column>`` above. See the examples in the official documentation to get a more complete picture of what is possible with ``.loc[]``: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

Below, I cover the most common ways to specify which rows and columns you want:

1. By explicitly specifying the names (labels) of the index and/or column(s)
2. Using slices (ranges) on the index labels or column labels
3. Using anything that returns a boolean sequence (a value of "True" in that sequence will be interpreted as "I want this row/column")

### Subsetting with explicit labels

In [3]:
df

Unnamed: 0,ticker,price,return
Tesla,TSLA,1000,0.1
Apple,AAPL,2000,0.05
Microsoft,MSFT,100,0.07


In [4]:
df.loc['Tesla','price']

1000

In [5]:
df.loc[['Tesla','Microsoft'], ['ticker','return']]

Unnamed: 0,ticker,return
Tesla,TSLA,0.1
Microsoft,MSFT,0.07


In [6]:
df.loc[:, ['ticker']]
#df['ticker']

Unnamed: 0,ticker
Tesla,TSLA
Apple,AAPL
Microsoft,MSFT


In [7]:
df.loc[:, 'ticker']

Tesla        TSLA
Apple        AAPL
Microsoft    MSFT
Name: ticker, dtype: object

In [8]:
df.loc[['Apple'], :]

Unnamed: 0,ticker,price,return
Apple,AAPL,2000,0.05


In [9]:
df.loc['Apple', :]

ticker    AAPL
price     2000
return    0.05
Name: Apple, dtype: object

### Subsetting with slices on labels

In [10]:
df

Unnamed: 0,ticker,price,return
Tesla,TSLA,1000,0.1
Apple,AAPL,2000,0.05
Microsoft,MSFT,100,0.07


In [11]:
df.loc['Apple':'Microsoft', 'price':'return']

Unnamed: 0,price,return
Apple,2000,0.05
Microsoft,100,0.07


In [12]:
df.loc['Tesla':'Apple']

Unnamed: 0,ticker,price,return
Tesla,TSLA,1000,0.1
Apple,AAPL,2000,0.05


In [13]:
df.loc[:,'ticker':'price']

Unnamed: 0,ticker,price
Tesla,TSLA,1000
Apple,AAPL,2000
Microsoft,MSFT,100


### Subsetting with boolean arrays

In [14]:
df.loc[[True, True, False], [True, False, True]]

Unnamed: 0,ticker,return
Tesla,TSLA,0.1
Apple,AAPL,0.05


In [15]:
df.loc[df['return'] > 0.05, :]

Unnamed: 0,ticker,price,return
Tesla,TSLA,1000,0.1
Microsoft,MSFT,100,0.07


In [16]:
df.loc[:, df.columns.str.contains('tic')]

Unnamed: 0,ticker
Tesla,TSLA
Apple,AAPL
Microsoft,MSFT


In [17]:
df.loc[df['return']> 0.07, 
       df.columns.str.contains('tic')]

Unnamed: 0,ticker
Tesla,TSLA


## ``.iloc[]``

Works similarly to ``.loc()`` with one crucial exception: ``.iloc()`` uses index/column **integer positions** (as opposed to labels like ``.loc()``).

In [18]:
df

Unnamed: 0,ticker,price,return
Tesla,TSLA,1000,0.1
Apple,AAPL,2000,0.05
Microsoft,MSFT,100,0.07


In [19]:
df.iloc[[0,2], [0,2]]

Unnamed: 0,ticker,return
Tesla,TSLA,0.1
Microsoft,MSFT,0.07


In [20]:
df.loc[['Tesla','Microsoft'], ['ticker', 'return']]

Unnamed: 0,ticker,return
Tesla,TSLA,0.1
Microsoft,MSFT,0.07


Slicing also works, but this time we have to use index/column numbers, and the right-most end of the range is **not** included:

In [21]:
df.iloc[1:2, 0:2]

Unnamed: 0,ticker,price
Apple,AAPL,2000


We rarely use boolean arrays with ``.iloc()`` so we will not cover it here.

## ``.filter()``

The ``.filter()`` attribute comes in handy if we want to subset based on index or column names (i.e. if we want entire rows or entire columns). In particular, its ``like`` parameter allows us to specify that we want all rows/columns that contain a particular piece of text in their label.

Syntax:
```python
DataFrame.filter(items=None, like=None, regex=None, axis=None)
```

For example:

In [22]:
df

Unnamed: 0,ticker,price,return
Tesla,TSLA,1000,0.1
Apple,AAPL,2000,0.05
Microsoft,MSFT,100,0.07


In [23]:
df.filter(like='esla', axis=0)

Unnamed: 0,ticker,price,return
Tesla,TSLA,1000,0.1


In [24]:
df.filter(like='ret', axis=1)

Unnamed: 0,return
Tesla,0.1
Apple,0.05
Microsoft,0.07


## Copies vs "views"

Let's make a copy of ``df`` that we can safely change for this section:

In [25]:
newdf = df
newdf

Unnamed: 0,ticker,price,return
Tesla,TSLA,1000,0.1
Apple,AAPL,2000,0.05
Microsoft,MSFT,100,0.07


Many times, we want to store a subset of a dataframe inside a new dataframe. For example:

In [26]:
sub = newdf.loc[:,'price']
sub

Tesla        1000
Apple        2000
Microsoft     100
Name: price, dtype: int64

Now suppose we have to make a change to the parent dataframe ``newdf``. For example:

In [27]:
newdf.loc['Tesla','price'] = 0

This change will be passed to ``sub``, even tough we never made this change explicitly ourselves:

In [30]:
sub

Tesla           0
Apple        2000
Microsoft     100
Name: price, dtype: int64

This happened because, when we created ``sub`` with the command ``sub = newdf.loc[:,'price']``, Pyhton did not actually create an entirely new dataframe. Instead, it just returned something like an address of where in ``newdf`` the ``price`` data can be found. This is called a **view** of the data. 

This is done to preserve memory and speed up the code, but, like we saw above, it can cause some of our dataframes to change when we edit other dataframes. 

To avoid this possible problem, I recommend always telling Python to create a copy of the subset of data you want, using the ``.copy()`` attribute. In our example above, ``sub`` should have been created like this: 

In [31]:
sub = newdf.loc[:, 'price'].copy()
sub

Tesla           0
Apple        2000
Microsoft     100
Name: price, dtype: int64

Now, changes to ``newdf``, like this:

In [32]:
newdf.loc['Apple','price'] = 123
newdf

Unnamed: 0,ticker,price,return
Tesla,TSLA,0,0.1
Apple,AAPL,123,0.05
Microsoft,MSFT,100,0.07


Will not cause ``sub`` to change:

In [33]:
sub

Tesla           0
Apple        2000
Microsoft     100
Name: price, dtype: int64

# Filtering on a MultiIndex

So far, all the dataframes we've seen have had a one-dimensional index (a single column). Dataframes can have a higher-dimensional index, and when they do, Pandas calls that index a MultiIndex (for a more thorough tutorial on MultiIndex, see https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)

Let's create an example dataframe with a MultiIndex. Below, we create the MultiIndex using the ``pd.MultiIndex.from_product()`` function, but there are several other ways of creating one (see the link above).

In [34]:
m = pd.DataFrame(data = np.random.rand(9,3), 
                 columns = list('ABC'),
                 index = pd.MultiIndex.from_product([['AAPL','TSLA','MSFT'], [2007,2008,2009]],
                                                    names = ['ticker','year']
                                                   )
                )
m

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
ticker,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,2007,0.90932,0.286187,0.42634
AAPL,2008,0.17153,0.020387,0.622
AAPL,2009,0.4226,0.71504,0.835241
TSLA,2007,0.695981,0.682544,0.742443
TSLA,2008,0.278523,0.571573,0.619444
TSLA,2009,0.525363,0.405011,0.851191
MSFT,2007,0.700339,0.856153,0.712035
MSFT,2008,0.61102,0.959225,0.792747
MSFT,2009,0.776965,0.385613,0.668776


Now, the index has two columns instead of one:

In [38]:
m.index

MultiIndex([('AAPL', 2007),
            ('AAPL', 2008),
            ('AAPL', 2009),
            ('TSLA', 2007),
            ('TSLA', 2008),
            ('TSLA', 2009),
            ('MSFT', 2007),
            ('MSFT', 2008),
            ('MSFT', 2009)],
           names=['ticker', 'year'])

Each entry (row) in the index is a tuple (note the parentheses):

In [39]:
m.index[0]

('AAPL', 2007)

This means that, if we want to use index labels with ``.loc`` to extract some subset of a dataframe, we need to use a tuple when we specify those index labels: 

In [40]:
m.loc[('AAPL', 2007), :]

A    0.909320
B    0.286187
C    0.426340
Name: (AAPL, 2007), dtype: float64

We can use slices on each dimension of the index, though you should make sure that you sorted your data by the values in the index first (using the ``sort_index()`` function (more on this function later):

In [43]:
m = m.sort_index()
m

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
ticker,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,2007,0.90932,0.286187,0.42634
AAPL,2008,0.17153,0.020387,0.622
AAPL,2009,0.4226,0.71504,0.835241
MSFT,2007,0.700339,0.856153,0.712035
MSFT,2008,0.61102,0.959225,0.792747
MSFT,2009,0.776965,0.385613,0.668776
TSLA,2007,0.695981,0.682544,0.742443
TSLA,2008,0.278523,0.571573,0.619444
TSLA,2009,0.525363,0.405011,0.851191


I recommend using the ``slice()`` function to slice on each dimension of the index. Below, ``slice(None)`` means no condition should be imposed on that dimension of the index:

In [44]:
m.loc[(slice('AAPL','MSFT'), slice(None)), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
ticker,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,2007,0.90932,0.286187,0.42634
AAPL,2008,0.17153,0.020387,0.622
AAPL,2009,0.4226,0.71504,0.835241
MSFT,2007,0.700339,0.856153,0.712035
MSFT,2008,0.61102,0.959225,0.792747
MSFT,2009,0.776965,0.385613,0.668776


In [45]:
m.loc[(slice(None), slice(2008,2010)), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
ticker,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,2008,0.17153,0.020387,0.622
AAPL,2009,0.4226,0.71504,0.835241
MSFT,2008,0.61102,0.959225,0.792747
MSFT,2009,0.776965,0.385613,0.668776
TSLA,2008,0.278523,0.571573,0.619444
TSLA,2009,0.525363,0.405011,0.851191


In [46]:
m.loc[(slice('AAPL','MSFT'), slice(2008,2010)), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
ticker,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,2008,0.17153,0.020387,0.622
AAPL,2009,0.4226,0.71504,0.835241
MSFT,2008,0.61102,0.959225,0.792747
MSFT,2009,0.776965,0.385613,0.668776


Though indexing on the first dimension of the index is a lot easier:

In [50]:
m.loc['AAPL':'MSFT']

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
ticker,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AAPL,2007,0.90932,0.286187,0.42634
AAPL,2008,0.17153,0.020387,0.622
AAPL,2009,0.4226,0.71504,0.835241
MSFT,2007,0.700339,0.856153,0.712035
MSFT,2008,0.61102,0.959225,0.792747
MSFT,2009,0.776965,0.385613,0.668776


However, the convenient syntax above does not work for the other dimensions of the index (e.g. ``m.loc[2008:2010]`` will not work).