<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preliminaries" data-toc-modified-id="Preliminaries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preliminaries</a></span></li><li><span><a href="#Filtering-(subsetting)-pandas-DataFrames" data-toc-modified-id="Filtering-(subsetting)-pandas-DataFrames-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Filtering (subsetting) pandas DataFrames</a></span><ul class="toc-item"><li><span><a href="#.loc[]" data-toc-modified-id=".loc[]-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span><code>.loc[]</code></a></span><ul class="toc-item"><li><span><a href="#Subsetting-with-explicit-labels" data-toc-modified-id="Subsetting-with-explicit-labels-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Subsetting with explicit labels</a></span></li><li><span><a href="#Subsetting-with-slices-on-labels" data-toc-modified-id="Subsetting-with-slices-on-labels-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Subsetting with slices on labels</a></span></li><li><span><a href="#Subsetting-with-boolean-arrays" data-toc-modified-id="Subsetting-with-boolean-arrays-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Subsetting with boolean arrays</a></span></li></ul></li><li><span><a href="#.iloc[]" data-toc-modified-id=".iloc[]-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span><code>.iloc[]</code></a></span></li><li><span><a href="#.filter()" data-toc-modified-id=".filter()-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span><code>.filter()</code></a></span></li><li><span><a href="#Copies-vs-&quot;views&quot;" data-toc-modified-id="Copies-vs-&quot;views&quot;-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Copies vs "views"</a></span></li></ul></li><li><span><a href="#Filtering-on-a-MultiIndex" data-toc-modified-id="Filtering-on-a-MultiIndex-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Filtering on a MultiIndex</a></span></li></ul></div>

# Preliminaries

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.DataFrame(data=[['TSLA',1000, 0.1],['AAPL',2000, 0.05], ['MSFT', 100, 0.07]], 
                  index = ['Tesla','Apple', 'Microsoft'], 
                  columns = ['ticker','price', 'return'])
df

# Filtering (subsetting) pandas DataFrames 

## ``.loc[]``

The most common way to access a subset of the data in a dataframe is through the ``.loc`` attribute. This attribute uses square brackets instead of parentheses and contains two arguments: we use the first one to tell Python which rows we want from the parent dataframe, and the second one to specify which columns we want. This generally looks like this:

```python
DataFrame.loc[<which_rows>, <which_columns>]
```
where, instead of ``DataFrame`` you would use the name of the full dataframe you want to subset. Pandas allows for a lot of flexibility as to what you can use instead of ``<which_rows>`` and ``<which_column>`` above. See the examples in the official documentation to get a more complete picture of what is possible with ``.loc[]``: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

Below, I cover the most common ways to specify which rows and columns you want:

1. By explicitly specifying the names (labels) of the index and/or column(s)
2. Using slices (ranges) on the index labels or column labels
3. Using anything that returns a boolean sequence (a value of "True" in that sequence will be interpreted as "I want this row/column")

### Subsetting with explicit labels

### Subsetting with slices on labels

### Subsetting with boolean arrays

## ``.iloc[]``

Works similarly to ``.loc()`` with one crucial exception: ``.iloc()`` uses index/column **integer positions** (as opposed to labels like ``.loc()``).

Slicing also works, but this time we have to use index/column numbers, and the right-most end of the range is **not** included:

We rarely use boolean arrays with ``.iloc()`` so we will not cover it here.

## ``.filter()``

The ``.filter()`` attribute comes in handy if we want to subset based on index or column names (i.e. if we want entire rows or entire columns). In particular, its ``like`` parameter allows us to specify that we want all rows/columns that contain a particular piece of text in their label.

Syntax:
```python
DataFrame.filter(items=None, like=None, regex=None, axis=None)
```

For example:

## Copies vs "views"

Let's make a copy of ``df`` that we can safely change for this section:

Many times, we want to store a subset of a dataframe inside a new dataframe. For example:

Now suppose we have to make a change to the parent dataframe ``newdf``. For example:

This change will be passed to ``sub``, even tough we never made this change explicitly ourselves:

This happened because, when we created ``sub`` with the command ``sub = newdf.loc[:,'price']``, Pyhton did not actually create an entirely new dataframe. Instead, it just returned something like an address of where in ``newdf`` the ``price`` data can be found. This is called a **view** of the data. 

This is done to preserve memory and speed up the code, but, like we saw above, it can cause some of our dataframes to change when we edit other dataframes. 

To avoid this possible problem, I recommend always telling Python to create a copy of the subset of data you want, using the ``.copy()`` attribute. In our example above, ``sub`` should have been created like this: 

Now, changes to ``newdf``, like this:

Will not cause ``sub`` to change:

# Filtering on a MultiIndex

So far, all the dataframes we've seen have had a one-dimensional index (a single column). Dataframes can have a higher-dimensional index, and when they do, Pandas calls that index a MultiIndex (for a more thorough tutorial on MultiIndex, see https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)

Let's create an example dataframe with a MultiIndex. Below, we create the MultiIndex using the ``pd.MultiIndex.from_product()`` function, but there are several other ways of creating one (see the link above).

In [None]:
m = pd.DataFrame(data = np.random.rand(9,3), 
                 columns = list('ABC'),
                 index = pd.MultiIndex.from_product([['AAPL','TSLA','MSFT'], [2007,2008,2009]],
                                                    names = ['ticker','year']
                                                   )
                )
m

Now, the index has two columns instead of one:

Each entry (row) in the index is a tuple (note the parentheses):

This means that, if we want to use index labels with ``.loc`` to extract some subset of a dataframe, we need to use a tuple when we specify those index labels: 

We can use slices on each dimension of the index, though you should make sure that you sorted your data by the values in the index first (using the ``sort_index()`` function (more on this function later):

I recommend using the ``slice()`` function to slice on each dimension of the index. Below, ``slice(None)`` means no condition should be imposed on that dimension of the index:

Though indexing on the first dimension of the index is a lot easier:

However, the convenient syntax above does not work for the other dimensions of the index (e.g. ``m.loc[2008:2010]`` will not work).