# Tutorial 05 - `DataFrame` Masking

This tutorial begins with a discussion of a programming technique called *comparison*.

We then see how *comparison* can be used to select subsets of a `DataFrame` according to a variety of criteria.

Using comparison to subset a `DataFrame` is known as *masking*.

## Imporing Packages

Let's import the packages that we will need throughout this tutorial.

In [None]:
##> import numpy as np
##> import pandas as pd




## Comparison

In data analysis, you often have occasion to compare two variables.

This means you are checking for conditions like equality (`==`), non-equality (`!=`), less-than (`<`).

The comparison of variables creates `boolean` data: `True` or `False`.

The following code code demonstrates some comparisons involving strings:

In [None]:
##> str_x = "SPY"
##> str_y = "IWM"
##> str_z = "SPY"
##> 
##> print(str_x == str_y)
##> print(str_x == str_z)
##> print(str_x == str_y)




This code demonstrates comparisons involving numbers:

In [None]:
##> dbl_x = 287.50
##> dbl_y = 250
##> 
##> print(dbl_x > dbl_y)
##> print(dbl_x < dbl_y)
##> print(dbl_x == dbl_y)
##> print(dbl_x != dbl_y)




You can also assign the boolean that is generated by comparison.

In [None]:
##> bln_1 = (dbl_x == dbl_y)
##> bln_2 = (str_x == str_y)
##> 
##> print(bln_1)
##> print(bln_2)




## Boolean Operators

A quick aside on *boolean operators*...

A boolean operator takes in one or more booleans, and returns another boolean.

There are three operators to be aware of:

- `not` (unary) - takes a single boolean and returns it's negation

- `or` (binary) - takes two booleans and returns `True` if at least one is `True`

- `and` (binary) - takes two booleans and returns `True` only if both are `True`

Let's demonstrate the `not` operator:

In [None]:
##> print(not True)
##> print(not False)




Let's demonstrate the `or` operator:

In [None]:
##> print(True or True)
##> print(True or False)
##> print(False or True)
##> print(False or False)




And lastly, let's demonstrate the `and` operator:

In [None]:
##> print(True and True)
##> print(True and False)
##> print(False and True)
##> print(False and False)




It is possible to combine these operators together is various ways, using parentheses, to come up with complex expressions:

In [None]:
##> (not True) and (False and not(False or True))



This can seem a little confusing at first, but it always reduces down to the simple unary and binary boolean operators we discussed above.

**Code Challenge:** Create a big, ugly boolean expression and make sure that code actually runs.

## Reading in Sample Data

For the remainder of the discussion, we will need some data to work with. Let's read in our sample SPY data:

In [None]:
##> df_spy = pd.read_csv('../data/spy_dec_2018.csv')
##> df_spy




It's always helpful try to explore your data a little, once you have imported it.  One thing that I like to check for is the unique number of values in categorical columns, like `date`.

That is what is done in the following code:

In [None]:
##> df_spy['date'].unique().size



The above calculation is not important to the rest of the tutorial, but I will be peppering little tips and tricks throughout these tutorials to facilitate learning via repetition.

**Code Challenge:** Use simple slicing to select the last three rows of a `df_spy` without typing any numbers. 

## Comparison and DataFrame Columns

As discussed in a previous tutorial, a column of a `DataFrame` is a `Series` object, which is a glorfied `numpy.array` (think vector or matrix).

Let's separate out the `adjusted` column of `df_spy` and assign it to a variable:

In [None]:
##> pd.options.display.max_rows = 6 # this modifies the printing of dataframes
##> ser_adjusted = df_spy['adjusted']
##> ser_adjusted




Recall that a `pandas.Series` is smart with respect to compnent-wise operations, meaning it behaves like vectors from linear algebra would.  This means that the following operations are *broadcasted* as you might expect.

Subtraction by 100 broadcasted:

In [None]:
##> ser_adjusted - 100



Division by 100 broadcasted:

In [None]:
##> ser_adjusted / 100



It is a convenient fact that this broadcasting behavior also occurs with comparison.  

The following code checks which elements of `ser_adjusted` are greater than 250: 

In [None]:
##> ser_test = (ser_adjusted > 250)
##> ser_test




Let's check that the resulting variable `ser_test` is a `pandas.Series`:

In [None]:
##> type(ser_test)



And finally let's observe the elements of `ser_test`:

In [None]:
##> print(ser_test.values)



A few observation about what just happened:

1. When we compare a `Series` of numerical values (`ser_adjusted`) to a single number (`250`), we get back a `Series` of booleans (`ser_test`)

2. We have that `ser_test[i]` = (`ser_adjusted[i] > 250`).

3. So the comparison operation was broadcast as you would expect.

This is easy to see by appending `ser_test` to `df_spy` and then reprinting:

In [None]:
##> pd.options.display.max_rows = 25
##> df_spy['test'] = ser_test
##> df_spy




Notice that starting on 12/19/2018, the adjusted price of SPY was below 250.

The broadcasting of comparison can be use as a powerful way to query subsets of data from a `DataFrame`; we consider this technique next.

## Masking with DataFrames

The next few steps might seem a little strange, but bear with me, I assure you this is leading somewhere.

From the code below we know that `df_spy` has 20 rows:

In [None]:
##> df_spy.shape



The following code creates a list consisting of 18 booleans, all of them `False`:

In [None]:
##> lst_bool = [False] * 20
##> lst_bool




Now, let's see what happens when we feed this `list` of `False` booleans into `df_spy` using square brackets.

In [None]:
##> df_spy[lst_bool]



It may look like nothing has happened, but really what has been returned is a an empty `DataFrame`.

**Code Challenge:** Verify that `df_spy[lst_bool]` is a `DataFrame`.

Next let's modify `lst_bool` slightly, by changing the first entry to `True`, and then feed it into `df_spy` again.

In [None]:
##> lst_bool[0] = True # this changes the first entry to True
##> df_spy[lst_bool]




So what happened?  Notice that `df_spy[lst_bool]` returns a `DataFrame` consisting only of the first row of `df_spy`.

Let's modify lst_bool once again, by setting the second entry of `df_spy` to `True`, and then once again repeat our little experiment. 

In [None]:
##> lst_bool[1] = True # this sets the second component to True
##> df_spy[lst_bool]




**Punchline:** What is returned by the code `df_spy[lst_bool]` will be a `DataFrame` consisting of all the rows corresponding to the `True` entries of `lst_bool`.

This is the notion of `DataFrame` *masking*.

**Code Challenge:** Modify `lst_bool` and then use `DataFrame` masking to grab the first, second and, fourth rows of `df_spy`.

## Filtering a `DataFrame` with Masking

In a data analysis, it is often the case that you will want to isolate certain rows of a `DataFrame`. The rows you are interested can often be identified by comparisons involving columns of the `DataFrame`.

We can achieve this kind of row isolation by combining the broadcasting of camparison over `DataFrame` columns, with the masking of `DataFrames`.

In order to consider concrete examples, let's read-in some data.  

The following code reads in a dataset consist of EOD prices for four different ETFs (SPY, IWM, QQQ, DIA), during the month of December 2018:

In [None]:
##> pd.options.display.max_rows = 25 # changes displaying behavior 
##> df_etf = pd.read_csv("../data/index_etf_dec_2018.csv")
##> df_etf




A method that is useful for inspecting `DataFrames` is `Series.value_count()`.

Let's apply this to the `symbol` column of `df_etf`:

In [None]:
##> df_etf['symbol'].value_counts()



The above output means that for each value of `symbol` there are 20 row in `df_etf` that have that value in the `symbol` column. 

### Filtering for One Symbol

We are now ready to start exploring non-trivial applications of `DataFrame` masking.

As a first example, let's isolate all the rows of `df_etf` that correspond to `IWM`:

In [None]:
##> pd.options.display.max_rows = 6
##> ser_bool = (df_etf['symbol'] == "IWM")
##> df_etf[ser_bool]




Notice that we did this in two steps: 

1. Calculate the series of `booleans` called `ser_bool`; 

2. Perform the masking by using square brackets `[]` and `lst_bool`.

We can actually perform this masking in a single step:

In [None]:
##> df_etf[df_etf['symbol'] == "IWM"]



**Code Challenge:** Select all the rows of `df_etf` for `QQQ`. 

### Filtering for Multiple Symbols

We can use the `.isin()` method to filter a `DataFrame` for multiple symbols.  The techniques is to feed the `.isin()` method a list of symbols you want to filter for.

The following code grabs all the rows of `df_etf` for either `QQQ` or `DIA`:

In [None]:
##> df_etf[df_etf['symbol'].isin(['QQQ', 'DIA'])]



**Code Challenge:** Grab all rows of `df_etf` corresponding to `SPY`, `IWM`, and `QQQ`.

### Filtering for A Single Date

For our next example, let's isolate the price data from the December 2018 regular option expiration, which was 12/21/2018.

In [None]:
##> df_etf[df_etf['date'] == '2018-12-21']



**Code Challenge:** Grab all the rows of `df_etf` for New Years Eve.

### Filtering for a Date Range

The following code grabs all the rows of `df_etf` that come after the middle of the month:

In [None]:
##> df_etf[df_etf['date'] > '2018-12-15']



**Code Challenge:** Grab all the rows of `df_etf` that come before 12/7/2018.

### Filtering on Multiple Criteria

We can filter on muliple criteria by using the `&` operator, which is the vectorized version of `and`.

Suppose that we want all rows for `SPY` or `IWM` that come after Christmas:

In [None]:
##> pd.options.display.max_rows = 10
##> bln_ticker = df_etf['symbol'].isin(['SPY', 'IWM'])
##> bln_date = df_etf['date'] > '2018-12-25'
##> bln_combined = bln_ticker & bln_date
##> 
##> df_etf[bln_combined]




**Code Challenge:** Isolate the row for `QQQ` on New Years Eves using the `&` operator.

## Related Reading

*PDSH* - 2.6 - Comparisons, Masks, and Boolean Logic

*PDSH* - 2.7 - Fancy Indexing

*PDSH* - 3.2 - Data Indexing and Selection 