# Tutorial 05 - `DataFrame` Masking

This tutorial begins with a discussion of a programming technique called *comparison*.

We then see how *comparison* can be used to select subsets of a `DataFrame` according to a variety of criteria.

Using comparison to subset a `DataFrame` is known as *masking*.

## Imporing Packages

Let's import the packages that we will need throughout this tutorial.

In [1]:
import numpy as np
import pandas as pd

## Comparison

In data analysis, you often have occasion to compare two variables.

This means you are checking for conditions like equality (`==`), non-equality (`!=`), less-than (`<`).

The comparison of variables creates `boolean` data: `True` or `False`.

The following code code demonstrates some comparisons involving strings:

In [2]:
str_x = "SPY"
str_y = "IWM"
str_z = "SPY"

print(str_x == str_y)
print(str_x == str_z)
print(str_x == str_y)

False
True
False


This code demonstrates comparisons involving numbers:

In [3]:
dbl_x = 287.50
dbl_y = 250

print(dbl_x > dbl_y)
print(dbl_x < dbl_y)
print(dbl_x == dbl_y)
print(dbl_x != dbl_y)

True
False
False
True


You can also assign the boolean that is generated by comparison.

In [4]:
bln_1 = (dbl_x == dbl_y)
bln_2 = (str_x == str_y)

print(bln_1)
print(bln_2)

False
False


## Boolean Operators

A quick aside on *boolean operators*...

A boolean operator takes in one or more booleans, and returns another boolean.

There are three operators to be aware of:

- `not` (unary) - takes a single boolean and returns it's negation

- `or` (binary) - takes two booleans and returns `True` if at least one is `True`

- `and` (binary) - takes two booleans and returns `True` only if both are `True`

Let's demonstrate the `not` operator:

In [5]:
print(not True)
print(not False)

False
True


Let's demonstrate the `or` operator:

In [6]:
print(True or True)
print(True or False)
print(False or True)
print(False or False)

True
True
True
False


And lastly, let's demonstrate the `and` operator:

In [7]:
print(True and True)
print(True and False)
print(False and True)
print(False and False)

True
False
False
False


It is possible to combine these operators together is various ways, using parentheses, to come up with complex expressions:

In [8]:
(not True) and (False and not(False or True))

False

This can seem a little confusing at first, but it always reduces down to the simple unary and binary boolean operators we discussed above.

**Code Challenge:** Create a big, ugly boolean expression and make sure that code actually runs.

## Reading in Sample Data

For the remainder of the discussion, we will need some data to work with. Let's read in our sample SPY data:

In [9]:
df_spy = pd.read_csv('../data/spy_dec_2018.csv')
df_spy

Unnamed: 0,date,open,high,low,close,volume,adjusted
0,2018-11-30,273.809998,276.279999,273.450012,275.649994,98204200,271.527222
1,2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,275.122589
2,2018-12-04,278.369995,278.850006,269.899994,270.25,177986000,266.207977
3,2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,265.804108
4,2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,259.627899
5,2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,260.120422
6,2018-12-11,267.660004,267.869995,262.480011,264.130005,121504400,260.179504
7,2018-12-12,267.470001,269.0,265.369995,265.459991,97976700,261.489624
8,2018-12-13,266.519989,267.48999,264.119995,265.369995,96662700,261.40094
9,2018-12-14,262.959991,264.029999,259.850006,260.470001,116961100,256.574249


It's always helpful try to explore your data a little, once you have imported it.  One thing that I like to check for is the unique number of values in categorical columns, like `date`.

That is what is done in the following code:

In [10]:
df_spy['date'].unique().size

20

The above calculation is not important to the rest of the tutorial, but I will be peppering little tips and tricks throughout these tutorials to facilitate learning via repetition.

**Code Challenge:** Use simple slicing to select the last three rows of a `df_spy` without typing any numbers. 

## Comparison and DataFrame Columns

As discussed in a previous tutorial, a column of a `DataFrame` is a `Series` object, which is a glorfied `numpy.array` (think vector or matrix).

Let's separate out the `adjusted` column of `df_spy` and assign it to a variable:

In [11]:
pd.options.display.max_rows = 6 # this modifies the printing of dataframes
ser_adjusted = df_spy['adjusted']
ser_adjusted

0     271.527222
1     275.122589
2     266.207977
         ...    
17    245.786697
18    245.469635
19    247.619644
Name: adjusted, Length: 20, dtype: float64

Recall that a `pandas.Series` is smart with respect to compnent-wise operations, meaning it behaves like vectors from linear algebra would.  This means that the following operations are *broadcasted* as you might expect.

Subtraction by 100 broadcasted:

In [12]:
ser_adjusted - 100

0     171.527222
1     175.122589
2     166.207977
         ...    
17    145.786697
18    145.469635
19    147.619644
Name: adjusted, Length: 20, dtype: float64

Division by 100 broadcasted:

In [13]:
ser_adjusted / 100

0     2.715272
1     2.751226
2     2.662080
        ...   
17    2.457867
18    2.454696
19    2.476196
Name: adjusted, Length: 20, dtype: float64

It is a convenient fact that this broadcasting behavior also occurs with comparison.  

The following code checks which elements of `ser_adjusted` are greater than 250: 

In [14]:
ser_test = (ser_adjusted > 250)
ser_test

0      True
1      True
2      True
      ...  
17    False
18    False
19    False
Name: adjusted, Length: 20, dtype: bool

Let's check that the resulting variable `ser_test` is a `pandas.Series`:

In [15]:
type(ser_test)

pandas.core.series.Series

And finally let's observe the elements of `ser_test`:

In [16]:
print(ser_test.values)

[ True  True  True  True  True  True  True  True  True  True  True  True
 False False False False False False False False]


A few observation about what just happened:

1. When we compare a `Series` of numerical values (`ser_adjusted`) to a single number (`250`), we get back a `Series` of booleans (`ser_test`)

2. We have that `ser_test[i]` = (`ser_adjusted[i] > 250`).

3. So the comparison operation was broadcast as you would expect.

This is easy to see by appending `ser_test` to `df_spy` and then reprinting:

In [17]:
pd.options.display.max_rows = 25
df_spy['test'] = ser_test
df_spy

Unnamed: 0,date,open,high,low,close,volume,adjusted,test
0,2018-11-30,273.809998,276.279999,273.450012,275.649994,98204200,271.527222,True
1,2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,275.122589,True
2,2018-12-04,278.369995,278.850006,269.899994,270.25,177986000,266.207977,True
3,2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,265.804108,True
4,2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,259.627899,True
5,2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,260.120422,True
6,2018-12-11,267.660004,267.869995,262.480011,264.130005,121504400,260.179504,True
7,2018-12-12,267.470001,269.0,265.369995,265.459991,97976700,261.489624,True
8,2018-12-13,266.519989,267.48999,264.119995,265.369995,96662700,261.40094,True
9,2018-12-14,262.959991,264.029999,259.850006,260.470001,116961100,256.574249,True


Notice that starting on 12/19/2018, the adjusted price of SPY was below 250.

The broadcasting of comparison can be use as a powerful way to query subsets of data from a `DataFrame`; we consider this technique next.

## Masking with DataFrames

The next few steps might seem a little strange, but bear with me, I assure you this is leading somewhere.

From the code below we know that `df_spy` has 20 rows:

In [18]:
df_spy.shape

(20, 8)

The following code creates a list consisting of 18 booleans, all of them `False`:

In [19]:
lst_bool = [False] * 20
lst_bool

[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False]

Now, let's see what happens when we feed this `list` of `False` booleans into `df_spy` using square brackets.

In [20]:
df_spy[lst_bool]

Unnamed: 0,date,open,high,low,close,volume,adjusted,test


It may look like nothing has happened, but really what has been returned is a an empty `DataFrame`.

**Code Challenge:** Verify that `df_spy[lst_bool]` is a `DataFrame`.

Next let's modify `lst_bool` slightly, by changing the first entry to `True`, and then feed it into `df_spy` again.

In [21]:
lst_bool[0] = True # this changes the first entry to True
df_spy[lst_bool]

Unnamed: 0,date,open,high,low,close,volume,adjusted,test
0,2018-11-30,273.809998,276.279999,273.450012,275.649994,98204200,271.527222,True


So what happened?  Notice that `df_spy[lst_bool]` returns a `DataFrame` consisting only of the first row of `df_spy`.

Let's modify lst_bool once again, by setting the second entry of `df_spy` to `True`, and then once again repeat our little experiment. 

In [22]:
lst_bool[1] = True # this sets the second component to True
df_spy[lst_bool]

Unnamed: 0,date,open,high,low,close,volume,adjusted,test
0,2018-11-30,273.809998,276.279999,273.450012,275.649994,98204200,271.527222,True
1,2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,275.122589,True


**Punchline:** What is returned by the code `df_spy[lst_bool]` will be a `DataFrame` consisting of all the rows corresponding to the `True` entries of `lst_bool`.

This is the notion of `DataFrame` *masking*.

**Code Challenge:** Modify `lst_bool` and then use `DataFrame` masking to grab the first, second and, fourth rows of `df_spy`.

## Filtering a `DataFrame` with Masking

In a data analysis, it is often the case that you will want to isolate certain rows of a `DataFrame`. The rows you are interested can often be identified by comparisons involving columns of the `DataFrame`.

We can achieve this kind of row isolation by combining the broadcasting of camparison over `DataFrame` columns, with the masking of `DataFrames`.

In order to consider concrete examples, let's read-in some data.  

The following code reads in a dataset consist of EOD prices for four different ETFs (SPY, IWM, QQQ, DIA), during the month of December 2018:

In [23]:
pd.options.display.max_rows = 25 # changes displaying behavior 
df_etf = pd.read_csv("../data/index_etf_dec_2018.csv")
df_etf

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted
0,SPY,2018-11-30,273.809998,276.279999,273.450012,275.649994,98204200,271.527222
1,SPY,2018-12-03,280.279999,280.399994,277.510010,279.299988,103176300,275.122589
2,SPY,2018-12-04,278.369995,278.850006,269.899994,270.250000,177986000,266.207977
3,SPY,2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,265.804108
4,SPY,2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,259.627899
5,SPY,2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,260.120422
6,SPY,2018-12-11,267.660004,267.869995,262.480011,264.130005,121504400,260.179504
7,SPY,2018-12-12,267.470001,269.000000,265.369995,265.459991,97976700,261.489624
8,SPY,2018-12-13,266.519989,267.489990,264.119995,265.369995,96662700,261.400940
9,SPY,2018-12-14,262.959991,264.029999,259.850006,260.470001,116961100,256.574249


A method that is useful for inspecting `DataFrames` is `Series.value_count()`.

Let's apply this to the `symbol` column of `df_etf`:

In [24]:
df_etf['symbol'].value_counts()

IWM    20
QQQ    20
DIA    20
SPY    20
Name: symbol, dtype: int64

The above output means that for each value of `symbol` there are 20 row in `df_etf` that have that value in the `symbol` column. 

### Filtering for One Symbol

We are now ready to start exploring non-trivial applications of `DataFrame` masking.

As a first example, let's isolate all the rows of `df_etf` that correspond to `IWM`:

In [25]:
pd.options.display.max_rows = 6
ser_bool = (df_etf['symbol'] == "IWM")
df_etf[ser_bool]

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted
20,IWM,2018-11-30,151.460007,152.860001,151.080002,152.619995,20304700,151.173599
21,IWM,2018-12-03,154.389999,154.479996,152.020004,154.080002,23379800,152.619751
22,IWM,2018-12-04,153.750000,154.130005,147.139999,147.520004,41077900,146.121948
...,...,...,...,...,...,...,...,...
37,IWM,2018-12-27,130.270004,132.479996,127.870003,132.479996,39527900,131.663620
38,IWM,2018-12-28,132.479996,135.009995,131.539993,132.860001,35994400,132.041306
39,IWM,2018-12-31,133.720001,134.050003,131.800003,133.899994,29173400,133.074875


Notice that we did this in two steps: 

1. Calculate the series of `booleans` called `ser_bool`; 

2. Perform the masking by using square brackets `[]` and `lst_bool`.

We can actually perform this masking in a single step:

In [26]:
df_etf[df_etf['symbol'] == "IWM"]

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted
20,IWM,2018-11-30,151.460007,152.860001,151.080002,152.619995,20304700,151.173599
21,IWM,2018-12-03,154.389999,154.479996,152.020004,154.080002,23379800,152.619751
22,IWM,2018-12-04,153.750000,154.130005,147.139999,147.520004,41077900,146.121948
...,...,...,...,...,...,...,...,...
37,IWM,2018-12-27,130.270004,132.479996,127.870003,132.479996,39527900,131.663620
38,IWM,2018-12-28,132.479996,135.009995,131.539993,132.860001,35994400,132.041306
39,IWM,2018-12-31,133.720001,134.050003,131.800003,133.899994,29173400,133.074875


**Code Challenge:** Select all the rows of `df_etf` for `QQQ`. 

### Filtering for Multiple Symbols

We can use the `.isin()` method to filter a `DataFrame` for multiple symbols.  The techniques is to feed the `.isin()` method a list of symbols you want to filter for.

The following code grabs all the rows of `df_etf` for either `QQQ` or `DIA`:

In [27]:
df_etf[df_etf['symbol'].isin(['QQQ', 'DIA'])]

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted
40,QQQ,2018-11-30,168.380005,169.470001,167.539993,169.369995,36722800,168.208099
41,QQQ,2018-12-03,173.110001,173.309998,169.509995,172.330002,50771700,171.147797
42,QQQ,2018-12-04,171.429993,171.910004,165.520004,165.720001,70594700,164.583130
...,...,...,...,...,...,...,...,...
77,DIA,2018-12-27,225.539993,231.369995,222.529999,231.259995,7952000,227.933624
78,DIA,2018-12-28,232.710007,233.809998,229.669998,230.479996,7266800,227.164841
79,DIA,2018-12-31,232.330002,233.259995,231.050003,233.199997,5079600,229.845718


**Code Challenge:** Grab all rows of `df_etf` corresponding to `SPY`, `IWM`, and `QQQ`.

### Filtering for A Single Date

For our next example, let's isolate the price data from the December 2018 regular option expiration, which was 12/21/2018.

In [28]:
df_etf[df_etf['date'] == '2018-12-21']

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted
14,SPY,2018-12-21,246.740005,249.710007,239.979996,240.699997,255345600,238.484512
34,IWM,2018-12-21,132.380005,132.949997,127.980003,128.369995,59404800,127.578957
54,QQQ,2018-12-21,153.050003,154.089996,146.720001,147.570007,141129400,146.557663
74,DIA,2018-12-21,229.009995,232.470001,223.850006,224.089996,10242700,220.866776


**Code Challenge:** Grab all the rows of `df_etf` for New Years Eve.

### Filtering for a Date Range

The following code grabs all the rows of `df_etf` that come after the middle of the month:

In [29]:
df_etf[df_etf['date'] > '2018-12-15']

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted
10,SPY,2018-12-17,259.399994,260.649994,253.529999,255.360001,165492300,251.540680
11,SPY,2018-12-18,257.200012,257.950012,253.279999,255.080002,134515100,251.264862
12,SPY,2018-12-19,255.169998,259.399994,249.350006,251.259995,214992800,247.501999
...,...,...,...,...,...,...,...,...
77,DIA,2018-12-27,225.539993,231.369995,222.529999,231.259995,7952000,227.933624
78,DIA,2018-12-28,232.710007,233.809998,229.669998,230.479996,7266800,227.164841
79,DIA,2018-12-31,232.330002,233.259995,231.050003,233.199997,5079600,229.845718


**Code Challenge:** Grab all the rows of `df_etf` that come before 12/7/2018.

### Filtering on Multiple Criteria

We can filter on muliple criteria by using the `&` operator, which is the vectorized version of `and`.

Suppose that we want all rows for `SPY` or `IWM` that come after Christmas:

In [30]:
pd.options.display.max_rows = 10
bln_ticker = df_etf['symbol'].isin(['SPY', 'IWM'])
bln_date = df_etf['date'] > '2018-12-25'
bln_combined = bln_ticker & bln_date

df_etf[bln_combined]

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted
16,SPY,2018-12-26,235.970001,246.179993,233.759995,246.179993,218485400,243.914063
17,SPY,2018-12-27,242.570007,248.289993,238.960007,248.070007,186267300,245.786697
18,SPY,2018-12-28,249.580002,251.399994,246.449997,247.75,153100200,245.469635
19,SPY,2018-12-31,249.559998,250.190002,247.470001,249.919998,144299400,247.619644
36,IWM,2018-12-26,126.279999,132.100006,125.809998,131.929993,40182700,131.11702
37,IWM,2018-12-27,130.270004,132.479996,127.870003,132.479996,39527900,131.66362
38,IWM,2018-12-28,132.479996,135.009995,131.539993,132.860001,35994400,132.041306
39,IWM,2018-12-31,133.720001,134.050003,131.800003,133.899994,29173400,133.074875


**Code Challenge:** Isolate the row for `QQQ` on New Years Eves using the `&` operator.

## Related Reading

*PDSH* - 2.6 - Comparisons, Masks, and Boolean Logic

*PDSH* - 2.7 - Fancy Indexing

*PDSH* - 3.2 - Data Indexing and Selection 