# Tutorial 05 - Comparison and `DataFrame` Masking

This tutorial begins with a discussion of a programming technique called *comparison*.

We then see how *comparison* can be used to select subsets of a `DataFrame` according to a variety of criteria.

Using comparison to subset a `DataFrame` is known as *masking*.

## Imporing Packages

Let's import the packages that we will neeed throughout this tutorial.

In [1]:
import numpy as np
import pandas as pd
import datetime as dt

Notice that we are using the `datetime` package for the first time, which will help us to work with dates.

## Comparison

In data analysis, you often have occasion to compare two variables.

This means you are checking for conditions like equality (`==`), non-equality (`!=`), less-than (`<`).

The comparison of variables creates `boolean` data: `True` or `False`.

The following code code demonstrates comparisons with strings.

In [2]:
str_x = "SPY"
str_y = "IWM"
str_z = "SPY"

str_x == str_y
str_x == str_z
str_x == str_y

False

This code demonstrates comparisons involving numbers.

In [3]:
dbl_x = 287.50
dbl_y = 250

dbl_x > dbl_y 
dbl_x < dbl_y
dbl_x == dbl_y
dbl_x != dbl_y

True

You can also assign the boolean that is generated by comparison.

In [4]:
bln_1 = (dbl_x == dbl_y)
bln_2 = (str_x == str_y)
bln_1
bln_2

False

## Boolean Operators

A quick aside on *boolean operators*...

A boolean operator takes in one or more booleans, and returns another boolean.

There are three operators to be aware of:

- `not` (unary) - takes a single boolean and returns it's negation

- `or` (binary) - takes two booleans and returns `True` if at least one is `True`

- `and` (binary) - takes two booleans and returns `True` only if both are `True`

Let's domonstrate the `not` operator:

In [5]:
not True
not False

True

Let's demonstrate the `or` operator:

In [6]:
True or True
True or False
False or True
False or False

False

And lastly, let's demonstrate the `and` operator:

In [7]:
True and True
True and False
False and True
False and False

False

It is possible to combine these operators together is various ways, using parentheses, to come up with complext expressions:

In [8]:
(not True) and (False and not(False or True))

False

This can seem a little confusing at first, but it always reduces down to the simple unary and binary boolean operators we discussed above.

## Reading in Sample Data

For the remainder of the discussion, we will need some data to work with.
Let's read in our sample SPY data 

In [9]:
df_spy = pd.read_csv('../data/spy_dec_2018.csv')
df_spy

Unnamed: 0,date,open,high,low,close,volume,adjusted
0,2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,277.678436
1,2018-12-04,278.369995,278.850006,269.899994,270.25,177986000,268.681
2,2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,268.273376
3,2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,262.039795
4,2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,262.536896
5,2018-12-11,267.660004,267.869995,262.480011,264.130005,121504400,262.596527
6,2018-12-12,267.470001,269.0,265.369995,265.459991,97976700,263.918793
7,2018-12-13,266.519989,267.48999,264.119995,265.369995,96662700,263.829315
8,2018-12-14,262.959991,264.029999,259.850006,260.470001,116961100,258.957794
9,2018-12-17,259.399994,260.649994,253.529999,255.360001,165492300,253.877457


It's always helpful try to explore your data a little, once you have imported it.  One thing that I like to check for is the unique number of values in categorical columns, like `date`.

That is what is done in the following code:

In [10]:
df_spy['date'].unique().size

18

The above calculation is not important to the rest of the tutorial, but I will be peppering little tips and tricks throughout these tutorials to facilitate learning via repetition.

## Comparison and DataFrame Columns

As discussed in a previous tutorial, a column of a `DataFrame` is a `Series` object, which basically just a glorfied `numpy.array` (think vector or matrix).

Let's separate out the `adjusted` column of `df_spy` and assign it to a variable:

In [11]:
pd.options.display.max_rows = 6 # this modifys the printing of dataframes
ser_adjusted = df_spy['adjusted']
ser_adjusted

0     277.678436
1     268.681000
2     268.273376
         ...    
15    246.179993
16    248.070007
17    247.750000
Name: adjusted, Length: 18, dtype: float64

Recall that a `pandas.Series` is smart with respect to compnent-wise operations, meaning it behaves like vectors from linear algebra would.

This means that the following operations are *broadcasted* as you might expect.

In [12]:
ser_adjusted - 100
ser_adjusted / 100

0     2.776784
1     2.686810
2     2.682734
        ...   
15    2.461800
16    2.480700
17    2.477500
Name: adjusted, Length: 18, dtype: float64

As it turns out, this broadcasting behavior also occurs when comparison.

The following code demonstrates this:

In [13]:
ser_test = (ser_adjusted > 250)

ser_test
type(ser_test)
ser_test.values

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False, False, False, False, False, False, False])

Let's make a few observation about what just happened:

1. When we compare a `Series` of numerical values (`ser_adjusted`) to a single number (250), we get back a `Series` of booleans (`ser_test`)

2. We have that `ser_test[i]` = (`ser_adjusted[i] > 250`).

3. So the comparison operation was broadcast as you would expect.

This is probably easiest to see by appending `ser_test` to `df_spy` and the reprinting the dataframe.  That's what the following code does:

In [14]:
pd.options.display.max_rows = 25
df_spy['test'] = ser_test
df_spy

Unnamed: 0,date,open,high,low,close,volume,adjusted,test
0,2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,277.678436,True
1,2018-12-04,278.369995,278.850006,269.899994,270.25,177986000,268.681,True
2,2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,268.273376,True
3,2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,262.039795,True
4,2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,262.536896,True
5,2018-12-11,267.660004,267.869995,262.480011,264.130005,121504400,262.596527,True
6,2018-12-12,267.470001,269.0,265.369995,265.459991,97976700,263.918793,True
7,2018-12-13,266.519989,267.48999,264.119995,265.369995,96662700,263.829315,True
8,2018-12-14,262.959991,264.029999,259.850006,260.470001,116961100,258.957794,True
9,2018-12-17,259.399994,260.649994,253.529999,255.360001,165492300,253.877457,True


Notice that starting on 12/19/2018, the adjusted price of SPY was below 250.

The broadcasting of comparison can be use as a powerful way to query subsets of data from a `DataFrame`; we consider this technique next.

## Masking with DataFrames

The next few steps might seem a little strange, but bear with me, I assure you this is leading somewhere.

From the code below we know that `df_spy` has 18 rows:

In [15]:
df_spy.shape

(18, 8)

The following code creates a list consisting of 18 booleans, all of them `False`:

In [16]:
lst_bool = [False] * 18
lst_bool

[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False]

Now, let's see what happens when we feed this list of `False` booleans into `df_spy` using square brackets.

In [17]:
df_spy[lst_bool]

Unnamed: 0,date,open,high,low,close,volume,adjusted,test


It may look like nothing has happened, but really what has been returned is a an empty `DataFrame`. (Exercise: try to demonstrate this to yourself with the `type()` function.) 

Next let's modify `lst_bool` slightly, by changing the first entry to `True`, and then feed it into `df_spy` again.

In [18]:
lst_bool[0] = True # this changes the first entry to True
df_spy[lst_bool]
df_spy.head()

Unnamed: 0,date,open,high,low,close,volume,adjusted,test
0,2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,277.678436,True
1,2018-12-04,278.369995,278.850006,269.899994,270.25,177986000,268.681,True
2,2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,268.273376,True
3,2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,262.039795,True
4,2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,262.536896,True


So what happened?  Notice that `df_spy[lst_bool]` returns a `DataFrame` consisting only of the first row of `df_spy`.

Let's modify lst_bool once again, by setting the second entry of `df_spy` to `True`, and then once again repeat our little experiment. 

In [19]:
lst_bool[1] = True # this sets the second component to True
df_spy[lst_bool]
df_spy.head()

Unnamed: 0,date,open,high,low,close,volume,adjusted,test
0,2018-12-03,280.279999,280.399994,277.51001,279.299988,103176300,277.678436,True
1,2018-12-04,278.369995,278.850006,269.899994,270.25,177986000,268.681,True
2,2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,268.273376,True
3,2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,262.039795,True
4,2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,262.536896,True


**Punchline:** What is returned by the code `df_spy[lst_bool]` will be a `DataFrame` consisting of all the rows corresponding to the `True` entries of `lst_bool`.

This is the notion of `DataFrame` *masking*.

## Querying a DataFrame

In a data analysis, it is often the case that you will want to isolate certain rows of a `DataFrame`, call it `df`. The rows you are interested can often be identified with by the condition that a comparison in involving one or more rows of `df` is `True`.  

We can achieve this kind of row isolation by combining the broadcasting of camparison over `DataFrame` columns, with the masking of `DataFrames`.

Let's consider a concrete example.  The following code reads in a slightly more complicated dataset.  It consists of EOD prices, for four different ETFs (SPY, IWM, QQQ, DIA), during the month of December 2018:

In [20]:
pd.options.display.max_rows = 25 # changes displaying behavior 
df_etf = pd.read_csv("../data/index_etf_dec_2018.csv")
df_etf

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted
0,SPY,2018-12-03,280.279999,280.399994,277.510010,279.299988,103176300,277.678436
1,SPY,2018-12-04,278.369995,278.850006,269.899994,270.250000,177986000,268.681000
2,SPY,2018-12-06,265.920013,269.970001,262.440002,269.839996,204185400,268.273376
3,SPY,2018-12-07,269.459991,271.220001,262.630005,263.570007,161018900,262.039795
4,SPY,2018-12-10,263.369995,265.160004,258.619995,264.070007,151445900,262.536896
5,SPY,2018-12-11,267.660004,267.869995,262.480011,264.130005,121504400,262.596527
6,SPY,2018-12-12,267.470001,269.000000,265.369995,265.459991,97976700,263.918793
7,SPY,2018-12-13,266.519989,267.489990,264.119995,265.369995,96662700,263.829315
8,SPY,2018-12-14,262.959991,264.029999,259.850006,260.470001,116961100,258.957794
9,SPY,2018-12-17,259.399994,260.649994,253.529999,255.360001,165492300,253.877457


As a first example of masking, let's isolate all the rows corresponding to `IWM`:

In [21]:
pd.options.display.max_rows = 6
lst_bool = (df_etf['symbol'] == "IWM")
df_etf[lst_bool]

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted
18,IWM,2018-12-03,154.389999,154.479996,152.020004,154.080002,23379800,153.566071
19,IWM,2018-12-04,153.750000,154.130005,147.139999,147.520004,41077900,147.027954
20,IWM,2018-12-06,145.449997,147.199997,143.429993,147.199997,37581200,146.709015
...,...,...,...,...,...,...,...,...
33,IWM,2018-12-26,126.279999,132.100006,125.809998,131.929993,40182700,131.929993
34,IWM,2018-12-27,130.270004,132.479996,127.870003,132.479996,39527900,132.479996
35,IWM,2018-12-28,132.479996,135.009995,131.539993,132.860001,35994400,132.860001


Notice that we did this in two steps: (1) calculate the list of `booleans`; (2) perform the masking by using square brackets `[]` and `lst_bool`.

We can actually perform this masking in a single step:

In [22]:
df_etf[df_etf['symbol'] == "IWM"]

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted
18,IWM,2018-12-03,154.389999,154.479996,152.020004,154.080002,23379800,153.566071
19,IWM,2018-12-04,153.750000,154.130005,147.139999,147.520004,41077900,147.027954
20,IWM,2018-12-06,145.449997,147.199997,143.429993,147.199997,37581200,146.709015
...,...,...,...,...,...,...,...,...
33,IWM,2018-12-26,126.279999,132.100006,125.809998,131.929993,40182700,131.929993
34,IWM,2018-12-27,130.270004,132.479996,127.870003,132.479996,39527900,132.479996
35,IWM,2018-12-28,132.479996,135.009995,131.539993,132.860001,35994400,132.860001


For our next example, let's isolate the price data from the December 2018 regular opion expiration, which was 12/21/2018.  In order to perform comparison on the `date` column, we will need to convert it to a `datetime` (it is currently an `object` data type).

In [23]:
df_etf[df_etf['date'] == '2018-12-21']

Unnamed: 0,symbol,date,open,high,low,close,volume,adjusted
13,SPY,2018-12-21,246.740005,249.710007,239.979996,240.699997,255345600,240.699997
31,IWM,2018-12-21,132.380005,132.949997,127.980003,128.369995,59404800,128.369995
49,QQQ,2018-12-21,153.050003,154.089996,146.720001,147.570007,141129400,147.149017
67,DIA,2018-12-21,229.009995,232.470001,223.850006,224.089996,10242700,223.930069



The possibilites masking are endless, and it will be a staple of our data analysis toolbox.

## Related Reading

*PDSH* - 2.6 - Comparisons, Masks, and Boolean Logic

*PDSH* - 2.7 - Fancy Indexing

*PDSH* - 3.2 - Data Indexing and Selection 