# `DataFrame` Querying

In this tutorial we discuss two ways of querying a `DataFrame`:

1. masking
2. the `DataFrame.query()` method

### Importing Packages

Let's first import the packages that we will need.

In [1]:
import numpy as np
import pandas as pd
import pandas_datareader as pdr

### Reading-In Data

Next, let's use `pandas_datareader` to read-in some `SPY` data from July 2021.

In [2]:
df_spy = pdr.get_data_yahoo('SPY', start='2021-06-30', end='2021-07-31')
df_spy = df_spy.round(2)
df_spy.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06
2021-07-01,430.6,428.8,428.87,430.43,53441000,430.43
2021-07-02,434.1,430.52,431.67,433.72,57697700,433.72
2021-07-06,434.01,430.01,433.78,432.93,68710400,432.93
2021-07-07,434.76,431.51,433.66,434.46,63549500,434.46


The following code resets the `index` so that `date` is a regular column, and then makes the all column names snake-case.

In [3]:
df_spy.reset_index(inplace=True)
df_spy.columns = df_spy.columns.str.lower().str.replace(' ', '_')
df_spy.head()

Unnamed: 0,date,high,low,open,close,volume,adj_close
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06
1,2021-07-01,430.6,428.8,428.87,430.43,53441000,430.43
2,2021-07-02,434.1,430.52,431.67,433.72,57697700,433.72
3,2021-07-06,434.01,430.01,433.78,432.93,68710400,432.93
4,2021-07-07,434.76,431.51,433.66,434.46,63549500,434.46


### Comparison and `DataFrame` Columns

As discussed in a previous tutorial, a column of a `DataFrame` is a `Series` object, which is a souped up `numpy.array` (think vector or matrix).

Let's separate out the `adjusted` column of `df_spy` and assign it to a variable:

In [4]:
pd.options.display.max_rows = 6 # this modifies the printing of dataframes
ser_adjusted = df_spy['adj_close']
ser_adjusted

0     428.06
1     430.43
2     433.72
       ...  
19    438.83
20    440.65
21    438.51
Name: adj_close, Length: 22, dtype: float64

Recall that a `pandas.Series` is smart with respect to component-wise arithmetic operations, meaning it behaves like a vector from linear algebra.  This means that arithmetic operations are *broadcasted* as you might expect.

For example, division by 100 is broadcasted component-wise:

In [5]:
ser_adjusted / 100

0     4.2806
1     4.3043
2     4.3372
       ...  
19    4.3883
20    4.4065
21    4.3851
Name: adj_close, Length: 22, dtype: float64

It is a convenient fact that this broadcasting behavior also occurs with comparison, and produces a `Series` of booleans. 

The following code checks which elements of `ser_adjusted` are greater than 435: 

In [6]:
ser_test = (ser_adjusted > 435)
ser_test

0     False
1     False
2     False
      ...  
19     True
20     True
21     True
Name: adj_close, Length: 22, dtype: bool

Let's check that the resulting variable `ser_test` is a `pandas.Series`:

In [7]:
type(ser_test)

pandas.core.series.Series

And finally let's observe the `.values` elements of `ser_test`:

In [8]:
print(ser_test.values)

[False False False False False False  True  True  True  True False False
 False False False  True  True  True  True  True  True  True]


A few observation about what just happened:

1. When we compare a `Series` of numerical values (`ser_adjusted`) to a single number (`435`), we get back a `Series` of booleans (`ser_test`).

2. We have that `ser_test[i]` = (`ser_adjusted[i] > 435`).

3. So the comparison operation was broadcasted as advertised.

This is easy to see by appending `ser_test` to `df_spy` and then reprinting:

In [9]:
pd.options.display.max_rows = 25
df_spy['test'] = ser_test
df_spy

Unnamed: 0,date,high,low,open,close,volume,adj_close,test
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06,False
1,2021-07-01,430.6,428.8,428.87,430.43,53441000,430.43,False
2,2021-07-02,434.1,430.52,431.67,433.72,57697700,433.72,False
3,2021-07-06,434.01,430.01,433.78,432.93,68710400,432.93,False
4,2021-07-07,434.76,431.51,433.66,434.46,63549500,434.46,False
5,2021-07-08,431.73,427.52,428.78,430.92,97595200,430.92,False
6,2021-07-09,435.84,430.71,432.53,435.52,76238600,435.52,True
7,2021-07-12,437.35,434.97,435.43,437.08,52889600,437.08,True
8,2021-07-13,437.84,435.31,436.24,435.59,52911300,435.59,True
9,2021-07-14,437.92,434.91,437.4,436.24,64130400,436.24,True


As we will see in the next two sections, the broadcasting of comparison can be used to query subsets of rows of a `DataFrame`.

### `DataFrame` Masking

From the code below we know that `df_spy` has 22 rows:

In [10]:
df_spy.shape

(22, 8)

The following code creates a list consisting of 22 booleans, all of them `False`:

In [11]:
lst_bool = [False] * 22
lst_bool

[False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False]

Now, let's see what happens when we feed this `list` of `False` booleans into `df_spy` using square brackets.

In [12]:
df_spy[lst_bool]

Unnamed: 0,date,high,low,open,close,volume,adj_close,test


**Code Challenge:** Verify that `df_spy[lst_bool]` is an empty `DataFrame`.

In [13]:
type(df_spy[lst_bool])

pandas.core.frame.DataFrame

In [14]:
df_spy[lst_bool].shape

(0, 8)

Next let's modify `lst_bool` slightly, by changing the 0th entry to `True`, and then feed it into `df_spy` again.

In [15]:
lst_bool[0] = True
df_spy[lst_bool]

Unnamed: 0,date,high,low,open,close,volume,adj_close,test
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06,False


So what happened?  Notice that `df_spy[lst_bool]` returns a `DataFrame` consisting only of the 0th row of `df_spy`.

Let's modify `lst_bool` once again, by setting the 1st entry of `df_spy` to `True`, and then once again feed it into `df_spy`. 

In [16]:
lst_bool[1] = True
df_spy[lst_bool]

Unnamed: 0,date,high,low,open,close,volume,adj_close,test
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06,False
1,2021-07-01,430.6,428.8,428.87,430.43,53441000,430.43,False


**Punchline:** What is returned by the code `df_spy[lst_bool]` will be a `DataFrame` consisting of all the rows corresponding to the `True` entries of `lst_bool`.

This is called `DataFrame` *masking*.

**Code Challenge:** Modify `lst_bool` and then use `DataFrame` masking to grab the 0th, 1st and, 3rd rows of `df_spy`.

In [17]:
lst_bool[3] = True
df_spy[lst_bool]

Unnamed: 0,date,high,low,open,close,volume,adj_close,test
0,2021-06-30,428.78,427.18,427.21,428.06,64827900,428.06,False
1,2021-07-01,430.6,428.8,428.87,430.43,53441000,430.43,False
3,2021-07-06,434.01,430.01,433.78,432.93,68710400,432.93,False


### Querying with `DataFrame` Masking

We often want to query a `DataFrame` based on some kind of comparison involving its column values.

We can achieve this kind of querying by combining the broadcasting of camparison over `DataFrame` columns with `DataFrame` masking.

In order to consider concrete examples, let's read-in some data.  

The following code reads in a dataset consisting of EOD prices for four different ETFs (SPY, IWM, QQQ, DIA), during the month of July 2021:

In [18]:
pd.options.display.max_rows = 25
df_etf = pdr.get_data_yahoo(['SPY', 'QQQ', 'IWM', 'DIA'], start='2021-06-30', end='2021-07-31')
df_etf = df_etf.round(2)
df_etf.head()

Attributes,Adj Close,Adj Close,Adj Close,Adj Close,Close,Close,Close,Close,High,High,...,Low,Low,Open,Open,Open,Open,Volume,Volume,Volume,Volume
Symbols,SPY,QQQ,IWM,DIA,SPY,QQQ,IWM,DIA,SPY,QQQ,...,IWM,DIA,SPY,QQQ,IWM,DIA,SPY,QQQ,IWM,DIA
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2021-06-30,428.06,354.43,229.37,344.75,428.06,354.43,229.37,344.95,428.78,355.23,...,227.76,342.35,427.21,354.83,228.65,342.38,64827900.0,32724000.0,26039000.0,3778900.0
2021-07-01,430.43,354.57,231.39,346.16,430.43,354.57,231.39,346.36,430.6,355.09,...,229.71,344.92,428.87,354.07,230.81,345.78,53441000.0,29290000.0,18089100.0,3606900.0
2021-07-02,433.72,358.64,229.19,347.73,433.72,358.64,229.19,347.94,434.1,358.97,...,228.56,346.18,431.67,356.52,232.0,347.04,57697700.0,32727200.0,21029700.0,3013500.0
2021-07-06,432.93,360.19,225.86,345.62,432.93,360.19,225.86,345.82,434.01,360.48,...,223.87,343.6,433.78,359.26,229.36,347.75,68710400.0,38842400.0,27771300.0,3910600.0
2021-07-07,434.46,360.95,223.76,346.71,434.46,360.95,223.76,346.92,434.76,362.76,...,221.8,344.43,433.66,362.45,225.54,345.65,63549500.0,35265200.0,28521500.0,3347000.0


This data is not as tidy as we would like.  Let's use method chaining to perform a series of data munging operations.

In [19]:
df_etf = \
    (
    df_etf
        .stack(level='Symbols') #pivot the table
        .reset_index() #turn date into a column 
        .sort_values(by=['Symbols', 'Date']) #sort
        .rename(columns={'Date':'date', 'Symbols':'symbol', 'Adj Close':'adj_close','Close':'close', 
                         'High':'high', 'Low':'low', 'Open':'open', 'Volume':'volume'}) #renaming columns
        [['date', 'symbol','open', 'high', 'low', 'close', 'volume', 'adj_close']] #reordering columns
    )
df_etf

Attributes,date,symbol,open,high,low,close,volume,adj_close
3,2021-06-30,DIA,342.38,345.51,342.35,344.95,3778900.0,344.75
7,2021-07-01,DIA,345.78,346.40,344.92,346.36,3606900.0,346.16
11,2021-07-02,DIA,347.04,348.29,346.18,347.94,3013500.0,347.73
15,2021-07-06,DIA,347.75,348.11,343.60,345.82,3910600.0,345.62
19,2021-07-07,DIA,345.65,347.14,344.43,346.92,3347000.0,346.71
...,...,...,...,...,...,...,...,...
68,2021-07-26,SPY,439.31,441.03,439.26,441.02,43719200.0,441.02
72,2021-07-27,SPY,439.91,439.94,435.99,439.01,67397100.0,439.01
76,2021-07-28,SPY,439.68,440.30,437.31,438.83,52472400.0,438.83
80,2021-07-29,SPY,439.82,441.80,439.81,440.65,47435300.0,440.65


#### Querying for One Symbol

We are now ready to apply `DataFrame` masking to our ETF data set.

As a first example, let's isolate all the rows of `df_etf` that correspond to `IWM`:

In [20]:
pd.options.display.max_rows = 6
ser_bool = (df_etf['symbol'] == "IWM")
df_etf[ser_bool]

Attributes,date,symbol,open,high,low,close,volume,adj_close
2,2021-06-30,IWM,228.65,230.32,227.76,229.37,26039000.0,229.37
6,2021-07-01,IWM,230.81,231.85,229.71,231.39,18089100.0,231.39
10,2021-07-02,IWM,232.00,232.08,228.56,229.19,21029700.0,229.19
...,...,...,...,...,...,...,...,...
78,2021-07-28,IWM,219.00,222.59,217.40,220.82,33043700.0,220.82
82,2021-07-29,IWM,222.79,224.44,222.14,222.52,22634800.0,222.52
86,2021-07-30,IWM,221.65,224.05,220.28,221.05,28465700.0,221.05


Notice that we did this in two steps: 

1. Calculate the series of `booleans` called `ser_bool` using comparison broadcasting.

2. Perform the masking by using square brackets `[]` and `ser_bool`.

We can actually perform this masking in a single line of code (without creating an intermediate variable):

In [21]:
df_etf[df_etf['symbol'] == "IWM"]

Attributes,date,symbol,open,high,low,close,volume,adj_close
2,2021-06-30,IWM,228.65,230.32,227.76,229.37,26039000.0,229.37
6,2021-07-01,IWM,230.81,231.85,229.71,231.39,18089100.0,231.39
10,2021-07-02,IWM,232.00,232.08,228.56,229.19,21029700.0,229.19
...,...,...,...,...,...,...,...,...
78,2021-07-28,IWM,219.00,222.59,217.40,220.82,33043700.0,220.82
82,2021-07-29,IWM,222.79,224.44,222.14,222.52,22634800.0,222.52
86,2021-07-30,IWM,221.65,224.05,220.28,221.05,28465700.0,221.05


**Code Challenge:** Select all the rows of `df_etf` for `QQQ`. 

In [22]:
df_etf[df_etf['symbol'] == 'QQQ']

Attributes,date,symbol,open,high,low,close,volume,adj_close
1,2021-06-30,QQQ,354.83,355.23,353.83,354.43,32724000.0,354.43
5,2021-07-01,QQQ,354.07,355.09,352.68,354.57,29290000.0,354.57
9,2021-07-02,QQQ,356.52,358.97,356.28,358.64,32727200.0,358.64
...,...,...,...,...,...,...,...,...
77,2021-07-28,QQQ,365.60,367.45,363.24,365.83,42066200.0,365.83
81,2021-07-29,QQQ,365.25,367.68,365.25,366.48,25672500.0,366.48
85,2021-07-30,QQQ,362.44,365.17,362.41,364.57,36463500.0,364.57


#### Querying for Multiple Symbols

We can use the `.isin()` method to filter a `DataFrame` for multiple symbols.  The technique is to feed `.isin()` a `list` of symbols you want to filter for.

The following code grabs all the rows of `df_etf` for both `QQQ` and `DIA`:

In [23]:
df_etf[df_etf['symbol'].isin(['QQQ', 'DIA'])]

Attributes,date,symbol,open,high,low,close,volume,adj_close
3,2021-06-30,DIA,342.38,345.51,342.35,344.95,3778900.0,344.75
7,2021-07-01,DIA,345.78,346.40,344.92,346.36,3606900.0,346.16
11,2021-07-02,DIA,347.04,348.29,346.18,347.94,3013500.0,347.73
...,...,...,...,...,...,...,...,...
77,2021-07-28,QQQ,365.60,367.45,363.24,365.83,42066200.0,365.83
81,2021-07-29,QQQ,365.25,367.68,365.25,366.48,25672500.0,366.48
85,2021-07-30,QQQ,362.44,365.17,362.41,364.57,36463500.0,364.57


**Code Challenge:** Grab all rows of `df_etf` corresponding to `SPY`, `IWM`, and `QQQ`.

In [24]:
df_etf[df_etf['symbol'].isin(['SPY', 'IWM', 'QQQ'])]

Attributes,date,symbol,open,high,low,close,volume,adj_close
2,2021-06-30,IWM,228.65,230.32,227.76,229.37,26039000.0,229.37
6,2021-07-01,IWM,230.81,231.85,229.71,231.39,18089100.0,231.39
10,2021-07-02,IWM,232.00,232.08,228.56,229.19,21029700.0,229.19
...,...,...,...,...,...,...,...,...
76,2021-07-28,SPY,439.68,440.30,437.31,438.83,52472400.0,438.83
80,2021-07-29,SPY,439.82,441.80,439.81,440.65,47435300.0,440.65
84,2021-07-30,SPY,437.91,440.06,437.77,438.51,68890600.0,438.51


#### Querying for Dates

The following code grabs all the rows of `df_etf` that come after the middle of the month:

In [25]:
df_etf[df_etf['date'] > '2021-07-15']

Attributes,date,symbol,open,high,low,close,volume,adj_close
47,2021-07-16,DIA,350.72,350.74,346.34,346.74,5710400.0,346.74
51,2021-07-19,DIA,341.79,350.03,337.38,339.88,9715300.0,339.88
55,2021-07-20,DIA,340.29,346.12,339.75,345.08,5802200.0,345.08
...,...,...,...,...,...,...,...,...
76,2021-07-28,SPY,439.68,440.30,437.31,438.83,52472400.0,438.83
80,2021-07-29,SPY,439.82,441.80,439.81,440.65,47435300.0,440.65
84,2021-07-30,SPY,437.91,440.06,437.77,438.51,68890600.0,438.51


**Code Challenge:** Grab all the rows of `df_etf` for the last trade date of the month.

In [26]:
df_etf[df_etf['date'] == '2021-07-30']

Attributes,date,symbol,open,high,low,close,volume,adj_close
87,2021-07-30,DIA,349.88,351.01,348.67,349.48,3573000.0,349.48
86,2021-07-30,IWM,221.65,224.05,220.28,221.05,28465700.0,221.05
85,2021-07-30,QQQ,362.44,365.17,362.41,364.57,36463500.0,364.57
84,2021-07-30,SPY,437.91,440.06,437.77,438.51,68890600.0,438.51


#### Querying on Multiple Criteria

We can filter on muliple criteria by using the `&` operator, which is the vectorized version of `and`.

Suppose that we want all rows for `SPY` that come before July fourth:

In [27]:
bln_ticker = (df_etf['symbol'] == 'SPY')
bln_date = (df_etf['date'] < '2021-07-04')
bln_combined = bln_ticker & bln_date

df_etf[bln_combined]

Attributes,date,symbol,open,high,low,close,volume,adj_close
0,2021-06-30,SPY,427.21,428.78,427.18,428.06,64827900.0,428.06
4,2021-07-01,SPY,428.87,430.6,428.8,430.43,53441000.0,430.43
8,2021-07-02,SPY,431.67,434.1,430.52,433.72,57697700.0,433.72


**Code Challenge:** Isolate the rows for `QQQ` and `IWM` on the last trading day before July 4th.

In [28]:
df_etf[(df_etf['symbol'].isin(["QQQ", "IWM"])) & (df_etf['date']=='2021-07-02')]

Attributes,date,symbol,open,high,low,close,volume,adj_close
10,2021-07-02,IWM,232.0,232.08,228.56,229.19,21029700.0,229.19
9,2021-07-02,QQQ,356.52,358.97,356.28,358.64,32727200.0,358.64


### Querying with `.query()`

I find querying a `DataFrame` via masking to be rather cumbersome.  

I greatly prefer the use of the `DataFrame.query()` method which uses strings to define queries.

For example, the following code grabs all the rows corresponding to `IWM`.

In [29]:
df_etf.query('symbol == "IWM"')

Attributes,date,symbol,open,high,low,close,volume,adj_close
2,2021-06-30,IWM,228.65,230.32,227.76,229.37,26039000.0,229.37
6,2021-07-01,IWM,230.81,231.85,229.71,231.39,18089100.0,231.39
10,2021-07-02,IWM,232.00,232.08,228.56,229.19,21029700.0,229.19
...,...,...,...,...,...,...,...,...
78,2021-07-28,IWM,219.00,222.59,217.40,220.82,33043700.0,220.82
82,2021-07-29,IWM,222.79,224.44,222.14,222.52,22634800.0,222.52
86,2021-07-30,IWM,221.65,224.05,220.28,221.05,28465700.0,221.05


This code queries all rows corresponding to `QQQ` and `DIA`.

In [30]:
df_etf.query('symbol in ("QQQ", "DIA")')

Attributes,date,symbol,open,high,low,close,volume,adj_close
3,2021-06-30,DIA,342.38,345.51,342.35,344.95,3778900.0,344.75
7,2021-07-01,DIA,345.78,346.40,344.92,346.36,3606900.0,346.16
11,2021-07-02,DIA,347.04,348.29,346.18,347.94,3013500.0,347.73
...,...,...,...,...,...,...,...,...
77,2021-07-28,QQQ,365.60,367.45,363.24,365.83,42066200.0,365.83
81,2021-07-29,QQQ,365.25,367.68,365.25,366.48,25672500.0,366.48
85,2021-07-30,QQQ,362.44,365.17,362.41,364.57,36463500.0,364.57


Here we grab the rows corresponding to the first half of July.

In [31]:
df_etf.query('date < "2021-07-15"')

Attributes,date,symbol,open,high,low,close,volume,adj_close
3,2021-06-30,DIA,342.38,345.51,342.35,344.95,3778900.0,344.75
7,2021-07-01,DIA,345.78,346.40,344.92,346.36,3606900.0,346.16
11,2021-07-02,DIA,347.04,348.29,346.18,347.94,3013500.0,347.73
...,...,...,...,...,...,...,...,...
28,2021-07-12,SPY,435.43,437.35,434.97,437.08,52889600.0,437.08
32,2021-07-13,SPY,436.24,437.84,435.31,435.59,52911300.0,435.59
36,2021-07-14,SPY,437.40,437.92,434.91,436.24,64130400.0,436.24


And we can filter on multiple criteria via method chaining.  Here we grab all the rows fo `SPY` and `IWM` from the second half of the month.

In [32]:
(
df_etf
    .query('symbol in ("SPY", "IWM")')
    .query('date > "2021-07-15"')
)

Attributes,date,symbol,open,high,low,close,volume,adj_close
46,2021-07-16,IWM,219.83,219.88,214.47,214.95,36620200.0,214.95
50,2021-07-19,IWM,210.63,214.45,209.05,211.73,58571000.0,211.73
54,2021-07-20,IWM,212.20,219.27,211.26,218.30,40794600.0,218.30
...,...,...,...,...,...,...,...,...
76,2021-07-28,SPY,439.68,440.30,437.31,438.83,52472400.0,438.83
80,2021-07-29,SPY,439.82,441.80,439.81,440.65,47435300.0,440.65
84,2021-07-30,SPY,437.91,440.06,437.77,438.51,68890600.0,438.51


**Code Challenge:** Grab all the rows of `df_etf` that correspond to the following criteria:
1. `SPY`
2. first half of month
3. close less than 450

In [33]:
(
df_etf
    .query('symbol == "SPY"')
    .query('date < "2021-07-15"')
    .query('close < 450')
)

Attributes,date,symbol,open,high,low,close,volume,adj_close
0,2021-06-30,SPY,427.21,428.78,427.18,428.06,64827900.0,428.06
4,2021-07-01,SPY,428.87,430.60,428.80,430.43,53441000.0,430.43
8,2021-07-02,SPY,431.67,434.10,430.52,433.72,57697700.0,433.72
...,...,...,...,...,...,...,...,...
28,2021-07-12,SPY,435.43,437.35,434.97,437.08,52889600.0,437.08
32,2021-07-13,SPY,436.24,437.84,435.31,435.59,52911300.0,435.59
36,2021-07-14,SPY,437.40,437.92,434.91,436.24,64130400.0,436.24


## Related Reading

*PDSH* - 2.6 - Comparisons, Masks, and Boolean Logic

*PDSH* - 2.7 - Fancy Indexing

*PDSH* - 3.2 - Data Indexing and Selection 

*PDSH* - 3.12 - High Performance Pandas