# `DataFrame` Querying

In this tutorial we discuss two ways of querying a `DataFrame`:

1. masking
2. the `DataFrame.query()` method

### Importing Packages

Let's first import the packages that we will need.

In [None]:
##> import numpy as np
##> import pandas as pd
##> import pandas_datareader as pdr




### Reading-In Data

Next, let's use `pandas_datareader` to read-in some `SPY` data from July 2021.

In [None]:
##> df_spy = pdr.get_data_yahoo('SPY', start='2021-06-30', end='2021-07-31')
##> df_spy = df_spy.round(2)
##> df_spy.head()




The following code resets the `index` so that `date` is a regular column, and then makes the all column names snake-case.

In [None]:
##> df_spy.reset_index(inplace=True)
##> df_spy.columns = df_spy.columns.str.lower().str.replace(' ', '_')
##> df_spy.head()




### Comparison and `DataFrame` Columns

As discussed in a previous tutorial, a column of a `DataFrame` is a `Series` object, which is a souped up `numpy.array` (think vector or matrix).

Let's separate out the `adjusted` column of `df_spy` and assign it to a variable:

In [None]:
##> pd.options.display.max_rows = 6 # this modifies the printing of dataframes
##> ser_adjusted = df_spy['adj_close']
##> ser_adjusted




Recall that a `pandas.Series` is smart with respect to component-wise arithmetic operations, meaning it behaves like a vector from linear algebra.  This means that arithmetic operations are *broadcasted* as you might expect.

For example, division by 100 is broadcasted component-wise:

In [None]:
##> ser_adjusted / 100



It is a convenient fact that this broadcasting behavior also occurs with comparison, and produces a `Series` of booleans. 

The following code checks which elements of `ser_adjusted` are greater than 435: 

In [None]:
##> ser_test = (ser_adjusted > 435)
##> ser_test




Let's check that the resulting variable `ser_test` is a `pandas.Series`:

In [None]:
##> type(ser_test)



And finally let's observe the `.values` elements of `ser_test`:

In [None]:
##> print(ser_test.values)



A few observation about what just happened:

1. When we compare a `Series` of numerical values (`ser_adjusted`) to a single number (`435`), we get back a `Series` of booleans (`ser_test`).

2. We have that `ser_test[i]` = (`ser_adjusted[i] > 435`).

3. So the comparison operation was broadcasted as advertised.

This is easy to see by appending `ser_test` to `df_spy` and then reprinting:

In [None]:
##> pd.options.display.max_rows = 25
##> df_spy['test'] = ser_test
##> df_spy



As we will see in the next two sections, the broadcasting of comparison can be used to query subsets of rows of a `DataFrame`.

### `DataFrame` Masking

From the code below we know that `df_spy` has 22 rows:

In [None]:
##> df_spy.shape



The following code creates a list consisting of 22 booleans, all of them `False`:

In [None]:
##> lst_bool = [False] * 22
##> lst_bool




Now, let's see what happens when we feed this `list` of `False` booleans into `df_spy` using square brackets.

In [None]:
##> df_spy[lst_bool]



**Code Challenge:** Verify that `df_spy[lst_bool]` is an empty `DataFrame`.

Next let's modify `lst_bool` slightly, by changing the 0th entry to `True`, and then feed it into `df_spy` again.

In [None]:
##> lst_bool[0] = True
##> df_spy[lst_bool]




So what happened?  Notice that `df_spy[lst_bool]` returns a `DataFrame` consisting only of the 0th row of `df_spy`.

Let's modify `lst_bool` once again, by setting the 1st entry of `df_spy` to `True`, and then once again feed it into `df_spy`. 

In [None]:
##> lst_bool[1] = True
##> df_spy[lst_bool]




**Punchline:** What is returned by the code `df_spy[lst_bool]` will be a `DataFrame` consisting of all the rows corresponding to the `True` entries of `lst_bool`.

This is called `DataFrame` *masking*.

**Code Challenge:** Modify `lst_bool` and then use `DataFrame` masking to grab the 0th, 1st and, 3rd rows of `df_spy`.

### Querying with `DataFrame` Masking

We often want to query a `DataFrame` based on some kind of comparison involving its column values.

We can achieve this kind of querying by combining the broadcasting of camparison over `DataFrame` columns with `DataFrame` masking.

In order to consider concrete examples, let's read-in some data.  

The following code reads in a dataset consisting of EOD prices for four different ETFs (SPY, IWM, QQQ, DIA), during the month of July 2021:

In [None]:
##> pd.options.display.max_rows = 25
##> df_etf = pdr.get_data_yahoo(['SPY', 'QQQ', 'IWM', 'DIA'], start='2021-06-30', end='2021-07-31')
##> df_etf = df_etf.round(2)
##> df_etf.head()




This data is not as tidy as we would like.  Let's use method chaining to perform a series of data munging operations.

In [None]:
##> df_etf = \
##>     (
##>     df_etf
##>         .stack(level='Symbols') #pivot the table
##>         .reset_index() #turn date into a column 
##>         .sort_values(by=['Symbols', 'Date']) #sort
##>         .rename(columns={'Date':'date', 'Symbols':'symbol', 'Adj Close':'adj_close','Close':'close', 
##>                          'High':'high', 'Low':'low', 'Open':'open', 'Volume':'volume'}) #renaming columns
##>         [['date', 'symbol','open', 'high', 'low', 'close', 'volume', 'adj_close']] #reordering columns
##>     )
##> df_etf




#### Querying for One Symbol

We are now ready to apply `DataFrame` masking to our ETF data set.

As a first example, let's isolate all the rows of `df_etf` that correspond to `IWM`:

In [None]:
##> pd.options.display.max_rows = 6
##> ser_bool = (df_etf['symbol'] == "IWM")
##> df_etf[ser_bool]




Notice that we did this in two steps: 

1. Calculate the series of `booleans` called `ser_bool` using comparison broadcasting.

2. Perform the masking by using square brackets `[]` and `ser_bool`.

We can actually perform this masking in a single line of code (without creating an intermediate variable):

In [None]:
##> df_etf[df_etf['symbol'] == "IWM"]



**Code Challenge:** Select all the rows of `df_etf` for `QQQ`. 

#### Querying for Multiple Symbols

We can use the `.isin()` method to filter a `DataFrame` for multiple symbols.  The technique is to feed `.isin()` a `list` of symbols you want to filter for.

The following code grabs all the rows of `df_etf` for both `QQQ` and `DIA`:

In [None]:
##> df_etf[df_etf['symbol'].isin(['QQQ', 'DIA'])]



**Code Challenge:** Grab all rows of `df_etf` corresponding to `SPY`, `IWM`, and `QQQ`.

#### Querying for Dates

The following code grabs all the rows of `df_etf` that come after the middle of the month:

In [None]:
##> df_etf[df_etf['date'] > '2021-07-15']



**Code Challenge:** Grab all the rows of `df_etf` for the last trade date of the month.

#### Querying on Multiple Criteria

We can filter on muliple criteria by using the `&` operator, which is the vectorized version of `and`.

Suppose that we want all rows for `SPY` that come before July fourth:

In [None]:
##> bln_ticker = (df_etf['symbol'] == 'SPY')
##> bln_date = (df_etf['date'] < '2021-07-04')
##> bln_combined = bln_ticker & bln_date
##> 
##> df_etf[bln_combined]




**Code Challenge:** Isolate the rows for `QQQ` and `IWM` on the last trading day before July 4th.

### Querying with `.query()`

I find querying a `DataFrame` via masking to be rather cumbersome.  

I greatly prefer the use of the `DataFrame.query()` method which uses strings to define queries.

For example, the following code grabs all the rows corresponding to `IWM`.

In [None]:
##> df_etf.query('symbol == "IWM"')



This code queries all rows corresponding to `QQQ` and `DIA`.

In [None]:
##> df_etf.query('symbol in ("QQQ", "DIA")')



Here we grab the rows corresponding to the first half of July.

In [None]:
##> df_etf.query('date < "2021-07-15"')




And we can filter on multiple criteria via method chaining.  Here we grab all the rows fo `SPY` and `IWM` from the second half of the month.

In [None]:
##> (
##> df_etf
##>     .query('symbol in ("SPY", "IWM")')
##>     .query('date > "2021-07-15"')
##> )




**Code Challenge:** Grab all the rows of `df_etf` that correspond to the following criteria:
1. `SPY`
2. first half of month
3. close less than 450

## Related Reading

*PDSH* - 2.6 - Comparisons, Masks, and Boolean Logic

*PDSH* - 2.7 - Fancy Indexing

*PDSH* - 3.2 - Data Indexing and Selection 

*PDSH* - 3.12 - High Performance Pandas