# Querying `DataFrame`

In this lecture we're going to talk about querying DataFrames. The first step in the process is to understand Expressions. Expressions are the heart of multi-row data manipulation.



In [1]:
import qualified DataFrame as D

df <- D.readCsv "datasets/Admission_Predict.csv"

D.take 10 df

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Serial No.<br>Int | GRE Score<br>Int | TOEFL Score<br>Int | University Rating<br>Int | SOP<br>Double | LOR <br>Double | CGPA<br>Double | Research<br>Int | Chance of Admit <br>Double
------------------|------------------|--------------------|--------------------------|---------------|----------------|----------------|-----------------|---------------------------
1                 | 337              | 118                | 4                        | 4.5           | 4.5            | 9.65           | 1               | 0.92                      
2                 | 324              | 107                | 4                        | 4.0           | 4.5            | 8.87           | 1               | 0.76                      
3                 | 316              | 104                | 3                        | 3.0           | 3.5            | 8.0            | 1               | 0.72                      
4                 | 322              | 110                | 3                        | 3.5           | 2.5            | 8.67           | 1               | 0.8                       
5                 | 314              | 103                | 2                        | 2.0           | 3.0            | 8.21           | 0               | 0.65                      
6                 | 330              | 115                | 5                        | 4.5           | 3.0            | 9.34           | 1               | 0.9                       
7                 | 321              | 109                | 3                        | 3.0           | 4.0            | 8.2            | 1               | 0.75                      
8                 | 308              | 101                | 2                        | 3.0           | 4.0            | 7.9            | 0               | 0.68                      
9                 | 302              | 102                | 1                        | 2.0           | 1.5            | 8.0            | 0               | 0.5                       
10                | 323              | 108                | 3                        | 3.5           | 3.0            | 8.6            | 0               | 0.45                      


In other sections we used `F.col` to refer to columns. Sometimes we had to specify the type to help the compiler. This gets tedious. We can actually generate read-to-use bindings to `F.col` that correspond to out columns (and their types).

In [2]:
:set -XTemplateHaskell
import qualified DataFrame.Functions as F
import Data.Text (Text) -- We'll need this when generating functions if some of our columns are text.

F.declareColumns df

We can now use these references in our expressions with autocomplete.

In our graduate admission dataset, we might be interested in seeing only those students that have a chance higher than 0.7

We can write that expression and save it for later.

In [3]:
import DataFrame.Functions ((.&&), (.>), (.<))

chance_of_admit_ .> 0.7

(gt (col @Double "Chance of Admit ") (lit (0.7)))

The expression renders in human-readable text with the column name already there. Not we didn't even have to worry about "Chance of Admit " having a space - as long as we got the column reference right we don't have to worry about typos.

Let's applying this expression to the dataframe.

In [4]:
df |> D.filterWhere (chance_of_admit_ .> 0.7)
   |> D.take 10

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Serial No.<br>Int | GRE Score<br>Int | TOEFL Score<br>Int | University Rating<br>Int | SOP<br>Double | LOR <br>Double | CGPA<br>Double | Research<br>Int | Chance of Admit <br>Double
------------------|------------------|--------------------|--------------------------|---------------|----------------|----------------|-----------------|---------------------------
1                 | 337              | 118                | 4                        | 4.5           | 4.5            | 9.65           | 1               | 0.92                      
2                 | 324              | 107                | 4                        | 4.0           | 4.5            | 8.87           | 1               | 0.76                      
3                 | 316              | 104                | 3                        | 3.0           | 3.5            | 8.0            | 1               | 0.72                      
4                 | 322              | 110                | 3                        | 3.5           | 2.5            | 8.67           | 1               | 0.8                       
6                 | 330              | 115                | 5                        | 4.5           | 3.0            | 9.34           | 1               | 0.9                       
7                 | 321              | 109                | 3                        | 3.0           | 4.0            | 8.2            | 1               | 0.75                      
12                | 327              | 111                | 4                        | 4.0           | 4.5            | 9.0            | 1               | 0.84                      
13                | 328              | 112                | 4                        | 4.0           | 4.5            | 9.1            | 1               | 0.78                      
23                | 328              | 116                | 5                        | 5.0           | 5.0            | 9.5            | 1               | 0.94                      
24                | 334              | 119                | 5                        | 5.0           | 4.5            | 9.7            | 1               | 0.95                      


We can combine expression to form commplex conditions.

In [6]:
df |> D.filterWhere
        (chance_of_admit_ .> 0.7 .&& (chance_of_admit_ .< 0.9))
    |> D.take 10

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Serial No.<br>Int | GRE Score<br>Int | TOEFL Score<br>Int | University Rating<br>Int | SOP<br>Double | LOR <br>Double | CGPA<br>Double | Research<br>Int | Chance of Admit <br>Double
------------------|------------------|--------------------|--------------------------|---------------|----------------|----------------|-----------------|---------------------------
2                 | 324              | 107                | 4                        | 4.0           | 4.5            | 8.87           | 1               | 0.76                      
3                 | 316              | 104                | 3                        | 3.0           | 3.5            | 8.0            | 1               | 0.72                      
4                 | 322              | 110                | 3                        | 3.5           | 2.5            | 8.67           | 1               | 0.8                       
7                 | 321              | 109                | 3                        | 3.0           | 4.0            | 8.2            | 1               | 0.75                      
12                | 327              | 111                | 4                        | 4.0           | 4.5            | 9.0            | 1               | 0.84                      
13                | 328              | 112                | 4                        | 4.0           | 4.5            | 9.1            | 1               | 0.78                      
27                | 322              | 109                | 5                        | 4.5           | 3.5            | 8.8            | 0               | 0.76                      
32                | 327              | 103                | 3                        | 4.0           | 4.0            | 8.3            | 1               | 0.74                      
36                | 320              | 110                | 5                        | 5.0           | 5.0            | 9.2            | 1               | 0.88                      
44                | 332              | 117                | 4                        | 4.5           | 4.0            | 9.1            | 0               | 0.87                      


In [8]:
df |> D.filterWhere
        (F.ifThenElse (chance_of_admit_ .> 0.7 .&& (chance_of_admit_ .< 0.9))
            (gre_score .> 320)
            (university_rating .< 3))
    |> D.take 10

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Serial No.<br>Int | GRE Score<br>Int | TOEFL Score<br>Int | University Rating<br>Int | SOP<br>Double | LOR <br>Double | CGPA<br>Double | Research<br>Int | Chance of Admit <br>Double
------------------|------------------|--------------------|--------------------------|---------------|----------------|----------------|-----------------|---------------------------
2                 | 324              | 107                | 4                        | 4.0           | 4.5            | 8.87           | 1               | 0.76                      
4                 | 322              | 110                | 3                        | 3.5           | 2.5            | 8.67           | 1               | 0.8                       
5                 | 314              | 103                | 2                        | 2.0           | 3.0            | 8.21           | 0               | 0.65                      
7                 | 321              | 109                | 3                        | 3.0           | 4.0            | 8.2            | 1               | 0.75                      
8                 | 308              | 101                | 2                        | 3.0           | 4.0            | 7.9            | 0               | 0.68                      
9                 | 302              | 102                | 1                        | 2.0           | 1.5            | 8.0            | 0               | 0.5                       
12                | 327              | 111                | 4                        | 4.0           | 4.5            | 9.0            | 1               | 0.84                      
13                | 328              | 112                | 4                        | 4.0           | 4.5            | 9.1            | 1               | 0.78                      
27                | 322              | 109                | 5                        | 4.5           | 3.5            | 8.8            | 0               | 0.76                      
28                | 298              | 98                 | 2                        | 1.5           | 2.5            | 7.5            | 1               | 0.44                      


In [None]:
# One thing to watch out for is order of operations! A common error for new pandas users is
# to try and do boolean comparisons using the & operator but not putting parentheses around
# the individual terms you are interested in
df['chance of admit'] > 0.7 & df['chance of admit'] < 0.9

In [None]:
# The problem is that Python is trying to bitwise and a 0.7 and a pandas dataframe, when you really want
# to bitwise and the broadcasted dataframes together

In [None]:
# Another way to do this is to just get rid of the comparison operator completely, and instead
# use the built in functions which mimic this approach
df['chance of admit'].gt(0.7) & df['chance of admit'].lt(0.9)

In [None]:
# These functions are build right into the Series and DataFrame objects, so you can chain them
# too, which results in the same answer and the use of no visual operators. You can decide what
# looks best for you
df['chance of admit'].gt(0.7).lt(0.9)

In [None]:
# This only works if you operator, such as less than or greater than, is built into the DataFrame, but I
# certainly find that last code example much more readable than one with ampersands and parenthesis.

In [None]:
 # You need to be able to read and write all of these, and understand the implications of the route you are
 # choosing. It's worth really going back and rewatching this lecture to make sure you have it. I would say
 # 50% or more of the work you'll be doing in data cleaning involves querying DataFrames.

In this lecture, we have learned to query dataframe using boolean masking, which is extremely important and often used in the world of data science. With boolean masking, we can select data based on the criteria we desire and, frankly, you'll use it everywhere. We've also seen how there are many different ways to query the DataFrame, and the interesting side implications that come up when doing so.