# Missing Values

Missing values are pretty common in data cleaning activities. And, missing values can be there for any number of reasons. For instance, if you are running a survey and a respondant didn't answer a question the missing value is actually an omission.

This kind of missing data is called **Missing at Random**. There's no physical or logic reason why the data is missing, in fact you could probably use other variables to predict it.

If there is no relationship to other variables, then we call this data **Missing Completely at Random (MCAR)**.

These are just two examples of missing data, and there are many more. For instance, data might be missing because it wasn't collected, either by the process responsible for collecting that data, such as a researcher, or because it wouldn't make sense if it were collected. This last example is extremely common when you start joining DataFrames together from multiple sources, such as joining a list of people at a university with a list of offices in the university (students generally don't have offices).

Let's look at some ways of handling missing data.

In [1]:
import qualified DataFrame as D

Python has a number of ways of representing missingness (nan, None etc). And you typically see if data is missing by counting the occurences of the values that represent missingness.

In Haskell, when we detect that a column has missing values we put all the values that exist into an Optional type called `Maybe`.


In [2]:
df <- D.readCsv "./datasets/class_grades.csv"

D.take 10 df

--------------------------------------------------------------------------------------------------------------------------------------------------
Prefix<br>Int | Assignment<br>Maybe Double | Tutorial<br>Maybe Double | Midterm<br>Maybe Double | TakeHome<br>Maybe Double | Final<br>Maybe Double
--------------|----------------------------|--------------------------|-------------------------|--------------------------|----------------------
5             | Just 57.14                 | Just 34.09               | Just 64.38              | Just 51.48               | Just 52.5            
8             | Just 95.05                 | Just 105.49              | Just 67.5               | Just 99.07               | Just 68.33           
8             | Just 83.7                  | Just 83.17               | Nothing                 | Just 63.15               | Just 48.89           
7             | Nothing                    | Nothing                  | Just 49.38              | Just 105.93              | Just 80.56           
8             | Just 91.32                 | Just 93.64               | Just 95.0               | Just 107.41              | Just 73.89           
7             | Just 95.0                  | Just 92.58               | Just 93.12              | Just 97.78               | Just 68.06           
8             | Just 95.05                 | Just 102.99              | Just 56.25              | Just 99.07               | Just 50.0            
7             | Just 72.85                 | Just 86.85               | Just 60.0               | Nothing                  | Just 56.11           
8             | Just 84.26                 | Just 93.1                | Just 47.5               | Just 18.52               | Just 50.83           
7             | Just 90.1                  | Just 97.55               | Just 51.25              | Just 88.89               | Just 63.61           


We can see how many missing values are in the dataset by using `describeColumns`.

In [3]:
D.describeColumns df

------------------------------------------------------------------------------------
Column Name<br>Text | # Non-null Values<br>Int | # Null Values<br>Int | Type<br>Text
--------------------|--------------------------|----------------------|-------------
Assignment          | 92                       | 7                    | Maybe Double
TakeHome            | 94                       | 5                    | Maybe Double
Final               | 96                       | 3                    | Maybe Double
Midterm             | 96                       | 3                    | Maybe Double
Tutorial            | 96                       | 3                    | Maybe Double
Prefix              | 99                       | 0                    | Int         


We can also drop rows that have any missing data.

In [4]:
import DataFrame ((|>))

df |> D.filterAllJust
   |> D.take 10

--------------------------------------------------------------------------------------------------------------------
Prefix<br>Int | Assignment<br>Double | Tutorial<br>Double | Midterm<br>Double | TakeHome<br>Double | Final<br>Double
--------------|----------------------|--------------------|-------------------|--------------------|----------------
5             | 57.14                | 34.09              | 64.38             | 51.48              | 52.5           
8             | 95.05                | 105.49             | 67.5              | 99.07              | 68.33          
8             | 91.32                | 93.64              | 95.0              | 107.41             | 73.89          
7             | 95.0                 | 92.58              | 93.12             | 97.78              | 68.06          
8             | 95.05                | 102.99             | 56.25             | 99.07              | 50.0           
8             | 84.26                | 93.1               | 47.5              | 18.52              | 50.83          
7             | 90.1                 | 97.55              | 51.25             | 88.89              | 63.61          
7             | 80.44                | 90.2               | 75.0              | 91.48              | 39.72          
8             | 97.16                | 103.71             | 72.5              | 93.52              | 63.33          
7             | 91.28                | 83.53              | 81.25             | 99.81              | 92.22          


One of the handy functions we have for working with missing values is the filling function, `impute`. This function takes reference to the column and a value to fill in.

In [7]:
df |> D.impute (F.col @(Maybe Double) "Assignment") 0 |> D.take 10

--------------------------------------------------------------------------------------------------------------------------------------------
Prefix<br>Int | Assignment<br>Double | Tutorial<br>Maybe Double | Midterm<br>Maybe Double | TakeHome<br>Maybe Double | Final<br>Maybe Double
--------------|----------------------|--------------------------|-------------------------|--------------------------|----------------------
5             | 57.14                | Just 34.09               | Just 64.38              | Just 51.48               | Just 52.5            
8             | 95.05                | Just 105.49              | Just 67.5               | Just 99.07               | Just 68.33           
8             | 83.7                 | Just 83.17               | Nothing                 | Just 63.15               | Just 48.89           
7             | 0.0                  | Nothing                  | Just 49.38              | Just 105.93              | Just 80.56           
8             | 91.32                | Just 93.64               | Just 95.0               | Just 107.41              | Just 73.89           
7             | 95.0                 | Just 92.58               | Just 93.12              | Just 97.78               | Just 68.06           
8             | 95.05                | Just 102.99              | Just 56.25              | Just 99.07               | Just 50.0            
7             | 72.85                | Just 86.85               | Just 60.0               | Nothing                  | Just 56.11           
8             | 84.26                | Just 93.1                | Just 47.5               | Just 18.52               | Just 50.83           
7             | 90.1                 | Just 97.55               | Just 51.25              | Just 88.89               | Just 63.61           


The first row with Prefix 7 is back and this time has 0 for the assignment. 