# Missing Values

Missing values are pretty common in data cleaning activities. And, missing values can be there for any number of reasons. For instance, if you are running a survey and a respondant didn't answer a question the missing value is actually an omission.

This kind of missing data is called **Missing at Random**. There's no physical or logic reason why the data is missing, in fact you could probably use other variables to predict it.

If there is no relationship to other variables, then we call this data **Missing Completely at Random (MCAR)**.

These are just two examples of missing data, and there are many more. For instance, data might be missing because it wasn't collected, either by the process responsible for collecting that data, such as a researcher, or because it wouldn't make sense if it were collected. This last example is extremely common when you start joining DataFrames together from multiple sources, such as joining a list of people at a university with a list of offices in the university (students generally don't have offices).

Let's look at some ways of handling missing data.

In [1]:
import qualified DataFrame as D

Python has a number of ways of representing missingness (nan, None etc). And you typically see if data is missing by counting the occurences of the values that represent missingness.

In Haskell, when we detect that a column has missing values we put all the values that exist into an Optional type called `Maybe`.


In [7]:
df <- D.readCsv "./datasets/class_grades.csv"

D.take 10 df

--------------------------------------------------------------------------------------------------------------------------------------------------
Prefix<br>Int | Assignment<br>Maybe Double | Tutorial<br>Maybe Double | Midterm<br>Maybe Double | TakeHome<br>Maybe Double | Final<br>Maybe Double
--------------|----------------------------|--------------------------|-------------------------|--------------------------|----------------------
5             | Just 57.14                 | Just 34.09               | Just 64.38              | Just 51.48               | Just 52.5            
8             | Just 95.05                 | Just 105.49              | Just 67.5               | Just 99.07               | Just 68.33           
8             | Just 83.7                  | Just 83.17               | Nothing                 | Just 63.15               | Just 48.89           
7             | Nothing                    | Nothing                  | Just 49.38              | Just 105.93              | Just 80.56           
8             | Just 91.32                 | Just 93.64               | Just 95.0               | Just 107.41              | Just 73.89           
7             | Just 95.0                  | Just 92.58               | Just 93.12              | Just 97.78               | Just 68.06           
8             | Just 95.05                 | Just 102.99              | Just 56.25              | Just 99.07               | Just 50.0            
7             | Just 72.85                 | Just 86.85               | Just 60.0               | Nothing                  | Just 56.11           
8             | Just 84.26                 | Just 93.1                | Just 47.5               | Just 18.52               | Just 50.83           
7             | Just 90.1                  | Just 97.55               | Just 51.25              | Just 88.89               | Just 63.61           


We can see how many missing values are in the dataset by using `describeColumns`.

In [8]:
D.describeColumns df

------------------------------------------------------------------------------------
Column Name<br>Text | # Non-null Values<br>Int | # Null Values<br>Int | Type<br>Text
--------------------|--------------------------|----------------------|-------------
Assignment          | 92                       | 7                    | Maybe Double
TakeHome            | 94                       | 5                    | Maybe Double
Final               | 96                       | 3                    | Maybe Double
Midterm             | 96                       | 3                    | Maybe Double
Tutorial            | 96                       | 3                    | Maybe Double
Prefix              | 99                       | 0                    | Int         


We can also drop rows that have any missing data.

In [9]:
import DataFrame ((|>))

df |> D.filterAllJust
   |> D.take 10

--------------------------------------------------------------------------------------------------------------------
Prefix<br>Int | Assignment<br>Double | Tutorial<br>Double | Midterm<br>Double | TakeHome<br>Double | Final<br>Double
--------------|----------------------|--------------------|-------------------|--------------------|----------------
5             | 57.14                | 34.09              | 64.38             | 51.48              | 52.5           
8             | 95.05                | 105.49             | 67.5              | 99.07              | 68.33          
8             | 91.32                | 93.64              | 95.0              | 107.41             | 73.89          
7             | 95.0                 | 92.58              | 93.12             | 97.78              | 68.06          
8             | 95.05                | 102.99             | 56.25             | 99.07              | 50.0           
8             | 84.26                | 93.1               | 47.5              | 18.52              | 50.83          
7             | 90.1                 | 97.55              | 51.25             | 88.89              | 63.61          
7             | 80.44                | 90.2               | 75.0              | 91.48              | 39.72          
8             | 97.16                | 103.71             | 72.5              | 93.52              | 63.33          
7             | 91.28                | 83.53              | 81.25             | 99.81              | 92.22          


One of the handy functions we have for working with missing values is the filling function, `impute`. This function takes reference to the column and a value to fill in.

In [None]:
df |> D.impute (F.col @Double "Assignment) 

In [None]:
# Note that the inplace attribute causes pandas to fill the values inline and does not return a copy of the
# dataframe, but instead modifies the dataframe you have.

In [None]:
# We can also use the na_filter option to turn off white space filtering, if white space is an actual value of
# interest. But in practice, this is pretty rare. In data without any NAs, passing na_filter=False, can
# improve the performance of reading a large file.

# In addition to rules controlling how missing values might be loaded, it's sometimes useful to consider
# missing values as actually having information. I'll give an example from my own research.  I often deal with
# logs from online learning systems. I've looked at video use in lecture capture systems. In these systems
# it's common for the player for have a heartbeat functionality where playback statistics are sent to the
# server every so often, maybe every 30 seconds. These heartbeats can get big as they can carry the whole
# state of the playback system such as where the video play head is at, where the video size is, which video
# is being rendered to the screen, how loud the volume is.

# If we load the data file log.csv, we can see an example of what this might look like.
df = pd.read_csv("datasets/log.csv")
df.head(20)

In [None]:
# In this data the first column is a timestamp in the Unix epoch format. The next column is the user name
# followed by a web page they're visiting and the video that they're playing. Each row of the DataFrame has a
# playback position. And we can see that as the playback position increases by one, the time stamp increases
# by about 30 seconds.

# Except for user Bob. It turns out that Bob has paused his playback so as time increases the playback
# position doesn't change. Note too how difficult it is for us to try and derive this knowledge from the data,
# because it's not sorted by time stamp as one might expect. This is actually not uncommon on systems which
# have a high degree of parallelism. There are a lot of missing values in the paused and volume columns. It's
# not efficient to send this information across the network if it hasn't changed. So this articular system
# just inserts null values into the database if there's no changes.

In [None]:
# Next up is the method parameter(). The two common fill values are ffill and bfill. ffill is for forward
# filling and it updates an na value for a particular cell with the value from the previous row. bfill is
# backward filling, which is the opposite of ffill. It fills the missing values with the next valid value.
# It's important to note that your data needs to be sorted in order for this to have the effect you might
# want. Data which comes from traditional database management systems usually has no order guarantee, just
# like this data. So be careful.

# In Pandas we can sort either by index or by values. Here we'll just promote the time stamp to an index then
# sort on the index.
df = df.set_index('time')
df = df.sort_index()
df.head(20)

In [None]:
# If we look closely at the output though we'll notice that the index 
# isn't really unique. Two users seem to be able to use the system at the same 
# time. Again, a very common case. Let's reset the index, and use some 
# multi-level indexing on time AND user together instead,
# promote the user name to a second level of the index to deal with that issue.

df = df.reset_index()
df = df.set_index(['time', 'user'])
df

In [None]:
# Now that we have the data indexed and sorted appropriately, we can fill the missing datas using ffill. It's
# good to remember when dealing with missing values so you can deal with individual columns or sets of columns
# by projecting them. So you don't have to fix all missing values in one command.

df = df.fillna(method='ffill')
df.head()

In [None]:
# We can also do customized fill-in to replace values with the replace() function. It allows replacement from
# several approaches: value-to-value, list, dictionary, regex Let's generate a simple example
df = pd.DataFrame({'A': [1, 1, 2, 3, 4],
                   'B': [3, 6, 3, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})
df

In [None]:
# We can replace 1's with 100, let's try the value-to-value approach
df.replace(1, 100)

In [None]:
# How about changing two values? Let's try the list approach For example, we want to change 1's to 100 and 3's
# to 300
df.replace([1, 3], [100, 300])

In [None]:
# What's really cool about pandas replacement is that it supports regex too!
# Let's look at our data from the dataset logs again
df = pd.read_csv("datasets/log.csv")
df.head(20)

In [None]:
# To replace using a regex we make the first parameter to replace the regex pattern we want to match, the
# second parameter the value we want to emit upon match, and then we pass in a third parameter "regex=True".

# Take a moment to pause this video and think about this problem: imagine we want to detect all html pages in
# the "video" column, lets say that just means they end with ".html", and we want to overwrite that with the
# keyword "webpage". How could we accomplish this?

In [None]:
# Here's my solution, first matching any number of characters then ending in .html
df.replace(to_replace=".*.html$", value="webpage", regex=True)

One last note on missing values. When you use statistical functions on DataFrames, these functions typically ignore missing values. For instance if you try and calculate the mean value of a DataFrame, the underlying NumPy function will ignore missing values. This is usually what you want but you should be aware that values are being excluded. Why you have missing values really matters depending upon the problem you are trying to solve. It might be unreasonable to infer missing values, for instance, if the data shouldn't exist in the first place.