Welcome to Day 2 of the 5-Day Data Challenge! Today, we're going to be looking at how handle missing values R. Specifically, we're going to:

* Determine if your data is missing at random
* See what data is missing
* Guess (impute) the values that are missing

I'll start by introducing each concept or technique, and then you'll get a chance to apply it with an exercise (look for the **Your turn!** section). Ready? Let's get started!

___

**Kernel FAQs:**

* **How do I get started?**   To get started, click the blue "Fork Notebook" button in the upper, right hand corner. This will create a private copy of this notebook that you can edit and play with. Once you're finished with the exercises, you can choose to make your notebook public to share with others. :)

* **How do I run the code in this notebook?** Once you fork the notebook, it will open in the notebook editor. From there you can write code in any code cell (the ones with the grey background) and run the code by either 1) clicking in the code cell and then hitting CTRL + ENTER or 2) clicking in the code cell and the clicking on the white "play" arrow to the left of the cell. If you want to run all the code in your notebook, you can use the double, "fast forward" arrows at the bottom of the notebook editor.

* **How do I save my work?** Any changes you make are saved automatically as you work. You can run all the code in your notebook and save a static version by hitting the blue "Commit & Run" button in the upper right hand corner of the editor. 

* **How can I find my notebook again later?** The easiest way is to go to your user profile (https://www.kaggle.com/replace-this-with-your-username), then click on the "Kernels" tab. All of your kernels will be under the "Your Work" tab, and all the kernels you've upvoted will be under the "Favorites" tab.

___

# Get our environment set up

___

First, let's get our environment set up with all the packages and data we'll need. Make sure to run this cell in your own notebook! :)

In [21]:
# read in libraries we'll use
library(tidyverse) # handy utility functions
library(mice) # package for categorical & numeric imputation

# set seed for reproducibility 
set.seed(5)

# read in our data
punjab <- read_csv("../input/all-census-data/gdp_Punjab2.csv")
pet_data <- read_csv("../input/austin-animal-center-shelter-outcomes-and/aac_shelter_outcomes.csv")

# seperate GDP & grwoth data 
punjab_gdp <- punjab %>%
    filter(Description == 'GDP (in Rs. Cr.)')
punjab_growth <- punjab %>%
    filter(Description == 'Growth Rate % (YoY)')

# Is your data MAR (Missing At Random)?
 
___
 

The first thing to consider when you're looking at missing data is *why* it's missing. To show why this is so important, let's take a look at this dataset of information about what happened to some animals at a shelter in Austin. 

In [16]:
# print the first few rows of the pet_data dataset
head(pet_data)

As you can see, we have some missing values in the "outcome_subtype" and "name" columns. In this instance, it wouldn't make sense to do anything with these particular missing values. For example, the name of the cat in the first row is missing. It's probably missing because it doesn't *have* a name, maybe because it's just two weeks old. The fact that the value is missing tells us something about the value. It's unlikely that it's just missing at random because someone forgot to write it down or something.

Let's look at another dataset. This file has information on the GDP (gross domestic product) of different cities in Punjab. We can see that it also has some missing values.

In [24]:
# print the first few rows of the punjab_gdp dataset
head(punjab_gdp)

In this case, these values are probably missing just because they weren't recorded for some years. It seems likely that we could make a pretty good guess about what they should be based on the rest of the data. They're (probably) not missing because of whatever the underlying value is. In this case, we can say that our data is *missing at random*.

There's no statistical way to determine if you data is missing at random or not. This is one of those cases where the solution is just to spend some time getting to know your data and how it was generated. (If you still find the distinction confusing, you're not alone! You might like [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4121561/). It's written as a little play between a medical researcher and a statistician. It's very readable.)


## Your turn!
___

Take a look at the `punjab_growth` dataset and decide if you think any values that are missing are missing at random. If you're looking for more information or practice with figuring out why data might be missing, I talk more about different types of missing data [in this notebook](https://www.kaggle.com/rtatman/data-cleaning-challenge-handling-missing-values). 

In [25]:
# your code here :)
punjab_gro <- read_csv("../input/all-census-data/gdp_Punjab1.csv")

punjab_grow <- punjab_gro %>%
    filter(Description == 'GDP (in Rs. Cr.)')



# Visualize your missing data
___

So far we've just looked at a couple of rows of our data. While this works with pretty small datasets, it doesn't scale well and it can be hard to pick out `NA`'s with your eyes. I personally  find it helpful to visualize missing values if I suspect they might exist. This can help you see how much data is missing, and also whether some particular columns or rows are very likely to have missing values. First, we need to do a little bit of data munging to get our data in the same we want it in. (If you're not familiar with the `%>%` symbol or some of these functions, I talk more about these techniques [in this notebook](https://www.kaggle.com/rtatman/manipulating-data-with-the-tidyverse/).)

In [27]:
# create a data frame with information on whether the value in each cell is missing
missing_by_column <- punjab_gdp %>% 
    is.na %>% # check if each cell is na
    as_data_frame %>% # convert to data-frame
    mutate(row_number = 1:nrow(.)) %>% # add a column with the row number
    gather(variable, is_missing, -row_number) # turn wide data into narrow data 

And now we can plot it and visually see where our data is missing! (With a shoutout to njtierney, whose [excellent color-blind friendly chart aesthetic I'm borrowing here](https://www.r-bloggers.com/ggplot-your-missing-data-2/).) 

In [28]:
# Plot the missing values in our data frame, with a good-looking theme
ggplot(missing_by_column, aes(x = variable, y = row_number, fill = is_missing)) +
    geom_tile() + 
    theme_minimal() +
    scale_fill_grey(name = "",
                    labels = c("Present","Missing")) +
    theme(axis.text.x  = element_text(angle=45, vjust=0.5, size = 8)) + 
    labs(x = "Variables in Dataset",
         y = "Rows / observations")

You can read this chart a little bit like your dataframe. The rows of the chart are the rows of your dataframe, and the columns are the columns. Each cell represents a single cell in your dataframe, and its color tells you if it's missing or not. [There are a bunch of other ways to visualize missing data](https://cran.r-project.org/web/packages/naniar/vignettes/naniar-visualisation.html), but this is the one I tend to prefer becuase of how parallel it is to the dataframe itself. 

Looking at our chart, we can see that three cities--Barnala, Sahibzada Ajit Singh Nagar and Taran Taran--all have missing data and that they overlap in which rows are missing. (Since each row is a year, this tells us that some of the same years are missing from each city). 

We can also see that our dataset is pretty small.  One way to handle missing values is to only look at "complete cases", or the rows you don't have any missing values for. In this case, though, that means that we'd lose almost a 7th of our data! In the next section, we'll discuss a  technique for filling in those missing values instead of tossing out rows. 

## Your turn!
___

Try your hand at plotting the missing values from the  `punjab_growth` dataset. Do you notice any patterns? 

In [37]:
head(punjab_growth)
# create a data frame with information on whether the value in each cell is missing
missing_by_column <- punjab_growth %>% 
    is.na %>% # check if each cell is na
    as_data_frame %>% # convert to data-frame
    mutate(row_number = 1:nrow(.)) %>% # add a column with the row number
    gather(variable, is_missing, -row_number) # turn wide data into narrow data 

# Plot the missing values in our data frame, with a good-looking theme
ggplot(missing_by_column, aes(x = variable, y = row_number, fill = is_missing)) +
    geom_tile() + 
    theme_minimal() +
    scale_fill_grey(name = "",
                    labels = c("Present","Missing")) +
    theme(axis.text.x  = element_text(angle=45, vjust=0.5, size = 8)) + 
    labs(x = "Variables in Dataset",
         y = "Rows / observations")

# Guessing what the missing values should be
___
Now that we’ve determined that we have randomly missing values, the next step is to try and figure out what they *should* be. 

> **Imputation**: The fancy math term for guessing what the value of a missing cell in your dataset should actually be. 

To do the imputation we're not going to actually look at each missing value and take our best guess. Instead, we're going to automate the process using MICE. Not the little squeaky rodents, but [Multiple Imputation by Chained Equations](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/), specifically the version implemented in [the mice R package](https://cran.r-project.org/web/packages/mice/mice.pdf). In the words of the package authors, Stef van Buuren and Karin Groothuis-Oudshoorn:  

> **The mice package implements a method to deal with missing data. The package creates multiple imputations (replacement values) for multivariate missing data. The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation.**

Since that's a lot of math-y jargon you may or may not have run into in the past, let's break it down bit by bit so that you can get a better idea of what's going on here.

>**The package creates multiple imputations (replacement values) for multivariate missing data.**
* This basically means that, for every missing value, this function will provide several different guesses for what it could be. This gives us a better idea of the range of possible values that our variables could take. For example, if you're imputing someone's salary, your guesses may range over several thousand units. On the other hand, if you're guessing their height in feet, your guesses will probably range over a couple units. The possible range of values depends on the range of the values we *did* observe. Then, once you've imputed however-many values, you create that many different versions of your dataset and then pool across them, generally by taking the mean or median for numeric variables and the most common class for categorical variables. (If you're curious about the details, you can check out some of the papers in [this bibliography](http://www.stefvanbuuren.nl/mi/MI.html).)

>**The method is based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. **
* This means that we're going to guess the value for each missing case separately. The other common choice for imputation is called "Joint Modelling", which means that you make your guesses for two different variables at the same. For instance, if you were imputing both salary and height information for someone, with Fully Conditional Specification you'd guess one and then the other. With Joint Modeling, however, you'd generate your guess for both the salary and at the same time. In both cases, the guesses depend on each each. So if you have reason to believe that taller people have higher salaries, you will probably impute a higher salary if you also impute a taller height. In Fully Conditional Specification, you might guess salary first and then height, and you might guess a taller height if you have a larger salary. In Joint Modeling, you'll guess the salary and height at the same time, so they'll both be higher or lower together.

> **The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation.**
* This bit just means that this particular function can handle pretty much any kind of data, not just numbers. This can make MICE a handy one-stop-shop for imputing all your missing data. (Assuming it's missing at random, of course.) 

Ok, now that you know the general theory, let's get to the specifics of how to actually use the function. 

There are a lot of different parameters you can tweak when using the `mice()` function. The easiest way to set them, however, is to use the parameters from an imputation run where you don’t actually do any imputation. This creates an empty model (one that doesn’t have any imputed values) with reasonable generic values. Then you can use the parameters from your empty model in a new imputation model that actually runs several times. Finally, you use the `complete()` function to pick one value for each cell you’re imputing. This gives you a copy of your original dataset with the `NA` values replaced with imputed values.

In [31]:
# initialize an empty model to take the parameters from
empty_model <- mice(punjab_gdp, maxit=0) 
method <- empty_model$method
predictorMatrix <- empty_model$predictorMatrix

# first make a bunch of guesses...
imputed_data <- mice(punjab_gdp, method, predictorMatrix, m=5)
# then pick one for each variable
imputed_data <- complete(imputed_data)

Let's take a look at our data to make sure it looks reasonable. 

In [32]:
# take a look at out dataset with the values imputed
head(imputed_data)

Looks good to me! The final step is to do a quick graph of our imputed data to make sure that we actually did get rid of all the `NA` values.

In [33]:
# create a data frame with information on whether the value in each cell is missing
missing_by_column <- imputed_data %>% 
    is.na %>% # check if each cell is na
    as_data_frame %>% # convert to data-frame
    mutate(row_number = 1:nrow(.)) %>% # add a coumn with the row number
    gather(variable, is_missing, -row_number) # turn wide data into narrow data 

# Plot the missing values in our data frame, with a good-looking theme
ggplot(missing_by_column, aes(x = variable, y = row_number, fill = is_missing)) +
    geom_tile() + 
    theme_minimal() +
    scale_fill_grey(name = "",
                    labels = c("Present","Missing")) +
    theme(axis.text.x  = element_text(angle=45, vjust=0.5, size = 8)) + 
    labs(x = "Variables in Dataset",
         y = "Rows / observations")

And all the `NA` values have been sucessfully imputed! Now try it for yourself. 

## Your turn!
____

For an extra challenge, you can spend some time investigating the various visualizations that are avalible for looking at the different imputed versions of the dataset. [This kernel has a nice introduction to some of them.](https://www.kaggle.com/captcalculator/imputing-missing-data-with-the-mice-package-in-r)

In [None]:
# your code here :)
# initialize an empty model to take the parameters from
empty_model <- mice(punjab_growth, maxit=0) 
method <- empty_model$method
predictorMatrix <- empty_model$predictorMatrix

# first make a bunch of guesses...
imputed_data <- mice(punjab_growth, method, predictorMatrix, m=5)
# then pick one for each variable
imputed_data <- complete(imputed_data)

# create a data frame with information on whether the value in each cell is missing
missing_by_column <- imputed_data %>% 
    is.na %>% # check if each cell is na
    as_data_frame %>% # convert to data-frame
    mutate(row_number = 1:nrow(.)) %>% # add a coumn with the row number
    gather(variable, is_missing, -row_number) # turn wide data into narrow data 

# Plot the missing values in our data frame, with a good-looking theme
ggplot(missing_by_column, aes(x = variable, y = row_number, fill = is_missing)) +
    geom_tile() + 
    theme_minimal() +
    scale_fill_grey(name = "",
                    labels = c("Present","Missing")) +
    theme(axis.text.x  = element_text(angle=45, vjust=0.5, size = 8)) + 
    labs(x = "Variables in Dataset",
         y = "Rows / observations")

# And that's it for Day 2!
___

And that's it for today! If you have any questions, be sure to post them in the comments below or [on the forums](https://www.kaggle.com/questions-and-answers).

Remember that your notebook is private by default, and in order to share it with other people or ask for help with it, you'll need to make it public. First, you'll need to save a version of your notebook that shows your current work by hitting the "Commit & Run" button. (Your work is saved automatically, but versioning your work lets you go back and look at what it was like at the point you saved it. It also lets you share a nice compiled notebook instead of just the raw code.) Then, once your notebook is finished running, you can go to the Settings tab in the panel to the left (you may have to expand it by hitting the [<] button next to the "Commit & Run" button) and setting the "Visibility" dropdown to "Public".