vignettes/rl.Rmd

---
title: "Checking ranges and legal values"
author: "Mark Myatt"
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  warning = FALSE,
  error = FALSE,
  message = FALSE,
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, echo = FALSE}
library(nipnTK)
```

Checking that data are within an acceptable or plausible range is an important basic check to apply to quantitative data. Checking that data are recorded with appropriate legal values or codes is an important basic check to apply to categorical data.

## Checking quantitative data

We will use the dataset `rl.ex01` that is included in the **nipnTK** package.

```{r, echo = TRUE, eval = FALSE}
svy <- rl.ex01
head(svy)
```

```{r, echo = FALSE, eval = TRUE}
svy <- rl.ex01
head(svy)
```

The `rl.ex01` dataset contains anthropometry data from a SMART survey from Angola.

We can use the `summary()` function to examine range (and other summary statistics) of a quantitative variable:

```{r, echo = TRUE, eval = FALSE}
summary(svy$muac)
```

This returns:

```{r, echo = FALSE, eval = TRUE}
summary(svy$muac)
```

A graphical examination can also be made:

```{r, echo = TRUE, eval = FALSE}
boxplot(svy$muac, horizontal = TRUE, xlab = "MUAC (mm)", frame.plot = FALSE)
```

```{r, echo = FALSE, eval = TRUE}
boxplot(svy$muac, horizontal = TRUE, xlab = "MUAC (mm)", frame.plot = FALSE)
```

The "whiskers" on the boxplot extend to 1.5 times the interquartile range from the ends of the box (i.e., the lower and upper quartiles). This is known as the *inner fence*. Data points that are outside the inner fence are considered to be *mild outliers*. The NiPN data quality toolkit provides an R language function `outliersUV()` that uses the same method to identify outliers:

```{r, echo = TRUE, eval = FALSE}
svy[outliersUV(svy$muac), ]
```

This returns:

```{r, echo = FALSE, eval = TRUE}
svy[outliersUV(svy$muac), ]
```

We can count the number of outliers or use:

```{r, echo = TRUE, eval = FALSE}
table(outliersUV(svy$muac))
```

This returns:

```{r, echo = FALSE, eval = TRUE}
table(outliersUV(svy$muac))
```

We can express this as a proportion:

```{r, echo = TRUE, eval = FALSE}
prop.table(table(outliersUV(svy$muac)))
```

This returns:

```{r, echo = FALSE, eval = TRUE}
prop.table(table(outliersUV(svy$muac)))
```

You may find it easier to use percentages:

```{r, echo = TRUE, eval = FALSE}
prop.table(table(outliersUV(svy$muac))) * 100
```

This returns:

```{r, echo = FALSE, eval = TRUE}
prop.table(table(outliersUV(svy$muac))) * 100
```

Some of the **muac** values identified as potential outliers are possible **muac** values:

```{r, echo = FALSE, eval = TRUE}
svy[outliersUV(svy$muac), ]
```

The `outliersUV()` function provides a **fence** parameter which alters the threshold at which a data point is considered to be an outlier.

The default **fence = 1.5** defines the *inner fence* (i.e **1.5** times the interquartile range below the lower quartile and above the upper quartile). This will identify *mild* and *severe* outliers.

The value **fence = 3** defines the *outer fence* (i.e **3** times the interquartile range below the lower quartile and above the upper quartile). This will identify *severe* outliers only:

```{r, echo = TRUE, eval = FALSE}
svy[outliersUV(svy$muac, fence = 3), ]
```

This returns:

```{r, echo = FALSE, eval = TRUE}
svy[outliersUV(svy$muac, fence = 3), ]
```

There is something wrong with all of these values of **muac**.

The intention was that the **muac** variable records mid-upper-arm-circumference (MUAC) in mm. There are some impossibly small (i.e. **11.1**, **12.4**, and **13.2**) and impossibly large values (i.e. **999.0**).

The three impossibly small values are probably due to data being recorded in cm rather than mm. It is probably safe to change these three values to 111, 124 and 132. It is easiest to do this each record separately:

```{r, echo = TRUE, eval = TRUE}
svy$muac[svy$muac == 11.1] <- 111
```

An alternative approach is to specify row numbers instead of values:

```{r, echo = TRUE, eval = TRUE}
svy$muac[381] <- 124
svy$muac[594] <- 132
```

The three **999.0** values are missing values coded as 999.0. It is safe to set these three values to missing using the special NA value:

```{r, echo = TRUE, eval = TRUE}
svy$muac[svy$muac == 999.00] <- NA
```

Range checks should be repeated after editing the data to ensure that the problems have been fixed:

```{r, echo = TRUE, eval = FALSE}
summary(svy$muac)
svy[outliersUV(svy$muac), ]
svy[outliersUV(svy$muac, fence = 3), ]
```

Following is a boxplot of the **muac** variable made using:

```{r, echo = TRUE, eval = FALSE}
boxplot(svy$muac, horizontal = TRUE, xlab = "MUAC (mm)", frame.plot = FALSE)
```

after the fixes for incorrectly entered data and missing values were made.

```{r, echo = FALSE, eval = TRUE}
boxplot(svy$muac, horizontal = TRUE, xlab = "MUAC (mm)", frame.plot = FALSE)
```

There should now be no severe outliers:

```{r, echo = TRUE, eval = FALSE}
prop.table(table(outliersUV(svy$muac, fence = 3))) * 100
```

returns:

```{r, echo = FALSE, eval = TRUE}
prop.table(table(outliersUV(svy$muac, fence = 3))) * 100
```

It is usually better to identify and edit only the most extreme *univariate* outliers, as we have done here, and use the scatterplot and statistical distance methods described elsewhere in this toolkit to identify other potential outliers.

## Editing data

We have edited records with outliers at the *R* command line.

It is a good idea to edit data at the command line or using a script containing the required commands.

A script provides a record of changes made to the data.

*R* also keeps a record of whatever you do at the command line in a "history file". The history file is a plain text file which is usually called .Rhistory and stored in your home directory.

Some regulatory authorities require you to keep a history file.

Some publications may require you to provide a "reproducible data analysis". This could be an edited and annotated copy of your history file.

The `edit()` function provides a basic tool for editing data interactively.

Editing data using the `edit()` function is typically a three stage process:

1. Create a new object containing only the data that requires editing.

2. Use the `edit()` function to edit data in the new object closing the data editor window when you are finished.

3. Replace the old records with the edited records.

We will try this using a separate copy of the example data:

```{r, echo = TRUE, eval = FALSE}
x <- rl.ex01
records2update <- x[outliersUV(x$muac, fence = 3), ]
records2update <- edit(records2update)
x[row.names(records2update), ] <- records2update
```

We can check that the edits have been made using:

```{r, echo = FALSE, eval = TRUE}
x <- rl.ex01
records2update <- x[outliersUV(x$muac, fence = 3), ]
#records2update <- edit(records2update)
x[row.names(records2update), ] <- records2update
```

```{r, echo = TRUE, eval = FALSE}
x[outliersUV(x$muac, fence = 3), ]
```

If you have fixed the problems in the data this should return:

```{r, echo = FALSE, eval = TRUE}
x[outliersUV(x$muac, fence = 3), ]
```

The `edit()` function works differently on different operating systems and with different graphical user interfaces. If you are using *RStudio* or *RAnalyticFlow* on OS X you will need to install *XQuartz* if you want to use the `edit()` function. *XQuartz* is available from:

[https://www.xquartz.org/index.html](https://www.xquartz.org/index.html)

## Checking categorical variables

We can use the **table()** function to examine the codes used in categorical variables. For example:

```{r, echo = TRUE, eval = FALSE}
table(svy$sex)
```

returns:

```{r, echo = FALSE, eval = TRUE}
table(svy$sex)
```

The intention was that the **sex** variable was coded using 1 for male and 2 for female but in a small number of records the codes **M** for male and **F** for female have been used. A mixed coding scheme like this will complicate data-management and data-analysis. Data in the sex variable should be edited to ensure that consistent coding is used:

```{r, echo = TRUE, eval = TRUE}
svy$sex[svy$sex == "M"] <- 1
svy$sex[svy$sex == "F"] <- 2
```

You may find that a few records contain meaningless codes. The code **3** in the example dataset has, very probably, no meaning and is likely to be a simple data entry error. This record should be checked and corrected, if possible. If the record cannot be corrected then the **sex** variable should be set to missing:

```{r, echo = TRUE, eval = TRUE}
svy$sex[svy$sex == 3] <- NA
```

Legal value checks should be repeated after editing to ensure that problems have been fixed:

```{r, echo = TRUE, eval = FALSE}
table(svy$sex)
```

now returns:

```{r, echo = FALSE, eval = TRUE}
table(svy$sex)
```

The table contains cells for the values **M**, **F**, and **3** because *R* imported the variable as a categorical or "factor" variable:

```{r, echo = TRUE, eval = FALSE}
str(svy)
```

returns:

```{r, echo = FALSE, eval = TRUE}
str(svy)
```

We can fix this by redefining the levels of the sex variable:

```{r, echo = TRUE, eval = TRUE}
levels(svy$sex) <- c("1", "2", NA, NA, NA)
table(svy$sex)
```

## Saving changes

We have edited some data.

We usually want to save changes.

It is simple to save a dataset in a comma-separated-value (CSV) text file using the `write.table()` function:

```{r, echo = TRUE, eval = FALSE}
write.table(x = svy, file = "rl.ex01.clean.csv", sep = ",", quote = FALSE, 
            row.names = FALSE, fileEncoding = "ASCII")
```

*R* can work with a variety of files format but it is usually simplest to work with simple text files.