-
Notifications
You must be signed in to change notification settings - Fork 1
/
rl.Rmd
315 lines (210 loc) · 9.44 KB
/
rl.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
---
title: "Checking ranges and legal values"
author: "Mark Myatt"
---
```{r, include = FALSE}
knitr::opts_chunk$set(
warning = FALSE,
error = FALSE,
message = FALSE,
collapse = TRUE,
comment = "#>"
)
```
```{r setup, echo = FALSE}
library(nipnTK)
```
Checking that data are within an acceptable or plausible range is an important basic check to apply to quantitative data. Checking that data are recorded with appropriate legal values or codes is an important basic check to apply to categorical data.
## Checking quantitative data
We will use the dataset `rl.ex01` that is included in the **nipnTK** package.
```{r, echo = TRUE, eval = FALSE}
svy <- rl.ex01
head(svy)
```
```{r, echo = FALSE, eval = TRUE}
svy <- rl.ex01
head(svy)
```
The `rl.ex01` dataset contains anthropometry data from a SMART survey from Angola.
We can use the `summary()` function to examine range (and other summary statistics) of a quantitative variable:
```{r, echo = TRUE, eval = FALSE}
summary(svy$muac)
```
This returns:
```{r, echo = FALSE, eval = TRUE}
summary(svy$muac)
```
A graphical examination can also be made:
```{r, echo = TRUE, eval = FALSE}
boxplot(svy$muac, horizontal = TRUE, xlab = "MUAC (mm)", frame.plot = FALSE)
```
```{r, echo = FALSE, eval = TRUE}
boxplot(svy$muac, horizontal = TRUE, xlab = "MUAC (mm)", frame.plot = FALSE)
```
The "whiskers" on the boxplot extend to 1.5 times the interquartile range from the ends of the box (i.e., the lower and upper quartiles). This is known as the *inner fence*. Data points that are outside the inner fence are considered to be *mild outliers*. The NiPN data quality toolkit provides an R language function `outliersUV()` that uses the same method to identify outliers:
```{r, echo = TRUE, eval = FALSE}
svy[outliersUV(svy$muac), ]
```
This returns:
```{r, echo = FALSE, eval = TRUE}
svy[outliersUV(svy$muac), ]
```
We can count the number of outliers or use:
```{r, echo = TRUE, eval = FALSE}
table(outliersUV(svy$muac))
```
This returns:
```{r, echo = FALSE, eval = TRUE}
table(outliersUV(svy$muac))
```
We can express this as a proportion:
```{r, echo = TRUE, eval = FALSE}
prop.table(table(outliersUV(svy$muac)))
```
This returns:
```{r, echo = FALSE, eval = TRUE}
prop.table(table(outliersUV(svy$muac)))
```
You may find it easier to use percentages:
```{r, echo = TRUE, eval = FALSE}
prop.table(table(outliersUV(svy$muac))) * 100
```
This returns:
```{r, echo = FALSE, eval = TRUE}
prop.table(table(outliersUV(svy$muac))) * 100
```
Some of the **muac** values identified as potential outliers are possible **muac** values:
```{r, echo = FALSE, eval = TRUE}
svy[outliersUV(svy$muac), ]
```
The `outliersUV()` function provides a **fence** parameter which alters the threshold at which a data point is considered to be an outlier.
The default **fence = 1.5** defines the *inner fence* (i.e **1.5** times the interquartile range below the lower quartile and above the upper quartile). This will identify *mild* and *severe* outliers.
The value **fence = 3** defines the *outer fence* (i.e **3** times the interquartile range below the lower quartile and above the upper quartile). This will identify *severe* outliers only:
```{r, echo = TRUE, eval = FALSE}
svy[outliersUV(svy$muac, fence = 3), ]
```
This returns:
```{r, echo = FALSE, eval = TRUE}
svy[outliersUV(svy$muac, fence = 3), ]
```
There is something wrong with all of these values of **muac**.
The intention was that the **muac** variable records mid-upper-arm-circumference (MUAC) in mm. There are some impossibly small (i.e. **11.1**, **12.4**, and **13.2**) and impossibly large values (i.e. **999.0**).
The three impossibly small values are probably due to data being recorded in cm rather than mm. It is probably safe to change these three values to 111, 124 and 132. It is easiest to do this each record separately:
```{r, echo = TRUE, eval = TRUE}
svy$muac[svy$muac == 11.1] <- 111
```
An alternative approach is to specify row numbers instead of values:
```{r, echo = TRUE, eval = TRUE}
svy$muac[381] <- 124
svy$muac[594] <- 132
```
The three **999.0** values are missing values coded as 999.0. It is safe to set these three values to missing using the special NA value:
```{r, echo = TRUE, eval = TRUE}
svy$muac[svy$muac == 999.00] <- NA
```
Range checks should be repeated after editing the data to ensure that the problems have been fixed:
```{r, echo = TRUE, eval = FALSE}
summary(svy$muac)
svy[outliersUV(svy$muac), ]
svy[outliersUV(svy$muac, fence = 3), ]
```
Following is a boxplot of the **muac** variable made using:
```{r, echo = TRUE, eval = FALSE}
boxplot(svy$muac, horizontal = TRUE, xlab = "MUAC (mm)", frame.plot = FALSE)
```
after the fixes for incorrectly entered data and missing values were made.
```{r, echo = FALSE, eval = TRUE}
boxplot(svy$muac, horizontal = TRUE, xlab = "MUAC (mm)", frame.plot = FALSE)
```
There should now be no severe outliers:
```{r, echo = TRUE, eval = FALSE}
prop.table(table(outliersUV(svy$muac, fence = 3))) * 100
```
returns:
```{r, echo = FALSE, eval = TRUE}
prop.table(table(outliersUV(svy$muac, fence = 3))) * 100
```
It is usually better to identify and edit only the most extreme *univariate* outliers, as we have done here, and use the scatterplot and statistical distance methods described elsewhere in this toolkit to identify other potential outliers.
## Editing data
We have edited records with outliers at the *R* command line.
It is a good idea to edit data at the command line or using a script containing the required commands.
A script provides a record of changes made to the data.
*R* also keeps a record of whatever you do at the command line in a "history file". The history file is a plain text file which is usually called .Rhistory and stored in your home directory.
Some regulatory authorities require you to keep a history file.
Some publications may require you to provide a "reproducible data analysis". This could be an edited and annotated copy of your history file.
The `edit()` function provides a basic tool for editing data interactively.
Editing data using the `edit()` function is typically a three stage process:
1. Create a new object containing only the data that requires editing.
2. Use the `edit()` function to edit data in the new object closing the data editor window when you are finished.
3. Replace the old records with the edited records.
We will try this using a separate copy of the example data:
```{r, echo = TRUE, eval = FALSE}
x <- rl.ex01
records2update <- x[outliersUV(x$muac, fence = 3), ]
records2update <- edit(records2update)
x[row.names(records2update), ] <- records2update
```
We can check that the edits have been made using:
```{r, echo = FALSE, eval = TRUE}
x <- rl.ex01
records2update <- x[outliersUV(x$muac, fence = 3), ]
#records2update <- edit(records2update)
x[row.names(records2update), ] <- records2update
```
```{r, echo = TRUE, eval = FALSE}
x[outliersUV(x$muac, fence = 3), ]
```
If you have fixed the problems in the data this should return:
```{r, echo = FALSE, eval = TRUE}
x[outliersUV(x$muac, fence = 3), ]
```
The `edit()` function works differently on different operating systems and with different graphical user interfaces. If you are using *RStudio* or *RAnalyticFlow* on OS X you will need to install *XQuartz* if you want to use the `edit()` function. *XQuartz* is available from:
[https://www.xquartz.org/index.html](https://www.xquartz.org/index.html)
## Checking categorical variables
We can use the **table()** function to examine the codes used in categorical variables. For example:
```{r, echo = TRUE, eval = FALSE}
table(svy$sex)
```
returns:
```{r, echo = FALSE, eval = TRUE}
table(svy$sex)
```
The intention was that the **sex** variable was coded using 1 for male and 2 for female but in a small number of records the codes **M** for male and **F** for female have been used. A mixed coding scheme like this will complicate data-management and data-analysis. Data in the sex variable should be edited to ensure that consistent coding is used:
```{r, echo = TRUE, eval = TRUE}
svy$sex[svy$sex == "M"] <- 1
svy$sex[svy$sex == "F"] <- 2
```
You may find that a few records contain meaningless codes. The code **3** in the example dataset has, very probably, no meaning and is likely to be a simple data entry error. This record should be checked and corrected, if possible. If the record cannot be corrected then the **sex** variable should be set to missing:
```{r, echo = TRUE, eval = TRUE}
svy$sex[svy$sex == 3] <- NA
```
Legal value checks should be repeated after editing to ensure that problems have been fixed:
```{r, echo = TRUE, eval = FALSE}
table(svy$sex)
```
now returns:
```{r, echo = FALSE, eval = TRUE}
table(svy$sex)
```
The table contains cells for the values **M**, **F**, and **3** because *R* imported the variable as a categorical or "factor" variable:
```{r, echo = TRUE, eval = FALSE}
str(svy)
```
returns:
```{r, echo = FALSE, eval = TRUE}
str(svy)
```
We can fix this by redefining the levels of the sex variable:
```{r, echo = TRUE, eval = TRUE}
levels(svy$sex) <- c("1", "2", NA, NA, NA)
table(svy$sex)
```
## Saving changes
We have edited some data.
We usually want to save changes.
It is simple to save a dataset in a comma-separated-value (CSV) text file using the `write.table()` function:
```{r, echo = TRUE, eval = FALSE}
write.table(x = svy, file = "rl.ex01.clean.csv", sep = ",", quote = FALSE,
row.names = FALSE, fileEncoding = "ASCII")
```
*R* can work with a variety of files format but it is usually simplest to work with simple text files.