-
Notifications
You must be signed in to change notification settings - Fork 0
/
extract_uk_icd_lists_from_uk_biobank.rmd
271 lines (213 loc) · 10 KB
/
extract_uk_icd_lists_from_uk_biobank.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
---
title: "Extract UK (WHO) ICD code lists from UK Biobank files"
author: "Jan Savinc"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
toc: true
toc_float: true
code_folding: hide
editor_options:
chunk_output_type: console
---
# Loading libraries
```{r, warning=FALSE}
library(tidyverse)
library(icd)
```
# Introduction
For code validation and interpreting ICD codes in the data, we need definitive lists of ICD-9 and ICD-10 codes.
# Finding WHO listing of codes
Lists of WHO ICD codes are fairly hard to find online because (1) the clinical modification (CM) versions used in the US are so prominent and are released to the public domain, and (2) the base WHO code lists are not in the public domain as far as I'm aware.
I have also been unable to find a write-up of the differences between the base (WHO) lists of codes and the clinical modifications, apart from the CM lists failing to cover some codes found in Scottish SMR data.
A key code I discovered was ICD-9 code *6509 Delivery in a completely normal case* used in the UK, but not elsewhere in ICD-9. I found this code in Scottish SMR02 data, and couldn't find out what it was, apart from it being a sub-code to *650*, which denotes a normal delivery but doesn't specify sub-codes. It was only by searching for it online that I came across the UK Biobank coding lists which happened to contain UK (WHO) coding. It's not clear to me if this is
The most authoritative lists as of 24 April 2019 were found on the [https://www.ukbiobank.ac.uk/](UK Biobank), specifically in the [http://biobank.ndph.ox.ac.uk/showcase/index.cgi](Data Showcase section).
* ICD-9: https://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=87
* ICD-10: https://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=19
The coding files were downloaded from the above two pages, and saved as:
* ICD-9: coding87.tsv
* ICD-10: coding19.tsv
# Importing data
```{r}
raw_icd9 <- read_tsv("./icd_codes/coding87.tsv", trim_ws = TRUE)
raw_icd10 <- read_tsv("./icd_codes/coding19.tsv", trim_ws = TRUE)
```
# Processing data
## Data format
The data is in a hierarchical format, with each node linked to a parent node, so we can reconstruct a tree with the ICD chapters on top, and "selectable" codes as leaves. This may be useful for working out codes later, so we'll keep the format.
## Converting to decimal code
Non-decimal codes were provided, but it's useful to have both a decimal and non-decimal code for matching different-formatted sources without having to convert between them.
One option for converting non-decimal ICD codes to decimal is to use the built-in function *short_to_decimal()* from the *icd* package - this fails on long E-codes in ICD-9 however, which aren't defined in ICD-9-CM (that the *icd* package was based on at the time of writing).
In ICD-9, all codes are 3 digits, with following digits behind decimal point. E-codes follow the same convention except they are prefixed by E. V-codes are V followed by 2 digits, with any further digits behind decimal point.
In ICD-10, all codes are a letter followed by 2 digits, with any further digits behind a decimal point.
## Blocks & chapters
Because of the hierarchical structure, nodes that aren't codes are included in the data - those include Chapters and Blocks. These will be kept, but will be designated separately so that decimal codes aren't extracted.
## Code order
The primary ordering that should be retaind in the final dictionary of codes is *node_id*. This is important because for the analysis we will sometimes have to deal with codes specified as ranges: the most practical way to deal with ranges is to look up the start and end points from the dictionary, and extract all codes between. This can only work if the canonical order is kept in the dictionary.
This issue can be avoided if we can also deal with the parent-child relationships between codes.
## ICD-9
Non-codes appear to correspond to *node_id* between 0 and 188; the first code-entry in the data is 189.
```{r}
# uncomment to review the node_id range that corresponds to non-codes
# raw_icd9 %>% arrange(node_id) %>% View()
non_code_range_icd9 <- 0:188
processed_icd9 <-
raw_icd9 %>%
rename(code=coding) %>%
mutate(
code_decimal = case_when( # define the three cases: E-codes, V-codes, and all the rest
node_id %in% non_code_range_icd9 ~ as.character(NA),
str_detect(code, pattern="^V") ~ sub(code, pattern="^(V\\d{2})(\\d+)$", replacement="\\1.\\2"),
str_detect(code, pattern="^E") ~ sub(code, pattern="^(E\\d{3})(\\d+)$", replacement="\\1.\\2"),
TRUE ~ sub(code, pattern="^(\\d{3})(\\d+)$", replacement="\\1.\\2")
)
) %>%
arrange(node_id)
```
## ICD-10
Non-codes appear to correspond to *node_id* between 0 and 285; the first code-entry in the data is 286.
```{r}
# uncomment to review the node_id range that corresponds to non-codes
# raw_icd10 %>% arrange(node_id) %>% View()
non_code_range_icd10 <- 0:285
processed_icd10 <-
raw_icd10 %>%
rename(code=coding) %>%
mutate(
code_decimal = case_when( # define the three cases: E-codes, V-codes, and all the rest
node_id %in% non_code_range_icd10 ~ as.character(NA),
TRUE ~ sub(code, pattern="^([A-Z]\\d{2})(\\w+)$", replacement="\\1.\\2")
)
) %>%
arrange(node_id)
## one way to check validity is to find cases where the description doesn't begin with the decimal code
processed_icd10 %>%
filter(
!startsWith(x=meaning,prefix=code_decimal)
)
## hooray!
```
# Saving resulting dictionaries
```{r}
write.csv(
processed_icd9,
file = "./processed_ICD_codes/master_icd9_code_list_UK(WHO).csv",
row.names = FALSE
)
write.csv(
processed_icd10,
file = "./processed_ICD_codes/master_icd10_code_list_UK(WHO).csv",
row.names = FALSE
)
```
# Compiling mappings of code prefixes to ICD chapters
For the use of rough categorisation of conditions, it is also useful to have a mapping of top-level ICD codes (3 characters) to their respective ICD chapter.
```{r}
generate_csv_from_range <- function(code_range, icd_version) {
## code_range is in format 001-141, E123-E128, V11-V45, for example
prefix_letter <- str_extract(code_range, pattern="^[A-Z]") %>% replace_na("")
start_and_end_numbers <- str_split(code_range, pattern="\\-")
pmap_chr(
list(
start_and_end_numbers,
prefix_letter,
icd_version
),
~{
start=parse_number(..1[1])
end=parse_number(..1[2])
sequence = seq(start,end,by=1)
if (!..2 %in% c("V") & ..3=="9") { # for ICD-9 codes, use 3 digits, or E+3 digits
padded_sequence = paste(str_pad(sequence,width=3,pad="0"))
} else { # for ICD-10 (and ICD-9 V-codes), use letter+2 digits
padded_sequence = paste(str_pad(sequence,width=2,pad="0"))
}
lettered_sequence = paste0(..2, padded_sequence)
csv = paste(lettered_sequence, collapse=",")
return(csv)
}
)
}
chapter_block_code_icd9 <-
processed_icd9 %>%
filter(parent_id==0) %>% # the main chapters
mutate(chapter_num = case_when(
node_id == 18 ~ "E-codes",
node_id == 19 ~ "V-codes",
TRUE ~ as.character(node_id)
)) %>%
select(chapter=meaning, chapter_id=node_id, chapter_num) %>%
left_join(processed_icd9 %>% select(parent_id, block_id=node_id, block = meaning), by=c("chapter_id"="parent_id")) %>%
mutate(
code_range=gsub(block, pattern="^([A-Z]*\\d{2,3}\\-[A-Z]*\\d{2,3})\\s.*$", replacement = "\\1"), # extract the e.g. E800-E859 ranges
icd_version=9,
csv = generate_csv_from_range(code_range, icd_version = 9)
) %>%
separate_rows(csv, sep="\\,") %>%
select(chapter, chapter_num, block, prefix=csv, icd_version) %>%
distinct
chapter_block_code_icd10 <-
processed_icd10 %>%
filter(parent_id==0) %>% # the main chapters
mutate(chapter_num = as.character(node_id)) %>%
select(chapter=meaning, chapter_id=node_id, chapter_num) %>%
left_join(processed_icd10 %>% select(parent_id, block_id=node_id, block = meaning), by=c("chapter_id"="parent_id")) %>%
mutate(
code_range=gsub(block, pattern="^([A-Z]*\\d{2,3}\\-[A-Z]*\\d{2,3})\\s.*$", replacement = "\\1")
) %>%
mutate(
icd_version=10,
csv = generate_csv_from_range(code_range, icd_version = 10)
) %>%
separate_rows(csv, sep="\\,") %>%
select(chapter, chapter_num, block, prefix=csv, icd_version) %>%
distinct
map_icd_chapter_block_code <-
bind_rows(
chapter_block_code_icd9,
chapter_block_code_icd10
)
write_csv(map_icd_chapter_block_code, path = "./processed_ICD_codes/map_icd_chapter_block_code.csv")
```
### Example ICD-9:
```{r}
head(chapter_block_code_icd9)
```
### Example ICD-10:
```{r}
head(chapter_block_code_icd10)
```
## Equivalent chapters between version 9 and 10:
```{r}
map_icd_equivalent_chapters <-
tibble(icd_9=character(0),icd_10=character(0)) %>%
add_case(icd_9 = "1", icd_10 = "1") %>%
add_case(icd_9 = "2", icd_10 = "2") %>%
add_case(icd_9 = "3", icd_10 = "4") %>%
add_case(icd_9 = "4", icd_10 = "3") %>%
add_case(icd_9 = "5", icd_10 = "5") %>%
add_case(icd_9 = "6", icd_10 = "6,7,8") %>%
add_case(icd_9 = "7", icd_10 = "9") %>%
add_case(icd_9 = "8", icd_10 = "10") %>%
add_case(icd_9 = "9", icd_10 = "11") %>%
add_case(icd_9 = "10", icd_10 = "14") %>%
add_case(icd_9 = "11", icd_10 = "15") %>%
add_case(icd_9 = "12", icd_10 = "12") %>%
add_case(icd_9 = "13", icd_10 = "13") %>%
add_case(icd_9 = "14", icd_10 = "17") %>%
add_case(icd_9 = "15", icd_10 = "16") %>%
add_case(icd_9 = "16", icd_10 = "18") %>%
add_case(icd_9 = "17", icd_10 = "19") %>%
add_case(icd_9 = "E-codes", icd_10 = "20") %>%
add_case(icd_9 = "V-codes", icd_10 = "21") %>%
add_case(icd_9 = NA_character_, icd_10 = "22")
write_csv(map_icd_equivalent_chapters, path = "./processed_ICD_codes/map_icd_equivalent_chapters.csv")
```
## Example of chapter equivalence:
```{r}
map_icd_equivalent_chapters %>%
separate_rows(icd_10, sep=",") %>%
left_join(chapter_block_code_icd9 %>% select(chapter_icd_9=chapter, chapter_num) %>% distinct, by=c("icd_9"="chapter_num")) %>%
left_join(chapter_block_code_icd10 %>% select(chapter_icd_10=chapter, chapter_num) %>% distinct, by=c("icd_10"="chapter_num")) %>%
knitr::kable()
```