-
Notifications
You must be signed in to change notification settings - Fork 0
/
convert_occupation_codes_for_safe_haven.Rmd
98 lines (78 loc) · 3.06 KB
/
convert_occupation_codes_for_safe_haven.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
title: "Convert occupation codes to usable format for safe haven"
author: "Jan Savinc"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
toc: true
code_folding: show
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r}
library(tidyverse)
library(lubridate)
library(readxl)
```
# Introduction
For finding out occupations from death records, we need to know how they are coded. The safe haven uses a single *OCCUPATION* variable across years. THe coding is different each decade, so we need to merge the codes with the correct decade. For this reason, we need to convert the provided codes to long format, paired to appropriate decade
# Download coding
```{r}
url <- "https://www.nrscotland.gov.uk/files/statistics/vital-events/socio-economic-code-lists.xlsx"
file <- "./raw/socio-economic-code-lists.xlsx"
if (!file.exists(file)) {
download.file(
url=url,
destfile=file, mode = "wb"
)
}
```
# Load coding
The coding is on the third tab, *Occupations*. The data for different decades in in separate columns, every three columns, and there are 5 decades.
```{r}
coding_raw <-
map_dfr(
.x = seq(1, 13, by=3), # 5 decades, every 3rd column
.f = ~(read_excel(path = file, sheet = "Occupations", guess_max = 1e5, skip = 4, col_types = "text") %>% # skip first four lines - no info
.[,.x:(.x+1)] %>% # take the first two columns
mutate(
decade = names(.)[1], # set the decade name
years = .[[1,1]] # set the years that apply
) %>%
set_names(x = ., nm = c("code","standard_occupational_classification","decade","years")) %>% # rename columns
slice(-(1:3)) # remove top three rows - no info in them apart from headers
) %>%
filter(!is.na(code)) %>% # remove entries without code
filter(!str_detect(tolower(code), pattern = "crown copyright")) # remove copyright stamp in last row
)
```
# Cleaning up
The data is almost clean now - let's clean up the decade and make sure we understand which years they apply to.
In 2000 & 2010, 4 digit format was used,
in 1980 & 1990, 3 digit format,
and in 1970, there wasn't a width specified it seems.
```{r}
coding_intermediate <-
coding_raw %>%
mutate(
classification = sub(decade, pattern=".*\\((.*)\\).*", replacement = "\\1"),
decade = as.numeric(str_extract(decade, "\\d{4}")),
code = case_when(
decade %in% c(2010,2000) ~ str_pad(code, width=4, pad = "0"),
decade %in% c(1990,1980) ~ str_pad(code, width=3, pad = "0"),
TRUE ~ code
)
)
coding_clean <- coding_intermediate %>% select(-years, -classification) %>% distinct %>% select(decade, everything())
decades_and_years <- coding_intermediate %>% select(classification, decade, years) %>% distinct %>% select(decade, everything())
```
# Save
```{r}
write_csv(coding_clean, path = "./occupation_classification/occupation_coding_by_decade.csv", )
write_csv(decades_and_years, path = "./occupation_classification/occupation_coding_rules_by_decade.csv")
```
# Le Fin
Thanks for coming along!