convert_occupation_codes_for_safe_haven.Rmd

---
title: "Convert occupation codes to usable format for safe haven"
author: "Jan Savinc"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: 
  html_document:
    toc: true
    code_folding: show
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


```{r}
library(tidyverse)
library(lubridate)
library(readxl)
```

# Introduction

For finding out occupations from death records, we need to know how they are coded. The safe haven uses a single *OCCUPATION* variable across years. THe coding is different each decade, so we need to merge the codes with the correct decade. For this reason, we need to convert the provided codes to long format, paired to appropriate decade

# Download coding

```{r}
url <- "https://www.nrscotland.gov.uk/files/statistics/vital-events/socio-economic-code-lists.xlsx"
file <- "./raw/socio-economic-code-lists.xlsx"

if (!file.exists(file)) {
  download.file(
    url=url, 
    destfile=file, mode = "wb"
    )
}
```

# Load coding

The coding is on the third tab, *Occupations*. The data for different decades in in separate columns, every three columns, and there are 5 decades.

```{r}
coding_raw <- 
  map_dfr(
    .x = seq(1, 13, by=3),  # 5 decades, every 3rd column
    .f = ~(read_excel(path = file, sheet = "Occupations", guess_max = 1e5, skip = 4, col_types = "text") %>%  # skip first four lines - no info
      .[,.x:(.x+1)] %>%  # take the first two columns
    mutate(
      decade = names(.)[1],  # set the decade name
      years = .[[1,1]]  # set the years that apply
      ) %>%
    set_names(x = ., nm = c("code","standard_occupational_classification","decade","years")) %>%  # rename columns
      slice(-(1:3))  # remove top three rows - no info in them apart from headers
      ) %>%
      filter(!is.na(code)) %>% # remove entries without code
      filter(!str_detect(tolower(code), pattern = "crown copyright"))  # remove copyright stamp in last row
    )
```

# Cleaning up

The data is almost clean now - let's clean up the decade and make sure we understand which years they apply to.

In 2000 & 2010, 4 digit format was used,
in 1980 & 1990, 3 digit format,
and in 1970, there wasn't a width specified it seems.

```{r}
coding_intermediate <-
  coding_raw %>%
  mutate(
    classification = sub(decade, pattern=".*\\((.*)\\).*", replacement = "\\1"),
    decade = as.numeric(str_extract(decade, "\\d{4}")),
    code = case_when(
      decade %in% c(2010,2000) ~ str_pad(code, width=4, pad = "0"),
      decade %in% c(1990,1980) ~ str_pad(code, width=3, pad = "0"),
      TRUE ~ code
    )
  )

coding_clean <- coding_intermediate %>% select(-years, -classification) %>% distinct %>% select(decade, everything())
decades_and_years <- coding_intermediate %>% select(classification, decade, years) %>% distinct %>% select(decade, everything())
```

# Save

```{r}
write_csv(coding_clean, path = "./occupation_classification/occupation_coding_by_decade.csv", )
write_csv(decades_and_years, path = "./occupation_classification/occupation_coding_rules_by_decade.csv")
```

# Le Fin

Thanks for coming along!