# NAILDOH Variables & Missingness

## Resources

This is the second notebook in the series used to prepare and analyze the NAILDOH collection.

In [1]:
# Options
options(digits = 1)

In [2]:
# Libraries
library(tidyverse) # for data manipulation

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.6
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [3]:
# Functions
factorize <- function(df){ # Create a function
  for(i in which(sapply(df, class) == "character")) # that looks for variables with the character class 
      df[[i]] = as.factor(df[[i]]) # and converts them to factor (i.e., categorical) class
  return(df)
}

unfactorize <- function(df){ # Create a function
  for(i in which(sapply(df, class) == "factor")) # that looks for variables with the character class 
      df[[i]] = as.character(df[[i]]) # and converts them to factor (i.e., categorical) class
  return(df)
}

In [4]:
# Data
letters <- factorize(read.csv("20230209_AM_PhD-NaildohSubset.csv")) # Put csv into a dataframe called docData

# Obserbations and Variables
dim(letters)

# Variabe Names
colnames(letters) # Get an overview of the dataframe

Put into a working dataframe only the variables of interest described in <a href="https://wellbeinginlifewriting.wordpress.com/2023/02/13/naldoh-variables-missingness/">Variables & Missingness</a> blog post.

In [5]:
letters  <- #Put into the letters dataframe
letters %>% #The existing letters dataframe
select(docid, sourcetitle, docyear, docmonth, docday, authorLocation, docauthorid, docauthorname, #Include only the listed variables
       authorgender, agewriting, birthyear, deathyear, birthplace, religion, 
       cultural_heritage, nationality, educlevel, native_occupation,
       north_american_occupation, marriagestatus, maternalstatus)

In [6]:
# Replace Not indicated and blanks with NA across entire dataframe
letters[letters == 'Not indicated']  <- NA
letters[letters == ""]  <- NA

In [7]:
# What is the percentage of missing data across the variables
# Values below zero are not shown.
letters %>% 
  summarise(across(everything(), ~ mean(is.na(.x)))) %>%
pivot_longer(everything()) %>%
filter(value > 0)  

name,value
<chr>,<dbl>
docyear,0.024
docmonth,0.215
docday,0.277
agewriting,0.469
birthyear,0.462
deathyear,0.499
birthplace,0.227
religion,0.397
cultural_heritage,0.006
nationality,0.528


Remove the educlevel and nationality variables for the reasons described in the <a href="https://wellbeinginlifewriting.wordpress.com/2023/02/13/naldoh-variables-missingness/">Variables & Missingness</a> blog post.

In [8]:
letters  <- #Put into the letters dataframe
letters %>% #The existing letters dataframe
select(docid, sourcetitle, docyear, docmonth, docday, authorLocation, docauthorid, docauthorname, #Only the variables listed
       authorgender, agewriting, birthyear, deathyear, religion, 
       cultural_heritage, north_american_occupation, native_occupation, marriagestatus, maternalstatus)

In [9]:
# Get some stats
round(mean(is.na(letters)), digits = 2) # proportion of missing data across dataframe
round(sum(complete.cases(letters))/nrow(letters), digits = 2) # proportion of complete cases

In [10]:
glimpse(letters)

Rows: 1,032
Columns: 18
$ docid                     [3m[90m<fct>[39m[23m S1019-D002, S1019-D004, S1019-D005, S1019-D0…
$ sourcetitle               [3m[90m<fct>[39m[23m "At the End of the Santa Fe Trail", "At the …
$ docyear                   [3m[90m<int>[39m[23m 1872, 1872, 1872, 1872, 1873, 1873, 1873, 18…
$ docmonth                  [3m[90m<int>[39m[23m 11, 12, 12, 12, 3, 7, 9, 6, 11, 6, 9, 12, 1,…
$ docday                    [3m[90m<int>[39m[23m 30, 6, 10, 21, 1, NA, NA, 30, 14, NA, NA, 16…
$ authorLocation            [3m[90m<fct>[39m[23m USA, USA, USA, USA, USA, USA, USA, USA, USA,…
$ docauthorid               [3m[90m<fct>[39m[23m per0001043, per0001043, per0001043, per00010…
$ docauthorname             [3m[90m<fct>[39m[23m "Segale, Sister Blandina, 1850-1941", "Segal…
$ authorgender              [3m[90m<fct>[39m[23m F, F, F, F, F, F, F, F, F, F, F, F, F, F, F,…
$ agewriting                [3m[90m<int>[39m[23m 22, 22, 22, 22, 23, 23, 23, 24, 

In [11]:
write.csv(letters, 
          "20230213_AM_PhD-NaildohSubset.csv", 
          row.names=FALSE)