# Restricting the metadata to letters produced in North America

This notebook removes letters not produced in the United States or Canada from the metadata. That is, letters written in the home country are not included.

## Resources

In [1]:
# Libraries
library(tidyverse) # for data manipulation

── [1mAttaching packages[22m ───────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.6
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ──────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [2]:
# Functions
factorize <- function(df){ # Create a function
  for(i in which(sapply(df, class) == "character")) # that looks for variables with the character class 
      df[[i]] = as.factor(df[[i]]) # and converts them to factor (i.e., categorical) class
  return(df)
}

unfactorize <- function(df){ # Create a function
  for(i in which(sapply(df, class) == "factor")) # that looks for variables with the character class 
      df[[i]] = as.character(df[[i]]) # and converts them to factor (i.e., categorical) class
  return(df)
}

In [3]:
# Data
letters <- factorize(read.csv("20230507_AM_PhD-NaildohSubset.csv")) # Put csv into a dataframe called docData
colnames(letters) # Get an overview of the dataframe
dim(letters)

In [4]:
table(letters$authorLocation)


  Canada  England  Ireland Scotland      USA 
     433       37        1        3      143 

In [5]:
letters  <- letters %>% 
filter(authorLocation=="Canada"|authorLocation=="USA")
table(letters$authorLocation)
glimpse(letters)


  Canada  England  Ireland Scotland      USA 
     433        0        0        0      143 

Rows: 576
Columns: 24
$ docauthorid      [3m[90m<fct>[39m[23m per0001043, per0001043, per0001043, per0001043, per00…
$ docauthorname    [3m[90m<fct>[39m[23m "Segale, Sister Blandina, 1850-1941", "Segale, Sister…
$ docid            [3m[90m<fct>[39m[23m S1019-D002, S1019-D004, S1019-D005, S1019-D006, S1019…
$ sourcetitle      [3m[90m<fct>[39m[23m "At the End of the Santa Fe Trail", "At the End of th…
$ docyear          [3m[90m<int>[39m[23m 1872, 1872, 1872, 1872, 1873, 1873, 1873, 1874, 1874,…
$ docmonth         [3m[90m<int>[39m[23m 11, 12, 12, 12, 3, 7, 9, 6, 11, 6, 9, 12, 1, 3, 3, 6,…
$ docday           [3m[90m<int>[39m[23m 30, 6, 10, 21, 1, NA, NA, 30, 14, NA, NA, 16, NA, NA,…
$ authorgender     [3m[90m<fct>[39m[23m F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F,…
$ agewriting       [3m[90m<int>[39m[23m 22, 22, 22, 22, 23, 23, 23, 24, 24, 26, 26, 26, 27, 2…
$ birthyear        [3m[90m<int>[39m[23m 1850, 1850, 1850, 1850, 1850, 1850, 1850, 1

In [26]:
write.csv(letters, 
          "20230606_AM_PhD-NaildohSubset.csv", 
          row.names=FALSE)

In [13]:
letters %>% 
filter(grepl("S316", docid)) %>% 
select(docid) %>% 
nrow()