# NAILDOH - IED Merge 1 (NAILDOH)

## Resources

In [3]:
# Libraries
library(tidyverse) # for data manipulation

# Functions
factorize <- function(df){ # Create a function
  for(i in which(sapply(df, class) == "character")) # that looks for variables with the character class 
      df[[i]] = as.factor(df[[i]]) # and converts them to factor (i.e., categorical) class
  return(df)
}

unfactorize <- function(df){ # Create a function
  for(i in which(sapply(df, class) == "factor")) # that looks for variables with the character class 
      df[[i]] = as.character(df[[i]]) # and converts them to factor (i.e., categorical) class
  return(df)
}

## Prep NAILDOH

In [814]:
# Data
df1 <- factorize(read.csv("20240510_PhD_NaildohSubset.csv")) # Put csv into a dataframe called docData
colnames(df1) # Get an overview of the dataframe
dim(df1)

I am going to omit Sister Blandina Segale's letters because they are considerably different from the others in the corpus and might introduce some confounding variables. For example, they published (therefore likely edited) and they take journal form. In terms of demographics, Sister Segale is the only writer not with anglo or Irish ancestry (she is Italian) and she is a 1.5 gen immigrant whereas most of the others are 1st generation. While she is Catholic, she is a nun so a rather special case of Catholic which won't tell me much about the Irish-Catholic (i.e., the labouring underclass) experience that I aim to understand.

As explained in the Social Class Part 1 notebook, Thomas Mooney was a prolific writer, apparently Catholic. Like Sister Blandina Segale, his letters to a sibling and "the rest of my countrymen" were gathered together and published in book form. The three letters by him do not relate much about his own life, but are rather advice for his sibling and other prospective migrants. I am going to omit these.

In [815]:
length(unique(df1$docauthorid))
vals <- c("per0001043", "per0034430")
df1 <- df1[!df1$docauthorid %in% vals, ]
length(unique(df1$docauthorid))

Margaret Carrothers wrote a letter that is bundled in with and attributed to one by Nathaniel. Removing Nathaniel's bit, which is preserved in S9635-D014. Changing attribution to Margare andgiving her a unique authorid. Also changing the gender from male to female.

In [816]:
df1 <- unfactorize(df1)

df1$docauthorname[df1$docid=="S9635-D015"] <- "Carrothers, Margaret"
df1$docauthorid[df1$docauthorid=="per0026978"] <- "per0026978a"
df1$docauthorid[df1$docid=="S9635-D015"] <- "per0026978b"
df1$authorgender[df1$docauthorid=="per0026978b"] <- "F"

Now I am changing the format of the name from last-first to first-last to match the IED metadata and avoid complications with the comma character. 

In [817]:
# Create a new variable for the modified names.
df1$authorName <- df1$docauthorname

# Cleaning
df1$authorName <- gsub("\\d", "", df1$authorName)
df1$authorName <- gsub("\\?", "", df1$authorName)
df1$authorName <- gsub("\\(\\)", "", df1$authorName)
df1$authorName <- gsub("fl.", "", df1$authorName)
df1$authorName <- gsub("\\s-", "", df1$authorName)
df1$authorName <- gsub(",$", "", df1$authorName)
df1$authorName <- gsub(",\\s\\s$", "", df1$authorName)

# Changing to first name first format
first <- gsub("^.*,\\s", "", df1$authorName)
last <- gsub(",.*$", "", df1$authorName)
authorName <- paste(first, last, sep=" ")
df1$authorName <- authorName

# Fixing a few stragglers
df1$authorName[df1$docauthorid=="per0031263"] <- "William Davies"
df1$authorName[df1$docauthorid=="per0031335"] <- "Lewis Howell Jr"
df1$authorName[df1$docauthorid=="per0004486"] <- "Samuel Roberts"

# Checking anonymous names to see if key info will be lost if I convert to NA
#df1 %>% 
#filter(grepl("Anonymous", authorName)) %>% 
#select(authorName, docauthorid, authorgender, nationalOrigin, authorLocation, L, religionNew) %>% 
#unique()

# per0036149 is the wife of a tradesman so coding the labour variables accordingly.
df1$L[df1$docauthorid=="per0036149"] <- FALSE
df1$S[df1$docauthorid=="per0036149"] <- TRUE

# No other info appears to be lost by changing Anonymous to NA so doing that now.
df1$authorName[grepl("Anonymous", df1$docauthorname)] <- NA

# Checking that docauthorname coverted correctly to authorName
#df1 %>% 
#select(docauthorname, authorName, docauthorid) %>% 
#unique() %>% 
#arrange(docauthorname) %>%
#slice(51:100)

In [818]:
# Double checking that the author & docid counts are correct

df1 <- factorize(df1)

#Drop unused levels
df1$docauthorname <- droplevels(df1$docauthorname)
df1$docauthorid <- droplevels(df1$docauthorid)
df1$authorName <- droplevels(df1$authorName)

#Original variables
length(unique(df1$docauthorname))
length(unique(df1$docauthorid))

#New variable
df1 %>% 
select(docauthorid, authorName) %>% 
unique() %>% 
group_by(authorName)  %>% 
tally(sort=TRUE) %>% 
filter(n>1)

df1 %>% 
filter(!is.na(authorName)) %>% 
select(docauthorid) %>% 
unique() %>% 
nrow()


authorName,n
<fct>,<int>
,34


All good -- the counts match up.

In [819]:
# Given that the entire corpus is European, I am re-coding "European" to NA
# Also dropping unused levels.
df1 <- factorize(df1)

df1$nationalOrigin[df1$nationalOrigin=="European"] <- NA
df1$nationalOrigin <- droplevels(df1$nationalOrigin)
summary(df1$nationalOrigin)

In [820]:
df1 %>% 
filter(is.na(nationalOrigin)) 

docauthorid,docauthorname,docid,docyear,docmonth,authorgender,agewriting,agedeath,religionNew,relMin,⋯,A,I,CCP,UWL,U,M,S,F,L,authorName
<fct>,<fct>,<fct>,<int>,<int>,<fct>,<int>,<int>,<fct>,<lgl>,⋯,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<fct>
per0036138,"Jennings, Joseph, fl. 1931",S9845-D004,1831,4,M,,,,,⋯,,,,,,,,,,Joseph Jennings
per0029183,"Anonymous Government Agent in Upper Canada, fl. 1833",S9865-D020,1833,7,M,,,,,⋯,False,False,True,False,True,False,False,False,False,


I am going to omit both these cases because they are missing data on what will be a key variable -- nationalOrigin. The reason I think this is ok is that both are designed around providing advice to a prospective emigrant, rather than a personal letter to a friend of family member. Also, I believe that the data for this missing variable is truly missing at random as the writers do not fit the profile of those usually excluded from cultural records. That is, these sound like men in a reasonably solid social position.

In [821]:
vals <- c("per0036138", "per0029183")
df1 <- df1[!df1$docauthorid %in% vals, ]
df1$nationalOrigin <- droplevels(df1$nationalOrigin)
summary(df1$nationalOrigin)

In [822]:
summary(df1$relMin)
summary(df1$religionNew)

   Mode   FALSE    TRUE    NA's 
logical     337       5      94 

The 9 Christians and the 105 NAs in the religionNew variable were coded to NA for the relMin variable but then 20 of these (all Protestants) were resolved through additional research, reducing the relMin NA count from 114 to 94. 

In [823]:
df1 %>% 
filter(is.na(religionNew) | religionNew == "Christian") %>% 
filter(!is.na(relMin)) %>% 
select(docauthorname, religionNew, relMin, nationalOrigin) 

docauthorname,religionNew,relMin,nationalOrigin
<fct>,<fct>,<lgl>,<fct>
"White, Jane, 1831(?)-1867",,False,Irish
"White, Jane, 1831(?)-1867",,False,Irish
"White, Jane, 1831(?)-1867",,False,Irish
"White, Jane, 1831(?)-1867",,False,Irish
"White, Jane, 1831(?)-1867",,False,Irish
"White, Jane, 1831(?)-1867",,False,Irish
"Robb, Alexander, 1839-",,False,Irish
"Robb, Alexander, 1839-",,False,Irish
"Robb, Alexander, 1839-",,False,Irish
"Robb, Alexander, 1839-",,False,Irish


There is only one Catholic in the corpus both before and after NAs were investigated / resolved. Now I am creating variables for Catholic and Irish so that I can set up the regression so that the theoretically least empowered / privileged individuals can serve as the comparison to the base case or test case.

In [824]:
df1  %>% 
filter(relMin==TRUE) %>% 
select(docauthorname, religionNew) %>% 
unique()

Unnamed: 0_level_0,docauthorname,religionNew
Unnamed: 0_level_1,<fct>,<fct>
1,"Ellis, Ann, fl. 1855",Mormon
2,"Llewellyn, Rees, fl. 1857",Mormon
3,"Anonymous Welsh Immigrant, Jane, fl. 1862",Mormon
4,"Hudson, Henry James, 1822-",Mormon
5,"Mee, Patrick, fl. 1844",Catholic


In [825]:
df1$catholic <- FALSE
df1$catholic[is.na(df1$relMin)] <- NA # Using relMin here to capture discovery work
df1$catholic[df1$religionNew=="Catholic"] <- TRUE # ReligionNew ok here bc none found during discovery
summary(df1$catholic)

   Mode   FALSE    TRUE    NA's 
logical     341       1      94 

In [826]:
df1$irish <- FALSE
df1$irish[is.na(df1$nationalOrigin)] <- NA
df1$irish[df1$nationalOrigin=="Irish"] <- TRUE
summary(df1$irish)

   Mode   FALSE    TRUE 
logical     395      41 

Now doing the same in the inverse, that is so that all expected correlations are positive. This might facilitate interpretation. For example, a 1 / TRUE for the indicator variables correlates with higher sentiment just as greater number of token or person mentions likewise correlates with higher sentiment. 

In [827]:
df1$otherChristian <- FALSE
df1$otherChristian[!df1$religionNew=="Catholic"] <- TRUE # Above for Catholics
df1$otherChristian[df1$relMin==FALSE] <- TRUE # To capture discovery for Protestants
df1$otherChristian[is.na(df1$relMin)] <- NA #Above for Catholics
summary(df1$otherChristian)

   Mode   FALSE    TRUE    NA's 
logical       1     341      94 

In [828]:
df1$otherUK <- FALSE
df1$otherUK[!df1$nationalOrigin=="Irish"] <- TRUE
df1$otherUK[is.na(df1$nationalOrigin)] <- NA
summary(df1$otherUK)

   Mode   FALSE    TRUE 
logical      41     395 

In [829]:
# Checking to make sure csv matches folder list.

# make list of doc ids in csv
csv <- sort(df1$docid)

# make list of doc ids in folder
WD <- getwd()
setwd(WD)
files <- list.files("SubsetNAILDOH")
folder <- sort(sub('.txt', '', files))

setdiff(csv, folder)
setdiff(folder, csv)

In [8]:
#What is the gender breakdown by doc and by author for letters

lettersG <- df1 %>% #Create new variable for the collection of letters
reframe(authorGender) #summarized by gender
table(lettersG$authorGender) #plot
prop.table(as.matrix(table(lettersG$authorGender)), 2)*100

letterAuthorsG <- df1 %>% #Create new variable for the writer pool
reframe(authorGender, group_by=docauthorid) %>% #summarized by gender and grouped by author
unique() #unique values only
table(letterAuthorsG$authorGender) #plot author breakdown
prop.table(as.matrix(table(letterAuthorsG$authorGender)), 2)*100 

df1 %>% 
filter(is.na(authorGender)) %>% 
select(docauthorid) %>% 
unique() %>% 
nrow()



  F   M 
283 153 

0,1
F,64.90826
M,35.09174



 F  M 
14 78 

0,1
F,15.21739
M,84.78261


In [831]:
#What is the nationalOrigin breakdown by doc and by author for letters

lettersN <- df1 %>% #Create new variable for the collection of letters
reframe(nationalOrigin) #summarized by nationalOrgin
table(lettersN$nationalOrigin) #plot
prop.table(as.matrix(table(lettersN$nationalOrigin)), 2)*100

letterAuthorsN <- df1 %>% #Create new variable for the writer pool
reframe(nationalOrigin, group_by=docauthorid) %>% #summarized by nationalOrgin and grouped by author
unique() #unique values only
table(letterAuthorsN$nationalOrigin) #plot author breakdown
prop.table(as.matrix(table(letterAuthorsN$nationalOrigin)), 2)*100 



 English    Irish Scottish    Welsh 
     327       41       42       26 

0,1
English,75.0
Irish,9.40367
Scottish,9.633028
Welsh,5.963303



 English    Irish Scottish    Welsh 
      35       10       26       21 

0,1
English,38.04348
Irish,10.86957
Scottish,28.26087
Welsh,22.82609


In [832]:
# Who are the Irish writers
df1 %>% 
filter(nationalOrigin=="Irish") %>% 
select(religionNew, docauthorname, relMin, catholic) %>% 
unique()

Unnamed: 0_level_0,religionNew,docauthorname,relMin,catholic
Unnamed: 0_level_1,<fct>,<fct>,<lgl>,<lgl>
1,,"Humphrey, James, fl. 1824",,
2,Catholic,"Mee, Patrick, fl. 1844",True,True
3,Anglican,"Carrothers, Nathaniel, ?-1881",False,False
4,Anglican,"Carrothers, Margaret",False,False
10,Methodist,"Carrothers, Joseph, 1793(?)-",False,False
20,,"White, Jane, 1831(?)-1867",False,False
26,,"Robb, Alexander, 1839-",False,False
38,,"Buchanan, J. C., fl. 1833",False,False
39,,"Graham, Thomas, fl. 1827",,
41,,"Sampson, William, 1764-1836",False,False


In [833]:
df1$authorGender <- df1$authorgender
df1$authorgender <- NULL

In [3]:
# Data
#df1 <- factorize(read.csv("20240514_PhD_NaildohSubset.csv")) # Put csv into a dataframe called docData
#colnames(df1) # Get an overview of the dataframe
#(df1)

In [26]:
temp <- df1 %>% #Create new variable for the writer pool
reframe(authorGender, group_by=docauthorid) %>% #summarized by nationalOrgin and grouped by author
unique() #unique values only
table(temp$authorGender) #plot author breakdown
prop.table(as.matrix(table(temp$authorGender)), 2)*100 


 F  M 
14 78 

0,1
F,15.21739
M,84.78261


In [22]:
temp <- df1 %>% 
select(docauthorid, irish, authorLocation, authorGender)  %>% 
unique()

table(temp$irish, temp$authorGender)
round(prop.table(table(temp$irish, temp$authorGender)), digits = 2)

# temp[temp$docauthorid=="per0029184",] This person appears twice in the count because one letter is sent from USA and one from Canada
# The correct count is 12 female, 70 male non-Irish

table(temp$authorLocation, temp$authorGender)
round(prop.table(table(temp$authorLocation, temp$authorGender)), digits = 2)

table(temp$authorLocation, temp$irish)
round(prop.table(table(temp$authorLocation, temp$irish)), digits=2)

       
         F  M
  FALSE 12 71
  TRUE   2  8

       
           F    M
  FALSE 0.13 0.76
  TRUE  0.02 0.09

        
          F  M
  Canada  6 34
  USA     8 45

        
            F    M
  Canada 0.06 0.37
  USA    0.09 0.48

        
         FALSE TRUE
  Canada    32    8
  USA       51    2

        
         FALSE TRUE
  Canada  0.34 0.09
  USA     0.55 0.02

In [491]:
write.csv(df1, 
          "20240514_PhD_NaildohSubset.csv", 
          row.names=FALSE)