# IED Subset

This is the first notebook in the series used to prepare the Irish Emigration Database subset for analysis. The goals of this series are:
<ol>
    <li>To identify the existing data structure;</li>
    <li>To identify how it must be restructured to match the data structure of the Naildoh subset;</li>
    <li>To make the necessary changes;</li>
    <li>To merge the two dataframes.</li>
 </ol>

## Resources

In [2]:
# Libraries
library(tidyverse) # for data manipulation

In [3]:
# Functions
factorize <- function(df){ # Create a function
  for(i in which(sapply(df, class) == "character")) # that looks for variables with the character class 
      df[[i]] = as.factor(df[[i]]) # and converts them to factor (i.e., categorical) class
  return(df)
}

unfactorize <- function(df){ # Create a function
  for(i in which(sapply(df, class) == "factor")) # that looks for variables with the character class 
      df[[i]] = as.character(df[[i]]) # and converts them to factor (i.e., categorical) class
  return(df)
}

In [4]:
# Data
df <- factorize(read.csv("20230514_AM_ied.csv", header=FALSE, sep=",")) # Put csv into a dataframe called docData

# Variabe Names
colnames(df)
dim(df)
df[0:1,]


Unnamed: 0_level_0,V1,V2,V3,V4,V5,V6
Unnamed: 0_level_1,<int>,<fct>,<fct>,<fct>,<fct>,<int>
1,300090,04-12-1896,Letters (Emigrants),"Public Record Office, Northern Ireland","Edward Stanley, Katawa, Canada to Joshua Peel, Armagh; PRONI D889/7/1; CMSIED 300090",20481


In [5]:
colnames(df) <- c('idIED', 
                       'ddmmyyyy', 
                       'publisher', 
                       'sourcetitle', 
                       'description', 
                       'docid')
df[0:1,]

Unnamed: 0_level_0,idIED,ddmmyyyy,publisher,sourcetitle,description,docid
Unnamed: 0_level_1,<int>,<fct>,<fct>,<fct>,<fct>,<int>
1,300090,04-12-1896,Letters (Emigrants),"Public Record Office, Northern Ireland","Edward Stanley, Katawa, Canada to Joshua Peel, Armagh; PRONI D889/7/1; CMSIED 300090",20481


In [7]:
main <- read.csv("20240510_PhD_NaildohSubset.csv")
colnames(main) # Get an overview of the dataframe
dim(main)
main[0:1,]

Unnamed: 0_level_0,docauthorid,docauthorname,docid,docyear,docmonth,authorgender,agewriting,agedeath,religionNew,relMin,⋯,authorLocation,A,I,CCP,UWL,U,M,S,F,L
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<int>,<int>,<chr>,<lgl>,⋯,<chr>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>
1,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,1872,11,F,22,91,Catholic,True,⋯,USA,False,False,True,False,True,True,False,False,False


First to main df, add "publisher" (NAILDOH).

Then for df, keep docid as it is, and make the following adjustments:

<ul>
    <li>docauthorid (create from docauthorname)</li>
    <li>docauthorname (description)</li>
    <li>docyear (mm-dd-yyyy)</li>
    <li>docmonth (mm-dd-yyyy)</li>
   <li>authorgender (NA)</li>
    <li>agewriting (NA)</li>
    <li>agedeath (NA)</li>
    <li>relMin (NA)</li>
    <li>nationalOrigin ("Ireland")</li>
    <li>authorLocation (description)</li>
    <li>U (NA)</li>
    <li>M (NA)</li>
    <li>S (NA)</li>
    <li>F (NA)</li>
    <li>L (NA)</li>
 </ul>
  


In [8]:
# Change to main:
main$publisher <- "NAILDOH"
main

docauthorid,docauthorname,docid,docyear,docmonth,authorgender,agewriting,agedeath,religionNew,relMin,⋯,A,I,CCP,UWL,U,M,S,F,L,publisher
<chr>,<chr>,<chr>,<int>,<int>,<chr>,<int>,<int>,<chr>,<lgl>,⋯,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>
per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,1872,11,F,22,91,Catholic,TRUE,⋯,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,NAILDOH
per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D004,1872,12,F,22,91,Catholic,TRUE,⋯,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,NAILDOH
per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D005,1872,12,F,22,91,Catholic,TRUE,⋯,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,NAILDOH
per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D006,1872,12,F,22,91,Catholic,TRUE,⋯,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,NAILDOH
per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D007,1873,3,F,23,91,Catholic,TRUE,⋯,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,NAILDOH
per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D008,1873,7,F,23,91,Catholic,TRUE,⋯,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,NAILDOH
per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D009,1873,9,F,23,91,Catholic,TRUE,⋯,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,NAILDOH
per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D010,1874,6,F,24,91,Catholic,TRUE,⋯,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,NAILDOH
per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D011,1874,11,F,24,91,Catholic,TRUE,⋯,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,NAILDOH
per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D012,1876,6,F,26,91,Catholic,TRUE,⋯,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,NAILDOH


In [9]:
# Data
statesUSA <- factorize(read.csv("20240429_PhD_StatesUSA.csv", header=TRUE, sep=",")) # Put csv into a dataframe called docData
glimpse(statesUSA)

Rows: 51
Columns: 2
$ State        [3m[90m<fct>[39m[23m Alabama, Alaska, Arizona, Arkansas, California, Colorado,…
$ Abbreviation [3m[90m<fct>[39m[23m AL, AK, AZ, AR, CA, CO, CT, DE, DC, FL, GA, HI, ID, IL, I…


In [10]:
# Data
namesFemale <- factorize(read.table("female.txt", header=FALSE)) # Put csv into a dataframe called docData
glimpse(namesFemale)

Rows: 5,004
Columns: 1
$ V1 [3m[90m<fct>[39m[23m Abagael, Abagail, Abbe, Abbey, Abbi, Abbie, Abby, Abigael, Abigail,…


In [11]:
# Data
namesMale <- factorize(read.table("male.txt", header=FALSE)) # Put csv into a dataframe called docData
glimpse(namesMale)

Rows: 2,943
Columns: 1
$ V1 [3m[90m<fct>[39m[23m Aamir, Aaron, Abbey, Abbie, Abbot, Abbott, Abby, Abdel, Abdul, Abdu…


## Changes

In [12]:
# First the easy changes to df
df$publisher <- "IED"
df$nationalOrigin <- "Irish"
df

idIED,ddmmyyyy,publisher,sourcetitle,description,docid,nationalOrigin
<int>,<fct>,<chr>,<fct>,<fct>,<int>,<chr>
300090,04-12-1896,IED,"Public Record Office, Northern Ireland","Edward Stanley, Katawa, Canada to Joshua Peel, Armagh; PRONI D889/7/1; CMSIED 300090",20481,Irish
9501251,12-01-1891,IED,"Public Record Office, Northern Ireland","From, Brooklyn, N.Y., to ""Dear James"" [no address]; PRONI T 3033/10; CMSIED 9501251",20487,Irish
300018,01-06-1822,IED,Ulster-American Folk Park.,"James Kelly, Desertmartin to John Kelly, Pennsylvania;The Kelly Family Documents: Copyright Retained by The UlsterAmerican Folk Park.; CMSIED 300018",20514,Irish
9408355,01-10-1842,IED,"Public Record Office, Northern Ireland","Alexander McCloy, Pennsylvania, to Cousin, [Ireland? or Liverpool, England?]; PRONI D1444/18; CMSIED 9408355",20519,Irish
9003061,10-01-1896,IED,"Public Record Office, Northern Ireland","George Kirkpatrick, Toronto, to Rev. Alex. Kirkpatrick, Co Antrim; PRONI D 1424/11; CMSIED 9003061",20522,Irish
9011027,13-02-1873,IED,"Public Record Office, Northern Ireland","William Porter, U.S.A. to Robert Porter, Ireland; PRONI D 1152/3/25; CMSIED 9011027",20529,Irish
9905110,16-05-1891,IED,Ulster-American Folk Park.,"G.R. Wood, Holly, Michigan to Annie Weir, Michigan;Copyright Retained by Mrs. Linda Weir; CMSIED 9905110",20530,Irish
9006021,01-01-1862,IED,"Public Record Office, Northern Ireland","Alexander Robb, Near Panama, to Family [Dundonald, Co Down?]; PRONI T 1454/6/2; CMSIED 9006021",20538,Irish
9411034,15-02-1858,IED,"Public Record Office, Northern Ireland","N. Carrothers, Ontario to W.Carrothers, Farnaght, Fermanagh; PRONI T3734; CMSIED 9411034",20563,Irish
8906036,08-09-1886,IED,"Public Record Office, Northern Ireland","James P. Breeze, Concord, to Charley Breeze.; PRONI T 1381/9; CMSIED 8906036",20568,Irish


In [13]:
vars <- c("authorgender", "agewriting", "agedeath", "relMin", "U", "M", "S", "F", "L")
df[vars] <- NA
df

idIED,ddmmyyyy,publisher,sourcetitle,description,docid,nationalOrigin,authorgender,agewriting,agedeath,relMin,U,M,S,F,L
<int>,<fct>,<chr>,<fct>,<fct>,<int>,<chr>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>
300090,04-12-1896,IED,"Public Record Office, Northern Ireland","Edward Stanley, Katawa, Canada to Joshua Peel, Armagh; PRONI D889/7/1; CMSIED 300090",20481,Irish,,,,,,,,,
9501251,12-01-1891,IED,"Public Record Office, Northern Ireland","From, Brooklyn, N.Y., to ""Dear James"" [no address]; PRONI T 3033/10; CMSIED 9501251",20487,Irish,,,,,,,,,
300018,01-06-1822,IED,Ulster-American Folk Park.,"James Kelly, Desertmartin to John Kelly, Pennsylvania;The Kelly Family Documents: Copyright Retained by The UlsterAmerican Folk Park.; CMSIED 300018",20514,Irish,,,,,,,,,
9408355,01-10-1842,IED,"Public Record Office, Northern Ireland","Alexander McCloy, Pennsylvania, to Cousin, [Ireland? or Liverpool, England?]; PRONI D1444/18; CMSIED 9408355",20519,Irish,,,,,,,,,
9003061,10-01-1896,IED,"Public Record Office, Northern Ireland","George Kirkpatrick, Toronto, to Rev. Alex. Kirkpatrick, Co Antrim; PRONI D 1424/11; CMSIED 9003061",20522,Irish,,,,,,,,,
9011027,13-02-1873,IED,"Public Record Office, Northern Ireland","William Porter, U.S.A. to Robert Porter, Ireland; PRONI D 1152/3/25; CMSIED 9011027",20529,Irish,,,,,,,,,
9905110,16-05-1891,IED,Ulster-American Folk Park.,"G.R. Wood, Holly, Michigan to Annie Weir, Michigan;Copyright Retained by Mrs. Linda Weir; CMSIED 9905110",20530,Irish,,,,,,,,,
9006021,01-01-1862,IED,"Public Record Office, Northern Ireland","Alexander Robb, Near Panama, to Family [Dundonald, Co Down?]; PRONI T 1454/6/2; CMSIED 9006021",20538,Irish,,,,,,,,,
9411034,15-02-1858,IED,"Public Record Office, Northern Ireland","N. Carrothers, Ontario to W.Carrothers, Farnaght, Fermanagh; PRONI T3734; CMSIED 9411034",20563,Irish,,,,,,,,,
8906036,08-09-1886,IED,"Public Record Office, Northern Ireland","James P. Breeze, Concord, to Charley Breeze.; PRONI T 1381/9; CMSIED 8906036",20568,Irish,,,,,,,,,


In [14]:
# Now the tougher ones. First break out date into month, year and day.
df$ddmmyyyy <- as.character(df$ddmmyyyy)
str(df$ddmmyyyy)

 chr [1:3055] "04-12-1896" "12-01-1891" "01-06-1822" "01-10-1842" ...


In [16]:
# Now new ones based on old ones
# Dates
df$docmonth <- as.factor(str_sub(df$ddmmyyyy, 4, 5))
df$docyear <- as.numeric(str_sub(df$ddmmyyyy, -4,-1))
vars <- c("docmonth", "docyear")
df[vars] %>% 
summary()

    docmonth       docyear    
 01     : 376   Min.   :1750  
 12     : 291   1st Qu.:1844  
 11     : 276   Median :1865  
 06     : 262   Mean   :1860  
 09     : 250   3rd Qu.:1888  
 03     : 243   Max.   :1913  
 (Other):1357                 

In [17]:
#Names
unique(df$description)[0:50]

For this variable, I am concerned with the sender names, not the locations or recipients, so I am going to drop everything after the first comma.

In [18]:
df$docauthorname <- gsub(",.*\\b", "", df$description)
unique(df$docauthorname)[0:50]

Now I am dropping everything after "to" in those cases where it appears.

In [19]:
df$docauthorname <- gsub(".to.*","", df$docauthorname)
unique(df$docauthorname)[0:50]

In [20]:
# Cleaning up a few more problems.
df$docauthorname <- gsub("From", "", df$docauthorname)# From
df$docauthorname <- gsub("\\[.*\\]","", df$docauthorname) #Brackets and contents thereof
df$docauthorname <- gsub("\\;.*","", df$docauthorname) #Everything after semi-colon
df$docauthorname <- gsub(".for.*","", df$docauthorname) #Everything after "for"
unique(df$docauthorname)[0:50]

References letters are ok but we don't need this in the author variable.

In [21]:
df$docauthorname <- gsub("Reference written by ","", df$docauthorname)
unique(df$docauthorname)[0:50]

Now, what is this sample?

In [22]:
df %>% 
filter(docauthorname==" Sample") %>% 
select(docid)

docid
<int>
20976


In [23]:
getwd()

In [26]:
read_file(paste0("/Users/alaynemoody/Dropbox/GradStudies/Flinders/Dissertation/IED/", "20976.txt"))

Sample is not a name. This letter does not seem to include one, therefore omitting it, to be converted to blank and then anonymous.

In [24]:
df$docauthorname <- gsub(" Sample","", df$docauthorname) 
unique(df$docauthorname)[0:50]

In [25]:
df$docauthorname[df$docauthorname==""] <- "Anonymous"
unique(df$docauthorname)[0:50]

In [26]:
# Omit letter from/by
df$docauthorname <- gsub(".*letter from ","", df$docauthorname)
df$docauthorname <- gsub(".*Letter from ","", df$docauthorname)
df$docauthorname <- gsub(".*letter by ","", df$docauthorname)
df$docauthorname <- gsub(".*Letter by ","", df$docauthorname)
unique(df$docauthorname)[0:50]

In [27]:
toMatch <- c("\\[|\\]", "\\(|\\)")
unique (grep(paste(toMatch,collapse="|"), 
                        df$docauthorname, value=TRUE))

In [28]:
# Code for testing method bypassed
#x <- "H.Y. [Hami"
df$docauthorname <- gsub(".*\\[.*","", df$docauthorname) #Brackets and contents thereof
#gsub("\\[|\\]", "", x) #Just the brackets
#x #Original

#y <- "Agnes Shakespeare (Nesta)"
df$docauthorname <- gsub("\\(.*\\)","", df$docauthorname) #Parantheses and contents thereof
#gsub("\\(|\\)", "", y) #Just the parantheses
#y #Original

unique(grep(paste(toMatch,collapse="|"), df$docauthorname, value=TRUE))

In [29]:
df$docauthorname[df$docauthorname==""] <- "Anonymous"
unique(df$docauthorname)[0:50]

In [30]:
df %>% 
filter(docauthorname=="Incomplete:  ") %>% 
select(docid)

docid
<int>
38606


In [34]:
read_file(paste0("/Users/alaynemoody/Dropbox/GradStudies/Flinders/Dissertation/IED/", "38606.txt"))

In [31]:
df %>%
filter(docauthorname=="Incomplete: ") %>% 
select(docid, description)

docid,description
<int>,<fct>
24149,"Incomplete: [?], Michigan To Annie Weir, [?];Copyright Retained by Mrs Linda Weir; CMSIED 9906061"
32432,"Incomplete: [W.J. Weir?], Fresno, to ""My Dear Annie"";Copyright Retained by Mrs Linda Weir; CMSIED 9906160"
32652,"Incomplete: [?], Edenclaw, Co. Fermanagh to ""Dear Sister"";Copyright Retained by Mrs Linda Weir; CMSIED 9906177"
35929,"Incomplete: [Bella Weir?] to ""Dear Mother, Sisters Bros"";Copyright Retained by Mrs Linda Weir; CMSIED 9906155"
37323,"Incomplete: [W. Weir?], Fresno, to ""Dear Annie"";Copyright Retained by Mrs Linda Weir; CMSIED 9906165"
40815,"Incomplete: [W.J. Weir?], Fresno, to ""Dear Annie"";Copyright Retained by Mrs Linda Weir; CMSIED 9906169"
40882,"Incomplete: [?], Michigan to Annie Weir, Pontiac;Copyright Retained by Mrs Linda Weir; CMSIED 9906170"
44825,"Incomplete: [?] [New Brunswick?], to ""My very dear Sister"",; PRONI D 1792/; CMSIED 9909292"
51927,"Incomplete: [W.J. Weir?], Fresno, to ""My Dear Annie"";Copyright Retained by Mrs Linda Weir; CMSIED 9906167"


In [32]:
docids <- c("32432", "40815", "51927")
df$docauthorname[df$docid %in% docids] <- "W.J. Weir"

docids <- c("35929")
df$docauthorname[df$docid %in% docids] <- "Bella Weir"

docids <- c("37323")
df$docauthorname[df$docid %in% docids] <- "W. Weir"

docids <- c("24149", "32652", "40882", "44825")
df$docauthorname[df$docid %in% docids] <- "Anonymous"

df %>%
filter(docauthorname=="Incomplete: ") %>% 
select(docid, description)

docid,description
<int>,<fct>


In [33]:
df %>% 
filter(docauthorname=="Philadelphia.") %>% 
select(docid, description)

docid,description
<int>,<fct>
35118,Extract of a Letter from Philadelphia.;The Belfast Mercury or Freeman's Chronicle 30th Sept 1783.; CMSIED 9407168
35820,"Extract of a Letter from Philadelphia.;The Northern Star, July 6th to July 9th 1795.; CMSIED 9408215"
41043,"Extract of a Letter from Philadelphia.;The Belfast Mercury, 27th April 1784.; CMSIED 9407189"
43633,Extract of a Letter from Philadelphia.;The Belfast Mercury 28th November 1783.; CMSIED 9407176


These are all published letters, which is ok in principal. But let's check out the content.

In [38]:
read_file(paste0("/Users/alaynemoody/Dropbox/GradStudies/Flinders/Dissertation/IED/", "43633.txt"))

These are all fine to include as they are of a personal nature. I will simply recode docauthorname to "anonymous".

In [34]:
df$docauthorname[df$docauthorname=="Philadelphia."] <- "Anonymous"
df %>% 
filter(docauthorname=="Philadelphia.") %>% 
select(docid, description)

docid,description
<int>,<fct>


In [35]:
df %>% 
filter(docauthorname=="Letter") %>% 
select(docid, description)

docid,description
<int>,<fct>
33451,"Letter to Albert Estopinal, [Washington D.C.?];Copyright Retained by Brendan O'Reilly; CMSIED 9808568"
35325,"Letter to the Editor on American Foreign and Domestic Policy;The Armagh Guardian, Tuesday, September 16, 1845; CMSIED 9407149"
40831,"Letter to Emigrants from British Vice-Consul, New York;The Belfast Newsletter, Tuesday, 22 October, 1833; CMSIED 200443"
46536,Letter to John Humphrey in South Carolina.; PRONI D3561/; CMSIED 9306115


Again, checking on content.

In [41]:
read_file(paste0("/Users/alaynemoody/Dropbox/GradStudies/Flinders/Dissertation/IED/", "46536.txt"))

Only 40831 is of a personal nature but not written in North America. The others are written in people's professional capacities or offer political commentary. Removing them all.

In [36]:
nrow(df)
docids <- c("33451", "35325", "40831", "46536")
df <- df[!df$docid %in% docids,]
nrow(df)

In [37]:
df %>% 
filter(grepl("Incomplete", docauthorname)) %>% 
select(docid, description)

docid,description
<int>,<fct>
21324,"Incomplete: [Isabella Moore?] San Francisco, to ""Dear Sister"";Copyright Retained by Mrs Linda Weir; CMSIED 9906156"
23201,"Incomplete: [?] Kalamazoo, To Annie Weir, [?];Copyright Retained By Mrs Linda Weir; CMSIED 9906119"
23310,"Incomplete letter U.S.A. to ""Dear William John"", Shell Creek.; PRONI D 1558/1/1; CMSIED 9708168"
25369,"Incomplete letter [Indiana?] to ""My Dear Mother"".;Donated by Mrs. I. J. Beattie; CMSIED 9904184"
27824,"Incomplete: [?] Birmingham, To ""Dear Sister"";Copyright Retained By Mrs Linda Weir; CMSIED 9906118"
29751,"Incomplete: [?] California To ""Dear Sister"";Copyright Retained By Mrs. Linda Weir; CMSIED 9906122"
30112,"Incomplete: [?] Ardvarney, to ""My Dear Sister"";Copyright Retained by Mrs Linda Weir; CMSIED 9906095"
32097,"Incomplete: W.J. Weir, Fresno, to Annie Weir, Pontiac;Copyright Retained By Mrs Linda Weir; CMSIED 9906162"
36625,"Incomplete letter: Officer of the 46th to Irish Gentleman;The London-Derry Journal and General Advertiser, Friday, August9th, 1776.; CMSIED 9909238"
38191,"Incomplete - [W.J. Weir?] Fresno, to ""My Dear Annie"";Copyright Retained by Mrs Linda Weir; CMSIED 9906164"


In [38]:
docids <- c("21324", "38606")
df$docauthorname[df$docid %in% docids] <- "Isabella Moore"

docids <- c("32097", "38191", "52472")
df$docauthorname[df$docid %in% docids] <- "W.J. Weir"

docids <- c("43840")
df$docauthorname[df$docid %in% docids] <- "William Love"

docids <- c("40130")
df$docauthorname[df$docid %in% docids] <- "J. Magill"

docids <- c("23201", "23310", "25369", "27824", "29751", "30112", "36625")
df$docauthorname[df$docid %in% docids] <- "Anonymous"

df %>% 
filter(grepl("Incomplete", docauthorname)) %>% 
select(docid, description)

docid,description
<int>,<fct>


In [39]:
df %>% 
filter(grepl("George Ritchie", docauthorname)) %>% 
select(docid, description)

docid,description
<int>,<fct>
30953,"George Ritchie, NY, to ""My Dear Father & Mother"", Londonderry.; PRONI T3292/2; CMSIED 9406202"
38584,"George Ritchie NY to James Ritchie, Co Londonderry; PRONI T3292/1; CMSIED 9406201"


In [40]:
docids <- c("38584")
df$docauthorname[df$docid %in% docids] <- "George Ritchie"

In [41]:
#test <- df$docauthorname
#test %>% 
 #str_remove_all('\"') %>% 
  #str_squish() %>% 
#unique() %>% 
#sort()

df$docauthorname <- df$docauthorname %>% 
str_remove_all('\"') %>% 
str_squish()

In [42]:
# To inspect
vals <- c(#"American Letter", # Letter to editor, more like a collection of journal entries. Omit.
          #"An Account of a Visit", # This appears to be a letter from an individual to a congregation. Keep.
          "an American Officer", 
          #"An Emigrant in Chicago", # Personal letter to parents. Keep.
          "an Irish Emigrant", 
          #"An Irishman in Cal", # This appears to be an open letter, probably in an publication of some kind. Omit.
          #"Anderson Canada", # Personal letter. Keep but fix in cell below.
          #"Baltimore", # All four of these to be deleted because they are impersonal (e.g., business, organizational)
          #"Benjamin Neely Co. Derry To his Brother", #This appears to contain two letters, one from Benjamin and one from William, both in South Carolina
          #"Capt. F. R. M. Crozier", #Personal letter. Keep.
          #"Chancellor of the Duchy of Carlisle", # Letters of recommendation. Omit.
          #"Cheque paid by William Parke", #Promis to pay. Omit
          #"Co. Tyrone", # Sent from Ireland (not USA or Can). Omit.
          #"Cork", # Sent from Ireland (not USA or Can). Omit.
          #"Countess of Dufferin", # Personal letter. Keep but fix name.
          #"Craven County", #Personal letter from America. Keep but change name to Anonymous.
          #"Cumberland Co.", #Personal letter w/intro. Keep but correct name and delete intro (done)
          #"Danville", #Possibly published but personal, originally from one individual to his father. Keep but change to Anonymous.
          #"Description of the P", # Personal letter. Keep but change to Anonymous.
          #"Destruction of the Irish Regiment at Fredericksburg.", #Personal letter that was later published. Keep but change to Anonymous.
          #"Diaries of James Harshaw", # Personal letter to his aunt. Keep but correct docauthorname
          #"Directions", #Personal letter later published. Keep but change to Anonymous.
          #"Dyer et al", #Business letter. Omit.
          #"Earl of Ava", #These appear to be all personal letters from USA or Canada. Keep but change name to Archie.
          #"Earl of Caledon", #Personal letter from Quebec. Keep but change name to Caledon.
          #"Emigrant Letter Regarding Conditions", #No content. Omit.
          #"Emigration", #Keep 30890 and change to Anonymous. Omit 43332.
          #"Envelope", # Just an envelope. Omit.
          #"Envelope:", # Just an envelope. Omit.
          #"Envelopes Sent To Rev J. Orr", # Just envelopes. Omit.
          #"Erin-Go-Bragh", # Letter to editor. Not personal. Omit.
          #"Ernest Cochrane Belfast", # Written in Belfast: Omit
          "Extract a Letter of a Recent Emigrant",
          "Extract an American Letter.",
          "Extract from a Gentleman Who Sailed on the Wilmin",
          "Extract from An Emigrant's Letter Discussing Problems in America.",
          "Extract Local Paper of a Letter Printed In America.",
          "Extract of a Letter Newry Concerning Returning Irishmen.",
          "Extract of a Letter on Canadian Emigration",
          "Extract of a Letter Oregon Terr",
          "Extract Of A Letter Philadelphia.",
          "Extract One of the Drennan Letters",
          "Extracts a Letter Dated San Francisco",
          "Extracts an Emigrant Letter New Orleans",
          "Extracts from Letter - Writer Unknown.",
          "Extracts of letters from New York",
          "Impressions of Cal",
          "Irish Emigrant Pigua",
          "Irishmen In America - Important.",
          "Irishmen in Virginia in 1784",
          "Joseph Philadelphia",
          "Letter",
          "Letter Concerning Problems of Emigration.",
          "Letter Mrs. Martha. L. Weyman Re the Savage family.",
          "Letter of Thanks by Belfast Printers",
          "Letter of Thanks from Passengers of Ship Prosperity",
          "Letter of Thanks from Passengers of Ship Riverdale",
          "Letter Re - The Jane McCullagh Estate",
          "Letter written by Thomas Gribbin",
          "Letters America",
          "Letters from America",
          "Letters from Mrs. Lizzie Street",
          "Letters from the McGinty and Crosby families",
          "Letters of the FitzGerald Family of Co. Tipperary",
          "Limerick emigrant",
          "Lord Alexander Caledon", #If ok, change to form below
          "Lord Caledon", 
          "Loving Mother & Sisters",
          "Lowell Emigrant",
          "ME",
          "Member of the Coman family",
          "More Returned Emigrants from United States.",
          "Mortimer & Harris",
          "Moyers & Consaul",
          "My Life in the Army William McCarter",
          "New Orleans",
          "New York",
          "Newcastle",
          "NY",
          "One of The Drennan Letters",
          "Oregon .",
          "Papers of Prof. E. R. R. Green.",
          "Passenger on Board the Ship Faithful Steward",
          "Passenger on S.S. Caledonia",
          "Passenger on Ship Iphigenia.",
          "Passenger who Sailed on the Ship Josephine",
          "Pembe",
          "Pennsylvania",
          "Petition",
          "Petition of John Caldwell Senior",
          "Philadelphia",
          "Poscript",
          "Postcard from S.",
          "Postcard Ralph",
          "Prospects",
          "Prospects of Emigrants in Canada.",
          "Protestant Episcopal Church in the United States",
          "pupils",
          "Quebec",
          "Receipt from James Clarke",
          "Request",
          "Return Migration from America",
          "Return of Emigrants from America.",
          "Returned Emigrants from United States.",
          "Salvation Army Emigrants",
          "Savannah",
          "Sep",
          "Sir Francis Hincks",
          "Sister Bell",
          "Sister M. Mamerta",
          "Sister Rose",
          "Smiley",
          "Son in America",
          "South Carolina",
          "Susan McAleece County Tyrone",
          "Susquehanna",
          "The",
          "The Anderson Brothers",
          "The Emigrant.",
          "The Estate of James Denny",
          "The Fenian Brotherhood Letter",
          "The Land Question - Rev. Mr Mullen",
          "The North West Terr",
          "The Presbyterian Church in America.",
          "The Privateer",
          "The Province of New York",
          "Things as They Are in The United States.",
          "Travels Through The United States",
          "Vere Foster",
          "Vere Foster and Irish Emigration.",
          "Vote of Thanks the Passengers of the Arethusa.",
          "W",
          "We")

In [43]:
df %>% 
filter(df$docauthorname %in% vals) %>% 
select(docid, docauthorname) %>% 
arrange(docauthorname)

docid,docauthorname
<int>,<chr>
49121,Extract Local Paper of a Letter Printed In America.
42440,Extract Of A Letter Philadelphia.
29386,Extract One of the Drennan Letters
39371,Extract a Letter of a Recent Emigrant
21992,Extract an American Letter.
43818,Extract from An Emigrant's Letter Discussing Problems in America.
43843,Extract from a Gentleman Who Sailed on the Wilmin
28347,Extract of a Letter Newry Concerning Returning Irishmen.
49043,Extract of a Letter Oregon Terr
37191,Extract of a Letter on Canadian Emigration


In [50]:
text <- read_file(paste0("/Users/alaynemoody/Dropbox/GradStudies/Flinders/Dissertation/IED/", "36501.txt"))
cat(text)

                         75 Fitzwilliam Place
                          Belfast May 15, 82 [1882?]

My dear Kitty
           I should have thanked
you long ago, for your very
kind letter. It is awfully
goood of you, old girl, to take
so much interest in me, as to
give me advice. And I don't
think it will be altogether
thrown away. But one thing
I want to say - any wrong
I have done was single
handed. No companions could
lead me. And id any blame
is going, "freeze" it on to me.
    I brought several of my
friends to see four paintings
They did not know I knew
"Miss Finlay", and they praised them.
    It is a fact I am leaving
town. My office is closed, &
my things sold; so I will do
a "shunt" some of these days
Its better for everybody, as
one of my old rambling fits
is on me. Where it will lead
me to I don't know yet, but
I think I will light on my
feet!!      I am so glad
John is well. He is a thorough
good fellow, and has a big
heart. Friday night I saw
Albert off. He went away
with 

In [44]:
# To modify
df$docauthorname[df$docauthorname==""] <- "Anonymous"
df$docauthorname[df$docauthorname=="& Mary Boyd"] <- "Mary Boyd"
df$docauthorname[df$docauthorname=="a Young Man Who Sailed Belfast"] <- "Anonymous"
df$docauthorname[df$docauthorname=="Alex Borrowman Quebec"] <- "Alex Borrowman"
df$docauthorname[df$docauthorname=="Arthe Dunamanagh To Mrs A.W. Smyth"] <- "Arthe Dunamanagh"
df$docauthorname[df$docauthorname=="B.F.Butler"] <- "B.F. Butler"
df$docauthorname[df$docauthorname=="George Farrelly To Dear Aunt Sarah"] <- "George Farrelly"
df$docauthorname[df$docauthorname=="George Hayes Farrelly New York City"] <- "George Hayes Farrelly"
df$docauthorname[df$docauthorname=="J. Banks Re"] <- "J. Banks"
df$docauthorname[df$docauthorname=="J. Cochrane Philadelphia"] <- "J. Cochrane"
df$docauthorname[df$docauthorname=="James Horner Philadelphia"] <- "James Horner"
df$docauthorname[df$docauthorname=="John Ferguson Philadelphia"] <- "John Ferguson"
df$docauthorname[df$docauthorname=="John S. Sinclair Cal"] <- "John S. Sinclair"
df$docauthorname[df$docauthorname=="John Wightman jun."] <- "John Wightman"
df$docauthorname[df$docauthorname=="Joseph Carrothers Canada"] <- "Joseph Carrothers"
df$docauthorname[df$docauthorname=="Joseph Carrothers London Canada"] <- "Joseph Carrothers"
df$docauthorname[df$docauthorname=="Maggie Martin 3122 Rhodes Ave. Chicago To Her Aunt Ballyfounder Portaferry"] <- "Maggie Martin"
df$docauthorname[df$docauthorname=="Margaret Hughes Philadelphia"] <- "Margaret Hughes"
df$docauthorname[df$docauthorname=="Mary Anderson Chattanooga Tennessee"] <- "Mary Anderson"
df$docauthorname[df$docauthorname=="Matilda Ferguson Philadelphia"] <- "Matilda Ferguson"
df$docauthorname[df$docauthorname=="Nathaniel Carrothers Canada"] <- "Nathaniel Carrothers"
df$docauthorname[df$docauthorname=="Robert Campbell New York"] <- "Robert Campbell"
df$docauthorname[df$docauthorname=="Robert Campbell St Louis"] <- "Robert Campbell"
df$docauthorname[df$docauthorname=="Robert Robinson Chulahoma"] <- "Robert Robinson"
df$docauthorname[df$docauthorname=="Robert Smith Philadelphia"] <- "Robert Smith"
df$docauthorname[df$docauthorname=="Thomas McGinity New York"] <- "Thomas McGinity"
df$docauthorname[df$docauthorname=="William Beatty New York America"] <- "William Beatty"
df$docauthorname[df$docauthorname=="William Porter Chicago U.S.A."] <- "William Porter"
df$docauthorname[df$docauthorname=="William Stavely.Pennuslvania"] <- "William Stavely"
df$docauthorname[df$docauthorname=="William Stavely Pennsylvania"] <- "William Stavely"

# Fix from inspected docs (2024 Apr 26)
df$docauthorname[df$docauthorname=="Anderson Canada"] <- "Muriel"
df$docauthorname[df$docauthorname=="Benjamin Neely Co. Derry To his Brother"] <- "Benjamin and William Neely"
df$docauthorname[df$docauthorname=="Countess of Dufferin"] <- "Harriot"
df$docauthorname[df$docauthorname=="Craven County"] <- "Anonymous"
df$docauthorname[df$docauthorname=="Cumberland Co."] <- "John Taylor"
df$docauthorname[df$docauthorname=="Danville"] <- "Anonymous"
df$docauthorname[df$docauthorname=="Description of the P"] <- "Anonymous"
df$docauthorname[df$docauthorname=="Destruction of the Irish Regiment at Fredericksburg."] <- "Anonymous"
df$docauthorname[df$docauthorname=="Diaries of James Harshaw"] <- "James Harshaw"
df$docauthorname[df$docauthorname=="Directions"] <- "Anonymous"
df$docauthorname[df$docauthorname=="Earl of Ava"] <- "Archie"
df$docauthorname[df$docauthorname=="Earl of Caledon"] <- "Caledon"
df$docauthorname[df$docauthorname=="R. Campbell U.S.A."] <- "R. Campbell"
df$docauthorname[df$docid=="30890"] <- "Anonymous"


In [45]:
# To delete (permanently)
vals <- c("A List of Killed & Wounded in the Irish Brigade.", 
          "a Mr McCarver Describing Business in Oregon .",
          "Address",
          "Address from the Residents of Amherst Island",
          "Advice",
          "Agents of Latter-Day Saints To Officers of S.S. Minnesota",
          "Ballymacarrett Emigrant Weavers",
          "British Officer",
          "Capt P Dillon Concerning the Passage",
          "Genealogical Notes on: Alexander Reed & John Colhoun",
          "Halifax Repealers",
          "Jonathan Smyth Liverpool",
          "Lists of Caldwell Family Letters",
          "Memo of Will of William Redmond",
          "Merchants & Planters Bank",
          "Ship's Record Book.",
          "Song By An Irish Schoolmaster.",
          "Testimony of Passengers on the Washin",
          "American Letter", #From here on omitting items from inspected cell.
          "An Irishman in Cal",
          "Baltimore",
          "Chancellor of the Duchy of Carlisle",
          "Cheque paid by William Parke",
          "Co. Tyrone",
          "Cork",
          "Dyer et al",
          "Ernest Cockrane",
          "Ernest Cochrane Belfast",
          "Emigrant Letter Regarding Conditions",
          "Envelope",
          "Envelope:",
          "Envelopes Sent To Rev J. Orr",
          "Erin-Go-Bragh")

docids <- c("43332")

In [46]:
# Remove the above.
nrow(df)
df <- df[!df$docauthorname %in% vals,]
df <- df[!df$docid %in% docids,]
nrow(df)

In [47]:
# To delete (temporarily -- that is, until I've had an opportunity to check)

vals <- c("an American Officer",
          "an Irish Emigrant",
          "A. McFeeters & Bros",
          "Archbishop John Hughes",
          "Brooklyn",
          "Bryson and Robb Families",
          "Byrne Family",
          "Capt. He",
          "Capt. Samuel Smiley",
          "Colonel Leslie",
          "Cook & Cook",
          "Cook & Leach",
          "Cormac et al.",
          "E. & McCann",
          "E. Dunlop Peterborough.",
          "Extract a Letter of a Recent Emigrant",
          "Extract an American Letter.",
          "Extract from a Gentleman Who Sailed on the Wilmin",
          "Extract from An Emigrant's Letter Discussing Problems in America.",
          "Extract Local Paper of a Letter Printed In America.",
          "Extract of a Letter Newry Concerning Returning Irishmen.",
          "Extract of a Letter on Canadian Emigration",
          "Extract of a Letter Oregon Terr",
          "Extract Of A Letter Philadelphia.",
          "Extract One of the Drennan Letters",
          "Extracts a Letter Dated San Francisco",
          "Extracts an Emigrant Letter New Orleans",
          "Extracts from Letter - Writer Unknown.",
          "Extracts of letters from New York",
          "Impressions of Cal",
          "Irish Emigrant Pigua",
          "Irishmen In America - Important.",
          "Irishmen in Virginia in 1784",
          "J. Fisher & Sons",
          "Joseph Philadelphia",
          "Last letter received by the family of Capt. F. R. M. Crozier.",
          "Letter",
          "Letter Concerning Problems of Emigration.",
          "Letter Mrs. Martha. L. Weyman Re the Savage family.",
          "Letter of Thanks by Belfast Printers",
          "Letter of Thanks from Passengers of Ship Prosperity",
          "Letter of Thanks from Passengers of Ship Riverdale",
          "Letter on Emigration.",
          "Letter on Irish Emigration",
          "Letter Re - The Jane McCullagh Estate",
          "Letter written by Thomas Gribbin",
          "Letters America",
          "Letters from America",
          "Letters from Mrs. Lizzie Street",
          "Letters from the McGinty and Crosby families",
          "Letters of the FitzGerald Family of Co. Tipperary",
          "Limerick emigrant",
          "Lt. Col. Leslie",
          "Lord Alexander Caledon", #If ok, change to form below
          "Lord Caledon", 
          "Loving Mother & Sisters",
          "Lowell Emigrant",
          "ME",
          "Member of the Coman family",
          "Miller & Bonsal",
          "More Returned Emigrants from United States.",
          "Mortimer & Harris",
          "Moyers & Consaul",
          "My Life in the Army William McCarter",
          "New Orleans",
          "New York",
          "Newcastle",
          "NY",
          "One of The Drennan Letters",
          "Oregon .",
          "Owen & Honr. Henigan",
          "Papers of Prof. E. R. R. Green.",
          "Passenger on Board the Ship Faithful Steward",
          "Passenger on S.S. Caledonia",
          "Passenger on Ship Iphigenia.",
          "Passenger who Sailed on the Ship Josephine",
          "Pembe",
          "Pennsylvania",
          "Petition",
          "Petition of John Caldwell Senior",
          "Philadelphia",
          "Poscript",
          "Postcard from S.",
          "Postcard Ralph",
          "Prospects",
          "Prospects of Emigrants in Canada.",
          "Protestant Episcopal Church in the United States",
          "pupils",
          "Quebec",
          "R. Redmond in Ireland",
          "R. Moore Portadown",
          "Receipt from James Clarke",
          "Request",
          "Return Migration from America",
          "Return of Emigrants from America.",
          "Returned Emigrants from United States.",
          "Rev E T O'Neill",
          "Rev F Kirkpatrick",
          "Rev J G Mulholland",
          "Rev John Orr",
          "Rev. J. Orr",
          "Rev. John Orr",
          "Rutledge & Young",
          "Salvation Army Emigrants",
          "Sergeant Major John Laird",
          "Savannah",
          "Sep",
          "Sir Francis Hincks",
          "Sister Bell",
          "Sister M. Mamerta",
          "Sister Rose",
          "Smiley",
          "Son in America",
          "South Carolina",
          "Susan McAleece County Tyrone",
          "Susquehanna",
          "The",
          "The Anderson Brothers",
          "The Emigrant.",
          "The Estate of James Denny",
          "The Fenian Brotherhood Letter",
          "The Land Question - Rev. Mr Mullen",
          "The North West Terr",
          "The Presbyterian Church in America.",
          "The Privateer",
          "The Province of New York",
          "Things as They Are in The United States.",
          "Travels Through The United States",
          "Vere Foster",
          "Vere Foster and Irish Emigration.",
          "Vote of Thanks the Passengers of the Arethusa.",
          "W",
          "We"
)

In [48]:
nrow(df)
df <- df[!df$docauthorname %in% vals,]
nrow(df)

Are there multiple letters in some of the files?

In [49]:
df$docid[grepl("etters", df$docauthorname)]

For now, the only item here that is relevant is 49280, which contains 8 letters and needs to be broken up. The status of the others will probably be resolved when I return to verify which letters to keep and which to omit.

<p>Examining the letters in the IED database (https://www.dippam.ac.uk/ied/records/[docid]) provided the folllowing information about these documents.</p>

<p>Multiple letters</p>
<ul><li>22052 (2 - include)</li>
<li>33434 (Intro alluding to multiple letters, possibly a Catholic family, some included elsewhere)</li>
<li>38514 (Intro alluding to multiple letters, included elsewhere in the collection)</li>
<li>40215 (Intro alluding to multiple letters that I don't see in the collection)</li>
<li>48704 (2 letter but one is from Ireland and the other extremely short -- basically just a 2-line greeting. Exclude)</li>
<li>49280 (8)</li>
</ul>

<p>One letter</p>
<ul><li>29386</li>
<li>33455(open letter in a publication)</li>
<li>35173</li>
<li>41708 (extract, published)</li>
</ul>    

<p>33434, 38514, 40215, and 48704 to be excluded from the NLP analysis but 33434 and 38514 will be read as part of the interpretive part of the analysis (to-do). Multiple letters will be separated into individual files. Single letters will be left as-is, aside from modifying the docauthorname to omit the suggestion that the items contains multiple letters.</p>

The code for adding lines (if necessary) is <br>

#Add rows for 22053 with same author info as for 22052.<br>
df <- rbind(df, df[df$docid=="22052",])<br>
nrow(df)


In [57]:
text <- read_file(paste0("/Users/alaynemoody/Dropbox/GradStudies/Flinders/Dissertation/IED/", "49280.txt"))
cat(text)

                Letters of Thomas Taylor of Ireland to his cousin Robert
                Taylor in America.  These are the last letters in the
                collection, the first is dated in 1799.
                                   Ballygoskin, 20th, May, 1826
Dear Bob:
I am yet in the land of the living, in March, 1824, I was very bad and
expected to die, my feet and legs was greatly swollen, about a month before
I took my bed my legs was greatly swoolen [swollen?], I found my legs
always hurt in bed, at length the water ossed [oozed?] out of them and wet
the bed, the latter end of April I got recovered and has enjoyed tolerable
good health ever since.
My wife has been in a delicate state of health for the most a year past but
is now feeling better, I believe somewhat touched with a liver complaint.
The rest of the family enjoys pretty good health.  You will be anxious to
know about my family; Thomas is about 5 feet 9 inches, slender make, very
like me, but has not such a prominent 

I will omit 49280 because the letters are sent from Ireland to the USA or Canada.

In [50]:
nrow(df)
docids <- c("49280")
df <- df[!df$docid %in% docids,]
nrow(df)

In [51]:
length(unique(df$docauthorname))
sum(nrow(df))
sort(table(df$docauthorname), decreasing = TRUE)
print(sort(unique(df$docauthorname)))


                         Anonymous                               Hami 
                               164                                 51 
                       R. Campbell                   Andrew Greenlees 
                                40                                 38 
                    James Buchanan                       Mary Cumming 
                                30                                 28 
                    Alexander Robb                     Roland Redmond 
                                20                                 20 
                            Archie                        John Walker 
                                19                                 17 
                      James Horner                   Richard Rothwell 
                                16                                 16 
                   Isabella Martin                   Thos. W. Coskery 
                                15                                 15 
     

   [1] "A"                                  "A & L Greenlees"                   
   [3] "A M'L Staveley"                     "A M'Leod Staveley"                 
   [5] "A McElheran"                        "A McFeeters"                       
   [7] "A S Woodburn"                       "A Wilson"                          
   [9] "A. A. Longstreet"                   "A. Aitken"                         
  [11] "A. Brown"                           "A. Browne"                         
  [13] "A. Campbell"                        "A. D. Cruickshank"                 
  [15] "A. Doran"                           "A. Greenlees"                      
  [17] "A. Hami"                            "A. Hunter"                         
  [19] "A. Jackson"                         "A. M."                             
  [21] "A. Mapherson"                       "A. McFeeters"                      
  [23] "A. S. Whitell"                      "A. S. Woodburn"                    
  [25] "A. Sinclair"        

In [52]:
df %>% 
filter(grepl("&", docauthorname)) %>% 
select(docauthorname) %>% 
unique()

Unnamed: 0_level_0,docauthorname
Unnamed: 0_level_1,<chr>
1,Elisha & Lois Parish
2,Nathaniel & Margaret Carrothers
3,Samuel & John Mo
4,Jane Chambers & Alexander Park
6,Maggie & Alice Martin
7,R & J. Smyth
8,A & L Greenlees
9,John & Matilda Ferguson
10,Nancy & Samuel Laird
11,J. E. Orr & M. Orr


In [53]:
df$test <- gsub("\\.(?=[A-Za-z])", ". ", df$docauthorname, perl = TRUE)

In [54]:
df %>% 
filter(grepl("\\.\\S{1,}", docauthorname))%>% 
select(docauthorname, test) %>% 
unique()

Unnamed: 0_level_0,docauthorname,test
Unnamed: 0_level_1,<chr>,<chr>
1,G.R. Wood,G. R. Wood
2,R.A. Taylor,R. A. Taylor
3,W.J. Weir,W. J. Weir
4,A.B. Mc Millan,A. B. Mc Millan
5,S.J. Porter,S. J. Porter
6,J.J. Elder,J. J. Elder
7,D.S. Cooper,D. S. Cooper
8,W.J. Campbell Allen,W. J. Campbell Allen
10,W.G. Weir,W. G. Weir
12,C.K. Breeze,C. K. Breeze


In [55]:
df$docauthorname <- df$test

In [56]:
dfNames <- df

## Author Location

I want to isolate everything between the first comma and word "to"

In [57]:
#reset
df <- dfNames

In [58]:
df$test <- gsub("\\b[t|T]o\\b.*","", df$description) # Everything after the word to
df$test <- gsub("^[^,]*,", "", df$test) # Everything before before the first comma
df$test <- gsub("\\[|\\]", "",df$test) # Strip brackets
df$test <- gsub("\\?", "",df$test) # Strip question marks
df$test <- gsub("PRONI.*", "",df$test) #Strip biblio
df$test

In [59]:
df %>% 
filter(grepl("Co\\.|Co\\s", test)) %>% 
select(test) %>% 
unique()
#pull(docid)

Unnamed: 0_level_0,test
Unnamed: 0_level_1,<chr>
1,"Ballintur, Co Down,"
2,Co. Tyrone
3,"Rostrevor,Co Down"
4,Co. Tyrone
5,"Cahard, Co. Down"
6,"Co Down,"
7,"Co. Monaghan,"
8,"Portaferry, Co. Down"
9,"Co. Donegal,"
10,"Ballintur, Co Down"


There are no Colorados but there are some U.S. counties. Code this to USA then omit the rest.

In [60]:
docids <- df %>% 
filter(grepl("Co\\.|Co\\s", test)) %>% 
filter(!grepl("Allegheny|Washington", test)) %>% 
#select(test) %>% 
#unique()
pull(docid)

In [61]:
nrow(df)
df <- df[!df$docid %in% docids,]
nrow(df)

In [62]:
# Fix test for docid 52111 
df$test[df$test==" Springbrook, USA My Dearest Mother, Banbridge, Ireland; "]  <- "Springbrook, USA"

In [63]:
docids <- 
df %>% 
filter(grepl("England|Ireland|Wales|Australia|New Zealand|Scotland", test)) %>% 
#select(docid, test, description)
pull(docid)

In [64]:
nrow(df)
df <- df[!df$docid %in% docids,]
nrow(df)

In [65]:
df %>% 
filter(grepl("Canada", test)) %>% 
#select(docid, test, description) %>% 
nrow()
#pull(docid)

vals <- c("USA", "U\\.S\\.A\\.", "U\\.\\sS\\.\\sA\\.", "U\\.S\\.\\W")

length(df$test[grepl(paste(vals, collapse = "|"), df$test)])

In [66]:
# New variable for author location
df$authorLocation <- NA

#Recode Canada
rows = which(grepl('Canada', df$test)) # Get rows that meet condition
df$authorLocation[rows] <- "Canada" # Recode data

rows = which(grepl(paste(vals, collapse = "|"), df$test)) # Get rows that meet condition
df$authorLocation[rows] <- "USA" # Recode data

# Convert character vars to factor
df$authorLocation <- as.factor(df$authorLocation)

#Check counts
summary(df$authorLocation)

In [67]:
provinces <- c("Quebec", 
               "Alberta", 
               "Ontario", 
               "Nova Scotia", 
               "Newfoundland", 
               "Labrador", 
               "Saskatchewan", 
               "Prince Edward Island", 
               "Manitoba", 
               "British Columbia", 
               "Yukon", 
               "New Brunswick",
               "Northern Territories"
               )

In [68]:
df$authorLocation[grepl(paste(provinces, collapse = "|"), df$test)] <- "Canada"
summary(df$authorLocation)

In [69]:
df %>% 
filter(authorLocation=="Canada") %>% 
pull(test) %>% 
unique() %>% 
sort()

In [70]:
states <- statesUSA$State
df$authorLocation[grepl(paste(states, collapse = "|"), df$test)] <- "USA"
summary(df$authorLocation)

In [71]:
# Now abbreviations
states01 <- as.character(statesUSA$Abbreviation)
states02 <- gsub('$', ',', states01)
states03 <- gsub(',', '.', states02)
states04 <- gsub('([A-Z])', '\\1.', states01)
states05 <- gsub('\\.$', '', states04)
states <- c(states01, states02, states03, states04, states05)
states <- gsub('^', ' ', states)
states

In [72]:
print(sort(unique(df$test[grepl(paste(states, collapse = "|"), df$test)])))

 [1] " Abbeville SC, "                       
 [2] " Allegheny, P.A "                      
 [3] " Allegheny, PA. "                      
 [4] " Blacksburg, VA., "                    
 [5] " Bloomfield, NJ "                      
 [6] " Brooklyn NY, "                        
 [7] " Brooklyn, N.Y., "                     
 [8] " Charleston SC "                       
 [9] " Charleston SC., "                     
[10] " Charleston, SC, "                     
[11] " Cohoes, NY "                          
[12] " Copenhagen, NY "                      
[13] " Ellinwood, KS, U.S.A. "               
[14] " Elmira, NY "                          
[15] " Fairfield, N.Y. "                     
[16] " Lancaster, PA "                       
[17] " N York "                              
[18] " N,York "                              
[19] " N.J., USA "                           
[20] " N.Jersey, "                           
[21] " N.Y U.S.A, "                          
[22] " N.Y U.S.A. "               

In [73]:
docids <- 
df %>% 
filter(grepl(paste(states, collapse = "|"), test)) %>% 
filter(!grepl("McArdle|McEarlane|Indies", test)) %>% 
#select(docid, test)
#unique()
pull(docid)

In [74]:
df$authorLocation[df$docid %in% docids] <- "USA"
summary(df$authorLocation)

In [75]:
cities <- c("Philadel", 
            "Chicag", 
            "New Orlean", 
            "Washin",  
            "San Francisco", 
            "Augusta", 
            "Petersburg",
            "Washin",
            "New Orleans",
            "San Francisco",
            "Baltimore",
            "Pittsburg",
            "Savannah",
            "Brooklyn",
            "Detroit",
            "Fresno",
            "Lynchburg",
            "Buffalo",
            "Cincinnat",
            "Sacrame",
            "Jeffersonville",
            "Pontiac",
            "Southfield",
            "Albany",
            "Cleveland",
            "Birmingham",
            "Frankford",
            "Grand Rapids",
            "Healdsburg",
            "Lynchburg",
            "Milwaukee",
            "N\\. York",
            "West Salem",
            "Wichita",
            "Abbeville",
            "Agnesville",
            "Allegheny",
            "Ansley",
            "Birmingham",
            "Blacksburg",
            "Boston",
            "Campbell\\'s Corner",
            "Chambersburg",
            "Charlevoix",
            "Cherokee City",
            "Concord",
            "Danielsonville",
            "Fairfield",
            "Fremont",
            "Greensburg",
            "Harvey",
            "Holly Hill",
            "Jackonsville",
            "Jersey City",
            "Kalamazoo",
            "Kalkaska",
            "Kaolin",
            "Kittanning",
            "Louisville",
            "Marshall",
            "Maryville",
            "Minneapolis",
            "Monticello",
            "N\\. Orleans",
            "N\\. York Cty",
            "N\\.Orleans",
            "Napa City",
            "New London, U\\.S\\.",
            "New Windsor",
            "Newburgh",
            "Newcastle De\\.",
            "Oakland",
            "Oxford\\, Mich",
            "Paterson")
cities  <- sort(cities)
cities

other <- c("Calif",
              "America",
              "Bucks County",
              "Mass\\.",
              "Charlotte County",
              "Mississipi",
              "Pa\\.",
              "Pennslyvania")
other <- sort(other)
other

In [76]:
df$authorLocation[grepl(paste(other, collapse = "|"), df$test)] <- "USA"
summary(df$authorLocation)

In [77]:
df$authorLocation[grepl(paste(cities, collapse = "|"), df$test)] <- "USA"
summary(df$authorLocation)

In [78]:
cities <- c("Philadephia",
            "Platsburgh",
            "Plymouth",
            "Rock Island",
            "Rockycreek",
            "Roxbury",
            "Saint Louis",
            "Sanfrancisco",
            "Springfield",
            "Spruce Creek",
            "Susquehanna",
            "Wolford",
            "Belvidere",
            "Ottawa.*Il\\."
           )
cities  <- sort(cities)
cities

other <- c("S\\. Carolina",
          "South of Carolina U\\.S\\.",
          "\\bVa\\b",
          "Waccamaw",
          "Chebanse",
          "Irish Channel")
other <- sort(other)
other

In [79]:
df$authorLocation[grepl(paste(cities, collapse = "|"), df$test)] <- "USA"
df$authorLocation[grepl(paste(other, collapse = "|"), df$test)] <- "USA"
summary(df$authorLocation)

In [80]:
cities <- c("Milford",
            "Goderich",
            "Montreal",
            "Calgary",
            "Ottawa",
            "Halifax",
            "St.*John",
            "Temperanceville",
            "Amherst Island",
            "Bosanquet",
            "Coleraine",
            "Grosse Island",
            "Little Britain",
            "Indian Island",
            "Lawrenceville",
            "Marmora",
            "Owen Sound",
            "Point St Charles",
            "Smithville",
            "Wainfleet",
            "Watsonville",
            "Weldon",
            "Winnipeg",
            "Toronto",
            "Vancouver",
            "Edmonton"
           )
cities  <- sort(cities)
cities

other <- c("Cumberland Co")
other <- sort(other)
other

In [81]:
df$authorLocation[grepl(paste(other, collapse = "|"), df$test)] <- "Canada"
summary(df$authorLocation)

In [82]:
docids <- 
df %>% 
filter(grepl(paste(cities, collapse = "|"), test)) %>% 
filter(!grepl("Illinois|Il\\.", test)) %>% 
#select(docid, test)
pull(docid)

In [83]:
#sort(df$test[grepl(paste(cities, collapse = "|"), df$test)])

In [84]:
df$authorLocation[df$docid %in% docids] <- "Canada"
summary(df$authorLocation)

In [85]:
df %>% 
filter(is.na(authorLocation)) %>% 
select(docid, test, description) %>% 
#pull(description) %>% 
unique() #%>% 
#print()

Unnamed: 0_level_0,docid,test,description
Unnamed: 0_level_1,<int>,<chr>,<fct>
1,20514,Desertmartin,"James Kelly, Desertmartin to John Kelly, Pennsylvania;The Kelly Family Documents: Copyright Retained by The UlsterAmerican Folk Park.; CMSIED 300018"
2,20538,"Near Panama,","Alexander Robb, Near Panama, to Family [Dundonald, Co Down?]; PRONI T 1454/6/2; CMSIED 9006021"
3,20724,",","E. Megaw, [?], to [Annie Weir?] [?];Copyright Retained By Mrs Linda Weir; CMSIED 9906120"
4,20937,Brighton,"Leydd McCrory [?], Brighton to Annie Weir, Michigan;Copyright Retained by Mrs. Linda Weir; CMSIED 9905112"
5,20950,John Parks,John Parks to John Caldwell Junior; PRONI T 3541/2/2; CMSIED 9309353
6,21003,Thomas Armstrong,Thomas Armstrong to Christopher Armstrong; PRONI T 2125/7/6; CMSIED 9309326
7,21043,Reference written by James Barbour for Henry Johnson.;,Reference written by James Barbour for Henry Johnson.; PRONI T 2319/1; CMSIED 9404132
8,21070,Watertown,"John McBride, Watertown to James McBride, Co. Antrim.; PRONI T 2613/11; CMSIED 9007105"
9,21145,Holly,"Ida M., Holly to ""Dear friend Anna"";Copyright Retained by Mrs Linda Weir; CMSIED 9906091"
10,21171,letter from Eliza Steele,letter from Eliza Steele to her Aunt;Dermot Lyttle; CMSIED 200912012


In [86]:
docids <- c("21319", "21324", "52472", "53460", "53624")
df$authorLocation[df$docid %in% docids] <- "USA"
summary(df$authorLocation)

In [87]:
df <- df[!is.na(df$authorLocation),]

In [88]:
dfPlaces <- df

## Mapping names to genders

In [89]:
#re-set
df <- dfPlaces

In [90]:
df$authorgender <- NA
summary(as.factor(df$authorgender))

In [91]:
namesB <- intersect(namesFemale, namesMale)
namesB <- namesB$V1
length(namesB)

In [92]:
namesF <- setdiff(namesFemale, namesMale)
namesF <- as.vector(namesF$V1)
length(namesF)

namesM <- setdiff(namesMale,namesFemale)
namesM <- as.vector(namesM$V1)
length(namesM)

In [93]:
names <- union(namesFemale, namesMale)
names <- names$V1
length(names)

# Check
length(namesB)+length(namesF)+length(namesM)

In [94]:
# Because of memory limitations, I have to break this process into chunks of 1000 names
# When letters are co-authored by male and females, they are coded as "F" to indicate that a women participated in the writing.

namesF <- setdiff(namesFemale, namesMale)
namesF <- as.vector(namesF$V1[0:1000])
length(namesF)
namesF[0:10]

namesM <- setdiff(namesMale,namesFemale)
namesM <- as.vector(namesM$V1[0:1000])
length(namesM)
namesM[0:10]

df$authorgender[grepl(paste(namesM, collapse = "|"), df$docauthorname)] <- "M"
df$authorgender[grepl(paste(namesF, collapse = "|"), df$docauthorname)] <- "F"
summary(as.factor(df$authorgender))

In [95]:
# Because of memory limitations, I have to break this process into chunks of 1000 names

namesF <- setdiff(namesFemale, namesMale)
namesF <- as.vector(namesF$V1[1001:2000])
length(namesF)
namesF[0:10]

namesM <- setdiff(namesMale,namesFemale)
namesM <- as.vector(namesM$V1[1001:2000])
length(namesM)
namesM[0:10]

df$authorgender[grepl(paste(namesM, collapse = "|"), df$docauthorname)] <- "M"
df$authorgender[grepl(paste(namesF, collapse = "|"), df$docauthorname)] <- "F"
summary(as.factor(df$authorgender))

In [96]:
# Because of memory limitations, I have to break this process into chunks of 1000 names

namesF <- setdiff(namesFemale, namesMale)
namesF <- as.vector(namesF$V1[2001:3000])
length(namesF)
namesF[0:10]

namesM <- setdiff(namesMale,namesFemale)
namesM <- as.vector(namesM$V1[2001:2578])
length(namesM)
namesM[0:10]

df$authorgender[grepl(paste(namesM, collapse = "|"), df$docauthorname)] <- "M"
df$authorgender[grepl(paste(namesF, collapse = "|"), df$docauthorname)] <- "F"
summary(as.factor(df$authorgender))

In [97]:
# Because of memory limitations, I have to break this process into chunks of 1000 names
# Only running female code because male names have been exhausted.

namesF <- setdiff(namesFemale, namesMale)
namesF <- as.vector(namesF$V1[3001:4000])
length(namesF)
namesF[0:10]

df$authorgender[grepl(paste(namesF, collapse = "|"), df$docauthorname)] <- "F"
summary(as.factor(df$authorgender))

In [98]:
# Because of memory limitations, I have to break this process into chunks of 1000 names
# Only running female code because male names have been exhausted.

namesF <- setdiff(namesFemale, namesMale)
namesF <- as.vector(namesF$V1[4001:4633])
length(namesF)
namesF[0:10]

df$authorgender[grepl(paste(namesF, collapse = "|"), df$docauthorname)] <- "F"
summary(as.factor(df$authorgender))

What are the names for the NAs?

In [99]:
sort(unique(df$docauthorname[is.na(df$authorgender)]))

In [100]:
# Making a few corrections
vals <- c("Atrhur", "Charly", "Frank", "Fred", "Wm\\.", "Gorge", "Samual", "Theophilus", "Thos\\.")
df$authorgender[grepl(paste(vals, collapse = "|"), df$docauthorname)] <- "M"

vals <- c("Mrs","Miss")
df$authorgender[grepl(paste(vals, collapse = "|"), df$docauthorname)] <- "F"

#Summary
summary(as.factor(df$authorgender))

## Summary Statistics

In [101]:
df <- factorize(df)

In [102]:
glimpse(df)

Rows: 2,345
Columns: 21
$ idIED          [3m[90m<int>[39m[23m 300090, 9501251, 9408355, 9003061, 9011027, 9905110, 94…
$ ddmmyyyy       [3m[90m<fct>[39m[23m 04-12-1896, 12-01-1891, 01-10-1842, 10-01-1896, 13-02-1…
$ publisher      [3m[90m<fct>[39m[23m IED, IED, IED, IED, IED, IED, IED, IED, IED, IED, IED, …
$ sourcetitle    [3m[90m<fct>[39m[23m "Public Record Office, Northern Ireland", "Public Recor…
$ description    [3m[90m<fct>[39m[23m "Edward Stanley, Katawa, Canada to Joshua Peel, Armagh;…
$ docid          [3m[90m<int>[39m[23m 20481, 20487, 20519, 20522, 20529, 20530, 20563, 20568,…
$ nationalOrigin [3m[90m<fct>[39m[23m Irish, Irish, Irish, Irish, Irish, Irish, Irish, Irish,…
$ authorgender   [3m[90m<fct>[39m[23m M, NA, F, F, M, M, NA, M, F, M, NA, NA, F, F, M, F, M, …
$ agewriting     [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ agedeath       [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,

In [103]:
vars <- c("docid", "nationalOrigin", "authorgender", "relMin", "U", "M", "S", "F", "L", "docmonth", "docyear", "docauthorname", "authorLocation")
df <- df[vars]
summary(df)

     docid       nationalOrigin authorgender  relMin           U          
 Min.   :20481   Irish:2345     F   : 597    Mode:logical   Mode:logical  
 1st Qu.:28328                  M   :1365    NA's:2345      NA's:2345     
 Median :36427                  NA's: 383                                 
 Mean   :36636                                                            
 3rd Qu.:44933                                                            
 Max.   :53643                                                            
                                                                          
    M              S              F              L              docmonth   
 Mode:logical   Mode:logical   Mode:logical   Mode:logical   01     : 283  
 NA's:2345      NA's:2345      NA's:2345      NA's:2345      12     : 233  
                                                             06     : 209  
                                                             11     : 209  
                    

In [104]:
length(unique(df$docauthorname))

## Make Unique Identifiers

In [105]:
# Make unique identifiers for all
df01 <- df %>%
 group_by(docauthorname) %>%
 mutate(docauthorid = cur_group_id())

In [106]:
summary(df01)

     docid       nationalOrigin authorgender  relMin           U          
 Min.   :20481   Irish:2345     F   : 597    Mode:logical   Mode:logical  
 1st Qu.:28328                  M   :1365    NA's:2345      NA's:2345     
 Median :36427                  NA's: 383                                 
 Mean   :36636                                                            
 3rd Qu.:44933                                                            
 Max.   :53643                                                            
                                                                          
    M              S              F              L              docmonth   
 Mode:logical   Mode:logical   Mode:logical   Mode:logical   01     : 283  
 NA's:2345      NA's:2345      NA's:2345      NA's:2345      12     : 233  
                                                             06     : 209  
                                                             11     : 209  
                    

In [107]:
df01$docauthorid[0:50]
df01$docauthorid <- sprintf("%004d", df01$docauthorid)
df01$docauthorid[0:50]

In [108]:
df01$docauthorid[0:50]
df01$docauthorid <- gsub("^", "IED", df01$docauthorid)
df01$docauthorid[0:50]

In [109]:
vals <- make.unique(df01$docauthorid[df01$docauthorid=="IED0089"])
vals

In [110]:
df01$docauthorid[df01$docauthorid=="IED0089"] <- vals
df01$docauthorname[grepl("IED0089", df01$docauthorid)] 

In [111]:
df01 <- factorize(df01)
summary(df01)

     docid       nationalOrigin authorgender  relMin           U          
 Min.   :20481   Irish:2345     F   : 597    Mode:logical   Mode:logical  
 1st Qu.:28328                  M   :1365    NA's:2345      NA's:2345     
 Median :36427                  NA's: 383                                 
 Mean   :36636                                                            
 3rd Qu.:44933                                                            
 Max.   :53643                                                            
                                                                          
    M              S              F              L              docmonth   
 Mode:logical   Mode:logical   Mode:logical   Mode:logical   01     : 283  
 NA's:2345      NA's:2345      NA's:2345      NA's:2345      12     : 233  
                                                             06     : 209  
                                                             11     : 209  
                    

## Save subset

In [112]:
write.csv(df01, 
          "20240502_PhD_IEDSubset.csv", 
          row.names=FALSE)

In [113]:
glimpse(df01)

Rows: 2,345
Columns: 14
Groups: docauthorname [1,063]
$ docid          [3m[90m<int>[39m[23m 20481, 20487, 20519, 20522, 20529, 20530, 20563, 20568,…
$ nationalOrigin [3m[90m<fct>[39m[23m Irish, Irish, Irish, Irish, Irish, Irish, Irish, Irish,…
$ authorgender   [3m[90m<fct>[39m[23m M, NA, F, F, M, M, NA, M, F, M, NA, NA, F, F, M, F, M, …
$ relMin         [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ U              [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ M              [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ S              [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ F              [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ L              [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ docmonth       [3m[90m<fct>[39m[23m 12, 01, 10, 0