# Metadata Adjustment

## Resources

In [2]:
#Import Library
library(tidyverse)

## Functions

In [3]:
# Converts all factors to character class
unfactorize <- function(df){
  for(i in which(sapply(df, class) == "factor")) df[[i]] = as.character(df[[i]])
  return(df)
}
# Code from user "By0" at https://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters (line 14)

In [4]:
# Converts character to factor class
factorize <- function(df){
  for(i in which(sapply(df, class) == "character")) df[[i]] = as.factor(df[[i]])
  return(df)
}

## Data

In [71]:
# Get and view last meta dataset
letters  <- unfactorize(read.csv("20201219_AM_Meta2Merge.csv"))
glimpse(letters)

Rows: 915
Columns: 48
$ X                       [3m[38;5;246m<int>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ docsequence             [3m[38;5;246m<int>[39m[23m 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 3…
$ docid                   [3m[38;5;246m<chr>[39m[23m "S10003-D023", "S10003-D024", "S10003-D025", …
$ docyear                 [3m[38;5;246m<int>[39m[23m 1836, 1836, 1837, 1837, 1838, 1838, 1838, 183…
$ doctype                 [3m[38;5;246m<chr>[39m[23m "Letter", "Letter", "Letter", "Letter", "Lett…
$ allsubject              [3m[38;5;246m<chr>[39m[23m "Childbirth; Church attendance; Cities; Farms…
$ broadsubj               [3m[38;5;246m<chr>[39m[23m "Health; Religion; Communities; Relationships…
$ personalevent           [3m[38;5;246m<chr>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, "Physical ill…
$ wwritten                [3m[38;5;246m<chr>[39m[23m "Baltimore, MD; Maryland; United States; Mid-…
$ docauthorid             [3m[38;5;246m<

In [72]:
# Drop unnecessary columns
letters = select(letters, -c(X,stayednorthamerica.y))
colnames(letters)

## Integration

This is operationalized as cessation of correspondence. There are a number of variables that might refect this: docsequence, docid and the docyear / docmonth / docday. The working metadata set does not include the month and day variables so need to go back and attach those from the original metadataset.

In [73]:
# Get original data
original <- read.csv("IMLD_DOCS_QA completed.csv")
glimpse(original)

Rows: 8,749
Columns: 71
$ docsequence               [3m[38;5;246m<int>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
$ docid                     [3m[38;5;246m<fct>[39m[23m S10000-D001, S10000-D002, S10000-D003, S100…
$ sourceid                  [3m[38;5;246m<fct>[39m[23m S10000, S10000, S10000, S10000, S10000, S10…
$ docauthorid               [3m[38;5;246m<fct>[39m[23m per0002637, per0021589, per0021589, per0021…
$ doctitle                  [3m[38;5;246m<fct>[39m[23m "Front Matter", "Chapter 1. A Necessary Dec…
$ docyear                   [3m[38;5;246m<int>[39m[23m 1992, 1992, 1992, 1992, 1992, 1992, 1992, 1…
$ docmonth                  [3m[38;5;246m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ docday                    [3m[38;5;246m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ docpage                   [3m[38;5;246m<fct>[39m[23m "N pag-vi", "1-4", "5-9", "10-11", "12-14",…
$ doctype                   [3m[38;5;2

In [74]:
# Put vars of interest into a list
vars  <- c("docyear", "docmonth", "docday", "docsequence", "docid")

In [75]:
#Get summary info for variables of interest
summary(original[vars])

    docyear        docmonth          docday      docsequence    
 Min.   :1784   Min.   : 1.000   Min.   : 1.0   Min.   :  0.00  
 1st Qu.:1880   1st Qu.: 4.000   1st Qu.: 5.0   1st Qu.:  4.00  
 Median :1925   Median : 7.000   Median :13.0   Median : 14.00  
 Mean   :1922   Mean   : 6.466   Mean   :13.9   Mean   : 28.87  
 3rd Qu.:1961   3rd Qu.: 9.000   3rd Qu.:22.0   3rd Qu.: 35.00  
 Max.   :2004   Max.   :12.000   Max.   :31.0   Max.   :241.00  
 NA's   :168    NA's   :5764     NA's   :5965                   
         docid     
 S10000-D001:   1  
 S10000-D002:   1  
 S10000-D003:   1  
 S10000-D004:   1  
 S10000-D005:   1  
 S10000-D006:   1  
 (Other)    :8743  

In [76]:
# Drop unnecessary variables
original  <- original[c("docid", "docmonth", "docday")]
glimpse(original)

Rows: 8,749
Columns: 3
$ docid    [3m[38;5;246m<fct>[39m[23m S10000-D001, S10000-D002, S10000-D003, S10000-D004, S10000-D…
$ docmonth [3m[38;5;246m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ docday   [3m[38;5;246m<int>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …


In [77]:
# Merge these columns to working dataset linking them by docid
letters <- left_join(letters, original, by = "docid")
glimpse(letters[vars])

Rows: 915
Columns: 5
$ docyear     [3m[38;5;246m<int>[39m[23m 1836, 1836, 1837, 1837, 1838, 1838, 1838, 1839, 1839, 183…
$ docmonth    [3m[38;5;246m<int>[39m[23m 9, 11, 8, 9, 3, 9, 10, 4, 6, 8, 1, 4, 6, 2, 3, 7, 10, 10,…
$ docday      [3m[38;5;246m<int>[39m[23m 20, 14, NA, 7, 1, 23, 21, 25, 5, NA, 17, NA, 14, 15, 28, …
$ docsequence [3m[38;5;246m<int>[39m[23m 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 3…
$ docid       [3m[38;5;246m<chr>[39m[23m "S10003-D023", "S10003-D024", "S10003-D025", "S10003-D026…


In [78]:
# Get summary data
summary(factorize(letters[vars]))

    docyear        docmonth          docday       docsequence    
 Min.   :1804   Min.   : 1.000   Min.   : 1.00   Min.   :  2.00  
 1st Qu.:1856   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 25.00  
 Median :1863   Median : 7.000   Median :15.00   Median : 53.00  
 Mean   :1865   Mean   : 6.629   Mean   :15.51   Mean   : 72.19  
 3rd Qu.:1880   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:110.00  
 Max.   :1913   Max.   :12.000   Max.   :31.00   Max.   :239.00  
 NA's   :8      NA's   :134      NA's   :201                     
         docid    
 S10003-D023:  1  
 S10003-D024:  1  
 S10003-D025:  1  
 S10003-D026:  1  
 S10003-D027:  1  
 S10003-D028:  1  
 (Other)    :909  

Now I need to figure out the best variable to use to identify the last letter in a series. To begin with, I need to merge the year - month - day data so that it can be turned into a time class variable that the most recent value can be identified. 

In [79]:
# Add leading zeros to month and day variables
letters$docMonth  <- formatC(letters$docmonth, width = 2, flag = 0)
letters$docDay  <- formatC(letters$docday, width = 2, flag = 0)

# add these variables to list of variables of interest
vars  <- append(vars, c("docMonth", "docDay"))

# Get summary data
head(letters[vars])

docyear,docmonth,docday,docsequence,docid,docMonth,docDay
1836,9,20.0,23,S10003-D023,9,20.0
1836,11,14.0,24,S10003-D024,11,14.0
1837,8,,25,S10003-D025,8,
1837,9,7.0,26,S10003-D026,9,7.0
1838,3,1.0,27,S10003-D027,3,1.0
1838,9,23.0,28,S10003-D028,9,23.0


In [80]:
# Make sure summary data is unchanged.
letters <- factorize(letters)
summary(letters[vars])

    docyear        docmonth          docday       docsequence    
 Min.   :1804   Min.   : 1.000   Min.   : 1.00   Min.   :  2.00  
 1st Qu.:1856   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 25.00  
 Median :1863   Median : 7.000   Median :15.00   Median : 53.00  
 Mean   :1865   Mean   : 6.629   Mean   :15.51   Mean   : 72.19  
 3rd Qu.:1880   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:110.00  
 Max.   :1913   Max.   :12.000   Max.   :31.00   Max.   :239.00  
 NA's   :8      NA's   :134      NA's   :201                     
         docid        docMonth       docDay   
 S10003-D023:  1   NA     :134   NA     :201  
 S10003-D024:  1   12     : 81   04     : 34  
 S10003-D025:  1   01     : 77   15     : 32  
 S10003-D026:  1   10     : 74   25     : 32  
 S10003-D027:  1   09     : 70   01     : 29  
 S10003-D028:  1   07     : 67   10     : 29  
 (Other)    :909   (Other):412   (Other):558  

Missing data no longer shows up as NA. In any case, these need to be recoded so that dates with missing month and day are not invalidated (e.g., 1850-NA-NA). Missing months and dates will both be recoded as 01 for lack of a better solution. 

In [81]:
# Convert NA for month and day to first of year/month
# Check counts
letters$docMonth[letters$docMonth == "NA"]  <- "01"
letters$docDay[letters$docDay == "NA"]  <- "01"

# Check counts.
summary(letters[vars])

    docyear        docmonth          docday       docsequence    
 Min.   :1804   Min.   : 1.000   Min.   : 1.00   Min.   :  2.00  
 1st Qu.:1856   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 25.00  
 Median :1863   Median : 7.000   Median :15.00   Median : 53.00  
 Mean   :1865   Mean   : 6.629   Mean   :15.51   Mean   : 72.19  
 3rd Qu.:1880   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:110.00  
 Max.   :1913   Max.   :12.000   Max.   :31.00   Max.   :239.00  
 NA's   :8      NA's   :134      NA's   :201                     
         docid        docMonth       docDay   
 S10003-D023:  1   01     :211   01     :230  
 S10003-D024:  1   12     : 81   04     : 34  
 S10003-D025:  1   10     : 74   15     : 32  
 S10003-D026:  1   09     : 70   25     : 32  
 S10003-D027:  1   07     : 67   10     : 29  
 S10003-D028:  1   06     : 66   12     : 29  
 (Other)    :909   (Other):346   (Other):529  

In [82]:
# Create new variable for docdate
letters$docdate  <- NA

In [83]:
# Join year and month
letters$docdate <- paste(letters$docyear, letters$docMonth, sep = "-")
head(letters$docdate)

In [84]:
# Join date and day
letters$docdate <- paste(letters$docdate, letters$docDay, sep = "-")
head(letters$docdate)

In [85]:
# Add this to list of variables of interest
vars  <- append(vars, c("docdate"))

# View
glimpse(letters[vars])

Rows: 915
Columns: 8
$ docyear     [3m[38;5;246m<int>[39m[23m 1836, 1836, 1837, 1837, 1838, 1838, 1838, 1839, 1839, 183…
$ docmonth    [3m[38;5;246m<int>[39m[23m 9, 11, 8, 9, 3, 9, 10, 4, 6, 8, 1, 4, 6, 2, 3, 7, 10, 10,…
$ docday      [3m[38;5;246m<int>[39m[23m 20, 14, NA, 7, 1, 23, 21, 25, 5, NA, 17, NA, 14, 15, 28, …
$ docsequence [3m[38;5;246m<int>[39m[23m 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 3…
$ docid       [3m[38;5;246m<fct>[39m[23m S10003-D023, S10003-D024, S10003-D025, S10003-D026, S1000…
$ docMonth    [3m[38;5;246m<fct>[39m[23m 09, 11, 08, 09, 03, 09, 10, 04, 06, 08, 01, 04, 06, 02, 0…
$ docDay      [3m[38;5;246m<fct>[39m[23m 20, 14, 01, 07, 01, 23, 21, 25, 05, 01, 17, 01, 14, 15, 2…
$ docdate     [3m[38;5;246m<chr>[39m[23m "1836-09-20", "1836-11-14", "1837-08-01", "1837-09-07", "…


In [86]:
# Create new date class variable from docdate
letters$docDate <- as.Date(letters$docdate)

# Add this to list of variables of interest
vars  <- append(vars, c("docDate"))

# View
glimpse(letters[vars])

# How many NAs (should be equivalent to the number of NAs for docyear)
sum(is.na(letters$docDate))
sum(is.na(letters$docyear))

Rows: 915
Columns: 9
$ docyear     [3m[38;5;246m<int>[39m[23m 1836, 1836, 1837, 1837, 1838, 1838, 1838, 1839, 1839, 183…
$ docmonth    [3m[38;5;246m<int>[39m[23m 9, 11, 8, 9, 3, 9, 10, 4, 6, 8, 1, 4, 6, 2, 3, 7, 10, 10,…
$ docday      [3m[38;5;246m<int>[39m[23m 20, 14, NA, 7, 1, 23, 21, 25, 5, NA, 17, NA, 14, 15, 28, …
$ docsequence [3m[38;5;246m<int>[39m[23m 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 3…
$ docid       [3m[38;5;246m<fct>[39m[23m S10003-D023, S10003-D024, S10003-D025, S10003-D026, S1000…
$ docMonth    [3m[38;5;246m<fct>[39m[23m 09, 11, 08, 09, 03, 09, 10, 04, 06, 08, 01, 04, 06, 02, 0…
$ docDay      [3m[38;5;246m<fct>[39m[23m 20, 14, 01, 07, 01, 23, 21, 25, 05, 01, 17, 01, 14, 15, 2…
$ docdate     [3m[38;5;246m<chr>[39m[23m "1836-09-20", "1836-11-14", "1837-08-01", "1837-09-07", "…
$ docDate     [3m[38;5;246m<date>[39m[23m 1836-09-20, 1836-11-14, 1837-08-01, 1837-09-07, 1838-03-…


Now I need to get the highest value for the key variables of interest.

In [92]:
# Reset vars of interest.
vars  <- c("docsequence", "docid", "docDate", "docauthorid")

# Get the highest values.
highestVals <- by(letters[vars], letters$docauthorid, tail, n=1)

# Put output into DF
highestVals.df <-do.call("rbind", as.list(highestVals))

# View DF
highestVals.df

Unnamed: 0,docsequence,docid,docDate,docauthorid
per0000238,27,S9974-D027,1866-01-01,per0000238
per0000624,5,S530-D005,1913-10-20,per0000624
per0001043,58,S1019-D058,1892-06-01,per0001043
per0004772,144,S2344-D144,1882-12-25,per0004772
per0005226,38,S9860-D038,1864-09-02,per0005226
per0007227,14,S9819-D014,,per0007227
per0012569,72,S9865-D072,1831-10-12,per0012569
per0014260,39,S8552-D039,1868-10-22,per0014260
per0014266,7,S8557-D007,1853-03-29,per0014266
per0017671,4,S9110-D004,1857-07-01,per0017671


It appears that docsequence and docid reflect each other. That is, the last two digits of docid is the same as sequence. Docsequence and docid do not have missing data. What happens for the cases where docDate is missing?

In [93]:
# Check the IDs for the people who show NA for docDate
highestVals.df %>%
filter(is.na(docDate)) 

Unnamed: 0,docsequence,docid,docDate,docauthorid
per0007227,14,S9819-D014,,per0007227
per0025503,13,S9819-D013,,per0025503
per0031623,9,S12296-D009,,per0031623


There are only three authors but eight cases missing a docDate. Why is that? Did these three people write all eight letters. Double check that.

In [94]:
# Check working metadataset for records missing docDate and summarize to see docauthorid counts
summary(subset(letters, is.na(docDate), select = vars))

  docsequence            docid      docDate       docauthorid
 Min.   : 8.00   S12296-D008:1   Min.   :NA   per0031623:2   
 1st Qu.:12.00   S12296-D009:1   1st Qu.:NA   per0035783:2   
 Median :18.00   S9819-D013 :1   Median :NA   per0007227:1   
 Mean   :27.12   S9819-D014 :1   Mean   :NA   per0025503:1   
 3rd Qu.:44.50   S9831-D022 :1   3rd Qu.:NA   per0035762:1   
 Max.   :61.00   S9831-D044 :1   Max.   :NA   per0035803:1   
                 (Other)    :2   NA's   :8    (Other)   :0   

Argh -- why are docauthorids per0035783, per0035762 and per0035803 not included as NAs for docDate in highestVals.df?

In [95]:
# Put IDs to check into a list
ids2check  <- c("per0035783", "per0035762", "per0035803")

# Get rows for those ids in higestVals.df
rows = which(grepl(paste(ids2check,collapse="|"), highestVals.df$docauthorid))
highestVals.df$docDate[rows]
highestVals.df$docauthorid[rows]
highestVals.df$docid[rows]

# Get rows for those ids in letters
rows = which(grepl(paste(ids2check,collapse="|"), letters$docauthorid))
letters$docDate[rows]
letters$docauthorid[rows]
letters$docid[rows]

Ok, the reason for this is that those authors have a mix of dated and undated letters: per0035762 (x1), per0035783 (x2) and per0035803 (x1) for a total of 4 cases, which added to the four cases from highestVals.df is equivalent to eight. This checks out. However, it highlights another issue: The highest value may be a duplicate. In the case of per0035783, two docs bear the highest value (1865-01-01): S9831-D073 and S9831-D045. In this case, docsequence can be used to break the tie. But first we need to make sure that docequence and docDate otherwise refer to the same docids. First, let's double check that deocsequence and docid are indeed perfect reflections of one another.

In [100]:
# Convert factors to character
highestVals.df  <- unfactorize(highestVals.df)

# Extract the last three digits of the docid and place into a new column called test
highestVals.df$test <- substr(highestVals.df$docid, nchar(highestVals.df$docid) - 3 + 1, nchar(highestVals.df$docid))

# Convert it to integer (to drop the preceeding zero)
highestVals.df$test <- as.integer(highestVals.df$test)

# Convert character to factor
highestVals.df  <- unfactorize(highestVals.df)

# View
head(highestVals.df)

Unnamed: 0,docsequence,docid,docDate,docauthorid,test
per0000238,27,S9974-D027,1866-01-01,per0000238,27
per0000624,5,S530-D005,1913-10-20,per0000624,5
per0001043,58,S1019-D058,1892-06-01,per0001043,58
per0004772,144,S2344-D144,1882-12-25,per0004772,144
per0005226,38,S9860-D038,1864-09-02,per0005226,38
per0007227,14,S9819-D014,,per0007227,14


In [102]:
# Test to see if docid and docsequence are the same
table(highestVals.df$test == highestVals.df$docsequence)


TRUE 
 218 

All docids reflect the docsequence. Now, do the metadata docDates for the docids in the highestVals.df match the docDates in highestVals.df?

In [113]:
# Put docDates into list
docsDates <- highestVals.df$docDate
length(docsDates)

# Put docIDs into list
docIDs <- highestVals.df$docid
length(docIDs)

In [114]:
# Get rows for docIDs in working metadataset
rows = which(grepl(paste(docIDs,collapse="|"), letters$docid))

# Get docDates for those rows
letterDocdates <- letters$docDate[rows]

# Verify length
length(letterDocdates)

In [116]:
# How many NAs in each list?
sum(is.na(docsDates))
sum(is.na(letterDocdates))


In [119]:
# Place docDates in decreasing order (NOTE: THIS WILL EXCLUDE NAs)
docsDates <- sort(docsDates, decreasing = TRUE)
letterDocdates <- sort(letterDocdates, decreasing = TRUE)

# Do they match?
table(letterDocdates == docsDates)


TRUE 
 215 

Based on the above, it is best to populate last letter variable using the docid value from the highestVals.df. First, let's make an indicator variable to identify series from orphan letters.

In [120]:
# Create indicator variables for orphan letters
letters$letterOrphan <- FALSE
glimpse(letters$letterOrphan)

 logi [1:915] FALSE FALSE FALSE FALSE FALSE FALSE ...


In [121]:
# Make a table showing how many letters per author.
docsAuthor <- letters %>% 
unfactorize() %>%
count(docauthorid, sort = TRUE) 
glimpse(docsAuthor)

Rows: 218
Columns: 2
$ docauthorid [3m[38;5;246m<chr>[39m[23m "per0038009", "per0022938", "per0004772", "per0001043", "…
$ n           [3m[38;5;246m<int>[39m[23m 186, 136, 101, 56, 34, 26, 21, 13, 12, 10, 10, 9, 9, 8, 7…


In [122]:
# Get IDs for authors of orphan letters
orphanAuthors <- docsAuthor %>% 
filter(n == 1) %>% 
pull(docauthorid)

# Get IDs for authors of letter series
seriesAuthors <- docsAuthor %>% 
filter(n >1) %>% 
pull(docauthorid)

length(orphanAuthors)
length(seriesAuthors)

In [123]:
# How many letters are part of a series?
rows = which(grepl(paste(seriesAuthors,collapse="|"), letters$docauthorid))
length(letters$docid[rows])

# Does this add up?
length(letters$docid[rows]) + 164

In [124]:
# Get rows for those orphan IDs
rows = which(grepl(paste(orphanAuthors,collapse="|"), letters$docauthorid))
letters$letterOrphan[rows] <- TRUE # Recode data
summary(letters$letterOrphan)

   Mode   FALSE    TRUE 
logical     751     164 

In [125]:
# Create variable for lastLetter, keeping values from letterOrphan
letters$letterLast <- letters$letterOrphan
glimpse(letters$letterLast)

 logi [1:915] FALSE FALSE FALSE FALSE FALSE FALSE ...


In [126]:
# Get rows for docIDs in working metadataset
rows = which(grepl(paste(docIDs,collapse="|"), letters$docid))

# Get docDates for those rows
letters$letterLast[rows]  <- TRUE

# Verify length
summary(letters$letterLast)

   Mode   FALSE    TRUE 
logical     697     218 

In [127]:
# Recode "True" (that is, orphan letters) as NA
# Just because letters are orphans does not mean that they were last letters. We don't know.
letters$letterLast[letters$letterOrphan == TRUE]  <- NA
summary(letters$letterLast)

   Mode   FALSE    TRUE    NA's 
logical     697      54     164 

In [136]:
# Put docDates into list
finalCheck <- letters %>%
filter(letterLast == TRUE) %>%
select(docDate, docauthorid) %>%
pull(docDate)

Does this list match docDates from highestVals.df?

In [137]:
# How many NAs
sum(is.na(finalCheck))

In [144]:
# Are there any values in finalCheck that are not in docDates (the dates from highestVals.df)
setdiff(finalCheck, docsDates)

But we should double check that NA value. Who is it?

In [147]:
# Get docauthorid
subset(letters, letterLast == TRUE & is.na(docDate), select = c("docauthorid", "docDate", "docsequence", 'docid'))

Unnamed: 0,docauthorid,docDate,docsequence,docid
228,per0031623,,9,S12296-D009


In [148]:
# Examine series info
subset(letters, letters$docauthorid == "per0031623", select = c("docDate", "docid", "docsequence"))

Unnamed: 0,docDate,docid,docsequence
226,1912-10-24,S12296-D002,2
227,,S12296-D008,8
228,,S12296-D009,9


In [150]:
# Check immigration dates.
subset(letters, letters$docauthorid == "per0031623", select = c("docDate", "docid", "docsequence", "yearimmigration"))

Unnamed: 0,docDate,docid,docsequence,yearimmigration
226,1912-10-24,S12296-D002,2,1908
227,,S12296-D008,8,1908
228,,S12296-D009,9,1908


There is no way for me to tell if letter 9 in the series comes before or after the others. I am going to leave it coded as the last letter with the understanding that a human editor more familiar with the series was able to position it in the migrant's life history.

In [152]:
#One last look at the DF
glimpse(letters)

Rows: 915
Columns: 54
$ docsequence             [3m[38;5;246m<int>[39m[23m 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 3…
$ docid                   [3m[38;5;246m<fct>[39m[23m S10003-D023, S10003-D024, S10003-D025, S10003…
$ docyear                 [3m[38;5;246m<int>[39m[23m 1836, 1836, 1837, 1837, 1838, 1838, 1838, 183…
$ doctype                 [3m[38;5;246m<fct>[39m[23m Letter, Letter, Letter, Letter, Letter, Lette…
$ allsubject              [3m[38;5;246m<fct>[39m[23m "Childbirth; Church attendance; Cities; Farms…
$ broadsubj               [3m[38;5;246m<fct>[39m[23m Health; Religion; Communities; Relationships;…
$ personalevent           [3m[38;5;246m<fct>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, Physical illn…
$ wwritten                [3m[38;5;246m<fct>[39m[23m "Baltimore, MD; Maryland; United States; Mid-…
$ docauthorid             [3m[38;5;246m<fct>[39m[23m per0022938, per0022938, per0022938, per002293…
$ docauthorname           [3m[38;5;246m<

In [154]:
vars <- c("docauthorid", "docid", "docsequence", "docDate", "letterOrphan", "letterLast")
summary(letters[vars])

     docauthorid          docid      docsequence        docDate          
 per0038009:186   S10003-D023:  1   Min.   :  2.00   Min.   :1804-05-07  
 per0022938:136   S10003-D024:  1   1st Qu.: 25.00   1st Qu.:1856-11-07  
 per0004772:101   S10003-D025:  1   Median : 53.00   Median :1863-01-16  
 per0001043: 56   S10003-D026:  1   Mean   : 72.19   Mean   :1865-04-15  
 per0022575: 34   S10003-D027:  1   3rd Qu.:110.00   3rd Qu.:1880-08-16  
 per0022530: 26   S10003-D028:  1   Max.   :239.00   Max.   :1913-10-20  
 (Other)   :376   (Other)    :909                    NA's   :8           
 letterOrphan    letterLast     
 Mode :logical   Mode :logical  
 FALSE:751       FALSE:697      
 TRUE :164       TRUE :54       
                 NA's :164      
                                
                                
                                

In [155]:
# Write a new .csv
write.csv(letters, "../20210118_AM_Meta2Merge.csv")