## Crosscheck authorData and docData

In [1]:
# Install packages, load libraries.
library(tidyverse)
library(arsenal)

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  2.0.1     ✔ dplyr   0.7.6
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


Tidyverse is used in this notebook mostly for data manipulation and visualization. Relevant documentation is at https://dplyr.tidyverse.org and https://ggplot2.tidyverse.org.

Arsenal is for comparing datasets with overlapping data. Documentation at https://cran.r-project.org/web/packages/arsenal/vignettes/comparedf.html

In [3]:
# Import and view author data.
authorData <- read.csv("IMLD_AUTHORS_QA completed.csv")
glimpse(authorData)

Observations: 2,162
Variables: 30
$ sourceids                 <fct> , S10000; S9527, S10001, S10002, S10003, S1…
$ numdocs                   <int> 502, 23, 19, 23, 172, 2, 1, 3, 1, 1, 24, 1,…
$ docauthorid               <fct> per0002637, per0021589, per0022935, per0022…
$ docauthorname             <fct> "Editor", "Pilibosian, Khachadoor, 1904-198…
$ alternatenames            <fct> "", "", "", "", "Giesberg, Henriette Ann El…
$ briefname                 <fct> "Editor", "Khachadoor Pilibosian", "Evelio …
$ authrace                  <fct> Not applicable, White, Black, White, White,…
$ nationality               <fct> Not applicable, Not indicated, United State…
$ religion                  <fct> Not applicable, Catholic; Christian, Cathol…
$ birthyear                 <int> NA, 1904, 1919, 1882, 1813, 1774, 1794, 181…
$ birthmonth                <int> NA, NA, NA, 4, NA, 8, NA, 9, 5, 4, NA, NA, …
$ birthday                  <int> NA, NA, NA, 16, NA, 11, NA, 17, 3, 23, NA, …
$ deathyear       

In [2]:
# Import and view document data.
docData <- read.csv("IMLD_DOCS_QA completed.csv")
glimpse(docData)

Observations: 8,749
Variables: 71
$ docsequence               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
$ docid                     <fct> S10000-D001, S10000-D002, S10000-D003, S100…
$ sourceid                  <fct> S10000, S10000, S10000, S10000, S10000, S10…
$ docauthorid               <fct> per0002637, per0021589, per0021589, per0021…
$ doctitle                  <fct> "Front Matter", "Chapter 1. A Necessary Dec…
$ docyear                   <int> 1992, 1992, 1992, 1992, 1992, 1992, 1992, 1…
$ docmonth                  <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ docday                    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ docpage                   <fct> "N pag-vi", "1-4", "5-9", "10-11", "12-14",…
$ doctype                   <fct> Editorial, Chapter, Chapter, Chapter, Chapt…
$ docabbr                   <fct> S10000-D001, S10000-D002, S10000-D003, S100…
$ allsubject                <fct> "", "Cousins; Family separation; Fathers; M…
$ subjname        

In [4]:
#Compare datasets and view summary.
cmp <- comparedf(authorData,
          docData, 
          by = "docauthorid", # variable that links the two datasets
          tol.vars = c(point_of_emigration = "point_of_departure"), # origin of migrant?
          tol.factor="labels" # compare only the factor labels not the underlying numbers
          )
cmp
summary(cmp)

Compare Object

Function Call: 
comparedf(x = authorData, y = docData, by = "docauthorid", tol.vars = c(point_of_emigration = "point_of_departure"), 
    tol.factor = "labels")

Shared: 25 non-by variables and 8556 observations.
Not shared: 49 variables and 539 observations.

Differences found in 23/25 variables compared.
16 variables compared have non-identical attributes.



Table: Summary of data.frames

version   arg           ncol   nrow
--------  -----------  -----  -----
x         authorData      30   2162
y         docData         71   8749



Table: Summary of overall comparison

statistic                                                      value
------------------------------------------------------------  ------
Number of by-variables                                             1
Number of non-by variables in common                              25
Number of variables compared                                      25
Number of variables in x but not y                                 4
Number of variables in y but not x                                45
Number of variables compared with some values unequal             23
Number of variables compared with all values equal                 2
Number of observations in common                                8556
Number of observations in x but not y                            346
Number of observations 

docData seems to contain all but 4 variables in authorData, but the values and attributes for many of the shared variables differ, as shown below.

In [5]:
# Show just the number of differences and attributes that differ.
summary(cmp)[c(6,8)] 

var.x,var.y,n,NAs
docauthorname,docauthorname,131,0
briefname,briefname,48,0
authrace,authrace,0,0
nationality,nationality,6246,0
religion,religion,57,0
birthyear,birthyear,1,0
birthmonth,birthmonth,4,2
birthday,birthday,3,2
deathyear,deathyear,2,2
deathmonth,deathmonth,54,3

Unnamed: 0,var.x,var.y,name
docauthorid,docauthorid,docauthorid,levels
docauthorname,docauthorname,docauthorname,levels
briefname,briefname,briefname,levels
authrace,authrace,authrace,levels
nationality,nationality,nationality,levels
religion,religion,religion,levels
birthplace,birthplace,birthplace,levels
deathplace,deathplace,deathplace,levels
schoolattend,schoolattend,schoolattend,levels
native_occupation,native_occupation,native_occupation,levels


Examine variables with the greatest number of differences: nationality, birthplace, native_occupation, north_american_occupation, cultural_heritage, point_of_entry and stayed_north_america.

## Nationality (Country of Origin?)

In [163]:
# Get differences
diffs  <- unnest(diffs(cmp, vars = "nationality"))

# What are they?
sort(table(diffs$values.x), decreasing = TRUE)
sort(table(diffs$values.y), decreasing = TRUE)
sum(is.na(diffs$values.x))
sum(is.na(diffs$values.y))

# What does the original data look like?
sort(table(authorData$nationality), decreasing = TRUE)
sort(table(docData$nationality), decreasing = TRUE)
sum(is.na(authorData$nationality))
sum(is.na(docData$nationality))


 United States         Canada United Kingdom         Mexico          China 
          5263            495            123             94             61 
       Austria        Germany      Australia          Japan    Switzerland 
            34             32             31             27             21 
         India        Ireland         Sweden          Chile    Puerto Rico 
            17             13             12              7              6 
        Russia         France           Iran          Italy         Latvia 
             4              1              1              1              1 
        Norway   South Africa Not applicable  Not indicated 
             1              1              0              0 


           American            Canadian   English; European             Mexican 
               5295                 494                 117                  94 
     Chinese; Asian    German; European          Australian     Japanese; Asian 
                 61                  32                  31                  27 
    Swiss; European       Indian; Asian     Irish; European   Swedish; European 
                 21                  17                  13                  12 
            Chilean       Not indicated        Puerto Rican  Scottish; European 
                  7                   5                   5                   5 
  Russian; European    French; European      Iranian; Asian   Italian; European 
                  4                   1                   1                   1 
  Latvian; European Norwegian; European       South African      Not applicable 
                  1                   1                   1                   0 
    Welsh; European 
      


 United States  Not indicated         Mexico         Canada United Kingdom 
          1496            545             40             33             19 
       Germany         Russia          Chile    Puerto Rico      Australia 
             4              4              3              3              1 
       Austria          China         France          India           Iran 
             1              1              1              1              1 
       Ireland          Italy          Japan         Latvia         Norway 
             1              1              1              1              1 
Not applicable   South Africa         Sweden    Switzerland 
             1              1              1              1 


           American       Not indicated            Canadian      Not applicable 
               5389                1951                 494                 460 
  English; European             Mexican      Chinese; Asian    German; European 
                117                  94                  61                  32 
         Australian     Japanese; Asian     Swiss; European       Indian; Asian 
                 31                  27                  21                  17 
    Irish; European   Swedish; European             Chilean        Puerto Rican 
                 13                  12                   7                   5 
 Scottish; European   Russian; European     Welsh; European    French; European 
                  5                   4                   3                   1 
     Iranian; Asian   Italian; European   Latvian; European Norwegian; European 
                  1                   1                   1                   1 
      South African 
      

In [164]:
# Recode "Not indicated" as NA
authorData$nationality[authorData$nationality==""] <- NA
authorData$nationality[authorData$nationality=="Not indicated"] <- NA
docData$nationality[docData$nationality==""] <- NA
docData$nationality[docData$nationality=="Not indicated"] <- NA

# Check original data
sort(table(authorData$nationality), decreasing = TRUE)
sort(table(docData$nationality), decreasing = TRUE)
sum(is.na(authorData$nationality))
sum(is.na(docData$nationality))


 United States         Mexico         Canada United Kingdom        Germany 
          1496             40             33             19              4 
        Russia          Chile    Puerto Rico      Australia        Austria 
             4              3              3              1              1 
         China         France          India           Iran        Ireland 
             1              1              1              1              1 
         Italy          Japan         Latvia         Norway Not applicable 
             1              1              1              1              1 
  South Africa         Sweden    Switzerland  Not indicated 
             1              1              1              0 


           American            Canadian      Not applicable   English; European 
               5389                 494                 460                 117 
            Mexican      Chinese; Asian    German; European          Australian 
                 94                  61                  32                  31 
    Japanese; Asian     Swiss; European       Indian; Asian     Irish; European 
                 27                  21                  17                  13 
  Swedish; European             Chilean        Puerto Rican  Scottish; European 
                 12                   7                   5                   5 
  Russian; European     Welsh; European    French; European      Iranian; Asian 
                  4                   3                   1                   1 
  Italian; European   Latvian; European Norwegian; European       South African 
                  1                   1                   1                   1 
      Not indicated 
      

In [165]:
# Re-run comparison
cmp_nationality <- comparedf(authorData,
          docData, 
          by = "docauthorid", # variable that links the two datasets
          tol.vars = c(point_of_emigration = "point_of_departure"), # origin of migrant?
          tol.factor="labels" # compare only the factor labels not the underlying numbers
          )
diffs_nationality  <- unnest(diffs(cmp_nationality, vars = "nationality"))

# Re-examine key differences
summary(diffs_nationality$values.x)[c(3, 24)]
summary(diffs_nationality$values.y)[c(1, 3, 26)]
unique(diffs_nationality[is.na(diffs_nationality$values.y), 6]) #This is the X value where y value is NA
unique(diffs_nationality[diffs_nationality$values.y == "American" 
                & diffs_nationality$values.x != "United States", 
                c(3,6)]) #This is the X value and person ID where docData says American and authorData says otherwise

Unnamed: 0,docauthorid,values.x
,,
2488.0,per0021755,Austria


In [166]:
# Who is this person?
authorData[authorData$docauthorid == "per0021755", 
           c("docauthorname", 
             "birthplace", 
             "point_of_emigration", 
             "year_immigration", 
             "stayed_north_america")]

Unnamed: 0,docauthorname,birthplace,point_of_emigration,year_immigration,stayed_north_america
1688,"Lamarr, Hedy, 1913-2000","Vienna, Vienna, Austria; Vienna, Austria; Austria; Western Europe; Europe",England; United Kingdom; British Isles; Western Europe; Europe,1937,Stayed


According to <i>Wikipedia</i>, Hedy Lamarr was originally Austrian but acquired American citizenship in 1953. This variable cannot indicate point of origin because it does not reflect multiple citizenships. It might be useful in assessing migration outcomes (e.g., did immigrants in North American aquire citizenship. To be useful, this would require a review of naturalization policies during this period of time under investigation. 

<b>Decision</b>: Correct the authorData entry for Hedy Lamarr and use this variable as a potentially useful one for migration outcomes, but do not treat it as an indicator for country of origin. Omit from analysis the docData nationality variable.

## Birthplace

In [167]:
# Get differences.
diffs  <- unnest(diffs(cmp, vars = "birthplace"))

# What are they?
print(summary(diffs$values.x)[1:5])
print(summary(diffs$values.y)[1:5])

# What does the original data look like?
summary(authorData$birthplace)[1]
summary(docData$birthplace)[1]

                 England; United Kingdom; British Isles; Western Europe; Europe 
                                                                            484 
                                 Ireland; British Isles; Western Europe; Europe 
                                                                            176 
        Biturmansk, Lithuania; Lithuania; Baltic States; Eastern Europe; Europe 
                                                                            171 
                               Lithuania; Baltic States; Eastern Europe; Europe 
                                                                            133 
London, England; England; United Kingdom; British Isles; Western Europe; Europe 
                                                                            123 
                 England; United Kingdom; Western Europe; Europe 
                                                             484 
                                 Ireland; Western Europe; 

In [168]:
# Values in authorData where docData shows "not indicated"
diffs[diffs$values.y == "Not indicated", 6]

<b>Decision</b>: Use authorData because it is missing less data and because where it is different from docData the reason is it's more complete (e.g., British Isles included versus not included in geo-referencing). 

## Occupations (Native and North American)

In [170]:
# Get differences for native occupation
diffs  <- unnest(diffs(cmp, vars = "native_occupation"))

In [171]:
# What are they?
print(summary(diffs$values.x)[1:10])
print(summary(diffs$values.y)[1:10])

# What does the original data look like?
summary(authorData$native_occupation)[1:3]
summary(docData$native_occupation)[1:3]

                             Not indicated 
                                      4229 
Businessman; Homemaker; Physician; Student 
                                       172 
                                    Writer 
                                       150 
                      Head of state; Nurse 
                                        30 
                Diplomat; Landlord; Writer 
                                        10 
                               Businessman 
                                         4 
                        Military personnel 
                                         1 
                                Accountant 
                                         0 
                 Accountant; Retail worker 
                                         0 
                                     Actor 
                                         0 
                                     Homemaker; Physician's wife; Student 
                                4379         

In [172]:
# Values in authorData where docData shows missing data
summary(diffs[diffs$values.y == "", 6])[1:2]
# Values in docData where authorData shows "Not indicated"
summary(diffs[diffs$values.x == "Not indicated", 7])[1]

# Who is the person identified as the physician and physician's wife
unique(diffs$docauthorid[diffs$values.x == "Businessman; Homemaker; Physician; Student" | 
              diffs$values.y == "Homemaker; Physician's wife; Student"])

In [173]:
# Who is this person?
authorData[authorData$docauthorid == "per0022938", 
           c("docauthorname", 
             "birthplace", "north_american_occupation", "native_occupation")]
unique(docData[docData$docauthorid == "per0022938", 
           c("docauthorname", 
             "birthplace", "north_american_occupation", "native_occupation")])

Unnamed: 0,docauthorname,birthplace,north_american_occupation,native_occupation
5,"Bruns, Jette, 1813-1899",Germany; Western Europe; Europe,Homemaker; Physician's wife,Businessman; Homemaker; Physician; Student


Unnamed: 0,docauthorname,birthplace,north_american_occupation,native_occupation
69,"Bruns, Jette, 1813-1899",Germany; Western Europe; Europe,Homemaker; Physician's wife,Homemaker; Physician's wife; Student


According to the <i>Find a Grave</i> entries for the subject and her husband, Dr Johann Bernhard Bruns, it appears that the information in authorData is incorrect. Where else are wives associated with husband's jobs?

In [174]:
# Values in docData where authorData is "Writer"
summary(diffs$values.y[diffs$values.x == "Writer"])[1]

# Values in docData where other differences are identified.
summary(diffs$values.y[diffs$values.x == c("Head of state; Nurse",
                                          "Diplomat; Landlord; Writer",
                                          "Businessman",
                                          "Military personnel")])[1:3]

It appears that "wife" is a point of departure between the datasets. To what degree are female authors associated with their husband's professions in the two datasets?

In [175]:
# Get rows in the two dataframes where either native or North American value contains "wife."
docWives <- unique(subset(docData, 
                          grepl('wife',native_occupation) | 
                          grepl('wife',north_american_occupation),
                          c("docauthorid", 
                            "native_occupation", 
                            "north_american_occupation")))
authorWives <- subset(authorData, 
                      grepl('wife',native_occupation) | 
                      grepl('wife',north_american_occupation),
                      c("docauthorid", 
                        "native_occupation", 
                        "north_american_occupation"))
authorWives
docWives

Unnamed: 0,docauthorid,native_occupation,north_american_occupation
5,per0022938,Businessman; Homemaker; Physician; Student,Homemaker; Physician's wife
30,per0000960,Not indicated,Rancher's wife
116,per0025012,Tradesman,Merchant's wife
246,per0034268,Not indicated,Businessman's wife
594,per0009430,Not indicated,Military wife; Writer
620,per0027884,Laborer,Miner's wife
648,per0027984,Accountant; Retail worker,Educator's wife
725,per0028201,Not indicated,Businessman's wife; Accountant
789,per0028852,Butcher,Physician's wife; Social worker
794,per0028866,Butcher,Lawyer's wife; Secretary


Unnamed: 0,docauthorid,native_occupation,north_american_occupation
69,per0022938,Homemaker; Physician's wife; Student,Homemaker; Physician's wife
413,per0000960,,Rancher's wife
848,per0025012,Tradesman,Merchant's wife
984,per0033698,Businessman's wife,
1041,per0034190,Businessman's wife,
1053,per0034268,,Businessman's wife
1932,per0009430,,Military wife; Writer
2012,per0027884,Laborer,Miner's wife
2068,per0027984,Accountant; Retail worker,Educator's wife
2145,per0028201,,Businessman's wife; Accountant


Wife is common for North American occupation in both datasets. For native occupation, it is unused in authorData and rarely used in docData. 

In [181]:
# Get differences for North American occupation
diffs  <- unnest(diffs(cmp, vars = "north_american_occupation"))

In [178]:
# What are they?
print(summary(diffs$values.x)[1:3])
print(summary(diffs$values.y)[1:2])

# What does the original data look like?
summary(authorData$north_american_occupation)[1]
summary(docData$north_american_occupation)[1]

                       Not indicated Farmer's wife; Military wife; Writer 
                                1941                                  150 
                              Artist 
                                  60 
                      Military wife; Writer 
                 2009                   150 


In [180]:
# Values in authorData where docData shows missing data
summary(diffs[diffs$values.y == "", 6])[1:4]

# Values in docData where authorData shows "Artist"
summary(diffs[diffs$values.x == "Artist", 7])[1]

<b>Decision</b>: Use authorData for both occupation variables because it is missing less and is more nuanced. Omit "'s wife" from the values for North American occupation and treat occupation as "family occupations". Check this with CR, AP and PM. 

## Cultural Heritage (Country of Origin)

In [182]:
# Get differences
diffs  <- unnest(diffs(cmp, vars = "cultural_heritage"))

# What are they?
print(summary(diffs$values.x)[1:2])
print(summary(diffs$values.y)[1:2])

# What does the original data look like?
summary(authorData$cultural_heritage)
summary(docData$cultural_heritage)

               Not indicated Puerto Rican; Latin American 
                        1274                          122 
             Puerto Rican 
        1294          121 


In [183]:
summary(diffs[diffs$values.y == "",6])[1:2]
summary(diffs[diffs$values.x == "Not indicated",7])[1]

<b>Decision</b>: Use the authorData because it is missing less data and the data is more nuanced (e.g., Puerto Rican; Latin American versus just Puerto Rican)

## Point of Emigration / Departure (Country of Origin?)

In [184]:
# Get differences
diffs  <- unnest(diffs(cmp, vars = "point_of_emigration"))

# What are they?
summary(diffs$values.x)[1]
summary(diffs$values.y)[1]

# What does the original data look like?
summary(authorData$point_of_emigration)[1]
summary(docData$point_of_departure)[c(1,3)]

In [185]:
#Recode missing data in original data to NAs
authorData$point_of_emigration[authorData$point_of_emigration==""] <- NA
authorData$point_of_emigration[authorData$point_of_emigration=="Not indicated"] <- NA
docData$point_of_departure[docData$point_of_departure==""] <- NA
docData$point_of_departure[docData$point_of_departure=="Not indicated"] <- NA
sum(is.na(authorData$point_of_emigration)) #Check to make sure it matches original dataset
sum(is.na(docData$point_of_departure)) #Check to make sure NAs add up
3394+385

In [188]:
#Re-run the comparison and report of differences
cmp_origin <- comparedf(authorData,
          docData, 
          by = "docauthorid", # variable that links the two datasets
          tol.vars = c(point_of_emigration = "point_of_departure"), # origin of migrant?
          tol.factor="labels" # compare only the factor labels not the underlying numbers
          )
diffs_origin <- unnest(diffs(cmp_origin, vars = "point_of_emigration"))

#Look at biggest diff
summary(diffs_origin$values.x)[1]
summary(diffs_origin$values.y)[1]
unique(diffs_origin[is.na(diffs_origin$values.y), 6]) #This is the X value where y value is NA

<b>Decision</b>: Use authorData because no NAs and the data that is present is more detailed. 

## Port of Entry (Destination Country)

In [192]:
# Get differences
diffs  <- unnest(diffs(cmp, vars = "point_of_entry"))

In [193]:
# What are they?
summary(droplevels(diffs$values.x))
summary(droplevels(diffs$values.y))

In [194]:
# What does the original data look like?
table(authorData$point_of_entry)["Not indicated"]
table(docData$point_of_entry)["Not indicated"]
table(docData$point_of_entry)[1]

In [195]:
# Recode "Not indicated" and blanks as NA
authorData$point_of_entry[authorData$point_of_entry==""] <- NA
authorData$point_of_entry[authorData$point_of_entry=="Not indicated"] <- NA
docData$point_of_entry[docData$point_of_entry==""] <- NA
docData$point_of_entry[docData$point_of_entry=="Not indicated"] <- NA

# Check original data
table(authorData$point_of_entry)["Not indicated"]
table(docData$point_of_entry)["Not indicated"]
table(docData$point_of_entry)[1]
sum(is.na(authorData$point_of_entry))
sum(is.na(docData$point_of_entry))

In [196]:
# Re-run the comparison and report of differences
cmp_entry <- comparedf(authorData,
          docData, 
          by = "docauthorid", # variable that links the two datasets
          tol.factor="labels" # compare only the factor labels not the underlying numbers
          )
diffs_entry <- unnest(diffs(cmp_entry, vars = "point_of_entry"))

In [197]:
# Look at biggest diff
print(summary(droplevels(diffs_entry$values.x)))
print(summary(droplevels(diffs_entry$values.y)))

Ellis Island, NJ; New Jersey; United States; Mid-Atlantic States; Northeast States; East Coast States; North America 
                                                                                                                   1 
                                                   Montreal, QC; Quebec; Canada; East Coast Provinces; North America 
                                                                                                                  62 
      New York, NY; New York; United States; Mid-Atlantic States; Northeast States; East Coast States; North America 
                                                                                                                 204 
           Ellis Island, NJ; New Jersey; United States; Mid-Atlantic States; Northeast States; East Coast States; North America 
                                                                                                                              7 
      New York, NY - Brooklyn; New

In [198]:
# Values in authorData where docData shows NA
summary(droplevels(diffs_entry[is.na(diffs_entry$values.y), 6])) #This is the X value where y value is NA
# Values in docData where authorData shows New York, NY... 
unique(diffs_entry[diffs_entry$values.x == "New York, NY; New York; United States; Mid-Atlantic States; Northeast States; East Coast States; North America", 7])
# Values in docData where authorData shows Montreal, QC
unique(diffs_entry[diffs_entry$values.x == "Montreal, QC; Quebec; Canada; East Coast Provinces; North America", 7])
unique(diffs_entry[diffs_entry$values.y == "New York, NY - Liberty Island; New York; United States; Mid-Atlantic States; Northeast States; East Coast States; North America", 6])

In [199]:
# How many reference to Ellis versus Liberty Island exist in the original data?
summary(droplevels(subset(authorData, grepl('Island',point_of_entry), point_of_entry)))
summary(droplevels(subset(docData, grepl('Island',point_of_entry), point_of_entry)))

                                                                                                              point_of_entry
 Charlottetown, PE; Prince Edward Island; Canada; Maritime Provinces; East Coast Provinces; North America            :  6   
 Ellis Island, NJ; New Jersey; United States; Mid-Atlantic States; Northeast States; East Coast States; North America:917   
 Providence, RI; Rhode Island; United States; New England; Northeast States; East Coast States; North America        :  3   

                                                                                                                         point_of_entry
 Charlottetown, PE; Prince Edward Island; Canada; Maritime Provinces; East Coast Provinces; North America                       :228   
 Ellis Island, NJ; New Jersey; United States; Mid-Atlantic States; Northeast States; East Coast States; North America           :796   
 New York, NY - Liberty Island; New York; United States; Mid-Atlantic States; Northeast States; East Coast States; North America:156   
 Providence, RI; Rhode Island; United States; New England; Northeast States; East Coast States; North America                   : 46   

<b>Decision</b>: Use authorData because less missing data and no confounding references to Liberty (as opposed to Ellis) Island. During analysis, consider using state/province level data. 

## Stayed in North America (Permanent Settlement)

In [190]:
# Get differences
diffs  <- unnest(diffs(cmp, vars = "stayed_north_america"))

# What are they?
summary(diffs$values.x)
summary(diffs$values.y)

# What does the original data look like?
summary(authorData$stayed_north_america)
summary(docData$stayed_north_america)

<b>Decision</b>: Use authorData because docData unreasonably shows immigrants not staying in North America

## Author Generation

In [6]:
# Get differences
diffs  <- unnest(diffs(cmp, vars = "author_generation"))


In [7]:
# What are they?
summary(droplevels(diffs$values.x))
summary(droplevels(diffs$values.y))

In [10]:
# What are the original values?
summary(droplevels(docData$author_generation))

In [8]:
diffs

var.x,var.y,docauthorid,row.x,row.y,values.x,values.y
author_generation,author_generation,per0003974,1999,8271,First,Not indi....
author_generation,author_generation,per0022893,1989,8170,First,Not indi....
author_generation,author_generation,per0022893,1989,8172,First,Not indi....
author_generation,author_generation,per0022893,1989,8175,First,Not indi....
author_generation,author_generation,per0022893,1989,8166,First,Not indi....
author_generation,author_generation,per0022893,1989,8163,First,Not indi....
author_generation,author_generation,per0022893,1989,8177,First,Not indi....
author_generation,author_generation,per0022893,1989,8161,First,Not indi....
author_generation,author_generation,per0022893,1989,8169,First,Not indi....
author_generation,author_generation,per0022893,1989,8178,First,Not indi....


## How well does docauthorid variable line across datasets?

In [16]:
# Factor levels do not match across the datasets.
# This function converts all factors to character class
unfactorize <- function(df){
  for(i in which(sapply(df, class) == "factor")) df[[i]] = as.character(df[[i]])
  return(df)
}
# Code from user "By0" at https://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters (line 14)

In [17]:
# Convert factor data to character data
authorData <- unfactorize(authorData)
docData <- unfactorize(docData)
glimpse(authorData)
glimpse(docData)

Observations: 2,162
Variables: 30
$ sourceids                 <chr> "", "S10000; S9527", "S10001", "S10002", "S…
$ numdocs                   <int> 502, 23, 19, 23, 172, 2, 1, 3, 1, 1, 24, 1,…
$ docauthorid               <chr> "per0002637", "per0021589", "per0022935", "…
$ docauthorname             <chr> "Editor", "Pilibosian, Khachadoor, 1904-198…
$ alternatenames            <chr> "", "", "", "", "Giesberg, Henriette Ann El…
$ briefname                 <chr> "Editor", "Khachadoor Pilibosian", "Evelio …
$ authrace                  <chr> "Not applicable", "White", "Black", "White"…
$ nationality               <chr> "Not applicable", "Not indicated", "United …
$ religion                  <chr> "Not applicable", "Catholic; Christian", "C…
$ birthyear                 <int> NA, 1904, 1919, 1882, 1813, 1774, 1794, 181…
$ birthmonth                <int> NA, NA, NA, 4, NA, 8, NA, 9, 5, 4, NA, NA, …
$ birthday                  <int> NA, NA, NA, 16, NA, 11, NA, 17, 3, 23, NA, …
$ deathyear       

In [37]:
# Subset docData such that I have only one instance of each id 
daidDoc <- docData[!duplicated(docData$docauthorid), ]
glimpse(daidDoc)

Observations: 1,982
Variables: 71
$ docsequence               <int> 1, 2, 2, 2, 2, 4, 8, 42, 107, 181, 2, 26, 2…
$ docid                     <chr> "S10000-D001", "S10000-D002", "S10001-D002"…
$ sourceid                  <chr> "S10000", "S10000", "S10001", "S10002", "S1…
$ docauthorid               <chr> "per0002637", "per0021589", "per0022935", "…
$ doctitle                  <chr> "Front Matter", "Chapter 1. A Necessary Dec…
$ docyear                   <int> 1992, 1992, 2000, 1967, 1886, 1827, 1832, 1…
$ docmonth                  <int> NA, NA, NA, NA, NA, 7, 1, NA, 3, 11, NA, NA…
$ docday                    <int> NA, NA, NA, NA, NA, 2, 10, NA, NA, 8, NA, N…
$ docpage                   <chr> "N pag-vi", "1-4", "3-5", "[3]-10", "33-36"…
$ doctype                   <chr> "Editorial", "Chapter", "Chapter", "Chapter…
$ docabbr                   <chr> "S10000-D001", "S10000-D002", "S10001-D002"…
$ allsubject                <chr> "", "Cousins; Family separation; Fathers; M…
$ subjname        

In [38]:
#Compare datasets and view summary.
cmp_docauthorid <- comparedf(authorData,
          daidDoc, 
          by = "docauthorid", # variable that links the two datasets
          tol.vars = c(point_of_emigration = "point_of_departure"), # origin of migrant?
          )
cmp_docauthorid
summary(cmp_docauthorid)

Compare Object

Function Call: 
comparedf(x = authorData, y = daidDoc, by = "docauthorid", tol.vars = c(point_of_emigration = "point_of_departure"))

Shared: 25 non-by variables and 1816 observations.
Not shared: 49 variables and 512 observations.

Differences found in 23/25 variables compared.
0 variables compared have non-identical attributes.



Table: Summary of data.frames

version   arg           ncol   nrow
--------  -----------  -----  -----
x         authorData      30   2162
y         daidDoc         71   1982



Table: Summary of overall comparison

statistic                                                      value
------------------------------------------------------------  ------
Number of by-variables                                             1
Number of non-by variables in common                              25
Number of variables compared                                      25
Number of variables in x but not y                                 4
Number of variables in y but not x                                45
Number of variables compared with some values unequal             23
Number of variables compared with all values equal                 2
Number of observations in common                                1816
Number of observations in x but not y                            346
Number of observations 

In [39]:
# Examine differences on key variable (docauthorid)
diffs_docauthorid  <- summary(cmp_docauthorid)[5][[1]] # Extract the table from the list
diffs_docauthorid

Unnamed: 0,version,docauthorid,observation
14,x,per0002514,1427
18,x,per0003886,1948
78,x,per0005132,1949
82,x,per0006512,1373
134,x,per0018263,1478
137,x,per0018345,1481
182,x,per0018452,1526
215,x,per0018489,1559
221,x,per0018577,1565
226,x,per0018582,1570


In [40]:
# How many unique IDs total?
length(diffs_docauthorid$docauthorid)

In [41]:
# Ids in authorData but not daidDoc
idsAuthor  <- diffs_docauthorid[diffs_docauthorid$version=="x",2]
length(idsAuthor)

# Ids in daidDoc but not authorData
idsDoc <- diffs_docauthorid[diffs_docauthorid$version=="y",2]
length(idsDoc)

# Do the counts match?
length(idsAuthor) + length(idsDoc)

# Are there any duplicates?
sum(duplicated(c(idsAuthor, idsDoc)))

In [42]:
# Confirm presence of ids same datasets
nrow(subset(authorData, grepl(paste(idsAuthor,collapse="|"), docauthorid), select = "docauthorid"))
nrow(subset(daidDoc, grepl(paste(idsDoc,collapse="|"), docauthorid), select = "docauthorid"))

In [43]:
# Confirm absence of ids in other dataset
nrow(subset(authorData, grepl(paste(idsDoc,collapse="|"), docauthorid), select = "docauthorid"))
nrow(subset(daidDoc, grepl(paste(idsAuthor,collapse="|"), docauthorid), select = "docauthorid"))

In [44]:
# Summarize data for IDs in authorData but not daidDoc
subset(authorData, 
       grepl(paste(idsAuthor,collapse="|"), 
             docauthorid), 
       select = c("docauthorid",
                  "briefname", 
                  "author_generation")
      ) %>% 
mutate_if(is.character,as.factor) %>% 
summary()
# code from https://stackoverflow.com/questions/20637360/convert-all-data-frame-character-columns-to-factors

     docauthorid                  briefname        author_generation
 per0002514:  1   A. B. Shults         :  1   First         :243    
 per0003886:  1   A. R. Waud           :  1   Not applicable:  1    
 per0005132:  1   Adeline Harmer       :  1   Not indicated : 87    
 per0006512:  1   Albert Berghaus      :  1   Other         :  4    
 per0018263:  1   Albert Edward Sterner:  1   Second        : 11    
 per0018345:  1   Albert Miamidian     :  1                         
 (Other)   :340   (Other)              :340                         

In [45]:
# Summarize data for IDs in daidDoc but not authorData
subset(daidDoc, 
       grepl(paste(idsDoc,collapse="|"), 
             docauthorid), 
       select = c("docauthorid",
                  "briefname", 
                  "author_generation", 
                  "doctype", 
                  "language")) %>% 
mutate_if(is.character,as.factor) %>% 
summary()
#https://stackoverflow.com/questions/20637360/convert-all-data-frame-character-columns-to-factors

     docauthorid            briefname       author_generation
 per0000339:  1   John Davies    :  4   First        : 42    
 per0004486:  1   David Davies   :  3   Not indicated:124    
 per0004487:  1   Evan Davies    :  2                        
 per0004488:  1   John Lewis     :  2                        
 per0004489:  1   John Powell    :  2                        
 per0004490:  1   Richard Edwards:  2                        
 (Other)   :160   (Other)        :151                        
             doctype       language  
 Emigration guide:  1   English:166  
 Letter          :165                
                                     
                                     
                                     
                                     
                                     

The main concern here is that the author information for 1st generation, English language letters will not import correctly.

In [46]:
# Extract names for IDs where author is 1st generation and writing a letter in English 
briefnames2check <- subset(daidDoc, 
       grepl(paste(idsDoc,collapse="|"), 
             docauthorid) & author_generation == "First" & doctype == "Letter" & language == "English", 
       select = c("docauthorid",
                  "briefname", 
                  "author_generation",
                  "doctype", 
                  "language")) %>% pull(briefname)

In [47]:
# Does Briefname match authorData?
subset(authorData, 
       grepl(paste(briefnames2check,collapse="|"), 
             briefname), 
       c("docauthorid", 
         "briefname",
         "north_american_occupation"))

# Does Briefname match docData?
subset(daidDoc, 
       grepl(paste(briefnames2check,collapse="|"), 
             briefname), 
       c("docauthorid", 
         "briefname",
         "north_american_occupation"))

Unnamed: 0,docauthorid,briefname,north_american_occupation
957,per0049780,William Thomas Smedley,Artist; Illustrator


Unnamed: 0,docauthorid,briefname,north_american_occupation
3382,per0031173,George Roberts,Clergy
3383,per0031175,David Shone Harry,
3386,per0041383,Edward Jones,
3387,per0031178,Robert Williams,
3388,per0031180,John Cheshire,
3391,per0004486,Samuel Roberts,
3393,per0041388,William Jenkins,
3396,per0041394,Morddal,
3398,per0031224,John Owen,
3400,per0031227,John Lloyd,


None of the briefnames match the authorData record so it seems reasonable to assume that the mismatches are due to ommissions rather than docauthorid errors. The single match is attributable to "William Thomas Sedley" containing "William Thomas." It is possible that the two William Thomases are the same person with two different IDs. Note: It might be useful to include ageatdeath as a variable for migration outcome. This will need to be integrated from authorData because this variable is not in Doc Data. 

<b>Decision</b>: Merge authorData (x) and docData (y) using right_join in dplyr (by = docauthorid).

## prepDataset

In [48]:
# Create and summarize a subset for docauthorids that are in y (docData) but not x (authorData). 
prepDataset01 <- subset(docData, 
               grepl(paste(idsDoc,collapse="|"), 
                     docauthorid), 
               select = c("docauthorid", 
                          "author_generation", 
                          "language", 
                          "wwritten", "doctype"))
summary(prepDataset01)

 docauthorid        author_generation    language           wwritten        
 Length:193         Length:193         Length:193         Length:193        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
   doctype         
 Length:193        
 Class :character  
 Mode  :character  

In [49]:
# Converts character to factor class
factorize <- function(df){
  for(i in which(sapply(df, class) == "character")) df[[i]] = as.factor(df[[i]])
  return(df)
}

In [50]:
prepDataset01 <- factorize(prepDataset01)
summary(prepDataset01)

     docauthorid      author_generation    language  
 per0000339:  4   First        : 48     English:193  
 per0004486:  3   Not indicated:145                  
 per0004491:  3                                      
 per0004529:  3                                      
 per0004552:  3                                      
 per0031180:  3                                      
 (Other)   :174                                      
                                                                                                                 wwritten  
 Ohio; United States; East North Central States; Midwest States; Mississippi Basin States; North America             : 17  
 California; United States; Western States; West Coast States; North America                                         : 15  
 Pennsylvania; United States; Mid-Atlantic States; Northeast States; East Coast States; North America                : 13  
 Illinois; United States; East North Central States; Midwest States; Mis

In [61]:
# How many items in the subset meet all key conditions?
letters <- filter(prepDataset01, 
                  doctype == "Letter" & #Only letters
                  grepl("English", language) & #Originally in English or translated into English
                  author_generation == "First" & #Only 1st generation migrants
                  grepl("North America", wwritten)) #Writing only from North America
nrow(letters) #How many letters?


In [62]:
summary(letters)

     docauthorid     author_generation    language 
 per0031180: 3   First        :47      English:47  
 per0004486: 2   Not indicated: 0                  
 per0031175: 2                                     
 per0032241: 2                                     
 per0004492: 1                                     
 per0004495: 1                                     
 (Other)   :36                                     
                                                                                                                    wwritten 
 Utica, NY; New York; United States; Mid-Atlantic States; Northeast States; East Coast States; North America            : 4  
 Pittsburgh, PA; Pennsylvania; United States; Mid-Atlantic States; Northeast States; East Coast States; North America   : 3  
 Racine, WI; Wisconsin; United States; East North Central States; Midwest States; Great Lakes States; North America     : 3  
 Scranton, PA; Pennsylvania; United States; Mid-Atlantic States; Northeast State

In [60]:
unique(letters$docauthorid)

In [56]:
write.table(unique(letters$docauthorid), file = "missingIds.txt")