Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vectorized, multi-method version of gender? #43

Open
bmschmidt opened this issue Sep 9, 2016 · 1 comment
Open

Add vectorized, multi-method version of gender? #43

bmschmidt opened this issue Sep 9, 2016 · 1 comment

Comments

@bmschmidt
Copy link
Contributor

gender_df is very useful, but requires a lot of programming for what I imagine is the normal case: creating a new column with the gender of every person in a frame born in many different years that might well span 1930. I'm finding I need a function that works on vectors in a less-hadleyverse-mandating kind of way. I just want to feed it a list of names and years (and maybe countries), and get a vector of male, female, or NA for each name in the set.

(This is essentially the same in result as the function humaniformat::first_name, which I must not be the only one using in conjunction with this).

I wonder if something like this would be useful in the package; it selects the appropriate name/data set based on the year, marks an NA for anything that it doesn't know how to handle, and allows you to just add gender to a data.frame without loads of merging and unmerging. E.g.:

bloo = authors %>% ungroup %>%
  filter(!is.na(name)) %>% 
  mutate(gender = vectorized_gender(names = first,years = birth,fuzz=30))

Buggy, feature-incomplete, and barely tested version of the function below. If you want pull request with a clean version, let me know and maybe I can make it; I can see all sorts of reasons you wouldn't.

vectorized_gender = function(years,names,fuzz,threshold = .9) {
  # A function that takes a list of years and names and vectorizes the assignment of gender.
  # Returns a vector the same length as years and names, where each element
  # is 'male', 'female', or NA.
  #
  # It uses 'ssa' for dates after 1930, and 'ipums' for dates before
  # 'fuzz' is the wiggle room on either side of the given year; eg if fuzz is 10
  # and year is 1930, names between 1920 and 1940 will be matched.
  # Avoids duplicating identical queries by using the gender_df method.

  input = data.frame(name=names,year=years,id = 1:length(names),stringsAsFactors = F)

  mins_frame = data_frame(method = c("ssa","ipums","NA"), maxx = c(2012,1930,NA), minn = c(1880,1789,NA))

  labeled = 
    input %>% 
    mutate(method = if_else(
      is.na(year),"NA",
      if_else((year + fuzz) < 1789, "NA",
              if_else((year - fuzz) > 2012, "NA",
                      if_else (year > 1930, "ssa", "ipums"))) )
    ) %>% mutate(min = year-fuzz,
                 max = year+fuzz) %>%
    inner_join(mins_frame) %>%
    mutate(min = if_else(min < minn, minn, min),
           max = if_else(max > maxx, maxx, max))

  mergeable = labeled %>% group_by(method) %>% filter(method %in% c("ssa","ipums","napp")) %>%
    do(gender_df(., year_col = c("min","max"), name_col = "name", method = .$method[1])) %>%
    ungroup %>%
    filter(abs(qlogis(proportion_male))>abs(plogis(threshold))) %>%
    mutate(min=year_min,max=year_max) %>% 
    select(min,max,name,gender)

  meta_mergeable = labeled %>% left_join(mergeable) %>% select(name,year,gender) %>% distinct
  newversion = input %>% left_join(meta_mergeable)
  newversion$gender
}
@lmullen
Copy link
Owner

lmullen commented Sep 10, 2016

@bmschmidt I agree that this would be a more useful interface than gender_df. (Virtually everyone who has e-mailed to ask for help with this package never read the documentation to figure out which method to use for their data, so may as well do the right thing for them automatically.) If you want to send a PR, please do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants