Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be more forgiving of exterior ranges? #42

Open
bmschmidt opened this issue Sep 8, 2016 · 5 comments
Open

Be more forgiving of exterior ranges? #42

bmschmidt opened this issue Sep 8, 2016 · 5 comments

Comments

@bmschmidt
Copy link
Contributor

If I have someone named "Orestes" in 1831, I can't match it in the IPUMS sample

> gender("Orestes",years=c(1831),method="ipums")
Source: local data frame [0 x 6]

Variables not shown: name <chr>, proportion_male <dbl>, proportion_female <dbl>, gender <lgl>, year_min

No problem, right? Just broaden the net when you have a rare name

> gender("Orestes",years=c(1821,1841),method="ipums")
Source: local data frame [1 x 6]

     name proportion_male proportion_female gender year_min year_max
    <chr>           <dbl>             <dbl>  <chr>    <dbl>    <dbl>
1 Orestes               1                 0   male     1821     1841

Super. But if I want to do a batch test on many names, I'd like to be able to just set the years for each of them at c(year-30,year+30). But this is going to raise loads of errors for anyone near the edge of the range.

> gender("Orestes",years=c(1803-15,1803+15),method="ipums")
Error in gender("Orestes", years = c(1803 - 25, 1803 + 25), method = "ipums") : 
  Please provide a year range between 1789 and 1930.

Of course I can muck up my codes with a lot of maxes and mins for each of the datasets I'm using. But why not just clip c(1788,1818) to c(1789, 1818) and write a warning instead of raising an error?

lmullen added a commit that referenced this issue Sep 10, 2016
@lmullen
Copy link
Owner

lmullen commented Sep 10, 2016

I agree that it is better just to trim the range than stop entirely. Can you try the version I just pushed? Note that Orestes doesn't appear until 1828, so this might be a better test:

gender("Orestes",years=c(1920-20,1920+20),method="ipums")

@bmschmidt
Copy link
Contributor Author

Super, thanks. Edge case note: the behavior is now unclear when both dates are outside the allowed range.

> gender("James",years=c(1930,1930),method="ipums")
Source: local data frame [1 x 6]

   name proportion_male proportion_female gender year_min year_max
  <chr>           <dbl>             <dbl>  <chr>    <dbl>    <dbl>
1 James          0.9902            0.0098   male     1930     1930
> gender("James",years=c(1960,1980),method="ipums")
Source: local data frame [0 x 6]

Variables not shown: name <chr>, proportion_male <dbl>, proportion_female <dbl>, gender <lgl>, year_min
  <dbl>, year_max <dbl>.
Warning message:
In gender("James", years = c(1960, 1980), method = "ipums") :
  The year range provided has been trimmed to fit within 1789 to 1930.

@lmullen
Copy link
Owner

lmullen commented Sep 10, 2016

Hmm. Good point. As it stands, dates which are completely outside the range
of the method will be reset to the entire range of the method. But I
suppose it is possible that someone could pass nonsensical dates and get
nonsensical answers. I should just report what the dates given were and
what the dates actually used are. For that matter, this whole thing should
be refactored.

On Sat, Sep 10, 2016 at 3:58 PM, Benjamin Schmidt notifications@github.com
wrote:

Super, thanks. Edge case note: the behavior is now unclear when both
dates are outside the allowed range.

gender("James",years=c(1930,1930),method="ipums")Source: local data frame [1 x 6]

name proportion_male proportion_female gender year_min year_max
1 James 0.9902 0.0098 male 1930 1930> gender("James",years=c(1960,1980),method="ipums")Source: local data frame [0 x 6]
Variables not shown: name , proportion_male , proportion_female , gender , year_min
, year_max .Warning message:In gender("James", years = c(1960, 1980), method = "ipums") :
The year range provided has been trimmed to fit within 1789 to 1930.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#42 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AALNeDr6iPVLmhsqirCDw95RtjfUWMBEks5qowvUgaJpZM4J4hDt
.

Lincoln Mullen
Assistant Professor, Department of History & Art History
George Mason University

@bmschmidt
Copy link
Contributor Author

Based on the output of gender("James", years = c(1960, 1980), method = "ipums"), I think it's currently being trimmed to years=c(1960,1910), which sails through because it runs after the check whether years is ordered. Guaranteed to return nothing, which isn't the worst possible option.

@lmullen
Copy link
Owner

lmullen commented Sep 27, 2016

Yeah, I wasn't thinking clearly about how the range was set for odd inputs. The whole code for setting ranges should be refactored. Will fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants