Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add "babies" dataset referenced in EOCE 6.1 of ISRS #4

Closed
beanumber opened this issue Sep 30, 2015 · 10 comments
Closed

add "babies" dataset referenced in EOCE 6.1 of ISRS #4

beanumber opened this issue Sep 30, 2015 · 10 comments

Comments

@beanumber
Copy link
Contributor

Unless I am missing it, this is neither the births nor the ncbirths data set.

Side note: Is it necessary to have both births and ncbirths?

@beanumber
Copy link
Contributor Author

Oh...nevermind.
It seems that this is the same as mosaicData::Gestation
But then shouldn't this package import mosaicData?

@beanumber
Copy link
Contributor Author

Upon further review, these data sets do not appear to be the same. In particular, the parity and smoke variables are handled differently. In babies they are binary, but in mosaicData::Gestation they are not.

Can anyone illuminate?

@mine-cetinkaya-rundel
Copy link
Collaborator

I'm afraid there is a data provenance issue here and I have not been able to track down the origin any further than what is stated in the help file of the package.

@mine-cetinkaya-rundel
Copy link
Collaborator

mine-cetinkaya-rundel commented Nov 24, 2020

@AmeliaMN
Copy link
Contributor

I feel like this could be a good opportunity to update the ncbirths dataset, using data from 2019. Pretty sure all my wrangling code would work on the new data, the only tricky piece is that real ages are redacted in the public-facing data, so the only thing available are ranges. Maybe someone has a smart idea for how to impute some ages or just randomly assign an age in the range.

@mine-cetinkaya-rundel
Copy link
Collaborator

I think it's a great idea to have the updated datasets here @AmeliaMN! We use the existing dataset in current books so I'd hesitate to replace it -- we can put a note in the docs clarifying the provenance issue as well as suggesting using the newer version. Once it's out of all the most recent editions of the books we could consider deprecating it.

I wonder about naming, how about ncbirths19?

Also, ages, hm... First question that comes to mind is, do we have to have ages? I don't have a great suggestion for imputing but could look up an appropriate method. Selecting from a random distribution in the range should be straightforward.

@AmeliaMN
Copy link
Contributor

That's fair, and I have seen other textbooks do similar things. For example, Stat2Data::BaseballTimes vs Stat2Data::BaseballTimes2017 or Lock5Data::HollywoodMovies vs Lock5Data::HollywoodMovies2011. I think ages was a nice variable because it is numeric, and this dataset gets used places like the inference for numeric data lab (that's probably an old link, just easy to put my hands on) and you exploring NC births lab.

@mine-cetinkaya-rundel
Copy link
Collaborator

Given that the age bands are not very wide, I wouldn't be opposed to a random draw from a uniform in that range. We can place the data prep code in the data-raw folder. I'd be happy to do this based on your work or a PR is good too, whichever you prefer!

@AmeliaMN
Copy link
Contributor

I started working on a PR and realized a couple things: 1. it seems like the most recent natality data is from 2014, and 2. probably the reason the ncbirths data was from 2004 is that is the last year the data included state information! So, I could make a births14 dataset that would have random babies born in 2014, but they wouldn't necessarily be from North Carolina.

@mine-cetinkaya-rundel
Copy link
Collaborator

I think that's perfectly fine!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants