Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could there be errors in this data set? #18

Closed
labarba opened this issue Oct 17, 2017 · 2 comments
Closed

Could there be errors in this data set? #18

labarba opened this issue Oct 17, 2017 · 2 comments

Comments

@labarba
Copy link

labarba commented Oct 17, 2017

Hello! I was pointed to this repo by this Twitter exchange:
https://twitter.com/R_Graph_Gallery/status/920074231269941248

I'm working on a lesson for my students, and took some inspiration from
https://python-graph-gallery.com/341-python-gapminder-animation/
... which uses your data.

A line plot of all life expectancies shows a dramatic drop for one country in 1977 and another in 1992—the first corresponds to Cambodia, but the value (31.2) is not consistent with the actual life expectancy in Cambodia during the crisis in the 70s, which was around 20 years old!

Have a look at my draft:
http://go.gwu.edu/engcomp2lesson4

It's an unexecuted Jupyter notebook (as we push with outputs only when finalized to avoid diff bloat).

When I look at the text data in this repo, I find the same: Cambodia in 1977 = 31.2
However, various sources report a life expectancy there in 1977 that was < 20.
For example: https://data.worldbank.org/country/cambodia

The other dip is Rwanda in 1992 = 23.6
But the World Bank gives 28.1
https://data.worldbank.org/country/rwanda

So I wonder: did something go awry when preparing this data set?

@jennybc
Copy link
Owner

jennybc commented Oct 17, 2017

This is definitely the data on the Gapminder website at the time of download. This repo is transparent about where the data came from and traces how the current data frames arise from Excel spreadsheets. See the data-raw directory for detail.

However, I don't doubt there could be data quality problems! It should definitely NOT be used as an authoritative source for life expectancy. Others have pointed out similar problems in other issues.

The package is offered as a dataset for teaching and exampling data wrangling & vis. I, for one, have a lot of resources built around it. Altering a couple data points would cause huge diffs in many web resources, with questionable pedagogical gain.

With some hand-wringing, I've concluded that package stability does more good than updating it whenever someone finds a better or different estimate for specific data points.

I hope this makes sense.

@jennybc jennybc closed this as completed Oct 17, 2017
@labarba
Copy link
Author

labarba commented Oct 17, 2017

Thanks for the response! I will complete my lesson with the data set, and include a discussion of what you say above.

Perhaps it's worth slapping a warning on the repo about the data quality, and that it should not be used for research purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants