Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3-letter iso country code #4

Closed
ckhung opened this issue Apr 15, 2023 · 5 comments
Closed

3-letter iso country code #4

ckhung opened this issue Apr 15, 2023 · 5 comments

Comments

@ckhung
Copy link

ckhung commented Apr 15, 2023

Hi, thanks for your work!

It will be very interesting to "join" several datasets and study correlations among possibly vastly different topics such as sanitation and transportation. Using the 3-letter ISO country code as the joining key between different tables (instead of the full country name) would make it much easier. Therefore I wrote a small program country-encode.py to prepend every row with the 3-letter code and the continent in which the country is located. Might it be possible to integrate this idea and code into the entire dataset?

BTW, during the process I found that "Faroe Islands" are misspelled as "Faeroe Islands" in some files. Presently I see 52 files have this problem using this command: grep -l Faeroe */*.csv | wc

@Marigold
Copy link
Contributor

Thanks for your interest @ckhung! We're trying to "harmonise" country names to be consistent across datasets (see how we do it), so joins across datasets should work. It's possible there are some old datasets with non-harmonised names or typos, but the important and recent ones should be clean.

We have a similar countries regions table with ISO codes and harmonised country names you could use.

I'd also recommend to take a look at our catalog with python interface that let's you easily load entire datasets. Good luck with your data work and don't hesitate to give us feedback!

@edomt
Copy link
Contributor

edomt commented Apr 17, 2023

A quick note on the Faroe Islands: "Faeroe" is an alternative spelling that's less common now, but that's the one we've used historically in our database. (See merriam-webster.com)

@ckhung
Copy link
Author

ckhung commented Apr 18, 2023

Thanks, @edomt for the note.

Thanks, @Marigold for the explanation and links. I only read a few pages of the etl project. The answer to my following question may just lie somewhere there, but if you could point me to a specific page it would be most helpful :-) How do I easily filter out entries representing aggregates (e.g. world, continent, G20, ...)?

@Marigold
Copy link
Contributor

How do I easily filter out entries representing aggregates (e.g. world, continent, G20, ...)?

Sorry, I should have kept it simple! The easiest way is to load dataframe directly from our catalog with

import pandas as pd
df = pd.read_feather('https://catalog.ourworldindata.org/garden/regions/2023-01-01/regions/definitions.feather')
df.head()

that should give you all you need. We have more info about countries like common aliases, other non-iso codes, historical regions, etc. but you probably don't need that.

@ckhung
Copy link
Author

ckhung commented Apr 21, 2023

Thank you very much! Yes, that's exactly what I need. Appreciate it!

@ckhung ckhung closed this as completed Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants