Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UN SDG - bulk import of UN SDG code #6

Merged
merged 68 commits into from
Jul 15, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
cdae93a
updating code
spoonerf Jun 14, 2021
39301a3
zipping raw data
spoonerf Jun 14, 2021
30432b6
Merge branch 'feature/wdi-bulk-import' into un_sdg_fiona
spoonerf Jun 14, 2021
8c269b8
using chart revision suggester
spoonerf Jun 14, 2021
67992a8
update gitignore
spoonerf Jun 14, 2021
a454715
Merge branch 'feature/wdi-bulk-import' into un_sdg_fiona
spoonerf Jun 14, 2021
0c0d320
updating alongside the CRS PR
spoonerf Jun 15, 2021
d7863bf
fix(CSR): An indentation issue on my end
spoonerf Jun 16, 2021
8664489
fix(download data): some missing modules and fixing the delete_output…
spoonerf Jun 16, 2021
cdf5d42
fix(main): removing extraneous global vars
spoonerf Jun 16, 2021
2b30463
fix(conflict): fixing merge conflict
spoonerf Jun 21, 2021
45ede6e
chore(.gitignore): Updating gitignore
spoonerf Jun 21, 2021
8b45c70
fix(cleaning): removing extraneous files
spoonerf Jun 21, 2021
048a137
chore(.gitignore): Updating gitignore and removing unnecessary files
spoonerf Jun 21, 2021
0263cbe
fix(gitignore): updating the gitignore to allow upload of zipped inpu…
spoonerf Jun 22, 2021
64949d7
feat(clean): adding function to zip output data
spoonerf Jun 22, 2021
d78758f
fix(gitignore): blocking upload of metadata files as they are too big
spoonerf Jun 22, 2021
98b7c00
feat(output): zipping the output datapoints
spoonerf Jun 22, 2021
4ea521b
feat(upload): uploading input data
spoonerf Jun 22, 2021
6c13b08
feat(print comments): Adding print comments to show progress of code
spoonerf Jun 25, 2021
2999e76
feat(print comments): Adding print comments to show progress of code
spoonerf Jun 25, 2021
87b70b8
feat(data upload): Uploading the input data
spoonerf Jun 25, 2021
49958d5
style(tidying): removing unused modules and imports
spoonerf Jun 25, 2021
8ca3951
style(tidying): removing unused modules and imports
spoonerf Jun 25, 2021
c2c0031
style(tidying): adding print comments
spoonerf Jun 25, 2021
96a9c3e
style(tidying): removing unused modules and imports
spoonerf Jun 25, 2021
00b36e4
feat(data upload): uploading output
spoonerf Jun 25, 2021
10eae86
chore(upload and gitignore): uploading variable replacement json and …
spoonerf Jun 28, 2021
97d2148
chore(cleaning): Removing unused modules and code
spoonerf Jun 28, 2021
a771bc5
chore(cleaning): removing code relating to metadata pdfs as these are…
spoonerf Jun 28, 2021
def4259
chore(cleaning): removing modules that are no longer needed
spoonerf Jun 28, 2021
f3117df
style(download): changing clean_up argument to snake case from camel …
spoonerf Jun 30, 2021
cacab4c
style(function return types): adding type annotations to function arg…
spoonerf Jun 30, 2021
fdd8425
refactor(remove csv): Removing the original download csv and just sto…
spoonerf Jun 30, 2021
07c709f
refactor(clean): removing unused files
spoonerf Jun 30, 2021
8d0dcc4
update(upload): uploading the output data
spoonerf Jun 30, 2021
0c995ec
chore(.gitignore): updating gitignore so only datapoints folder is ig…
spoonerf Jul 5, 2021
d27c949
style(worldbank_wdi): autoformat with black
spoonerf Jul 5, 2021
6ff3867
style(comments): removing outdated comments
spoonerf Jul 5, 2021
8fec443
style(hardcoded): removing hardcoded var from main.py
spoonerf Jul 5, 2021
a574044
style(namespace): using more appropriate global var in function, DATA…
spoonerf Jul 5, 2021
e6fc56d
style(comment block): moving large block of comments to top
spoonerf Jul 5, 2021
a1a7917
style(clean): removing unused code
spoonerf Jul 5, 2021
0e52d2b
chore(clean): remove unnecessary file ext
spoonerf Jul 5, 2021
906998f
refactor(download date): updating the global download date var to be …
spoonerf Jul 5, 2021
acb0334
refactor(data manipulation): creating unique df of series, indicator,…
spoonerf Jul 5, 2021
6d3bd35
refactor(init): removing full file path from dataset_dir
spoonerf Jul 5, 2021
6ee9a1d
refactor(clean): adding comments, keeping columns rather than droppin…
spoonerf Jul 5, 2021
fd909b9
refactor(hardcoded): removing hardcoded variables
spoonerf Jul 5, 2021
28ee36a
refactor(function vars): speeding up the generate_tables_for_indicato…
spoonerf Jul 5, 2021
edda0a7
refactor(function): adding a var to function call so that it doesn't …
spoonerf Jul 5, 2021
88200c7
refactor(clean): removing an unnecessary drop_duplicates, using an as…
spoonerf Jul 5, 2021
051abfd
refactor(un_sdg.refactor): Trying to get a workaround for a problem s…
spoonerf Jul 6, 2021
f709995
refactor(un_sdg.clean): removing an unnecessary condition which was s…
spoonerf Jul 6, 2021
4a76fea
update(un_sdg.data_update): updating the input and output data in lin…
spoonerf Jul 6, 2021
9f11a84
refactor(un_sdg): fixing the delete_output function so it is more fle…
spoonerf Jul 13, 2021
e800982
docs(un_sdg.download): adding comments to function
spoonerf Jul 14, 2021
73850b5
refactor(un_sdg.match_var): change search for datasets to include all…
spoonerf Jul 14, 2021
a4c6f2f
refactor(gitignore): merging conflic, adding venv
spoonerf Jul 14, 2021
9ae711e
chore(un_sdg.file_move): moving standardised_entity_names.csv to CONF…
spoonerf Jul 14, 2021
6570b47
refactor(un_sdg.var rename): reserving all caps var names for global …
spoonerf Jul 14, 2021
22966f8
refactor(un_sdg.download): removing standardised_entity_names.csv fro…
spoonerf Jul 14, 2021
685f468
refactor(un_sdg.comments): adding comments to core.py
spoonerf Jul 14, 2021
261351a
refactor(un_sdg.json indent): adding json dump indent to make a nicer…
spoonerf Jul 14, 2021
e0dd1b3
refactor(un_dsg.unique key): Setting code and display to None, was ru…
spoonerf Jul 15, 2021
209f626
refactor(un_sdg.match_vars): Had an issue with unique database IDs so…
spoonerf Jul 15, 2021
ee9d80a
upload(un_sdg.data): uploading latest and hopefully last version of i…
spoonerf Jul 15, 2021
6dc0eaa
refactor(un_sdg.config): Moving entity file to config rather than output
spoonerf Jul 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,16 +1,25 @@
.vscode/*
.vscode/
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
*.code-workspace


.DS_Store
.env
env/
.venv
venv/
__pycache__/
.ipynb_checkpoints/
povcal/data_by_poverty_line/
povcal/output/

venv
un_sdg/input/**/*.csv
un_sdg/output/datapoints
un_sdg/metadata/
*.Rproj.user
*.Rhistory
*.Rproj
.Rproj.user
venv
2 changes: 1 addition & 1 deletion standard_importer/chart_revision_suggester.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ def insert(
WHERE status IN ("pending", "flagged")
GROUP BY chartId
ORDER BY c DESC
) as grouped
) as grouped
WHERE grouped.c > 1
"""
)
Expand Down
18 changes: 18 additions & 0 deletions un_sdg/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import os
import pathlib
import datetime

DATASET_NAME = "United Nations Sustainable Development Goals"
DATASET_AUTHORS = "United Nations"
DATASET_VERSION = "2021-03"
DATASET_LINK = "https://unstats.un.org/sdgs/indicators/database/"
DATASET_DIR = os.path.dirname(__file__).split("/")[-1]
DATASET_NAMESPACE = f"{DATASET_DIR}@{DATASET_VERSION}"
CONFIGPATH = os.path.join(DATASET_DIR, "config")
INPATH = os.path.join(DATASET_DIR, "input")
OUTPATH = os.path.join(DATASET_DIR, "output")
INFILE = os.path.join(INPATH, "un-sdg-" + DATASET_VERSION + ".csv.zip")
ENTFILE = os.path.join(INPATH, "entities-" + DATASET_VERSION + ".csv")
fname = pathlib.Path(INFILE)
mtime = datetime.datetime.fromtimestamp(fname.stat().st_mtime)
DATASET_RETRIEVED_DATE = mtime.strftime("%d-%B-%y")
308 changes: 308 additions & 0 deletions un_sdg/clean.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,308 @@
"""
After running load_and_clean() to create $ENTFILE use the country standardiser tool to standardise $ENTFILE
1. Open the OWID Country Standardizer Tool
(https://owid.cloud/admin/standardize);
2. Change the "Input Format" field to "Non-Standard Country Name";
3. Change the "Output Format" field to "Our World In Data Name";
4. In the "Choose CSV file" field, upload $ENTFILE;
5. For any country codes that do NOT get matched, enter a custom name on
the webpage (in the "Or enter a Custom Name" table column);
* NOTE: For this dataset, you will most likely need to enter custom
names for regions/continents (e.g. "Arab World", "Lower middle
income");
6. Click the "Download csv" button;
7. Name the downloaded csv 'standardized_entity_names.csv' and save in the output folder;
8. Rename the "Country" column to "country_code".
"""

import pandas as pd
import os
import shutil
import json
import numpy as np
import re
from pathlib import Path
from tqdm import tqdm

from un_sdg import (
INFILE,
ENTFILE,
OUTPATH,
CONFIGPATH,
DATASET_NAME,
DATASET_AUTHORS,
DATASET_VERSION,
DATASET_RETRIEVED_DATE,
DATASET_NAMESPACE,
)

from un_sdg.core import (
create_short_unit,
extract_datapoints,
get_distinct_entities,
clean_datasets,
dimensions_description,
attributes_description,
create_short_unit,
get_series_with_relevant_dimensions,
generate_tables_for_indicator_and_series,
)

"""
load_and_clean():
- Loads in the raw data
- Keeps rows where values in the "Value" column are not Null
- Creates $ENTFILE, a list of unique geographical entities from the "GeoAreaName" column
- Creates the output/datapoints folder
- Outputs cleaned data
"""


def load_and_clean() -> pd.DataFrame:
# Load and clean the data
print("Reading in original data...")
original_df = pd.read_csv(INFILE, low_memory=False, compression="gzip")
original_df = original_df[original_df["Value"].notnull()]
print("Extracting unique entities to " + ENTFILE + "...")
original_df[["GeoAreaName"]].drop_duplicates().dropna().rename(
columns={"GeoAreaName": "Country"}
).to_csv(ENTFILE, index=False)
# Make the datapoints folder
Path(OUTPATH, "datapoints").mkdir(parents=True, exist_ok=True)
return original_df


"""
create_datasets():
- Creates very simple one line csv with name of dataset and dataset id
"""


def create_datasets() -> pd.DataFrame:
df_datasets = clean_datasets(DATASET_NAME, DATASET_AUTHORS, DATASET_VERSION)
assert (
df_datasets.shape[0] == 1
), f"Only expected one dataset in {os.path.join(OUTPATH, 'datasets.csv')}."
print("Creating datasets csv...")
df_datasets.to_csv(os.path.join(OUTPATH, "datasets.csv"), index=False)
return df_datasets


"""
create_sources():
- Creates a csv where each row represents a source for each unique series code in the database
- Each indicator can have multiple series codes associated with it
- Each series code may be associated with multiple indicators
- Each series code may be made up of multiple sources ('dataPublisherSource')
- For each series we extract the 'dataPublisherSource', if there are two or fewer we record all of them,
if there are more we state that '"Data from multiple sources compiled by UN Global SDG Database - https://unstats.un.org/sdgs/indicators/database/"'
"""


def create_sources(original_df: pd.DataFrame, df_datasets: pd.DataFrame) -> None:
df_sources = pd.DataFrame(columns=["id", "name", "description", "dataset_id"])
source_description_template = {
"dataPublishedBy": "United Nations Statistics Division",
"dataPublisherSource": None,
"link": "https://unstats.un.org/sdgs/indicators/database/",
"retrievedDate": DATASET_RETRIEVED_DATE,
"additionalInfo": None,
}
all_series = (
original_df[["SeriesCode", "SeriesDescription", "[Units]"]]
.drop_duplicates()
.reset_index()
)
source_description = source_description_template.copy()
print("Extracting sources from original data...")
for i, row in tqdm(all_series.iterrows(), total=len(all_series)):
dp_source = original_df[
original_df.SeriesCode == row["SeriesCode"]
].Source.drop_duplicates()
if len(dp_source) <= 2:
source_description["dataPublisherSource"] = dp_source.str.cat(sep="; ")
else:
source_description[
"dataPublisherSource"
] = "Data from multiple sources compiled by UN Global SDG Database - https://unstats.un.org/sdgs/indicators/database/"
try:
source_description["additionalInfo"] = None
except:
pass
df_sources = df_sources.append(
{
"id": i,
"name": "%s %s" % (row["SeriesDescription"], DATASET_NAMESPACE),
"description": json.dumps(source_description),
"dataset_id": df_datasets.iloc[0][
"id"
], # this may need to be more flexible!
"series_code": row["SeriesCode"],
},
ignore_index=True,
)
print("Saving sources csv...")
df_sources.to_csv(os.path.join(OUTPATH, "sources.csv"), index=False)


"""
create_variables_datapoints():
- Outputs a csv where each variables is a row

"""


def create_variables_datapoints(original_df: pd.DataFrame) -> None:
variable_idx = 0
variables = pd.DataFrame(columns=["id", "name", "unit", "dataset_id", "source_id"])

new_columns = []
for k in original_df.columns:
new_columns.append(re.sub(r"[\[\]]", "", k))

original_df.columns = new_columns

entity2owid_name = (
pd.read_csv(os.path.join(CONFIGPATH, "standardized_entity_names.csv"))
.set_index("country_code")
.squeeze()
.to_dict()
)

sources = pd.read_csv(os.path.join(OUTPATH, "sources.csv"))
sources = sources[["id", "series_code"]]

series2source_id = sources.set_index("series_code").squeeze().to_dict()

unit_description = attributes_description()

dim_description = dimensions_description()

original_df["country"] = original_df["GeoAreaName"].apply(
lambda x: entity2owid_name[x]
)
original_df["Units_long"] = original_df["Units"].apply(
lambda x: unit_description[x]
)

init_dimensions = tuple(dim_description.id.unique())
init_non_dimensions = tuple(
[c for c in original_df.columns if c not in set(init_dimensions)]
)
all_series = (
original_df[["Indicator", "SeriesCode", "SeriesDescription", "Units_long"]]
.drop_duplicates()
.reset_index()
)
all_series["short_unit"] = create_short_unit(all_series.Units_long)
print("Extracting variables from original data...")
for i, row in tqdm(all_series.iterrows(), total=len(all_series)):
data_filtered = pd.DataFrame(
original_df[
(original_df.Indicator == row["Indicator"])
& (original_df.SeriesCode == row["SeriesCode"])
]
)
_, dimensions, dimension_members = get_series_with_relevant_dimensions(
data_filtered, init_dimensions, init_non_dimensions
)
if len(dimensions) == 0:
# no additional dimensions
table = generate_tables_for_indicator_and_series(
data_filtered, init_dimensions, init_non_dimensions, dim_description
)
variable = {
"dataset_id": 0,
"source_id": series2source_id[row["SeriesCode"]],
"id": variable_idx,
"name": "%s - %s - %s"
% (row["Indicator"], row["SeriesDescription"], row["SeriesCode"]),
"description": None,
"code": row["SeriesCode"],
"unit": row["Units_long"],
"short_unit": row["short_unit"],
"timespan": "%s - %s"
% (
int(np.min(data_filtered["TimePeriod"])),
int(np.max(data_filtered["TimePeriod"])),
),
"coverage": None,
"display": None,
"original_metadata": None,
}
variables = variables.append(variable, ignore_index=True)
extract_datapoints(table).to_csv(
os.path.join(OUTPATH, "datapoints", "datapoints_%d.csv" % variable_idx),
index=False,
)
variable_idx += 1
else:
# has additional dimensions
for member_combination, table in generate_tables_for_indicator_and_series(
data_filtered, init_dimensions, init_non_dimensions, dim_description
).items():
variable = {
"dataset_id": 0,
"source_id": series2source_id[row["SeriesCode"]],
"id": variable_idx,
"name": "%s - %s - %s - %s"
% (
row["Indicator"],
row["SeriesDescription"],
row["SeriesCode"],
" - ".join(map(str, member_combination)),
),
"description": None,
"code": None,
"unit": row["Units_long"],
"short_unit": row["short_unit"],
"timespan": "%s - %s"
% (
int(np.min(data_filtered["TimePeriod"])),
int(np.max(data_filtered["TimePeriod"])),
),
"coverage": None,
# "display": None,
"original_metadata": None,
}
variables = variables.append(variable, ignore_index=True)
extract_datapoints(table).to_csv(
os.path.join(
OUTPATH, "datapoints", "datapoints_%d.csv" % variable_idx
),
index=False,
)
variable_idx += 1
print("Saving variables csv...")
variables.to_csv(os.path.join(OUTPATH, "variables.csv"), index=False)


def create_distinct_entities() -> None:
df_distinct_entities = pd.DataFrame(
get_distinct_entities(), columns=["name"]
) # Goes through each datapoints to get the distinct entities
df_distinct_entities.to_csv(
os.path.join(OUTPATH, "distinct_countries_standardized.csv"), index=False
)


def compress_output(outpath) -> None:
outpath = os.path.realpath(outpath)
zip_loc = os.path.join(outpath, "datapoints")
zip_dest = os.path.join(outpath, "datapoints")
shutil.make_archive(
base_dir=zip_loc, root_dir=zip_loc, format="zip", base_name=zip_dest
)


def main() -> None:
original_df = load_and_clean()
df_datasets = create_datasets()
create_sources(original_df, df_datasets)
create_variables_datapoints(original_df)
create_distinct_entities()
compress_output(OUTPATH)


if __name__ == "__main__":
main()
Loading