Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UN SDG - bulk import of UN SDG code #6

Merged
merged 68 commits into from
Jul 15, 2021
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
cdae93a
updating code
spoonerf Jun 14, 2021
39301a3
zipping raw data
spoonerf Jun 14, 2021
30432b6
Merge branch 'feature/wdi-bulk-import' into un_sdg_fiona
spoonerf Jun 14, 2021
8c269b8
using chart revision suggester
spoonerf Jun 14, 2021
67992a8
update gitignore
spoonerf Jun 14, 2021
a454715
Merge branch 'feature/wdi-bulk-import' into un_sdg_fiona
spoonerf Jun 14, 2021
0c0d320
updating alongside the CRS PR
spoonerf Jun 15, 2021
d7863bf
fix(CSR): An indentation issue on my end
spoonerf Jun 16, 2021
8664489
fix(download data): some missing modules and fixing the delete_output…
spoonerf Jun 16, 2021
cdf5d42
fix(main): removing extraneous global vars
spoonerf Jun 16, 2021
2b30463
fix(conflict): fixing merge conflict
spoonerf Jun 21, 2021
45ede6e
chore(.gitignore): Updating gitignore
spoonerf Jun 21, 2021
8b45c70
fix(cleaning): removing extraneous files
spoonerf Jun 21, 2021
048a137
chore(.gitignore): Updating gitignore and removing unnecessary files
spoonerf Jun 21, 2021
0263cbe
fix(gitignore): updating the gitignore to allow upload of zipped inpu…
spoonerf Jun 22, 2021
64949d7
feat(clean): adding function to zip output data
spoonerf Jun 22, 2021
d78758f
fix(gitignore): blocking upload of metadata files as they are too big
spoonerf Jun 22, 2021
98b7c00
feat(output): zipping the output datapoints
spoonerf Jun 22, 2021
4ea521b
feat(upload): uploading input data
spoonerf Jun 22, 2021
6c13b08
feat(print comments): Adding print comments to show progress of code
spoonerf Jun 25, 2021
2999e76
feat(print comments): Adding print comments to show progress of code
spoonerf Jun 25, 2021
87b70b8
feat(data upload): Uploading the input data
spoonerf Jun 25, 2021
49958d5
style(tidying): removing unused modules and imports
spoonerf Jun 25, 2021
8ca3951
style(tidying): removing unused modules and imports
spoonerf Jun 25, 2021
c2c0031
style(tidying): adding print comments
spoonerf Jun 25, 2021
96a9c3e
style(tidying): removing unused modules and imports
spoonerf Jun 25, 2021
00b36e4
feat(data upload): uploading output
spoonerf Jun 25, 2021
10eae86
chore(upload and gitignore): uploading variable replacement json and …
spoonerf Jun 28, 2021
97d2148
chore(cleaning): Removing unused modules and code
spoonerf Jun 28, 2021
a771bc5
chore(cleaning): removing code relating to metadata pdfs as these are…
spoonerf Jun 28, 2021
def4259
chore(cleaning): removing modules that are no longer needed
spoonerf Jun 28, 2021
f3117df
style(download): changing clean_up argument to snake case from camel …
spoonerf Jun 30, 2021
cacab4c
style(function return types): adding type annotations to function arg…
spoonerf Jun 30, 2021
fdd8425
refactor(remove csv): Removing the original download csv and just sto…
spoonerf Jun 30, 2021
07c709f
refactor(clean): removing unused files
spoonerf Jun 30, 2021
8d0dcc4
update(upload): uploading the output data
spoonerf Jun 30, 2021
0c995ec
chore(.gitignore): updating gitignore so only datapoints folder is ig…
spoonerf Jul 5, 2021
d27c949
style(worldbank_wdi): autoformat with black
spoonerf Jul 5, 2021
6ff3867
style(comments): removing outdated comments
spoonerf Jul 5, 2021
8fec443
style(hardcoded): removing hardcoded var from main.py
spoonerf Jul 5, 2021
a574044
style(namespace): using more appropriate global var in function, DATA…
spoonerf Jul 5, 2021
e6fc56d
style(comment block): moving large block of comments to top
spoonerf Jul 5, 2021
a1a7917
style(clean): removing unused code
spoonerf Jul 5, 2021
0e52d2b
chore(clean): remove unnecessary file ext
spoonerf Jul 5, 2021
906998f
refactor(download date): updating the global download date var to be …
spoonerf Jul 5, 2021
acb0334
refactor(data manipulation): creating unique df of series, indicator,…
spoonerf Jul 5, 2021
6d3bd35
refactor(init): removing full file path from dataset_dir
spoonerf Jul 5, 2021
6ee9a1d
refactor(clean): adding comments, keeping columns rather than droppin…
spoonerf Jul 5, 2021
fd909b9
refactor(hardcoded): removing hardcoded variables
spoonerf Jul 5, 2021
28ee36a
refactor(function vars): speeding up the generate_tables_for_indicato…
spoonerf Jul 5, 2021
edda0a7
refactor(function): adding a var to function call so that it doesn't …
spoonerf Jul 5, 2021
88200c7
refactor(clean): removing an unnecessary drop_duplicates, using an as…
spoonerf Jul 5, 2021
051abfd
refactor(un_sdg.refactor): Trying to get a workaround for a problem s…
spoonerf Jul 6, 2021
f709995
refactor(un_sdg.clean): removing an unnecessary condition which was s…
spoonerf Jul 6, 2021
4a76fea
update(un_sdg.data_update): updating the input and output data in lin…
spoonerf Jul 6, 2021
9f11a84
refactor(un_sdg): fixing the delete_output function so it is more fle…
spoonerf Jul 13, 2021
e800982
docs(un_sdg.download): adding comments to function
spoonerf Jul 14, 2021
73850b5
refactor(un_sdg.match_var): change search for datasets to include all…
spoonerf Jul 14, 2021
a4c6f2f
refactor(gitignore): merging conflic, adding venv
spoonerf Jul 14, 2021
9ae711e
chore(un_sdg.file_move): moving standardised_entity_names.csv to CONF…
spoonerf Jul 14, 2021
6570b47
refactor(un_sdg.var rename): reserving all caps var names for global …
spoonerf Jul 14, 2021
22966f8
refactor(un_sdg.download): removing standardised_entity_names.csv fro…
spoonerf Jul 14, 2021
685f468
refactor(un_sdg.comments): adding comments to core.py
spoonerf Jul 14, 2021
261351a
refactor(un_sdg.json indent): adding json dump indent to make a nicer…
spoonerf Jul 14, 2021
e0dd1b3
refactor(un_dsg.unique key): Setting code and display to None, was ru…
spoonerf Jul 15, 2021
209f626
refactor(un_sdg.match_vars): Had an issue with unique database IDs so…
spoonerf Jul 15, 2021
ee9d80a
upload(un_sdg.data): uploading latest and hopefully last version of i…
spoonerf Jul 15, 2021
6dc0eaa
refactor(un_sdg.config): Moving entity file to config rather than output
spoonerf Jul 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,14 +1,24 @@
.vscode/*
.vscode/
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
*.code-workspace


.DS_Store
.env
env/
.venv
venv/
__pycache__/
.ipynb_checkpoints/
povcal/data_by_poverty_line/
povcal/output/
un_sdg/input/**/*.csv
un_sdg/output/**/*.csv
un_sdg/metadata/
*.Rproj.user
*.Rhistory
*.Rproj
.Rproj.user
2 changes: 1 addition & 1 deletion standard_importer/chart_revision_suggester.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ def insert(
WHERE status IN ("pending", "flagged")
GROUP BY chartId
ORDER BY c DESC
) as grouped
) as grouped
WHERE grouped.c > 1
"""
)
Expand Down
16 changes: 16 additions & 0 deletions un_sdg/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
import os

#from standard_importer.import_dataset import USER_ID
# Dataset constants.
DATASET_NAME = "United Nations Sustainable Development Goals Indicators"
DATASET_AUTHORS = "United Nations"
DATASET_VERSION = "2021-03"
DATASET_LINK = "https://unstats.un.org/sdgs/indicators/database/" # Have to request dataset by email
spoonerf marked this conversation as resolved.
Show resolved Hide resolved
DATASET_RETRIEVED_DATE = "10-May-2021"
spoonerf marked this conversation as resolved.
Show resolved Hide resolved
DATASET_DIR = os.path.dirname(__file__)
DATASET_NAMESPACE = f"{DATASET_DIR.split('/')[-1]}@{DATASET_VERSION}"
spoonerf marked this conversation as resolved.
Show resolved Hide resolved
CONFIGPATH = os.path.join(DATASET_DIR, 'config')
INPATH = os.path.join(DATASET_DIR, 'input')
OUTPATH = os.path.join(DATASET_DIR, 'output')
INFILE = os.path.join(INPATH, 'un-sdg-' + DATASET_VERSION + '.csv.zip')
ENTFILE = os.path.join(INPATH, 'entities-' + DATASET_VERSION + '.csv')
219 changes: 219 additions & 0 deletions un_sdg/clean.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
import pandas as pd
import os
import shutil
import json
import numpy as np
import re
from datetime import datetime
from pathlib import Path
from tqdm import tqdm

from un_sdg import (
INFILE,
ENTFILE,
OUTPATH,
DATASET_NAME,
DATASET_AUTHORS,
DATASET_VERSION
)

from un_sdg.core import (
create_short_unit,
extract_datapoints,
get_distinct_entities,
clean_datasets,
dimensions_description,
attributes_description,
create_short_unit,
get_series_with_relevant_dimensions,
generate_tables_for_indicator_and_series
)

def load_and_clean() -> pd.DataFrame:
# Load and clean the data
print("Reading in original data...")
original_df = pd.read_csv(
INFILE,
low_memory=False,
compression="gzip"
)
original_df = original_df[original_df['Value'].notnull()]
print("Extracting unique entities to " + ENTFILE + "...")
original_df[['GeoAreaName']].drop_duplicates() \
.dropna() \
.rename(columns={'GeoAreaName': 'Country'}) \
.to_csv(ENTFILE, index=False)
# Make the datapoints folder
Path(OUTPATH, 'datapoints').mkdir(parents=True, exist_ok=True)
return original_df

"""
Now use the country standardiser tool to standardise $ENTFILE
1. Open the OWID Country Standardizer Tool
(https://owid.cloud/admin/standardize);
2. Change the "Input Format" field to "Non-Standard Country Name";
3. Change the "Output Format" field to "Our World In Data Name";
4. In the "Choose CSV file" field, upload {outfpath};
5. For any country codes that do NOT get matched, enter a custom name on
the webpage (in the "Or enter a Custom Name" table column);
* NOTE: For this dataset, you will most likely need to enter custom
names for regions/continents (e.g. "Arab World", "Lower middle
income");
6. Click the "Download csv" button;
7. Replace {outfpath} with the downloaded CSV;
8. Rename the "Country" column to "country_code".
"""
spoonerf marked this conversation as resolved.
Show resolved Hide resolved

### Datasets
spoonerf marked this conversation as resolved.
Show resolved Hide resolved
def create_datasets() -> pd.DataFrame:
df_datasets = clean_datasets(DATASET_NAME, DATASET_AUTHORS, DATASET_VERSION)
assert df_datasets.shape[0] == 1, f"Only expected one dataset in {os.path.join(OUTPATH, 'datasets.csv')}."
print("Creating datasets csv...")
df_datasets.to_csv(os.path.join(OUTPATH, 'datasets.csv'), index=False)
return df_datasets

### Sources
spoonerf marked this conversation as resolved.
Show resolved Hide resolved

def create_sources(original_df: pd.DataFrame, df_datasets: pd.DataFrame) -> None:
df_sources = pd.DataFrame(columns=['id', 'name', 'description', 'dataset_id'])
source_description_template = {
'dataPublishedBy': "United Nations Statistics Division",
'dataPublisherSource': None,
'link': "https://unstats.un.org/sdgs/indicators/database/",
'retrievedDate': datetime.now().strftime("%d-%B-%y"),
spoonerf marked this conversation as resolved.
Show resolved Hide resolved
'additionalInfo': None
}
all_series = original_df[['SeriesCode', 'SeriesDescription', '[Units]']] .groupby(by=['SeriesCode', 'SeriesDescription', '[Units]']) .count() .reset_index()
spoonerf marked this conversation as resolved.
Show resolved Hide resolved
source_description = source_description_template.copy()
print("Extracting sources from original data...")
for i, row in tqdm(all_series.iterrows(), total=len(all_series)):
dp_source = original_df[original_df.SeriesCode == row['SeriesCode']].Source.drop_duplicates()
if len(dp_source) <= 2:
source_description['dataPublisherSource'] = dp_source.str.cat(sep='; ')
else:
source_description['dataPublisherSource'] = 'Data from multiple sources compiled by UN Global SDG Database - https://unstats.un.org/sdgs/indicators/database/'
print(source_description['dataPublisherSource'])
spoonerf marked this conversation as resolved.
Show resolved Hide resolved
try:
source_description['additionalInfo'] = None
except:
pass
df_sources = df_sources.append({
'id': i,
'name': "%s (UN SDG, 2021)" % row['SeriesDescription'],
'description': json.dumps(source_description),
'dataset_id': df_datasets.iloc[0]['id'], # this may need to be more flexible!
'series_code': row['SeriesCode']
}, ignore_index=True)
print("Saving sources csv...")
df_sources.to_csv(os.path.join(OUTPATH, 'sources.csv'), index=False)


### Variables
spoonerf marked this conversation as resolved.
Show resolved Hide resolved

def create_variables_datapoints(original_df: pd.DataFrame) -> None:
variable_idx = 0
variables = pd.DataFrame(columns=['id', 'name', 'unit', 'dataset_id', 'source_id'])

new_columns = []
for k in original_df.columns:
new_columns.append(re.sub(r"[\[\]]", '',k))

original_df.columns = new_columns

entity2owid_name = pd.read_csv(os.path.join(OUTPATH, 'standardized_entity_names.csv')) \
.set_index('country_code') \
.squeeze() \
.to_dict()

series2source_id = pd.read_csv(os.path.join(OUTPATH, 'sources.csv'))\
.drop(['name','description', 'dataset_id'], 1)\
spoonerf marked this conversation as resolved.
Show resolved Hide resolved
.set_index('series_code')\
.squeeze() \
.to_dict()

unit_description = attributes_description()

dim_description = dimensions_description()

original_df['country'] = original_df['GeoAreaName'].apply(lambda x: entity2owid_name[x])
original_df['Units_long'] = original_df['Units'].apply(lambda x: unit_description[x])

DIMENSIONS = tuple(dim_description.id.unique())
NON_DIMENSIONS = tuple([c for c in original_df.columns if c not in set(DIMENSIONS)])# not sure if units should be in here
spoonerf marked this conversation as resolved.
Show resolved Hide resolved

all_series = original_df[['Indicator', 'SeriesCode', 'SeriesDescription', 'Units_long']] .groupby(by=['Indicator', 'SeriesCode', 'SeriesDescription', 'Units_long']) .count() .reset_index()
all_series['short_unit'] = create_short_unit(all_series.Units_long)
print("Extracting variables from original data...")
for i, row in tqdm(all_series.iterrows(), total=len(all_series)):
data_filtered = pd.DataFrame(original_df[(original_df.Indicator == row['Indicator']) & (original_df.SeriesCode == row['SeriesCode'])])
_, dimensions, dimension_members = get_series_with_relevant_dimensions(data_filtered, DIMENSIONS, NON_DIMENSIONS)
spoonerf marked this conversation as resolved.
Show resolved Hide resolved
print(i)
if len(dimensions) == 0|(data_filtered[dimensions].isna().sum().sum() > 0):
spoonerf marked this conversation as resolved.
Show resolved Hide resolved
# no additional dimensions
table = generate_tables_for_indicator_and_series(data_filtered, DIMENSIONS, NON_DIMENSIONS)
spoonerf marked this conversation as resolved.
Show resolved Hide resolved
print(type(table))
variable = {
'dataset_id': 0,
'source_id': series2source_id[row['SeriesCode']],
'id': variable_idx,
'name': "%s - %s - %s" % (row['Indicator'], row['SeriesDescription'], row['SeriesCode']),
'description': None,
'code': row['SeriesCode'],
'unit': row['Units_long'],
'short_unit': row['short_unit'],
'timespan': "%s - %s" % (int(np.min(data_filtered['TimePeriod'])), int(np.max(data_filtered['TimePeriod']))),
'coverage': None,
'display': None,
'original_metadata': None
}
variables = variables.append(variable, ignore_index=True)
extract_datapoints(table).to_csv(os.path.join(OUTPATH,'datapoints','datapoints_%d.csv' % variable_idx), index=False)
variable_idx += 1
else:
# has additional dimensions
for member_combination, table in generate_tables_for_indicator_and_series(data_filtered, DIMENSIONS, NON_DIMENSIONS).items():
spoonerf marked this conversation as resolved.
Show resolved Hide resolved
variable = {
'dataset_id': 0,
'source_id': series2source_id[row['SeriesCode']],
'id': variable_idx,
'name': "%s - %s - %s - %s" % (
row['Indicator'],
row['SeriesDescription'],
row['SeriesCode'],
' - '.join(map(str, member_combination))),
'description': None,
'code': row['SeriesCode'],
'unit': row['Units_long'],
'short_unit': row['short_unit'],
'timespan': "%s - %s" % (int(np.min(data_filtered['TimePeriod'])), int(np.max(data_filtered['TimePeriod']))),
'coverage': None,
'display': None,
'original_metadata': None
}
print(member_combination)
variables = variables.append(variable, ignore_index=True)
extract_datapoints(table).to_csv(os.path.join(OUTPATH,'datapoints','datapoints_%d.csv' % variable_idx), index=False)
variable_idx += 1
print(table)
print("Saving variables csv...")
variables.to_csv(os.path.join(OUTPATH,'variables.csv'), index=False)

def create_distinct_entities() -> None:
df_distinct_entities = pd.DataFrame(get_distinct_entities(), columns=['name']) # Goes through each datapoints to get the distinct entities
df_distinct_entities.to_csv(os.path.join(OUTPATH, 'distinct_countries_standardized.csv'), index=False)

def compress_output(outpath) -> None:
zip_loc = os.path.join(outpath, 'datapoints')
zip_dest = os.path.join(outpath, 'datapoints')
shutil.make_archive(base_dir=zip_loc, root_dir=zip_loc, format='zip', base_name=zip_dest)

def main() -> None:
original_df = load_and_clean()
df_datasets = create_datasets()
create_sources(original_df, df_datasets)
create_variables_datapoints(original_df)
create_distinct_entities()
compress_output(OUTPATH)

if __name__ == '__main__':
main()
Loading