Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chart revision suggester and enhance World Bank WDI bulk import #5

Merged
merged 18 commits into from
Jun 18, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
cb456c1
wdi: add data cleaning script for 2021.03 dataset
bnjmacdonald Apr 16, 2021
e227507
wdi: add chart update scripts for 2021.03 dataset
bnjmacdonald Apr 16, 2021
2de157f
bugfix: fix issue with long dataset namespace
bnjmacdonald Apr 16, 2021
605c0e2
remove config + output files with hardcoded db ids
bnjmacdonald Apr 16, 2021
4583a27
fix(wdi): attempt xls download if csv download fails
bnjmacdonald Jun 8, 2021
770e1c8
refactor(worldbank_wdi): move suggested revision upsert code to `stan…
bnjmacdonald Jun 9, 2021
d54b583
data(worldbank_wdi): update WDI dataset to version 2021.05.25
bnjmacdonald Jun 9, 2021
90f0fb6
fix(standard_importer): update `delete_dataset` to work with newly re…
bnjmacdonald Jun 9, 2021
ad42394
docs(readme): add more detail
bnjmacdonald Jun 10, 2021
231f639
docs(standard_importer): minor doc improvements
bnjmacdonald Jun 10, 2021
bc7d17f
fix(worldbank_wdi): fix bug with reading xlsx via pandas
bnjmacdonald Jun 10, 2021
80b22b4
fix(worldbank_wdi): fix error with >1 old variable version
bnjmacdonald Jun 10, 2021
0699d38
fix(chart_revision_suggester): remove on duplicate key update
bnjmacdonald Jun 11, 2021
0ecf4ff
refactor(chart_revision_suggester): minor refactoring
bnjmacdonald Jun 11, 2021
a349de4
style(chart_revision_suggester): add suggested changes
bnjmacdonald Jun 18, 2021
6f5e878
style(chart_revision_suggester): autoformat with `black`
bnjmacdonald Jun 18, 2021
14c28f7
style(import_dataset): autoformat with `black`
bnjmacdonald Jun 18, 2021
b3102d1
style(worldbank_wdi): autoformat with `black`
bnjmacdonald Jun 18, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@ DB_HOST="localhost"
DB_PORT=3306
DB_USER="user"
DB_PASS="password"
USER_ID=57
DEBUG=False
70 changes: 58 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,75 @@ _Bulk import scripts for ingesting large external datasets into OWID's master da

## Overview

OWID keeps a master MySQL database of all known data sets. Whilst some are manually added by researchers using the [grapher](https://github.com/owid/owid-grapher) admin interface, the bulk of the data comes from importing large external datasets, which is the focus of this repository. Datasets are often updated versions of older datasets; new versions do not overwrite old data, but are enabled with their own versioned `namespace`. This codebase also proposes which `grapher` charts should be updated to use data from a fresher dataset version.
OWID keeps a master MySQL database of the charts that appear on our website, as well as the datasets used to create these charts (see [owid-grapher](https://github.com/owid/owid-grapher)).

## Rough convention
The `importers` repository aids in the maintenance of this database by:

Each large dataset has its own folder `<dataset>/`, and inside that you can find:
**1. Importing datasets:** The folders in this repository contain scripts for uploading external datasets to the database at regular intervals, such as the World Bank World Development Indicators. Only some of the datasets in our database are updated in this way. Most are instead manually added to the database by OWID researchers using the [grapher](https://github.com/owid/owid-grapher) admin interface.

- Scripts that transform it
- `input/`: an original(ish) copy
- `output/`: the transformed copy that will be uploaded
- `standardisation/`: any country name transformations that were needed
**2. Suggesting chart revisions:** Once a new version of a dataset has been uploaded to the database, the next task is to update the corresponding OWID charts to display the newly available data in place of the old data. Because dataset imports create a _new_ version of an existing dataset rather than overwriting an old version of the same dataset, the relevant charts must be amended to display the new data. The scripts in this repository _suggest_ these chart revisions, which are then manually approved or rejected by an OWID researcher using the [grapher](https://github.com/owid/owid-grapher) admin interface.

## Development

You should install Python 3.8+ and install required packages:
1. install Python 3.8+ and required packages:

```
pip install -r requirements.txt
```

Historical data transformations might have used Jupyter notebooks, but all recent ones use Python scripts.
2. Copy `.env.example` to `.env` and change any variables as needed. If you are unsure what to change, ask a member of the OWID data team.

## Updating data
```
cp .env.example .env
```

3. Follow the [setup instructions in the owid-grapher repository](https://github.com/owid/owid-grapher#initial-development-setup) to initialize a local version of the OWID MySQL database.

> Note: After following the setup instructions, you must initialize the `suggested_chart_revisions` MySQL table by switching to the [feature/admin-suggested-chart-revision-approver](https://github.com/owid/owid-grapher/tree/feature/admin-suggested-chart-revision-approver) branch of the owid-grapher repository and running `yarn buildTsc && yarn typeorm migration:run`.

### Folder structure

Each dataset has its own folder `{institution}_{dataset}/` (e.g. `worldbank_wdi/` for the World Bank World Development Indicators), containing all code and configuration files required to execute the dataset import and suggest the chart revisions.

Typical folder structure:

```
__init__.py # used for storing dataset constants (e.g. {DATASET_NAME}, {DATASET_VERSION})
main.py # executes all 6 steps in sequence
... # helper scripts (e.g. `download.py`, `clean.py`)
input/ # original copy of the dataset
output/ # the cleaned data to be imported + other *generated* files for steps 1-6
config/ # *manually* constructed files for steps 1-6.
```

See [worldbank_wdi/](worldbank_wdi) for a recent example to follow.

## Conventions to follow

Each `{institution}_{dataset}/` folder executes the same 6 steps to import a dataset and suggest chart revisions:

1. Download the data.
- Example: [worldbank_wdi/download.py](worldbank_wdi/download.py).

2. Specify which variables in the dataset are to be cleaned and imported into the database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I haven't read this in detail, so totally uninformed question: Does this mean we are not importing all variables available in the WDI dataset? And if yes, is it possible to easily import others later?

Asking because as I understand, authors can decide to use any WDI variable at any point, it's likely not the case that the ones in the database are the only ones we want to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct – the variables stored in variables_to_clean.json at the end of step 2 are a subset of all variables in the dataset.

In the case of the WDI bulk import in this PR, I've written init_variables_to_clean.py (which constructs variables_to_clean.json) so as to only keep the WDI variables that have been used in at least 1 chart.

My thinking is that this is probably a good rule of thumb for effectively keeping db clutter to a minimum while still providing authors with >90% of the variables they are ever going to use. But more discussion is certainly needed here, and it would be easy enough to alter init_variables_to_clean.py to include more variables.

- This information is typically stored in a `variables_to_clean.json` file. Example: [worldbank_wdi/output/variables_to_clean.json](worldbank_wdi/output/variables_to_clean.json).
- For some datasets, it makes sense to generate `variables_to_clean.json` programmatically (as in [worldbank_wdi/init_variables_to_clean.py](worldbank_wdi/init_variables_to_clean.py)), in which case `variables_to_clean.json` should be stored in the `output/` sub-folder. For other datasets, it may make more sense for you to generate `variables_to_clean.json` manually, in which case it should be stored in the `config/` sub-folder.

3. Clean/transform/manipulate the selected variables prior to import.
- This step involves the construction of metadata for each variable (name, description, ...), as well as any required sanity checks on data observations, the removal or correction of problematic data observations, data transformations (e.g. per capita), et cetera.
- The cleaned variables must be saved in CSV files in preparation for import into MySQL. See [standard_importer/README.md](standard_importer/README.md) for the required CSV format.
- Example: [worldbank_wdi/clean.py](worldbank_wdi/clean.py).

> Note: This step generally requires usage of OWID's Country Standardizer Tool ([owid.cloud/admin/standardize](https://owid.cloud/admin/standardize), or [localhost:3030/admin/standardize](http://localhost:3030/admin/standardize) if you are running the grapher locally). This step requires you to upload a list of all unique country/entity names in the dataset to the country standardizer tool and then save the resulting downloaded csv file to `{DATASET_DIR}/config/standardized_entity_names.csv` (e.g. [worldbank_wdi/config/standardized_entity_names.csv](worldbank_wdi/config/standardized_entity_names.csv)) for use in your cleaning script to harmonize all entity names with OWID entity names.

4. Import the dataset into MySQL.
- The [standard_importer/import_dataset.py](standard_importer/import_dataset.py) module exists to implement this step for any dataset, as long as you have saved the cleaned variables from step 3 in the required CSV format.

5. For each variable in the new dataset, specify which variable in the old dataset is its equivalent.
- This information is typically stored in a `variable_replacements.json` file. Example: [worldbank_wdi/output/variable_replacements.json](worldbank_wdi/output/variable_replacements.json).
- For some datasets, it makes sense to generate `variable_replacements.json` programmatically (as in [worldbank_wdi/match_variables.py](worldbank_wdi/match_variables.py)), in which case `variable_replacements.json` should be stored in the `output/` sub-folder. For other datasets, it may make more sense for you to generate `variable_replacements.json` manually, in which case it should be stored in the `config/` sub-folder.

Depending on the dataset, it may get uploaded as new variables, or the existing variables might have additional values added. In both cases, existing charts do not necessarily get new data, but might need their JSON blob to be updated.
6. Suggest the chart revisions using the `oldVariable -> newVariable` key-value pairs from `variable_replacements.json`.
- The [standard_importer/chart_revision_suggester.py](standard_importer/chart_revision_suggester.py) module exists to implement this step for any dataset.

Ask the Data Team for more info!
Historical data transformations might have used Jupyter notebooks, but all recent ones use Python or R scripts.
19 changes: 10 additions & 9 deletions db.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,13 @@
load_dotenv()

# Connect to the database
connection = pymysql.connect(
db=os.getenv("DB_NAME"),
host=os.getenv("DB_HOST"),
port=int(os.getenv("DB_PORT")),
user=os.getenv("DB_USER"),
password=os.getenv("DB_PASS"),
charset="utf8mb4",
autocommit=True
)
def get_connection() -> pymysql.Connection:
return pymysql.connect(
db=os.getenv("DB_NAME"),
host=os.getenv("DB_HOST"),
port=int(os.getenv("DB_PORT")),
user=os.getenv("DB_USER"),
password=os.getenv("DB_PASS"),
charset="utf8mb4",
autocommit=True
)
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,12 @@ beautifulsoup4==4.9.3
bs4==0.0.1
certifi==2020.12.5
chardet==4.0.0
et-xmlfile==1.1.0
idna==2.10
lxml==4.6.3
multidict==5.1.0
numpy==1.20.1
openpyxl==3.0.7
pandas==1.2.3
PyMySQL==1.0.2
python-dateutil==2.8.1
Expand Down
36 changes: 25 additions & 11 deletions standard_importer/README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,57 @@
# OWID standard importer

This is a standard importer to load data into the OWID database. Can be run with `python3 -m standard_importer.import_dataset` after setting the right values at the top of the script:
Imports a cleaned dataset and associated data sources, variables, and data points into the MySQL database.

Example usage:

```
DATASET_DIR = "vdem" # Directory in this Git repo where data is located
USER_ID = 46 # ID of OWID user loading the data
from standard_importer import import_dataset
dataset_dir = "worldbank_wdi"
dataset_namespace = "worldbank_wdi@2021.05.25"
import_dataset.main(dataset_dir, dataset_namespace)
```

`import_dataset.main(...)` expects a set of CSV files to exist in `{DATASET_DIR}/output/` (e.g. `worldbank_wdi/output`):

- `distinct_countries_standardized.csv`
- `datasets.csv`
- `sources.csv`
- `variables.csv`
- `datapoints/data_points_{VARIABLE_ID}.csv` (one `data_points_{VARIABLE_ID}.csv` file for each variable in `variables.csv`)

## Expected format
## Expected format of CSV files

Inside the dataset directory (e.g. `vdem`), data must be located in an `output` directory, with the following structure:

(see [worldbank_wdi/output](../worldbank_wdi/output) for an example)


### Entities file
### Entities file (`distinct_countries_standardized.csv`)

This file lists all entities present in the data, so that new entities can be created if necessary. Located in `output/distinct_countries_standardized.csv`:

* `name`: name of the entity.


### Datasets file
### Datasets file (`datasets.csv`)

Located in `output/datasets.csv`:

* `id`: temporary dataset ID for loading process
* `name`: name of the Grapher dataset


### Sources file
### Sources file (`sources.csv`)

Located in `output/sources.csv`:

* `id`: temporary source ID for loading process
* `name`: name of the source
* `description`: JSON string with `dataPublishedBy` (string), `dataPublisherSource` (string), `link` (string), `retrievedDate` (string), `additionalInfo` (string)
* `dataset_id`: foreign key matching each source with a dataset ID


### Variables file
### Variables file (`variables.csv`)

Located in `output/variables.csv`:

Expand All @@ -55,11 +69,11 @@ Located in `output/variables.csv`:
* `original_metadata`: JSON object representing original uncleaned metadata from the data source


### Datapoint files
### Datapoint files (`datapoints/data_points_{VARIABLE_ID}.csv`)

Located in `output/datapoints/datapoints_NNN.csv`:
Located in `output/datapoints/datapoints_{VARIABLE_ID}.csv`:

* `NNN` in the file name is a foreign key matching values with a variable ID
* `{VARIABLE_ID}` in the file name is a foreign key matching values with a temporary variable ID in `variables.csv`
* `country`: location of the observation
* `year`: year of the observation
* `value`: value of the observation
Loading