Merge pull request #5 from owid/feature/wdi-bulk-import

Add chart revision suggester and enhance World Bank WDI bulk import
owid · Jun 18, 2021 · 9a69724 · 9a69724
2 parents b504a3f + b3102d1
commit 9a69724
Show file tree

Hide file tree

Showing 292 changed files with 1,464,125 additions and 3,296 deletions.
diff --git a/.env.example b/.env.example
@@ -3,3 +3,5 @@ DB_HOST="localhost"
 DB_PORT=3306
 DB_USER="user"
 DB_PASS="password"
+USER_ID=57
+DEBUG=False
diff --git a/README.md b/README.md
@@ -4,29 +4,75 @@ _Bulk import scripts for ingesting large external datasets into OWID's master da
 
 ## Overview
 
-OWID keeps a master MySQL database of all known data sets. Whilst some are manually added by researchers using the [grapher](https://github.com/owid/owid-grapher) admin interface, the bulk of the data comes from importing large external datasets, which is the focus of this repository. Datasets are often updated versions of older datasets; new versions do not overwrite old data, but are enabled with their own versioned `namespace`. This codebase also proposes which `grapher` charts should be updated to use data from a fresher dataset version.
+OWID keeps a master MySQL database of the charts that appear on our website, as well as the datasets used to create these charts (see [owid-grapher](https://github.com/owid/owid-grapher)). 
 
-## Rough convention
+The `importers` repository aids in the maintenance of this database by:
 
-Each large dataset has its own folder `<dataset>/`, and inside that you can find:
+**1. Importing datasets:** The folders in this repository contain scripts for uploading external datasets to the database at regular intervals, such as the World Bank World Development Indicators. Only some of the datasets in our database are updated in this way. Most are instead manually added to the database by OWID researchers using the [grapher](https://github.com/owid/owid-grapher) admin interface. 
 
-- Scripts that transform it
-- `input/`: an original(ish) copy
-- `output/`: the transformed copy that will be uploaded
-- `standardisation/`: any country name transformations that were needed
+**2. Suggesting chart revisions:** Once a new version of a dataset has been uploaded to the database, the next task is to update the corresponding OWID charts to display the newly available data in place of the old data. Because dataset imports create a _new_ version of an existing dataset rather than overwriting an old version of the same dataset, the relevant charts must be amended to display the new data. The scripts in this repository _suggest_ these chart revisions, which are then manually approved or rejected by an OWID researcher using the [grapher](https://github.com/owid/owid-grapher) admin interface.
 
 ## Development
 
-You should install Python 3.8+ and install required packages:
+1.  install Python 3.8+ and required packages:
 
 ```
 pip install -r requirements.txt
 ```
 
-Historical data transformations might have used Jupyter notebooks, but all recent ones use Python scripts.
+2. Copy `.env.example` to `.env` and change any variables as needed. If you are unsure what to change, ask a member of the OWID data team.
 
-## Updating data
+```
+cp .env.example .env
+```
+
+3. Follow the [setup instructions in the owid-grapher repository](https://github.com/owid/owid-grapher#initial-development-setup) to initialize a local version of the OWID MySQL database.
+
+> Note: After following the setup instructions, you must initialize the `suggested_chart_revisions` MySQL table by switching to the [feature/admin-suggested-chart-revision-approver](https://github.com/owid/owid-grapher/tree/feature/admin-suggested-chart-revision-approver) branch of the owid-grapher repository and running `yarn buildTsc && yarn typeorm migration:run`.
+
+### Folder structure
+
+Each dataset has its own folder `{institution}_{dataset}/` (e.g. `worldbank_wdi/` for the World Bank World Development Indicators), containing all code and configuration files required to execute the dataset import and suggest the chart revisions.
+
+Typical folder structure:
+
+```
+__init__.py # used for storing dataset constants (e.g. {DATASET_NAME}, {DATASET_VERSION})
+main.py     # executes all 6 steps in sequence
+...         # helper scripts (e.g. `download.py`, `clean.py`)
+input/      # original copy of the dataset
+output/     # the cleaned data to be imported + other *generated* files for steps 1-6
+config/     # *manually* constructed files for steps 1-6.
+```
+
+See [worldbank_wdi/](worldbank_wdi) for a recent example to follow.
+
+## Conventions to follow
+
+Each `{institution}_{dataset}/` folder executes the same 6 steps to import a dataset and suggest chart revisions:
+
+1. Download the data.
+   - Example: [worldbank_wdi/download.py](worldbank_wdi/download.py).
+
+2. Specify which variables in the dataset are to be cleaned and imported into the database.
+   - This information is typically stored in a `variables_to_clean.json` file. Example: [worldbank_wdi/output/variables_to_clean.json](worldbank_wdi/output/variables_to_clean.json). 
+   - For some datasets, it makes sense to generate `variables_to_clean.json` programmatically (as in [worldbank_wdi/init_variables_to_clean.py](worldbank_wdi/init_variables_to_clean.py)), in which case `variables_to_clean.json` should be stored in the `output/` sub-folder. For other datasets, it may make more sense for you to generate `variables_to_clean.json` manually, in which case it should be stored in the `config/` sub-folder.
+
+3. Clean/transform/manipulate the selected variables prior to import.
+   - This step involves the construction of metadata for each variable (name, description, ...), as well as any required sanity checks on data observations, the removal or correction of problematic data observations, data transformations (e.g. per capita), et cetera.
+   - The cleaned variables must be saved in CSV files in preparation for import into MySQL. See [standard_importer/README.md](standard_importer/README.md) for the required CSV format.
+   - Example: [worldbank_wdi/clean.py](worldbank_wdi/clean.py).
+
+> Note: This step generally requires usage of OWID's Country Standardizer Tool ([owid.cloud/admin/standardize](https://owid.cloud/admin/standardize), or [localhost:3030/admin/standardize](http://localhost:3030/admin/standardize) if you are running the grapher locally). This step requires you to upload a list of all unique country/entity names in the dataset to the country standardizer tool and then save the resulting downloaded csv file to `{DATASET_DIR}/config/standardized_entity_names.csv` (e.g. [worldbank_wdi/config/standardized_entity_names.csv](worldbank_wdi/config/standardized_entity_names.csv)) for use in your cleaning script to harmonize all entity names with OWID entity names.
+
+4. Import the dataset into MySQL.
+   - The [standard_importer/import_dataset.py](standard_importer/import_dataset.py) module exists to implement this step for any dataset, as long as you have saved the cleaned variables from step 3 in the required CSV format.
+
+5. For each variable in the new dataset, specify which variable in the old dataset is its equivalent.
+   - This information is typically stored in a `variable_replacements.json` file. Example: [worldbank_wdi/output/variable_replacements.json](worldbank_wdi/output/variable_replacements.json).
+   - For some datasets, it makes sense to generate `variable_replacements.json` programmatically (as in [worldbank_wdi/match_variables.py](worldbank_wdi/match_variables.py)), in which case `variable_replacements.json` should be stored in the `output/` sub-folder. For other datasets, it may make more sense for you to generate `variable_replacements.json` manually, in which case it should be stored in the `config/` sub-folder.
 
-Depending on the dataset, it may get uploaded as new variables, or the existing variables might have additional values added. In both cases, existing charts do not necessarily get new data, but might need their JSON blob to be updated.
+6. Suggest the chart revisions using the `oldVariable -> newVariable` key-value pairs from `variable_replacements.json`.
+   - The [standard_importer/chart_revision_suggester.py](standard_importer/chart_revision_suggester.py) module exists to implement this step for any dataset. 
 
-Ask the Data Team for more info!
+Historical data transformations might have used Jupyter notebooks, but all recent ones use Python or R scripts.
diff --git a/db.py b/db.py
@@ -5,12 +5,13 @@
 load_dotenv()
 
 # Connect to the database
-connection = pymysql.connect(
-    db=os.getenv("DB_NAME"),
-    host=os.getenv("DB_HOST"),
-    port=int(os.getenv("DB_PORT")),
-    user=os.getenv("DB_USER"),
-    password=os.getenv("DB_PASS"),
-    charset="utf8mb4",
-    autocommit=True
-)
+def get_connection() -> pymysql.Connection:
+    return pymysql.connect(
+        db=os.getenv("DB_NAME"),
+        host=os.getenv("DB_HOST"),
+        port=int(os.getenv("DB_PORT")),
+        user=os.getenv("DB_USER"),
+        password=os.getenv("DB_PASS"),
+        charset="utf8mb4",
+        autocommit=True
+    )
diff --git a/requirements.txt b/requirements.txt
@@ -5,10 +5,12 @@ beautifulsoup4==4.9.3
 bs4==0.0.1
 certifi==2020.12.5
 chardet==4.0.0
+et-xmlfile==1.1.0
 idna==2.10
 lxml==4.6.3
 multidict==5.1.0
 numpy==1.20.1
+openpyxl==3.0.7
 pandas==1.2.3
 PyMySQL==1.0.2
 python-dateutil==2.8.1

diff --git a/standard_importer/README.md b/standard_importer/README.md
@@ -1,43 +1,57 @@
 # OWID standard importer
 
-This is a standard importer to load data into the OWID database. Can be run with `python3 -m standard_importer.import_dataset` after setting the right values at the top of the script:
+Imports a cleaned dataset and associated data sources, variables, and data points into the MySQL database.
+
+Example usage:
 
 ```
-DATASET_DIR = "vdem"  # Directory in this Git repo where data is located
-USER_ID = 46          # ID of OWID user loading the data
+from standard_importer import import_dataset
+dataset_dir = "worldbank_wdi"
+dataset_namespace = "worldbank_wdi@2021.05.25"
+import_dataset.main(dataset_dir, dataset_namespace)
 ```
 
+`import_dataset.main(...)` expects a set of CSV files to exist in `{DATASET_DIR}/output/` (e.g. `worldbank_wdi/output`): 
+
+- `distinct_countries_standardized.csv`
+- `datasets.csv`
+- `sources.csv`
+- `variables.csv`
+- `datapoints/data_points_{VARIABLE_ID}.csv` (one `data_points_{VARIABLE_ID}.csv` file for each variable in `variables.csv`)
 
-## Expected format
+## Expected format of CSV files
 
 Inside the dataset directory (e.g. `vdem`), data must be located in an `output` directory, with the following structure:
 
+(see [worldbank_wdi/output](../worldbank_wdi/output) for an example)
+
 
-### Entities file
+### Entities file (`distinct_countries_standardized.csv`)
 
 This file lists all entities present in the data, so that new entities can be created if necessary. Located in `output/distinct_countries_standardized.csv`:
 
 * `name`: name of the entity.
 
 
-### Datasets file
+### Datasets file (`datasets.csv`)
 
 Located in `output/datasets.csv`:
 
 * `id`: temporary dataset ID for loading process
 * `name`: name of the Grapher dataset
 
 
-### Sources file
+### Sources file (`sources.csv`)
 
 Located in `output/sources.csv`:
 
+* `id`: temporary source ID for loading process
 * `name`: name of the source
 * `description`: JSON string with `dataPublishedBy` (string), `dataPublisherSource` (string), `link` (string), `retrievedDate` (string), `additionalInfo` (string)
 * `dataset_id`: foreign key matching each source with a dataset ID
 
 
-### Variables file
+### Variables file (`variables.csv`)
 
 Located in `output/variables.csv`:
 
@@ -55,11 +69,11 @@ Located in `output/variables.csv`:
 * `original_metadata`: JSON object representing original uncleaned metadata from the data source
 
 
-### Datapoint files
+### Datapoint files (`datapoints/data_points_{VARIABLE_ID}.csv`)
 
-Located in `output/datapoints/datapoints_NNN.csv`:
+Located in `output/datapoints/datapoints_{VARIABLE_ID}.csv`:
 
-* `NNN` in the file name is a foreign key matching values with a variable ID
+* `{VARIABLE_ID}` in the file name is a foreign key matching values with a temporary variable ID in `variables.csv`
 * `country`: location of the observation
 * `year`: year of the observation
 * `value`: value of the observation