-
-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add chart revision suggester and enhance World Bank WDI bulk import #5
Conversation
Note: "wdi" = "World Development Indicators"
Fixes a bug where the dataset namespace was constructed from the whole current working directory (e.g. "Users/.../importers/worldbank_wdi") instead of just "worldbank_wdi".
Removes config and output files such as "variable_replacemennts.json" and "charts_to_update.json" that contain hard-coded SQL db ids. The problem with hard-coding these db ids is that the files may havve been constructed from a SQL db instance that was not up to date with the production db.
…dard_importer` + other refactoring - Moves suggested revision upsert code to `standard_importer.chart_revision_suggester` - Refactors `worldbank_wdi` folder to store all generated json/csv files (e.g. `variable_replacements.json) in the `output` folder instead of `config` folder, while storing manually constructed json/csv files (e.g. `standardized_entity_names.csv`) in the`config` folder. - Refactors `db` to return a `get_connection` method instead of an active SQL connection, so that it is easier to create and close multiple connections in a single module as needed.
Fixes errors raised in `worldbank_wdi/init_variables.py` and `worldbank_wdi/match_variables.py` when there are multiple versions of an old variable.
Removes "ON DUPLICATE KEY UPDATE..." from ChartRevisionSuggester.upsert b/c it updates suggested chart revisions that may have already been approved/rejected, which is undesired behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Didn't find any actual errors, but gave a bunch of stylistic feedback to suggest ways the Python and Pandas could be more idiomatic. Hope it's helpful!
1. Download the data. | ||
- Example: [worldbank_wdi/download.py](worldbank_wdi/download.py). | ||
|
||
2. Specify which variables in the dataset are to be cleaned and imported into the database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I haven't read this in detail, so totally uninformed question: Does this mean we are not importing all variables available in the WDI dataset? And if yes, is it possible to easily import others later?
Asking because as I understand, authors can decide to use any WDI variable at any point, it's likely not the case that the ones in the database are the only ones we want to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct – the variables stored in variables_to_clean.json
at the end of step 2 are a subset of all variables in the dataset.
In the case of the WDI bulk import in this PR, I've written init_variables_to_clean.py
(which constructs variables_to_clean.json
) so as to only keep the WDI variables that have been used in at least 1 chart.
My thinking is that this is probably a good rule of thumb for effectively keeping db clutter to a minimum while still providing authors with >90% of the variables they are ever going to use. But more discussion is certainly needed here, and it would be easy enough to alter init_variables_to_clean.py
to include more variables.
Thanks everyone! I'll make the requested changes before the end of the week. |
Features and enhancements
ChartRevisionSuggester
class for suggesting chart revisions after a dataset bulk import has been executed.Breaking changes
standard_importer/import_dataset.py
: now intended for use as an imported module, rather than as a standalone script to be executed from the command line. Also includes minor changes to expected column names in input CSV files..env.example
: aUSER_ID
must now be specified in your.env
file.db.py
: previous usage for connecting to MySQL:New usage: