Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chart revision suggester and enhance World Bank WDI bulk import #5

Merged
merged 18 commits into from
Jun 18, 2021

Conversation

bnjmacdonald
Copy link
Contributor

@bnjmacdonald bnjmacdonald commented Jun 10, 2021

Features and enhancements

  • Adds a ChartRevisionSuggester class for suggesting chart revisions after a dataset bulk import has been executed.
  • Refactors and enhances old World Bank World Development Indicator (WDI) code and adds functionality for suggesting chart revisions following the bulk dataset import.
  • Adds bulk import CSV files for World Bank World Development Indicator (WDI) 2021.05.25 version.

Breaking changes

  • standard_importer/import_dataset.py: now intended for use as an imported module, rather than as a standalone script to be executed from the command line. Also includes minor changes to expected column names in input CSV files.
  • .env.example: a USER_ID must now be specified in your .env file.
  • db.py: previous usage for connecting to MySQL:
from db import connection

New usage:

from db import get_connection
connection = get_connection()

Note: "wdi" = "World Development Indicators"
Fixes a bug where the dataset namespace was
constructed from the whole current working directory
(e.g. "Users/.../importers/worldbank_wdi") instead of
just "worldbank_wdi".
Removes config and output files such as
"variable_replacemennts.json" and
"charts_to_update.json" that contain hard-coded SQL
db ids. The problem with hard-coding these db ids is
that the files may havve been constructed from a SQL
db instance that was not up to date with the
production db.
…dard_importer` + other refactoring

- Moves suggested revision upsert code to
`standard_importer.chart_revision_suggester`
- Refactors `worldbank_wdi` folder to store all
generated json/csv files (e.g. `variable_replacements.json) in the
`output` folder instead of `config` folder, while storing manually
constructed json/csv files (e.g. `standardized_entity_names.csv`) in
the`config` folder.
- Refactors `db` to return a `get_connection` method instead of
an active SQL connection, so that it is easier to create and close
multiple connections in a single module as needed.
Fixes errors raised in `worldbank_wdi/init_variables.py` and
`worldbank_wdi/match_variables.py` when there are multiple versions of
an old variable.
Removes "ON DUPLICATE KEY UPDATE..." from ChartRevisionSuggester.upsert b/c it
updates suggested chart revisions that may
have already been approved/rejected, which
is undesired behavior.
Copy link
Contributor

@larsyencken larsyencken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Didn't find any actual errors, but gave a bunch of stylistic feedback to suggest ways the Python and Pandas could be more idiomatic. Hope it's helpful!

standard_importer/chart_revision_suggester.py Outdated Show resolved Hide resolved
standard_importer/chart_revision_suggester.py Show resolved Hide resolved
standard_importer/chart_revision_suggester.py Outdated Show resolved Hide resolved
standard_importer/chart_revision_suggester.py Outdated Show resolved Hide resolved
standard_importer/chart_revision_suggester.py Outdated Show resolved Hide resolved
worldbank_wdi/init_variables_to_clean.py Outdated Show resolved Hide resolved
worldbank_wdi/match_variables.py Outdated Show resolved Hide resolved
worldbank_wdi/match_variables.py Outdated Show resolved Hide resolved
worldbank_wdi/match_variables.py Outdated Show resolved Hide resolved
worldbank_wdi/match_variables.py Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
1. Download the data.
- Example: [worldbank_wdi/download.py](worldbank_wdi/download.py).

2. Specify which variables in the dataset are to be cleaned and imported into the database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I haven't read this in detail, so totally uninformed question: Does this mean we are not importing all variables available in the WDI dataset? And if yes, is it possible to easily import others later?

Asking because as I understand, authors can decide to use any WDI variable at any point, it's likely not the case that the ones in the database are the only ones we want to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct – the variables stored in variables_to_clean.json at the end of step 2 are a subset of all variables in the dataset.

In the case of the WDI bulk import in this PR, I've written init_variables_to_clean.py (which constructs variables_to_clean.json) so as to only keep the WDI variables that have been used in at least 1 chart.

My thinking is that this is probably a good rule of thumb for effectively keeping db clutter to a minimum while still providing authors with >90% of the variables they are ever going to use. But more discussion is certainly needed here, and it would be easy enough to alter init_variables_to_clean.py to include more variables.

@bnjmacdonald
Copy link
Contributor Author

Thanks everyone! I'll make the requested changes before the end of the week.

@bnjmacdonald bnjmacdonald merged commit 9a69724 into master Jun 18, 2021
@bnjmacdonald bnjmacdonald deleted the feature/wdi-bulk-import branch June 18, 2021 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants