Skip to content

mathiasweidinger/WBqueryR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


WBqueryR
WBqueryR

A "brick and mortar" R-package to query the World Bank Microdata Library.

TL;DR

The R-package WBqueryR makes it easy to query the World Bank's Microdata Library for variables from within R. Its main function WBqueryR::WBquery() takes user-defined search parameters and a list of keywords as input, downloads codebooks that meet the search criteria, and queries the variable labels in them for the presence of the keywords.

Background

Researchers and practitioners from diverse disciplinary backgrounds - Economics, Public Policy, Development Studies, Sociology, to just name a few - rely on high quality data at the firm, household, and individual level for their work. Typically, these data are collected through survey instruments by national or international organizations, statistical services, or private enterprises. The World Bank Microdata Library is a large and popular repository for this kind of data. At the time of writing, it holds 3864 datasets.

Each of these datasets may include thousands of individual variables, all of which are described in the respective survey's codebook. Browsing through these codebooks one by one to find what you are looking for can consume a lot of time. Automated querying has the potential to greatly reduce this burden and enable researchers to spend more of their time thinking about the big questions they are trying to answer. The World Bank provides a dedicated API for this purpose. WBquery acts as a wrapper for this API and implements a simple work routine for R users to query the Microdata Library for entries of interest.

Note that via the API, users can query for collections, datasets, and specific variables that match (and exactly match) the search parameters provided. Because labelling of variables varies widely from survey to survey, exact matches will only return a subset of those entries relevant to the researcher. WBqueryR solves this problem as it implements a vector-space-model (VSM) based search engine that scores variable labels for the presence of key words provided by the user.

Usage

⚠️ This project is in its very early development stage: expect bugs, performance issues, crashes or data loss!

How does it work?

For now, users of the WBqueryR package can only engage with one function : WBqueryR::WBquery(). It implements the following three steps:

  1. Gather codebooks for (a subset of) all datasets listed in the Microdata Library, using the World Bank's API.

  2. Score the variable labels (descriptions) in these codebooks for the presence of key words provided.

  3. Return a list of tibbles for variables with a matching score above the accuracy threshold.

Example

To illustrate the use of WBqueryR::WBquery(), consider the following example. Say you are interested in obtaining data on total consumption and expenditure for households in Nigeria, South Africa, or Vietnam. You are only interested in data that was collected between 2000 and 2019, and which is either in the public domain or else open access. Lastly, you want the results to match your key words at list by 60%. The example below queries the Microdata Library to find data that suits your needs:

library(WBqueryR)

my_example <- WBquery(
    key = c("total consumption", "total expenditure"), # enter your keywords
    
    from = 2000,                            # lower time limit
    to = 2019,                              # upper time limit
    country = c("nigeria", "zaf", "vnm"),   # specify countries
    collection = c("lsms"),                 # only look for lsms data
    access = c("open", "public"),           # specify access
    accuracy = 0.6                          # set accuracy to 60%
    )
Click to see output
#> gathering codebooks...
#> scoring for key word 1/2 (total consumption)...
#> scoring for key word 2/2 (total expenditure)...
#> Search complete! Print results in the console? (type y for YES, n for NO):

When WBqueryR::WBquery() has completed the search, the user is prompted to decide, whether a summary of the search results should be printed in the console or not. Typing y into the console will print the summary, whereas n will not. Let us see what the summary looks like for the example above:

#> Search complete! Print results in the console? (type y for YES, n for NO):y
Click to see output
   
#> 5 result(s) for --total consumption-- in 2 library item(s):
#>    NGA_2018_GHSP-W4_v03_M
#>        s7bq2b (CONSUMPTION UNIT) - 67% match
#>        s7bq2b_os (OTHER CONSUMPTION UNIT) - 63% match
#>        s7bq2c (CONSUMPTION SIZE) - 61% match
#>    
#>    NGA_2015_GHSP-W3_v02_M
#>        totcons (Total consumption per capita) - 61% match
#>        totcons (Total consumption per capita) - 61% match
#>    
#>   
#> 3 result(s) for --total expenditure-- in 2 library item(s):
#>    NGA_2012_GHSP-W2_v02_M
#>        s2q19i (AGGREGATE EXPENDITURE) - 60% match
#>    
#>    NGA_2010_GHSP-W1_v03_M
#>        s2aq23i (aggregate expenditure) - 61% match
#>        s2bq14i (aggregate expenditure) - 61% match

Note that no matter whether you choose to display the summary or not, all the information necessary to find the data later has been assigned to the new R-object my_example in your environment. This object is a list of 2 items - one for each key word - and each of these items includes tibbles of varying sizes that correspond to the datasets in the Microdata Library for which results have been found. Every tibble includes information on the matched variables: their name, label, and matching score. Type the code below to inspect the structure of my_example in R.

str(my_example)
Click to see output
#>List of 2
#> $ total consumption:List of 2
#>  ..$ NGA_2018_GHSP-W4_v03_M: tibble [3 × 3] (S3: tbl_df/tbl/data.frame)
#>  .. ..$ doc  : chr [1:3] "s7bq2b" "s7bq2b_os" "s7bq2c"
#>  .. ..$ score: num [1:3, 1] 0.675 0.629 0.613
#>  .. .. ..- attr(*, "dimnames")=List of 2
#>  .. .. .. ..$ : chr [1:3] "s7bq2b" "s7bq2b_os" "s7bq2c"
#>  .. .. .. ..$ : NULL
#>  .. ..$ text :List of 3
#>  .. .. ..$ s7bq2b   : chr "CONSUMPTION UNIT"
#>  .. .. ..$ s7bq2b_os: chr "OTHER CONSUMPTION UNIT"
#>  .. .. ..$ s7bq2c   : chr "CONSUMPTION SIZE"
#>  ..$ NGA_2015_GHSP-W3_v02_M: tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
#>  .. ..$ doc  : chr [1:2] "totcons" "totcons"
#>  .. ..$ score: num [1:2, 1] 0.606 0.606
#>  .. .. ..- attr(*, "dimnames")=List of 2
#>  .. .. .. ..$ : chr [1:2] "totcons" "totcons"
#>  .. .. .. ..$ : NULL
#>  .. ..$ text :List of 2
#>  .. .. ..$ totcons: chr "Total consumption per capita"
#>  .. .. ..$ totcons: chr "Total consumption per capita"
#> $ total expenditure:List of 2
#>  ..$ NGA_2012_GHSP-W2_v02_M: tibble [1 × 3] (S3: tbl_df/tbl/data.frame)
#>  .. ..$ doc  : chr "s2q19i"
#>  .. ..$ score: num [1, 1] 0.601
#>  .. .. ..- attr(*, "dimnames")=List of 2
#>  .. .. .. ..$ : chr "s2q19i"
#>  .. .. .. ..$ : NULL
#>  .. ..$ text :List of 1
#>  .. .. ..$ s2q19i: chr "AGGREGATE EXPENDITURE"
#>  ..$ NGA_2010_GHSP-W1_v03_M: tibble [2 × 3] (S3: tbl_df/tbl/data.frame)
#>  .. ..$ doc  : chr [1:2] "s2aq23i" "s2bq14i"
#>  .. ..$ score: num [1:2, 1] 0.61 0.61
#>  .. .. ..- attr(*, "dimnames")=List of 2
#>  .. .. .. ..$ : chr [1:2] "s2aq23i" "s2bq14i"
#>  .. .. .. ..$ : NULL
#>  .. ..$ text :List of 2
#>  .. .. ..$ s2aq23i: chr "aggregate expenditure"
#>  .. .. ..$ s2bq14i: chr "aggregate expenditure"

Parameters for WBqueryR::WBquery()

WBqueryR::WBquery() takes nine parameters. One of them is called key and is required for the function to work as it includes the key words for step 2. above. There are also eight optional parameters. If left blank, they either remain undefined or default to internally defined values. Five of these optional values are used in step 1. above to limit the scope of codebooks gathered by time, country, collection, and access type. Two others, sort_by and sort_order are merely used to sort the codebooks before step 2. The last optional parameter is accuracy and defines a matching score threshold for step 2. such that matches with scores below it are discarded and only those matches with scores at or above the threshold are included in the final results. If unspecified by the user, this threshold defaults to accuracy = 0.5, which essentially means that a variable's label needs to match the key word at least by 50% for the variable to be included in the results.

In summary, the parameters for WBqueryR::WBquery() are:

Parameter Syntax Example Default Type Description
key key = (...) none required a character (string) vector of key words, separated by commas
from from = #### none optional an integer indicating the minimum year of data collection
to to = #### none optional an integer, indicating the maximum year of data collection
country country = c(...) none optional a character (string) vector of country name(s) or iso3 codes, or a mix of both; separated by commas
collection collection = c(...) "lsms" optional a character (string) vector including one or more of the microdata library collection identifiers, separated by commas:

String Collection title # of datasets
"afrobarometer" Afrobarometer 32
"datafirst" DataFirst , University of Cape Town, South Africa 257
"dime" Development Impact Evaluation (DIME) 35
"microdata_rg" Development Research Microdata 59
"enterprise_surveys" Enterprise Surveys 566
"FCV" Fragility, Conflict and Violence 1011
"global-findex" Global Financial Inclusion (Global Findex) Database 436
"ghdx" Global Health Data Exchange (GHDx), Institute for Health Metrics and Evaluation (IHME) 20
"hfps" High-Frequency Phone Surveys 58
"impact_evaluation" Impact Evaluation Surveys 198
"ipums" Integrated Public Use Microdata Series (IPUMS) 431
"lsms" Living Standards Measurement Study (LSMS) 151
"dhs" MEASURE DHS: Demographic and Health Surveys 362
"mrs" Migration and Remittances Surveys 9
"MCC" Millennium Challenge Corporation (MCC) 39
"pets" Service Delivery Facility Surveys 13
"sdi" Service Delivery Indicators 19
"step" The STEP Skills Measurement Program 22
"sief" The Strategic Impact Evaluation Fund (SIEF) 38
"COS" The World Bank Group Country Opinion Survey Program (COS) 343
"MICS" UNICEF Multiple Indicator Cluster Surveys (MICS) 221
"unhcr" United Nations Refugee Agency (UNHCR) 269
"WHO" WHO’s Multi-Country Studies Programmes 72
access access = c(...) none optional a character (string) vector indicating the desired type(s) of access rights; one or more of: "open", "public", "direct", "remote", "licensed"; separated by commas
sort_by sort_by(...) none optional a character (string) vector indicating one of: "rank", "title", "nation", "year"
sort_order sort_order(...) none optional a character (string) vector indicating ascending or descending sort order by one of: "asc", "desc"
accuracy accuracy = # 0.5 optional a real number between 0 and 1, indicating the desired level of scoring accuracy

Installation

As I wrote WBqueryR primarily with myself and colleagues in mind, and because it is the first R-package I have ever written, I have no aspiration to get it onto CRAN. Built by an amateur, it might very well "act out" and throw all kinds of errors and bugs at you. If you want to give it a try nonetheless, please use the code snippet below to install it from this github repo. YOU HAVE BEEN WARNED 😉

# first check if devtools is installed, if not install...
if (!require("devtools", character.only = TRUE)) {
    install.packages("devtools", dependencies = TRUE)
    }

# now install WBqueryR from github...
devtools::install_github("mathiasweidinger/WBqueryR")
Click to see output
#> Downloading GitHub repo mathiasweidinger/WBqueryR@HEAD
#> rlang (1.0.2 -> 1.0.3) [CRAN]
#> Installing 1 packages: rlang
#> Installing package into 'C:/Users/mweidinger/AppData/Local/R/win-library/4.2'
#> (as 'lib' is unspecified)
#> 
#> The downloaded binary packages are in
#>  C:\Users\mweidinger\AppData\Local\Temp\RtmpeaDfjL\downloaded_packages
#>          checking for file 'C:\Users\mweidinger\AppData\Local\Temp\RtmpeaDfjL\remotes460066734547\mathiasweidinger-WBqueryR-1c8ad39/DESCRIPTION' ...     checking for file 'C:\Users\mweidinger\AppData\Local\Temp\RtmpeaDfjL\remotes460066734547\mathiasweidinger-WBqueryR-1c8ad39/DESCRIPTION' ...   ✔  checking for file 'C:\Users\mweidinger\AppData\Local\Temp\RtmpeaDfjL\remotes460066734547\mathiasweidinger-WBqueryR-1c8ad39/DESCRIPTION'
#>       ─  preparing 'WBqueryR':
#>    checking DESCRIPTION meta-information ...     checking DESCRIPTION meta-information ...   ✔  checking DESCRIPTION meta-information
#>       ─  checking for LF line-endings in source and make files and shell scripts
#>   ─  checking for empty or unneeded directories
#>       ─  building 'WBqueryR_0.0.0.9000.tar.gz'
#>      
#> #> Installing package into 'C:/Users/mweidinger/AppData/Local/R/win-library/4.2'
#> (as 'lib' is unspecified)

Details

WBqueryR::WBquery() internally calls the helper function vsm_score() to score the labels from the codebooks for the presence of the user-defined key words in the parameter key. vsm_score() is a custom-built function that implements a simple vector-space-model. It is broadly based on the excellent online tutorials by Fernando Torres H., Ben Ogorek, and Suresh Gorakala.

VSM Basics

To quote Wikipedia,

Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers (such as index terms). It is used in information filtering, information retrieval, indexing and relevancy rankings.

The vector space model procedure can be divided in to three stages. The first stage is the document indexing where content bearing terms are extracted from the document text. The second stage is the weighting of the indexed terms to enhance retrieval of document relevant to the user. The last stage ranks the document with respect to the query according to a similarity measure.

In the use case of WBqueryR, the $j$ variable labels in the codebooks gathered from the Microdata Library are represented by a vector $d$ and the user-defined keywords by another vector $q$.

$$ d_j = ( w_{1,j} ,w_{2,j} , \dotsc ,w_{t,j} ) $$

$$ q = ( w_{1,q} ,w_{2,q} , \dotsc ,w_{n,q} ) $$

Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. The definition of term depends on the application. In the present use case, terms are either single words, keywords, or longer phrases. If words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary (the number of distinct words occurring in the corpus).

Vector operations can be used to compare documents with queries. In the case of vsm_score, every variable label receives a matching-score $m$ in the unit-interval,

$$ m \in \mathbb{R} \mid 0 \leq m \leq 1, $$

indicating how well the label fits the keywords. A score of $m=1$ means that the label exactly matches the keyword. As descibed above, the user can specify a threshold below which results are being discarded. By default, WBqueryR::WBquery() only returns variables with $m\geq 0.5$ in its results.

Limitations (and their relevance)

Borrowing once more from Wikipedia, the commonly acknowledged limitations of VSM include:

  1. Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality)
  2. Search keywords must precisely match document terms; word substrings might result in a "false positive match"
  3. Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".
  4. The order in which the terms appear in the document is lost in the vector space representation.
  5. Theoretically assumes terms are statistically independent.
  6. Weighting is intuitive but not very formal.

Luckily, some of these limitations do not immediately cause issues for WBqueryR. Limitation 1 is rather insignicant since variable labels rarely exceed a single sentence. As for limitation 2, I suspect that most users would use full words - not substrings thereof - to query for variable names. Limitation 3 might cause problems for WBqueryR. It is perhaps wise to include closely related words to ensure that as many relevant variables as possible are found and retrieved. For example, when looking for variables on expenditure, one might add the closely related term "expenses" to make sure variables containing it in their description are scored accurately:

WBqueryR::WBquery(key = c("expenditure", "expenses"))

Limitation 4 is not an issue at all; in fact, the order in which the terms appear in the variable labels should not, ex ante, affect the value of the matching score assigned by vsm-score(). The flexibility of matching key words with labels that are not exactly in the same order was the primary reason for choosing the VSM framework over simply pattern-matching strings with one another (e.g. using grep() in base-R).

Lastly, limitations 5 and 6 are acknowledged. There are more computationally intensive methods that would have resolved these concerns. However, considering how simple the task at hand really is, implementing them for the sake of querying variable labels seemed a little excessive to me.

Development

This section draws up directions for future developments of WBqueryR() in terms of features and usability. If you have ideas or constructive criticism, feel free to submit a feature request. From there, I will sporadically update this list.

Datascraping

In its current form. WBquery only yields a summary of where (specifically, in which datasets) the data searched by the user can be found. To me, the most obvious useful addition to WBqueryR would be to give it the capability to scrape the Microdata Library and download the actual datasets onto the user's system. I might work away on this over the summer of 2022 - no promises quite yet, though!

Disclaimer

Copyright

Mathias Weidinger, 2022.

Licensing

WBqueryR is licensed under version 3 of the GNU Public License. To learn what that means for you (the user), please refer to the license file; or you can find a a quick guide to GPLv3 here.

Copying Permission

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Notes

With rdhs, there exists a much more comprehensive package to download and process data from DHS in R.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages