Skip to content

notshi/dquery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

🎉 Introduction

http://d-portal.org/dquery/

Use SQL to search the complete IATI data, including non-standard attributes and extensions directly into the live d-portal database, in a modern browser.

+ Documentation is ongoing so please bear with us.

If you have a data need or question that could benefit from a query but don't know how, do raise an issue and we can work it out together.

dQuery works well if you are familiar with the IATI Standard and querying in SQL using JSONB data types in PostgreSQL. However, it shouldn't be too hard to pick up once you've done a few recipes.

dQuery is an internal tool created by Wetgenes to aid the development of d-portal but publicly available as a courtesy.

Contents

📌 Getting started

The current interface is sparse and consists of two panels; the editor on the left and the console on the right.

The numbers you see on the side of the editor are line numbers and are purely there as visual aid.
The console displays results of your query when you click the Run button.

There are five buttons above these panels which perform various things.

Examples
Some example recipes.

Run (optional)
View results in the browser. ( Keyboard Shortcut Ctrl + Enter )
Please limit your results! If queries are complex, this can slow down the browser.

Explain
Inserts the SQL keyword, "explain" in front of your query.
Tells you how the query will run instead of running it.
Can be used to optimise queries.

View on d-portal
View the results of the query on the d-portal website.
Only works if query includes activity identifiers using the aid command.

Download
Download data in various formats without viewing results in the browser.

How to
Sends you to this documentation page in a new window/tab.

You can drag the partition to increase or decrease the space of either panel.

To Top

Run

When you click 'Run' or press 'Ctrl+Enter', the console to the right of the editor will display the results of your query.

Generally, it will look something like this:

{
    rows: [
        {
            aid: "DK-CVR-40781218-AA"
        }
    ],
    dquery: http://d-portal.org/dquery#select%20aid%0Afrom%20xson%20where%20root='/iati-activities/iati-activity'%0Alimit%201;%0A,
    time: 0.017,
    count: 1
}

That's quite a bit to dissect, isn't it?

There are 4 separate things being displayed here but you will probably only need rows.

rows will contain the data you asked for.

dquery is the url of the current query page.
time is how long the database took to answer this query, in seconds.
count is how many rows there are in total on the page.

To Top

Data formats

Downloads of data is available as csv, json, xml and html.

The default format of the queried results is json.

When getting a lot of data, json has to repeat the header names so you may find that opting for csv is substantially faster for larger queries.

The html option will only work at the top level of an activity or an organisation file; ie. when select *, the root needs to be at either /iati-activities/iati-activity or /iati-organisations/iati-organisation.

The html option is a Print-friendly version of SAVi (Simple Activity Viewer).

Replace /# in the url with ?form=csv&sql= to get the query link in different formats.

To Top

Commands

You can more or less figure out what a query does by reading it like a sentence.
And just like a sentence, you can write out a whole query in a single line as linebreaks in these examples are mostly for legibility purposes.

The following lists the most common SQL clauses and operators you can use to create queries.

Commands What it does
select Picks field(s) that contain data of interest
from Picks table(s) that contain data of interest
where Adds conditions to the query
and Additional conditions
or Additional conditions
not Additional conditions
like Use with % or _
ilike Use with % or _
as Gives a table, or a column a temporary name
in Specifies multiple values
group by Aggregates values across row
order by Specifies sorting conditions with option for asc or desc
limit Limits the number of returned results
offset Skips a given number of results
join Get data from 2 or more tables

To Top

Select

Select specifies the kind of data we want to get.

Multiple column names are separated by a comma ,.

If column names are not provided, this defaults to an asterisk * which means everything. The valid values here depend on which tables you are querying.

The following lists the most common, but not all commands using Select.

Commands What it does
* Returns full acitivity information, in their original order
count(*) Returns a number of items
distinct Returns a list of unique results
aid Returns a list of identifiers
pid Returns a list of publishers
as Gives a table, or a column a temporary name

Examples

select distinct aid 
select * 

By using the select aid command, you will get a list of activity identifiers that you can then explore on d-portal by clicking on the, 'View on d-portal' button.
Note that if you rename this column by using the as command, you will not be able to view the results on d-portal.

To Top

As

Sometimes we want column names that are more human readable when they are displayed in a CSV or JSON output.
We can give them temporary aliases that exists only for the duration of that query.

We first specify which column we want, followed by As and the temporary column name.
Double quotation marks "" are required if the alias name contains spaces.

Examples

select pid as "Organisation Identifier"
select pid as org_id

To Top

From

Tables are where the data is stored and From specifies which tables to look at to get them.

For most queries, we are only looking at the xson table.
The xson table includes the entire IATI activity and organisation elements, as well as non-standard attributes and extensions.

Multiple table names are separated by a comma ,.

For a full list of available tables in the database, please refer here.
For a full list of IATI acitivity elements, please refer here.

Examples

from xson
from location

To Top

Where

At its most basic, Where is used to get data that fulfills a specified condition.
For most queries, we do this by specifying the xpath of an element in root.

Where can be combined with operators to get results based on more than one condition.
Multiple column names are separated by a comma ,.

Operator What it does
and Returns data if all the conditions separated by And are TRUE
or Returns data if all the conditions separated by Or are TRUE
not Returns data if the condition(s) is NOT TRUE
in Specifies multiple values, arrays, short for multiple Or conditions
between Selects a range, start and end values included (values can be numbers, text, or dates)
like Search for a specified pattern, case sensitive
ilike Search for a specified pattern, case insensitive

For Like and iLike, there are two wildcard options:

  • The percent sign % for multiple characters.
  • The underscore sign _ for a single character.
Use What it does
like a% Finds data that starts with "a"
like %a Finds data that ends with "a"
like %aa% Finds data that has "aa" within it
like _a% Finds data that has "a" as the second character
like a_% Finds data that starts with "a" and is at least 2 characters in length
like a__% Finds data that starts with "a" and is at least 3 characters in length
like _a_ Finds data that has 3 characters that has "a" as the second character
like a%s Finds data that starts with "a" and ends with "s"

Examples

where root='/iati-activities/iati-activity' 
where root='/iati-activities/iati-activity/other-identifier' and xson->>'@type' = 'B1'

To Top

In

Sometimes you want to look for something within multiple elements or within an element that has multiple values.
There are many elements in the IATI Standard that have multiple values; for example, there can be multiple countries reported in an activity.

In such cases, we can use in to look through multiple values or a number of arrays.

For a full list of arrays in the database, please refer here.

Examples

This looks into the recipient-country element that can occur multiple times in an activity.
We refer to this element as an array.

and aid in (
    select aid from xson where root='/iati-activities/iati-activity/recipient-country'
    and xson->>'@code' = 'SO'
)

This looks into multiple elements seperated by a comma ,.

where root in (
    '/iati-activities/iati-activity/participating-org',
    '/iati-activities/iati-activity/transaction/'
)

To Top

And

Adds additional limits to the current query to narrow the results.

You can use the following operators to enhance a query.

Operator Description
= Equal
> Greater than
< Less than
>= Greater than or equal
<= Less than or equal
!= Not equal
<> Not equal
is null Find missing values
is not null Ignore missing values

Examples

and xson->>'@type' = 'B1'
and xson->>'/total-budget' IS NOT NULL

To Top

Group by

When counting or adding up values, it is often neccesary to group columns with the same values.
Column numbers can be used instead of column names, as well as a combination of numbers and names.

Multiple column names are separated by a comma ,.

Examples

group by 1
group by xson->>'@role', xson->>'@type'

To Top

Order by

This sorts the resulting data in ascending or descending order.
Column numbers can be used instead of column names, as well as a combination of numbers and names.

Multiple column names are separated by a comma ,.

The default order is in ascending order (low to high).

Operator Description
asc Sorts in ascending order (low to high)
desc Sorts in descending order (high to low)

Examples

order by pid desc
order by 1 asc, xson->>'@role' desc

To Top

Limit

We use Limit to specify the number of results we want in the query.
This is especially useful if we want to test a query that might be intensive or to quickly see if it works.

This will return the first 10 results from a table.

Examples

limit 10;

To Top

Offset

We use Offset for pagination, especially useful on large tables when we need to see the next set of results.
When we combine it with Limit, this allows us to page through the results whilst still returning a limited number of results.

Examples

offset 5;

The following query will return only 5 results, starting on the the 11th result.

limit 5 offset 10;

To Top

Join

We use Join to combine rows from two or more tables, based on a related column between them.

Examples

join country on act.aid = country.aid

We can also refer to the same table as two different table aliases, thus the xson table below is joined with itself.
An example of such use is when you want to return results that includes data within and outside a particular element.

This can happen with IATI data due to the hierarchical nature of XML.

However, it is highly recommended that we always aim to keep queries straightforward and simple to reduce load times!

from xson as "Table 1", xson as "Table 2"

To Top

🔍 Basic queries

We often refer to tables, rows and columns when querying. This is because the database contains many tables.

Each table is identified by a name, for example xson.
Each table has rows and columns, and those have names too, for example iati-identifier.

For a full list of available table and column names in the database, please refer here.

When we do a query, we write SQL statements.

SQL keywords are not case sensitive, so select is the same as SELECT.
We always end a statement with a semicolon ; so you can write many statements in a query.

However, although we allow multiple SQL statements to be executed at a time, only the latest one will display.

We always limit our queries because the database is huge!
Returning large results can impact your browser performance, and ultimately the server and the d-portal website.

To Top

Syntax

The following is a SQL statement.
For this purpose, all SQL keywords are in upper-case.

Hopefully by reading the following code snippet, you are able to figure out the different clauses and fields that make up parts of the statement.
You can always refer back to the most common commands here.

SELECT column_name(s), column_name AS alias_name
FROM table_name(s)
WHERE condition
AND condition
GROUP BY column_name(s)
ORDER BY column_name(s)
LIMIT number;

All possible column names can be found by querying a table and looking at the returned JSON.
By specifying the xpath of an element in root, you can filter results to within that element without displaying the entire xml.

Examples

This gets you all the data within the budget element table from a random activity.

select *
from xson where root='/iati-activities/iati-activity/budget'
limit 1;

Result

{
    rows: [
        {
            aid: "SE-0-SE-6-10451A0101-MLI-72012",
            pid: "SE-0",
            root: "/iati-activities/iati-activity/budget",
            xson: {
                /value: 189381.7616514557,
                /value@currency: "USD",
                /value@value-date: "2019-01-01",
                /period-end@iso-date: "2019-12-28",
                /period-start@iso-date: "2019-01-01"
            }
        }
    ],
    time: 0.014
}

If we change the select field from * to aid, the result will change.
This is because the query has specified that we only want a list of identifiers.

select aid
from xson where root='/iati-activities/iati-activity/budget'
limit 1;

Result

{
    rows: [
        {
            aid: "XM-DAC-903-SPI-10117"
        }
    ],
    time: 0.035
}

This gets you all the data within the total-expenditure element table from a random organisation file.

select *
from xson where root='/iati-organisations/iati-organisation/total-expenditure'
limit 1;

Result

{
    rows: [
        {
            aid: null,
            pid: "XM-DAC-41123",
            root: "/iati-organisations/iati-organisation/total-expenditure",
            xson: {
                /value: 1060488055.37,
                /value@currency: "USD",
                /value@value-date: "2013-01-01",
                /period-end@iso-date: "2013-12-31",
                /period-start@iso-date: "2013-01-01"
            }
        }
    ],
    time: 0.009
}

If we tweak the query like so, the result will change.
Here we only want the publisher identifier pid and value element which we have renamed as Total Expenditure in USD.

select pid, xson->>'/value' as "Total Expenditure in USD"
from xson where root='/iati-organisations/iati-organisation/total-expenditure'
limit 1;

Result

{
    rows: [
        {
            pid: "XM-DAC-41146",
            Total Expenditure in USD: "354113927"
        }
    ],
    time: 0.009
}

To Top

XSON

For the majority of our queries, we will be using the xson table as it includes all activity and organisation elements according to the IATI Standard, as well as non-standard attributes and extensions that can be found in published data.

You will find root='/iati-activities/iati-activity' in many of the example queries below.

This tells PostgreSQL where to look within the xson table by giving it a start of the xpath.
When we say xpath, we mean where an IATI element is within the Activity or Organisation Standard; ie. /iati-activities/iati-activity.

If you are familiar with the IATI Standard, you can manipulate the xpath to look at any activity or organisation element.

When we set conditions for a query, it usually looks like the following.

and xson->>'@vocabulary' = '99'
and xson->>'@dstore:dataset' = 'slovakaid-69_1_ac'

This filters the queries to certain requirements by looking at specific values of IATI element attributes.

xson is a JSON representation of the XML data, such that if you concatenate the root value plus all the object key names in a tree, the result will be a valid XML path that can be looked up in the IATI schema to tell you what data it represents.

For example, within the xson row, the entries to the left of the colon; ie. @type, @iso-date, /narrative and @xml are key names.
And thus, object entries on the right are the values.

aid, pid, root and xson are row values from the PostgreSQL database.

{
    aid: "ID-KHH-5018112631240040-ID012-LEAP",
    pid: "ID-KHH-5018112631240040",
    root: "/iati-activities/iati-activity/activity-date",
    xson: {
        @type: "1",
        @iso-date: "2017-09-01",
        /narrative: [
            {
                : "Grant Decision LEAP Project was signed by EKN (Minbuza) Reference: JAK-2017/859",
                @xml:lang: "EN"
            }
        ]
    }
}

To Top

Advanced Queries

For more information about the SQL language, the full list of functions and futher documentation, please refer to the following official pages.

  1. PostgreSQL Queries

  2. PostgreSQL Functions and Operators

To Top

Tables and references

We have a few different tables you can query but you should probably use the ones referred to in the recipes and examples.

For example, the slug table stores duplicate identifiers and would be the best place to look for such things.

The trans and budget tables would be very useful for analysts interested in IATI transaction data as the complicated job of splitting by sector / country and currency conversions to USD, CAD, GBP and Euro is already done for you.

Please note that transaction data is in IATI Standard version 1.04 as that is when d-portal was created.

This will return a list of available tables in the database.

select tablename from pg_tables
where schemaname='public'
order by tablename

This will return a list of available xson roots and how often they are used in the database.
They also represent whereabouts in the json you will find arrays.

select root, count(*)
from xson
group by 1 order by 2 desc

For a complete list of column names and elements , please refer to the following pages.

  1. List of available table and column names in the database

  2. List of IATI acitivity elements

Adding explain before your query displays how the database handles your query.

explain
select * from table
where condition
limit 1;

To Top

🍩 Database Dump

Spin up a server, import and start querying.

The entire d-portal database is about 3GB in size.

The following dumps can be imported locally and queried using the same SQL code that is used on the web interface so you can run large queries on it without clogging up d-portal.

  1. PostgreSQL pg_dump custom format so pg_restore can be used with its various options (Updated nightly)
    http://d-portal.org/db/dstore.pg

  2. A zip of all the raw cached xml (Updated nightly)
    http://d-portal.org/db/cache.xml.zip

To Top

Server

For us, a full restore of the database takes about 7 hours, or 2 hours if you use multithreading with pg_restore.

--jobs=(number of cpu cores)

Almost all of the time is spent rebuilding indexes so it's not the data itself that takes the time.
We create a lot of indexes because, as a public facing site, we need the database queries to run as fast as possible.

For comparison, this is our test server specification that imported the data in 2 hours.

CPU RAM Hard Drive(s)
Intel Atom C2750 - 2.4 GHz - 8 core(s) 16GB - DDR3 1x 256GB (SSD SATA)

What we found is having a solid state drive really speeds up the import process more than anything else.

To Top

✨ Recipes

The following recipes are some examples of what you can query using this tool.

Some of these are requests for when more complex data is needed outside the scope of the d-portal platform.
Where applicable, results are limited to 1 in the examples below.

Add comments like so

-- This is a comment

/*
This is a  
multi line comment
*/

To Top

Display all unique reporting-org/@ref published in a dataset

Raised codeforIATI/iati-data-bugtracker#7

Here we have selected the reporting_ref column from the act table and gave the column a temporary name (alias) for legibility.

We are looking at a particular dataset slovakaid-69_1_ac in the slug column.

Finally, we have grouped the results by the temporary name (alias) reporting_org and sorted it alphabetically, in the default ascending order.

select act.reporting_ref as reporting_org
from act where act.slug='slovakaid-69_1_ac'
group by reporting_org
order by reporting_org
limit 1;

Result

{
    rows: [
        {
            reporting_org: "XM-DAC-69-0"
        }
    ],
    time: 0.015
}

We can do the same search using the xson table but this will be much slower as this column is not part of the IATI Standard and so is not indexed.

You can compare this by looking at the duration of the query.

This query took 17 seconds compared to almost instantaneous in the previous query.

When creating queries, it is always good to aim at efficiency so it could be that it takes a few tries to get it right.
Databases are complex creatures!

Here we selected the /reporting-org@ref element from the xson table and gave the column a temporary name (alias).
We are looking at the dataset slovakaid-69_1_ac in the @dstore:dataset column.

select xson->>'/reporting-org@ref' as reporting_org
from xson where root='/iati-activities/iati-activity'
and xson->>'@dstore:dataset' = 'slovakaid-69_1_ac'
group by reporting_org
order by reporting_org
limit 1;

Result

{
    rows: [
        {
            reporting_org: "XM-DAC-69-0"
        }
    ],
    time: 17.329
}

To Top

Display count of certain element in org file

select

count(*)

from xson where root='/iati-organisations/iati-organisation'
and xson->>'/total-budget' IS NOT NULL

Result

{
    rows: [
        {
            count: "437"
        }
    ],
    time: 0.053
}

To Top

Look for similar iati-identifier using a wildcard %

Raised codeforIATI/iati-ideas#37

The wildcard % can be placed before, after or before and after a search term to get the following results.

%<searchterm> looks for identifiers that end with the search term.
<searchterm>% looks for identifiers that start with the search term.
%<searchterm>% looks for identifiers that has the search term within it.

select

distinct aid

from xson where root='/iati-activities/iati-activity' 
and xson->>'/iati-identifier' like '%1022474%'

limit 1;

Result

{
    rows: [
        {
            aid: "1022474"
        },
        {
            aid: "DE-1-201022474"
        }
    ],
    time: 18.484
}

Below is an alternative query using a different table act.

This is a much faster search because it is looking at a smaller table that has the same data in column, iati-identifier.
The act table is a curated table and does not include all IATI elements.

For a full list of available table and column names in the database, please refer here.

select act.aid as identifier
from act where act.aid like '%1022474%'
group by act.aid
limit 1;

Result

{
    rows: [
        {
            identifier: "1022474"
        },
        {
            identifier: "DE-1-201022474"
        }
    ],
    time: 1.457
}

To Top

Display IATI Registry dataset for publishers that use the same organisation-identifier

From codeforIATI/iati-data-bugtracker#19, we wanted to check if there were more publishers that used the same id.

select
distinct xson->>'@dstore:dataset' as "IATI Registry Dataset"
from xson where root='/iati-organisations/iati-organisation' 
and xson->>'/organisation-identifier' like '%NL-KVK-30285304%'
limit 1;

Result

{
    rows: [
        {
            IATI Registry Dataset: "humana_houben-org"
        }
    ],
    time: 0.058
}

We can display the narrative of both /name and /reporting-org so that we know who these publishers are.
We include these two different elements become sometimes publishers can publish either one or both fields.

select
distinct xson->'/reporting-org/narrative'->0->>'' as "Publisher",
xson->'/name/narrative'->0->>'' as "Publisher Name",
xson->>'@dstore:dataset' as "IATI Registry Dataset"
from xson where root='/iati-organisations/iati-organisation' 
and xson->>'/organisation-identifier' like '%NL-KVK-30285304%'
limit 1;

Result

{
    rows: [
        {
            Publisher: "Humana people to people",
            Publisher Name: "Humana people to people",
            IATI Registry Dataset: "humana_houben-org"
        }
    ],
    time: 0.059
}

To Top

Display IATI Registry dataset for an activity

select
distinct xson->>'@dstore:dataset' as "IATI Registry Dataset"
from xson where root='/iati-activities/iati-activity' 
and xson->>'/iati-identifier'='XM-DAC-6-4-011752-01-6'
limit 1;

Result

{
    rows: [
        {
            IATI Registry Dataset: "aics-jo"
        }
    ],
    time: 0.252
}

To Top

Display duplicate activities and their count

Multiple identifiers are only stored in the slug table so we should look there.

View this query on dQuery.

select aid, count(*)
from slug
group by aid having count(*) > 1
order by count(*) desc
limit 1;

Result

{
    rows: [
        {
            aid: "ES-DIR3-E04585801-009-066169",
            count: "105"
        }
    ],
    time: 0.557
}

To Top

Display IATI Registry dataset for duplicate activities where we know the iati-identifier

From codeforIATI/iati-data-bugtracker#10, we wanted to look for datasets containing a specific iati-identifier.

Multiple identifiers are only stored in the slug table so we should look there.

select *
from slug
where aid='XM-DAC-6-4-011752-01-6'
limit 1;

Result

{
    rows: [
        {
            aid: "XM-DAC-6-4-011752-01-6",
            slug: "aics-679"
        },
        {
            aid: "XM-DAC-6-4-011752-01-6",
            slug: "aics-jo"
        }
    ],
    time: 0.009
}

To Top

Exploring traceability within IATI data

Here are steps to reproduce the results for traceability as outlined in the methodology used for IATI Trace.

The report used a previous version of d-portal that was not able to provide support for all the data needed.
However, this should now be possible with dQuery.

View this query on dQuery.

You can write as many SQL statements in the editor but only the latest one will be displayed as results.

Any line starting with -- are commented out so you can delete any -- to start a comment chunk.
Alternatively, just delete the query chunk you don't need.

Queries are limited so it doesn't stress the server but you can increase or decrease the limit accordingly.

Click the Download button to choose the format you want to download the data in.