# OC4IDS Database - Download, check and import

Use this notebook to:

* Download data from OC4IDS publishers
* Check data using the OC4IDS Data Review Tool
* Import data and check results into the OC4IDS database

**How to run the notebook**

1. [Enter your database credentials](#scrollTo=z4-iWuZRhoEe)
1. Run all cells in the [Setup](#scrollTo=3PU3KAsPuYP7) section
1. For each load (set of data sources that you wish to analyze as a group):
  1. [Set a new `load_id`](#scrollTo=XJSBAcLbWJHu)
  1. For each data source, [choose a data source](#scrollTo=HFTcVz0Q0tMr) and run all cells under [Download, check and import data](#scrollTo=veSGp6SIwYRt)

**How to add a new publisher**

1. Add a `source_id` for the publisher in the [Choose a data source](#scrollTo=HFTcVz0Q0tMr) section.
2. In the [Download data](#scrollTo=veSGp6SIwYRt) section:
  1. Add the `source_id` and download URL to `sources`.
  1. If the data is not accessible as a single project package via a simple `GET` request, add an [`elif`](https://docs.python.org/3/reference/compound_stmts.html#elif) part to the `if` statement with code that returns a single project package named `project_package.json`.

**Having problems?**

[Open an issue in the `notebooks-oc4ids` repository](https://github.com/open-contracting/notebooks-oc4ids/issues/new).

## Enter your database credentials.

> **Open Data Services users:** Refer to the password database.

In [None]:
import getpass

user = 'postgres'
password = getpass.getpass('Password:')

## Setup

Install `psql` client:

In [None]:
%%shell

sudo apt-get update
sudo apt-get install -y postgresql-client

Create a `.pgpass` file with database credentials:

In [None]:
!touch ~/.pgpass
!chmod 0600 ~/.pgpass
!echo oc4ids-database-2.cuujgua4wses.us-east-1.rds.amazonaws.com:5432:postgres:{user}:{password} > ~/.pgpass

Install `.jq`:

In [None]:
%%shell

sudo apt-get install jq

Connect notebook to database:

In [None]:
connection_string = 'postgresql://' + user + ':' + password + '@oc4ids-database-2.cuujgua4wses.us-east-1.rds.amazonaws.com/postgres'

%load_ext sql
%sql $connection_string

Install lib-cove-oc4ids:

In [None]:
%%shell

pip install libcoveoc4ids

## Set a new `load_id`

In [None]:
from datetime import datetime
load_id = datetime.now()

## Choose a data source

In [None]:
# @title { run: "auto" }

# @markdown After running this cell manually, it will auto-run when you change the source_id.

source_id = 'uganda_gpp' #@param [ 'mexico_cost_jalisco', 'ghana_cost_sekondi_takoradi', 'mexico_nuevo_leon', 'indonesia_cost_west_lombok', 'ukraine_cost_ukraine', 'uganda_gpp', 'malawi_cost_malawi']

print('Source selected:', source_id)

# Download, check and import data

In [None]:
import json
import requests

from datetime import datetime

sources = {
  'mexico_cost_jalisco': 'http://www.costjalisco.org.mx/jsonprojects',
  'ghana_cost_sekondi_takoradi': 'https://costsekondi-takoradigh.org/uploads/projectJson.json',
  'mexico_nuevo_leon': 'http://si.nl.gob.mx/siasi_ws/api/edcapi/DescargarProjectPackage',
  'indonesia_cost_west_lombok': 'https://intras.lombokbaratkab.go.id/oc4ids',
  'ukraine_cost_ukraine': 'https://portal.costukraine.org/data.json',
  'uganda_gpp': 'https://gpp.ppda.go.ug/adminapi/public/api/open-data/v1/infrastructure/projects/download?format=json',
  'malawi_cost_malawi': 'https://ippi.mw/api/projects/query',
  'indonesia_cost_ntb':'https://intras.ntbprov.go.id/storage/docs/oc4ids.json'
}

if source_id == 'malawi_cost_malawi':
  # The IPPI API accepts a POST request with JSON-encoded start_date and end_date paramters
  payload = {
      "start_date": "2010-01-01",
      "end_date": datetime.today().strftime('%Y-%m-%d')
    }
  response = requests.post(sources[source_id], json=payload)

  with open('project_package.json', 'wb') as f:
    f.write(response.content)

else:
  response = requests.get(sources[source_id], verify=False)

  with open('project_package.json', 'wb') as f:
    f.write(response.content)


## Check data

Check data using `libcoveoc4ids`:

In [None]:
%%shell

libcoveoc4ids project_package.json > results.json

## Import data and check results

Use `jq` to generate a new-line delimited JSON file from the project package:

In [None]:
%%shell

cat project_package.json | jq -crM .projects[] > projects.json

Import data to `temp_data` table:

In [None]:
%%sql

delete from temp_data;

In [None]:
!cat projects.json | psql -h "oc4ids-database-2.cuujgua4wses.us-east-1.rds.amazonaws.com" -U {user} -d "postgres" -c "COPY temp_data (data) FROM STDIN WITH escape '\' quote e'\x01' delimiter e'\x02' CSV"

Import check results to `temp_checks`:

In [None]:
%%sql

delete from temp_checks;

In [None]:
!cat results.json | jq -crM . | psql -h "oc4ids-database-2.cuujgua4wses.us-east-1.rds.amazonaws.com" -U {user} -d "postgres" -c "COPY temp_checks (cove_output) FROM STDIN WITH escape '\' quote e'\x01' delimiter e'\x02' CSV"

Create collection, copy data to `projects` table, copy check results to `collection_check` table, populate `field_counts` and `project_fields` tables:

In [None]:
%%sql

INSERT INTO collection (source_id, data_version, load_id)
    VALUES (:source_id, CURRENT_TIMESTAMP, :load_id);

INSERT INTO projects (collection_id, project_id, data)
SELECT
    (
        SELECT
            id
        FROM
            collection
        ORDER BY
            id DESC
        LIMIT 1) AS collection_id,
    trim(BOTH '"' FROM (data -> 'id')::text) AS project_id,
    data AS data
FROM
    temp_data;

DELETE FROM temp_data;

INSERT INTO collection_check (collection_id, cove_output)
SELECT
    (
        SELECT
            id
        FROM
            collection
        ORDER BY
            id DESC
        LIMIT 1) AS collection_id,
    cove_output AS cove_output
FROM
    temp_checks;

DELETE FROM temp_checks;

INSERT INTO field_counts
SELECT
    (
        SELECT
            id
        FROM
            collection
        ORDER BY
            id DESC
        LIMIT 1) AS collection_id,
    path,
    regexp_split_to_array(path, '/') AS path_array,
    sum(object_property) object_property,
    sum(array_item) array_count,
    count(DISTINCT id) distinct_projects
FROM
    projects
    CROSS JOIN flatten (data)
WHERE
    collection_id = (
        SELECT
            id
        FROM
            collection
        ORDER BY
            id DESC
        LIMIT 1)
GROUP BY
    collection_id,
    path;

WITH RECURSIVE paths (
    project_id,
    path,
    "value"
) AS (
    SELECT
        project_id,
        (key_value).KEY "path",
        (key_value).value "value",
        'true'::boolean "use_path"
    FROM (
        SELECT
            project_id,
            jsonb_each(data) key_value
        FROM
            projects
        WHERE
            collection_id = (
                SELECT
                    id
                FROM
                    collection
                ORDER BY
                    id DESC
                LIMIT 1)) a
    UNION ALL (
        SELECT
            project_id,
            CASE WHEN key_value IS NOT NULL THEN
                path || '/'::text || (key_value).KEY::text
            ELSE
                path
            END "path",
            CASE WHEN key_value IS NOT NULL THEN
            (key_value).value
        ELSE
            array_value
            END "value",
            key_value IS NOT NULL "use_path"
        FROM (
            SELECT
                project_id,
                path,
                jsonb_each(
                    CASE WHEN jsonb_typeof(value) = 'object' THEN
                        value
                    ELSE
                        '{}'::jsonb
                    END) key_value,
                jsonb_array_elements(
                    CASE WHEN jsonb_typeof(value) = 'array'
                        AND jsonb_typeof(value -> 0) = 'object' THEN
                        value
                    ELSE
                        '[]'::jsonb
                    END) "array_value"
            FROM
                paths) a))
    INSERT INTO project_fields
    SELECT
        (
            SELECT
                id
            FROM
                collection
            ORDER BY
                id DESC
            LIMIT 1) AS collection_id,
        project_id,
        array_agg(path) AS paths
FROM
    paths
WHERE
    use_path
GROUP BY
    project_id;
