<a href="https://colab.research.google.com/github/open-contracting/oc4ids_database/blob/main/OC4IDS_Database_Data_Import.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OC4IDS Database - Import Data

Use this notebook to import data and CoVE check results into the OC4IDS database.

If your data is formatted as project package, edit the `source_id` and download url in the next two cells then press `Ctrl+F9` to run all the cells in the notebook.

Otherwise, you need to reformat your data into a project package and save it as `project_package.json` before running the notebook.

Set `source_id`:

In [None]:
source_id = 'example'

Download a project package:

In [None]:
!curl https://standard.open-contracting.org/infrastructure/latest/en/_static/example.json > project_package.json

## Setup

Install `psql` client:

In [None]:
!sudo apt-get install -y postgresql-client

Create a `.pgpass` file with database credentials:

In [None]:
!touch ~/.pgpass
!chmod 0600 ~/.pgpass
!echo "database-1.cmc8bohiuyg3.us-east-1.rds.amazonaws.com:5432:postgres:postgres:w3b5pDXIB6ZLARBO4rn3" > ~/.pgpass

Install `.jq`:

In [None]:
!sudo apt-get install jq

Connect notebook to database:

In [None]:
%load_ext sql
%sql postgresql://postgres:w3b5pDXIB6ZLARBO4rn3@database-1.cmc8bohiuyg3.us-east-1.rds.amazonaws.com/postgres

Install lib-cove-oc4ids:

In [None]:
!pip install libcove==0.18.0
!pip install libcoveoc4ids

## Check data

Check data using `libcoveoc4ids`:

In [None]:
!libcoveoc4ids project_package.json > results.json

## Import data and check results

Use `jq` to generate a new-line delimited JSON file from the project package:

In [None]:
!cat project_package.json | jq -crM .projects[] > projects.json

Import data to `temp_data` table:

In [None]:
%%sql

delete from temp_data;

In [None]:
!cat projects.json | psql -h "database-1.cmc8bohiuyg3.us-east-1.rds.amazonaws.com" -U "postgres" -d "postgres" -c "COPY temp_data (data) FROM STDIN WITH escape '\' quote e'\x01' delimiter e'\x02' CSV"

Import check results to `temp_checks`:

In [None]:
%%sql

delete from temp_checks;

In [None]:
!cat results.json | jq -crM . | psql -h "database-1.cmc8bohiuyg3.us-east-1.rds.amazonaws.com" -U "postgres" -d "postgres" -c "COPY temp_checks (cove_output) FROM STDIN WITH escape '\' quote e'\x01' delimiter e'\x02' CSV"

Create collection, copy data to `projects` table, copy check results to `collection_check` table, populate `field_counts` and `project_fields` tables:

In [None]:
%%sql

INSERT INTO collection (source_id, data_version) VALUES (:source_id, current_timestamp);

INSERT INTO
  projects (collection_id, project_id, data)
SELECT
  (SELECT id FROM collection ORDER BY id DESC LIMIT 1) as collection_id,
  trim(both '"' from (data -> 'id')::text) as project_id,
  data as data
FROM
  temp_data;

DELETE FROM temp_data;

INSERT INTO
  collection_check (collection_id, cove_output)
SELECT
  (SELECT id FROM collection ORDER BY id DESC LIMIT 1) as collection_id,
  cove_output as cove_output
FROM
  temp_checks;

DELETE FROM temp_checks;

INSERT INTO
  field_counts
SELECT
  (SELECT id FROM collection ORDER BY id DESC LIMIT 1) as collection_id,
  path,
  regexp_split_to_array(path, '/') as path_array,
  sum(object_property) object_property,
  sum(array_item) array_count,
  count(distinct id) distinct_projects
from
  projects
cross join
  flatten(data)
where
  collection_id = (SELECT id FROM collection ORDER BY id DESC LIMIT 1)
group by
  collection_id,
  path;

WITH RECURSIVE paths(project_id, path, "value") AS (
    select project_id,
        (key_value).key "path", 
        (key_value).value "value", 
        'true'::boolean "use_path" from 
    (select project_id, jsonb_each(data) key_value from projects where collection_id = (SELECT id FROM collection ORDER BY id DESC LIMIT 1)) a
  UNION ALL
    (select project_id,
            case when key_value is not null then
                path || '/'::text || (key_value).key::text
            else
                path
            end "path",
            case when key_value is not null then
                (key_value).value
            else
                array_value
            end "value",
            key_value is not null "use_path"
      from
        (select 
            project_id,
            path,
            jsonb_each(case when jsonb_typeof(value) = 'object' then value else '{}'::jsonb end) key_value,
            jsonb_array_elements(case when jsonb_typeof(value) = 'array' and jsonb_typeof(value -> 0) = 'object' then value else '[]'::jsonb end) "array_value"
            from paths
        ) a
    )
)
INSERT INTO
  project_fields
SELECT
  (SELECT id FROM collection ORDER BY id DESC LIMIT 1) as collection_id,
  project_id,
  array_agg(path) as paths
FROM
  paths
WHERE
  use_path
GROUP BY
  project_id;
