# OC4IDS Publisher Status Report

## How to author a report

1. Load the data that you want to report on using the [data load notebook]().
2. Run the status checks using the [status checks notebook]().
3. Run the cells in [Appendix 1: Report Setup](#scrollTo=GYwdqQevW-zi).
4. Run all cells in the [Summary](), [Criteria](), [Checks]() and [Metrics]() sections.
5. For each criteria and check whose methodology is `manual`:
  1. Follow the instructions in the methodology
  2. Add code and/or Markdown cells to document your findings (e.g. check failures).
  3. Save the results.
6. Run the cells in the Summary section to update the summary table.
7. Remove this cell.

## Introduction

This report assesses the status of OC4IDS publications. It covers:

* Quality [criteria](#scrollTo=7N-KAMJHQkad) that all OC4IDS publications should meet.
* Other [checks](#scrollTo=JZ_mhib6Q_sK) on the quality of OC4IDS data.
* [Metrics](#scrollTo=xmCUh-_CtLwA) related to the criteria and checks.

## Data sources

This report covers data from the following OC4IDS publications:

In [None]:
# @title ### Publications

%%sql

select
  source_id,
  data_version as collection_date
from
  collection
join
  run_collection
on
  collection.id = run_collection.collection_id
where
  run_id = :run_id
order by
  source_id asc;

## Summary

This section provides a summary of criteria and check results. It is intended to support comparison between publications and assessment of the overall quality of the corpus of OC4IDS data.

`True` indicates success against a criteria or check, `False` indicates failure and `None` indicates that a critiera or check was not assessed.



In [None]:
# @title ### Comparison table
get_results(run_id = run_id, extra_results = manual_checks)

## Criteria

This section assesses publications against pass/fail criteria that all publications should meet.

### Registered

**Description:**

The data uses an OC4IDS prefix in its project identifiers.

**Methodology:** `automated`

Check against the [list of registered prefixes](https://standard.open-contracting.org/infrastructure/latest/en/reference/prefixes/).

**Output:**

List of project identifiers without a registered prefix.

In [None]:
# @title #### Output
get_output(run_id = run_id, check_id = 'criteria_registered')

In [None]:
# @title ### Results
get_results(run_id = run_id, check_id='criteria_registered')

### Discoverable

**Description:**

It is possible to discover the data by navigating a website whose homepage is indexed by popular web search engines.

**Methodology:** `manual`

Ask the publisher where the access methods are publicly listed and/or review the publisher’s website.

**Output:**

None

In [None]:
# @title #### Results

display_result_widgets('criteria_discoverable')

### Retrievable


**Description:**

It is possible to automate the download of all the data, either using an HTML page listing bulk download URLs, or using only machine-readable data as input.

**Methodology:** `manual`

First review: Author and run a Python scraper.

Subsequent reviews: Run the Python scraper and update if needed.

**Output:**

None

In [None]:
# @title #### Results

display_result_widgets('criteria_retrievable')

### Reviewable

**Description:**

The OC4IDS Data Review Tool is able to report results on the data.

**Methodology:** `manual`

Check that libcoveoc4ids reports results.


**Output:**

None

In [None]:
# @title #### Results

display_result_widgets('criteria_reviewable')

### Appropriate

**Description:**

Concepts are published in semantic accordance with the rules of the OC4IDS rather than using a non-OC4IDS field or code. There must not be more than 10 cases in which a concept is covered by a field or code in OC4IDS but is disclosed using another field or code.

**Methodology:** `manual`

Review the output to identify concepts covered by a field or code in OC4IDS but disclosed using another field or code.


**Output:**

List of additional fields and example values reported by the Data Review Tool.

In [None]:
# @title #### Output

%%sql

select
  source_id,
  output.key as path,
  output.value -> 'count' as count,
  output.value -> 'examples' as examples
from
  check_results
cross join
  jsonb_each(output) as output
join collection on
  collection_id = collection.id
where
  run_id = :run_id
and
  check_id = 'criteria_appropriate'
order by
  source_id asc;

In [None]:
# @title #### Results

display_result_widgets('criteria_appropriate')

### Active

**Description:**

The data has been updated within the last 12 months.

**Methodology:** `automated`

There is a project with a top-level `updated` field value within the last 12 months.

**Output:**

None. For more information, see the [last updated metric](#scrollTo=RdJl4q6sw-pj).

In [None]:
# @title #### Results
get_results(run_id = run_id, check_id='criteria_active')

### Documented

**Description:**

The publisher provides a publication policy/data user guide alongside the data.

**Methodology:** `manual`

Ask the publisher where the publication policy/data user guide is publicly available and/or review the publisher’s website.

**Output:**

None


In [None]:
# @title #### Results

display_result_widgets('criteria_documented')

### Accessible

**Description:**

The data is available as a bulk download in tabular (CSV or spreadsheet) format.

**Methodology:** `manual`

Ask the publisher for a link to the bulk downloads and/or review the publisher’s website.

**Output:**

None

In [None]:
# @title #### Results

display_result_widgets('criteria_accessible')

### Valid

**Description:**

The OC4IDS Data Review Tool reports no validation errors.

**Methodology:** `automated`

Use libcoveoc4ids to generate a list of validation errors.

**Output:**

None. For more information, see the [validation error count metric](#scrollTo=HYOcgsFSxKWD).

In [None]:
# @title #### Results

get_results(run_id = run_id, check_id='criteria_valid')

### Conformant

**Description:**

The OC4IDS Data Review Tool reports no structure warnings.

**Methodology:** `automated`

Use libcoveoc4ids to generate a list of structure warnings.

**Output:**

None


In [None]:
# @title ### Results

get_results(run_id = run_id, check_id='criteria_conformant')

## Checks

This section documents the results of pass/fail checks on the quality of OC4IDS data.

### Sectors are standardised

**Description:**

Projects are classified against the OC4IDS sector codelist

**Methodology:** `automated`

Check that `sector` is present for at least one project and that it contains no values from outside the OC4IDS sector codelist.

**Output:**

List of additional sector codes.

In [None]:
# @title ### Output

%%sql

select
  source_id,
  output -> 'all_projects' as additional_codes
from
  check_results
cross join
  jsonb_each(output)
join collection on
  collection_id = collection.id
where
  run_id = :run_id
and
  check_id = 'semantics_sector_codelist'
order by
  source_id asc;

In [None]:
# @title ### Results

get_results(run_id = run_id, check_id='semantics_sector_codelist')

### Public authority names are realistic

**Description:**

Check that a sample of public authority names are realistic e.g. they are government departments, rather than suppliers or individuals etc.

**Methodology:** `manual`

Review the output and check that names are realistic.

**Output:**

Sample of `publicAuthority.name` values.

In [None]:
# @title #### Output

get_output(run_id, 'semantics_public_authority_names')

In [None]:
# @title ##### Results

display_result_widgets('semantics_public_authority_names')

### Supplier names are realistic

**Description:**

Check that a sample of supplier names are realistic e.g. they are private businesses, rather than government departments etc.


**Methodology:** `manual`

Review the output and check that names are realistic.

**Output:**

Sample of `contractingProcesses/summary/suppliers/name` values.


In [None]:
# @title ### Output

get_output(run_id, 'semantics_supplier_names')

In [None]:
# @title #### Results

display_result_widgets('semantics_supplier_names')

### Project budgets are realistic

**Description:**

Check that project budgets are non-zero and less than 5bn USD.

**Methodology:** `automated`

Convert `project.budget` to USD and check against the thresholds.

**Output:**

List of unrealistic budgets.


In [None]:
# @title #### Output

get_output(run_id, 'semantics_budgets').rename(columns={"output": "budget_usd"})

In [None]:
# @title ### Results

get_results(run_id, 'semantics_budgets')

### Contract values are realistic

**Description:**

Check that contract values are non-zero and less than 5bn USD.

**Methodology:** `automated`

Convert `contractingProcesses/summary/contractValue` to USD and check against the thresholds.

**Output:**

List of unrealistic contract values.


In [None]:
# @title #### Output

get_output(run_id, 'semantics_contract_values').rename(columns={"output": "contract_value_usd"})

In [None]:
# @title ### Results

get_results(run_id, 'semantics_contract_values')

### Funder names are realistic

**Description:**

Check that a sample of funder names are realistic e.g. they are government agencies, donors or multi-lateral financial institutions, rather than private businesses.

**Methodology:** `manual`

Review the output and check that names are realistic.

**Output:**

Sample of `parties/name` values.


In [None]:
# @title #### Output
get_output(run_id, 'semantics_funder_names')

In [None]:
# @title #### Results

display_result_widgets('semantics_public_funder_names')

### Dates are realistic

**Description:**

Check that dates are after 1st January 1970 and before 1st January 2050.

**Methodology:** `automated`

Check the following dates against the thresholds:

* `updated`
* `period/startDate`
* `period/endDate`
* `completion/endDate`

**Output:**

List of unrealistic dates.

In [None]:
# @title #### Output

get_output(run_id, 'semantics_dates')

In [None]:
# @title #### Results

get_results(run_id, 'semantics_dates')

### Roles are set

**Description:**

Check that organization `.roles` are set according to their references.

**Methodology:** `automated`

Check that:

* The organization referenced in `publicAuthority` has 'publicAuthority' in `.roles`.
* The organizations referenced in `budget/sourceParty` have 'sourceParty' in `.roles`.
* The organizations referenced in `contractingProcesses/summary/tender/tenderers` have 'tenderer' in `.roles`.
* The organization referenced in `contractingProcesses/summary/tender/procuringEntity` has 'procuringEntity' in `.roles`.
* The organization referenced in `contractingProcesses/summary/tender/administrativeEntity` has 'administrativeEntity' in `.roles`.
* The organizations referenced in `contractingProcesses/summary/suppliers` have 'supplier' in `.roles`.

**Output:**

List of missing roles.

In [None]:
# @title ### Output

get_output(run_id, 'semantics_role_coherence')

In [None]:
# @title ### Results

get_results(run_id, 'semantics_role_coherence')

### Coordinates are valid

**Description:**

Check that project location coordinates are valid.

**Methodology:** `automated`

Check that `locations/geometry/coordinates` are in the range of [-90, 90] for latitudes and [-180, 180] for longitudes.

**Output:**

List of invalid coordinates.

In [None]:
# @title ### Output

get_output(run_id, 'semantics_coordinates')

In [None]:
# @title ### Results

get_results(run_id, 'semantics_coordinates')

## Metrics

This section provides measurements related to the criteria and checks. There are no judgements associated with these measurements, rather they provide additional context to the pass/fail criteria and checks.

### New project count

**Description:**

A count of projects added since the previous report.

**Methodology:** `automated`

Identify projects added since the previous report by comparing project identifiers (`id`).

In [None]:
# @title Output

get_metric_output(run_id, 'metrics_new_projects')

### Last updated date

**Description:**

The last updated date of the most recently updated project.


**Methodology:** `automated`

The maximum `date` amongst the projects in the dataset.

In [None]:
# @title #### Output

get_metric_output(run_id, 'metrics_last_updated')

### Earliest project start date

**Description:**

The earliest project start date.

**Methodology:** `automated`

The minimum `period/startDate` amongst the projects in the dataset.


In [None]:
# @title #### Output

get_metric_output(run_id, 'metrics_earliest_start_date')

### Latest project end date

**Description:**

The latest project end date.

**Methodology:** `automated`

The maximum `period/endDate` amongst the projects in the dataset.

In [None]:
# @title #### Output

get_metric_output(run_id, 'metrics_latest_end_date')

### Additional field count

**Description:**

A count of non-OC4IDS fields in the dataset.


**Methodology:** `automated`

Use libcoveoc4ids to generate a count of additional fields.


In [None]:
# @title #### Output

get_metric_output(run_id, 'metrics_additional_field_count')

### Project count

**Description:**

A count of projects in the dataset.

**Methodology:** `automated`

Count the projects in the dataset.

In [None]:
# @title #### Output

get_metric_output(run_id, 'metrics_project_count')

### Validation error count

**Description:**

A count of the validation errors reported by the OC4IDS data review tool.

**Methodology:** `automated`

Count the types of validation error reported by libcoveoc4ids, not the number of occurrences of each error type.


In [None]:
# @title #### Output

get_metric_output(run_id, 'metrics_validation_error_count')

### Structure warning count

**Description:**

A count of the structure warnings reported by the OC4IDS data review tool.

**Methodology:** `automated`

Count the structure warnings reported by libcoveoc4ids, not the number of occurrences of each structure warning

In [None]:
# @title #### Output

get_metric_output(run_id, 'metrics_structure_warning_count')

## Appendix 1: Report Setup

In [None]:
# @title ### Install requirements
# @markdown After running this cell, you must restart the session (Ctrl+M .)
!pip install --upgrade ipython-sql > pip.log
!pip install --upgrade pandas>=2.2

In [None]:
# @title ### Connect to the database
# @markdown ODS users: enter the password for the `readonly` user, from the ODS password database.
import getpass

print('Enter your credentials')
user = 'readonly'
password = getpass.getpass('Password:')

connection_string = 'postgresql://' + user + ':' + password + '@oc4ids-database-2.cuujgua4wses.us-east-1.rds.amazonaws.com/postgres'
%load_ext sql
%sql $connection_string
%config SqlMagic.autopandas = True  # Return Pandas DataFrames instead of regular result sets
%config SqlMagic.displaycon = False  # Don't show connection string after execute
%config SqlMagic.feedback = False  # Don't print number of rows affected by DML


In [None]:
# @title Choose a `run_id` to report on
from ipywidgets import interact

def set_run_id(id):
  global run_id
  run_id = id

  global source_ids
  source_ids = %sql select source_id from run_collection join collection on run_collection.collection_id = collection.id where run_id = :run_id order by source_id asc;
  source_ids = source_ids['source_id']

run_ids = %sql select distinct run_id from check_results order by run_id desc;

interact(set_run_id, id=run_ids['run_id']);

In [None]:
# @title Setup notebook environment

# https://colab.research.google.com/notebooks/data_table.ipynb
%load_ext google.colab.data_table
from google.colab.data_table import DataTable
DataTable.max_columns = 50 # Increase max columns so that dataframes with many columns are rendered as data tables
DataTable.include_index = False # Remove the index from data tables for easier copy-pasting to Google Docs
DataTable.num_rows_per_page = 10

import functools
import ipywidgets
import pandas as pd

from IPython.display import display

manual_checks = {}

In [None]:
# @title ### Define functions

def get_results(run_id = run_id, check_id = None, extra_results = None):

  query = f"""

  select
    case
      when check_id = 'criteria_registered' then 'Criteria: Registered'
      when check_id = 'criteria_discoverable' then 'Criteria: Discoverable'
      when check_id = 'criteria_retrievable' then 'Criteria: Retrievable'
      when check_id = 'criteria_reviewable' then 'Criteria: Reviewable'
      when check_id = 'criteria_appropriate' then 'Criteria: Appropriate'
      when check_id = 'criteria_active' then 'Criteria: Active'
      when check_id = 'criteria_documented' then 'Criteria: Documented'
      when check_id = 'criteria_accessible' then 'Criteria: Accessible'
      when check_id = 'criteria_valid' then 'Criteria: Valid'
      when check_id = 'criteria_conformant' then 'Criteria: Conformant'
      when check_id = 'semantics_sector_codelist' then 'Check: Sectors are standardised'
      when check_id = 'semantics_public_authority_names' then 'Check: Public authority names are realistic'
      when check_id = 'semantics_supplier_names' then 'Check: Supplier names are realistic'
      when check_id = 'semantics_budgets' then 'Check: Project budgets are realistic'
      when check_id = 'semantics_contract_values' then 'Check: Contract values are realistic'
      when check_id = 'semantics_funder_names' then 'Check: Funder names are realistic'
      when check_id = 'semantics_dates' then 'Check: Dates are realistic'
      when check_id = 'semantics_role_coherence' then 'Check: Roles are set'
      when check_id = 'semantics_coordinates' then 'Check: Coordinates are valid'
      else check_id
    end as check,
    source_id,
    result
  from
    check_results
  join collection on
    collection_id = collection.id
  where
    run_id = '{run_id}'
    and (left(check_id, 8) = 'criteria'
      or left(check_id, 9) = 'semantics')
    {f"and check_id = '{check_id}'" if check_id else ""}
  order by
    array_position(array[
    'critiera_registered',
    'critiera_discoverable',
    'critiera_retrievable',
    'critiera_reviewable',
    'critiera_appropriate',
    'critiera_active',
    'critiera_documented',
    'critiera_accessible',
    'critiera_valid',
    'critiera_conformant',
    'semantics_sector_codelist',
    'semantics_public_authority_names',
    'semantics_supplier_names',
    'semantics_budgets',
    'semantics_contract_values',
    'semantics_funder_names',
    'semantics_dates',
    'semantics_role_coherence',
    'semantics_coordinates'],
    check_id) asc,
    source_id asc;

  """

  results = %sql {query}

  if extra_results is not None:
    for check, source in extra_results.items():
      for source_id, result in source.items():
        results = results._append(pd.DataFrame([{'check': check, 'source_id': source_id, 'result': result}]))

  results = results.pivot(index=['check'], columns='source_id', values='result')

  styler = results.style

  return styler.map(lambda x: 'background-color:rgba(0, 255, 0, 0.25);' if x == True else ('background-color:rgba(255, 0, 0, 0.25);' if x == False else 'background-color:rgba(100, 100, 100, 0.25);'))

def get_output(run_id, check_id):

  query = f"""

  select
    source_id,
    key as project_id,
    value as output
  from
    check_results
  cross join
    jsonb_each(output)
  join collection on
    collection_id = collection.id
  where
    run_id = '{run_id}'
  and
    check_id = '{check_id}'
  order by
    check_id, source_id;

  """

  output = %sql {query}

  return output

def get_metric_output(run_id, check_id):

  query = f"""

  select
    source_id,
    coalesce(output->'count', output->'date') as count
  from check_results
  join collection on
    collection_id = collection.id
  where
    run_id = '{run_id}'
  and
    check_id = '{check_id}'
  order by
    check_id, source_id;

  """

  output = %sql {query}

  return output

def save_results(b, check_id, widgets):
  global manual_checks

  results = {source_id: widget.value for source_id, widget in widgets.items()}

  manual_checks[check_id] = results

def display_result_widgets(check_id):
  global source_ids

  widgets = {}

  description_length = max([len(source_id) for source_id in source_ids])

  for source_id in source_ids:

    widgets[source_id] = ipywidgets.Dropdown(
      options=[True, False, None],
      value=None,
      description=f'{source_id}:',
      disabled=False,
      layout={'width': '35em'},
      style={'description_width': f'{description_length}em'}
  )

  button = ipywidgets.Button(description="Save")

  for widget in widgets.values():
    display(widget)

  display(button)

  button.on_click(functools.partial(save_results, check_id = check_id, widgets = widgets))