# OC4IDS CoST IDS Coverage

Use this notebook to calculate the coverage of the CoST Infrastructure Data Standard (IDS), according to the [CoST IDS to OC4IDS mapping](https://standard.open-contracting.org/infrastructure/latest/en/cost/#cost-ids-to-oc4ids-mapping).

The notebook generates a CSV file with 3 columns:

* The CoST IDS element
* The coverage of the element
* The coverage of the individual OC4IDS fields that map to the element

| CoST IDS element                 | Coverage          | Field Coverage                                     |
| -------------------------------- | ----------------- | -------------------------------------------------- |
| Procuring entity contact details | 0.833333333333333 | parties/address: 83%<br>parties/contactPoint: 100% |


Coverage is measured based on how many projects include the required fields for an element. There may be some false positives, since more granular levels of coverage are not measured. For example, how many individual contracting processes within a project include the required fields. See [this issue](https://github.com/open-contracting/kingfisher-views/issues/29#issuecomment-680326386) for a detailed explanation.

## Setup

Enter your OC4IDS database credentials.

In [None]:
import getpass

print('Enter your credentials')
user = 'readonly'
password = getpass.getpass('Password:')

Setup notebook environment:

In [None]:
# https://ocdskit.readthedocs.io/
!pip install ocdskit

connection_string = 'postgresql://' + user + ':' + password + '@oc4ids-database-2.cuujgua4wses.us-east-1.rds.amazonaws.com/postgres'

# https://pypi.org/project/ipython-sql/
!pip install --upgrade ipython-sql > pip.log
%load_ext sql
%sql $connection_string
%config SqlMagic.autopandas = True  # Return Pandas DataFrames instead of regular result sets
%config SqlMagic.displaycon = False  # Don't show connection string after execute
%config SqlMagic.feedback = False  # Don't print number of rows affected by DML
%config SqlMagic.style = '_DEPRECATED_DEFAULT'

# https://colab.research.google.com/notebooks/data_table.ipynb
%load_ext google.colab.data_table

import csv
import matplotlib.pyplot as plt
import pandas as pd
import requests
import seaborn as sns

from google.colab import files

def get_gsheet(doc_id, sheet_id):

  url = f'https://docs.google.com/spreadsheets/d/{doc_id}/export?format=csv&gid={sheet_id}'
  response = requests.get(url)
  content = response.content.decode('utf-8').splitlines(keepends=True)

  return csv.DictReader(content, quotechar='"')

## Choose a collection to query

Set the `collection_id` to query:

In [None]:
collection_id = 50

If you don't know which collection you need, run the next cell and use the **Filter** button to filter the collection table to find the collection. You can use the `source_id` column to filter on the name of the data source entered when the data was imported. Use the value from the `id` column to update the previous cell.

In [None]:
%%sql

select
  collection.id as collection_id,
  source_id,
  data_version,
  count(*) as project_count
from
  collection
join
  projects on collection.id = projects.collection_id
group by
  collection.id,
  source_id,
  data_version
 order by
  collection.id desc;

## Calculate IDS coverage

Calculate IDS coverage:

In [None]:
import csv
import requests

def get_csv(url):

  response = requests.get(url)
  content = response.content.decode('utf-8').splitlines(keepends=True)

  return csv.DictReader(content, quotechar='"')

# Define a query to check the coverage of a group of fields
element_query_template = """

    SELECT
      SUM(CASE WHEN ARRAY{fields} <@ paths THEN 1 ELSE 0 END) as successes,
      count(*) as checks,
      SUM(CASE WHEN ARRAY{fields} <@ paths THEN 1 ELSE 0 END) / count(*)::numeric as percentage
    FROM
      project_fields
    WHERE
      collection_id = :collection_id

"""

# Define a query to check the coverage of a single field
field_query_template = """

    SELECT
      path,
      distinct_projects
    FROM
      field_counts
    WHERE
      collection_id = :collection_id
    AND
      path IN {fields}

"""

base_url = 'https://standard.open-contracting.org/infrastructure/latest/en/_static/project-level/'

csv_files = [
  'process-level-implementation.csv',
  'process-level-procurement.csv',
  'project-level-completion.csv',
  'project-level-identification.csv',
  'project-level-preparation.csv',
  'reactive-process-level-contract.csv',
  'reactive-process-level-implementation.csv',
  'reactive-process-level-procurement.csv',
  'reactive-project-level-completion.csv',
  'reactive-project-level-identification-preparation.csv']

elements = []

for url in [f'{base_url}{filename}' for filename in csv_files]:
  mapping_reader = get_csv(url)

  for element in mapping_reader :
    if len(element['OC4IDS Fields']) > 0:

      element['fields'] = {field: {'coverage': 0} for field in element['OC4IDS Fields'].split(',')}

      # Calculate coverage for the element
      element_query = element_query_template.format(fields = list(element['fields'].keys()))
      element_coverage = %sql {element_query}

      element['successes'] = element_coverage['successes'][0]
      element['checks'] = element_coverage ['checks'][0]
      element['coverage'] = element_coverage['percentage'][0]

      # Calculate coverage for each field in the element
      fields = [f"'{field}'" for field in element["fields"].keys()]
      field_query = field_query_template.format(fields = f'({",".join(fields)})')
      field_coverage = %sql {field_query}

      if len(field_coverage) >0:
        for field, coverage in field_coverage.set_index('path').to_dict('index').items():
          element['fields'][field]['coverage'] = coverage['distinct_projects'] / element['checks']
      else:
        for field in element["fields"].keys():
          element['fields'][field]['coverage'] = 0

      elements.append(element)


## Generate report

In [None]:
from google.colab import files

filename  = f'{collection_id}_coverage.csv'

with open(filename, 'w') as f:

  fieldnames = ['CoST IDS element', 'Coverage', 'Field Coverage']

  writer = csv.DictWriter(f, fieldnames = fieldnames)
  writer.writeheader()

  for element in elements:
    row = {'CoST IDS element': element['CoST IDS element'],
           'Coverage': element['coverage'],
           'Field Coverage': '\n'.join([f'{path}: {"{:.0%}".format(details["coverage"])}' for path, details in element['fields'].items()]) }
    writer.writerow(row)

files.download(filename)