<a href="https://colab.research.google.com/github/open-contracting/notebooks-oc4ids/blob/indicator_coverage/OC4IDS_Indicator_Coverage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OC4IDS Indicator Coverage

Use this notebook to calculate the coverage of the indicators defined in the [OC4IDS Indicators spreadsheet](https://docs.google.com/spreadsheets/d/1Vo6-Jis-J61PB_33QQx1YKnmQnI7M4rTq6CTMckAFsE/edit#gid=882740051).

The notebook generates:

* A CSV file containing a copy of the [Indicators sheet](https://docs.google.com/spreadsheets/d/1Vo6-Jis-J61PB_33QQx1YKnmQnI7M4rTq6CTMckAFsE/edit#gid=2023307133), annotated with:
  * The coverage for the indicator
  * The coverage for the each field needed to calculate the indicator
* Charts showing:
  * How many indicators are calculable for at least one project, grouped by use case
  * How many indicators are calculable for less than 33%, 33-66% and greater than 66% of projects:

Coverage is measured based on how many projects include the required fields for an indicator. There may be some false positives, since more granular levels of coverage are not measured. For example, how many individual contracting processes within a project include the required fields. See [this issue](https://github.com/open-contracting/kingfisher-views/issues/29#issuecomment-680326386) for a detailed explanation.

## Setup

Enter your OC4IDS database credentials. OCDS Helpdesk analysts and OCP staff can find credentials in [CRM-6335](https://crm.open-contracting.org/issues/6335).

In [None]:
import getpass

print('Enter your credentials')
user = input('Username:')
password = getpass.getpass('Password:')

Setup notebook environment:

In [None]:
# https://ocdskit.readthedocs.io/
!pip install ocdskit

connection_string = 'postgresql://' + user + ':' + password + '@oc4ids-database-1.cuujgua4wses.us-east-1.rds.amazonaws.com/postgres'

# https://pypi.org/project/ipython-sql/
!pip install --upgrade ipython-sql > pip.log
%load_ext sql
%sql $connection_string
%config SqlMagic.autopandas = True  # Return Pandas DataFrames instead of regular result sets
%config SqlMagic.displaycon = False  # Don't show connection string after execute
%config SqlMagic.feedback = False  # Don't print number of rows affected by DML

# https://colab.research.google.com/notebooks/data_table.ipynb
%load_ext google.colab.data_table

import csv
import matplotlib.pyplot as plt
import pandas as pd
import requests
import seaborn as sns

from google.colab import files

def get_gsheet(doc_id, sheet_id):

  url = f'https://docs.google.com/spreadsheets/d/{doc_id}/export?format=csv&gid={sheet_id}'
  response = requests.get(url)
  content = response.content.decode('utf-8').splitlines(keepends=True)

  return csv.DictReader(content, quotechar='"')

## Choose a collection to query

Set the `collection_id` to query:

In [None]:
collection_id = 41

If you don't know which collection you need, run the next cell and use the **Filter** button to filter the collection table to find the collection. You can use the `source_id` column to filter on the name of the data source entered when the data was imported. Use the value from the `id` column to update the previous cell.

In [None]:
%%sql

select
  collection.id as collection_id,
  source_id,
  data_version,
  count(*) as project_count
from
  collection
join
  projects on collection.id = projects.collection_id
group by
  collection.id,
  source_id,
  data_version
 order by
  collection.id desc;

## Calculate coverage

Download the OC4IDS schema and generate a [mapping sheet](https://ocdskit.readthedocs.io/en/latest/cli/schema.html#mapping-sheet):

In [None]:
%%shell

curl 'https://standard.open-contracting.org/infrastructure/latest/en/project-schema.json' > project-schema.json
ocdskit mapping-sheet project-schema.json > project-schema.csv

Calculate coverage:

In [None]:
with open('project-schema.csv', 'r') as f:

  oc4ids_fields = [field['path'] for field in csv.DictReader(f)]

indicator_reader = get_gsheet('1Vo6-Jis-J61PB_33QQx1YKnmQnI7M4rTq6CTMckAFsE', '2023307133')
indicators = {indicator['id']: indicator for indicator in indicator_reader}

method_reader = get_gsheet('1Vo6-Jis-J61PB_33QQx1YKnmQnI7M4rTq6CTMckAFsE', '835339370')

method_query_template = """

    SELECT
      SUM(CASE WHEN ARRAY{fields} <@ paths THEN 1 ELSE 0 END) as successes,
      count(*) as checks,
      SUM(CASE WHEN ARRAY{fields} <@ paths THEN 1 ELSE 0 END) / count(*)::numeric as percentage
    FROM
      project_fields
    WHERE
      collection_id = {collection_id};

"""

field_query_template = """

    SELECT
      path,
      distinct_projects
    FROM
      field_counts
    WHERE
      collection_id = {collection_id}
    AND
      path IN {fields}

"""

for method in method_reader:

  method['fields'] = {field: {'coverage': 0} for field in method['fields'].split('\n')}
 
  # Check that the field paths for this method are valid
  for field in method['fields'].keys():
    if len(field) > 0:
      assert field in oc4ids_fields, f'Found invalid fields in method {method["id"]}: {field}. ' \
                                    'Update the OC4IDS Indicators spreadsheet to correct the field path.'

  # Calculate coverage for this method
  method_query = method_query_template.format(collection_id = collection_id, fields = list(method['fields'].keys()))
  method_coverage = %sql {method_query}

  method['coverage'] = {'successes': method_coverage['successes'][0],
                        'checks': method_coverage ['checks'][0],
                        'percentage': method_coverage['percentage'][0]}

  # Calculate coverage for each field in this method
  field_query = field_query_template.format(collection_id = collection_id, fields = tuple(method['fields'].keys()))
  field_coverage = %sql {field_query}

  for field, coverage in field_coverage.set_index('path').to_dict('index').items():
    method['fields'][field]['coverage'] = coverage['distinct_projects'] / method['coverage']['checks']

  # Add method details to indicators
  if 'methods' not in indicators[method['indicator_id']]:
    indicators[method['indicator_id']]['methods'] = [method]
  else:
    indicators[method['indicator_id']]['methods'].append(method)

for indicator in indicators.values():

  # Calcuate indicator coverage, i.e. the coverage of the best method
  indicator['coverage'] = -1

  for method in indicators[indicator['id']]['methods']:
    
    if method['coverage']['percentage'] > indicator['coverage']:

      indicator['best_method'] = method['id']
      indicator['coverage'] = method['coverage']['percentage']
      indicator['successes'] = method['coverage']['successes']
      indicator['checks'] = method['coverage']['checks']
      indicator['field_coverage'] = '\n'.join([f'{path}: {"{:.0%}".format(details["coverage"])}' for path, details in method['fields'].items()])
      indicator['missing_fields'] = '\n'.join([path for path, details in method['fields'].items() if details["coverage"] == 0])

# Generate indicator coverage report
with open('coverage.csv', 'w') as f:

  fieldnames = indicator_reader.fieldnames
  fieldnames.extend(['best_method', 'coverage', 'successes', 'checks', 'field_coverage', 'missing_fields'])

  writer = csv.DictWriter(f, fieldnames = fieldnames)
  writer.writeheader()

  for indicator in indicators.values():
    row = dict(indicator)
    row.pop('methods')
    writer.writerow(row)

Download coverage report:

In [None]:
files.download('coverage.csv')

Generate coverage summary:

In [None]:
coverage_summary = pd.DataFrame.from_dict(indicators, orient='index')
coverage_summary = coverage_summary[coverage_summary['deprecated'] == 'FALSE']

Plot how many indicators are calculable for at least one project, grouped by use case:

In [None]:
coverage_summary['calculable'] = coverage_summary['coverage'] > 0
calculable = coverage_summary[['id', 'calculable', 'use_case']].groupby(['use_case','calculable']).count()
calculable.reset_index(inplace=True)

plt.figure(figsize = (15,8))
ax = sns.histplot(calculable, x='use_case', hue='calculable', weights='id', multiple='stack')
plt.xticks(rotation=45)

Plot how many indicators are calculable for less than 33%, 33-66% and greater than 66% of projects:

In [None]:
def coverage_bin(coverage):
  if coverage < 0.33:
    return '<33%'
  elif coverage <= 0.66:
    return '33-66%'
  else:
    return  '>66%'

coverage_summary['coverage_bin'] = coverage_summary.apply(lambda x: coverage_bin(x['coverage']), axis=1)

ax = sns.countplot(x='coverage_bin', order=['<33%', '33-66%', '>66%'], data=coverage_summary)