## Setup Data Registry

### Install Cardinal

This notebook uses [Cardinal](https://cardinal.readthedocs.io/en/latest/), a Rust package to calculate red flags with OCDS data.

In [None]:
! curl -sSOL https://github.com/open-contracting/cardinal-rs/releases/download/0.0.5/ocdscardinal-0.0.5-linux-64-bit.zip! unzip -oj ocdscardinal-0.0.5-linux-64-bit.zip ocdscardinal-0.0.5-linux-64-bit/ocdscardinal
! ls

### Download the data from the Data Registry

To select the data source. go to the [Data Registry](https://data.open-contracting.org/) and select the desired publisher.  For the publisher of choice, copy the URL of a **JSON file**, to paste as input below.

**In the registry, you will also find a description of the data source and direct links to the publisher website where you can find more information about the scope of the publication.**

<img src="https://drive.google.com/uc?id=10dlm8c55pN89YTGEyZgvsLDc8fFMLNf0"  width="200" height="300">

In [None]:
url = input('URL of JSON file:')

In [None]:
! curl -sSOJ "$url"

In the files tab at the left-hand side of the notebook, look for the file ending in `.jsonl.gz` that you downloaded (e.g `chile_compra_api_releases_full.jsonl.gz`), and add it to the command below:

<img src="https://drive.google.com/uc?id=19z86Nj5OY7Y8REfcd2sZbFPXDTAWZYS6" width="200" height="200">



In [None]:
file_gz = input('Name of .jsonl.gz file')

In [None]:
file_jsonl = file.replace('.gz', '')

In [None]:
! gunzip -f "$file_gz"

In [None]:
! ls -lh "$file_jsonl"

### Calculate the field list

Use Cardinal's [coverage command](https://cardinal.readthedocs.io/en/latest/cli/coverage.html) to extract the OCDS data fields published in the dataset. Store the results in a dataframe.

In [None]:
! ./ocdscardinal coverage "$file_jsonl" >> result_fields.json

In the table below you will see the list of fields that are published and the number of [OCDS releases](https://standard.open-contracting.org/latest/en/schema/reference/).

In [None]:
fields = pd.DataFrame(pd.read_json('result_fields.json', typ='series'), columns=['releases']).rename_axis('path').reset_index()
# Leaves only object members
fields_table = fields[fields.path.str.contains('[a-z]$')]
fields_table['path'] = fields_table['path'].str.replace(r'[][]|^/', '', regex=True)
fields_table


In [None]:
save_dataframe_to_sheet(fields_table, 'fields')