
# Validate a data contract against Databricks dataset

We use [datacontract-cli](https://github.com/datacontract/datacontract-cli).

The datacontract CLI is an open-source command-line tool for working with data contracts. It uses data contract YAML files as [Data Contract Specification](https://datacontract.com/) or [ODCS](https://bitol-io.github.io/open-data-contract-standard/latest/) to lint the data contract, connect to data sources and execute schema and quality tests, detect breaking changes, and export to different formats. The tool is written in Python. It can be used as a standalone CLI tool, in a CI/CD pipeline, or directly as a Python library.

## Install libs

You can ignore this warning: `Core Python package version(s) changed`

In [0]:
%pip install 'datacontract-cli[databricks]==0.10.21' databricks-sql-connector==3.7.2

In [0]:
dbutils.library.restartPython()

## Make databricks token available

In [0]:
import os
api_token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()
os.environ["DATACONTRACT_DATABRICKS_SERVER_HOSTNAME"] = "dbc-639f4875-165d.cloud.databricks.com"
os.environ["DATACONTRACT_DATABRICKS_HTTP_PATH"] = "/sql/1.0/warehouses/aee0a674651b7e21"
os.environ["DATACONTRACT_DATABRICKS_TOKEN"] = api_token


## Test the data contract with python

Could be done at the end of a notebook, or in a Github Action or other CICD context.

If failing, you can use run.pretty() to get output about what failed.

In [0]:
from datacontract.data_contract import DataContract

data_contract = DataContract(data_contract_file="./contracts/revenue_per_inhabitant.yaml")
run = data_contract.test()
if not run.has_passed():
    print(run.pretty())
    print("ERROR: Data quality validation failed.")
else:
    print("Data quality validation successful.")
run.finish()

## Task: Fix failing quality check

Hint: In population_share we use 0-100 for percentage, instead of 0.0-1.0.
The contract can be found in the contracts subfolder.


## Test the data contract from command line

Could be useful in a Github Action or other CICD context.

In [0]:
%sh
datacontract test ./contracts/revenue_per_inhabitant.yaml

## Generate SQL table definition with data contract

Could be useful if using data contract to generate base schema.

In [0]:
%sh
datacontract export --format sql ./contracts/revenue_per_inhabitant.yaml


Compare this to the original definition created by running the spark notebook:

```
CREATE TABLE acme_transport_taxinyc.dev_paal_main_de1f9ba0_revenue.revenue_per_inhabitant (
  pickup_borough STRING,
  amount DOUBLE,
  borough STRING,
  population INT,
  population_share FLOAT,
  revenue_per_inhabitant DOUBLE)
USING delta
TBLPROPERTIES (
  'delta.enableDeletionVectors' = 'true',
  'delta.feature.deletionVectors' = 'supported',
  'delta.minReaderVersion' = '3',
  'delta.minWriterVersion' = '7')
```