# VTL Check hierarchy

The original doc can be found at **line 6399 of VTL-2.0-Reference-Manual**

## Syntax

```text
check_hierarchy ( op , hr { condition condComp { , condComp }* } { rule ruleComp }
{ mode } { input } { output } )
  - mode ::= non_null | non_zero | partial_null | partial_zero | always_null | always_zero
  - input ::= dataset | dataset_priority
  - output ::= invalid | all | all_measures

```

- **op**: the Data Set to be checked
- **hr**: the hierarchical Ruleset to be used
- **condComp**: `condComp` is a Component of `op` to be associated (in positional order) to the conditioning Value Domains or Variables defined in `hr` (if any).
- **ruleComp**: Component of `op`
- **mode**: this parameter specifies how to treat the possible missing Data Points corresponding to the Code Items in the left and right sides of the rules and which Data Points are produced in output. The meaning of the possible values of the parameter is explained below.
- **output**: specifies the Data Points and the Measures of the resulting Data Set:
     - **invalid**: the resulting Data Set contains a Data Point for each Data Point of `op` and  each Rule in `dpr` that evaluates to `FALSE` on that Data Point. The resulting Data Set has the Measures of op.
     - **all**: the resulting Data Set contains a data point for each Data Point of `op` and each Rule in `dpr`. The resulting Data Set has the boolean Measure bool_var.
     - **all_measures**: the resulting Data Set contains a Data Point for each Data Point of `op` and each Rule in `dpr`. The resulting dataset has the Measures of `op` and the  boolean Measure bool_var.
     - If not specified then output is assumed to be invalid. See the Behaviour for further details.

## Example

1. Define hierarchical ruleset.

```text
define hierarchical ruleset HR_1 ( valuedomain rule VD_1 ) is
   R010 : A = J + K + L errorcode Bad_val errorlevel 5;
   R020 : B = M + N + O errorcode Bad_val errorlevel 5;
   R070 : G = B + C errorcode Bad_val errorlevel 1
```

Given a dataset ds_1:

```text
Id_1,Id_2,Me_1
2010,A,5
2010,B,11
2010,C,0
2010,G,19
2010,H,NULL
2010,I,14
2010,M,2
2010,N,5
2010,O,4
2010,P,7
2010,Q,-7
2010,S,3
2010,T,9
2010,U,NULL
2010,V,6
```

The output should be:

```text

Id_1,Id_2,ruleid,Bool_var,imbalance,errorcode,errorlevel
2010,A,R010,NULL,NULL,NULL,NULL
2010,B,R020,TRUE,0,NULL,NULL
2010,G,R070,FALSE,8,Bad_val,1
```

In [4]:
from pyspark.sql import SparkSession,DataFrame

import os
from pyspark.sql.functions import col, lit, when


In [5]:
local = True

if local:
    spark = SparkSession.builder \
        .master("local[4]") \
        .appName("VTLValidation_check_hierarchy")\
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443") \
        .appName("VTLValidation_check_hierarchy")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory", "4g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .getOrCreate()


In [6]:
root_path="../../data"
data_path=f"{root_path}/check_hier_ds.csv"

df=spark.read.csv(data_path, header=True,inferSchema=True)
df.show()

+----+----+----+
|Id_1|Id_2|Me_1|
+----+----+----+
|2010|   A|   5|
|2010|   B|  11|
|2010|   C|   0|
|2010|   G|  19|
|2010|   H|NULL|
|2010|   I|  14|
|2010|   M|   2|
|2010|   N|   5|
|2010|   O|   4|
|2010|   P|   7|
|2010|   Q|  -7|
|2010|   S|   3|
|2010|   T|   9|
|2010|   U|NULL|
|2010|   V|   6|
+----+----+----+



## Step 1: Implement hierarchical ruleset

A hierarchical ruleset can contain one or more rules. For each rule, we need to define a corresponding validation function in spark that implements the logic and generate the resulting columns.

Note it has 3 modes (e.g. invalid, all, all_measures), and each mode has a unique output column formats. So each generated function must take consideration of that.

Below functions should be generated when we encounter **define datapoint ruleset**

In [12]:
# this function get the value of each operant
def get_op_var(ds,reference_col,val_col,op_val):
    return ds.filter(col(reference_col)==op_val).select(val_col).collect()[0][0]

In [80]:
# this function is for complete the output column of check option : all
def trans_for_all(ds,rule_id,error_code,error_level):
    return ds.withColumn("rule_id",lit(rule_id)) \
       .withColumn("error_code",when(col("bool_var")==False,error_code)) \
        .withColumn("error_level",when(col("bool_var")==False,error_level))

In [14]:
reference_col="Id_2"
val_col="Me_1"
op1="A"
op2="J"
op3="K"
op4="L"

# get the value of op1
val= get_op_var(df,reference_col,val_col,op1)

In [15]:
print(val)

5


In [81]:
# implementation of rule R010 in HR_1, this should be generated based on HR_1:R010 definition
def dpr1_1(ds,option):
    rule_id="R010"
    cond_col="Id_3"
    cond_val="CREDIT"
    check_col="Me_1"
    check_val=0
    error_code="Bad credit"
    error_level="5"
    if option=="invalid":
        tmp=ds.filter((col(cond_col)==cond_val) & (col(check_col)<check_val) ).withColumnRenamed(check_col,"obs_value")
        return trans_for_invalid(tmp,rule_id,error_code,error_level)
    elif option=="all":
        tmp=ds.withColumn("bool_var",when((col(cond_col)==cond_val) & (col(check_col)<check_val),False).otherwise(True))
        return trans_for_all(tmp,rule_id,error_code,error_level)
    elif option=="all_measures":
        return ds
    else:
        raise ValueError("Unknown option value, accepted values are : invalid, all, all_measures")

In [82]:
dpr1_1_resu=dpr1_1(df,"all")

In [83]:
dpr1_1_resu.show()

+----+----+------+----+--------+-------+----------+-----------+
|Id_1|Id_2|  Id_3|Me_1|bool_var|rule_id|error_code|error_level|
+----+----+------+----+--------+-------+----------+-----------+
|2011|   1|CREDIT|  10|    true| dpr1_1|      null|       null|
|2011|   1| DEBIT|  -2|    true| dpr1_1|      null|       null|
|2012|   1|CREDIT|  10|    true| dpr1_1|      null|       null|
|2012|   1| DEBIT|   2|    true| dpr1_1|      null|       null|
+----+----+------+----+--------+-------+----------+-----------+



In [84]:
# implementation of rule dpr1_2 in dpr1, this should be generated based on dpr1 definition
def dpr1_2(ds,option):
    rule_id="dpr1_2"
    cond_col="Id_3"
    cond_val="DEBIT"
    check_col="Me_1"
    check_val=0
    error_code="Bad debit"
    error_level="6"
    if option=="invalid":
        tmp=ds.filter((col(cond_col)==cond_val) & (col(check_col)<check_val) ).withColumnRenamed(check_col,"obs_value")
        return trans_for_invalid(tmp,rule_id,error_code,error_level)
    elif option=="all":
        tmp=ds.withColumn("bool_var",when((col(cond_col)==cond_val) & (col(check_col)<check_val),False).otherwise(True))
        return trans_for_all(tmp,rule_id,error_code,error_level)
    elif option=="all_measures":
        return ds
    else:
        raise ValueError("Unknown option value, accepted values are : invalid, all, all_measures")

In [85]:
dpr1_2_resu=dpr1_2(df,"all")

In [86]:
dpr1_2_resu.show()

+----+----+------+----+--------+-------+----------+-----------+
|Id_1|Id_2|  Id_3|Me_1|bool_var|rule_id|error_code|error_level|
+----+----+------+----+--------+-------+----------+-----------+
|2011|   1|CREDIT|  10|    true| dpr1_2|      null|       null|
|2011|   1| DEBIT|  -2|   false| dpr1_2| Bad debit|          6|
|2012|   1|CREDIT|  10|    true| dpr1_2|      null|       null|
|2012|   1| DEBIT|   2|    true| dpr1_2|      null|       null|
+----+----+------+----+--------+-------+----------+-----------+



In [87]:
dpr1_1_resu.union(dpr1_2_resu).show()

+----+----+------+----+--------+-------+----------+-----------+
|Id_1|Id_2|  Id_3|Me_1|bool_var|rule_id|error_code|error_level|
+----+----+------+----+--------+-------+----------+-----------+
|2011|   1|CREDIT|  10|    true| dpr1_1|      null|       null|
|2011|   1| DEBIT|  -2|    true| dpr1_1|      null|       null|
|2012|   1|CREDIT|  10|    true| dpr1_1|      null|       null|
|2012|   1| DEBIT|   2|    true| dpr1_1|      null|       null|
|2011|   1|CREDIT|  10|    true| dpr1_2|      null|       null|
|2011|   1| DEBIT|  -2|   false| dpr1_2| Bad debit|          6|
|2012|   1|CREDIT|  10|    true| dpr1_2|      null|       null|
|2012|   1| DEBIT|   2|    true| dpr1_2|      null|       null|
+----+----+------+----+--------+-------+----------+-----------+



## Step 2: Apply data point ruleset on a data frame

This function should be generated when a function **check_datapoint**,
note the rule sets and rules are generated in step 1. They must be present when we call **check_datapoint**

In [88]:
def data_validation(ds,rules,option):
    result=rules[0](ds,option)
    for i in range(1,len(rules)):
        result=result.union(rules[i](ds,option))
    return result

In [89]:
rule_sets=[dpr1_1,dpr1_2]

In [90]:
invalid_resu=data_validation(df,rule_sets,"invalid")

In [91]:
invalid_resu.show()

+----+----+-----+---------+-------+----------+-----------+
|Id_1|Id_2| Id_3|obs_value|rule_id|error_code|error_level|
+----+----+-----+---------+-------+----------+-----------+
|2011|   1|DEBIT|       -2| dpr1_2| Bad debit|          6|
+----+----+-----+---------+-------+----------+-----------+



In [92]:
all_resu=data_validation(df,rule_sets,"all")

In [93]:
all_resu.show()

+----+----+------+----+--------+-------+----------+-----------+
|Id_1|Id_2|  Id_3|Me_1|bool_var|rule_id|error_code|error_level|
+----+----+------+----+--------+-------+----------+-----------+
|2011|   1|CREDIT|  10|    true| dpr1_1|      null|       null|
|2011|   1| DEBIT|  -2|    true| dpr1_1|      null|       null|
|2012|   1|CREDIT|  10|    true| dpr1_1|      null|       null|
|2012|   1| DEBIT|   2|    true| dpr1_1|      null|       null|
|2011|   1|CREDIT|  10|    true| dpr1_2|      null|       null|
|2011|   1| DEBIT|  -2|   false| dpr1_2| Bad debit|          6|
|2012|   1|CREDIT|  10|    true| dpr1_2|      null|       null|
|2012|   1| DEBIT|   2|    true| dpr1_2|      null|       null|
+----+----+------+----+--------+-------+----------+-----------+

