# VTL Check hierarchy

The original doc can be found at **line 6399 of VTL-2.0-Reference-Manual**

To use the check hierarchy, user need to define two parts:
- define a  hierarchical ruleset:
- define a check_hierarchy function:


## 1. Define hierarchical ruleset

The original doc can be found at **line 650 of VTL-2.0-Reference-Manual**

### 1.1 Syntax
Below is an example of how to define a new ruleset

```text
define hierarchical ruleset rulesetName ( hrRulesetSignature ) is
hrRule
{ ; hrRule }*
end hierarchical ruleset
```
- **hrRulesetSignature**: The definition of parameters which the `hrRule` can use. (e.g. column names)
- **hrRule**: the definition of a single rule

### 1.2 Example

``` text
define hierarchical ruleset BeneluxRuleset ( valuedomain rule GeoArea) is
Belgium = Belgium;
Luxembourg = Luxembourg;
Netherlands = Netherlands;
Benelux = Belgium + Luxembourg + Netherlands
end hierarchical ruleset
```

## 2. Define a check_hierarchy function

### 2.1 Syntax

```text
check_hierarchy ( op , hr { condition condComp { , condComp }* } { rule ruleComp }
{ mode } { input } { output } )
  - mode ::= non_null | non_zero | partial_null | partial_zero | always_null | always_zero
  - input ::= dataset | dataset_priority
  - output ::= invalid | all | all_measures

```

- **op**: the Data Set to be checked
- **hr**: the hierarchical Ruleset to be used
- **condComp**: `condComp` is a Component of `op` to be associated (in positional order) to the conditioning Value Domains or Variables defined in `hr` (if any).
- **ruleComp**: Component of `op`
- **mode**: this parameter specifies how to treat the possible missing Data Points corresponding to the Code Items in the left and right sides of the rules and which Data Points are produced in output. The meaning of the possible values of the parameter is explained below.
- **output**: specifies the Data Points and the Measures of the resulting Data Set:
     - **invalid**: the resulting Data Set contains a Data Point for each Data Point of `op` and  each Rule in `dpr` that evaluates to `FALSE` on that Data Point. The resulting Data Set has the Measures of op.
     - **all**: the resulting Data Set contains a data point for each Data Point of `op` and each Rule in `dpr`. The resulting Data Set has the boolean Measure bool_var.
     - **all_measures**: the resulting Data Set contains a Data Point for each Data Point of `op` and each Rule in `dpr`. The resulting dataset has the Measures of `op` and the  boolean Measure bool_var.
     - If not specified then output is assumed to be invalid. See the Behaviour for further details.

### 2.2 Example

```text
DS_r := check_hierarchy ( DS_1, HR_1 rule Id_2 partial_null all )
```

- **DS_1** : is the input dataset
- **HR_1** : is the ruleset
- **rule Id_2**: is the column name of the `DS_1` which we be used by `HR_1`

## 3. A complete example

### 3.1. Dataset

Suppose we have the below input dataset:

```text
Id_1,Geo_Area,Me_1
2010,Italia,5
2010,France,11
2010,Germany,8
2010,Europe,23
2010,Algeria,5
2010,Gongo,11
2010,Zimbabwe,8
2010,Africa,24
2010,China,8
2010,Japan,7
2010,Laos,6
2010,Asia,21
```

### 3.2 Define hierarchical ruleset.

```text
define hierarchical ruleset HR_1 ( valuedomain rule GeoArea ) is
   R010 : Europe = Italia + France + Germany errorcode Bad_val errorlevel 5;
   R020 : Africa = Algeria + Gongo + Zimbabwe errorcode Bad_val errorlevel 5;
   R030 : Asia = China + Japan + Laos errorcode Bad_val errorlevel 1
```

### 3.3 Define the check_hierarchy() function

```text
DS_r := check_hierarchy ( DS_1, HR_1 rule GeoArea partial_null all )
```

The output should be:

```text

Id_1,GeoArea,Me_1,Bool_var,imbalance,errorcode,errorlevel
2010,Italia,5,NULL,NULL,NULL,5
2010,France,11,NULL,NULL,NULL,5
2010,Germany,8,NULL,NULL,NULL,5
2010,Europe,23,FALSE,-1,Bad_val,5
2010,Algeria,5,NULL,NULL,NULL,5
2010,Gongo,11,NULL,NULL,NULL,5
2010,Zimbabwe,8,NULL,NULL,NULL,5
2010,Africa,24,TRUE,0,Bad_val,5
2010,China,8,NULL,NULL,NULL,5
2010,Japan,7,NULL,NULL,NULL,5
2010,Laos,6,NULL,NULL,NULL,5
2010,Asia,21,TRUE,0,Bad_val,1
```

In [70]:
from pyspark.sql import SparkSession,DataFrame

import os
from pyspark.sql.functions import col, lit, when,sum

In [2]:
local = True

if local:
    spark = SparkSession.builder \
        .master("local[4]") \
        .appName("VTLValidation_check_hierarchy")\
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443") \
        .appName("VTLValidation_check_hierarchy")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory", "4g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .getOrCreate()


23/05/12 12:06:15 WARN Utils: Your hostname, pengfei-Virtual-Machine resolves to a loopback address: 127.0.1.1; using 10.50.2.80 instead (on interface eth0)
23/05/12 12:06:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/05/12 12:06:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [46]:
root_path="../../data"
data_path=f"{root_path}/check_hier_ds.csv"

df=spark.read.csv(data_path, header=True,inferSchema=True)
df.show()

+----+--------+----+
|Id_1|Geo_Area|Me_1|
+----+--------+----+
|2010|  Italia|   5|
|2010|  France|  11|
|2010| Germany|   8|
|2010|  Europe|  23|
|2010| Algeria|   5|
|2010|   Gongo|  11|
|2010|Zimbabwe|   8|
|2010|  Africa|  24|
|2010|   China|   8|
|2010|   Japan|   7|
|2010|    Laos|   6|
|2010|    Asia|  21|
+----+--------+----+



### 3.3 Implement check hierarchy in spark

A hierarchical ruleset can contain one or more rules. For each rule, we need to define a corresponding validation function in spark that implements the logic and generate the resulting columns.

Note it has 3 modes (e.g. invalid, all, all_measures), and each mode has a unique output column formats. So each generated function must take consideration of that.

Below functions should be generated when we encounter **define datapoint ruleset**

In [47]:
# this function get the value of each operant
def get_op_var(ds,reference_col,val_col,op_val):
    cell=ds.filter(col(reference_col)==op_val).select(val_col).collect()
    if len(cell) and len(cell[0]):
        return cell[0][0]
    else:
        raise TypeError

In [48]:
# this function is for complete the output column of check option : all
def trans_for_all(ds,rule_id,error_code,error_level):
    return ds.withColumn("rule_id",lit(rule_id)) \
       .withColumn("error_code",when(col("bool_var")==False,error_code)) \
        .withColumn("error_level",when(col("bool_var")==False,error_level))

In [49]:
# implementation of rule R010 in HR_1, this should be generated based on HR_1:R010 definition
# Europe = Italia + France + Germany
def r010(ds,option):
    rule_id="R010"
    reference_col="Geo_Area"
    val_col="Me_1"
    op1="Europe"
    op2="Italia"
    op3="France"
    op4="Germany"

    # get the value of op1
    try:
        val1= get_op_var(df,reference_col,val_col,op1)
        val2=get_op_var(df,reference_col,val_col,op2)
        val3=get_op_var(df,reference_col,val_col,op3)
        val4=get_op_var(df,reference_col,val_col,op4)
        print(f"{op1}: {val1}, {op2}: {val2}, {op3}: {val3}, {op4} : {val4}")
        imbalance=val1-val2-val3-val4
    except TypeError:
        imbalance=None
    error_code="Bad val"
    error_level="5"
    if option=="invalid":
        tmp=ds.filter(col(reference_col)==op1)\
            .withColumn("imbalance",lit(imbalance))
        tmp = tmp.withColumn("bool_var", when(tmp.imbalance == 0, "True")
                                        .when(tmp.imbalance.isNull() , "None")
                                        .otherwise("False"))
        tmp = tmp.withColumn("error_code",lit(error_code)).withColumn("error_level",lit(error_level))
        return tmp
    elif option=="all":

        return 1
    elif option=="all_measures":
        return ds
    else:
        raise ValueError("Unknown option value, accepted values are : invalid, all, all_measures")

In [50]:
r010_resu=r010(df,"invalid")

Europe: 23, Italia: 5, France: 11, Germany : 8


In [51]:
r010_resu.show()

+----+--------+----+---------+--------+----------+-----------+
|Id_1|Geo_Area|Me_1|imbalance|bool_var|error_code|error_level|
+----+--------+----+---------+--------+----------+-----------+
|2010|  Europe|  23|       -1|   False|   Bad val|          5|
+----+--------+----+---------+--------+----------+-----------+



In [52]:
# implementation of rule R010 in HR_1, this should be generated based on HR_1:R010 definition
# Europe = Italia + France + Germany
def r020(ds,option):
    rule_id="R020"
    reference_col="Geo_Area"
    val_col="Me_1"
    op1="Africa"
    op2="Algeria"
    op3="Gongo"
    op4="Zimbabwe"

    # get the value of op1
    try:
        val1= get_op_var(df,reference_col,val_col,op1)
        val2=get_op_var(df,reference_col,val_col,op2)
        val3=get_op_var(df,reference_col,val_col,op3)
        val4=get_op_var(df,reference_col,val_col,op4)
        print(f"{op1}: {val1}, {op2}: {val2}, {op3}: {val3}, {op4} : {val4}")
        imbalance=val1-val2-val3-val4
    except TypeError:
        imbalance=None
    error_code="Bad val"
    error_level="5"
    if option=="invalid":
        tmp=ds.filter(col(reference_col)==op1)\
            .withColumn("imbalance",lit(imbalance))

        tmp = tmp.withColumn("bool_var", when(tmp.imbalance == 0, "True")
                                        .when(tmp.imbalance.isNull() , "None")
                                        .otherwise("False"))
        tmp = tmp.withColumn("error_code",lit(error_code)).withColumn("error_level",lit(error_level))
        return tmp
    elif option=="all":

        return 1
    elif option=="all_measures":
        return ds
    else:
        raise ValueError("Unknown option value, accepted values are : invalid, all, all_measures")

In [53]:
r020_resu=r020(df,"invalid")

Africa: 24, Algeria: 5, Gongo: 11, Zimbabwe : 8


In [54]:
r020_resu.show()

+----+--------+----+---------+--------+----------+-----------+
|Id_1|Geo_Area|Me_1|imbalance|bool_var|error_code|error_level|
+----+--------+----+---------+--------+----------+-----------+
|2010|  Africa|  24|        0|    True|   Bad val|          5|
+----+--------+----+---------+--------+----------+-----------+



In [55]:
# implementation of rule R010 in HR_1, this should be generated based on HR_1:R010 definition
# Europe = Italia + France + Germany
def r030(ds,option):
    rule_id="R030"
    reference_col="Geo_Area"
    val_col="Me_1"
    op1="Asia"
    op2="China"
    op3="Japan"
    op4="Laos"

    # get the value of op1
    try:
        val1= get_op_var(df,reference_col,val_col,op1)
        val2=get_op_var(df,reference_col,val_col,op2)
        val3=get_op_var(df,reference_col,val_col,op3)
        val4=get_op_var(df,reference_col,val_col,op4)
        print(f"{op1}: {val1}, {op2}: {val2}, {op3}: {val3}, {op4} : {val4}")
        imbalance=val1-val2-val3-val4
    except TypeError:
        imbalance=None
    error_code="Bad val"
    error_level="5"
    if option=="invalid":
        tmp=ds.filter(col(reference_col)==op1)\
            .withColumn("imbalance",lit(imbalance))
        tmp = tmp.withColumn("bool_var", when(tmp.imbalance == 0, "True")
                                        .when(tmp.imbalance.isNull() , "None")
                                        .otherwise("False"))
        tmp = tmp.withColumn("error_code",lit(error_code)).withColumn("error_level",lit(error_level))
        return tmp
    elif option=="all":

        return 1
    elif option=="all_measures":
        return ds
    else:
        raise ValueError("Unknown option value, accepted values are : invalid, all, all_measures")

In [56]:
r030_resu=r030(df,"invalid")

Asia: 21, China: 8, Japan: 7, Laos : 6


In [57]:
r030_resu.show()

+----+--------+----+---------+--------+----------+-----------+
|Id_1|Geo_Area|Me_1|imbalance|bool_var|error_code|error_level|
+----+--------+----+---------+--------+----------+-----------+
|2010|    Asia|  21|        0|    True|   Bad val|          5|
+----+--------+----+---------+--------+----------+-----------+



In [58]:
finalDf=r010_resu.union(r020_resu).union(r030_resu)

In [59]:
finalDf.show()

+----+--------+----+---------+--------+----------+-----------+
|Id_1|Geo_Area|Me_1|imbalance|bool_var|error_code|error_level|
+----+--------+----+---------+--------+----------+-----------+
|2010|  Europe|  23|       -1|   False|   Bad val|          5|
|2010|  Africa|  24|        0|    True|   Bad val|          5|
|2010|    Asia|  21|        0|    True|   Bad val|          5|
+----+--------+----+---------+--------+----------+-----------+



## 4. What borders me

There are two major problems of the function `check_hierarchy`:

- Mix of different information in the same column
- The calculation of rows instead of columns will kill the performance. With small data, the row approach may work, but with big data, it will fail.




## 5. What we propose

Let's retake the above example.

We propose to separate the dataset into two dataset with a common key called `Continent`

```text
Id_1,Country,Me_1,Continent
2010,Italia,5,Europe
2010,France,11,Europe
2010,Germany,8,Europe
2010,Algeria,5,Africa
2010,Gongo,11,Africa
2010,Zimbabwe,8,Africa
2010,China,8,Asia
2010,Japan,7,Asia
2010,Laos,6,Asia
```

```text
Id_1,Continent,Me_1
2010,Europe,23
2010,Africa,24
2010,Asia,21
```

DS_r := check_hierarchy ( DS_1, HR_1 rule GeoArea partial_null all )

In [60]:
country_path=f"{root_path}/check_hier_ds_country.csv"

df_country=spark.read.csv(country_path, header=True,inferSchema=True)

In [67]:
df_country.printSchema()
df_country.show()

root
 |-- Id_1: integer (nullable = true)
 |-- Country: string (nullable = true)
 |-- Me_1: integer (nullable = true)
 |-- Continent: string (nullable = true)

+----+--------+----+---------+
|Id_1| Country|Me_1|Continent|
+----+--------+----+---------+
|2010|  Italia|   5|   Europe|
|2010|  France|  11|   Europe|
|2010| Germany|   8|   Europe|
|2010| Algeria|   5|   Africa|
|2010|   Gongo|  11|   Africa|
|2010|Zimbabwe|   8|   Africa|
|2010|   China|   8|     Asia|
|2010|   Japan|   7|     Asia|
|2010|    Laos|   6|     Asia|
+----+--------+----+---------+



In [62]:
continent_path=f"{root_path}/check_hier_ds_continent.csv"

df_continent=spark.read.csv(continent_path, header=True,inferSchema=True)

In [68]:
df_continent.printSchema()
df_continent.show()

root
 |-- Id_1: integer (nullable = true)
 |-- Continent: string (nullable = true)
 |-- Me_1: integer (nullable = true)

+----+---------+----+
|Id_1|Continent|Me_1|
+----+---------+----+
|2010|   Europe|  23|
|2010|   Africa|  24|
|2010|     Asia|  21|
+----+---------+----+



In [73]:
df_sum=df_country.groupBy("Continent").agg(sum("Me_1").alias("country_sum_me1"))

In [74]:
df_sum.show()

+---------+---------------+
|Continent|country_sum_me1|
+---------+---------------+
|   Europe|             24|
|   Africa|             24|
|     Asia|             21|
+---------+---------------+



In [80]:
# join the two df
df_resu=df_sum.join(df_continent,"Continent","inner")

# create the column imbalance|bool_var|error_code|error_level|
df_resu=df_resu.withColumn("imbalance",col("Me_1")-col("country_sum_me1"))

df_resu=df_resu.withColumn("bool_var",when(col("imbalance")==0,True)
                                      .when(col("imbalance").isNull(),None)
                                      .otherwise(False))\
                .withColumn("error_code",lit("Bad val"))\
                .withColumn("error_level",lit(5)).drop("country_sum_me1")


In [81]:
df_resu.show()

+---------+----+----+---------+--------+----------+-----------+
|Continent|Id_1|Me_1|imbalance|bool_var|error_code|error_level|
+---------+----+----+---------+--------+----------+-----------+
|   Europe|2010|  23|       -1|   false|   Bad val|          5|
|   Africa|2010|  24|        0|    true|   Bad val|          5|
|     Asia|2010|  21|        0|    true|   Bad val|          5|
+---------+----+----+---------+--------+----------+-----------+

