# VTL Check data point

The original doc can be found at **line 6319 of VTL-2.0-Reference-Manual**
## Syntax

```text
check_datapoint ( op , dpr { components listComp } { output } )
   - listComp ::= comp { , comp }*
   - output ::= invalid | all | all_measures

```

- **op**: the Data Set to check
- **dpr**: the Data Point Ruleset to be used
- **listComp**: if `dpr` is defined on Value Domains then `listComp` is the list of Components of `op` to be associated (in positional order) to the conditioning Value Domains defined in `dpr`. If `dpr` is defined on Variables then listComp is the list of Components of op to be associated (in positional order) to the conditioning Variables defined in dpr (for documentation purposes).
- **comp**: Component of `op`
- **output**: specifies the Data Points and the Measures of the resulting Data Set:
     - **invalid**: the resulting Data Set contains a Data Point for each Data Point of `op` and  each Rule in `dpr` that evaluates to `FALSE` on that Data Point. The resulting Data Set has the Measures of op.
     - **all**: the resulting Data Set contains a data point for each Data Point of `op` and each Rule in `dpr`. The resulting Data Set has the boolean Measure bool_var.
     - **all_measures**: the resulting Data Set contains a Data Point for each Data Point of `op` and each Rule in `dpr`. The resulting dataset has the Measures of `op` and the  boolean Measure bool_var.
     - If not specified then output is assumed to be invalid. See the Behaviour for further details.

## Example

Step 1. Define `dpr`

```text
define datapoint ruleset valide_livret_A ( variable transaction_type, balance ) is
            when transaction_type = “CREDIT” then balance >= 22950 errorcode “Limit reached” errorlevel 5;
            when transaction_type = “DEBIT” then balance <0 errorcode “Not enough credit” errorlevel 6
end datapoint ruleset

```

Step 2. Apply `dpr` on `ds`

Given the Data Set DS_1:
```text
Id_1,account_number,transaction_type,balance
2011,1,CREDIT,23950
2011,1,DEBIT,-2
2012,1,CREDIT,10
2012,1,DEBIT,2
```

DS_r := check_datapoint ( DS_1, dpr1 ) results in:

```text
Id_1,Id_2,Id_3,ruleid,obs_value,errorcode,errorlevel
2011,1,DEBIT,dpr1_2,-2,Bad debit,null
```

DS_r := check_datapoint ( DS_1, dpr1 all ) results in:

```text
Id_1,Id_2,Id_3,ruleid,bool_var,errorcode,errorlevel
2011,1,CREDIT,dpr1_1,true,null,null
2011,1,CREDIT,dpr1_2,true,null,null
2011,1,DEBIT,dpr1_1,true,null,null
2011,1,DEBIT,dpr1_2,false,Bad debit,null
2012,1,CREDIT,dpr1_1,true,null,null
2012,1,CREDIT,dpr1_2,true,null,null
2012,1,DEBIT,dpr1_1,true,null,null
2012,1,DEBIT,dpr1_2,true,null,null
```

In [1]:
from pyspark.sql import SparkSession,DataFrame

import os
from pyspark.sql.functions import col, lit, when


In [2]:
local = True

if local:
    spark = SparkSession.builder \
        .master("local[4]") \
        .appName("VTLValidation")\
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443") \
        .appName("VTLValidation")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory", "4g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .getOrCreate()


2022-11-07 12:59:44,169 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
root_path="../../data"
data_path=f"{root_path}/validation_ds.csv"

df=spark.read.csv(data_path, header=True,inferSchema=True)
df.show()

+----+----------+----------------+------------------+-------------+
|date|acc_number|transaction_type|transaction_amount|balance_after|
+----+----------+----------------+------------------+-------------+
|2011|         1|          CREDIT|               250|        23950|
|2011|         2|           DEBIT|               100|           -2|
|2012|         3|          CREDIT|              1000|        17800|
|2012|         4|           DEBIT|               200|        20550|
+----+----------+----------------+------------------+-------------+



## Step 1: Implement data point ruleset

A data point ruleset can contain one or more rules. For each rule, we need to define a corresponding validation function in spark that implements the logic and generate the resulting columns.

Note it has 3 modes (e.g. invalid, all, all_measures), and each mode has a unique output column formats. So each generated function must take consideration of that.

Below functions should be generated when we encounter **define datapoint ruleset**

In [5]:
# this function is for complete the output column of check option : invalid
def trans_for_invalid(ds,rule_id,error_code,error_level):
    return ds.withColumn("rule_id",lit(rule_id)) \
       .withColumn("error_code",lit(error_code)) \
        .withColumn("error_level",lit(error_level))

In [6]:
# this function is for complete the output column of check option : all
def trans_for_all(ds,rule_id,error_code,error_level):
    return ds.withColumn("rule_id",lit(rule_id)) \
       .withColumn("error_code",when(col("bool_var")==False,error_code)) \
        .withColumn("error_level",when(col("bool_var")==False,error_level))

In [9]:
# implementation of rule dpr1_1 in dpr1, this should be generated based on dpr1 definition
def dpr1_1(ds,option):
    rule_id="valide_livret_A_1"
    cond_col="transaction_type"
    cond_val="CREDIT"
    check_col="balance_after"
    check_val=22950
    error_code="Limit reached"
    error_level="5"
    if option=="invalid":
        tmp=ds.filter((col(cond_col)==cond_val) & (col(check_col)>check_val) ).withColumnRenamed(check_col,"obs_value")
        return trans_for_invalid(tmp,rule_id,error_code,error_level)
    elif option=="all":
        tmp=ds.withColumn("bool_var",when((col(cond_col)==cond_val) & (col(check_col)>check_val),False).otherwise(True))
        return trans_for_all(tmp,rule_id,error_code,error_level)
    elif option=="all_measures":
        return ds
    else:
        raise ValueError("Unknown option value, accepted values are : invalid, all, all_measures")

In [10]:
dpr1_1_resu=dpr1_1(df,"all")

In [11]:
dpr1_1_resu.show()

+----+----------+----------------+------------------+-------------+--------+-----------------+-------------+-----------+
|date|acc_number|transaction_type|transaction_amount|balance_after|bool_var|          rule_id|   error_code|error_level|
+----+----------+----------------+------------------+-------------+--------+-----------------+-------------+-----------+
|2011|         1|          CREDIT|               250|        23950|   false|valide_livret_A_1|Limit reached|          5|
|2011|         2|           DEBIT|               100|           -2|    true|valide_livret_A_1|         null|       null|
|2012|         3|          CREDIT|              1000|        17800|    true|valide_livret_A_1|         null|       null|
|2012|         4|           DEBIT|               200|        20550|    true|valide_livret_A_1|         null|       null|
+----+----------+----------------+------------------+-------------+--------+-----------------+-------------+-----------+



In [12]:
# implementation of rule dpr1_2 in dpr1, this should be generated based on dpr1 definition
def dpr1_2(ds,option):
    rule_id="valide_livret_A_2"
    cond_col="transaction_type"
    cond_val="DEBIT"
    check_col="balance_after"
    check_val=0
    error_code="Not enough credit"
    error_level="6"
    if option=="invalid":
        tmp=ds.filter((col(cond_col)==cond_val) & (col(check_col)<check_val) ).withColumnRenamed(check_col,"obs_value")

        return trans_for_invalid(tmp,rule_id,error_code,error_level)
    elif option=="all":
        tmp=ds.withColumn("bool_var",when((col(cond_col)==cond_val) & (col(check_col)<check_val),False).otherwise(True))
        return trans_for_all(tmp,rule_id,error_code,error_level)
    elif option=="all_measures":
        return ds
    else:
        raise ValueError("Unknown option value, accepted values are : invalid, all, all_measures")

In [13]:
dpr1_2_resu=dpr1_2(df,"all")

In [14]:
dpr1_2_resu.show()

+----+----------+----------------+------------------+-------------+--------+-----------------+-----------------+-----------+
|date|acc_number|transaction_type|transaction_amount|balance_after|bool_var|          rule_id|       error_code|error_level|
+----+----------+----------------+------------------+-------------+--------+-----------------+-----------------+-----------+
|2011|         1|          CREDIT|               250|        23950|    true|valide_livret_A_2|             null|       null|
|2011|         2|           DEBIT|               100|           -2|   false|valide_livret_A_2|Not enough credit|          6|
|2012|         3|          CREDIT|              1000|        17800|    true|valide_livret_A_2|             null|       null|
|2012|         4|           DEBIT|               200|        20550|    true|valide_livret_A_2|             null|       null|
+----+----------+----------------+------------------+-------------+--------+-----------------+-----------------+-----------+


In [15]:
dpr1_1_resu.union(dpr1_2_resu).show()

+----+----------+----------------+------------------+-------------+--------+-----------------+-----------------+-----------+
|date|acc_number|transaction_type|transaction_amount|balance_after|bool_var|          rule_id|       error_code|error_level|
+----+----------+----------------+------------------+-------------+--------+-----------------+-----------------+-----------+
|2011|         1|          CREDIT|               250|        23950|   false|valide_livret_A_1|    Limit reached|          5|
|2011|         2|           DEBIT|               100|           -2|    true|valide_livret_A_1|             null|       null|
|2012|         3|          CREDIT|              1000|        17800|    true|valide_livret_A_1|             null|       null|
|2012|         4|           DEBIT|               200|        20550|    true|valide_livret_A_1|             null|       null|
|2011|         1|          CREDIT|               250|        23950|    true|valide_livret_A_2|             null|       null|


## Step 2: Apply data point ruleset on a data frame

This function should be generated when a function **check_datapoint**,
note the rule sets and rules are generated in step 1. They must be present when we call **check_datapoint**

In [16]:
def data_validation(ds,rules,option):
    result=rules[0](ds,option)
    for i in range(1,len(rules)):
        result=result.union(rules[i](ds,option))
    return result

In [17]:
rule_sets=[dpr1_1,dpr1_2]

In [18]:
invalid_resu=data_validation(df,rule_sets,"invalid")

In [19]:
invalid_resu.show()

+----+----------+----------------+------------------+---------+-----------------+-----------------+-----------+
|date|acc_number|transaction_type|transaction_amount|obs_value|          rule_id|       error_code|error_level|
+----+----------+----------------+------------------+---------+-----------------+-----------------+-----------+
|2011|         1|          CREDIT|               250|    23950|valide_livret_A_1|    Limit reached|          5|
|2011|         2|           DEBIT|               100|       -2|valide_livret_A_2|Not enough credit|          6|
+----+----------+----------------+------------------+---------+-----------------+-----------------+-----------+



In [20]:
all_resu=data_validation(df,rule_sets,"all")

In [21]:
all_resu.show()

+----+----------+----------------+------------------+-------------+--------+-----------------+-----------------+-----------+
|date|acc_number|transaction_type|transaction_amount|balance_after|bool_var|          rule_id|       error_code|error_level|
+----+----------+----------------+------------------+-------------+--------+-----------------+-----------------+-----------+
|2011|         1|          CREDIT|               250|        23950|   false|valide_livret_A_1|    Limit reached|          5|
|2011|         2|           DEBIT|               100|           -2|    true|valide_livret_A_1|             null|       null|
|2012|         3|          CREDIT|              1000|        17800|    true|valide_livret_A_1|             null|       null|
|2012|         4|           DEBIT|               200|        20550|    true|valide_livret_A_1|             null|       null|
|2011|         1|          CREDIT|               250|        23950|    true|valide_livret_A_2|             null|       null|
