# VTL Check

The original doc can be found at **line 6575 of VTL-2.0-Reference-Manual**
## Syntax

```text
check ( op { errorcode errorcode } { errorlevel errorlevel } { imbalance imbalance } { output } )
  - output ::= invalid | all

```

- **op**: a boolean Data Set (a boolean condition expressed on one or more Data Sets)
- **errorcode**: the error code to be produced when the condition evaluates to FALSE. It must be a valid value of the `errorcode_vd` Value Domain (or string if the `errorcode_vd` Value Domain is not found). It can be a Data Set or a scalar. If not specified then `errorcode` is NULL.
- **errorlevel**: the error level to be produced when the condition evaluates to FALSE. It must be a valid value of the `errorlevel_vd` Value Domain (or integer if the `errorcode_vd` Value Domain is not found). It can be a Data Set or a scalar. If not specified then `errorlevel` is NULL
- **imbalance**: the imbalance to be computed. `imbalance` is a numeric mono-measure Data Set with the same 6587 Identifiers of op. If not specified then imbalance is NULL.
- **output**: specifies which Data Points are returned to the resulting Data Set:
     - **invalid**: returns the Data Points of `op` for which the condition evaluates to FALSE
     - **all**: returns all Data Points of op
     - If not specified then output is `all`.

## Example

Given the Data Set DS_1:
```text
Id_1,Id_2,Me_1
2010,I,1
2011,I,2
2012,I,10
2013,I,4
2014,I,5
2015,I,6
2010,D,25
2011,D,35
2012,D,45
2013,D,55
2014,D,50
2015,D,75
```

Given the Data Set DS_2:
```text
Id_1,Id_2,Me_1
2010,I,9
2011,I,2
2012,I,10
2013,I,7
2014,I,5
2015,I,6
2010,D,50
2011,D,35
2012,D,40
2013,D,55
2014,D,65
2015,D,75
```
`DS_r := check ( DS1 >= DS2 imbalance DS1 - DS2 )` results in:

```text
Id_1 Id_2 bool_var imbalance errorcode errorlevel
2010 I FALSE -8 NULL NULL
2011 I TRUE 0 NULL NULL
2012 I TRUE 0 NULL NULL
2013 I FALSE -3 NULL NULL
2014 I TRUE 0 NULL NULL
2015 I TRUE 0 NULL NULL
2010 D FALSE -25 NULL NULL
2011 D TRUE 0 NULL NULL
2012 D TRUE 5 NULL NULL
2013 D TRUE 0 NULL NULL
2014 D FALSE -15 NULL NULL
2015 D TRUE 0 NULL NULL
```


In [1]:
from pyspark.sql import SparkSession,DataFrame

import os
from pyspark.sql.functions import col, lit, when


In [2]:
local = True

if local:
    spark = SparkSession.builder \
        .master("local[4]") \
        .appName("VTLValidation-check")\
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443") \
        .appName("VTLValidation-check")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory", "4g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .getOrCreate()


22/10/13 09:28:32 WARN Utils: Your hostname, pengfei-Virtual-Machine resolves to a loopback address: 127.0.1.1; using 10.50.2.80 instead (on interface eth0)
22/10/13 09:28:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/10/13 09:28:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
root_path="../../data"
data_path1=f"{root_path}/check_ds1.csv"
data_path2=f"{root_path}/check_ds2.csv"
df1=spark.read.csv(data_path1, header=True,inferSchema=True)
df2=spark.read.csv(data_path2, header=True,inferSchema=True)


In [4]:
df1.show()

+----+----+----+
|Id_1|Id_2|Me_1|
+----+----+----+
|2010|   I|   1|
|2011|   I|   2|
|2012|   I|  10|
|2013|   I|   4|
|2014|   I|   5|
|2015|   I|   6|
|2010|   D|  25|
|2011|   D|  35|
|2012|   D|  45|
|2013|   D|  55|
|2014|   D|  50|
|2015|   D|  75|
+----+----+----+



In [5]:
df2.show()

+----+----+----+
|Id_1|Id_2|Me_1|
+----+----+----+
|2010|   I|   9|
|2011|   I|   2|
|2012|   I|  10|
|2013|   I|   7|
|2014|   I|   5|
|2015|   I|   6|
|2010|   D|  50|
|2011|   D|  35|
|2012|   D|  40|
|2013|   D|  55|
|2014|   D|  65|
|2015|   D|  75|
+----+----+----+



## Step 1: Inner Join on the two data frame



In [10]:
target_col_name="Me_1"
clean_col_name1=f"df1_{target_col_name}"
clean_col_name2=f"df2_{target_col_name}"
df1_clean=df1.withColumnRenamed(target_col_name,clean_col_name1)
df2_clean=df2.withColumnRenamed(target_col_name,clean_col_name2)
df_join=df1_clean.join(df2_clean, ["Id_1","Id_2"])

In [11]:
df_join.show()

+----+----+--------+--------+
|Id_1|Id_2|df1_Me_1|df2_Me_1|
+----+----+--------+--------+
|2010|   I|       1|       9|
|2011|   I|       2|       2|
|2012|   I|      10|      10|
|2013|   I|       4|       7|
|2014|   I|       5|       5|
|2015|   I|       6|       6|
|2010|   D|      25|      50|
|2011|   D|      35|      35|
|2012|   D|      45|      40|
|2013|   D|      55|      55|
|2014|   D|      50|      65|
|2015|   D|      75|      75|
+----+----+--------+--------+



## Step 2: Apply check rules and imbalance rule

In [14]:
df_part=df_join.withColumn("imbalance",col(clean_col_name1)-col(clean_col_name2))\
    .withColumn("bool_var",when((col(clean_col_name1) >= col(clean_col_name2)),True).otherwise(False))

## Step 3: Apply the error code

In [19]:
error_code="Ds1 is not valid"
error_level="6"

df_resu=df_part.withColumn("error_code",when(col("bool_var")==False,error_code).otherwise("null")) \
           .withColumn("error_level",when(col("bool_var")==False,error_level).otherwise("null"))

In [20]:
df_resu.show()

+----+----+--------+--------+---------+--------+----------------+-----------+
|Id_1|Id_2|df1_Me_1|df2_Me_1|imbalance|bool_var|      error_code|error_level|
+----+----+--------+--------+---------+--------+----------------+-----------+
|2010|   I|       1|       9|       -8|   false|Ds1 is not valid|          6|
|2011|   I|       2|       2|        0|    true|            null|       null|
|2012|   I|      10|      10|        0|    true|            null|       null|
|2013|   I|       4|       7|       -3|   false|Ds1 is not valid|          6|
|2014|   I|       5|       5|        0|    true|            null|       null|
|2015|   I|       6|       6|        0|    true|            null|       null|
|2010|   D|      25|      50|      -25|   false|Ds1 is not valid|          6|
|2011|   D|      35|      35|        0|    true|            null|       null|
|2012|   D|      45|      40|        5|    true|            null|       null|
|2013|   D|      55|      55|        0|    true|            null

In [23]:
def check_01(df1,df2,option):
    error_code="Ds1 is not valid"
    error_level="6"
    target_col_name="Me_1"
    clean_col_name1=f"df1_{target_col_name}"
    clean_col_name2=f"df2_{target_col_name}"
    df1_clean=df1.withColumnRenamed(target_col_name,clean_col_name1)
    df2_clean=df2.withColumnRenamed(target_col_name,clean_col_name2)
    df_join=df1_clean.join(df2_clean, ["Id_1","Id_2"])


    df_part=df_join.withColumn("imbalance",col(clean_col_name1)-col(clean_col_name2))\
    .withColumn("bool_var",when((col(clean_col_name1) >= col(clean_col_name2)),True).otherwise(False))
    df_all=df_part.withColumn("error_code",when(col("bool_var")==False,error_code).otherwise("null")) \
           .withColumn("error_level",when(col("bool_var")==False,error_level).otherwise("null"))
    if option=="all":
        return df_all
    elif option=="invalid":
        return df_all.filter(col("bool_var")==False)


In [24]:
result_all=check_01(df1,df2,"all")
result_all.show()

+----+----+--------+--------+---------+--------+----------------+-----------+
|Id_1|Id_2|df1_Me_1|df2_Me_1|imbalance|bool_var|      error_code|error_level|
+----+----+--------+--------+---------+--------+----------------+-----------+
|2010|   I|       1|       9|       -8|   false|Ds1 is not valid|          6|
|2011|   I|       2|       2|        0|    true|            null|       null|
|2012|   I|      10|      10|        0|    true|            null|       null|
|2013|   I|       4|       7|       -3|   false|Ds1 is not valid|          6|
|2014|   I|       5|       5|        0|    true|            null|       null|
|2015|   I|       6|       6|        0|    true|            null|       null|
|2010|   D|      25|      50|      -25|   false|Ds1 is not valid|          6|
|2011|   D|      35|      35|        0|    true|            null|       null|
|2012|   D|      45|      40|        5|    true|            null|       null|
|2013|   D|      55|      55|        0|    true|            null

In [25]:
result_invalid=check_01(df1,df2,"invalid")
result_invalid.show()

+----+----+--------+--------+---------+--------+----------------+-----------+
|Id_1|Id_2|df1_Me_1|df2_Me_1|imbalance|bool_var|      error_code|error_level|
+----+----+--------+--------+---------+--------+----------------+-----------+
|2010|   I|       1|       9|       -8|   false|Ds1 is not valid|          6|
|2013|   I|       4|       7|       -3|   false|Ds1 is not valid|          6|
|2010|   D|      25|      50|      -25|   false|Ds1 is not valid|          6|
|2014|   D|      50|      65|      -15|   false|Ds1 is not valid|          6|
+----+----+--------+--------+---------+--------+----------------+-----------+

