# Use TDDA to validate data


In [4]:
from tdda.constraints import discover_df, verify_df
import pandas as pd
import os

# Step1: Read data

In [5]:
file_path = "../../data/adult_with_duplicates.csv"
columns = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship",
           "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]
df_base = pd.read_csv(file_path, names=columns, header=None, sep=',', na_values=["null"])
df_base.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
1,139,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
2,fourty five,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
3,-12,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
4,,emp-by-pengfei,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K


In [16]:
root_dir = './tdda_refs'
my_rule_path = 'my_constraints.tdda'

In [19]:
# Test
validation_rule_path = f'{root_dir}/{my_rule_path}'
dir_path=os.path.dirname(validation_rule_path)
print(dir_path)

/home/pliu/git/DataQualityAndValidation/demo/02.tdda/tdda_refs


In [22]:

result = verify_df(df_base, validation_rule_path, type_checking='strict', epsilon=0)
print(str(result))

FIELDS:

age: 4 failures  1 pass  type ✗  min ✗  max ✗  sign ✗  max_nulls ✓

SUMMARY:

Constraints passing: 1
Constraints failing: 4


In [4]:
# generate the constraint
constraints = discover_df(df_base)

# Show the generated constraints
print(str(constraints))


FIELDS:

Field age:
           type: TypeConstraint(value='string')
     min_length: MinLengthConstraint(value=2)
     max_length: MaxLengthConstraint(value=11)
      max_nulls: MaxNullsConstraint(value=1)

Field workclass:
           type: TypeConstraint(value='string')
     min_length: MinLengthConstraint(value=7)
     max_length: MaxLengthConstraint(value=16)
  allowed_values: AllowedValuesConstraint(value=['Federal-gov', 'Local-gov', 'Never-worked', 'Private', 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay', 'emp-by-pengfei', 'workclass'])

Field fnlwgt:
           type: TypeConstraint(value='string')
     min_length: MinLengthConstraint(value=5)
     max_length: MaxLengthConstraint(value=7)
      max_nulls: MaxNullsConstraint(value=0)

Field education:
           type: TypeConstraint(value='string')
     min_length: MinLengthConstraint(value=3)
     max_length: MaxLengthConstraint(value=12)
      max_nulls: MaxNullsConstraint(value=0)
  allowed_values: AllowedValues

## Step 2 save the constraint for validation

You can notice we save the generated constrain in json format. Tdda group the validation rule by column. For example, for column 'age' (numeric column), we have the following rules:

```json
{
"age": {
            "type": "int",
            "min": 17,
            "max": 90,
            "sign": "positive",
            "max_nulls": 0
        }
}
```

- "type" : "int" means the value in this column must have integer type
- "min" : 17 means the min value of this column is 17. (Note the min, max value you often need to modify, because it's generated by using the base validation data, it may not cover all the possibilities. )
- "max": 90 means the max value of this column is 90
- "sign": "positive" means all the value of this column must be positive
- "max_nulls": 0 means you can not have null value in this column

For column workclass(string/categorical column), we have the following rule:
```json
{
"workclass": {
            "type": "string",
            "min_length": 7,
            "max_length": 16,
            "allowed_values": [
                "Federal-gov",
                "Local-gov",
                "Never-worked",
                "Private",
                "Self-emp-inc",
                "Self-emp-not-inc",
                "State-gov",
                "Without-pay"
            ]
        }
}
```

- "type": "string" means the value in this column must have string type
- "min_length": 7 means the min length of the string must be 7
- "max_length": 16 means the max length of the string must be 7
- "allowed_values": means all values in this column must in the list.

Note, we don't see rule "max_nulls": 0, that's because this column contains many null value. We will use the generated rule to validate data first, then we will add new rules to redo the validation

In [8]:
def write_constrain(constrains, constrain_root_dir: str, constrain_file_name: str):
    """
    This function takes a tdda constrain object, root directory and file name to store the tdda constrains in json format

    :param constrains: tdda constrain object
    :param constrain_root_dir: root directory to store the tdda constrains
    :param constrain_file_name: file name to store the constrain
    :return:
    """
    if not os.path.exists(constrain_root_dir):
        os.mkdir(constrain_root_dir)
    constraints_path = f'{constrain_root_dir}/{constrain_file_name}'
    print(f"write constrain to path {constraints_path}")
    with open(constraints_path, 'w') as f:
        f.write(constraints.to_json())

In [9]:
# write constrain to local file system

write_constrain(constraints,root_dir,file_name)

write constrain to path tdda_refs/generated_constraints.tdda


Now we can use the generated constraints to validate some data. Let's start with the valid data and see the outputs

FIELDS:

age: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓

workclass: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

fnlwgt: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓

education: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

education-num: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

marital-status: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

occupation: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

relationship: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

race: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

sex: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

capital-gain: 0 failures  4 passes  type ✓  min_length ✓  ma

You can notice all validation rules has passed the test. Now let's add some new rules.

For this purpose, I created a new constraint json file called "my_constraints.tdda". You can notice in the below code snippet, that I added a new rule
max null value in column workclass is 10.

```json
{
"workclass": {
            "type": "string",
            "min_length": 7,
            "max_length": 16,
            "allowed_values": [
                "Federal-gov",
                "Local-gov",
                "Never-worked",
                "Private",
                "Self-emp-inc",
                "Self-emp-not-inc",
                "State-gov",
                "Without-pay"
            ],
             "max_nulls": 10
        }
}
```

In [13]:
my_constrains="my_constraints.tdda"
constraints_path = f'{root_dir}/{my_constrains}'
valid_result = verify_df(df_base, constraints_path, type_checking='strict', epsilon=0)
print(str(valid_result))

FIELDS:

age: 5 failures  0 passes  type ✗  min ✗  max ✗  sign ✗  max_nulls ✗

workclass: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

fnlwgt: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓

education: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

education-num: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

marital-status: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

occupation: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

relationship: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

race: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

sex: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

capital-gain: 0 failures  4 passes  type ✓  min_length ✓  max_leng

You can notice in the result now we have one validation rule (max_nulls) does not pass the test.

Now lets try the constraint on some sample data that contains many errors

In [17]:
test_file_path="../data/adult_with_duplicates.csv"
df_test=pd.read_csv(test_file_path,header=None, names=columns,sep=',', na_values=["null"])

test_result = verify_df(df_test, constraints_path, type_checking='strict', epsilon=0)
print(str(test_result))

FIELDS:

age: 5 failures  0 passes  type ✗  min ✗  max ✗  sign ✗  max_nulls ✗

workclass: 2 failures  3 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✗  allowed_values ✗

fnlwgt: 4 failures  1 pass  type ✗  min ✗  max ✗  sign ✗  max_nulls ✓

education: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✗

education-num: 4 failures  1 pass  type ✗  min ✗  max ✗  sign ✗  max_nulls ✓

marital-status: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✗

occupation: 1 failure  3 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✗

relationship: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✗

race: 2 failures  3 passes  type ✓  min_length ✗  max_length ✓  max_nulls ✓  allowed_values ✗

sex: 2 failures  3 passes  type ✓  min_length ✗  max_length ✓  max_nulls ✓  allowed_values ✗

capital-gain: 4 failures  1 pass  type ✗  min ✗  max ✗  sign ✗  max_nulls ✓

capital-loss: 

Now, you can notice that we have many validation rules that failed. For example, the validation rules of age column are all failed.  