# Use TDDA to validate data

## What is TDDA?

Test-driven data analysis (TDDA) is an approach to improving the quality, correctness and robustness of analytical processes by transferring the ideas of test-driven development from the domain of software development to that of data analysis, extending and adjusting them where appropriate.
A Methodology and a Toolset

TDDA is primarily a methodology that can be implemented in many different ways. **Stochastic Solutions also develops an open-source (MIT-licensed) Python module, tdda, to providing tooling support for test-driven data analysis**.


## What this tutorial will provide?

In this tutorial, we will use TDDA to generate "constraints" from a reference data asset (valid training data) and verify whether another data asset (test data) matches those constraints or not.

## Conclusion
In TDDA's vocabulary, validation rules are called constrains.


# Pros:
- Easy to install and use
- Declarative validation rule
- Able to reuse existing validation rule set
- Provide a profiler to generate possible validation rules for a given dataset.
- Validation rules generated by Profiler have high relevance.

# Cons:
- Can only take pandas dataframe as input
- Limit built-in validation rule
- Hard to implement new validation rules
- Validation report only indicates which validation rule failed
- Heavy python language dependencies. Although the validation rules(constrain) is in json. But all other operations all requires python.


In [7]:
from tdda.constraints import discover_df, verify_df
import pandas as pd
import os

# Step1: Generate the constraints based on the valid dataframe

In [8]:
file_path = "../data/adult.csv"
columns = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship",
           "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]
df_base = pd.read_csv(file_path, names=columns, header=None, sep=',\s+', na_values=["?"])
df_base.head()

  return func(*args, **kwargs)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [12]:
# generate the constraint
constraints = discover_df(df_base)

# Show the generated constraints
print(str(constraints))


FIELDS:

Field age:
           type: TypeConstraint(value='int')
            min: MinConstraint(value=17, precision=None)
            max: MaxConstraint(value=90, precision=None)
           sign: SignConstraint(value='positive')
      max_nulls: MaxNullsConstraint(value=0)

Field workclass:
           type: TypeConstraint(value='string')
     min_length: MinLengthConstraint(value=7)
     max_length: MaxLengthConstraint(value=16)
  allowed_values: AllowedValuesConstraint(value=['Federal-gov', 'Local-gov', 'Never-worked', 'Private', 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'])

Field fnlwgt:
           type: TypeConstraint(value='int')
            min: MinConstraint(value=12285, precision=None)
            max: MaxConstraint(value=1484705, precision=None)
           sign: SignConstraint(value='positive')
      max_nulls: MaxNullsConstraint(value=0)

Field education:
           type: TypeConstraint(value='string')
     min_length: MinLengthConstraint(value=3)
     max_

## Step 2 save the constraint for validation

You can notice we save the generated constrain in json format. Tdda group the validation rule by column. For example, for column 'age' (numeric column), we have the following rules:

```json
{
"age": {
            "type": "int",
            "min": 17,
            "max": 90,
            "sign": "positive",
            "max_nulls": 0
        }
}
```

- "type" : "int" means the value in this column must have integer type
- "min" : 17 means the min value of this column is 17. (Note the min, max value you often need to modify, because it's generated by using the base validation data, it may not cover all the possibilities. )
- "max": 90 means the max value of this column is 90
- "sign": "positive" means all the value of this column must be positive
- "max_nulls": 0 means you can not have null value in this column

For column workclass(string/categorical column), we have the following rule:
```json
{
"workclass": {
            "type": "string",
            "min_length": 7,
            "max_length": 16,
            "allowed_values": [
                "Federal-gov",
                "Local-gov",
                "Never-worked",
                "Private",
                "Self-emp-inc",
                "Self-emp-not-inc",
                "State-gov",
                "Without-pay"
            ]
        }
}
```

- "type": "string" means the value in this column must have string type
- "min_length": 7 means the min length of the string must be 7
- "max_length": 16 means the max length of the string must be 7
- "allowed_values": means all values in this column must in the list.

Note, we don't see rule "max_nulls": 0, that's because this column contains many null value. We will use the generated rule to validate data first, then we will add new rules to redo the validation

In [None]:
def write_constrain(constrains, constrain_root_dir: str, constrain_file_name: str):
    """
    This function takes a tdda constrain object, root directory and file name to store the tdda constrains in json format

    :param constrains: tdda constrain object
    :param constrain_root_dir: root directory to store the tdda constrains
    :param constrain_file_name: file name to store the constrain
    :return:
    """
    if not os.path.exists(constrain_root_dir):
        os.mkdir(constrain_root_dir)
    constraints_path = f'{constrain_root_dir}/{constrain_file_name}'
    with open(constraints_path, 'w') as f:
        f.write(constraints.to_json())

In [None]:
# write constrain to local file system
root_dir = 'tdda_refs'
file_name = 'generated_constraints.tdda'
write_constrain(constraints,root_dir,file_name)

Now we can use the generated constraints to validate some data. Let's start with the valid data and see the outputs

In [11]:
# Test
constraints_path = f'{root_dir}/{file_name}'
result = verify_df(df_base, constraints_path, type_checking='strict', epsilon=0)
print(str(result))

FIELDS:

age: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

workclass: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

fnlwgt: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

education: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

education-num: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

marital-status: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

occupation: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

relationship: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

race: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

sex: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

capital-gain: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

capital-loss: 0 f

You can notice all validation rules has passed the test. Now let's add some new rules.

For this purpose, I created a new constraint json file called "my_constraints.tdda". You can notice in the below code snippet, that I added a new rule
max null value in column workclass is 10.

```json
{
"workclass": {
            "type": "string",
            "min_length": 7,
            "max_length": 16,
            "allowed_values": [
                "Federal-gov",
                "Local-gov",
                "Never-worked",
                "Private",
                "Self-emp-inc",
                "Self-emp-not-inc",
                "State-gov",
                "Without-pay"
            ],
             "max_nulls": 10
        }
}
```

In [13]:
my_constrains="my_constraints.tdda"
constraints_path = f'{root_dir}/{my_constrains}'
valid_result = verify_df(df_base, constraints_path, type_checking='strict', epsilon=0)
print(str(valid_result))

FIELDS:

age: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

workclass: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✗  allowed_values ✓

fnlwgt: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

education: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

education-num: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

marital-status: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

occupation: 0 failures  4 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✓

relationship: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

race: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

sex: 0 failures  5 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✓

capital-gain: 0 failures  5 passes  type ✓  min ✓  max ✓  sign ✓  max_nulls ✓

capit

You can notice in the result now we have one validation rule (max_nulls) does not pass the test.

Now lets try the constraint on some sample data that contains many errors

In [17]:
test_file_path="../data/adult_with_duplicates.csv"
df_test=pd.read_csv(test_file_path,header=None, names=columns,sep=',', na_values=["null"])

test_result = verify_df(df_test, constraints_path, type_checking='strict', epsilon=0)
print(str(test_result))

FIELDS:

age: 5 failures  0 passes  type ✗  min ✗  max ✗  sign ✗  max_nulls ✗

workclass: 2 failures  3 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✗  allowed_values ✗

fnlwgt: 4 failures  1 pass  type ✗  min ✗  max ✗  sign ✗  max_nulls ✓

education: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✗

education-num: 4 failures  1 pass  type ✗  min ✗  max ✗  sign ✗  max_nulls ✓

marital-status: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✗

occupation: 1 failure  3 passes  type ✓  min_length ✓  max_length ✓  allowed_values ✗

relationship: 1 failure  4 passes  type ✓  min_length ✓  max_length ✓  max_nulls ✓  allowed_values ✗

race: 2 failures  3 passes  type ✓  min_length ✗  max_length ✓  max_nulls ✓  allowed_values ✗

sex: 2 failures  3 passes  type ✓  min_length ✗  max_length ✓  max_nulls ✓  allowed_values ✗

capital-gain: 4 failures  1 pass  type ✗  min ✗  max ✗  sign ✗  max_nulls ✓

capital-loss: 

Now, you can notice that we have many validation rules that failed. For example, the validation rules of age column are all failed.  