# Voluptuous

Voluptuous is a Python data validation library. It is primarily intended for validating data coming into Python as JSON, YAML, etc.

It has three goals:

- Simplicity.
- Support for complex data structures.
- Provide useful error messages.

Pros:
- easy to install
- easy to use/understand the concept

Cons:
- lack of validation rules, and hard to add custom validation rule
- need to convert pandas dataframe into python dict (will lead to performance issue when df is bigger)
- performance issues when data set is bigger (>100Mo)
- Main contributor abandon the project, no more update or bug fixing.

In [1]:
from voluptuous import Schema, All, Range, ALLOW_EXTRA
import pandas as pd

In [3]:
# path of the valid data
valid_file_path = "../data/adult.csv"
test_file_path = "../data/adult_with_duplicates.csv"

In [4]:
columns = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation",
               "relationship",
               "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]
df1 = pd.read_csv(valid_file_path, names=columns, header=None)

To use voluptuous to validate, we need to define an expected schema of the dataset. Then we use this schema to validate each row of the dataframe
Note, the input row of the schema must be in the format of a dictionary. {'col_name1': 'value', 'col_name2': value, ...}

In [7]:
# The doc is not very clear on the schema input row dict-type data structure, I have to test all the possible format.
# This returns the df if the tests pass, or throws an error if a test fails
def validate_df(df):
    # define the expected schema
    schema = Schema(
        {
            'age': All(int,Range(min=1, max=100)),
            'education': All(str)
        },
        extra=ALLOW_EXTRA
    )
    # convert the input data frame into a dictionary
    for r in df.to_dict('records'):
        # validate each row by using the schema
        schema(r)
    return df

# Test this out with df1, which should pass
validate_df(df1).head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [10]:
df2 = pd.read_csv(test_file_path, names=columns, header=1, sep=',', na_values=["null"])
df2 = df2.fillna({"age": 0})
df2 = df2.astype({"age": int}, errors='ignore')
df2.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,-12,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,0,emp-by-pengfei,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
2,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
3,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
4,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K


In [9]:
# use the same schema to check the bad sample, it should raise exception
try:
    validate_df(df2).head()
except Exception as e:
    print("**Exception**", e)

**Exception** value must be at least 1 for dictionary value @ data['age']


In the exception, you can notice the error is from age, but it will not tell which row or which value violated the expected schema (validation rule). So it's a bit fuzzy to find the error.

Appendix: pandas df to dict

pandas df provide a method to_df() to convert a dataframe to dict. It provides several modes:
-

In [5]:
df = pd.DataFrame({'col1': ['red', 'yellow', 'blue'], 'col2': [0.5, 0.25, 0.125]})

In [6]:
df.head()

Unnamed: 0,col1,col2
0,red,0.5
1,yellow,0.25
2,blue,0.125


## 1. dict mode
The dict mode is the default mode. In this mode, column names are keys, values are dictionaries of index:data pairs. Check below example.

In [7]:
df.to_dict('dict')

{'col1': {0: 'red', 1: 'yellow', 2: 'blue'},
 'col2': {0: 0.5, 1: 0.25, 2: 0.125}}

## 2. list mode
In list mode, keys are column names, values are lists of column data, you can notice, we no longer have row number, all values of a column are compacted into a list.

In [8]:
df.to_dict('list')

{'col1': ['red', 'yellow', 'blue'], 'col2': [0.5, 0.25, 0.125]}

## 3. series mode
 Similar to 'list' mode, but values are Series instead of list

In [9]:
df.to_dict('series')

{'col1': 0       red
 1    yellow
 2      blue
 Name: col1, dtype: object,
 'col2': 0    0.500
 1    0.250
 2    0.125
 Name: col2, dtype: float64}

## split mode
In split mode, dataframe are split into columns/data/index as keys with values being column names, data values by row and index labels respectively

In [10]:
df.to_dict('split')

{'index': [0, 1, 2],
 'columns': ['col1', 'col2'],
 'data': [['red', 0.5], ['yellow', 0.25], ['blue', 0.125]]}

## records mode
In records mode, each row becomes a dictionary where key is column name and value is the data in the cell

In [14]:
i=0
for r in df.to_dict('records'):
    i+=1
    print(f"row {i}: {r}")

row 1: {'col1': 'red', 'col2': 0.5}
row 2: {'col1': 'yellow', 'col2': 0.25}
row 3: {'col1': 'blue', 'col2': 0.125}


## index mode
Similar to records mode, but a dictionary of dictionaries with keys as index labels (rather than a list)

In [12]:
df.to_dict('index')

{0: {'col1': 'red', 'col2': 0.5},
 1: {'col1': 'yellow', 'col2': 0.25},
 2: {'col1': 'blue', 'col2': 0.125}}