# Checking Data integrity

## Basic data invariants
We want to check:
1. Every child id is in the data set.
2. Every parent id is in the dataset.
3. Every parent_nodes field has either 0 or 1 elements, 0 for the root node only.
4. Every json has the exact same fields.
5. Every json has the same number of fields.
6. Set of all child nodes is equivalent to set of all nodes minus the root node.
7. Union of child nodes and parent nodes is equal to set of all nodes.

Below is the script we use to check for these conditions.

In [1]:
import json
#All the keys we have documented
valid_keys = set(['child_nodes','title','id','parent_nodes','index_terms','code','coding_note',\
              'class_kind','code_range','inclusions','definition','exclusions','block_id','p_scale'])

# our cleaned data
clean_data = json.load(open('clean_icd11_mms_v3.json','r'))
clean_dicts = {clean_dict['id']:clean_dict for clean_dict in clean_data}

parents = set()
children = set()
fields = set()
#Set of all ids
all_ids = set(clean_dicts.keys())
for datum in clean_data:
    
    # 1. making sure this id is in our set of ids
    assert datum['id'] in all_ids
    
    # 2. making sure every child node is in the set of all valid ids
    for child_node in datum['child_nodes']:
        assert child_node in all_ids
        
    # 3. making sure there are the correct number of nodes in parent_nodes field
    # 1 for normal nodes and 0 for root
    assert  len(datum['parent_nodes']) == 1 or (len(datum['parent_nodes']) == 0 and datum['id'] == 'mms')
    
    # set up for 4.
    fields.add(tuple(datum.keys()))
    
    # 5. making sure every json has the same number of fields
    assert len(datum.keys()) == len(valid_keys)
    
    # create the set of child nodes and parentnodes
    for node in datum['child_nodes']:
        children.add(node)
    for node in datum['parent_nodes']:
            parents.add(node)
            
# 6. check that the set of child nodes is equal to the set of all nodes minus the root node
assert children == (all_ids - set(['mms']))

# 7. check union of children and parent nodes gives all nodes in the data set
assert children.union(parents) == all_ids

# checking 4. making sure theres only one set of keys and that its equivalent to valid_keys
assert set(list(fields)[0]) == set(valid_keys) and len(fields) == 1
print('All basic tests passed')