# Verify

Recipes to verify datasets comply with specified uniqueness, integrity, and check constraints.

In [1]:
import pandas as pd

## Load Example Datasets

The "territorial change" dataset from the CoW has several columns with polity identifiers. Some of these are official country codes, contained in the 'states2016' dataset. Others are supplementary codes for territories - polities that don't rise to the level of statehood by CoW's estimation. These codes are contained in a seperate pdf file. The table in the pdf file has been converted into a csv file using Tabula, resulting in the 'Territories' dataset. To have a complete list of polity codes, the states and territories need to be concatenated.

In [2]:
tc = pd.read_csv("../example_datasets/source_data/tc2014.csv", encoding='utf-8', na_values=[-9, '.'])

In [3]:
states = pd.read_csv("../example_datasets/source_data/states2016.csv", encoding='utf-8', 
                     usecols=['ccode', 'statenme', 'styear', 'endyear']) \
            .rename(columns= {'ccode':'id', 'statenme':'name', 'styear':'startyear'})
states['type'] = 'state'

In [4]:
terr = pd.read_csv("../example_datasets/source_data/Territories.csv", encoding='utf-8', 
                   usecols=['Entity Number', 'Name', 'Begin Year', 'End Year']) \
            .rename(columns={'Entity Number':'id', 'Name':'name', 'Begin Year':'startyear', 'End Year':'endyear'})
terr['type'] = 'territory'

In [5]:
pol = pd.concat([states, terr])

In [6]:
pol

Unnamed: 0,id,name,startyear,endyear,type
0,2,United States of America,1816.0,2016.0,state
1,20,Canada,1920.0,2016.0,state
2,31,Bahamas,1973.0,2016.0,state
3,40,Cuba,1902.0,1906.0,state
4,40,Cuba,1909.0,2016.0,state
...,...,...,...,...,...
2646,9985,Argentine Antarctica,1942.0,1993.0,territory
2647,9986,Chilean Antarctica,1940.0,1993.0,territory
2648,9987,Neu Schwabenland,1939.0,1945.0,territory
2649,9991,Peter I I.,1931.0,1993.0,territory


## Check Referential Integrity

The 'gainer', 'entity', and 'loser' columns all refer to a polity - and that polity should exist in the polity table. In database terms, tc's 'gainer', 'entity', and 'loser' are "foreign keys" to pol's 'id'.

Let's make sure that all polity IDs in 'tc' also exist in 'pol'.

In [7]:
tc[['number', 'gainer', 'entity', 'loser']]

Unnamed: 0,number,gainer,entity,loser
0,3,160,160.0,230.0
1,4,200,790.0,790.0
2,5,200,420.0,
3,28,220,433.0,200.0
4,29,365,365.0,
...,...,...,...,...
832,886,471,475.0,475.0
833,887,710,365.0,365.0
834,888,710,702.0,702.0
835,889,626,626.0,625.0


In [8]:
polities = set(pol['id'].dropna().unique().tolist())
tc_gainers = set(tc['gainer'].dropna().unique().tolist())
tc_gainers - polities

{0, 1}

In [9]:
tc_losers = set(tc['loser'].dropna().unique().tolist())
tc_losers - polities

{0.0, 1.0, 822.0}

In [10]:
tc_entities = set(tc['entity'].dropna().unique().tolist())
tc_entities - polities

{822.0}

Looks like the polity IDs 0, 1, and 822 are missing from the polity table.

Let's create a function that perform this check more generally.

In [11]:
def check_ids_ref_integrity(primary_df, pdf_id:str, related_df, rdf_fk:list or str, verbose = False):
    if type(rdf_fk) == str:
        rdf_fk = [rdf_fk]
    
    ref_integrity = True
    missing_ids = set()
    
    p_ids = set(primary_df[pdf_id].dropna().unique().tolist())
    for c in rdf_fk:
        f_ids = set(related_df[c].dropna().unique().tolist())
        diff = f_ids - p_ids
        if len(diff):
            ref_integrity = False
            missing_ids.update(diff)
    
    if verbose:
        return missing_ids
    else:
        return ref_integrity

In [12]:
check_ids_ref_integrity(primary_df = pol, pdf_id = 'id', 
                    related_df = tc, rdf_fk = 'gainer')

False

In [13]:
check_ids_ref_integrity(primary_df = pol, pdf_id = 'id', 
                    related_df = tc, 
                    rdf_fk = ['gainer', 'loser', 'entity'], 
                    verbose = True)

{0, 1, 822.0}