# Verify

Recipes to verify datasets comply with specified uniqueness, integrity, and check constraints.

In [1]:
import pandas as pd

## Load Example Datasets

The "territorial change" dataset from the CoW has several columns with polity identifiers. Some of these are official country codes, contained in the 'states2016' dataset. Others are supplementary codes for territories - polities that don't rise to the level of statehood by CoW's estimation. These codes are contained in a seperate pdf file. The table in the pdf file has been converted into a csv file using Tabula, resulting in the 'Territories' dataset. To have a complete list of polity codes, the states and territories need to be concatenated.

In [2]:
tc = pd.read_csv("../example_datasets/source_data/tc2014.csv", encoding='utf-8', na_values=[-9, '.'], dtype={'gainer':'Int64', 'loser':'Int64', 'entity':'Int64'})

In [3]:
states = pd.read_csv("../example_datasets/source_data/states2016.csv", encoding='utf-8', 
                     usecols=['ccode', 'statenme', 'styear', 'endyear']) \
            .rename(columns= {'ccode':'id', 'statenme':'name', 'styear':'startyear'})
states['type'] = 'state'

In [4]:
terr = pd.read_csv("../example_datasets/source_data/Territories.csv", encoding='utf-8', 
                   usecols=['Entity Number', 'Name', 'Begin Year', 'End Year']) \
            .rename(columns={'Entity Number':'id', 'Name':'name', 'Begin Year':'startyear', 'End Year':'endyear'})
terr['type'] = 'territory'

In [5]:
pol = pd.concat([states, terr])
pol['startyear'] = pol['startyear'].astype('Int64')
pol['endyear'] = pol['endyear'].astype('Int64')

In [6]:
pol.sample(5)

Unnamed: 0,id,name,startyear,endyear,type
2452,930,New Caledonia and Dependencies,1853,1993,territory
735,3362,Wallachia,1828,1834,territory
20,21,Newfoundland,1816,1920,territory
2332,8203,Pahang,1896,1946,territory
1709,6823,Dhala,1937,1967,territory


## Check Referential Integrity

The 'gainer', 'entity', and 'loser' columns all refer to a polity - and that polity should exist in the polity table. In database terms, tc's 'gainer', 'entity', and 'loser' are "foreign keys" to pol's 'id'.

Let's make sure that all polity IDs in 'tc' also exist in 'pol'.

In [7]:
tc[['number', 'gainer', 'entity', 'loser']].sample(5)

Unnamed: 0,number,gainer,entity,loser
370,413,710,711,365
332,373,200,720,710
430,473,678,678,640
784,838,372,372,365
677,727,260,210,210


In [8]:
polities = set(pol['id'].dropna().unique())
tc_gainers = set(tc['gainer'].dropna().unique())
tc_gainers - polities

{0, 1}

In [9]:
tc_losers = set(tc['loser'].dropna().unique())
tc_losers - polities

{0, 1, 822}

In [10]:
tc_entities = set(tc['entity'].dropna().unique())
tc_entities - polities

{822}

Looks like the polity IDs 0, 1, and 822 are missing from the polity table.

Let's create a function that perform this check more generally.

In [11]:
def check_ids_ref_integrity(primary_df, pdf_id:str, related_df, rdf_fk:list or str, verbose = False):
    if type(rdf_fk) == str:
        rdf_fk = [rdf_fk]
    
    ref_integrity = True
    missing_ids = set()
    
    p_ids = set(primary_df[pdf_id].dropna().unique())
    for c in rdf_fk:
        f_ids = set(related_df[c].dropna().unique())
        diff = f_ids - p_ids
        if len(diff):
            ref_integrity = False
            missing_ids.update(diff)
    
    if verbose:
        return missing_ids
    else:
        return ref_integrity

In [12]:
check_ids_ref_integrity(primary_df = pol, 
                        pdf_id = 'id', 
                        related_df = tc, 
                        rdf_fk = 'gainer')

False

In [13]:
check_ids_ref_integrity(primary_df = pol, 
                        pdf_id = 'id', 
                        related_df = tc, 
                        rdf_fk = ['gainer', 'loser', 'entity'], 
                        verbose = True)

{0, 1, 822}

## Check uniqueness of primary key

The most important quality of a primary key is that it is unique.

We previously created a new "episodes" dataset from the ucdp_df dataset when demonstrating the Transformation recipes. Let's see if this new table is ready to be loaded into the database.

In [14]:
ucdp_episode = pd.read_csv("../example_datasets/transformed_data/ucdp_episodes.csv")

In [15]:
ucdp_episode

Unnamed: 0,conflict_id,start_date2,ep_end_date,start_prec2
0,13637,2015-03-03,,1
1,333,1978-04-27,,1
2,431,1979-12-27,1979-12-28,1
3,13692,2001-10-07,2001-11-13,1
4,215,1946-10-22,1946-12-31,1
...,...,...,...,...
806,402,1994-04-28,1994-07-04,2
807,318,1967-09-05,,1
808,318,1967-09-05,1968-12-31,1
809,318,1973-04-04,,1


Simple visual examination shows that the primary key - conflict_id + start_date2 - is not unique. But let's create a function that can do this for us.

In [16]:
def check_key_uniqueness(df, key:list):
    orig_rows = len(df)
    mod_df = df.drop_duplicates(subset=key)
    mod_rows = len(mod_df)
    if orig_rows == mod_rows:
        return True
    else:
        return False

In [17]:
check_key_uniqueness(ucdp_episode, ['conflict_id', 'start_date2'])

False

In [18]:
ucdp_episode.sort_values(by=['conflict_id', 'start_date2', 'ep_end_date'])

Unnamed: 0,conflict_id,start_date2,ep_end_date,start_prec2
48,200,1946-07-21,1946-07-21,2
49,200,1952-04-09,1952-04-12,1
50,200,1967-03-31,1967-10-16,3
70,201,1946-08-31,1953-11-09,3
69,201,1946-08-31,,3
...,...,...,...,...
82,14129,2017-12-08,,2
371,14268,2017-06-11,2017-06-11,1
635,14275,2016-06-01,,1
758,14333,2016-03-07,2016-11-09,1


In [19]:
ucdp_episode = ucdp_episode.sort_values(by=['conflict_id', 'start_date2', 'ep_end_date']) \
                           .drop_duplicates(subset=['conflict_id', 'start_date2'], keep='first')
check_key_uniqueness(ucdp_episode, ['conflict_id', 'start_date2'])

True