# Transform

Recipes to transform datasets: decompose a table into new tables with proper functional dependencies, create new identifiers, and separate multivalued attributes

In [1]:
import pandas as pd
import numpy as np

## Load Example Datasets

In [2]:
ucdp_df = pd.read_csv("../example_datasets/source_data/ucdp-prio-acd-191.csv", encoding = 'utf-8')

In [3]:
cow_alliance_df = pd.read_csv("../example_datasets/source_data/alliance_v4.1_by_member.csv", encoding = 'utf-8')

## Seperate Multivalued Columns

The first step in normalizing your tables is to make sure every value in every cell is atomic.

Multivalued attributes come in many forms. It might be a list inside of cell, or it may exist as multiple dummy columns. Here are functions to deal with both.

### When the cell contains a list

The UCDP dataset has several columns that contain lists of identifiers. If we want to be able to reference these identifiers, we need to isolate them.

This function requires you to first identify what column(s) uniquely identify the multivalued column and what the deliminator is for the multivalued column.

In [4]:
ucdp_df[ucdp_df['gwno_a_2nd'].str.contains(',', na=False)][['conflict_id', 'year', 'side_a', 'gwno_a', 'side_a_2nd', 'gwno_a_2nd']].sample(5)

Unnamed: 0,conflict_id,year,side_a,gwno_a,side_a_2nd,gwno_a_2nd
44,333,2018,Government of Afghanistan,700,"Government of Pakistan, Government of United S...","770, 2"
1211,260,1984,Government of Lebanon,660,"Government of France, Government of United Sta...","220, 2"
1239,229,1950,Government of United Kingdom,200,"Government of Australia, Government of New Zea...","900, 920"
1210,260,1983,Government of Lebanon,660,"Government of France, Government of United Sta...","220, 2"
2329,418,2014,Government of United States of America,2,"Government of Afghanistan, Government of Pakistan","700, 770"


In [5]:
def split_lists(df, pk_columns:list, list_column:str, delim=","):
    new_df = df[pk_columns + [list_column]].copy()
    new_df[list_column] = df[list_column].str.split(pat=delim)
    new_df_exploded = new_df.explode(list_column).dropna().reset_index(drop=True)
    return new_df_exploded

In [6]:
split_lists(ucdp_df, pk_columns=['conflict_id', 'year'], 
            list_column='gwno_a_2nd', delim=', ').head(10)

Unnamed: 0,conflict_id,year,gwno_a_2nd
0,13637,2015,770
1,13637,2015,2
2,13637,2016,770
3,13637,2016,2
4,13637,2017,2
5,13637,2018,2
6,333,1980,365
7,333,1981,365
8,333,1982,365
9,333,1983,365


### When the values are spread out over "dummy" columns

"Dummy" columns are used for analyzing categorical variables, and are created through the process called "binary encoding" (1 = True, 0 = False).

For example, see this CoW Alliance dataset, which records traits for each alliance with 4 dummy variables:

In [7]:
cow_alliance_df[['version4id', 'ccode', 'defense', 'neutrality', 'nonaggression', 'entente']].astype('Int64').head(10)

Unnamed: 0,version4id,ccode,defense,neutrality,nonaggression,entente
0,1,200,1,0,1,0
1,1,235,1,0,1,0
2,2,200,0,0,0,1
3,2,380,0,0,0,1
4,3,240,1,0,1,1
5,3,240,1,0,1,1
6,3,245,1,0,1,1
7,3,245,1,0,1,1
8,3,255,1,0,1,1
9,3,255,1,0,1,1


In [8]:
def de_dummify(df, pk_columns:list, dummy_columns:list, col_name:str):
    new_df = df[pk_columns + dummy_columns].copy()
    # propogate the category/column name for rows where it is true
    for c in dummy_columns:
        new_df[c] = np.where(new_df[c]==1, c, None)
    # create a list out of the categories, and explode/melt the list
    new_df[col_name] = new_df[dummy_columns].values.tolist()
    new_df = new_df.drop(columns=dummy_columns).explode(col_name) \
                    .dropna().reset_index(drop=True)
    return new_df

In [9]:
de_dummify(cow_alliance_df, pk_columns=['version4id', 'ccode'], 
           dummy_columns=['defense', 'neutrality', 
           'nonaggression', 'entente'], col_name='trait')

Unnamed: 0,version4id,ccode,trait
0,1,200,defense
1,1,200,nonaggression
2,1,235,defense
3,1,235,nonaggression
4,2,200,entente
...,...,...,...
2007,412,626,nonaggression
2008,413,651,nonaggression
2009,413,666,nonaggression
2010,414,2,entente


## Decompose into new tables

After exploring the tangled functional dependencies of the original dataset, you will want to break apart that dataset into new tables with proper functional dependencies. We can use the find_dependent_columns() function and add a few more steps.

In [10]:
from recipes import find_dependent_columns

For example, some columns that are dependent on the combination of 'conflict_id' and 'start_date2', the secondary start date.

In [11]:
find_dependent_columns(ucdp_df, ['conflict_id', 'start_date2'])

['location',
 'side_a',
 'side_a_id',
 'incompatibility',
 'territory_name',
 'start_date',
 'start_prec',
 'start_prec2',
 'ep_end_date',
 'gwno_a',
 'gwno_b',
 'gwno_loc',
 'region']

However, some of these columns can be determined only by the conflict_id. Which columns are dependent only on the combination of 'conflict_id' and 'start_date2'?

In [12]:
set(find_dependent_columns(ucdp_df, ['conflict_id', 'start_date2'])) - set(find_dependent_columns(ucdp_df, ['conflict_id']))

{'ep_end_date', 'start_prec2'}

So, let's create a dataframe for just these attributes.

In [13]:
ucdp_episode_df = ucdp_df[['conflict_id', 'start_date2', 'start_prec2', 'ep_end_date']].copy().drop_duplicates().dropna(subset=['conflict_id', 'start_date2'])
ucdp_episode_df

Unnamed: 0,conflict_id,start_date2,start_prec2,ep_end_date
0,13637,2015-03-03,1,
4,333,1978-04-27,1,
45,431,1979-12-27,1,1979-12-28
46,13692,2001-10-07,1,2001-11-13
47,215,1946-10-22,1,1946-12-31
...,...,...,...,...
2374,402,1994-04-28,2,1994-07-04
2375,318,1967-09-05,1,
2376,318,1967-09-05,1,1968-12-31
2377,318,1973-04-04,1,


Let's formalize this process into a generic function

In [14]:
def decompose_table(df, primary_key:list):
    # find which columns belong in this table
    table_columns = set(find_dependent_columns(df, primary_key))
    if len(primary_key) > 1:
        for column in primary_key:
            dependent_cols = set(find_dependent_columns(df, [column]))
            table_columns -= dependent_cols
    # create the new table
    new_df = df[primary_key + list(table_columns)].copy().dropna(subset=primary_key).drop_duplicates().reset_index(drop=True)
    return new_df

In [15]:
decompose_table(ucdp_df, ['conflict_id', 'start_date2']).sort_values(by=['conflict_id', 'start_date2', 'ep_end_date'])

Unnamed: 0,conflict_id,start_date2,ep_end_date,start_prec2
48,200,1946-07-21,1946-07-21,2
49,200,1952-04-09,1952-04-12,1
50,200,1967-03-31,1967-10-16,3
70,201,1946-08-31,1953-11-09,3
69,201,1946-08-31,,3
...,...,...,...,...
82,14129,2017-12-08,,2
371,14268,2017-06-11,2017-06-11,1
635,14275,2016-06-01,,1
758,14333,2016-03-07,2016-11-09,1


In [16]:
decompose_table(ucdp_df, ['conflict_id', 'start_date2']).to_csv("../example_datasets/transformed_data/ucdp_episodes.csv", index=False)

Further cleaning is still required, so let's save this dataset as it is now for further investigation.