# Transform

Recipes to transform datasets: decompose a table into new tables with proper functional dependencies, create new identifiers, and separate multivalued attributes

In [1]:
import pandas as pd
import numpy as np

## Load Example Datasets

In [2]:
ucdp_df = pd.read_csv("../example_datasets/source_data/ucdp-prio-acd-191.csv", encoding = 'utf-8')

In [3]:
cow_alliance_df = pd.read_csv("../example_datasets/source_data/alliance_v4.1_by_member.csv", encoding = 'utf-8')

## Seperate Multivalued Columns

The first step in normalizing your tables is to make sure every value in every cell is atomic.

Multivalued attributes come in many forms. It might be a list inside of cell, or it may exist as multiple dummy columns. Here are functions to deal with both.

### When the cell contains a list

The UCDP dataset has several columns that contain lists of identifiers. If we want to be able to reference these identifiers, we need to isolate them.

This function requires you to first identify what column(s) uniquely identify the multivalued column and what the deliminator is for the multivalued column.

In [4]:
ucdp_df[['gwno_a', 'gwno_a_2nd', 'gwno_b', 'gwno_b_2nd']].head(5)

Unnamed: 0,gwno_a,gwno_a_2nd,gwno_b,gwno_b_2nd
0,700,"770, 2",,
1,700,"770, 2",,
2,700,2,,
3,700,2,,
4,700,,,


In [5]:
def split_lists(df, pk_columns:list, list_column:str, delim=","):
    new_df = df[pk_columns + [list_column]].copy()
    new_df[list_column] = df[list_column].str.split(pat=delim)
    new_df_exploded = new_df.explode(list_column).dropna().reset_index(drop=True)
    return new_df_exploded

In [6]:
split_lists(ucdp_df, pk_columns=['conflict_id', 'year'], 
            list_column='gwno_a_2nd', delim=", ").head(10)

Unnamed: 0,conflict_id,year,gwno_a_2nd
0,13637,2015,770
1,13637,2015,2
2,13637,2016,770
3,13637,2016,2
4,13637,2017,2
5,13637,2018,2
6,333,1980,365
7,333,1981,365
8,333,1982,365
9,333,1983,365


### When the values are spread out over "dummy" columns

"Dummy" columns are used for analyzing categorical variables, and are created through the process called "binary encoding" (1 = True, 0 = False).

For example, see this CoW Alliance dataset, which records traits for each alliance with 4 dummy variables:

In [7]:
cow_alliance_df[['version4id', 'ccode', 'defense', 'neutrality', 'nonaggression', 'entente']].head(10)

Unnamed: 0,version4id,ccode,defense,neutrality,nonaggression,entente
0,1,200,1,0,1.0,0.0
1,1,235,1,0,1.0,0.0
2,2,200,0,0,0.0,1.0
3,2,380,0,0,0.0,1.0
4,3,240,1,0,1.0,1.0
5,3,240,1,0,1.0,1.0
6,3,245,1,0,1.0,1.0
7,3,245,1,0,1.0,1.0
8,3,255,1,0,1.0,1.0
9,3,255,1,0,1.0,1.0


In [8]:
def de_dummify(df, pk_columns:list, dummy_columns:list, col_name:str):
    new_df = df[pk_columns + dummy_columns].copy()
    # propogate the category/column name for rows where it is true
    for c in dummy_columns:
        new_df[c] = np.where(new_df[c]==1, c, None)
    # create a list out of the categories, and explode/melt the list
    new_df[col_name] = new_df[dummy_columns].values.tolist()
    new_df = new_df.drop(columns=dummy_columns).explode(col_name) \
                    .dropna().reset_index(drop=True)
    return new_df

In [9]:
de_dummify(cow_alliance_df, pk_columns=['version4id', 'ccode'], 
           dummy_columns=['defense', 'neutrality', 'nonaggression', 'entente'], 
           col_name='trait').head(10)

Unnamed: 0,version4id,ccode,trait
0,1,200,defense
1,1,200,nonaggression
2,1,235,defense
3,1,235,nonaggression
4,2,200,entente
5,2,380,entente
6,3,240,defense
7,3,240,nonaggression
8,3,240,entente
9,3,240,defense
