## Normalization on Yaml Files

Following is an implementation of normalization on `.yaml` files.

### Imports

In [1]:
import os
import sys

In [2]:
currentdir = os.path.dirname(os.path.realpath('__file__'))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0, parentdir)
currentdir, parentdir

('/Users/me/Projects/flat_table/notebooks', '/Users/me/Projects/flat_table')

In [3]:
import yaml
import pandas as pd
import flat_table as norm

### Data

The data used here is from [unitedstates/congress-legislators](https://github.com/unitedstates/congress-legislators) repository.

- legislators-historical.yml
- executive.yml').read
- committees-current.yml

Download and put them into `data` folder.

In [4]:
def load_data(filepath):
    with open(filepath) as f:
        contents = yaml.load(f, Loader=yaml.SafeLoader)
    return contents

In [5]:
source = load_data('../data/legislators-historical.yml')
source2 = load_data('../data/executive.yml')
source3 = load_data('../data/committees-current.yml')

### Analysis

A seri can be transformed into:

    1. Expanded into rows
    2. Expendad to columns
    3. Kept as is
   
    * Each individual seri has parent df name, self name, and self.
    * if parent and self has the same name ==> it is an object
    * if parent and 
    

#### Source 1

In [6]:
df1 = pd.DataFrame(source)
df1_in = norm.mapper(df1)
df1_in.shape

(68, 4)

In [7]:
df1.shape

(11982, 7)

In [8]:
df1_in.tail()

Unnamed: 0,parent,child,type,obj
63,leadership_roles,leadership_roles.title,str,0 NaN 1 NaN 2 NaN 3 ...
64,.,family,list,0 ...
65,,family,dict,0 ...
66,family,family.relation,str,0 NaN 1 NaN 2 NaN 3 ...
67,family,family.name,str,0 NaN 1 ...


In [9]:
df1.sample()

Unnamed: 0,id,name,bio,terms,other_names,leadership_roles,family
2460,"{'bioguide': 'W000692', 'govtrack': 411844, 'i...","{'first': 'Bradford', 'middle': 'Ripley', 'las...","{'birthday': '1800-09-03', 'gender': 'M'}","[{'type': 'rep', 'start': '1845-12-01', 'end':...",,,


In [10]:
df_norm = norm.normalize(df1_in, is_mapper=True)

In [11]:
df_norm.shape

(275401, 54)

In [12]:
df_norm.columns

Index(['index', 'id.maplight', 'id.opensecrets', 'id.ballotpedia', 'id.lis',
       'id.votesmart', 'id.cspan', 'id.thomas', 'id.house_history_alternate',
       'id.house_history', 'id.google_entity_id', 'id.wikidata',
       'id.wikipedia', 'id.icpsr', 'id.govtrack', 'id.bioguide', 'id.fec',
       'id.bioguide_previous', 'name.official_full', 'name.suffix',
       'name.nickname', 'name.middle', 'name.last', 'name.first', 'bio.gender',
       'bio.birthday', 'terms.rss_url', 'terms.state_rank', 'terms.office',
       'terms.contact_form', 'terms.fax', 'terms.phone', 'terms.address',
       'terms.url', 'terms.how', 'terms.district', 'terms.party',
       'terms.class', 'terms.state', 'terms.end', 'terms.start', 'terms.type',
       'terms.party_affiliations.party', 'terms.party_affiliations.end',
       'terms.party_affiliations.start', 'other_names.last',
       'other_names.middle', 'other_names.end', 'leadership_roles.end',
       'leadership_roles.start', 'leadership_roles.chamb

#### Source 2

In [13]:
df2 = pd.DataFrame(source2)
df2_in = norm.mapper(df2)
df2_in.shape

(31, 4)

In [14]:
df2.shape

(78, 4)

In [15]:
df2_in.head()

Unnamed: 0,parent,child,type,obj
0,.,id,dict,"0 {'bioguide': 'W000178', 'govtrack': 4113..."
1,id,id.google_entity_id,str,0 NaN 1 NaN 2 ...
2,id,id.wikidata,str,0 NaN 1 NaN 2 NaN 3 ...
3,id,id.house_history,float,0 NaN 1 NaN 2 NaN 3 ...
4,id,id.wikipedia,str,0 NaN 1 NaN 2 ...


In [16]:
norm.normalize(df2_in)

Unnamed: 0,index,parent,child,type,obj
0,0,.,id,dict,"0 {'bioguide': 'W000178', 'govtrack': 4113..."
1,1,id,id.google_entity_id,str,0 NaN 1 NaN 2 ...
2,2,id,id.wikidata,str,0 NaN 1 NaN 2 NaN 3 ...
3,3,id,id.house_history,float,0 NaN 1 NaN 2 NaN 3 ...
4,4,id,id.wikipedia,str,0 NaN 1 NaN 2 ...
5,5,id,id.fec,list,0 NaN 1 NaN 2 ...
6,6,,id.fec,str,0 NaN 1 NaN 2 Na...
7,7,id,id.votesmart,float,0 NaN 1 NaN 2 NaN 3 ...
8,8,id,id.opensecrets,str,0 NaN 1 NaN 2 Na...
9,9,id,id.lis,str,0 NaN 1 NaN 2 NaN 3 NaN 4 ...


#### Source 3

In [17]:
df3 = pd.DataFrame(source3)
df3_in = norm.mapper(df3)
df3_in.shape

(22, 4)

In [18]:
df3.shape

(49, 16)

In [19]:
df3_in.head()

Unnamed: 0,parent,child,type,obj
0,.,type,str,0 house 1 house 2 house 3 ...
1,.,name,str,0 House Committee on Ag...
2,.,url,str,0 https://agriculture.h...
3,.,minority_url,str,0 https://republicans-agriculture.h...
4,.,thomas_id,str,0 HSAG 1 HSAP 2 HSAS 3 HSBA 4 ...


In [20]:
norm.normalize(df3_in)

Unnamed: 0,index,parent,child,type,obj
0,0,.,type,str,0 house 1 house 2 house 3 ...
1,1,.,name,str,0 House Committee on Ag...
2,2,.,url,str,0 https://agriculture.h...
3,3,.,minority_url,str,0 https://republicans-agriculture.h...
4,4,.,thomas_id,str,0 HSAG 1 HSAP 2 HSAS 3 HSBA 4 ...
5,5,.,house_committee_id,str,0 AG 1 AP 2 AS 3 BA 4 ...
6,6,.,subcommittees,list,"0 [{'name': 'Conservation and Forestry', '..."
7,7,,subcommittees,dict,"0 {'name': 'Conservation and Forestry', 't..."
8,8,subcommittees,subcommittees.wikipedia,str,0 ...
9,9,subcommittees,subcommittees.phone,str,0 (202) 225-2171 0 (202) 225-2171 0 ...
