## DATA 512 Final Project Analysis
### Natasha Halfin
### Autumn 2020

### Introduction
This Jupyter notebook shows the work I conducted for my final project to address the following research question and hypothesis: 
 
**Research Question**: What factors are associated with countries with greater female representation in politics?  

**Hypothesis**: Countries with better performance across key gender parity indicators are more likely to have greater representation of women in politics.

### Environment Setup
Before beginning the analysis, I will import all necessary libraries and load my datasets.

In [2]:
#Import libraries/packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

### Part 0: Data Loading, Cleaning, and Preparation

In [24]:
#Load data
df_GIDDB = pd.read_csv("GIDDB2019_NH.csv",sep =",")
df_EMP = pd.read_csv("GENDER_EMP_2014-2019.csv",sep =",")
df_GIDDB.head()
df_EMP.head()

Unnamed: 0,COU,Country,IND,Indicator,SEX,Sex,AGE,Age Group,TIME,Time,Unit Code,Unit,PowerCode Code,PowerCode,Reference Period Code,Reference Period,Value,Flag Codes,Flags
0,SVN,Slovenia,EMP17,Female share of seats in national parliaments,WOMEN,Women,TOTAL,Total,2014,2014,,,0,Units,,,35.6,,
1,GRC,Greece,EMP17,Female share of seats in national parliaments,WOMEN,Women,TOTAL,Total,2014,2014,,,0,Units,,,21.0,,
2,LUX,Luxembourg,EMP17,Female share of seats in national parliaments,WOMEN,Women,TOTAL,Total,2014,2014,,,0,Units,,,28.3,,
3,SVK,Slovak Republic,EMP17,Female share of seats in national parliaments,WOMEN,Women,TOTAL,Total,2014,2014,,,0,Units,,,18.7,,
4,SWE,Sweden,EMP17,Female share of seats in national parliaments,WOMEN,Women,TOTAL,Total,2014,2014,,,0,Units,,,44.7,,


In [25]:
##clean data##
##part 1: GIDDB dataset
#rename variables to friendly names in df_GIDDB
df_GIDDB['VAR'] = df_GIDDB['VAR'].replace(['DF_DV_LAW'],'Divorce Law')
df_GIDDB['VAR'] = df_GIDDB['VAR'].replace(['DF_HR_ATT'],'Household Responsibilities Attitudes')
df_GIDDB['VAR'] = df_GIDDB['VAR'].replace(['RPI_VAW_PRACT'],'Violence Against Women Practices')
df_GIDDB['VAR'] = df_GIDDB['VAR'].replace(['RPI_RA_LAW'],'Reproductive Autonomy Law')
df_GIDDB['VAR'] = df_GIDDB['VAR'].replace(['RAPFR_SAFS_LAW'],'Access to Financial Services Law')
df_GIDDB['VAR'] = df_GIDDB['VAR'].replace(['RCL_PV_PRACT'],'Political Voice Practices')

df_GIDDB.head()


Unnamed: 0,REGION,Region,LOCATION,Country,INC,Income,VAR,Variable,TIME,Year,Value,Flag Codes,Flags
0,ASI,Asia,AUS,Australia,HIN,High income,Household Responsibilities Attitudes,Attitudes,2019,2019,21.1,,
1,ASI,Asia,AUS,Australia,HIN,High income,Divorce Law,Law,2019,2019,0.0,,
2,ASI,Asia,AUS,Australia,HIN,High income,Violence Against Women Practices,Practice,2019,2019,16.9,,
3,ASI,Asia,AUS,Australia,HIN,High income,Reproductive Autonomy Law,Law,2019,2019,0.0,,
4,ASI,Asia,AUS,Australia,HIN,High income,Access to Financial Services Law,Law,2019,2019,0.0,,


In [26]:
#drop redundant or unneeded columns
df_GIDDB = df_GIDDB.drop(['REGION','LOCATION','INC','Variable','TIME','Flag Codes','Flags','Income','Region'],axis=1)
df_GIDDB.head()

Unnamed: 0,Country,VAR,Year,Value
0,Australia,Household Responsibilities Attitudes,2019,21.1
1,Australia,Divorce Law,2019,0.0
2,Australia,Violence Against Women Practices,2019,16.9
3,Australia,Reproductive Autonomy Law,2019,0.0
4,Australia,Access to Financial Services Law,2019,0.0


In [27]:
#pivot datadrame to create column for each variable/indicator
df_GIDDB_pivot = pd.pivot_table(df_GIDDB,values='Value', columns = 'VAR',index = ['Country','Year'])
df_GIDDB_pivot.head()

Unnamed: 0_level_0,VAR,Access to Financial Services Law,Divorce Law,Household Responsibilities Attitudes,Political Voice Practices,Reproductive Autonomy Law,Violence Against Women Practices
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Afghanistan,2019,0.25,1.0,,27.7,0.75,60.8
Albania,2019,0.25,0.25,,27.9,0.0,24.6
Algeria,2019,0.25,1.0,75.1,25.8,0.75,
Angola,2019,0.25,0.5,,30.5,0.0,34.8
Antigua and Barbuda,2019,0.25,0.0,,11.1,0.75,


In [35]:
#part 2: EMP dataset
#drop unneeded columns
df_EMP = df_EMP[['Country','Indicator','Time','Value']]
df_EMP = df_EMP.rename(columns={"Time":"Year"})
df_EMP.head()

Unnamed: 0,Country,Indicator,Year,Value
0,Slovenia,Female share of seats in national parliaments,2014,35.6
1,Greece,Female share of seats in national parliaments,2014,21.0
2,Luxembourg,Female share of seats in national parliaments,2014,28.3
3,Slovak Republic,Female share of seats in national parliaments,2014,18.7
4,Sweden,Female share of seats in national parliaments,2014,44.7


In [37]:
#pivot indicators to columns
df_EMP_pivot = pd.pivot_table(df_EMP,values='Value',columns='Indicator',index = ['Country','Year'])
df_EMP_pivot.head()

Unnamed: 0_level_0,Indicator,Female share of seats in national parliaments,Female share of seats on boards of the largest publicly listed companies,Length of maternity leave,Share of female managers
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Australia,2014,26.0,,6.0,36.0
Australia,2015,26.7,,6.0,37.2
Australia,2016,28.7,26.0,6.0,36.3
Australia,2017,,28.7,,38.2
Australia,2018,,31.5,,37.3


In [39]:
#join GIDDB and EMP datasets
df_join = df_EMP_pivot.merge(df_GIDDB_pivot,how = 'inner',on = ['Country','Year'])
df_join.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Female share of seats in national parliaments,Female share of seats on boards of the largest publicly listed companies,Length of maternity leave,Share of female managers,Access to Financial Services Law,Divorce Law,Household Responsibilities Attitudes,Political Voice Practices,Reproductive Autonomy Law,Violence Against Women Practices
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Australia,2019,,31.2,,,0.0,0.0,21.1,28.7,0.0,16.9
Austria,2019,,31.3,,,0.0,0.0,58.3,34.4,0.0,13.0
Belgium,2019,,35.9,,,0.0,0.0,32.0,38.0,0.0,24.0
Brazil,2019,,11.9,,,0.0,0.0,60.4,10.7,0.5,33.5
Canada,2019,,29.1,,,0.0,0.0,27.7,27.0,0.0,1.9


In [40]:
df_join

Unnamed: 0_level_0,Unnamed: 1_level_0,Female share of seats in national parliaments,Female share of seats on boards of the largest publicly listed companies,Length of maternity leave,Share of female managers,Access to Financial Services Law,Divorce Law,Household Responsibilities Attitudes,Political Voice Practices,Reproductive Autonomy Law,Violence Against Women Practices
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Australia,2019,,31.2,,,0.0,0.0,21.1,28.7,0.0,16.9
Austria,2019,,31.3,,,0.0,0.0,58.3,34.4,0.0,13.0
Belgium,2019,,35.9,,,0.0,0.0,32.0,38.0,0.0,24.0
Brazil,2019,,11.9,,,0.0,0.0,60.4,10.7,0.5,33.5
Canada,2019,,29.1,,,0.0,0.0,27.7,27.0,0.0,1.9
Chile,2019,,8.5,,,0.0,0.25,36.0,22.6,0.5,6.7
China (People's Republic of),2019,,11.4,,,0.0,0.0,42.4,24.2,0.0,
Colombia,2019,,13.5,,,0.0,0.0,42.4,18.7,0.5,37.4
Czech Republic,2019,,18.2,,,0.25,0.25,32.5,22.0,0.0,21.0
Denmark,2019,,30.0,,,0.0,0.0,22.1,37.4,0.0,32.0


### Part 1: Descriptive Analysis to Measure Correlation of Key Gender Parity Indicators to Representation of Women in Politics

In [None]:
#Standard Linear Regression

In [None]:
#Create stratified sample/training/testing split

In [None]:
#measure accuracy, RMSE, etc.

### Part 2: Linear Regression Model to Predict Representation of Women in Politics

In [None]:
#LASSO (to determine if certain features should be excluded)

In [None]:
#repeat other steps