011816incomeandwagedataaggregation.py - Takes DHS data and
aggregates the variables in the files (most importantly income and wage data) to the sector and municipality level, after making necessary cleaning and merging steps

Jay Sayre - sayrejay@gmail.com

INPUTS: 

"DHS/2013Standard/geo/merge2013clust.csv" - DHS 2013 Dominican Republic geospatial data corresponding to geo-tagged keys in DHS 2013 dataset, compiled by extractoneshapefiletopoint.py

"DHS/2013Standard/hhmember/DRPR61FL.csv" - DHS 2013 Dominican Republic Household Member Dataset (converted to csv by DHS/DHSbuild.R script)

Also potentially 2007 DHS data

INTERMEDIATE DATA: N/A

OUTPUTS: "averageincomebymunicipality2010.csv" - Average income in each D.R. municipality for 2013

In [1]:
import pandas as pd
import os

if os.name == 'nt':
    base_dir = "D:/Dropbox/Dropbox (Personal)/College/DR_Paper/"
else:
    base_dir = "/home/j/Dropbox/College/DR_Paper/"

Takes the DHS geocluster csvs (merged with ONE municipality codes by 091215extractoneshapefiletopoint.py and converted to csvs by 091215dbftodata.py) and merges them with DHS samples

Number of DHS clusters incorrectly marked in another province: 53

In [19]:
dhs_geoclust_2007 = base_dir + "DHS/2007Standard/geo/merge2007clust.csv"
dhs_hhmember_2007 = base_dir + "DHS/2007Standard/hhmember/DRPR52FL.csv" # Converted to csv by DHSbuild.R in  main directory
dhs_geoclust_2013 = base_dir + "DHS/2013Standard/geo/merge2013clust.csv"
dhs_hhmember_2013 = base_dir + "DHS/2013Standard/hhmember/DRPR61FL.csv" # Converted to csv by DHSbuild.R in  main directory

geo_df_2007 = pd.read_csv(dhs_geoclust_2007)
geo_df_2013 = pd.read_csv(dhs_geoclust_2013)
dhs_data_2007 = pd.read_csv(dhs_hhmember_2007)
dhs_data_2013 = pd.read_csv(dhs_hhmember_2013)

# Subset geocluster datasets down to relevant variables
geo_keep_cols = ['ADM1DHS','ALT_DEM',"DHSCLUST","MUN","PROV","REG","URBAN_RURA"]
geo_df_2007 = geo_df_2007[geo_keep_cols]
geo_df_2013 = geo_df_2013[geo_keep_cols]

#May want to apply some correction to municipality displacement. Only problem is correction isn't possible for 2013 data.
# Commented section below tests whether different shapefile has better municipality displacement, which it does not.
#count = 0 
#for i in geo_df_2013.index:
#    if geo_df_2013['ADM1DHS'][i] != geo_df_2013['PROV'][i]:
#        count += 1
#print "Number of DHS clusters incorrectly marked in another province:", count

dhs_data_2007 = dhs_data_2007.merge(geo_df_2007, left_on="hv001", right_on="DHSCLUST", how="left")
dhs_data_2013 = dhs_data_2013.merge(geo_df_2013, left_on="hv001", right_on="DHSCLUST", how="left")

### DEPRECATED
## Used to test whether DIVA-GIS provided ADM shapefile is more accurate. 
## It isn't.
## 60 missing clusters

#dhs_testclust_2007 = base_dir + "DHS/2007Standard/geo/TEST2007clust.csv"
#dhs_testmerge = base_dir + "DR Codigos/muncorrespondence.csv"

#geo_df_test = pd.read_csv(dhs_testclust_2007)
#test_keeps = ['ADM1DHS','ALT_DEM',"DHSCLUST","URBAN_RURA","ID_1","ID_2"]
#geo_df_test = geo_df_test[test_keeps]
#geo_df_test.columns = ['ADM1DHS','ALT_DEM',"DHSCLUST","URBAN_RURA","GEOID_1","GEOID_2"]
#testcode = pd.read_csv(dhs_testmerge)
#testcode = testcode[[u'REGION', u'PROV', u'MUN', u'GEOID_1', u'GEOID_2']]

#geo_df_test = geo_df_test.merge(testcode, on=['GEOID_1', 'GEOID_2'], how='left')

#count = 0 
#for i in geo_df_test.index:
#    if geo_df_test['ADM1DHS'][i] != geo_df_test['PROV'][i]:
#        count += 1
#print count

In [3]:
def codigo(prov):
    prov = str(prov)
    if len(prov) == 1:
        return '0'+prov
    else:
        return prov

In [16]:
## Subset down to workers employed in the private sector
#Do I want to include the 162 some employers here? 'employer'
dhs_data_2013 = dhs_data_2013[dhs_data_2013['sg111'] == "private employee"] 
#print dhs_data_2013['sg110'].unique() # detailed employment categories (espanol)
#dhs_data_2013['sg112a'].unique() # gross salary

#print dhs_data_2007['gs14'].unique() #these are the codes in 2013/occ categories
#print dhs_data_2007['gs16'].unique()

In [29]:
#Check average income numbers for DR and if they match up with GDP P.C.
# 2013 -  5,968.7, 2002 - 3,008.5 
print dhs_data_2013['sg111'].unique()
print len(dhs_data_2013[dhs_data_2013['singresoo'].astype(str) != 'nan'])


print dhs_data_2013['singresoo'].mean()
print dhs_data_2013['singrthp'].mean()
print dhs_data_2013['sg117a'].mean()
print dhs_data_2013['singresotp'].mean()

['public employee' 'self-employee' 'domestic worker' 'private employee' nan
 'employer' 'member of cooperative' "don't know"]
7909
10446.2326843
8832.79855481
11441.00513
13432.408699


In [28]:
# Aggregate income variables, very quickly - come back and improve this
#print len(dhs_data_2007) - len(dhs_data_2007[dhs_data_2007['ingoo'].astype(str) != 'nan']) # Observations missing data
#print len(dhs_data_2013) - len(dhs_data_2013[dhs_data_2013['singresoo'].astype(str) != 'nan']) # Observations missing data

dhs_data_2007 = dhs_data_2007[dhs_data_2007['ingoo'].astype(str) != 'nan']
dhs_data_2013 = dhs_data_2013[dhs_data_2013['singresoo'].astype(str) != 'nan']

occinc2007 = dhs_data_2007.groupby(['PROV','MUN'], as_index=False)['ingoo'].mean()
occinc2013 = dhs_data_2013.groupby(['PROV','MUN'], as_index=False)['singresoo'].mean()
occinc = occinc2013
occinc = occinc2007.merge(occinc2013, on=['PROV','MUN'], how="inner")
occinc['PROV'] = occinc['PROV'].astype(str)
occinc['MUN'] = occinc['MUN'].apply(lambda x: codigo(x))
occinc['CODIGO'] = occinc['PROV']+occinc['MUN']
occinc.to_csv(base_dir + "averageincomebymunicipality2010.csv", index=False)

#provoccinc2007 = dhs_data_2007.groupby('PROV', as_index=False)['ingoo'].mean()
#provoccinc2013 = dhs_data_2013.groupby('PROV', as_index=False)['singresoo'].mean()
#provoccinc = provoccinc2007.merge(provoccinc2013, on='PROV', how="inner")
#provoccinc.to_csv(base_dir + "PROVoccinc.csv", index=False)