DHSdataaggregation.py - Takes DHS data and
aggregates the variables in the files (most importantly income and wage data) to the occupation and municipality level, after making necessary cleaning and merging steps

Jay Sayre - sayrejay (at) gmail,

INPUTS: 

"../DHS/2013Standard/geo/merge2013clust.csv" - DHS 2013 Dominican Republic geospatial data corresponding to geo-tagged keys in DHS 2013 dataset, compiled by extractoneshapefiletopoint.py

"../DHS/2013Standard/hhmember/DRPR61FL.csv" - DHS 2013 Dominican Republic Household Member Dataset (converted to csv by DHS/DHSbuild.R script)

Ditto for DHS 2007 data (don't think I'm using this)

"/cafta-dr/DHS/DHSoccupationsISIC.xlsx" - Code for converting DHS occupation categories into ISIC occupation codes, compiled by author (me!)

INTERMEDIATE DATA: N/A

OUTPUTS: "averageincomebymunicipality2013.csv" - Average income in each D.R. municipality for 2013

"averageincomebyoccmun2013.csv" - Average income in each D.R. municipality and occupation for 2013


In [1]:
import pandas as pd
import os

if os.name == 'nt':
    base_dir = "D:/Dropbox/Dropbox (Personal)/College/DR_Paper/"
else:
    base_dir = "/home/j/Dropbox/College/DR_Paper/"
    
## INPUTS
dhs_geoclust_2007 = base_dir + "DHS/2007Standard/geo/merge2007clust.csv"
dhs_hhmember_2007 = base_dir + "DHS/2007Standard/hhmember/DRPR52FL.csv" # Converted to csv by DHSbuild.R in  main directory
dhs_geoclust_2013 = base_dir + "DHS/2013Standard/geo/merge2013clust.csv"
dhs_hhmember_2013 = base_dir + "DHS/2013Standard/hhmember/DRPR61FL.csv" # Converted to csv by DHSbuild.R in  main directory
dhs_occupations_conversion = base_dir+"cafta-dr/DHS/DHSoccupationsISIC.xlsx"

## OUTPUTS
averageincmun = base_dir + "/cafta-dr/Output/averageincomebymunicipality2013.csv"
averageincmunocc = base_dir + "/cafta-dr/Output/averageincomebyoccmun2013.csv"

Takes the DHS geocluster csvs (merged with ONE municipality codes by 091215extractoneshapefiletopoint.py and converted to csvs by 091215dbftodata.py) and merges them with DHS samples

Number of DHS clusters incorrectly marked in another province: 53

In [2]:
#geo_df_2007 = pd.read_csv(dhs_geoclust_2007)
#dhs_data_2007 = pd.read_csv(dhs_hhmember_2007)
geo_df_2013 = pd.read_csv(dhs_geoclust_2013, encoding='latin_1')
dhs_data_2013 = pd.read_csv(dhs_hhmember_2013, encoding='latin_1')

## Subset geocluster datasets down to relevant variables
geo_keep_cols = ['ADM1DHS','ALT_DEM',"DHSCLUST","MUN","PROV","REG","URBAN_RURA"]
#geo_df_2007 = geo_df_2007[geo_keep_cols]
geo_df_2013 = geo_df_2013[geo_keep_cols]
## Merge geodata with DHS data
#dhs_data_2007 = dhs_data_2007.merge(geo_df_2007, left_on="hv001", right_on="DHSCLUST", how="left")
dhs_data_2013 = dhs_data_2013.merge(geo_df_2013, left_on="hv001", right_on="DHSCLUST", how="left")

###May want to apply some correction to municipality displacement. Only problem is correction isn't possible for 2013 data.
###Commented section below tests whether different shapefile has better municipality displacement, which it does not.
#count = 0 
#for i in geo_df_2013.index:
#    if geo_df_2013['ADM1DHS'][i] != geo_df_2013['PROV'][i]:
#        count += 1
#print "Number of DHS clusters incorrectly marked in another province:", count

### Clean DHS data

## Convert occupations provided in espanol to ISIC two digit code
## Build conversion dictionary
occonversion = pd.read_excel(dhs_occupations_conversion,encoding='latin_1')
occonversion['isic2digitV3']=occonversion['isic2digitV3'].astype(str)
occonversion['isic2digitV3']=occonversion['isic2digitV3'].apply(lambda x: x.split(','))
occonversion = dict(zip(occonversion['espanol'],occonversion['isic2digitV3']))
## Take steps necessary to convert occupations
dhs_data_2013['sg110'] = dhs_data_2013['sg110'].fillna('missing')
dhs_data_2013 = dhs_data_2013[dhs_data_2013['sg110'] != 'missing']
dhs_data_2013['sg110'] = dhs_data_2013['sg110'].apply(lambda x: occonversion[x.rstrip()])
print "Observations with occcupation data", len(dhs_data_2013)

## Subset down to workers employed in the private sector 
## NOT SURE IF I'M  DOING THIS, BUT I DID IT IN IPUMS DATA 
##Do I want to include the 162 some employers here? 'employer'
print "Observations classified as having a private employer", len(dhs_data_2013[dhs_data_2013['sg111'] == "private employee"])
#dhs_data_2013 = dhs_data_2013[dhs_data_2013['sg111'] == "private employee"] 


Observations with occcupation data 8487
Observations classified as having a private employer 3248


  interactivity=interactivity, compiler=compiler, result=result)


In [72]:
##Check average income numbers for DR and if they match up with GDP P.C.
##2013 -  5,968.7, 2002 - 3,008.5 
##Takeaway is that there are several different sources for income data
print "Observations with income data", len(dhs_data_2013[dhs_data_2013['singresoo'].astype(str) != 'nan'])

print dhs_data_2013['singresoo'].mean()
print dhs_data_2013['singrthp'].mean()
print dhs_data_2013['sg117a'].mean()
print dhs_data_2013['singresotp'].mean()

Observations with income data 7854
10451.003947
8756.32806935
13425.4538088
12820.858232


In [3]:
def codigo(prov):
    prov = str(prov)
    if len(prov) == 1:
        return '0'+prov
    else:
        return prov

#dhs_data_2007 = dhs_data_2007[dhs_data_2007['ingoo'].astype(str) != 'nan']
dhs_data_2013 = dhs_data_2013[dhs_data_2013['singresoo'].astype(str) != 'nan']
columnstoaggregate = ['singresoo','singrthp','sg117a','singresotp']

## Aggregate income at the municipality level
muninc2013 = dhs_data_2013.groupby(['PROV','MUN'], as_index=False)[columnstoaggregate].mean()
muninc2013['PROV'] = muninc2013['PROV'].astype(str)
muninc2013['MUN'] = muninc2013['MUN'].apply(lambda x: codigo(x))
muninc2013['CODIGO'] = muninc2013['PROV']+muninc2013['MUN']
muninc2013.to_csv(averageincmun, index=False)

## Aggregate income at the municipality and occupation level
## Duplicate rows into different occupation categories
dhs_data_2013 = dhs_data_2013.reset_index()
dhs_data_2013.drop('index',1,inplace=True)
origrows = list(dhs_data_2013.index)
for i in origrows:
    numisics = len(dhs_data_2013.loc[i,'sg110'])
    if numisics != 1:
        updaterows = []
        for j in range(numisics)[1:]:
            newrow = dict(dhs_data_2013.loc[i,:])
            newrow['sg110']=dhs_data_2013.loc[i,'sg110'][j]
            updaterows.append(newrow)
        dhs_data_2013 = dhs_data_2013.append(updaterows)
        dhs_data_2013 = dhs_data_2013.reset_index()
        dhs_data_2013.drop('index',1,inplace=True)
    
    dhs_data_2013.loc[i,'sg110'] = dhs_data_2013.loc[i,'sg110'][0]


dhs_data_2013 = dhs_data_2013.groupby(['PROV','MUN','sg110'], as_index=False)[columnstoaggregate].mean()
dhs_data_2013['PROV'] = dhs_data_2013['PROV'].astype(str)
dhs_data_2013['MUN'] = dhs_data_2013['MUN'].apply(lambda x: codigo(x))
dhs_data_2013['CODIGO'] = dhs_data_2013['PROV']+dhs_data_2013['MUN']
dhs_data_2013.to_csv(averageincmunocc, index=False)